From the course: Delivering and Analyzing a Software Pilot: GitHub Copilot
Preparing your data for analysis - Github Copilot Tutorial
From the course: Delivering and Analyzing a Software Pilot: GitHub Copilot
Preparing your data for analysis
- [Instructor] Now imagine trying to cook a really nice meal with ingredients straight from the garden, dirt, stems, leaves, and all. You wouldn't start cooking without first cleaning and prepping those ingredients. The same principle applies to data analysis. Before you can dive into the analysis, you need to prepare your data properly. In this video, we'll focus on the importance of preparing your data for analysis, explore key techniques for cleaning and preprocessing text data, and walk you through coding examples that show you how to remove stop words and identify bigrams using Python. Preparing your data is a crucial step that can significantly impact the quality of your analysis. Clean, well-prepared data leads to more accurate and meaningful results. So let's start with one of the foundational tasks, removing stop words. Stop words are common words such as and, the, and is that don't add much value to text analysis. Removing them helps reduce noise in your data, and makes it easier to focus on meaningful words. So let's dive straight in. I'm going to need to import nltk. And then from nltk.corpus, I'm going to import stopwords. And don't forget that a corpus is just a collection of words and stop words, as we've just identified, are the meaningless words that we want to remove from our example text that we're going to use. And then I want to import re. Now I need to download those stop words. Now I've imported them, I actually need to download them. So I'm going to say nltk.download, parenthesis, single quote, closed parenthesis, otherwise I'll forget, and then stopwords. Now I need to define the function to actually remove the words. So I'm going to define remove underscore, whoops, that's my bad spelling. You get live coding. Stopwords, open parentheses on text, closed parentheses then colon, and then stopwords are going to be, it's going to be stop underscore words actually, is going to be the set open parenthesis of stopwords dot words, open parenthesis English 'cause that's what I'm focusing on right now and then double close parentheses and then words are going to be re dot findall and open parehthesis r single quotes slash w plus and then after the quote, it's comma and then text, and then close parentheses and then filtered words. Oops! Filtered underscore words are going to be square bracket. That's an important distinction. Word for word in words. Lots of the use of the word 'word' there, almost as though that's a stop word in itself. If word dot lower, open parenthesis, not in stop underscore words close square bracket, then return space dot join, open parenthesis, filtered underscore words, close parenthesis. Otherwise, I'll forget. So that has defined the function to remove the actual stop words from whatever text I put in. Now, you can make this run against an actual file, but for this example, I'm going to have it run against some sample text. So if I say sample underscore text equals, and if you've watched the previous videos, you know exactly what I'm about to paste in here. It's our fictitious GitHub copilot review, also surrounded by triple quotation marks. And then under that, I'm going to say cleaned underscore text equals, remove underscore stop words, parentheses, sample underscore, underscore text, close parentheses, and then print parentheses cleaned text, close parentheses. And there we have it. We can see the stop words like our, has, a, in have all been removed from that sample text leaving the most meaningful words behind. So as you can see, the function filters out common stop words from the sample text, leaving those with a more relevant words. This clean data is now much easier to analyze. Now let's talk about identifying bigrams. Bigrams are pairs of consecutive words that can provide more context than single words alone. For example, machine learning is more meaningful as a bigram than the two words machine and learning. Given the sentence, I love Mike's LinkedIn learning course, the bigrams would be, I love, love Mike's, Mike's LinkedIn, LinkedIn Learning and Learning course. So let's use nltk to identify bigrams in our sample text. So on my editor, on my Jupiter notebook, I'm going to say from nltk dot collocations and then import bigram associate measures. And also by bigram collection finder. Now I need to define the function to find the bigrams in that text that we generated a moment ago. So def find underscore bigrams which is text, and then close parentheses and colon and then words equals re dot find all open parentheses r single quote, slash w plus very similar to earlier. And the single quote again, then comma, and then text, and then close parentheses. And then the next line bigram underscore measures equals bigram assoc measures, open close parentheses. And then finder equals bigram collocation finder dot from underscore underscore words, open parentheses words, close parentheses and then bigrams equals finder dot nbest open parentheses bigram underscore measures dot pmi and then let's give it a value of how many bigrams that we want it to find. Let's say 10, the top 10 bigrams that it can find given its learning and then return bigrams. Let me just scroll down a bit. So then we're going to use it on our cleaned text from earlier. So I'm going to say bigrams equals find, underscore bigrams, open parentheses cleaned underscore text. And because we've run these in unison or in succession, I should say, it means that it can lead off of the previous results. So the text we generated earlier is relevant in this example in this environment. And then print, open parentheses bigram. Now if I run that, we can see that the identify the top 10 identified bigrams in that cleaned text are 30 developers, AI ability, since implementing, ability quickly, able focus, also decreased, automating repetitive, building features, deadlines consistently and decreased need. All of those words make more impact and more sense when together than potentially they would if they were separate. So in summary, preparing your data by removing the stop words and identifying bigrams is essential for effective text analysis. These steps help you focus on the most meaningful parts of your data, setting the stage for deeper analysis. And up next, we'll explore techniques to identify sentiments and themes in your text data, helping you uncover deeper insights and patterns. So stay tuned.