From the course: Advanced Python Projects: Build AI Applications
Data preprocessing and scaling - Python Tutorial
From the course: Advanced Python Projects: Build AI Applications
Data preprocessing and scaling
- [Instructor] Next, we're going to be doing Data Preprocessing. So this method is used to check the number of null values in each column of the DF data frame. In this case, when we run this, we notice that there are no null values. Next, let's scroll down. We're going to be converting the zip code column in the DF data frame to the object data type str. This is done to ensure that the zip codes are treated as alphanumeric values when performing operations, such as joining the population with data which we're going to do soon, so let's run this. Next, in this block of code, find zip code function is defined. This function takes a geocode as input and searches for the five-digit pattern at the end of the string using a regular expression. If a match is found, the function returns the extracted zip code. Let's run this. Next, in this code, the find zip code function is applied to the geography column in the population data using the apply method over here. The result is stored in a new zip code column in the population data frame. So this process extracts the last five digits of the zip code from the geography column, so let's run this. Excellent. Over here, we're making a copy of the DF data frame with the name, cafe_data, and then pd.merge function is used to merge cafe_data with the population data frame based on the zip code column. The result is stored back in the DF data frame, so now let's run that. All right. Here, a list of column names is created by combining the columns from the cafe data data frame and adding it to the total column. The DF data frame is then updated to include only the columns specified in the columns list. Finally, the total column is renamed to population. So now, let's run that. Excellent. Next, let's run the data frame to see what it looks like. So here's the merged data frame. At the end of that data frame, we have the population data as expected. As you see over here, we now have 412 rows and 12 columns. This reduction in the rows and columns are expected given that we've merged both of the data together. Now, we're only going to keep the relevant features from that dataset. What's relevant to us is the zip code, rating, median salary, latte price, and population, because that's what we will use to identify the top five zip codes and the price of the latte for each of those zip codes that we identify. So now let's run this. Excellent. As expected, we now have reduced the columns from 12 to five, and here are the columns that are present in the DF data frame. So next, what we're doing is here, the total number of coffee shops for each zip code is calculated and stored in the coffee shop counts data frame. Both zip code columns are insured to be of type string for proper merging. The counts are merged back into the original DF data frame. The data frame is printed to display the changes. Next, we're going to identify the top five zip codes using the criteria listed here. We're going to make sure that our top five zip codes contain high population, low number of coffee shops, low ratings, and high median salary. This is to make sure that the top five zip codes that we choose have a lot of demand and less competition, and we hope that the folks that live in those areas have high median salary, so that they'll be able to afford coffees. A new data frame sorted_df is created by sorting df based on those specific criteria we just listed. We're going to sort population by the highest population, right? And then the coffee shop count is going to be the lowest coffee shop count, low to high, and then rating is going to be also low to high, so that we ensure that the other coffee shops that are existing in those areas have low ratings, so that we have a better chance of establishing a business that's more successful and can receive a higher rating. Again, median salary, we're going to sort from high to low, so that we target areas that have higher median salary. Now, we create a list named LSD to store unique zip codes. The loop iterates through the sorted data frame, sorted_df, checking if the length of the list is less than five. And if the current zip code is not already in the list. If the both conditions are met, the zip code is then added to LSD. Finally, sorted_df is filtered to include only rows where the zip code is in lst creating the data frame. Then we display all the records of the top five zip codes based on the specific criteria that we established. Now, let's run these codes. Here are the sorted values, and here's the output for our top zip codes. If you notice, our first zip code appears five different times. This is because there are five different coffee shops in that location, and these are the data related to those coffee shops. Also notice that the population is stagnant, which is expected because the, there's only one median population for that particular zip code. This is why we see 18 values. Here's the first zip code. Here's the second zip code. Here's the third zip code. and here's the fourth zip code. Last but not least, here's our fifth zip code. So we've identified the top five zip codes. Next, we create a feature matrix, which is labeled as X by dropping the columns latte price, and zip code from the DF data frame. The target variable Y is assigned value from the latte price column. This prepares the data for a machine learning model with X representing the features and Y representing the target variable. Train test split function from scikit-learn is used to split the feature matrix X and the target variable Y into training and testing sets. The parameter test size 0.2 specifies that 20% of the data should be used for testing, and the rest is for training. Random state 42 ensures reproducibility by fixing the random seed. The resulting sets are assigned to X train, X test, Y train, and Y test. Let's now run both of this. Next, we're going to be doing feature scaling. Feature scaling is a method used to normalize the range of independent variables or features of a data. It's important in machine learning because many algorithms like linear regression require features to be on the same scale. If features are not on the same scale, the algorithm may be more biased toward features with large values. So in these lines of codes, the standard scaler from scikit-learn is used to standardize the features of the matrices X train and X test. The fit transform method is applied to X train to compute the mean and standard deviation needed for scaling and transforming the training data. So the transform method over here is then used to scale the testing data based on the parameters learned from the training data. So what this truly does is that it ensures that the features have a mean of zero in the standard deviation of one, which can be beneficial for certain machine learning algorithms.