From the course: Applied Machine Learning: Value Estimation
Train model: Linear regression
From the course: Applied Machine Learning: Value Estimation
Train model: Linear regression
- [Instructor] So let's use this here. I'm going to have some code that I would say is a little bit more robust than maybe some simpler code, but we're going to use scikit-learn Pipelines. Pipelines allow us to put this code into production and be careful about what we call leaking variables. We're not going to leak any information from the training data into the testing data. So I've got a bunch of imports here. They're mostly coming from the scikit-learn library. I'm not going to go over them, but I will go over the code below and talk about what the code is doing. So here we're loading our data set again, we've got the features here. This is the same as what we had before, but we don't have that sell price column in it. And I'm going to make two variables. I'm going to make a variable called capital X and another variable called lowercase y. And if you're a Python person, you might be wondering why I'm using capital X. Well, typically in machine learning literature and in linear algebra, when you have a capital variable that represents an array, that's two dimensional. When you have a lowercase value, that's a vector in linear algebra, and this holds true for tabular or structured machine learning. Typically we have a two dimensional set of features where each row represents a sample and each column represents a feature describing that sample. And then we have y, which is one dimensional, but it's best to think of y as a column where each value in y corresponds to a row in our X. So let's run this code and make sure that it works. Okay, and I'll just show you what X looks like after we've done that. So again, X is a data frame. It's very common to use a data frame for that. And then y is a Pandas series. And again, you can think of this as a column. This is the price for each row. Because our data has both numeric values and non-numeric values, I'm going to further segment them. I'm going to make a list of the numeric features and the list of the non-numeric features up here. If you look at this, you can see that we have a neighborhood value that is categorical. Linear regression by itself cannot deal with categorical values so we need a way to process those or turn those into numeric values and one way to do that is what's called one hit encoding. Basically what we're going to do is for every categorical value, we will make a new column and put a one or a zero in it if the row has that value. So if we've got 20 different neighborhoods, this is going to add 20 new columns. Most of them will be zeros, and then one or zero of them will have a one in them. There could be zero if there is a neighborhood that is a new neighborhood. The next step that I want to do is I want to make a pipeline. So a pipeline you can think of as a process for cleaning up your data. And again, we could do this with Pandas, but in practice, what we want to do is we want to preserve the behavior and preserving that, while it's possible in Pandas, these scikit-learn Pipelines make it really easy, they just require a little bit of coding. So what the pipeline is saying is it's saying we've got some steps. The first thing we're going to do is do imputation, that's a fancy statistical term for saying, "Replace values that are missing." Linear regression also does not like dealing with missing values. So we're going to deal with those by just giving the median for a column. So if a value is missing, we are going to provide the median value. And then after we do that, we're going to standardize our data. Standardization is a statistical term that means we're going to give each column a mean value of zero and a standard deviation of one. Again, we do that often with machine learning algorithms because a lot of algorithms are distance based, and if you have a column, we looked at like lot area, that can be very big, and then you have another column, maybe like number of basement bathrooms that can be very small in scale. The feature that has a large scale in a distance metric can overpower the information in the feature with a small scale. So by standardizing these, basically shifting them and shrinking or expanding them so they have a standard deviation of one and are centered on zero basically allows the algorithm to look at the information on the data and not the scale of the data. Let's just try this out. I'm going to say with this transformer called fit_Transform, and what you'll see with scikit-learn is it has a consistent interface, it likes this fit thing here, and it says, "Numerical features is not defined," that's because I didn't run the cell up above. Let's run that cell up there and we'll run this one now. Okay, and we get this output right here. So this is a NumPy array. What's going on here is it has taken all of the numerical columns from X and it has imputed any missing values and then it standardized them. Now, this NumPy array is a little bit hard to read, I'm not a big fan of using NumPy arrays for machine learning because this just looks like a bunch of numbers. I like to have some context about what's going on there. So I'm going to run this code now with Pandas output, and you can see that I get a Pandas dataframe coming out of this. This just makes it easier for me to see that like, this is the first column here and these are the values in it rather than just a bunch of numbers without any context. I can validate that these values are standardized by doing describe here. This is going to tell me the mean value and the standard deviation. And remember, the mean value is close to zero. 5.8 doesn't really look close to zero, but this is 5.8 times 10 to the minus 17. So for floating point numbers, it's essentially zero, and the standard deviation is very close to one. So it looks like these were indeed standardized. Let's look at our categorical pipeline. This is a little bit different because we're dealing with categorical values. I'm going to say simpleImputer here. In this case, I'm not going to do median 'cause that doesn't make sense for categorical values, but I'm going to say constant. So we're just going to fill in a value called missing if it is missing and then I'm going to do one hot encoding. We talked about that, where that is basically saying, "Take all of the values and make a column for each of the categorical values using a one or a zero to indicate whether a row has that feature." Let's run this here with our categorical features. And I got an error. It says that Pandas does not support sparse data. Remember I said that these one hot encoded features are mostly zeros and by default, this is giving us sparse data where it doesn't put the value there if it's a zero. What we're going to do is we're going to come in here, we're going to change this specification here, we're going to change this parameter that says sparse output is false, and let's try it again. That looks like it's working. I've also put in a few more parameters here. I've said max categories five so it's going to take the top five categories. And if you look at this, there's 1, 2, 3, 4. It's only showing five categories. Why is it doing that? Because I'm actually dropping the first category as well. This is a common thing in one hot encoding. If you think about how one hot encoding works, let's consider a binary categorical that has maybe you are sick or you are not sick, we really don't need two columns to represent sickness, we only need one, it's either a one or a zero. And this holds true as you add more values. If you've got four different options, you really don't need four columns, you only need three. So we're just dropping one of the columns there. Okay, once we've got those two pipelines, we're going to combine these into this thing called a column transformer, and we're going to say, "Here's the processing that we're going to do." We're going to make a transformer called num that's going to take our numerical transformer and only apply that to numerical features, and it's going to take our categorical transformer and only apply that to the categorical features. Once we have this, we can throw in our whole data frame here, X, and it looks like that fit_transform works, it gives us the transformed data. Also note that it is going to prepend the columns or features with the transformer type that it applied to it. So you can see that these are numeric, and we've got categorical at the end. Okay, we're almost done with this pipeline. Now that we've got our preprocessor pipeline that combines the categorical and the numeric processing, we're going to combine this into one more pipeline here that says, "Pre-process our data," and then after we're doing that, stick it into this linear regression library. What we're going to do with this pipeline is we're going to call fit, and we're going to pass in X and y. So you can think of X as the features and y as the labels. It's going to take X and throw it through the pre-processor, which throws it through the numeric processor and the categorical processor, and then it's going to take the result of that and train the formula, that y=mX+b formula, with this y value. Let's run this and make sure that it works. Okay, it looks like it does. Scikit-learn actually gives us this handy convenient pipeline visualization that we can see what's going on here. You can expand this as well, and you can see how we've parameterized the behavior of this. So for example, the one hot encoding, we had some parameters here that we changed. Okay, so that is a pipeline. There's a bit of code to do that. But again, once you have that code, it makes it very easy to make sure that you're not cheating. What do I mean by cheating? Well, you might want to evaluate your data and one common way to evaluate your data is splitting the data into a training set and a testing set. The whole reason for making a machine learning model is to make predictions on data that we haven't seen. So how do we know how our model is performing? Well, what we do typically is we hold out some of the data and we train the model on a subset of data, and then we evaluate it on data that it hasn't seen to give us a feel for what it's doing. So we've got this train_test_split, this is from scikit-learn. We're going to pass in X and y. This is going to give a training X and a training y, it's going to give us a test X and a test y. And you can see that this thing on the left here, this is the index, it's just basically randomly pulled out different rows there. There's 2300 rows for that, and here's the corresponding labels for that. What we're going to do now is we're going to call fit on our pipeline with the training data. Okay, so now our model has just been fit with the training data. That looks like that worked. In the next section, we're going to talk about model evaluation.
Contents
-
-
-
-
Overview of linear regression models1m 57s
-
Train model: Linear regression11m 19s
-
(Locked)
Evaluate model: Linear regression8m 51s
-
(Locked)
Make predictions: Linear regression2m 44s
-
(Locked)
Challenge: Implement a linear regression model1m 8s
-
(Locked)
Solution: Implement a linear regression model13m 36s
-
-
-
-