From the course: Machine Learning with Python: Decision Trees

How to build a regression tree in Python - Python Tutorial

From the course: Machine Learning with Python: Decision Trees

How to build a regression tree in Python

- [Instructor] In this exercise, we'll use a sample income data set to build a regression tree that predicts the salary of a worker based on their age and education level. Before we get started, know that this video is the first in the three video sequence that explains how to build, visualize, and prune a regression tree. We start by importing the Pandas package. Then we import the data into a data frame called income and preview it to make sure that the import worked as expected. Now that we have our data, let's try to understand it. First, we get a concise summary of the structure of the data by calling the info method of the data frame. From the summary, we can tell that there are 30 instances in the data set by looking at the range index. We can also tell that there are three features in the data set. Looking at the D type column of the summary, we see that the age column holds integer values. The education column holds text, AKA object, and the salary column, holds floating point or decimal values. Next, we get summary statistics for the numeric columns by calling the described method of the data frame. From the statistics, we see that the minimum salary value in the data is 16.8 while the maximum value is 118. Note that these values are in the thousands. So what we're seeing here, is $16,800 and $118,000. We also see that the minimum median and maximum age values are 24, 45, and 65 respectively. Next, let's also visually explore the data by creating a few plots. To ensure that our plots show up in line, we run the map plot lib in line command. Then we import pie plot from the map plot lib package as well as a seaborne package. The first plot we create is a box plot that shows the distribution of salary by education level. The chart shows that those with a high school diploma tend to earn the least while those with a professional degree, tend to earn the most. Next let's create another box plot to show the distribution of age by education level. This chart doesn't show much separation between the groups. However, we do see that those with professional degrees tend to be a bit older than the rest of the workers in the data set. Finally, let's create a scatter plot to look at the relationship between salary and age. The chart shows somewhat of a linear relationship between these two variables. This means that generally, the older a worker is, the higher their salary. Now that we've done some initial data exploration let's prepare our data for modeling by splitting it into training and test sets. Prior to doing so, we must first separate the dependent variable from the independent variables. Let's start by creating a data frame called Y for the dependent variable, which is salary. Then we create a second data frame X, for the independent variables, age and education. Next, we import the train test split function from the SK learn model selections package. Then we split the X and Y data frames into X_train X_test Y train and Y test. Note that we set train size to 0.6. This means we want 60% of the original data to become the training data. While 40% becomes the test data. We also set stratify to the education column in X, which means we want the data split using stratified random sampling based on the values of the education column. Finally, we set random state to 1234. Simply so we get the same results every time we do the split. The shape attribute of the X_train and X_test data frames tell us how many instances or records are in each data set. From the results, we can see that we have 18 instances in the training set and 12 instances in the test set. The psyche learn package we intend to use to fill out regression trees does not support non-numeric values, like the education column in our data. As a result, we have to dummy code the education columns in the X_train and X_test data frames. Before we dummy code X_train let's preview it using the head method. To dummy code X_train, we pass it to the Pandas get dummy function and preview the updated data frame. Notice that each of the categorical values in the education column are now columns, each with a dichotomous value of zero and one. Let's dummy code and preview the test data set X_test as well. We are done with data preparation. We can now build our model. To build the regression tree in Python, we need to import the decision tree regressor class from the SK learn tree sub package. We then instantiate an object from the class. We call the object, regressor. Using the regressor object, we can fit a regression tree on the training data. To evaluate and estimate the future performance of a model, let's see how well it fits against the test data. To do so, we pass the test data to the score method of the model. This returns the R squared of the model on the test data. The R squared value we get here tells us that our model is only able to explain 58.5% of the variability and the response values of the test data. We can do better. Another way to evaluate a regression tree is to evaluate how accurate it is. This means comparing the predicted values against actual values or getting the mean absolute error of the predictions. Before we can get the mean absolute error, we need to get the models predicted response values for the test data. We assign these results to a variable called y_test_pred. Next, we import the mean absolute error function from the SK learn metrics of package and calculate the mean absolute error between the actual response values y_test and the predicted response values y_test_pred. What does this mean? The mean absolute error implies that going forward, we should expect the salary values our regression tree predicts, to be off the mark by an average of plus or minus 13,542.

Contents