From the course: Machine Learning with Python: Decision Trees
How to visualize a regression tree in Python - Python Tutorial
From the course: Machine Learning with Python: Decision Trees
How to visualize a regression tree in Python
- [Instructor] Before we get started, note that this video is the second in a three video sequence that explains how to build, visualize and prune a regression tree. So if you have not done so, watch the previous video for a detailed walkthrough of the code and the cells above. Now that we've trained on regression tree, let's visualize it to get a better understanding of the tree logic. First, we make sure that we import the tree object from the sklearn package. The figure method of Pyplot allows us to specify the size of our tree, feel free to adjust this to see how it impacts the size of your tree. Next, we'll use the plot tree method of the tree object to visualize the tree. The first argument we pass to this method is the regression tree model itself. Then we specify the dummy coded independent variables as a list. Finally, we specify that we want the nodes of the tree color field. Because of the depth, and number of nodes in our tree, it's a bit difficult to make out the details of the tree. For illustrative purposes alone, let's limit the tree visualization to the first three nodes in the tree. To do this, we set the max depth argument in the plot tree method to one. Now we have a clearer view of the top set of nodes. Let's take a moment to understand the structure of the tree based on what we have here. We can interpret the root node as asking the question, is a worker 34 years old or younger? If so, branch to the left, else branch to the right. The fact that the age variable was used as a first split, lets us know that it is the most important variable within the data set in predicting the salary of a worker. Know that the branch logic and the next two nodes are a bit peculiar. For example, the right branch evaluates whether education professional is less than or equal to 0.5. To understand what is happening here, recall that we had to dummy code original education values to either zero or one. So if a worker has a professional degree, the value will be one and zero if they don't. With that in mind, we can interpret this branch logic as asking the question, does a worker not have a professional degree? If yes, branch to the left, else branch to the right. I know, it does sound a bit odd. So we can ask the opposite question and reverse the direction of the responses. In other words, we can ask, does a worker have a professional degree, if so, branch through right, else branch to the left. Within each node, besides the branch logic, we also get a value for the MSE or mean squared error. This can be interpreted as a measure of the degree of impurity, or variability in a partition. The smaller this value is, the closer the values are to the mean. Conversely, the larger this value is, the further the values are to the mean. We also see the number of items or samples within each partition. Notice that this value decreases as we work our way down the tree towards the leaf nodes. This is expected since the primary objective of recursive partitioning is to create smaller, more homogenous subsets of the data. The last information in each node value is the mean of the values in a particular partition. This is the predicted value of the regression tree. For example, if the regression tree were just one node, the root node, the tree would predict that all worker salaries was $65,367. The average of all this error is in the training data. One of the benefits of decision trees is that they are pretty good at ranking the effectiveness of independent variables in predicting the values of the dependent variable. This is known as feature importance. We can get the feature importance of each independent variable from the feature importances on the score attributes of our model. We get back an array of important scores for each independent variable. To put these scores in context, let's connect them to the feature names and visualize the scores. To do this, the first thing we do is create a Pantha series called feature importance by using the importance array as the values and the independent variable names as the index. Then we sort the series by value and plot it. From the plot, we see that the age variable is the most important in predicting salary. While the education masters feature is the least important.