From the course: Machine Learning with Python: Decision Trees

How to prune a regression tree in Python - Python Tutorial

From the course: Machine Learning with Python: Decision Trees

How to prune a regression tree in Python

- [Instructor] Before we get started, note that this video is the third in a three-video sequence that explains how to build, visualize and prune a regression tree in Python. If you have not done so, watch the previous two videos for a detailed walkthrough of the code in the cells above. Now that we've trained and visualized a regression tree, let's look into what we can do to improve its performance by pruning. Decision trees are prone to overfitting. One telltale sign that a tree has overfit is if it performs well on the training data but very poorly on the test data. Let's evaluate our tree to see if it overfit on the training data. To do this, we pass the training data to the score method of the model. Our model is able to explain 99% of the variability in the training data. Let's see how it does on the test data. Similarly, we pass the test data to the score method of the model. Our model is only able to explain 59% of the variability in the test data. It has overfit the training data and needs to be pruned. There are two ways to prune a decision tree. One is to set parameters that manage its growth during the recursive partitioning process. This is known as pre-pruning. Another approach is to allow the tree to fully grow unimpeded and then gradually reduce its size in order to improve its performance. This is known as post-pruning. In this tutorial, we will use the post-pruning approach. The specific strategy we will use is known as cost complexity pruning. The primary objective in cost complexity pruning is finding the right parameter, known as alpha. The right alpha is the one that performs the best with the test data. To get a list of effective alpha values to choose from, we start by passing the training data to the cost complexity pruning path method of our previously instantiated regressor object. Then we extract a list of the effective alphas. The list of effective alphas go from zero all the way to 222.77. The larger the value for alpha, the smaller the tree will be. The maximum value of alpha represents a tree with just one node. We do not want that one. We remove it from our list of effective alphas. That's better. Next, we train and evaluate several trees using different values for alpha. We start by creating two empty lists: train scores and test scores to store the results of our model evaluation. Then we loop through all the alpha values in our list of effective alphas. For each alpha, we instantiate a regressor with the alpha, fit a regression tree to the training data, evaluate the tree's performance on the training and test datasets and append the results to the train scores and test scores list. Let's run that. With that information, we can now plot the training and test scores against different values of alpha. The plot shows that when alpha is zero, the tree overfits. The training score is at its highest. As alpha increases, more of the tree is pruned, which results in reduced training scores. The test scores behave a little differently. As alpha increases, the test score initially increases. Then it starts to decline as well. The best alpha is the one that corresponds with the highest test score. By a visual inspection alone, this seems to fall somewhere between 10 and 20. Let's get a list of these test scores. The test scores are listed in the same order as the effective alphas in the ccp_alpha's list. To get the value for the best alpha, we first get the index of the highest test score and I use the index to select the corresponding alpha from the ccp_alpha's list. We see that the best alpha for the regression tree is 14.8. Finally, we fit a regression tree on the training data pruned using the best alpha. Let's get the model's R squared on the training data. We see that the R squared on the training data has gone down from .99 to .877. Now, let's get the model's R squared on the test data as well. We see that the R squared on the test data has gone up from .585 to .757. Finally, we can visualize our pruned regression tree. Our new regression tree is smaller on the one we started off with but it performs better on the test data, which means that it now generalizes better.

Contents