From the course: Machine Learning with Python: Decision Trees

How to prune a classification tree in Python - Python Tutorial

From the course: Machine Learning with Python: Decision Trees

How to prune a classification tree in Python

- [Instructor] Before we get started, know that this video is a third in a three video sequence. That explains how to build, visualize and prune a classification tree. So if you have not done so, watch the previous two videos for a detailed explanation of the prior code. Now that we've trained and visualized a classification tree, let's look into what we can do to improve its performance by pruning. Decision trees are prone to overfitting. One telltale sign that a tree has overfit is if it has a high accuracy score on the training data with a low accuracy score on the test data. Let's start by getting our trees accuracy on the training data. To do this we pass the training data to the score method of the model. A model is a hundred percent accurate on a training data. That's suspicious. Let's get a second opinion from the test data. Similarly, we pass the test data to the score method of the model. Our model is 50% accurate on the test data. Our model has definitely overfit on the training data and needs to be pruned. There are two ways to prune a decision tree. One is to set parameters that manage its growth during the recursive partitioning process. This is known as pre-pruning. Another approach is to allow the tree to fully grow on impeded and then gradually reduce its size in order to improve its performance. This is known as post-pruning. In this tutorial, we will use a pre-running approach. This means that we need to figure out the best combination of values for the parameters of the tree that will result in the best performance. This is known as hyper parameter tuning. The psyche learned package scikit-learn provides several parameters we can tune during this process. We will limit ourselves to three of them. We start by creating a dictionary which we call grid that holds the values of the parameters we want to try out. The first parameter is max depth. This sets the maximum depth of the decision tree. We will try setting the value to two, three, four and five to see which is the best. The next parameter is min sample split. This sets the minimum number of items we can have in the partition before it can be split. Studies show that a value between one and 40 is best. We will try setting the value to two, three, and four. Next is the min samples leaf parameter. This sets the minimum number of items we have in a leaf node. Studies show that the best values are between one and 20. We will try setting the value to one, two, three, four, five, and six. The gridsearch CV class from the scikit-learn model selection sub package allows us to perform a great search to find the best parameter values for our tree. We import the class then we instantiate a decision tree classifier object and then we pass the object to a new grid search CV object, which we call GCV. We also pass the parameter grid to the object. We then pass the training data to the fit method of GCV so it evaluates each hyper parameter combination in grid. The best estimator attributes of GCV returns the classifier with the best combination of hyper parameters for our data, let's get that. We then fit a classification tree on the training data using this classifier. The output shows that the best combination of hyper parameters is max depth set at two and min samples leaf set at six. Now we can reevaluate how well our model fits the training data by passing the training data to the score method of the model. We see that the accuracy has gone down from a hundred percent to 87.5%. Let's see how the model fits the test data as well. Now, the model's accuracy on the test data has risen from 50% to 83.3% that is much better. Finally, we can visualize our prune model. Our prune tree is much smaller than the one we started off with but it generalizes much better.

Contents