From the course: Machine Learning with Python: Decision Trees

How do classification trees measure impurity? - Python Tutorial

From the course: Machine Learning with Python: Decision Trees

How do classification trees measure impurity?

- [Instructor] Classification trees are built using a process known as recursive partitioning. The objective of recursive partitioning is to create child partitions that are purer than their parents. Classification tree algorithms use a mathematical formula to quantify the degree of impurity within a partition. Two of the most commonly used measures of impurity are entropy and Gini. Entropy is the preferred measure of impurity for the C5.0 decision tree algorithm. It is a concept that is borrowed from information theory. And when applied to decision trees, represents a quantification of the level of randomness or disorder that exists within a partition. A partition with low entropy is one in which most of the items have the same value or outcome, while a partition with high entropy is one that has no dominant outcome. The entropy of a partition S, with C possible outcomes or labels, is calculated as shown here. To understand how this formula works, let's walk through an example. Given the following orange partition with two possible outcomes, red triangle and green circle. To calculate the entropy of the partition, we first multiply the negative proportion of examples that belong to the triangle class by the binary log of the proportion itself. How do we get 0.53? Well, I'm glad you asked. There are 30 examples in the partition, and 16 of them belong to the triangle class. Therefore, the proportion of examples in the triangle class is 16 divided by 30, which is 0.53. Next, we multiply the negative proportion of examples that belong to the circle class by the binary log of the proportion itself. And how do we get 0.47? 14 out of 30 examples belong to the circle class. 0.47 is 14 divided by 30. Adding the two products yields 0.997. This is the entropy of the orange partition. So how does the classification tree algorithm use entropy to decide where to split the data during the recursive partitioning process? Let's assume that a classification tree algorithm is considering where to split this data in order to create pure partitions. One of the choices it has to make is whether a split on loan amount of $40,000 is the best split it could make. To figure this out, the algorithm starts by calculating the entropy of the left partition, which yields 0.986. Then it calculates the entropy of the right partition, which yields 0.951. The gray partition accounts for 47% of the data, while the yellow partition accounts for 53%. The combined entropy of both partitions after the split is a weighted sum of their individual entropy values, which is 0.47, times the entropy of the left partition, 0.986, plus 0.53 times the entropy of the right partition, 0.951. This comes out to 0.967. 0.967 is the combined entropy of the left and right partitions after the data is split on loan amount of $40,000. Notice that the entropy of the orange partition prior to the split is 0.997. And that the combined entropy of the gray and yellow partition is 0.967. This reduction in entropy that occurs as a result of the split is known as information gain. It is the difference between the entropy of the partition before the split and the combined entropy of the partitions after the split, which is 0.03. So what does all of this mean? Well, imagine that a classification tree algorithm is considering different split values for loan amount, as shown here. It'll first calculate the information gain that occurs as a result of each potential split. Then it'll choose the split that results in the largest information gain or reduction in entropy. That is the best split. This example uses information gained by way of entropy to determine the best split. As previously mentioned, entropy is a preferred impurity measure for the C5.0 decision tree algorithm. Another common impurity measure is Gini. It is a measure of statistical dispersion. Gini is the preferred measure of impurity for the CART decision tree algorithm. One way to think of Gini is as a measure of how often a particular example in a partition would be incorrectly labeled if it were randomly labeled based on the distribution of labels in the partition. Similar to entropy, the greater the degree of randomness or impurity within the partition, the higher the Gini impurity value. And the smaller the degree of randomness or impurity within a partition, the lower the Gini impurity value. The Gini impurity measure of a data partition S, with C possible outcomes or labels, is calculated as shown here. Similar to entropy, a classification tree algorithm that uses Gini as a measure of impurity will evaluate several splits, and choose the one that results in the largest reduction in Gini.

Contents