Decision Tree (CART)
Decision tree is one of the most important models as it lays out an important concept that is used for other machine learning models like Random Forest, XGBoost, bagging & boosting, etc which all together come under the ensemble methods. It’s a tree-shaped model consisting of root nodes, branches, and internal & leaf nodes which are mostly used for supervised learning. So, it’s really important to understand the concept of a decision tree and here we have explained the functioning of decision trees.
A decision tree can be broadly categorized into — Regressors & Classifiers and hence this is where it has received its name Classification and Regression Trees (CART). They both play a different function like the classification trees have the role of classifying targets e.g., Customers being loyal vs fraud. Whereas, decision tree regressors are mostly used to predict and find the continuous values e.g., Changing prices of houses.
Some of the important parameters in the decision tree are Gini Index, Entropy, Information Gain, Purity.
i) Purity: Purity can be of two types and they are pure & impure. Pure means that the leaf holding data has all the data belonging to the same particular class and impure means that the leaf holding the data has a mixture of different classes.
ii) Entropy: Entropy is used to calculate the purity of the branch and nodes and the mathematical formula for it is given below. Here “p” stands for probability of a particular class in that total class of variables.
iii) Gini Index: Gini index is similar to the entropy apart from the fact that it doesn’t consider the logarithmic values and is computationally less heavy as compared to the entropy. It’s used to find the probability of a particular class out of the total classes being classified correctly.
iv) Information gain: Information gain is generally used to find out the tree that produces less entropy or Gini index value. The tree that produces the highest information gain, that tree is considered for further splitting and then that tree is chosen and the same process repeats till the end, and that way accuracy increases.
Entropy vs Gini Index
The entropy lies within a range of 0 to 1 with its max impurity index in the range of 0 to 1. Whereas, Gini index lies within a range of 0 to 1 and the impurity index of 0 to 0.5 since it doesn’t consider the logarithmic expressions. Calculating entropy is computationally heavy as compared to the Gini index and for that reason, the Gini index is opted over entropy. Generally, the entropy at a leaf node where its data classes are split into equal parts leads to the highest entropy.
Pruning
At times the tree obtained can be of a larger size and can include those branches and leaf nodes that have low importance so they are pruned to decrease the size of the tree and to increase the accuracy. A few of the pruning techniques are:
1. Reduced error pruning
2. Cost complexity pruning
3. Weakest link pruning
We picked up a dataset on breast cancer where we test out our algorithms of decision trees and random forest which goes through multiple parameters to predict whether the patient has breast cancer or not. Then after applying our model we check for the accuracy of our model to see how correctly our model predicts the values. Then we apply cost complexity pruning as it uses a parameter of alpha which is called learning parameter and the alpha rate that generates the highest accuracy is taken for the pruning step. One more important point is that in the case of decision trees, the outliers don’t impact the metrics(accuracy) as they use probability and for such outlier, the probability would be so small that it wouldn’t be considered.
And here we have explained the decision tree algorithm using python.
We at first import the dataset of breast cancer and install are the prerequisite packages that are needed. The breast cancer dataset holds up to 31 parameters including the target. All the 30 parameters are loaded into the X except the target variable which is loaded in y.
Then, the dataset is split into a ratio of 20% for tests and 80% for training. The Decision tree classifier is called as clf and the values from the dataset are fit upon the model which later on predicts the values from which we receive an accuracy of 94.73%.
And this is how the decision tree looks like :
Since the tree we received is of a very large shape and it also consists of nodes whose value is too less and then to avoid such a problem we use the Cost Complex Pruning method to further increase our accuracy and to decrease the size of the tree by removing the nodes and branches that are of low importance.
After assigning the alphas we get these values of alphas and impurities are calculated as :
After receiving the different values of alphas, we iterate our classifier over different values of alphas and then we plot a graph of train scores and test scores between accuracy on different alpha values.
These are accuracies that we receive after iterating it over different alpha values and the graph we plot is given below:
And after once we obtain the graph, we choose the alpha rate for which the accuracy is the highest and that is the same for a few of the 5 values. On computing the accuracy from those values, we receive an accuracy percentage of 95.61% post pruning as compared to the 94.73% obtained before the pruning of the tree.
And here is the tree we receive after the pruning is done and the figure is given below:
You can connect with me at : https://www.linkedin.com/in/sudeepdas27/