In this Series of learning Machine Learning Algorithms, we have already learned Linear Regression and Logistic Regression which are Regression and Classification Algorithms respectively. Today we are going to start with the Decision Tree which is one of the simplest algorithms in Machine Learning.
- Decision Tree is a supervised machine learning algorithm capable of solving both Regression and Classification problems.
- It has a tree-like structure that shows how each decision is taken based on what condition.
- It starts with the root node and divides into branches till the leaf node. In this, each node represents a feature (attribute), each link or branch represents a decision (rule) and each leaf represents an outcome.
- The decision Tree is simple to understand as it mimics human-level thinking.
- It enables looking at the logic behind the interpretation and hence it becomes easy to relate to this algorithm.
How does a Decision Tree Build?
- Start with the root node which contains the complete dataset.
- Divide the current node using the best feature. The best feature is selected using the attribute selection measure (ASM).
- Repeat step 2 until you arrive at the stopping condition.
We will learn about the attribute selection measure and stopping condition in detail.
Attribute Selection Measure (ASM)
Attribute Selection Measure helps you to find out the best feature i.e. the feature on which the current decision tree can be split further. Below is a few Attribute Selection Measure:
- Information Gain
- Information Gain Ratio
- Gini Index
Entropy is the measure of uncertainty, impurity, disorderness, and randomness in the dataset.
Example 1, You want to decide the place for a holiday trip, 4 members of your family want to visit Goa and 4 other members want to visit Kerala. It is very difficult to decide the place in this case as this problem has high randomness hence high Entropy.
Example 2, In Example 1 if 2 members want to visit Kerala and the other 6 wants to visit Goa. It becomes easy to decide the place since the majority of members are in favour of Goa. Now the randomness is reduced and hence low Entropy.
From the above two examples, we infer that to arrive at a conclusion, Entropy should be low.
Entropy (H) =
- The value of Entropy lies between 0 and 1.
- 0 is the lowest value that represents no impurity i.e pure node (all values belong to the same class).
- 1 is the highest value representing randomness, with no purity.
Best Feature = Entropy is calculated on each feature and the feature with minimum entropy is considered the best feature and is selected for splitting.
Pure Node: Node in which all values belong to one class.
Information Gain is the measure of the difference in entropy before and after splitting on a feature.
We already know that low entropy is good, hence we should select the feature which reduces the entropy most after splitting on it, hence Information Gain should be high.
Information Gain (IG) = IG =
- is the original dataset.
- is the jth sub-dataset after being split.
- and are the numbers of samples belong to the original dataset and the sub-dataset, respectively.
- is the Entropy of the jth sub-dataset.
Best Feature = Information Gain is calculated on each feature and the feature with maximum Information Gain is considered the best feature and is selected for splitting.
Problem with Information Gain
Let there be a feature, splitting on which gives more pure nodes. According to IG, it is considered to be the best feature to split on since IG is the highest. But this feature may be causing more splits which makes the decision tree very complex and can cause an overfitting problem.
Information Gain is biased toward high branching features. It tends to use a feature that has more unique values.
Information Gain Ratio
Information Gain Ratio is a modified version of Information Gain that prevents more split by introducing a normalizing term called Intrinsic Information.
Information Gain Ratio is the ratio of Information Gain by Intrinsic Information.
Intrinsic Information (II) = Information Gain Ratio =
Information Gain Ratio should be high, which can be increased by decreasing Intrinsic Information.
Problem with Information Gain Ratio
Due to Intrinsic Information, Information Gain Ratio prefers splitting partitions that are much smaller than the others.
Gini Index or Gini Impurity measures the probability of a particular variable being wrongly classified when it is randomly chosen.
Gini = where
Best Feature = Gini Index is calculated on each feature and the feature with minimum Gini is considered the best feature and is selected for splitting.
Advantages of Decision Tree
- Data Scaling is not required.
- No effect of missing value.
- Easy to understand as we can see the logic behind a prediction.
Disadvantage of Decision Tree
- Unstable: A small change in the data can cause a large change in the structure of the decision tree causing instability.
- Overfitting: The decision Tree tries to fit the training data very well causing an overfitting problem. This is the major problem in this algorithm. It can be prevented by changing the stopping condition.
From the above graph, we can see
- When the complexity of the trained model is low, both training error and testing errors are high, this situation is called Underfitting.
- As complexity increases, the error on training data decreases but the error on testing data starts increasing, this situation is called Overfitting.
- By applying a stopping condition, we can arrive at an optimal solution.
The stopping condition tells a Decision Tree when to stop further splitting. Below are the default stopping conditions.
- When it arrives at a Pure Node.
- When it runs out of the feature.
Using the above two conditions, the Decision Tree can become very complex even with little data causing overfitting.
Methods to Overcome Overfitting
- Early Stopping or Pre-Pruning
- Post-Pruning or Backward Pruning
1. Early Stopping or Pre-Pruning
It is the process of stopping a tree from splitting further before actually completing it. This can be done by selecting the length of the tree in advance.
- Deciding the length of the tree.
- In some cases, we get more info by splitting on one side of the tree.
In the above tree, more splits are on the left side. But if we decide on length in advance, it will apply to both sides and important splits may be removed.
To solve problem 2, we can stop further splitting when there is no significant improvement in the score like Information Gain.
2. Post-Pruning or Backward Pruning
In this method, a complete Decision Tree is created using the default stopping condition and then non-significant branches are pruned. Non-Significant branches are those splitting on which does not reduce much of the randomness in data.
Post-Pruning is the most commonly used Pruning method.
Thank you for reading this article. By the end of this article, we are familiar with the Decision Trees, different splitting methods, overfitting problems in the decision trees and its solution, and different stopping conditions.
I hope this article was informative. Feel free to ask any query or give your feedback in the comment box below.