Once you can measure impurity, the next question is how to choose a split. Decision trees do this by computing how much each candidate split reduces entropy. That reduction is called information gain.
The workflow at a node:
- Compute the entropy of the current node.
- For each candidate feature, imagine splitting the data.
- Compute the entropy of each child branch.
- Take a weighted average of child entropies based on how many examples go left and right.
- Subtract that weighted child impurity from the parent impurity.
Formula:
information gain = H(root) - [w_left * H(left) + w_right * H(right)]
Here w_left and w_right are the fractions of examples that go to the left and right branches. The weighting matters because a highly impure branch containing many examples is more important than a highly impure branch containing only a tiny number of examples.
Interpretation: a large information gain means the split made the children much cleaner than the parent. A small information gain means the split did little to reduce uncertainty and may not be worth the added complexity.
Source Note example: at the root, splitting on ear shape gives a larger reduction in entropy than splitting on face shape or whiskers. That is why the tree chooses ear shape as the root feature in the worked example.
Why use reduction instead of only weighted child entropy? Because the reduction value is directly useful as a stopping signal. If the gain is tiny, the split may not justify a larger tree. This connects split quality to regularization: the tree should grow only when it earns the right to grow.
Production guidance: information gain is local, not global. A greedy tree algorithm chooses the best split at the current node, not the globally optimal full tree. This is one reason trees are fast and practical, but also one reason different samples or perturbations can lead to different learned trees.
Architecture note: information gain is the scoring function behind learned routing. It decides which question best separates the current slice of data. If the gain is weak everywhere, the node may already be as simple as it should be for the available evidence.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Information gain measures how much a candidate split reduces weighted entropy, allowing the tree to choose the most purity-improving feature.
- Decision trees do this by computing how much each candidate split reduces entropy.
- Interpretation: a large information gain means the split made the children much cleaner than the parent.
- When building a decision tree, the way we'll decide what feature to split on at a node will be based on what choice of feature reduces entropy the most.
- The way we will choose a split is by computing these three numbers and picking whichever one is lowest because that gives us the left and right sub-branches with the lowest average weighted entropy.
- In decision tree learning, the reduction of entropy is called information gain.
- A small information gain means the split did little to reduce uncertainty and may not be worth the added complexity.
- Which is rather than computing this weighted average entropy, we're going to compute the reduction in entropy compared to if we hadn't split at all.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.
Information gain is weighted impurity reduction. The branch-size weighting is crucial because impurity in a large branch matters more than impurity in a tiny branch.
Greedy tradeoff: split selection is local, not globally optimal. The algorithm chooses the best immediate reduction and relies on recursive refinement plus stopping rules for practical quality.