If you look at the documentation for the DecisionTreeClassifier class in scikit-learn, you’ll see something like this for the `criterion`

parameter:

The RandomForestClassifier documentation says the same thing. Both mention that the default criterion is “gini” for the **Gini Impurity**. What is that?!

TLDR: Read the Recap.

## Decision Trees 🌲

Training a decision tree consists of iteratively splitting the current data into two branches. Say we had the following datapoints:

Right now, we have 1 branch with 5 blues and 5 greens.

Let’s make a split at $x = 2$:

This is a **perfect** split! It breaks our dataset perfectly into two branches:

- Left branch, with 5 blues.
- Right branch, with 5 greens.

What if we’d made a split at $x = 1.5$ instead?

This imperfect split breaks our dataset into these branches:

- Left branch, with 4 blues.
- Right branch, with 1 blue and 5 greens.

It’s obvious that this split is worse, but **how can we quantify that?**

Being able to measure the quality of a split becomes even more important if we add a third class, reds . Imagine the following split:

- Branch 1, with 3 blues, 1 green, and 1 red.
- Branch 2, with 3 greens and 1 red.

Compare that against this split:

- Branch 1, with 3 blues, 1 green, and 2 reds.
- Branch 2, with 3 greens.

Which split is better? It’s no longer immediately obvious. We need a way to **quantitatively evaluate** how good a split is.

## Gini Impurity

This is where the Gini Impurity metric comes in.

Suppose we

- Randomly pick a datapoint in our dataset, then
**Randomly classify it according to the class distribution in the dataset**. For our dataset, we’d classify it as blue $\frac{5}{10}$ of the time and as green $\frac{5}{10}$ of the time, since we have 5 datapoints of each color.

**What’s the probability we classify the datapoint incorrectly?** The answer to that question is the Gini Impurity.

### Example 1: The Whole Dataset

Let’s calculate the Gini Impurity of our entire dataset. If we randomly pick a datapoint, it’s either blue (50%) or green (50%).

Now, we randomly classify our datapoint according to the class distribution. Since we have 5 of each color, we classify it as blue 50% of the time and as green 50% of the time.

What’s the probability we classify our datapoint **incorrectly**?

Event | Probability |
---|---|

Pick Blue, Classify Blue ✓ | 25% |

Pick Blue, Classify Green ❌ | 25% |

Pick Green, Classify Blue ❌ | 25% |

Pick Green, Classify Green ✓ | 25% |

We only classify it incorrectly in 2 of the events above. Thus, our total probability is 25% + 25% = 50%, so the Gini Impurity is $\boxed{0.5}$.

### The Formula

If we have $C$ total classes and $p(i)$ is the probability of picking a datapoint with class $i$, then the Gini Impurity is calculated as

$G = \sum_{i=1}^C p(i) * (1 - p(i))$For the example above, we have $C = 2$ and $p(1) = p(2) = 0.5$, so

$\begin{aligned} G &= p(1) * (1 - p(1)) + p(2) * (1 - p(2)) \\ &= 0.5 * (1 - 0.5) + 0.5 * (1 - 0.5) \\ &= \boxed{0.5} \\ \end{aligned}$which matches what we calculated!

### Example 2: A Perfect Split

Let’s go back to the perfect split we had. What are the Gini Impurities of the two branches after the split?

Left Branch has only blues, so its Gini Impurity is

$G_{left} = 1 * (1 - 1) + 0 * (1 - 0) = \boxed{0}$Right Branch has only greens, so its Gini Impurity is

$G_{right} = 0 * (1 - 0) + 1 * (1 - 1) = \boxed{0}$Both branches have $0$ impurity! The perfect split turned a dataset with $0.5$ impurity into 2 branches with $0$ impurity.

**A Gini Impurity of 0 is the lowest and best possible impurity**. It can only be achieved when everything is the same class (e.g. only blues or only greens).

### Example 3: An Imperfect Split

Finally, let’s return to our imperfect split.

Left Branch has only blues, so we know that $G_{left} = \boxed{0}$.

Right Branch has 1 blue and 5 greens, so

$\begin{aligned} G_{right} &= \frac{1}{6} * (1 - \frac{1}{6}) + \frac{5}{6} * (1 - \frac{5}{6}) \\ &= \frac{5}{18} \\ &= \boxed{0.278} \\ \end{aligned}$## Picking The Best Split

It’s finally time to answer the question we posed earlier: **how can we quantitatively evaluate the quality of a split?**

Here’s the imperfect split yet again:

We’ve already calculated the Gini Impurities for:

- Before the split (the entire dataset): $0.5$
- Left Branch: $0$
- Right Branch: $0.278$

We’ll determine the quality of the split by **weighting the impurity of each branch by how many elements it has**. Since Left Branch has 4 elements and Right Branch has 6, we get:

Thus, the amount of impurity we’ve “removed” with this split is

$0.5 - 0.167 = \boxed{0.333}$I’ll call this value the Gini Gain. This is what’s used to pick the best split in a decision tree! **Higher Gini Gain = Better Split**. For example, it’s easy to verify that the Gini Gain of the perfect split on our dataset is $0.5 > 0.333$.

## Recap

**Gini Impurity** is the probability of *incorrectly* classifying a randomly chosen element in the dataset if it were randomly labeled *according to the class distribution* in the dataset. It’s calculated as

where $C$ is the number of classes and $p(i)$ is the probability of randomly picking an element of class $i$.

When training a decision tree, the best split is chosen by **maximizing the Gini Gain**, which is calculated by subtracting the weighted impurities of the branches from the original impurity.