A Simple Explanation of Information Gain and Entropy

What Information Gain and Information Entropy are and how they're used to train Decision Trees.

Information Gain, like Gini Impurity, is a metric used to train Decision Trees. Specifically, these metrics measure the quality of a split. For example, say we have the following data:

The Dataset

What if we made a split at x=1.5x = 1.5?

An Imperfect Split

This imperfect split breaks our dataset into these branches:

  • Left branch, with 4 blues.
  • Right branch, with 1 blue and 5 greens.

It’s clear this split isn’t optimal, but how good is it? How can we quantify the quality of a split?

That’s where Information Gain comes in.

Confused? Not sure what Decision Trees are or how they’re trained? Read the beginning of my introduction to Random Forests and Decision Trees.

Information Entropy

Before we get to Information Gain, we have to first talk about Information Entropy. In the context of training Decision Trees, Entropy can be roughly thought of as how much variance the data has. For example:

  • A dataset of only blues would have very low (in fact, zero) entropy.
  • A dataset of mixed blues, greens, and reds would have relatively high entropy.

Here’s how we calculate Information Entropy for a dataset with CC classes:

E=iCpilog2piE = -\sum_i^C p_i \log_2 p_i

where pip_i is the probability of randomly picking an element of class ii (i.e. the proportion of the dataset made up of class ii).

The easiest way to understand this is with an example. Consider a dataset with 1 blue, 2 greens, and 3 reds: . Then

E=(pblog2pb+pglog2pg+prlog2pr)E = -(p_b \log_2 p_b + p_g \log_2 p_g + p_r \log_2 p_r)

We know pb=16p_b = \frac{1}{6} because 16\frac{1}{6} of the dataset is blue. Similarly, pg=26p_g = \frac{2}{6} (greens) and pr=36p_r = \frac{3}{6} (reds). Thus,

E=(16log2(16)+26log2(26)+36log2(36))=1.46\begin{aligned} E &= -(\frac{1}{6} \log_2(\frac{1}{6}) + \frac{2}{6} \log_2(\frac{2}{6}) + \frac{3}{6} \log_2(\frac{3}{6})) \\ &= \boxed{1.46} \\ \end{aligned}

What about a dataset of all one color? Consider 3 blues as an example: . The entropy would be

E=(1log21)=0E = -(1 \log_2 1) = \boxed{0}

Information Gain

It’s finally time to answer the question we posed earlier: how can we quantify the quality of a split?

Let’s consider this split again:

An Imperfect Split

Before the split, we had 5 blues and 5 greens, so the entropy was

Ebefore=(0.5log20.5+0.5log20.5)=1\begin{aligned} E_{before} &= -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) \\ &= \boxed{1} \\ \end{aligned}

After the split, we have two branches.

Left Branch has 4 blues, so Eleft=0E_{left} = \boxed{0} because it’s a dataset of all one color.

Right Branch has 1 blue and 5 greens, so

Eright=(16log2(16)+56log2(56))=0.65\begin{aligned} E_{right} &= -(\frac{1}{6} \log_2 (\frac{1}{6}) + \frac{5}{6} \log_2 (\frac{5}{6})) \\ &= \boxed{0.65} \\ \end{aligned}

Now that we have the entropies for both branches, we can determine the quality of the split by weighting the entropy of each branch by how many elements it has. Since Left Branch has 4 elements and Right Branch has 6, we weight them by 0.40.4 and 0.60.6, respectively:

Esplit=0.40+0.60.65=0.39\begin{aligned} E_{split} &= 0.4 * 0 + 0.6 * 0.65 \\ &= \boxed{0.39} \\ \end{aligned}

We started with Ebefore=1E_{before} = 1 entropy before the split and now are down to 0.390.39! Information Gain = how much Entropy we removed, so

Gain=10.39=0.61\text{Gain} = 1 - 0.39 = \boxed{0.61}

This makes sense: higher Information Gain = more Entropy removed, which is what we want. In the perfect case, each branch would contain only one color after the split, which would be zero entropy!

Recap

Information Entropy can be thought of as how how unpredictable a dataset is.

  • A set of only one class (say, blue ) is extremely predictable: anything in it is blue. This would have low entropy.
  • A set of many mixed classes is unpredictable: a given element could be any color! This would have high entropy.

The actual formula for calculating Information Entropy is:

E=iCpilog2piE = -\sum_i^C p_i \log_2 p_i

Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.

Want to learn more? Check out my explanation of Gini Impurity, a similar metric, or my in-depth guide Random Forests for Complete Beginners.

I write about ML, Web Dev, and more. Subscribe to get new posts by email!



This blog is open-source on Github.

At least this isn't a full screen popup

That would be more annoying. Anyways, consider subscribing to my newsletter to get new posts by email! I write about ML, Web Dev, and more.