One-Hot Encoding, Explained

A simple guide on the what, why, and how of One-Hot Encoding.

One-Hot Encoding takes a single integer and produces a vector where a single element is 1 and all other elements are 0, like [0,1,0,0][0, 1, 0, 0].

For example, imagine we’re working with categorical data, where only a limited number of colors are possible: red, green, or blue. One way we could represent this numerically is by assigning each color a number:

ColorValue
Red0
Green1
Blue2

This is known as integer encoding. For Machine Learning, this encoding can be problematic - in this example, we’re essentially saying “green” is the average of “red” and “blue”, which can lead to weird unexpected outcomes.

It’s often more useful to use the one-hot encoding instead:

ColorInteger EncodingOne-Hot Encoding
Red0[1,0,0][1, 0, 0]
Green1[0,1,0][0, 1, 0]
Blue2[0,0,1][0, 0, 1]

This is much more useful to pass into something like a neural network.

One-Hot Encoding in Python

Below are several different ways to implement one-hot encoding in Python.

scikit-learn

Using scikit-learn’s OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
print(encoder.fit_transform([['red'], ['green'], ['blue']]))
'''
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
 '''

Keras

Using Keras’s to_categorical:

from keras.utils import to_categorical

print(to_categorical([0, 1, 2]))
'''
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
 '''

NumPy

Using NumPy:

import numpy as np

arr = [2, 1, 0]
max = np.max(arr) + 1
print(np.eye(max)[arr])
'''
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
'''

This blog is open-source on Github.