In the previous chapter, Predicting Online Ads Click-through with Tree-Based Algorithms, we mentioned how one-hot encoding transforms categorical features to numerical features in order to be used in the tree algorithms in scikit-learn and TensorFlow. This will not limit our choice to tree-based algorithms if we can adopt one-hot encoding to any other algorithms that only take in numerical features.
The simplest solution we can think of in terms of transforming a categorical feature with k possible values is to map it to a numerical feature with values from 1 to k. For example, [Tech, Fashion, Fashion, Sports, Tech, Tech, Sports] becomes [1, 2, 2, 3, 1, 1, 3]. However, this will impose an ordinal characteristic, such as Sports being greater than Tech, and a distance property, such as Sports being closer to Fashion than to Tech.
Instead, one-hot encoding converts the categorical feature to k binary features. Each binary feature indicates the presence or absence of a corresponding possible value. Hence, the preceding example becomes the following:
Previously, we have used OneHotEncoder from scikit-learn to convert a matrix of string into a binary matrix, but here, let's take a look at another module, DictVectorizer, which also provides an efficient conversion. It transforms dictionary objects (categorical feature: value) into one-hot encoded vectors.
For example, take a look at the following codes:
>>> from sklearn.feature_extraction import DictVectorizer
>>> X_dict = [{'interest': 'tech', 'occupation': 'professional'},
... {'interest': 'fashion', 'occupation': 'student'},
... {'interest': 'fashion','occupation':'professional'},
... {'interest': 'sports', 'occupation': 'student'},
... {'interest': 'tech', 'occupation': 'student'},
... {'interest': 'tech', 'occupation': 'retired'},
... {'interest': 'sports','occupation': 'professional'}]
>>> dict_one_hot_encoder = DictVectorizer(sparse=False)
>>> X_encoded = dict_one_hot_encoder.fit_transform(X_dict)
>>> print(X_encoded)
[[ 0. 0. 1. 1. 0. 0.]
[ 1. 0. 0. 0. 0. 1.]
[ 1. 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0. 1.]
[ 0. 0. 1. 0. 0. 1.]
[ 0. 0. 1. 0. 1. 0.]
[ 0. 1. 0. 1. 0. 0.]]
We can also see the mapping by executing the following:
>>> print(dict_one_hot_encoder.vocabulary_)
{'interest=fashion': 0, 'interest=sports': 1,
'occupation=professional': 3, 'interest=tech': 2,
'occupation=retired': 4, 'occupation=student': 5}
When it comes to new data, we can transform it by:
>>> new_dict = [{'interest': 'sports', 'occupation': 'retired'}]
>>> new_encoded = dict_one_hot_encoder.transform(new_dict)
>>> print(new_encoded)
[[ 0. 1. 0. 0. 1. 0.]]
We can inversely transform the encoded features back to the original features by:
>>> print(dict_one_hot_encoder.inverse_transform(new_encoded))
[{'interest=sports': 1.0, 'occupation=retired': 1.0}]
One important thing to note is that if a new (not seen in training data) category is encountered in new data, it should be ignored. DictVectorizer handles this implicitly (while OneHotEncoder needs to specify parameter ignore):
>>> new_dict = [{'interest': 'unknown_interest',
'occupation': 'retired'},
... {'interest': 'tech', 'occupation':
'unseen_occupation'}]
>>> new_encoded = dict_one_hot_encoder.transform(new_dict)
>>> print(new_encoded)
[[ 0. 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0. 0.]]
Sometimes, we do prefer transforming a categorical feature with k possible values into a numerical feature with values ranging from 1 to k. We conduct ordinal encoding in order to employ ordinal or ranking knowledge in our learning; for example, large, medium, and small become 3, 2, and 1 respectively, good and bad become 1 and 0, while one-hot encoding fails to preserve such useful information. We can realize ordinal encoding easily through the use of pandas, for example:
>>> import pandas as pd
>>> df = pd.DataFrame({'score': ['low',
... 'high',
... 'medium',
... 'medium',
... 'low']})
>>> print(df)
score
0 low
1 high
2 medium
3 medium
4 low
>>> mapping = {'low':1, 'medium':2, 'high':3}
>>> df['score'] = df['score'].replace(mapping)
>>> print(df)
score
0 1
1 3
2 2
3 2
4 1
We convert the string feature into ordinal values based on the mapping we define.