One-hot encoding

Many classifiers of the scikit-learn library require categorical variables to be one-hot encoded. One-hot encoding, or a 1-of-K representation, is when a categorical variable that has more than two possible values is recorded as multiple variables each having two possible values.

For example, let's say that we have five patients in our dataset and we wish to one-hot encode a column that encodes the primary visit diagnosis. Before one-hot encoding, the column looks like this:

`patient_id`	`primary_dx`
1	copd
2	hypertension
3	copd
4	chf
5	asthma

After one-hot encoding, this column would be split into K columns, where K is the number of possible values, and each column takes a value of 0 or 1 depending on whether the observation takes the value corresponding to that column:

`patient_id`	`primary_dx_copd`	`primary_dx_hypertension`	`primary_dx_chf`	`primary_dx_asthma`
1	1	0	0	0
2	0	1	0	0
3	1	0	0	0
4	0	0	1	0
5	0	0	0	1

Note that we have converted the strings of the previous column into an integer representation. This makes sense since machine learning algorithms are trained on numbers, not words! This is why one-hot encoding is necessary.

scikit-learn has a OneHotEncoder class in its preprocessing module. However, pandas has a get_dummies() function that accomplishes one-hot encoding in a single line. Let's use the pandas function. Before we do that, we must identify the columns that are categorical in our dataset to be passed to the function. We do this by using the metadata to identify the categorical columns and seeing which of those columns intersect with the columns that remain in our data:

categ_cols = df_helper.loc[
    df_helper['variable_type'] == 'CATEGORICAL', 'column_name'
]

one_hot_cols = list(set(categ_cols) & set(X_train.columns))

X_train = pd.get_dummies(X_train, columns=one_hot_cols)

We must also one-hot encode the test data:

X_test = pd.get_dummies(X_test, columns=one_hot_cols)

As a final note, we should mention that there is the possibility that the testing set will contain categorical values that haven't been seen in the training data. This may cause an error when assessing the performance of the model using the testing set. To prevent this, you may have to write some extra code that sets any missing columns in the testing set to zero. Fortunately, we do not have to worry about that with our dataset.

Table of Contents for One-hot encoding

Create new playlist

Sign In

Sign Up

Table of Contents for
One-hot encoding