One-hot encoding

Many classifiers of the scikit-learn library require categorical variables to be one-hot encoded. One-hot encoding, or a 1-of-K representation, is when a categorical variable that has more than two possible values is recorded as multiple variables each having two possible values.

For example, let's say that we have five patients in our dataset and we wish to one-hot encode a column that encodes the primary visit diagnosis. Before one-hot encoding, the column looks like this:

patient_id primary_dx
1 copd
2 hypertension
3 copd
4 chf
5 asthma

 

After one-hot encoding, this column would be split into K columns, where K is the number of possible values, and each column takes a value of 0 or 1 depending on whether the observation takes the value corresponding to that column:

patient_id primary_dx_copd primary_dx_hypertension primary_dx_chf primary_dx_asthma
1 1 0 0 0
2 0 1 0 0
3 1 0 0 0
4 0 0 1 0
5 0 0 0 1

 

Note that we have converted the strings of the previous column into an integer representation. This makes sense since machine learning algorithms are trained on numbers, not words! This is why one-hot encoding is necessary.

scikit-learn has a OneHotEncoder class in its preprocessing module. However, pandas has a get_dummies() function that accomplishes one-hot encoding in a single line. Let's use the pandas function. Before we do that, we must identify the columns that are categorical in our dataset to be passed to the function. We do this by using the metadata to identify the categorical columns and seeing which of those columns intersect with the columns that remain in our data:

categ_cols = df_helper.loc[
df_helper['variable_type'] == 'CATEGORICAL', 'column_name'
]

one_hot_cols = list(set(categ_cols) & set(X_train.columns))

X_train = pd.get_dummies(X_train, columns=one_hot_cols)

We must also one-hot encode the test data:

X_test = pd.get_dummies(X_test, columns=one_hot_cols)

As a final note, we should mention that there is the possibility that the testing set will contain categorical values that haven't been seen in the training data. This may cause an error when assessing the performance of the model using the testing set. To prevent this, you may have to write some extra code that sets any missing columns in the testing set to zero. Fortunately, we do not have to worry about that with our dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset