One-hot encoding of categorical variables

Almost every dataset has some categorical data contained in it. Categorical data is discrete data in which the value can take on a finite number of possible values (usually encoded as a "string"). Because Python's scikit-learn can handle only numeric data, before performing machine learning with scikit-learn, we must find alternative ways of encoding categorical variables.

With one-hot encoding, also known as a 1-of-K encoding scheme, a single categorical variable having k possible values is converted into k different binary variables, each one is positive if and only if the column's value for that observation equaled the value it represents. In Chapter 7, Making Predictive Models in Healthcare, we provide a detailed example of what one-hot encoding is and use a pandas function called get_dummies() to perform one-hot encoding on a real clinical dataset. scikit-learn also has a class used to perform one-hot encoding, however, it is the OneHotEncoder class in the sklearn.preprocessing module.

For instructions on how OneHotEncoder is used, you can visit the scikit-learn documentation: http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features.

Table of Contents for One-hot encoding of categorical variables

Create new playlist

Sign In

Sign Up

Table of Contents for
One-hot encoding of categorical variables