The first three chapters of this book focused on data cleaning, exploration, and how to identify missing values and outliers. The next few chapters will delve heavily into feature engineering, starting, in this chapter, with techniques to encode, transform, and scale data to improve the performance of machine learning models.
Typically, machine learning algorithms require some form of encoding of variables. Additionally, our models often perform better with scaling so that features with higher variability do not overwhelm the optimization. We will show you how to use different scaling techniques when your features have dramatically different ranges.
Specifically, in this chapter, we will explore the following main topics:
In this chapter, we will work extensively with the feature-engine and category_encoders packages alongside the sklearn library. You can use pip to install these packages with pip install feature-engine, pip install category_encoders, and pip install scikit-learn. The code in this chapter uses version 0.24.2 of sklearn, version 1.1.2 of feature-engine, and version 2.2.2 of category_encoders. Note that either pip install feature-engine or pip install feature_engine will work.
All of the code for this chapter can be found on GitHub at https://github.com/PacktPublishing/Data-Cleaning-and-Exploration-with-Machine-Learning/tree/main/4.%20PruningEncodingandRescalingFeatures.
One of the biggest threats to the performance of our models is data leakage. Data leakage occurs whenever our models are informed by data that is not in the training dataset. Sometimes, we inadvertently assist our model training with information that cannot be gleaned from the training data alone and end up with an overly rosy assessment of our model's accuracy.
Data scientists do not really intend for this to happen, hence the term leakage. This is not a don't do it kind of discussion. We all know not to do it. This is more of a which steps should I take to avoid the problem? discussion. It is actually quite easy to have some data leakage unless we develop routines to prevent it.
For example, if we have missing values for a feature, we might impute the mean across the whole dataset for those values. However, in order to validate our model, we subsequently split our data into training and testing datasets. We would then have accidentally introduced data leakage into our training dataset since the information from the full dataset (that is, the global mean) would have been used.
One of the practices that data scientists have adopted to avoid this is to establish separate training and testing datasets as close to the beginning of the analysis as possible. This can become a little more complicated with validation techniques such as cross-validation, but in the following chapters, we will go over how to avoid data leakage in a variety of situations.
We can use scikit-learn to create training and testing DataFrames for the National Longitudinal Survey of Youth data.
Note
The National Longitudinal Survey (NLS) of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through to 2017. For this section, I pulled 89 variables on grades, employment, income, and attitudes toward the government from the hundreds of data items within the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. The NLS data can be downloaded for public use from https://www.nlsinfo.org/investigator/pages/search.
Let's start creating the DataFrame:
import pandas as pd
from sklearn.model_selection import train_test_split
nls97 = pd.read_csv("data/nls97b.csv")
nls97.set_index("personid", inplace=True)
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall']
X_train, X_test, y_train, y_test =
train_test_split(nls97[feature_cols],
nls97[['wageincome']], test_size=0.3,
random_state=0)
nls97.shape[0]
8984
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6288 entries, 574974 to 370933
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -------
0 satverbal 1001 non-null float64
1 satmath 1001 non-null float64
2 gpascience 3998 non-null float64
3 gpaenglish 4078 non-null float64
4 gpamath 4056 non-null float64
5 gpaoverall 4223 non-null float64
dtypes: float64(6)
memory usage: 343.9 KB
y_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6288 entries, 574974 to 370933
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -------
0 wageincome 3599 non-null float64
dtypes: float64(1)
memory usage: 98.2 KB
X_test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2696 entries, 363170 to 629736
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -------
0 satverbal 405 non-null float64
1 satmath 406 non-null float64
2 gpascience 1686 non-null float64
3 gpaenglish 1720 non-null float64
4 gpamath 1710 non-null float64
5 gpaoverall 1781 non-null float64
dtypes: float64(6)
memory usage: 147.4 KB
y_test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2696 entries, 363170 to 629736
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -------
0 wageincome 1492 non-null float64
dtypes: float64(1)
memory usage: 42.1 KB
We will use scikit-learn's test_train_split to create separate training and testing DataFrames in the rest of this chapter. We will introduce more complicated strategies for constructing testing datasets for validation in Chapter 6, Preparing for Model Evaluation.
Next, we begin our feature engineering work by removing features that are obviously unhelpful. This is because they have the same data as another feature or there is no variation in the responses.
During the process of data cleaning and manipulation, we often end up with data that is no longer meaningful. Perhaps we subsetted data based on a single feature value, and we have retained that feature even though it now has the same value for all observations. Or, for the subset of the data that we are using, two features have the same value. Ideally, we catch those redundancies during our data cleaning. However, if we do not catch them during that process, we can use the open source feature-engine package to help us.
Additionally, there might be features that are so highly correlated that it is very unlikely that we could build a model that could use all of them effectively. feature-engine has a method, DropCorrelatedFeatures, that makes it easy to remove a feature when it is highly correlated with another feature.
In this section, we will work with land temperature data, along with the NLS data. Note that we will only load temperature data for Poland here.
Data Note
The land temperature dataset contains the average temperature readings (in Celsius) in 2019 from over 12,000 stations across the world, though the majority of the stations are in the United States. The raw data was retrieved from the Global Historical Climatology Network integrated database. It has been made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4.
Let's start removing redundant and unhelpful features:
import pandas as pd
import feature_engine.selection as fesel
from sklearn.model_selection import train_test_split
nls97 = pd.read_csv("data/nls97b.csv")
nls97.set_index("personid", inplace=True)
ltpoland = pd.read_csv("data/ltpoland.csv")
ltpoland.set_index("station", inplace=True)
ltpoland.dropna(inplace=True)
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall']
X_train, X_test, y_train, y_test =
train_test_split(nls97[feature_cols],
nls97[['wageincome']], test_size=0.3,
random_state=0)
X_train.corr()
satverbal satmath gpascience gpaenglish
satverbal 1.000 0.729 0.439 0.444
satmath 0.729 1.000 0.480 0.430
gpascience 0.439 0.480 1.000 0.672
gpaenglish 0.444 0.430 0.672 1.000
gpamath 0.375 0.518 0.606 0.600
gpaoverall 0.421 0.485 0.793 0.844
gpamath gpaoverall
satverbal 0.375 0.421
satmath 0.518 0.485
gpascience 0.606 0.793
gpaenglish 0.600 0.844
gpamath 1.000 0.750
gpaoverall 0.750 1.000
Here, gpaoverall is highly correlated with gpascience, gpaenglish, and gpamath. The corr method returns the Pearson coefficients by default. This is fine when we can assume a linear relationship between the features. However, when this assumption does not make sense, we should consider requesting Spearman coefficients instead. We can do that by passing spearman to the method parameter of corr.
tr = fesel.DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.75)
tr.fit(X_train)
X_train_tr = tr.transform(X_train)
X_test_tr = tr.transform(X_test)
X_train_tr.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6288 entries, 574974 to 370933
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -------
0 satverbal 1001 non-null float64
1 satmath 1001 non-null float64
2 gpascience 3998 non-null float64
3 gpaenglish 4078 non-null float64
4 gpamath 4056 non-null float64
dtypes: float64(5)
memory usage: 294.8 KB
Typically, we would evaluate a feature more carefully before deciding to drop it. However, there are times when feature selection is part of a pipeline, and we need to automate the process. This can be done with DropCorrelatedFeatures since all of the feature_engine methods can be brought into a scikit-learn pipeline.
feature_cols = ['year','month','latabs',
'latitude','elevation', 'longitude','country']
X_train, X_test, y_train, y_test =
train_test_split(ltpoland[feature_cols],
ltpoland[['temperature']], test_size=0.3,
random_state=0)
X_train.sample(5, random_state=99)
year month latabs latitude elevation longitude country
station
SIEDLCE 2019 11 52 52 152 22 Poland
OKECIE 2019 6 52 52 110 21 Poland
BALICE 2019 1 50 50 241 20 Poland
BALICE 2019 7 50 50 241 20 Poland
BIALYSTOK 2019 11 53 53 151 23 Poland
X_train.year.value_counts()
2019 84
Name: year, dtype: int64
X_train.country.value_counts()
Poland 84
Name: country, dtype: int64
(X_train.latitude!=X_train.latabs).sum()
0
tr = fesel.DropConstantFeatures()
tr.fit(X_train)
X_train_tr = tr.transform(X_train)
X_test_tr = tr.transform(X_test)
X_train_tr.head()
month latabs latitude elevation longitude
station
OKECIE 1 52 52 110 21
LAWICA 8 52 52 94 17
LEBA 11 55 55 2 18
SIEDLCE 10 52 52 152 22
BIALYSTOK 11 53 53 151 23
tr = fesel.DropDuplicateFeatures()
tr.fit(X_train_tr)
X_train_tr = tr.transform(X_train_tr)
X_train_tr.head()
month latabs elevation longitude
station
OKECIE 1 52 110 21
LAWICA 8 52 94 17
LEBA 11 55 2 18
SIEDLCE 10 52 152 22
BIALYSTOK 11 53 151 23
This fixes some obvious problems with our features in the NLS data and the land temperature data for Poland. We dropped gpaoverall from a DataFrame that has the other GPA features because it is highly correlated with them. Additionally, we removed redundant data, dropping features with the same value throughout the DataFrame and features that duplicate the values of another feature.
The rest of this chapter explores somewhat messier feature engineering challenges: encoding, transforming, binning, and scaling.
There are several reasons why we might need to encode features before using them in most machine learning algorithms. First, these algorithms typically require numeric data. Second, when a categorical feature is represented with numbers, for example, 1 for female and 2 for male, we need to encode the values so that they are recognized as categorical. Third, the feature might actually be ordinal, with a discrete number of values that represent some meaningful ranking. Our models need to capture that ranking. Finally, a categorical feature might have a large number of values (known as high cardinality), and we might want our encoding to collapse categories.
We can handle the encoding of features with a limited number of values, say 15 or less, with one-hot encoding. In this section, we will, first, go over one-hot encoding and then discuss ordinal encoding. We will look at strategies for handling categorical features with high cardinality in the next section.
One-hot encoding a feature creates a binary vector for each value of that feature. So, if a feature, called letter, has three unique values, A, B, and C, one-hot encoding creates three binary vectors to represent those values. The first binary vector, which we can call letter_A, has 1 whenever letter has a value of A, and 0 when it is B or C. letter_B and letter_C would be coded similarly. The transformed features, letter_A, letter_B, and letter_C, are often referred to as dummy variables. Figure 4.1 illustrates one-hot encoding:
A number of features from the NLS data are appropriate for one-hot encoding. In the following code blocks, we encode some of those features:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
nls97 = pd.read_csv("data/nls97b.csv")
nls97.set_index("personid", inplace=True)
feature_cols =['gender','maritalstatus','colenroct99']
nls97demo = nls97[['wageincome'] + feature_cols].dropna()
X_demo_train, X_demo_test, y_demo_train, y_demo_test=
train_test_split(nls97demo[feature_cols],
nls97demo[['wageincome']], test_size=0.3,
random_state=0)
pd.get_dummies(X_demo_train,
columns=['gender','maritalstatus']).head(2).T
personid 736081 832734
colenroct99 1.Not enrolled 1.Not enrolled
gender_Female 1 0
gender_Male 0 1
maritalstatus_Divorced 0 0
maritalstatus_Married 1 0
maritalstatus_Never-married 0 1
maritalstatus_Separated 0 0
maritalstatus_Widowed 0 0
We are not saving the DataFrame created by get_dummies because, later in this section, we will be using a different technique to do the encoding.
Typically, we create k-1 dummy variables for k unique values for a feature. So, if gender has two values in our data, we only need to create one dummy variable. If we know the value for gender_Female, we also know the value of gender_Male; therefore, the latter variable is redundant. Similarly, we know the value of maritalstatus_Divorced if we know the values of the other maritalstatus dummies. Creating a redundancy in this way is inelegantly referred to as the dummy variable trap. To avoid this problem, we drop one dummy from each group.
Note
For some machine learning algorithms, such as linear regression, dropping one dummy variable is actually required. In estimating the parameters of a linear model, the matrix is inverted. If our model has an intercept, and all dummy variables are included, the matrix cannot be inverted.
pd.get_dummies(X_demo_train,
columns=['gender','maritalstatus'],
drop_first=True).head(2).T
personid 736081 832734
colenroct99 1. Not enrolled 1. Not enrolled
gender_Male 0 1
maritalstatus_Married 1 0
maritalstatus_Never-married 0 1
maritalstatus_Separated 0 0
maritalstatus_Widowed 0 0
An alternative to get_dummies is the one-hot encoder in either sklearn or feature_engine. These one-hot encoders have the advantage that they can be easily brought into a machine learning pipeline, and they can persist information gathered from the training dataset to the testing dataset.
ohe = OneHotEncoder(drop_last=True,
variables=['gender','maritalstatus'])
ohe.fit(X_demo_train)
X_demo_train_ohe = ohe.transform(X_demo_train)
X_demo_test_ohe = ohe.transform(X_demo_test)
X_demo_train_ohe.filter(regex='gen|mar', axis="columns").head(2).T
personid 736081 832734
gender_Female 1 0
maritalstatus_Married 1 0
maritalstatus_Never-married 0 1
maritalstatus_Divorced 0 0
maritalstatus_Separated 0 0
This demonstrates that one-hot encoding is a fairly straightforward way to prepare nominal data for a machine learning algorithm. But what if our categorical features are ordinal, rather than nominal? In that case, we need to use ordinal encoding.
Categorical features can be either nominal or ordinal, as discussed in Chapter 1, Examining the Distribution of Features and Targets. Gender and marital status are nominal. Their values do not imply order. For example, "never married" is not a higher value than "divorced."
However, when a categorical feature is ordinal, we want the encoding to capture the ranking of the values. For example, if we have a feature that has the values of low, medium, and high, one-hot encoding would lose this ordering. Instead, a transformed feature with the values of 1, 2, and 3 for low, medium, and high, respectively, would be better. We can accomplish this with ordinal encoding.
The college enrollment feature on the NLS dataset can be considered an ordinal feature. The values range from 1. Not enrolled to 3. 4-year college. We should use ordinal encoding to prepare it for modeling. We will do that next:
X_demo_train.colenroct99.unique()
array(['1. Not enrolled', '2. 2-year college ',
'3. 4-year college'], dtype=object)
X_demo_train.head()
gender maritalstatus colenroct99
personid
736081 Female Married 1. Not enrolled
832734 Male Never-married 1. Not enrolled
453537 Male Married 1. Not enrolled
322059 Female Divorced 1. Not enrolled
324323 Female Married 2. 2-year college
oe = OrdinalEncoder(categories=
[X_demo_train.colenroct99.unique()])
colenr_enc =
pd.DataFrame(oe.fit_transform(X_demo_train[['colenroct99']]),
columns=['colenroct99'], index=X_demo_train.index)
X_demo_train_enc =
X_demo_train[['gender','maritalstatus']].
join(colenr_enc)
X_demo_train_enc.head()
gender maritalstatus colenroct99
personid
736081 Female Married 0
832734 Male Never-married 0
453537 Male Married 0
322059 Female Divorced 0
324323 Female Married 1
X_demo_train.colenroct99.value_counts().sort_index()
1. Not enrolled 3050
2. 2-year college 142
3. 4-year college 350
Name: colenroct99, dtype: int64
X_demo_train_enc.colenroct99.value_counts().sort_index()
0 3050
1 142
2 350
Name: colenroct99, dtype: int64
The ordinal encoding replaces the initial values for colenroct99 with numbers from 0 to 2. It is now in a form that is consumable by many machine learning models, and we have retained the meaningful ranking information.
Note
Ordinal encoding is appropriate for non-linear models such as decision trees. It might not make sense in a linear regression model because that would assume that the distance between values was equally meaningful across the whole distribution. In this example, that would assume that the increase from 0 to 1 (that is, from no enrollment to 2-year enrollment) is the same thing as the increase from 1 to 2 (that is, from 2-year enrollment to 4-year enrollment).
One-hot encoding and ordinal encoding are relatively straightforward approaches to engineering categorical features. It can be more complicated to deal with categorical features when there are many more unique values. In the next section, we will go over a couple of techniques for handling those features.
When we are working with a categorical feature that has many unique values, say 10 or more, it can be impractical to create a dummy variable for each value. When there is high cardinality, that is, a very large number of unique values, there might be too few observations with certain values to provide much information for our models. At the extreme, with an ID variable, there is just one observation for each value.
There are a couple of ways in which to handle medium or high cardinality. One way is to create dummies for the top k categories and group the remaining values into an other category. Another way is to use feature hashing, also known as the hashing trick. In this section, we will explore both strategies. We will be using the COVID-19 dataset for this example:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from category_encoders.hashing import HashingEncoder
from sklearn.model_selection import train_test_split
covidtotals = pd.read_csv("data/covidtotals.csv")
feature_cols = ['location','population',
'aged_65_older','diabetes_prevalence','region']
covidtotals = covidtotals[['total_cases'] + feature_cols].dropna()
X_train, X_test, y_train, y_test =
train_test_split(covidtotals[feature_cols],
covidtotals[['total_cases']], test_size=0.3,
random_state=0)
The feature region has 16 unique values, the first 6 of which have counts of 10 or more:
X_train.region.value_counts()
Eastern Europe 16
East Asia 12
Western Europe 12
West Africa 11
West Asia 10
East Africa 10
South America 7
South Asia 7
Central Africa 7
Southern Africa 7
Oceania / Aus 6
Caribbean 6
Central Asia 5
North Africa 4
North America 3
Central America 3
Name: region, dtype: int64
ohe = OneHotEncoder(top_categories=6, variables=['region'])
covidtotals_ohe = ohe.fit_transform(covidtotals)
covidtotals_ohe.filter(regex='location|region',
axis="columns").sample(5, random_state=99).T
97 173 92 187 104
Location Israel Senegal Indonesia Sri Lanka Kenya
region_Eastern Europe 0 0 0 0 0
region_Western Europe 0 0 0 0 0
region_West Africa 0 1 0 0 0
region_East Asia 0 0 1 0 0
region_West Asia 1 0 0 0 0
region_East Africa 0 0 0 0 1
An alternative approach to one-hot encoding, when a categorical feature has many unique values, is to use feature hashing.
Feature hashing maps a large number of unique feature values to a smaller number of dummy variables. We can specify the number of dummy variables to create. However, collisions are possible; that is, some feature values might map to the same dummy variable combination. The number of collisions increases as we decrease the number of requested dummy variables.
We can use HashingEncoder from category_encoders to do feature hashing. We use n_components to indicate that we want six dummy variables (we copy the region feature before we do the transform so that we can compare the original values to the new dummies):
X_train['region2'] = X_train.region
he = HashingEncoder(cols=['region'], n_components=6)
X_train_enc = he.fit_transform(X_train)
X_train_enc.
groupby(['col_0','col_1','col_2','col_3','col_4',
'col_5','region2']).
size().reset_index().rename(columns={0:'count'})
col_0 col_1 col_2 col_3 col_4 col_5 region2 count
0 0 0 0 0 0 1 Caribbean 6
1 0 0 0 0 0 1 Central Africa 7
2 0 0 0 0 0 1 East Africa 10
3 0 0 0 0 0 1 North Africa 4
4 0 0 0 0 1 0 Central America 3
5 0 0 0 0 1 0 Eastern Europe 16
6 0 0 0 0 1 0 North America 3
7 0 0 0 0 1 0 Oceania / Aus 6
8 0 0 0 0 1 0 Southern Africa 7
9 0 0 0 0 1 0 West Asia 10
10 0 0 0 0 1 0 Western Europe 12
11 0 0 0 1 0 0 Central Asia 5
12 0 0 0 1 0 0 East Asia 12
13 0 0 0 1 0 0 South Asia 7
14 0 0 1 0 0 0 West Africa 11
15 1 0 0 0 0 0 South America 7
Unfortunately, this gives us a large number of collisions. For example, Caribbean, Central Africa, East Africa, and North Africa all get the same dummy variable values. In this case at least, using one-hot encoding and specifying the number of categories, as we did in the last section, was a better solution.
In the previous two sections, we covered common encoding strategies: one-hot encoding, ordinal encoding, and feature hashing. Almost all of our categorical features will require some kind of encoding before we can use them in a model. However, sometimes, we need to alter our features in other ways, including with transformations, binning, and scaling. In the next three sections, we will consider the reasons why we might need to alter our features in these ways and explore tools for doing that.
Sometimes, we want to use features that do not have a Gaussian distribution with a machine learning algorithm that assumes our features are distributed in that way. When that happens, we either need to change our minds about which algorithm to use (for example, we could choose KNN rather than linear regression) or transform our features so that they approximate a Gaussian distribution. In this section, we will go over a couple of strategies for doing the latter:
import pandas as pd
from feature_engine import transformation as vt
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from scipy import stats
covidtotals = pd.read_csv("data/covidtotals.csv")
feature_cols = ['location','population',
'aged_65_older','diabetes_prevalence','region']
covidtotals = covidtotals[['total_cases'] + feature_cols].dropna()
X_train, X_test, y_train, y_test =
train_test_split(covidtotals[feature_cols],
covidtotals[['total_cases']], test_size=0.3,
random_state=0)
y_train.total_cases.skew()
6.313169268923333
plt.hist(y_train.total_cases)
plt.title("Total COVID Cases (in millions)")
plt.xlabel('Cases')
plt.ylabel("Number of Countries")
plt.show()
This produces the following histogram:
This illustrates the very high skew for the total number of cases. In fact, it looks log-normal, which is not surprising given the large number of very low values and several very high values.
Note
For more information about the measures of skew and kurtosis, please refer to Chapter 1, Examining the Distribution of Features and Targets.
tf = vt.LogTransformer(variables = ['total_cases'])
y_train_tf = tf.fit_transform(y_train)
y_train_tf.total_cases.skew()
-1.3872728024141519
plt.hist(y_train_tf.total_cases)
plt.title("Total COVID Cases (log transformation)")
plt.xlabel('Cases')
plt.ylabel("Number of Countries")
plt.show()
This produces the following histogram:
Effectively, log transformations increase variability at the lower end of the distribution and decrease variability at the upper end. This produces a more symmetrical distribution. This is because the slope of the logarithmic function is steeper for smaller values than for larger ones.
tf = vt.BoxCoxTransformer(variables = ['total_cases'])
y_train_tf = tf.fit_transform(y_train)
y_train_tf.total_cases.skew()
0.07333475786753735
plt.hist(y_train_tf.total_cases)
plt.title("Total COVID Cases (Box-Cox transformation)")
plt.xlabel('Cases')
plt.ylabel("Number of Countries")
plt.show()
This produces the following plot:
Box-Cox transformations identify a value for lambda between -5 and 5 that generates a distribution that is closest to normal. It uses the following equation for the transformation:
or
Here, is our transformed feature. Just for fun, let's see the value of the lambda that was used to transform total_cases:
stats.boxcox(y_train.total_cases)[1]
0.10435377585681517
The lambda for the Box-Cox transformation is 0.104. For comparison, the lambda for a feature with a Gaussian distribution would be 1.000, meaning that no transformation would be necessary.
Now that our transformed total cases feature looks good, we can build a model with it as the target. Additionally, we can set up our pipeline to restore values to their original scaling when we make predictions. feature_engine has a number of other transformations that are implemented similarly to the log and Box-Cox transformations.
Sometimes, we will want to convert a continuous feature into a categorical feature. The process of creating k equally spaced intervals from the minimum to the maximum value of a distribution is called binning or, the somewhat less-friendly term, discretization. Binning can address several important issues with a feature: skew, excessive kurtosis, and the presence of outliers.
Binning might be a good choice with the COVID case data. Let's try that (this might also be useful with other variables in the dataset, including total deaths and population, but we will only work with total cases for now. total_cases is the target variable in the following code, so it is a column – the only column – on the y_train DataFrame):
import pandas as pd
from feature_engine.discretisation import EqualFrequencyDiscretiser as efd
from feature_engine.discretisation import EqualWidthDiscretiser as ewd
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.model_selection import train_test_split
covidtotals = pd.read_csv("data/covidtotals.csv")
feature_cols = ['location','population',
'aged_65_older','diabetes_prevalence','region']
covidtotals = covidtotals[['total_cases'] + feature_cols].dropna()
X_train, X_test, y_train, y_test =
train_test_split(covidtotals[feature_cols],
covidtotals[['total_cases']], test_size=0.3, random_state=0)
y_train['total_cases_group'] = pd.qcut(y_train.total_cases, q=10, labels=[0,1,2,3,4,5,6,7,8,9])
y_train.total_cases_group.value_counts().sort_index()
0 13
1 13
2 12
3 13
4 12
5 13
6 12
7 13
8 12
9 13
Name: total_cases_group, dtype: int64
def runtransform(bt, dftrain, dftest):
bt.fit(dftrain)
train_bins = bt.transform(dftrain)
test_bins = bt.transform(dftest)
return train_bins, test_bins
y_train.drop(['total_cases_group'], axis=1, inplace=True)
bintransformer = efd(q=10, variables=['total_cases'])
y_train_bins, y_test_bins = runtransform(bintransformer, y_train, y_test)
y_train_bins.total_cases.value_counts().sort_index()
0 13
1 13
2 12
3 13
4 12
5 13
6 12
7 13
8 12
9 13
Name: total_cases, dtype: int64
This gives us the same results as qcut, but it has the advantage of being easier to bring into a machine learning pipeline since we are using feature_engine to produce it. The equal-frequency binning addresses both the skew and outlier problems.
Note
We will explore machine learning pipelines in detail in this book, starting with Chapter 6, Preparing for Model Evaluation. Here, the key point is that feature engine transformers can be a part of a pipeline that includes other sklearn-compatible transformers, even ones we construct ourselves.
bintransformer = ewd(bins=10, variables=['total_cases'])
y_train_bins, y_test_bins = runtransform(bintransformer, y_train, y_test)
y_train_bins.total_cases.value_counts().sort_index()
0 119
1 4
5 1
9 2
Name: total_cases, dtype: int64
This is a far less successful transformation. Almost all of the values are at the bottom of the distribution in the data prior to the binning, so it is not surprising that equal-width binning would have the same problem. It results in only 4 bins, even though we requested 10.
pd.options.display.float_format = '{:,.0f}'.format
y_train_bins = y_train_bins.
rename(columns={'total_cases':'total_cases_group'}).
join(y_train)
y_train_bins.groupby("total_cases_group")["total_cases"].agg(['min','max'])
min max
total_cases_group
0 1 3,304,135
1 3,740,567 5,856,682
5 18,909,037 18,909,037
9 30,709,557 33,770,444
Although in this case, equal-width binning was a bad choice, there are many times when it makes sense. It can be useful when data is more uniformly distributed or when the equal widths make sense substantively.
Another option is to use k-means clustering to determine the bins. The k-means algorithm randomly selects k data points as centers of clusters and then assigns the other data points to the closest cluster. The mean of each cluster is computed, and the data points are reassigned to the nearest new cluster. This process is repeated until the optimal centers are found.
When k-means is used for binning, all data points in the same cluster will have the same ordinal value:
kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='kmeans')
y_train_bins =
pd.DataFrame(kbins.fit_transform(y_train),
columns=['total_cases'])
y_train_bins.total_cases.value_counts().sort_index()
0 49
1 24
2 23
3 11
4 6
5 6
6 4
7 1
8 1
9 1
Name: total_cases, dtype: int64
y_train.total_cases.agg(['skew','kurtosis'])
skew 6.313
kurtosis 41.553
Name: total_cases, dtype: float64
y_train_bins.total_cases.agg(['skew','kurtosis'])
skew 1.439
kurtosis 1.923
Name: total_cases, dtype: float64
Binning can help us to address skew, kurtosis, and outliers in our data. However, it does mask much of the variation in the feature and reduces its explanatory potential. Often, some form of scaling, such as min-max or z-score, is a better option. Let's examine feature scaling next.
Often, the features we want to use in our model are on very different scales. Put simply, the distance between the minimum and maximum values, or the range, varies substantially across possible features. For example, in the COVID-19 data, the total cases feature goes from 1 to almost 34 million, while aged 65 or older goes from 9 to 27 (the number represents the percentage of the population).
Having features on very different scales impacts many machine learning algorithms. For example, KNN models often use Euclidean distance, and features with greater ranges will have a greater influence on the model. Scaling can address this problem.
In this section, we will go over two popular approaches to scaling: min-max scaling and standard (or z-score) scaling. Min-max scaling replaces each value with its location in the range. More precisely, the following happens:
=
Here, is the min-max score, is the value for the observation of the feature, and and are the minimum and maximum values of the feature.
Standard scaling normalizes the feature values around a mean of 0. Those who studied undergraduate statistics will recognize it as the z-score. Specifically, it is as follows:
Here, is the value for the observation of the feature, is the mean for feature , and is the standard deviation for that feature.
We can use scikit-learn's preprocessing module to get the min-max and standard scalers:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
covidtotals = pd.read_csv("data/covidtotals.csv")
feature_cols = ['population','total_deaths',
'aged_65_older','diabetes_prevalence']
covidtotals = covidtotals[['total_cases'] + feature_cols].dropna()
X_train, X_test, y_train, y_test =
train_test_split(covidtotals[feature_cols],
covidtotals[['total_cases']], test_size=0.3, random_state=0)
scaler = MinMaxScaler()
X_train_mms = pd.DataFrame(scaler.fit_transform(X_train),
columns=X_train.columns, index=X_train.index)
X_train_mms.describe()
population total_deaths aged_65_older diabetes_prevalence
count 123.00 123.00 123.00 123.00
mean 0.04 0.04 0.30 0.41
std 0.13 0.14 0.24 0.23
min 0.00 0.00 0.00 0.00
25% 0.00 0.00 0.10 0.26
50% 0.01 0.00 0.22 0.37
75% 0.02 0.02 0.51 0.54
max 1.00 1.00 1.00 1.00
scaler = StandardScaler()
X_train_ss = pd.DataFrame(scaler.fit_transform(X_train),
columns=X_train.columns, index=X_train.index)
X_train_ss.describe()
population total_deaths aged_65_older diabetes_prevalence
count 123.00 123.00 123.00 123.00
mean -0.00 -0.00 -0.00 -0.00
std 1.00 1.00 1.00 1.00
min -0.29 -0.32 -1.24 -1.84
25% -0.27 -0.31 -0.84 -0.69
50% -0.24 -0.29 -0.34 -0.18
75% -0.11 -0.18 0.87 0.59
max 7.58 6.75 2.93 2.63
If we have outliers in our data, robust scaling might be a good option. Robust scaling subtracts the median from each value of a variable and divides that value by the interquartile range. So, each value is as follows:
Here, is the value of the feature, and , ,and are the median, third, and first quantiles of the feature. Robust scaling is less sensitive to extreme values since it does not use the mean or variance.
scaler = RobustScaler()
X_train_rs = pd.DataFrame(
scaler.fit_transform(X_train),
columns=X_train.columns, index=X_train.index)
X_train_rs.describe()
population total_deaths aged_65_older diabetes_prevalence
count 123.00 123.00 123.00 123.00
mean 1.47 2.22 0.20 0.14
std 6.24 7.65 0.59 0.79
min -0.35 -0.19 -0.53 -1.30
25% -0.24 -0.15 -0.30 -0.40
50% 0.00 0.00 0.00 0.00
75% 0.76 0.85 0.70 0.60
max 48.59 53.64 1.91 2.20
We use feature scaling with most machine learning algorithms. Although it is not often required, it yields noticeably better results. Min-max scaling and standard scaling are popular scaling techniques, but there are times when robust scaling might be the better option.
In this chapter, we covered a wide range of feature engineering techniques. We used tools to drop redundant or highly correlated features. We explored the most common kinds of encoding – one-hot encoding, ordinal encoding, and hashing encoding. Following this, we used transformations to improve the distribution of our features. Finally, we used common binning and scaling approaches to address skew, kurtosis, and outliers, and to adjust for features with widely different ranges.
Some of the techniques we discussed in this chapter are required for most machine learning models. We almost always need to encode our features for algorithms in order to understand them correctly. For example, most algorithms cannot make sense of female or male values or know not to treat ZIP codes as ordinal. Although not typically necessary, scaling is often a very good idea when we have features with vastly different ranges. When we are using algorithms that assume a Gaussian distribution of our features, some form of transformation might be required for our features to be consistent with that assumption.
We now have a good sense of how our features are distributed, have imputed missing values, and have done some feature engineering where necessary. We are now prepared to begin perhaps the most interesting and meaningful part of the model building process – feature selection.
In the next chapter, we will examine key feature selection tasks, building on the feature cleaning, exploration, and engineering work that we have done so far.