Depending on how you began your data analytic work and your own intellectual interests, you might have a different perspective on the topic of feature selection. You might think, yeah, yeah, it is an important topic, but I really want to get to the model building. Or, at the other extreme, you might view feature selection as at the core of model building and believe that you are 90% of the way toward having your model once you have chosen your features. For now, let's just agree that we should spend a good chunk of time understanding the relationships between features – and their relationship to a target if we are building a supervised model – before we do any serious model specification.
It is helpful to approach our feature selection work with the attitude that less is more. If we can reach nearly the same degree of accuracy or explain as much of the variance with fewer features, we should select the simpler model. Sometimes, we can actually get better accuracy with fewer features. This can be hard to wrap our brains around, and even be a tad disappointing for those of us who cut our teeth on building models that told rich and complicated stories.
But we are less concerned with parameter estimates than with the accuracy of our predictions when fitting machine learning models. Unnecessary features can contribute to overfitting and tax hardware resources.
We can sometimes spend months specifying the features of our model, even when there is a limited number of columns in the data. Bivariate correlations, such as those created in Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets, give us some sense of what to expect, but the importance of a feature can vary significantly once other potentially explanatory features are introduced. The feature may no longer be significant, or, conversely, may only be significant when other features are included. Two features might be so highly correlated that including both of them offers very little additional information than including just one.
This chapter takes a close look at feature selection techniques applicable to a variety of predictive modeling tasks. Specifically, we will explore the following topics:
We will work with the feature_engine, mlxtend, and boruta packages in this chapter, in addition to the scikit-learn library. You can use pip to install these packages. I have chosen a dataset with a small number of observations for our work in this chapter, so the code should work fine even on suboptimal workstations.
Note
We will work exclusively in this chapter with data from The National Longitudinal Survey of Youth, conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. We will work with educational attainment, household demographic, weeks worked, and wage income data. The wage income column represents wages earned in 2016. The NLS dataset can be downloaded for public use at https://www.nlsinfo.org/investigator/pages/search.
The most straightforward feature selection methods are based on each feature's relationship with a target variable. The next two sections examine techniques for determining the k best features based on their linear or non-linear relationship with the target. These are known as filter methods. They are also sometimes called univariate methods since they evaluate the relationship between the feature and the target independent of the impact of other features.
We use somewhat different strategies when the target is categorical than when it is continuous. We'll go over the former in this section and the latter in the next.
We can use mutual information classification or analysis of variance (ANOVA) tests to select features when we have a categorical target. We will try mutual information classification first, and then ANOVA for comparison.
Mutual information is a measure of how much information about a variable is provided by knowing the value of another variable. At the extreme, when features are completely independent, the mutual information score is 0.
We can use scikit-learn's SelectKBest class to select the k features that have the highest predictive strength based on mutual information classification or some other appropriate measure. We can use hyperparameter tuning to select the value of k. We can also examine the scores of all features, whether they were identified as one of the k best or not, as we will see in this section.
Let's first try mutual information classification to identify features that are related to completing a bachelor's degree. Later, we will compare that with using ANOVA F-values as the basis for selection:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest,
mutual_info_classif, f_classif
nls97compba = pd.read_csv("data/nls97compba.csv")
feature_cols = ['gender','satverbal','satmath',
'gpascience', 'gpaenglish','gpamath','gpaoverall',
'motherhighgrade','fatherhighgrade','parentincome']
X_train, X_test, y_train, y_test =
train_test_split(nls97compba[feature_cols],
nls97compba[['completedba']], test_size=0.3, random_state=0)
ohe = OneHotEncoder(drop_last=True, variables=['gender'])
X_train_enc = ohe.fit_transform(X_train)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
X_train_enc =
pd.DataFrame(scaler.fit_transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Female']])
Note
We will do a complete case analysis of the NLS data throughout this chapter; that is, we will remove all observations that have missing values for any of the features. This is not usually a good approach and is particularly problematic when data is not missing at random or when there is a large number of missing values for one or more features. In such cases, it would be better to use some of the approaches that we used in Chapter 3, Identifying and Fixing Missing Values. We will do a complete case analysis in this chapter to keep the examples as straightforward as possible.
ksel = SelectKBest(score_func=mutual_info_classif, k=5)
ksel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[ksel.get_support()]
selcols
Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpaoverall'], dtype='object')
pd.DataFrame({'score': ksel.scores_,
'feature': X_train_enc.columns},
columns=['feature','score']).
sort_values(['score'], ascending=False)
feature score
5 gpaoverall 0.108
1 satmath 0.074
3 gpaenglish 0.072
0 satverbal 0.069
2 gpascience 0.047
4 gpamath 0.038
8 parentincome 0.024
7 fatherhighgrade 0.022
6 motherhighgrade 0.022
9 gender_Female 0.015
Note
This is a stochastic process, so we will get different results each time we run it.
To get the same results each time, you can pass a partial function to score_func:
from functools import partial
SelectKBest(score_func=partial(mutual_info_classif,
random_state=0), k=5)
X_train_analysis = X_train_enc[selcols]
X_train_analysis.dtypes
satverbal float64
satmath float64
gpascience float64
gpaenglish float64
gpaoverall float64
dtype: object
That is all we need to do to select the k best features for our model using mutual information.
Alternatively, we can use ANOVA instead of mutual information. ANOVA evaluates how different the mean for a feature is for each target class. This is a good metric for univariate feature selection when we can assume a linear relationship between features and the target and our features are normally distributed. If those assumptions do not hold, mutual information classification is a better choice.
Let's try using ANOVA for our feature selection. We can set the score_func parameter of SelectKBest to f_classif to select based on ANOVA:
ksel = SelectKBest(score_func=f_classif, k=5)
ksel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[ksel.get_support()]
selcols
Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpaoverall'], dtype='object')
pd.DataFrame({'score': ksel.scores_,
'feature': X_train_enc.columns},
columns=['feature','score']).
sort_values(['score'], ascending=False)
feature score
5 gpaoverall 119.471
3 gpaenglish 108.006
2 gpascience 96.824
1 satmath 84.901
0 satverbal 77.363
4 gpamath 60.930
7 fatherhighgrade 37.481
6 motherhighgrade 29.377
8 parentincome 22.266
9 gender_Female 15.098
This selected the same features as were selected when we used mutual information. Showing the scores gives us some indication of whether the selected value for k makes sense. For example, there is a greater drop in score from the fifth- to the sixth-best feature (77-61) than from the fourth to the fifth (85-77). There is an even bigger decline from the sixth to the seventh, however (61-37), suggesting that we should at least consider a value for k of 6.
ANOVA tests, and the mutual information classification we did earlier, do not take into account features that are only important in multivariate analysis. For example, fatherhighgrade might matter among individuals with similar GPA or SAT scores. We use multivariate feature selection methods later in this chapter. We do more univariate feature selection in the next section where we explore selection techniques appropriate for continuous targets.
Regression models have a continuous target. The statistical techniques we used in the previous section are not appropriate for such targets. Fortunately, scikit-learn's selection module provides several options for selecting features when building regression models. (By regression models here, I do not mean linear regression models. I am only referring to models with continuous targets.) Two good options are selection based on F-tests and selection based on mutual information for regression. Let's start with F-tests.
The F-statistic is a measure of the strength of the linear correlation between a target and a single regressor. Scikit-learn has an f_regression scoring function, which returns F-statistics. We can use it with SelectKBest to select features based on that statistic.
Let's use F-statistics to select features for a model of wages. We use mutual information for regression in the next section to select features for the same target:
import pandas as pd
import numpy as np
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
nls97wages = pd.read_csv("data/nls97wages.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','gender',
'motherhighgrade','fatherhighgrade','parentincome',
'completedba']
X_train, X_test, y_train, y_test =
train_test_split(nls97wages[feature_cols],
nls97wages[['wageincome']], test_size=0.3, random_state=0)
ohe = OneHotEncoder(drop_last=True, variables=['gender'])
X_train_enc = ohe.fit_transform(X_train)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
X_train_enc =
pd.DataFrame(scaler.fit_transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Male']])
y_train =
pd.DataFrame(scaler.fit_transform(y_train),
columns=['wageincome'], index=y_train.index)
Note
You may have noticed that we are not encoding or scaling the testing data. We will need to do that eventually to validate our models. We will introduce validation later in this chapter and go over it in much more detail in the next chapter.
ksel = SelectKBest(score_func=f_regression, k=5)
ksel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[ksel.get_support()]
selcols
Index(['satmath', 'gpascience', 'parentincome',
'completedba','gender_Male'],
dtype='object')
pd.DataFrame({'score': ksel.scores_,
'feature': X_train_enc.columns},
columns=['feature','score']).
sort_values(['score'], ascending=False)
feature score
1 satmath 45
9 completedba 38
10 gender_Male 26
8 parentincome 24
2 gpascience 21
0 satverbal 19
5 gpaoverall 17
4 gpamath 13
3 gpaenglish 10
6 motherhighgrade 9
7 fatherhighgrade 8
The disadvantage of the F-statistic is that it assumes a linear relationship between each feature and the target. When that assumption does not make sense, we can use mutual information for regression instead.
We can also use SelectKBest to select features using mutual information for regression:
from functools import partial
ksel = SelectKBest(score_func=
partial(mutual_info_regression, random_state=0),
k=5)
ksel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[ksel.get_support()]
selcols
Index(['satmath', 'gpascience', 'fatherhighgrade', 'completedba','gender_Male'],dtype='object')
pd.DataFrame({'score': ksel.scores_,
'feature': X_train_enc.columns},
columns=['feature','score']).
sort_values(['score'], ascending=False)
feature score
1 satmath 0.101
10 gender_Male 0.074
7 fatherhighgrade 0.047
2 gpascience 0.044
9 completedba 0.044
4 gpamath 0.016
8 parentincome 0.015
6 motherhighgrade 0.012
0 satverbal 0.000
3 gpaenglish 0.000
5 gpaoverall 0.000
We get fairly similar results with mutual information for regression as we did with F-tests. parentincome was selected with F-tests and fatherhighgrade with mutual information. Otherwise, the same features are selected.
A key advantage of mutual information for regression compared with F-tests is that it does not assume a linear relationship between the feature and the target. If that assumption turns out to be unwarranted, mutual information is a better approach. (Again, there is also some randomness in the scoring and the score for each feature can bounce around within a limited range.)
Note
Our choice of k=5 to get the five best features is quite arbitrary. We can make it much more scientific with some hyperparameter tuning. We will go over tuning in the next chapter.
The feature selection methods we have used so far are known as filter methods. They examine the univariate relationship between each feature and the target. They are a good starting point. Similar to our discussion in previous chapters of the usefulness of having correlations handy before we start examining multivariate relationships, it is helpful to at least explore filter methods. Often, though, our model fitting will require taking into account features that are important, or not, when other features are also included. To do that, we need to use wrapper or embedded methods for feature selection. We explore wrapper methods in the next few sections, starting with forward and backward feature selection.
Forward and backward feature selection, as their names suggest, select features by adding them one by one – or subtracting them for backward selection – and assessing the impact on model performance after each iteration. Since both methods assess that performance based on a given algorithm, they are considered wrapper selection methods.
Wrapper feature selection methods have two advantages over the filter methods we have explored so far. First, they evaluate the importance of features as other features are included. Second, since features are evaluated based on their contribution to the performance of a specific algorithm, we get a better sense of which features will ultimately matter. For example, satmath seemed to be an important feature based on our results from the previous section. But it is possible that satmath is only important when we use a particular model, say linear regression, and not an alternative such as decision tree regression. Wrapper selection methods can help us discover that.
The main disadvantage of wrapper methods is that they can be quite expensive computationally since they retrain the model after each iteration. We will look at both forward and backward feature selection in this section.
Forward feature selection starts by identifying a subset of features that individually have a significant relationship with a target, not unlike the filter methods. But it then evaluates all possible combinations of the selected features for the combination that performs best with the chosen algorithm.
We can use forward feature selection to develop a model of bachelor's degree completion. Since wrapper methods require us to choose an algorithm, and this is a binary target, let's use scikit-learn's random forest classifier. We will also need the feature_selection module of mlxtend to do the iteration required to select features:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from mlxtend.feature_selection import SequentialFeatureSelector
nls97compba = pd.read_csv("data/nls97compba.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','gender',
'motherhighgrade','fatherhighgrade','parentincome']
X_train, X_test, y_train, y_test =
train_test_split(nls97compba[feature_cols],
nls97compba[['completedba']], test_size=0.3, random_state=0)
ohe = OneHotEncoder(drop_last=True, variables=['gender'])
X_train_enc = ohe.fit_transform(X_train)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
X_train_enc =
pd.DataFrame(scaler.fit_transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Female']])
rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0)
sfs = SequentialFeatureSelector(rfc, k_features=5,
forward=True, floating=False, verbose=2,
scoring='accuracy', cv=5)
sfs.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[list(sfs.k_feature_idx_)]
selcols
Index(['satverbal', 'satmath', 'gpaoverall',
'parentincome', 'gender_Female'], dtype='object')
You might recall from the first section of this chapter that our univariate feature selection for the completed bachelor's degree target gave us somewhat different results:
Index(['satverbal', 'satmath', 'gpascience',
'gpaenglish', 'gpaoverall'], dtype='object')
Three of the features – satmath, satverbal, and gpaoverall – are the same. But our forward feature selection has identified parentincome and gender_Female as more important than gpascience and gpaenglish, which were selected in the univariate analysis. Indeed, gender_Female had among the lowest scores in the earlier analysis. These differences likely reflect the advantages of wrapper feature selection methods. We can identify features that are not important unless other features are included, and we are evaluating the effect on the performance of a particular algorithm, in this case, random forest classification.
One disadvantage of forward selection is that once a feature is selected, it is not removed, even though it may decline in importance as additional features are added. (Recall that forward feature selection adds features iteratively based on the contribution of that feature to the model.)
Let's see whether our results vary with backward feature selection.
Backward feature selection starts with all features and eliminates the least important. It then repeats this process with the remaining features. We can use mlxtend's SequentialFeatureSelector for backward selection in pretty much the same way we used it for forward selection.
We instantiate a RandomForestClassifier object from the scikit-learn library and then pass it to mlxtend's sequential feature selector:
rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0)
sfs = SequentialFeatureSelector(rfc, k_features=5,
forward=False, floating=False, verbose=2,
scoring='accuracy', cv=5)
sfs.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[list(sfs.k_feature_idx_)]
selcols
Index(['satverbal', 'gpascience', 'gpaenglish',
'gpaoverall', 'gender_Female'], dtype='object')
Perhaps unsurprisingly, we get different results for our feature selection. satmath and parentincome are no longer selected, and gpascience and gpaenglish are.
Backward feature selection has the opposite drawback to forward feature selection. Once a feature has been removed, it is not re-evaluated, even though its importance may change with different feature mixtures. Let's try exhaustive feature selection instead.
If your results from forward and backward selection are unpersuasive, and you do not mind running a model while you go out for coffee or lunch, you can try exhaustive feature selection. Exhaustive feature selection trains a given model on all possible combinations of features and selects the best subset of features. But it does this at a price. As the name suggests, this procedure might exhaust both system resources and your patience.
Let's use exhaustive feature selection for our model of bachelor's degree completion:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.metrics import accuracy_score
nls97compba = pd.read_csv("data/nls97compba.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','gender',
'motherhighgrade','fatherhighgrade','parentincome']
X_train, X_test, y_train, y_test =
train_test_split(nls97compba[feature_cols],
nls97compba[['completedba']], test_size=0.3, random_state=0)
ohe = OneHotEncoder(drop_last=True, variables=['gender'])
ohe.fit(X_train)
X_train_enc, X_test_enc =
ohe.transform(X_train), ohe.transform(X_test)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
scaler.fit(X_train_enc[standcols])
X_train_enc =
pd.DataFrame(scaler.transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Female']])
X_test_enc =
pd.DataFrame(scaler.transform(X_test_enc[standcols]),
columns=standcols, index=X_test_enc.index).
join(X_test_enc[['gender_Female']])
rfc = RandomForestClassifier(n_estimators=100, max_depth=2,n_jobs=-1, random_state=0)
efs = ExhaustiveFeatureSelector(rfc, max_features=5,
min_features=1, scoring='accuracy',
print_progress=True, cv=5)
efs.fit(X_train_enc, y_train.values.ravel())
efs.best_feature_names_
('satverbal', 'gpascience', 'gpamath', 'gender_Female')
X_train_efs = efs.transform(X_train)
X_test_efs = efs.transform(X_test)
rfc.fit(X_train_efs, y_train.values.ravel())
y_pred = rfc.predict(X_test_efs)
confusion = pd.DataFrame(y_pred, columns=['pred'],
index=y_test.index).
join(y_test)
confusion.loc[confusion.pred==confusion.completedba].shape[0]
/confusion.shape[0]
0.6703296703296703
accuracy_score(y_test, y_pred)
0.6703296703296703
Note
The accuracy score is often used to assess the performance of a classification model. We will lean on it in this chapter, but other measures might be equally or more important depending on the purposes of your model. For example, we are sometimes more concerned with sensitivity, the ratio of our correct positive predictions to the number of actual positives. We examine the evaluation of classification models in detail in Chapter 6, Preparing for Model Evaluation.
lr = LogisticRegression(solver='liblinear')
efs = ExhaustiveFeatureSelector(lr, max_features=5,
min_features=1, scoring='accuracy',
print_progress=True, cv=5)
efs.fit(X_train_enc, y_train.values.ravel())
efs.best_feature_names_
('satmath', 'gpascience', 'gpaenglish', 'motherhighgrade', 'gender_Female')
X_train_efs = efs.transform(X_train_enc)
X_test_efs = efs.transform(X_test_enc)
lr.fit(X_train_efs, y_train.values.ravel())
y_pred = lr.predict(X_test_efs)
accuracy_score(y_test, y_pred)
0.6923076923076923
rfc = RandomForestClassifier(n_estimators=100, max_depth=2,
n_jobs=-1, random_state=0)
efs = ExhaustiveFeatureSelector(rfc, max_features=5,
min_features=1, scoring='accuracy',
print_progress=True, cv=5)
%timeit efs.fit(X_train_enc, y_train.values.ravel())
5min 8s ± 3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
lr = LogisticRegression(solver='liblinear')
efs = ExhaustiveFeatureSelector(lr, max_features=5,
min_features=1, scoring='accuracy',
print_progress=True, cv=5)
%timeit efs.fit(X_train_enc, y_train.values.ravel())
4.29 s ± 45.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Exhaustive feature selection can provide very clear guidance about the features to select, as I have mentioned, but that may come at too high a price for many projects. It may actually be better suited for diagnostic work than for use in a machine learning pipeline. If a linear model is appropriate, it can lower the computational costs considerably.
Wrapper methods, such as forward, backward, and exhaustive feature selection, tax system resources because they need to be trained with each iteration, and the more difficult the chosen algorithm is to implement, the more this is an issue. Recursive feature elimination (RFE) is something of a compromise between the simplicity of filter methods and the better information provided by wrapper methods. It is similar to backward feature selection, except it simplifies the removal of a feature at each iteration by basing it on the model's overall performance rather than re-evaluating each feature. We explore recursive feature selection in the next two sections.
A popular wrapper method is RFE. This method starts with all features, removes the lowest weighted one (based on a coefficient or feature importance measure), and repeats the process until the best-fitting model has been identified. When a feature is removed, it is given a ranking reflecting the point at which it was removed.
RFE can be used for both regression models and classification models. We will start by using it in a regression model:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
nls97wages = pd.read_csv("data/nls97wages.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','motherhighgrade',
'fatherhighgrade','parentincome','gender','completedba']
X_train, X_test, y_train, y_test =
train_test_split(nls97wages[feature_cols],
nls97wages[['weeklywage']], test_size=0.3, random_state=0)
ohe = OneHotEncoder(drop_last=True, variables=['gender'])
ohe.fit(X_train)
X_train_enc, X_test_enc =
ohe.transform(X_train), ohe.transform(X_test)
scaler = StandardScaler()
standcols = feature_cols[:-2]
scaler.fit(X_train_enc[standcols])
X_train_enc =
pd.DataFrame(scaler.transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Male','completedba']])
X_test_enc =
pd.DataFrame(scaler.transform(X_test_enc[standcols]),
columns=standcols, index=X_test_enc.index).
join(X_test_enc[['gender_Male','completedba']])
scaler.fit(y_train)
y_train, y_test =
pd.DataFrame(scaler.transform(y_train),
columns=['weeklywage'], index=y_train.index),
pd.DataFrame(scaler.transform(y_test),
columns=['weeklywage'], index=y_test.index)
Now, we are ready to do some recursive feature selection. Since RFE is a wrapper method, we need to choose an algorithm around which the selection will be wrapped. Random forests for regression make sense in this case. We are modeling a continuous target and do not want to assume a linear relationship between the features and the target.
rfr = RandomForestRegressor(max_depth=2)
treesel = RFE(estimator=rfr, n_features_to_select=5)
treesel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[treesel.get_support()]
selcols
Index(['satmath', 'gpaoverall', 'parentincome', 'gender_Male', 'completedba'], dtype='object')
Note that this gives us a somewhat different list of features than using a filter method (with F-tests) for the wage income target. gpaoverall and motherhighgrade are selected here and not the gender flag or gpascience.
pd.DataFrame({'ranking': treesel.ranking_,
'feature': X_train_enc.columns},
columns=['feature','ranking']).
sort_values(['ranking'], ascending=True)
feature ranking
1 satmath 1
5 gpaoverall 1
8 parentincome 1
9 gender_Male 1
10 completedba 1
6 motherhighgrade 2
2 gpascience 3
0 satverbal 4
3 gpaenglish 5
4 gpamath 6
7 fatherhighgrade 7
fatherhighgrade was removed after the first interaction and gpamath after the second.
rfr.fit(treesel.transform(X_train_enc), y_train.values.ravel())
rfr.score(treesel.transform(X_test_enc), y_test)
0.13612629794428466
lr = LinearRegression()
lrsel = RFE(estimator=lr, n_features_to_select=5)
lrsel.fit(X_train_enc, y_train)
selcols = X_train_enc.columns[lrsel.get_support()]
selcols
Index(['satmath', 'gpaoverall', 'parentincome', 'gender_Male', 'completedba'], dtype='object')
lr.fit(lrsel.transform(X_train_enc), y_train)
lr.score(lrsel.transform(X_test_enc), y_test)
0.17773742846314056
The linear model is not really much better than the random forest model. This is likely a sign that, collectively, the features available to us only capture a small part of the variation in wages per week. This is an important reminder that we can identify several significant features and still have a model with limited explanatory power. (Perhaps it is also good news that our scores on standardized tests, and even our degree attainment, are important but not determinative of our wages many years later.)
Let's try RFE with a classification model.
RFE can also be a good choice for classification problems. We can use RFE to select features for a model of bachelor's degree completion. You may recall that we used exhaustive feature selection to select features for that model earlier in this chapter. Let's see whether we get better accuracy or an easier-to-train model with RFE:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score
nls97compba = pd.read_csv("data/nls97compba.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','gender',
'motherhighgrade','fatherhighgrade','parentincome']
X_train, X_test, y_train, y_test =
train_test_split(nls97compba[feature_cols],
nls97compba[['completedba']], test_size=0.3,
random_state=0)
ohe = OneHotEncoder(drop_last=True, variables=['gender'])
ohe.fit(X_train)
X_train_enc, X_test_enc =
ohe.transform(X_train), ohe.transform(X_test)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
scaler.fit(X_train_enc[standcols])
X_train_enc =
pd.DataFrame(scaler.transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Female']])
X_test_enc =
pd.DataFrame(scaler.transform(X_test_enc[standcols]),
columns=standcols, index=X_test_enc.index).
join(X_test_enc[['gender_Female']])
rfc = RandomForestClassifier(n_estimators=100, max_depth=2,
n_jobs=-1, random_state=0)
treesel = RFE(estimator=rfc, n_features_to_select=5)
treesel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[treesel.get_support()]
selcols
Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpaoverall'], dtype='object')
pd.DataFrame({'ranking': treesel.ranking_,
'feature': X_train_enc.columns},
columns=['feature','ranking']).
sort_values(['ranking'], ascending=True)
feature ranking
0 satverbal 1
1 satmath 1
2 gpascience 1
3 gpaenglish 1
5 gpaoverall 1
4 gpamath 2
8 parentincome 3
7 fatherhighgrade 4
6 motherhighgrade 5
9 gender_Female 6
rfc.fit(treesel.transform(X_train_enc), y_train.values.ravel())
y_pred = rfc.predict(treesel.transform(X_test_enc))
accuracy_score(y_test, y_pred)
0.684981684981685
Recall that we had 67% accuracy with the exhaustive feature selection. We get about the same accuracy here. The benefit of RFE though is that it can be significantly easier to train than exhaustive feature selection.
Another option among wrapper and wrapper-like feature selection methods is the Boruta library. Originally developed as an R package, it can now be used with any scikit-learn ensemble method. We use it with scikit-learn's random forest classifier in the next section.
The Boruta package takes a unique approach to feature selection, though it has some similarities with wrapper methods. For each feature, Boruta creates a shadow feature, one with the same range of values as the original feature but with shuffled values. It then evaluates whether the original feature offers more information than the shadow feature, gradually removing features providing the least information. Boruta outputs confirmed, tentative, and rejected features with each iteration.
Let's use the Boruta package to select features for a classification model of bachelor's degree completion (you can install the Boruta package with pip if you have not yet installed it):
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.metrics import accuracy_score
nls97compba = pd.read_csv("data/nls97compba.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','gender',
'motherhighgrade','fatherhighgrade','parentincome']
X_train, X_test, y_train, y_test =
train_test_split(nls97compba[feature_cols],
nls97compba[['completedba']], test_size=0.3, random_state=0)
ohe = OneHotEncoder(drop_last=True, variables=['gender'])
ohe.fit(X_train)
X_train_enc, X_test_enc =
ohe.transform(X_train), ohe.transform(X_test)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
scaler.fit(X_train_enc[standcols])
X_train_enc =
pd.DataFrame(scaler.transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Female']])
X_test_enc =
pd.DataFrame(scaler.transform(X_test_enc[standcols]),
columns=standcols, index=X_test_enc.index).
join(X_test_enc[['gender_Female']])
rfc = RandomForestClassifier(n_estimators=100,
max_depth=2, n_jobs=-1, random_state=0)
borsel = BorutaPy(rfc, random_state=0, verbose=2)
borsel.fit(X_train_enc.values, y_train.values.ravel())
BorutaPy finished running.
Iteration: 100 / 100
Confirmed: 9
Tentative: 1
Rejected: 0
selcols = X_train_enc.columns[borsel.support_]
selcols
Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpamath', 'gpaoverall', 'motherhighgrade', 'fatherhighgrade', 'parentincome', 'gender_Female'], dtype='object')
pd.DataFrame({'ranking': borsel.ranking_,
'feature': X_train_enc.columns},
columns=['feature','ranking']).
sort_values(['ranking'], ascending=True)
feature ranking
0 satverbal 1
1 satmath 1
2 gpascience 1
3 gpaenglish 1
4 gpamath 1
5 gpaoverall 1
6 motherhighgrade 1
7 fatherhighgrade 1
8 parentincome 1
9 gender_Female 2
rfc.fit(borsel.transform(X_train_enc.values), y_train.values.ravel())
y_pred = rfc.predict(borsel.transform(X_test_enc.values))
accuracy_score(y_test, y_pred)
0.684981684981685
Part of Boruta's appeal is the persuasiveness of its selection of each feature. If a feature has been selected, then it likely does provide information that is not captured by combinations of features that exclude it. However, it is quite computationally expensive, not unlike exhaustive feature selection. It can help us sort out which features matter, but it may not always be suitable for pipelines where training speed matters.
The last few sections have shown some of the advantages and some disadvantages of wrapper feature selection methods. We explore embedded selection methods in the next section. These methods provide more information than filter methods but without the computational costs of wrapper methods. They do this by embedding feature selection into the training process. We will explore embedded methods with the same data we have worked with so far.
Regularization methods are embedded methods. Like wrapper methods, embedded methods evaluate features relative to a given algorithm. But they are not as expensive computationally. That is because feature selection is built into the algorithm already and so happens as the model is being trained.
Embedded models use the following process:
Regularization accomplishes this by adding a penalty to any model to constrain the parameters. L1 regularization, also referred to as lasso regularization, shrinks some of the coefficients in a regression model to 0, effectively eliminating those features.
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
nls97compba = pd.read_csv("data/nls97compba.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','gender',
'motherhighgrade','fatherhighgrade','parentincome']
X_train, X_test, y_train, y_test =
train_test_split(nls97compba[feature_cols],
nls97compba[['completedba']], test_size=0.3,
random_state=0)
ohe = OneHotEncoder(drop_last=True,
variables=['gender'])
ohe.fit(X_train)
X_train_enc, X_test_enc =
ohe.transform(X_train), ohe.transform(X_test)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
scaler.fit(X_train_enc[standcols])
X_train_enc =
pd.DataFrame(scaler.transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Female']])
X_test_enc =
pd.DataFrame(scaler.transform(X_test_enc[standcols]),
columns=standcols, index=X_test_enc.index).
join(X_test_enc[['gender_Female']])
lr = LogisticRegression(C=1, penalty="l1",
solver='liblinear')
regsel = SelectFromModel(lr, max_features=5)
regsel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[regsel.get_support()]
selcols
Index(['satmath', 'gpascience', 'gpaoverall',
'fatherhighgrade', 'gender_Female'], dtype='object')
lr.fit(regsel.transform(X_train_enc),
y_train.values.ravel())
y_pred = lr.predict(regsel.transform(X_test_enc))
accuracy_score(y_test, y_pred)
0.684981684981685
This gives us fairly similar results to that of the forward feature selection for bachelor's degree completion. We used a random forest classifier as a wrapper method in that example.
Lasso regularization is a good choice for feature selection in a case like this, particularly when performance is a key concern. It does, however, assume a linear relationship between the features and the target, which might not be appropriate. Fortunately, there are embedded feature selection methods that do not make that assumption. A good alternative to logistic regression for the embedded model is a random forest classifier. We try that next with the same data.
In this section, let's use a random forest classifier:
rfc = RandomForestClassifier(n_estimators=100,
max_depth=2, n_jobs=-1, random_state=0)
rfcsel = SelectFromModel(rfc, max_features=5)
rfcsel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[rfcsel.get_support()]
selcols
Index(['satverbal', 'gpascience', 'gpaenglish',
'gpaoverall'], dtype='object')
This actually selects very different features from the lasso regression. satmath, fatherhighgrade, and gender_Female are no longer selected, while satverbal and gpaenglish are. This is likely partly due to the relaxation of the assumption of linearity.
rfc.fit(rfcsel.transform(X_train_enc),
y_train.values.ravel())
y_pred = rfc.predict(rfcsel.transform(X_test_enc))
accuracy_score(y_test, y_pred)
0.673992673992674
Embedded methods are generally less CPU-/GPU-intensive than wrapper methods but can nonetheless produce good results. With our models of bachelor's degree completion in this section, we get the same accuracy as we did with our models based on exhaustive feature selection.
Each of the methods we have discussed so far has important use cases, as we have discussed. However, we have not yet really discussed one very challenging feature selection problem. What do you do if you simply have too many features, many of which independently account for something important in your model? By too many, here I mean that there are so many features that the model cannot run efficiently, either for training or for predicting target values. How can we reduce the feature set without sacrificing some of the predictive power of our model? In that situation, principal component analysis (PCA) might be a good approach. We'll discuss PCA in the next section.
A very different approach to feature selection than any of the methods we have discussed so far is PCA. PCA allows us to replace the existing feature set with a limited number of components, each of which explains an important amount of the variance. It does this by finding a component that captures the largest amount of variance, followed by a second component that captures the largest amount of remaining variance, and then a third component, and so on. One key advantage of this approach is that these components, known as principal components, are uncorrelated. We discuss PCA in detail in Chapter 15, Principal Component Analysis.
Although I include PCA here as a feature selection approach, it is probably better to think of it as a tool for dimension reduction. We use it for feature selection when we need to limit the number of dimensions without sacrificing too much explanatory power.
Let's work with the NLS data again and use PCA to select features for a model of bachelor's degree completion:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
nls97compba = pd.read_csv("data/nls97compba.csv")
feature_cols = ['satverbal','satmath','gpascience',
'gpaenglish','gpamath','gpaoverall','gender',
'motherhighgrade', 'fatherhighgrade','parentincome']
X_train, X_test, y_train, y_test =
train_test_split(nls97compba[feature_cols],
nls97compba[['completedba']], test_size=0.3,
random_state=0)
ohe = OneHotEncoder(drop_last=True,
variables=['gender'])
ohe.fit(X_train)
X_train_enc, X_test_enc =
ohe.transform(X_train), ohe.transform(X_test)
scaler = StandardScaler()
standcols = X_train_enc.iloc[:,:-1].columns
scaler.fit(X_train_enc[standcols])
X_train_enc =
pd.DataFrame(scaler.transform(X_train_enc[standcols]),
columns=standcols, index=X_train_enc.index).
join(X_train_enc[['gender_Female']])
X_test_enc =
pd.DataFrame(scaler.transform(X_test_enc[standcols]),
columns=standcols, index=X_test_enc.index).
join(X_test_enc[['gender_Female']])
pca = PCA(n_components=5)
pca.fit(X_train_enc)
In the following output, columns 0 through 4 are the five principal components:
pd.DataFrame(pca.components_,
columns=X_train_enc.columns).T
0 1 2 3 4
satverbal -0.34 -0.16 -0.61 -0.02 -0.19
satmath -0.37 -0.13 -0.56 0.10 0.11
gpascience -0.40 0.21 0.18 0.03 0.02
gpaenglish -0.40 0.22 0.18 0.08 -0.19
gpamath -0.38 0.24 0.12 0.08 0.23
gpaoverall -0.43 0.25 0.23 -0.04 -0.03
motherhighgrade -0.19 -0.51 0.24 -0.43 -0.59
fatherhighgrade -0.20 -0.51 0.18 -0.35 0.70
parentincome -0.16 -0.46 0.28 0.82 -0.08
gender_Female -0.02 0.08 0.12 -0.04 -0.11
Another way to understand these scores is that they indicate how much each feature contributes to the component. (Indeed, if for each component, you square each of the 10 scores and then sum the squares, you get a total of 1.)
pca.explained_variance_ratio_
array([0.46073387, 0.19036089, 0.09295703, 0.07163009, 0.05328056])
np.cumsum(pca.explained_variance_ratio_)
array([0.46073387, 0.65109476, 0.74405179, 0.81568188, 0.86896244])
X_train_pca = pca.transform(X_train_enc)
X_train_pca.shape
(634, 5)
np.round(X_train_pca[0:6],2)
array([[ 2.79, -0.34, 0.41, 1.42, -0.11],
[-1.29, 0.79, 1.79, -0.49, -0.01],
[-1.04, -0.72, -0.62, -0.91, 0.27],
[-0.22, -0.8 , -0.83, -0.75, 0.59],
[ 0.11, -0.56, 1.4 , 0.2 , -0.71],
[ 0.93, 0.42, -0.68, -0.45, -0.89]])
X_test_pca = pca.transform(X_test_enc)
We can now fit a model of bachelor's degree completion using these principal components. Let's run a random forest classification.
rfc = RandomForestClassifier(n_estimators=100,
max_depth=2, n_jobs=-1, random_state=0)
rfc.fit(X_train_pca, y_train.values.ravel())
y_pred = rfc.predict(X_test_pca)
accuracy_score(y_test, y_pred)
0.7032967032967034
A dimension reduction technique such as PCA can be a good option when the feature selection challenge is that we have highly correlated features and we want to reduce the number of dimensions without substantially reducing the explained variance. In this example, the high school GPA features moved together, as did the parental education and income levels and the SAT features. They became the key features for our first three components. (An argument can be made that our model could have had just those three components since together they accounted for 74% of the variance of the features.)
There are several modifications to PCA that might be useful depending on your data and modeling objectives. This includes strategies to handle outliers and regularization. PCA can also be extended to situations where the components are not linearly separable by using kernels. We discuss PCA in detail in Chapter 15, Principal Component Analysis.
Let's summarize what we've learned in this chapter.
In this chapter, we went over a range of feature selection methods, from filter to wrapper to embedded methods. We also saw how they work with categorical and continuous targets. For wrapper and embedded methods, we considered how they work with different algorithms.
Filter methods are very easy to run and interpret and are easy on system resources. However, they do not take other features into account when evaluating each feature. Nor do they tell us how that assessment might vary by the algorithm used. Wrapper methods do not have any of these limitations but they are computationally expensive. Embedded methods are often a good compromise, selecting features based on multivariate relationships and a given algorithm without taxing system resources as much as wrapper methods. We also explored how a dimension reduction method, PCA, could improve our feature selection.
You also probably noticed that I slipped in a little bit of model validation during this chapter. We will go over model validation in much more detail in the next chapter.