Chapter 5: Feature Selection

Depending on how you began your data analytic work and your own intellectual interests, you might have a different perspective on the topic of feature selection. You might think, yeah, yeah, it is an important topic, but I really want to get to the model building. Or, at the other extreme, you might view feature selection as at the core of model building and believe that you are 90% of the way toward having your model once you have chosen your features. For now, let's just agree that we should spend a good chunk of time understanding the relationships between features – and their relationship to a target if we are building a supervised model – before we do any serious model specification.

It is helpful to approach our feature selection work with the attitude that less is more. If we can reach nearly the same degree of accuracy or explain as much of the variance with fewer features, we should select the simpler model. Sometimes, we can actually get better accuracy with fewer features. This can be hard to wrap our brains around, and even be a tad disappointing for those of us who cut our teeth on building models that told rich and complicated stories.

But we are less concerned with parameter estimates than with the accuracy of our predictions when fitting machine learning models. Unnecessary features can contribute to overfitting and tax hardware resources.

We can sometimes spend months specifying the features of our model, even when there is a limited number of columns in the data. Bivariate correlations, such as those created in Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets, give us some sense of what to expect, but the importance of a feature can vary significantly once other potentially explanatory features are introduced. The feature may no longer be significant, or, conversely, may only be significant when other features are included. Two features might be so highly correlated that including both of them offers very little additional information than including just one.

This chapter takes a close look at feature selection techniques applicable to a variety of predictive modeling tasks. Specifically, we will explore the following topics:

  • Selecting features for classification models
  • Selecting features for regression models
  • Using forward and backward feature selection
  • Using exhaustive feature selection
  • Eliminating features recursively in a regression model
  • Eliminating features recursively in a classification model
  • Using Boruta for feature selection
  • Using regularization and other embedded methods
  • Using principal component analysis

Technical requirements

We will work with the feature_engine, mlxtend, and boruta packages in this chapter, in addition to the scikit-learn library. You can use pip to install these packages. I have chosen a dataset with a small number of observations for our work in this chapter, so the code should work fine even on suboptimal workstations.

Note

We will work exclusively in this chapter with data from The National Longitudinal Survey of Youth, conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. We will work with educational attainment, household demographic, weeks worked, and wage income data. The wage income column represents wages earned in 2016. The NLS dataset can be downloaded for public use at https://www.nlsinfo.org/investigator/pages/search.

Selecting features for classification models

The most straightforward feature selection methods are based on each feature's relationship with a target variable. The next two sections examine techniques for determining the k best features based on their linear or non-linear relationship with the target. These are known as filter methods. They are also sometimes called univariate methods since they evaluate the relationship between the feature and the target independent of the impact of other features.

We use somewhat different strategies when the target is categorical than when it is continuous. We'll go over the former in this section and the latter in the next.

Mutual information classification for feature selection with a categorical target

We can use mutual information classification or analysis of variance (ANOVA) tests to select features when we have a categorical target. We will try mutual information classification first, and then ANOVA for comparison.

Mutual information is a measure of how much information about a variable is provided by knowing the value of another variable. At the extreme, when features are completely independent, the mutual information score is 0.

We can use scikit-learn's SelectKBest class to select the k features that have the highest predictive strength based on mutual information classification or some other appropriate measure. We can use hyperparameter tuning to select the value of k. We can also examine the scores of all features, whether they were identified as one of the k best or not, as we will see in this section.

Let's first try mutual information classification to identify features that are related to completing a bachelor's degree. Later, we will compare that with using ANOVA F-values as the basis for selection:

  1. We start by importing OneHotEncoder from feature_engine to encode some of the data, and train_test_split from scikit-learn to create training and testing data. We will also need scikit-learn's SelectKBest, mutual_info_classif, and f_classif modules for our feature selection:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.feature_selection import SelectKBest,

      mutual_info_classif, f_classif

  2. We load NLS data that has a binary variable for having completed a bachelor's degree and features possibly related to degree attainment: Scholastic Assessment Test (SAT) score, high school GPA, parental educational attainment and income, and gender. Observations with missing values for any feature have been removed. We then create training and testing DataFrames, encode the gender feature, and scale the other data:

    nls97compba = pd.read_csv("data/nls97compba.csv")

    feature_cols = ['gender','satverbal','satmath',

      'gpascience', 'gpaenglish','gpamath','gpaoverall',

      'motherhighgrade','fatherhighgrade','parentincome']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97compba[feature_cols],

      nls97compba[['completedba']], test_size=0.3, random_state=0)

    ohe = OneHotEncoder(drop_last=True, variables=['gender'])

    X_train_enc = ohe.fit_transform(X_train)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    X_train_enc =

      pd.DataFrame(scaler.fit_transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Female']])

    Note

    We will do a complete case analysis of the NLS data throughout this chapter; that is, we will remove all observations that have missing values for any of the features. This is not usually a good approach and is particularly problematic when data is not missing at random or when there is a large number of missing values for one or more features. In such cases, it would be better to use some of the approaches that we used in Chapter 3, Identifying and Fixing Missing Values. We will do a complete case analysis in this chapter to keep the examples as straightforward as possible.

  3. Now we are ready to select features for our model of bachelor's degree completion. One approach is to use mutual information classification. To do that, we set the score_func value of SelectKBest to mutual_info_classif and indicate that we want the five best features. Then, we call fit and use the get_support method to get the five best features:

    ksel = SelectKBest(score_func=mutual_info_classif, k=5)

    ksel.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[ksel.get_support()]

    selcols

    Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpaoverall'], dtype='object')

  4. If we also want to see the score for each feature, we can use the scores_ attribute, though we need to do a little work to associate the scores with a particular feature name and sort the scores in descending order:

    pd.DataFrame({'score': ksel.scores_,

      'feature': X_train_enc.columns},

       columns=['feature','score']).

       sort_values(['score'], ascending=False)

            feature              score

    5       gpaoverall           0.108

    1       satmath              0.074

    3       gpaenglish           0.072

    0       satverbal            0.069

    2       gpascience           0.047

    4       gpamath              0.038

    8       parentincome         0.024

    7       fatherhighgrade      0.022

    6       motherhighgrade      0.022

    9       gender_Female        0.015

    Note

    This is a stochastic process, so we will get different results each time we run it.

To get the same results each time, you can pass a partial function to score_func:

from functools import partial

SelectKBest(score_func=partial(mutual_info_classif,

                               random_state=0), k=5)

  1. We can create a DataFrame with just the important features using the selcols array we created using get_support. (We could have used the transform method of SelectKBest instead. This would have returned the values of the selected features as a NumPy array.)

    X_train_analysis = X_train_enc[selcols] 

    X_train_analysis.dtypes

    satverbal       float64

    satmath         float64

    gpascience      float64

    gpaenglish      float64

    gpaoverall      float64

    dtype: object

That is all we need to do to select the k best features for our model using mutual information.

ANOVA F-value for feature selection with a categorical target

Alternatively, we can use ANOVA instead of mutual information. ANOVA evaluates how different the mean for a feature is for each target class. This is a good metric for univariate feature selection when we can assume a linear relationship between features and the target and our features are normally distributed. If those assumptions do not hold, mutual information classification is a better choice.

Let's try using ANOVA for our feature selection. We can set the score_func parameter of SelectKBest to f_classif to select based on ANOVA:

ksel = SelectKBest(score_func=f_classif, k=5)
ksel.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[ksel.get_support()]
selcols
Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpaoverall'], dtype='object')
pd.DataFrame({'score': ksel.scores_,
  'feature': X_train_enc.columns},
   columns=['feature','score']).
   sort_values(['score'], ascending=False)
       feature                score
5      gpaoverall           119.471
3      gpaenglish           108.006
2      gpascience            96.824
1      satmath               84.901
0      satverbal             77.363
4      gpamath               60.930
7      fatherhighgrade       37.481
6      motherhighgrade       29.377
8      parentincome          22.266
9      gender_Female         15.098

This selected the same features as were selected when we used mutual information. Showing the scores gives us some indication of whether the selected value for k makes sense. For example, there is a greater drop in score from the fifth- to the sixth-best feature (77-61) than from the fourth to the fifth (85-77). There is an even bigger decline from the sixth to the seventh, however (61-37), suggesting that we should at least consider a value for k of 6.

ANOVA tests, and the mutual information classification we did earlier, do not take into account features that are only important in multivariate analysis. For example, fatherhighgrade might matter among individuals with similar GPA or SAT scores. We use multivariate feature selection methods later in this chapter. We do more univariate feature selection in the next section where we explore selection techniques appropriate for continuous targets.

Selecting features for regression models

Regression models have a continuous target. The statistical techniques we used in the previous section are not appropriate for such targets. Fortunately, scikit-learn's selection module provides several options for selecting features when building regression models. (By regression models here, I do not mean linear regression models. I am only referring to models with continuous targets.) Two good options are selection based on F-tests and selection based on mutual information for regression. Let's start with F-tests.

F-tests for feature selection with a continuous target

The F-statistic is a measure of the strength of the linear correlation between a target and a single regressor. Scikit-learn has an f_regression scoring function, which returns F-statistics. We can use it with SelectKBest to select features based on that statistic.

Let's use F-statistics to select features for a model of wages. We use mutual information for regression in the next section to select features for the same target:

  1. We start by importing the one-hot encoder from feature_engine and train_test_split and SelectKBest from scikit-learn. We also import f_regression to get F-statistics later:

    import pandas as pd

    import numpy as np

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.feature_selection import SelectKBest, f_regression

  2. Next, we load the NLS data, including educational attainment, parental income, and wage income data:

    nls97wages = pd.read_csv("data/nls97wages.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','gender',

      'motherhighgrade','fatherhighgrade','parentincome',

      'completedba']

  3. Then, we create training and testing DataFrames, encode the gender feature, and scale the training data. We need to scale the target in this case since it is continuous:

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97wages[feature_cols],

      nls97wages[['wageincome']], test_size=0.3, random_state=0)

    ohe = OneHotEncoder(drop_last=True, variables=['gender'])

    X_train_enc = ohe.fit_transform(X_train)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    X_train_enc =

      pd.DataFrame(scaler.fit_transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Male']])

    y_train =

      pd.DataFrame(scaler.fit_transform(y_train),

      columns=['wageincome'], index=y_train.index)

    Note

    You may have noticed that we are not encoding or scaling the testing data. We will need to do that eventually to validate our models. We will introduce validation later in this chapter and go over it in much more detail in the next chapter.

  4. Now, we are ready to select features. We set score_func of SelectKBest to f_regression and indicate that we want the five best features. The get_support method of SelectKBest returns True for each feature that was selected:

    ksel = SelectKBest(score_func=f_regression, k=5)

    ksel.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[ksel.get_support()]

    selcols

    Index(['satmath', 'gpascience', 'parentincome',

    'completedba','gender_Male'],

          dtype='object')

  5. We can use the scores_ attribute to see the score for each feature:

    pd.DataFrame({'score': ksel.scores_,

      'feature': X_train_enc.columns},

       columns=['feature','score']).

       sort_values(['score'], ascending=False)

                  feature              score

    1             satmath              45

    9             completedba          38

    10            gender_Male          26

    8             parentincome         24

    2             gpascience           21

    0             satverbal            19

    5             gpaoverall           17

    4             gpamath              13

    3             gpaenglish           10

    6             motherhighgrade       9

    7             fatherhighgrade       8

The disadvantage of the F-statistic is that it assumes a linear relationship between each feature and the target. When that assumption does not make sense, we can use mutual information for regression instead.

Mutual information for feature selection with a continuous target

We can also use SelectKBest to select features using mutual information for regression:

  1. We need to set the score_func parameter of SelectKBest to mutual_info_regression, but there is a small complication. To get the same results each time we run the feature selection, we need to set a random_state value. As we discussed in the previous section, we can use a partial function for that. We pass partial(mutual_info_regression, random_state=0) to the score function.
  2. We can then run the fit method and use get_support to get the selected features. We can use the scores_ attribute to give us the score for each feature:

    from functools import partial

    ksel = SelectKBest(score_func=

      partial(mutual_info_regression, random_state=0),

      k=5)

    ksel.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[ksel.get_support()]

    selcols

    Index(['satmath', 'gpascience', 'fatherhighgrade', 'completedba','gender_Male'],dtype='object')

    pd.DataFrame({'score': ksel.scores_,

      'feature': X_train_enc.columns},

       columns=['feature','score']).

       sort_values(['score'], ascending=False)

               feature               score

    1          satmath               0.101

    10         gender_Male           0.074

    7          fatherhighgrade       0.047

    2          gpascience            0.044

    9          completedba           0.044

    4          gpamath               0.016

    8          parentincome          0.015

    6          motherhighgrade       0.012

    0          satverbal             0.000

    3          gpaenglish            0.000

    5          gpaoverall            0.000

We get fairly similar results with mutual information for regression as we did with F-tests. parentincome was selected with F-tests and fatherhighgrade with mutual information. Otherwise, the same features are selected.

A key advantage of mutual information for regression compared with F-tests is that it does not assume a linear relationship between the feature and the target. If that assumption turns out to be unwarranted, mutual information is a better approach. (Again, there is also some randomness in the scoring and the score for each feature can bounce around within a limited range.)

Note

Our choice of k=5 to get the five best features is quite arbitrary. We can make it much more scientific with some hyperparameter tuning. We will go over tuning in the next chapter.

The feature selection methods we have used so far are known as filter methods. They examine the univariate relationship between each feature and the target. They are a good starting point. Similar to our discussion in previous chapters of the usefulness of having correlations handy before we start examining multivariate relationships, it is helpful to at least explore filter methods. Often, though, our model fitting will require taking into account features that are important, or not, when other features are also included. To do that, we need to use wrapper or embedded methods for feature selection. We explore wrapper methods in the next few sections, starting with forward and backward feature selection.

Using forward and backward feature selection

Forward and backward feature selection, as their names suggest, select features by adding them one by one – or subtracting them for backward selection – and assessing the impact on model performance after each iteration. Since both methods assess that performance based on a given algorithm, they are considered wrapper selection methods.

Wrapper feature selection methods have two advantages over the filter methods we have explored so far. First, they evaluate the importance of features as other features are included. Second, since features are evaluated based on their contribution to the performance of a specific algorithm, we get a better sense of which features will ultimately matter. For example, satmath seemed to be an important feature based on our results from the previous section. But it is possible that satmath is only important when we use a particular model, say linear regression, and not an alternative such as decision tree regression. Wrapper selection methods can help us discover that.

The main disadvantage of wrapper methods is that they can be quite expensive computationally since they retrain the model after each iteration. We will look at both forward and backward feature selection in this section.

Using forward feature selection

Forward feature selection starts by identifying a subset of features that individually have a significant relationship with a target, not unlike the filter methods. But it then evaluates all possible combinations of the selected features for the combination that performs best with the chosen algorithm.

We can use forward feature selection to develop a model of bachelor's degree completion. Since wrapper methods require us to choose an algorithm, and this is a binary target, let's use scikit-learn's random forest classifier. We will also need the feature_selection module of mlxtend to do the iteration required to select features:

  1. We start by importing the necessary libraries:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.ensemble import RandomForestClassifier

    from mlxtend.feature_selection import SequentialFeatureSelector

  2. Then, we load the NLS data again. We also create a training DataFrame, encode the gender feature, and standardize the remaining features:

    nls97compba = pd.read_csv("data/nls97compba.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','gender',

      'motherhighgrade','fatherhighgrade','parentincome']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97compba[feature_cols],

      nls97compba[['completedba']], test_size=0.3, random_state=0)

    ohe = OneHotEncoder(drop_last=True, variables=['gender'])

    X_train_enc = ohe.fit_transform(X_train)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    X_train_enc =

      pd.DataFrame(scaler.fit_transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Female']])

  3. We create a random forest classifier object and then pass that object to the feature selector of mlxtend. We indicate that we want it to select five features and that it should forward select. (We can also use the sequential feature selector to select backward.) After running fit, we can use the k_feature_idx_ attribute to get the list of selected features:

    rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0)

    sfs = SequentialFeatureSelector(rfc, k_features=5,

      forward=True, floating=False, verbose=2,

      scoring='accuracy', cv=5)

    sfs.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[list(sfs.k_feature_idx_)]

    selcols

    Index(['satverbal', 'satmath', 'gpaoverall',

    'parentincome', 'gender_Female'], dtype='object')

You might recall from the first section of this chapter that our univariate feature selection for the completed bachelor's degree target gave us somewhat different results:

Index(['satverbal', 'satmath', 'gpascience',

'gpaenglish', 'gpaoverall'], dtype='object')

Three of the features – satmath, satverbal, and gpaoverall – are the same. But our forward feature selection has identified parentincome and gender_Female as more important than gpascience and gpaenglish, which were selected in the univariate analysis. Indeed, gender_Female had among the lowest scores in the earlier analysis. These differences likely reflect the advantages of wrapper feature selection methods. We can identify features that are not important unless other features are included, and we are evaluating the effect on the performance of a particular algorithm, in this case, random forest classification.

One disadvantage of forward selection is that once a feature is selected, it is not removed, even though it may decline in importance as additional features are added. (Recall that forward feature selection adds features iteratively based on the contribution of that feature to the model.)

Let's see whether our results vary with backward feature selection.

Using backward feature selection

Backward feature selection starts with all features and eliminates the least important. It then repeats this process with the remaining features. We can use mlxtend's SequentialFeatureSelector for backward selection in pretty much the same way we used it for forward selection.

We instantiate a RandomForestClassifier object from the scikit-learn library and then pass it to mlxtend's sequential feature selector:

rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0)
sfs = SequentialFeatureSelector(rfc, k_features=5,
  forward=False, floating=False, verbose=2,
  scoring='accuracy', cv=5)
sfs.fit(X_train_enc, y_train.values.ravel())
selcols = X_train_enc.columns[list(sfs.k_feature_idx_)]
selcols
Index(['satverbal', 'gpascience', 'gpaenglish',
 'gpaoverall', 'gender_Female'], dtype='object')

Perhaps unsurprisingly, we get different results for our feature selection. satmath and parentincome are no longer selected, and gpascience and gpaenglish are.

Backward feature selection has the opposite drawback to forward feature selection. Once a feature has been removed, it is not re-evaluated, even though its importance may change with different feature mixtures. Let's try exhaustive feature selection instead.

Using exhaustive feature selection

If your results from forward and backward selection are unpersuasive, and you do not mind running a model while you go out for coffee or lunch, you can try exhaustive feature selection. Exhaustive feature selection trains a given model on all possible combinations of features and selects the best subset of features. But it does this at a price. As the name suggests, this procedure might exhaust both system resources and your patience.

Let's use exhaustive feature selection for our model of bachelor's degree completion:

  1. We start by loading the required libraries, including the RandomForestClassifier and LogisticRegression modules from scikit-learn and ExhaustiveFeatureSelector from mlxtend. We also import the accuracy_score module so that we can evaluate a model with the selected features:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.linear_model import LogisticRegression

    from mlxtend.feature_selection import ExhaustiveFeatureSelector

    from sklearn.metrics import accuracy_score

  2. Next, we load the NLS educational attainment data and create training and testing DataFrames:

    nls97compba = pd.read_csv("data/nls97compba.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','gender',

      'motherhighgrade','fatherhighgrade','parentincome']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97compba[feature_cols],

      nls97compba[['completedba']], test_size=0.3, random_state=0)

  3. Then, we encode and scale the training and testing data:

    ohe = OneHotEncoder(drop_last=True, variables=['gender'])

    ohe.fit(X_train)

    X_train_enc, X_test_enc =

      ohe.transform(X_train), ohe.transform(X_test)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    scaler.fit(X_train_enc[standcols])

    X_train_enc =

      pd.DataFrame(scaler.transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Female']])

    X_test_enc =

      pd.DataFrame(scaler.transform(X_test_enc[standcols]),

      columns=standcols, index=X_test_enc.index).

      join(X_test_enc[['gender_Female']])

  4. We create a random forest classifier object and pass it to mlxtend's ExhaustiveFeatureSelector. We tell the feature selector to evaluate all combinations of one to five features and return the combination with the highest accuracy in predicting degree attainment. After running fit, we can use the best_feature_names_ attribute to get the selected features:

    rfc = RandomForestClassifier(n_estimators=100, max_depth=2,n_jobs=-1, random_state=0)

    efs = ExhaustiveFeatureSelector(rfc, max_features=5,

      min_features=1, scoring='accuracy',

      print_progress=True, cv=5)

    efs.fit(X_train_enc, y_train.values.ravel())

    efs.best_feature_names_

    ('satverbal', 'gpascience', 'gpamath', 'gender_Female')

  5. Let's evaluate the accuracy of this model. We first need to transform the training and testing data to include only the four selected features. Then, we can fit the random forest classifier again with just those features and generate the predicted values for bachelor's degree completion. We can then calculate the percentage of the time we predicted the target correctly, which is 67%:

    X_train_efs = efs.transform(X_train)

    X_test_efs = efs.transform(X_test)

    rfc.fit(X_train_efs, y_train.values.ravel())

    y_pred = rfc.predict(X_test_efs)

    confusion = pd.DataFrame(y_pred, columns=['pred'],

      index=y_test.index).

      join(y_test)

    confusion.loc[confusion.pred==confusion.completedba].shape[0]

      /confusion.shape[0]

    0.6703296703296703

  6. We get the same answer if we just use scikit-learn's accuracy score instead. (We calculate it in the previous step because it is pretty straightforward and it gives us a better sense of what is meant by accuracy in this case.)

    accuracy_score(y_test, y_pred)

    0.6703296703296703

    Note

    The accuracy score is often used to assess the performance of a classification model. We will lean on it in this chapter, but other measures might be equally or more important depending on the purposes of your model. For example, we are sometimes more concerned with sensitivity, the ratio of our correct positive predictions to the number of actual positives. We examine the evaluation of classification models in detail in Chapter 6, Preparing for Model Evaluation.

  7. Let's now try exhaustive feature selection with a logistic model:

    lr = LogisticRegression(solver='liblinear')

    efs = ExhaustiveFeatureSelector(lr, max_features=5,

      min_features=1, scoring='accuracy',

      print_progress=True, cv=5)

    efs.fit(X_train_enc, y_train.values.ravel())

    efs.best_feature_names_

    ('satmath', 'gpascience', 'gpaenglish', 'motherhighgrade', 'gender_Female')

  8. Let's look at the accuracy of the logistic model. We get a fairly similar accuracy score:

    X_train_efs = efs.transform(X_train_enc)

    X_test_efs = efs.transform(X_test_enc)

    lr.fit(X_train_efs, y_train.values.ravel())

    y_pred = lr.predict(X_test_efs)

    accuracy_score(y_test, y_pred)

    0.6923076923076923

  9. One key advantage of the logistic model is that it is much faster to train, which really makes a difference with exhaustive feature selection. If we time the training for each model (probably not a good idea to do that on your computer unless it's a pretty high-end machine or you don't mind walking away from your computer for a while), we see a substantial difference in average training time – from an amazing 5 minutes for the random forest to 4 seconds for the logistic regression. (Of course, the absolute numbers are machine-dependent.)

    rfc = RandomForestClassifier(n_estimators=100, max_depth=2,

      n_jobs=-1, random_state=0)

    efs = ExhaustiveFeatureSelector(rfc, max_features=5,

      min_features=1, scoring='accuracy',

      print_progress=True, cv=5)

    %timeit efs.fit(X_train_enc, y_train.values.ravel())

    5min 8s ± 3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

    lr = LogisticRegression(solver='liblinear')

    efs = ExhaustiveFeatureSelector(lr, max_features=5,

      min_features=1, scoring='accuracy',

      print_progress=True, cv=5)

    %timeit efs.fit(X_train_enc, y_train.values.ravel())

    4.29 s ± 45.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Exhaustive feature selection can provide very clear guidance about the features to select, as I have mentioned, but that may come at too high a price for many projects. It may actually be better suited for diagnostic work than for use in a machine learning pipeline. If a linear model is appropriate, it can lower the computational costs considerably.

Wrapper methods, such as forward, backward, and exhaustive feature selection, tax system resources because they need to be trained with each iteration, and the more difficult the chosen algorithm is to implement, the more this is an issue. Recursive feature elimination (RFE) is something of a compromise between the simplicity of filter methods and the better information provided by wrapper methods. It is similar to backward feature selection, except it simplifies the removal of a feature at each iteration by basing it on the model's overall performance rather than re-evaluating each feature. We explore recursive feature selection in the next two sections.

Eliminating features recursively in a regression model

A popular wrapper method is RFE. This method starts with all features, removes the lowest weighted one (based on a coefficient or feature importance measure), and repeats the process until the best-fitting model has been identified. When a feature is removed, it is given a ranking reflecting the point at which it was removed.

RFE can be used for both regression models and classification models. We will start by using it in a regression model:

  1. We import the necessary libraries, three of which we have not used yet: the RFE, RandomForestRegressor, and LinearRegression modules from scikit-learn:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.feature_selection import RFE

    from sklearn.ensemble import RandomForestRegressor

    from sklearn.linear_model import LinearRegression

  2. Next, we load the NLS data on wages and create training and testing DataFrames:

    nls97wages = pd.read_csv("data/nls97wages.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','motherhighgrade',

      'fatherhighgrade','parentincome','gender','completedba']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97wages[feature_cols],

      nls97wages[['weeklywage']], test_size=0.3, random_state=0)

  3. We need to encode the gender feature and standardize the other features and the target (wageincome). We do not do any encoding or scaling of completedba, which is a binary feature:

    ohe = OneHotEncoder(drop_last=True, variables=['gender'])

    ohe.fit(X_train)

    X_train_enc, X_test_enc =

      ohe.transform(X_train), ohe.transform(X_test)

    scaler = StandardScaler()

    standcols = feature_cols[:-2]

    scaler.fit(X_train_enc[standcols])

    X_train_enc =

      pd.DataFrame(scaler.transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Male','completedba']])

    X_test_enc =

      pd.DataFrame(scaler.transform(X_test_enc[standcols]),

      columns=standcols, index=X_test_enc.index).

      join(X_test_enc[['gender_Male','completedba']])

    scaler.fit(y_train)

    y_train, y_test =

      pd.DataFrame(scaler.transform(y_train),

      columns=['weeklywage'], index=y_train.index),

      pd.DataFrame(scaler.transform(y_test),

      columns=['weeklywage'], index=y_test.index)

Now, we are ready to do some recursive feature selection. Since RFE is a wrapper method, we need to choose an algorithm around which the selection will be wrapped. Random forests for regression make sense in this case. We are modeling a continuous target and do not want to assume a linear relationship between the features and the target.

  1. RFE is fairly easy to implement with scikit-learn. We instantiate an RFE object, telling it what estimator we want in the process. We indicate RandomForestRegressor. We then fit the model and use get_support to get the selected features. We limit max_depth to 2 to avoid overfitting:

    rfr = RandomForestRegressor(max_depth=2)

    treesel = RFE(estimator=rfr, n_features_to_select=5)

    treesel.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[treesel.get_support()]

    selcols

    Index(['satmath', 'gpaoverall', 'parentincome', 'gender_Male', 'completedba'], dtype='object')

Note that this gives us a somewhat different list of features than using a filter method (with F-tests) for the wage income target. gpaoverall and motherhighgrade are selected here and not the gender flag or gpascience.

  1. We can use the ranking_ attribute to see when each of the eliminated features was removed:

    pd.DataFrame({'ranking': treesel.ranking_,

      'feature': X_train_enc.columns},

       columns=['feature','ranking']).

       sort_values(['ranking'], ascending=True)

               feature                ranking

    1          satmath                1

    5          gpaoverall             1

    8          parentincome           1

    9          gender_Male            1

    10         completedba            1

    6          motherhighgrade        2

    2          gpascience             3

    0          satverbal              4

    3          gpaenglish             5

    4          gpamath                6

    7          fatherhighgrade        7

fatherhighgrade was removed after the first interaction and gpamath after the second.

  1. Let's run some test statistics. We fit only the selected features on a random forest regressor model. The transform method of the RFE selector gives us just the selected features with treesel.transform(X_train_enc). We can use the score method to get the r-squared value, also known as the coefficient of determination. R-squared is a measure of the percentage of total variation explained by our model. We get a very low score, indicating that our model explains only a little of the variation. (Note that this is a stochastic process, so we will likely get different results each time we fit the model.)

    rfr.fit(treesel.transform(X_train_enc), y_train.values.ravel())

    rfr.score(treesel.transform(X_test_enc), y_test)

    0.13612629794428466

  2. Let's see whether we get any better results using RFE with a linear regression model. This model returns the same features as the random forest regressor:

    lr = LinearRegression()

    lrsel = RFE(estimator=lr, n_features_to_select=5)

    lrsel.fit(X_train_enc, y_train)

    selcols = X_train_enc.columns[lrsel.get_support()]

    selcols

    Index(['satmath', 'gpaoverall', 'parentincome', 'gender_Male', 'completedba'], dtype='object')

  3. Let's evaluate the linear model:

    lr.fit(lrsel.transform(X_train_enc), y_train)

    lr.score(lrsel.transform(X_test_enc), y_test)

    0.17773742846314056

The linear model is not really much better than the random forest model. This is likely a sign that, collectively, the features available to us only capture a small part of the variation in wages per week. This is an important reminder that we can identify several significant features and still have a model with limited explanatory power. (Perhaps it is also good news that our scores on standardized tests, and even our degree attainment, are important but not determinative of our wages many years later.)

Let's try RFE with a classification model.

Eliminating features recursively in a classification model

RFE can also be a good choice for classification problems. We can use RFE to select features for a model of bachelor's degree completion. You may recall that we used exhaustive feature selection to select features for that model earlier in this chapter. Let's see whether we get better accuracy or an easier-to-train model with RFE:

  1. We import the same libraries we have been working with so far in this chapter:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.feature_selection import RFE

    from sklearn.metrics import accuracy_score

  2. Next, we create training and testing data from the NLS educational attainment data:

    nls97compba = pd.read_csv("data/nls97compba.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','gender',

      'motherhighgrade','fatherhighgrade','parentincome']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97compba[feature_cols],

      nls97compba[['completedba']], test_size=0.3,

      random_state=0)

  3. Then, we encode and scale the training and testing data:

    ohe = OneHotEncoder(drop_last=True, variables=['gender'])

    ohe.fit(X_train)

    X_train_enc, X_test_enc =

      ohe.transform(X_train), ohe.transform(X_test)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    scaler.fit(X_train_enc[standcols])

    X_train_enc =

      pd.DataFrame(scaler.transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Female']])

    X_test_enc =

      pd.DataFrame(scaler.transform(X_test_enc[standcols]),

      columns=standcols, index=X_test_enc.index).

      join(X_test_enc[['gender_Female']])

  4. We instantiate a random forest classifier and pass it to the RFE selection method. We can then fit the model and get the selected features.

    rfc = RandomForestClassifier(n_estimators=100, max_depth=2,

      n_jobs=-1, random_state=0)

    treesel = RFE(estimator=rfc, n_features_to_select=5)

    treesel.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[treesel.get_support()]

    selcols

    Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpaoverall'], dtype='object')

  5. We can also show how the features are ranked by using the RFE ranking_ attribute:

    pd.DataFrame({'ranking': treesel.ranking_,

      'feature': X_train_enc.columns},

       columns=['feature','ranking']).

       sort_values(['ranking'], ascending=True)

               feature                 ranking

    0          satverbal               1

    1          satmath                 1

    2          gpascience              1

    3          gpaenglish              1

    5          gpaoverall              1

    4          gpamath                 2

    8          parentincome            3

    7          fatherhighgrade         4

    6          motherhighgrade         5

    9          gender_Female           6

  6. Let's look at the accuracy of a model with the selected features using the same random forest classifier we used for our baseline model:

    rfc.fit(treesel.transform(X_train_enc), y_train.values.ravel())

    y_pred = rfc.predict(treesel.transform(X_test_enc))

    accuracy_score(y_test, y_pred)

    0.684981684981685

Recall that we had 67% accuracy with the exhaustive feature selection. We get about the same accuracy here. The benefit of RFE though is that it can be significantly easier to train than exhaustive feature selection.

Another option among wrapper and wrapper-like feature selection methods is the Boruta library. Originally developed as an R package, it can now be used with any scikit-learn ensemble method. We use it with scikit-learn's random forest classifier in the next section.

Using Boruta for feature selection

The Boruta package takes a unique approach to feature selection, though it has some similarities with wrapper methods. For each feature, Boruta creates a shadow feature, one with the same range of values as the original feature but with shuffled values. It then evaluates whether the original feature offers more information than the shadow feature, gradually removing features providing the least information. Boruta outputs confirmed, tentative, and rejected features with each iteration.

Let's use the Boruta package to select features for a classification model of bachelor's degree completion (you can install the Boruta package with pip if you have not yet installed it):

  1. We start by loading the necessary libraries:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.ensemble import RandomForestClassifier

    from boruta import BorutaPy

    from sklearn.metrics import accuracy_score

  2. We load the NLS educational attainment data again and create the training and test DataFrames:

    nls97compba = pd.read_csv("data/nls97compba.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','gender',

      'motherhighgrade','fatherhighgrade','parentincome']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97compba[feature_cols],

      nls97compba[['completedba']], test_size=0.3, random_state=0)

  3. Next, we encode and scale the training and test data:

    ohe = OneHotEncoder(drop_last=True, variables=['gender'])

    ohe.fit(X_train)

    X_train_enc, X_test_enc =

      ohe.transform(X_train), ohe.transform(X_test)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    scaler.fit(X_train_enc[standcols])

    X_train_enc =

      pd.DataFrame(scaler.transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Female']])

    X_test_enc =

      pd.DataFrame(scaler.transform(X_test_enc[standcols]),

      columns=standcols, index=X_test_enc.index).

      join(X_test_enc[['gender_Female']])

  4. We run Boruta feature selection in much the same way that we ran RFE feature selection. We use random forest as our baseline method again. We instantiate a random forest classifier and pass it to Boruta's feature selector. We then fit the model, which stops at 100 iterations, identifying 9 features that provide information:

    rfc = RandomForestClassifier(n_estimators=100,

      max_depth=2, n_jobs=-1, random_state=0)

    borsel = BorutaPy(rfc, random_state=0, verbose=2)

    borsel.fit(X_train_enc.values, y_train.values.ravel())

    BorutaPy finished running.

    Iteration:            100 / 100

    Confirmed:            9

    Tentative:            1

    Rejected:             0

    selcols = X_train_enc.columns[borsel.support_]

    selcols

    Index(['satverbal', 'satmath', 'gpascience', 'gpaenglish', 'gpamath', 'gpaoverall', 'motherhighgrade', 'fatherhighgrade', 'parentincome', 'gender_Female'], dtype='object')

  5. We can use the ranking_ property to view the rankings of the features:

    pd.DataFrame({'ranking': borsel.ranking_,

      'feature': X_train_enc.columns},

       columns=['feature','ranking']).

       sort_values(['ranking'], ascending=True)

               feature               ranking

    0          satverbal             1

    1          satmath               1

    2          gpascience            1

    3          gpaenglish            1

    4          gpamath               1

    5          gpaoverall            1

    6          motherhighgrade       1

    7          fatherhighgrade       1

    8          parentincome          1

    9          gender_Female         2

  6. To evaluate the model's accuracy, we fit the random forest classifier model with just the selected features. We can then make predictions for the testing data and compute accuracy:

    rfc.fit(borsel.transform(X_train_enc.values), y_train.values.ravel())

    y_pred = rfc.predict(borsel.transform(X_test_enc.values))

    accuracy_score(y_test, y_pred)

    0.684981684981685

Part of Boruta's appeal is the persuasiveness of its selection of each feature. If a feature has been selected, then it likely does provide information that is not captured by combinations of features that exclude it. However, it is quite computationally expensive, not unlike exhaustive feature selection. It can help us sort out which features matter, but it may not always be suitable for pipelines where training speed matters.

The last few sections have shown some of the advantages and some disadvantages of wrapper feature selection methods. We explore embedded selection methods in the next section. These methods provide more information than filter methods but without the computational costs of wrapper methods. They do this by embedding feature selection into the training process. We will explore embedded methods with the same data we have worked with so far.

Using regularization and other embedded methods

Regularization methods are embedded methods. Like wrapper methods, embedded methods evaluate features relative to a given algorithm. But they are not as expensive computationally. That is because feature selection is built into the algorithm already and so happens as the model is being trained.

Embedded models use the following process:

  1. Train a model.
  2. Estimate each feature's importance to the model's predictions.
  3. Remove features with low importance.

Regularization accomplishes this by adding a penalty to any model to constrain the parameters. L1 regularization, also referred to as lasso regularization, shrinks some of the coefficients in a regression model to 0, effectively eliminating those features.

Using L1 regularization

  1. We will use L1 regularization with logistic regression to select features for a bachelor's degree attainment model:We need to first import the required libraries, including a module we will be using for the first time, SelectFromModel from scikit-learn:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.linear_model import LogisticRegression

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.feature_selection import SelectFromModel

    from sklearn.metrics import accuracy_score

  2. Next, we load NLS data on educational attainment:

    nls97compba = pd.read_csv("data/nls97compba.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','gender',

      'motherhighgrade','fatherhighgrade','parentincome']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97compba[feature_cols],

      nls97compba[['completedba']], test_size=0.3,

      random_state=0)

  3. Then, we encode and scale the training and testing data:

    ohe = OneHotEncoder(drop_last=True,

                        variables=['gender'])

    ohe.fit(X_train)

    X_train_enc, X_test_enc =

      ohe.transform(X_train), ohe.transform(X_test)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    scaler.fit(X_train_enc[standcols])

    X_train_enc =

      pd.DataFrame(scaler.transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Female']])

    X_test_enc =

      pd.DataFrame(scaler.transform(X_test_enc[standcols]),

      columns=standcols, index=X_test_enc.index).

      join(X_test_enc[['gender_Female']])

  4. Now we are ready to do feature selection based on logistic regression with an L1 penalty:

    lr = LogisticRegression(C=1, penalty="l1",

                            solver='liblinear')

    regsel = SelectFromModel(lr, max_features=5)

    regsel.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[regsel.get_support()]

    selcols

    Index(['satmath', 'gpascience', 'gpaoverall',

    'fatherhighgrade', 'gender_Female'], dtype='object')

  5. Let's evaluate the accuracy of the model. We get an accuracy score of 0.68:

    lr.fit(regsel.transform(X_train_enc),

           y_train.values.ravel())

    y_pred = lr.predict(regsel.transform(X_test_enc))

    accuracy_score(y_test, y_pred)

    0.684981684981685

This gives us fairly similar results to that of the forward feature selection for bachelor's degree completion. We used a random forest classifier as a wrapper method in that example.

Lasso regularization is a good choice for feature selection in a case like this, particularly when performance is a key concern. It does, however, assume a linear relationship between the features and the target, which might not be appropriate. Fortunately, there are embedded feature selection methods that do not make that assumption. A good alternative to logistic regression for the embedded model is a random forest classifier. We try that next with the same data.

Using a random forest classifier

In this section, let's use a random forest classifier:

  1. We can use SelectFromModel to use a random forest classifier rather than logistic regression:

    rfc = RandomForestClassifier(n_estimators=100,

      max_depth=2, n_jobs=-1, random_state=0)

    rfcsel = SelectFromModel(rfc, max_features=5)

    rfcsel.fit(X_train_enc, y_train.values.ravel())

    selcols = X_train_enc.columns[rfcsel.get_support()]

    selcols

    Index(['satverbal', 'gpascience', 'gpaenglish',

      'gpaoverall'], dtype='object')

This actually selects very different features from the lasso regression. satmath, fatherhighgrade, and gender_Female are no longer selected, while satverbal and gpaenglish are. This is likely partly due to the relaxation of the assumption of linearity.

  1. Let's evaluate the accuracy of the random forest classifier model. We get an accuracy score of 0.67. This is pretty much the same score that we got with the lasso regression:

    rfc.fit(rfcsel.transform(X_train_enc),

            y_train.values.ravel())

    y_pred = rfc.predict(rfcsel.transform(X_test_enc))

    accuracy_score(y_test, y_pred)

    0.673992673992674

Embedded methods are generally less CPU-/GPU-intensive than wrapper methods but can nonetheless produce good results. With our models of bachelor's degree completion in this section, we get the same accuracy as we did with our models based on exhaustive feature selection.

Each of the methods we have discussed so far has important use cases, as we have discussed. However, we have not yet really discussed one very challenging feature selection problem. What do you do if you simply have too many features, many of which independently account for something important in your model? By too many, here I mean that there are so many features that the model cannot run efficiently, either for training or for predicting target values. How can we reduce the feature set without sacrificing some of the predictive power of our model? In that situation, principal component analysis (PCA) might be a good approach. We'll discuss PCA in the next section.

Using principal component analysis

A very different approach to feature selection than any of the methods we have discussed so far is PCA. PCA allows us to replace the existing feature set with a limited number of components, each of which explains an important amount of the variance. It does this by finding a component that captures the largest amount of variance, followed by a second component that captures the largest amount of remaining variance, and then a third component, and so on. One key advantage of this approach is that these components, known as principal components, are uncorrelated. We discuss PCA in detail in Chapter 15, Principal Component Analysis.

Although I include PCA here as a feature selection approach, it is probably better to think of it as a tool for dimension reduction. We use it for feature selection when we need to limit the number of dimensions without sacrificing too much explanatory power.

Let's work with the NLS data again and use PCA to select features for a model of bachelor's degree completion:

  1. We start by loading the necessary libraries. The only module we have not already used in this chapter is scikit-learn's PCA:

    import pandas as pd

    from feature_engine.encoding import OneHotEncoder

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.decomposition import PCA

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.metrics import accuracy_score

  2. Next, we create training and testing DataFrames once again:

    nls97compba = pd.read_csv("data/nls97compba.csv")

    feature_cols = ['satverbal','satmath','gpascience',

      'gpaenglish','gpamath','gpaoverall','gender',

      'motherhighgrade', 'fatherhighgrade','parentincome']

    X_train, X_test, y_train, y_test =  

      train_test_split(nls97compba[feature_cols],

      nls97compba[['completedba']], test_size=0.3,

      random_state=0)

  3. We need to scale and encode the data. Scaling is particularly important with PCA:

    ohe = OneHotEncoder(drop_last=True,

                        variables=['gender'])

    ohe.fit(X_train)

    X_train_enc, X_test_enc =

      ohe.transform(X_train), ohe.transform(X_test)

    scaler = StandardScaler()

    standcols = X_train_enc.iloc[:,:-1].columns

    scaler.fit(X_train_enc[standcols])

    X_train_enc =

      pd.DataFrame(scaler.transform(X_train_enc[standcols]),

      columns=standcols, index=X_train_enc.index).

      join(X_train_enc[['gender_Female']])

    X_test_enc =

      pd.DataFrame(scaler.transform(X_test_enc[standcols]),

      columns=standcols, index=X_test_enc.index).

      join(X_test_enc[['gender_Female']])

  4. Now, we instantiate a PCA object and fit a model:

    pca = PCA(n_components=5)

    pca.fit(X_train_enc)

  5. The components_ attribute of the PCA object returns the scores of all 10 features on each of the 5 components. The features that drive the first component most are those with scores that have the highest absolute value. In this case, that is gpaoverall, gpaenglish, and gpascience. For the second component, the most important features are motherhighgrade, fatherhighgrade, and parentincome. satverbal and satmath drive the third component.

In the following output, columns 0 through 4 are the five principal components:

pd.DataFrame(pca.components_,

  columns=X_train_enc.columns).T

                   0       1      2       3       4

satverbal         -0.34   -0.16  -0.61   -0.02   -0.19

satmath           -0.37   -0.13  -0.56    0.10    0.11

gpascience        -0.40    0.21   0.18    0.03    0.02

gpaenglish        -0.40    0.22   0.18    0.08   -0.19

gpamath           -0.38    0.24   0.12    0.08    0.23

gpaoverall        -0.43    0.25   0.23   -0.04   -0.03

motherhighgrade   -0.19   -0.51   0.24   -0.43   -0.59

fatherhighgrade   -0.20   -0.51   0.18   -0.35    0.70

parentincome      -0.16   -0.46   0.28    0.82   -0.08

gender_Female     -0.02    0.08   0.12   -0.04   -0.11

Another way to understand these scores is that they indicate how much each feature contributes to the component. (Indeed, if for each component, you square each of the 10 scores and then sum the squares, you get a total of 1.)

  1. Let's also examine how much of the variance in the features is explained by each component. The first component accounts for 46% of the variance alone, followed by an additional 19% for the second component. We can use NumPy's cumsum method to see how much of feature variance is explained by the five components cumulatively. We can explain 87% of the variance in the 10 features with 5 components:

    pca.explained_variance_ratio_

    array([0.46073387, 0.19036089, 0.09295703, 0.07163009, 0.05328056])

    np.cumsum(pca.explained_variance_ratio_)

    array([0.46073387, 0.65109476, 0.74405179, 0.81568188, 0.86896244])

  2. Let's transform our features in the testing data based on these five principal components. This returns a NumPy array with only the five principal components. We look at the first few rows. We also need to transform the testing DataFrame:

    X_train_pca = pca.transform(X_train_enc)

    X_train_pca.shape

    (634, 5)

    np.round(X_train_pca[0:6],2)

    array([[ 2.79, -0.34,  0.41,  1.42, -0.11],

           [-1.29,  0.79,  1.79, -0.49, -0.01],

           [-1.04, -0.72, -0.62, -0.91,  0.27],

           [-0.22, -0.8 , -0.83, -0.75,  0.59],

           [ 0.11, -0.56,  1.4 ,  0.2 , -0.71],

           [ 0.93,  0.42, -0.68, -0.45, -0.89]])

    X_test_pca = pca.transform(X_test_enc)

We can now fit a model of bachelor's degree completion using these principal components. Let's run a random forest classification.

  1. We first create a random forest classifier object. We then pass the training data with the principal components and the target values to its fit method. We pass the testing data with the components to the classifier's predict method and then get an accuracy score:

    rfc = RandomForestClassifier(n_estimators=100,

      max_depth=2, n_jobs=-1, random_state=0)

    rfc.fit(X_train_pca, y_train.values.ravel())

    y_pred = rfc.predict(X_test_pca)

    accuracy_score(y_test, y_pred)

    0.7032967032967034

A dimension reduction technique such as PCA can be a good option when the feature selection challenge is that we have highly correlated features and we want to reduce the number of dimensions without substantially reducing the explained variance. In this example, the high school GPA features moved together, as did the parental education and income levels and the SAT features. They became the key features for our first three components. (An argument can be made that our model could have had just those three components since together they accounted for 74% of the variance of the features.)

There are several modifications to PCA that might be useful depending on your data and modeling objectives. This includes strategies to handle outliers and regularization. PCA can also be extended to situations where the components are not linearly separable by using kernels. We discuss PCA in detail in Chapter 15, Principal Component Analysis.

Let's summarize what we've learned in this chapter.

Summary

In this chapter, we went over a range of feature selection methods, from filter to wrapper to embedded methods. We also saw how they work with categorical and continuous targets. For wrapper and embedded methods, we considered how they work with different algorithms.

Filter methods are very easy to run and interpret and are easy on system resources. However, they do not take other features into account when evaluating each feature. Nor do they tell us how that assessment might vary by the algorithm used. Wrapper methods do not have any of these limitations but they are computationally expensive. Embedded methods are often a good compromise, selecting features based on multivariate relationships and a given algorithm without taxing system resources as much as wrapper methods. We also explored how a dimension reduction method, PCA, could improve our feature selection.

You also probably noticed that I slipped in a little bit of model validation during this chapter. We will go over model validation in much more detail in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset