10 Feature Selection and Engineering for Interpretability

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

In the first three chapters, we discussed how complexity hinders machine learning (ML) interpretability. There's a trade-off because you want some complexity to maximize predictive performance, yet not to the extent that you cannot rely on the model to satisfy the tenets of interpretability: fairness, accountability, and transparency. This chapter is the first of four focused on how to tune for interpretability. One of the easiest ways to improve interpretability is through feature selection. It has many benefits, such as faster training and making the model easier to interpret. But if these two reasons don't convince you, perhaps another one will.

A common misunderstanding is that complex models can self-select features and perform well nonetheless, so why even bother to select features? Yes, many model classes have mechanisms that can take care of useless features, but they aren't perfect. And the potential for overfitting increases with each one that remains. Overfitted models aren't reliable, even if they are more accurate. So, while employing model mechanisms such as regularization is still highly recommended to avoid overfitting, feature selection is the first step.

In this chapter, you will comprehend how irrelevant features adversely weigh on the outcome of a model and, thus, the importance of feature selection for model interpretability. Then, we will review filter-based feature selection methods such as Spearman's correlation, and learn about embedded methods such as LASSO and Ridge regression. Then, you will discover wrapper methods such as sequential feature selection and hybrid ones such as recursive feature elimination (RFE), as well as more advanced ones, such as genetic algorithms (GAs). Lastly, even though feature engineering is typically conducted before selection, there's value in exploring feature engineering for many reasons after the dust has settled and features have been selected.

These are the main topics we are going to cover in this chapter:

  • Understanding the effect of irrelevant features
  • Reviewing filter-based feature selection methods
  • Exploring embedded feature selection methods
  • Discovering wrapper, hybrid, and advanced feature selection methods
  • Considering feature engineering

Technical requirements

This chapter's example uses the mldatasets, pandas, numpy, scipy, mlxtend, sklearn_genetic, xgboost, sklearn, matplotlib, and seaborn libraries. Instructions on how to install all of these libraries are in the Preface.

The GitHub code for this chapter is located here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python/tree/master/Chapter10/.

The mission

It has been estimated that there are over 10 million non-profits worldwide, and while a large portion of them have public funding, most of them depend mostly on private donors, both corporate and individual, to continue operations. As such, fundraising is mission-critical and carried out throughout the year.

Year over year, donation revenue has grown but there are several problems non-profits face: donor interests evolve, so a charity popular one year might be forgotten the next; competition is fierce between non-profits; and demographics are shifting. In the United States, the average donor only gives two charitable gifts per year and is over 64 years old. Identifying potential donors is challenging and campaigns to reach them can be expensive.

A National Veterans Organization non-profit arm has a large mailing list of about 190,000 past donors and would like to send a special mailer to ask for donations. However, even with a special bulk discount rate, it costs them $0.68 per address. This adds up to over $130,000. They only have a marketing budget of $35,000. Given that they have made this a high priority, they are willing to extend the budget but only if the return on investment (ROI) is high enough to justify the additional cost.

To minimize the use of their limited budget, instead of mass mailing, they'd like to try direct mailing, which aims to identify potential donors using what is already known, such as past donations, geographic location, and demographic data. They will reach other donors via email instead, which is much cheaper, costing no more than /month for their entire list. They hope this hybrid marketing plan will yield better results. They also recognize that high-value donors respond better to personalized paper mailers, while smaller donors respond better to email anyway.

No more than six percent of the mailing list donates at any given campaign. Using ML to predict human behavior is by no means an easy task, especially when it's so imbalanced. Nevertheless, success is not measured by the highest predictive accuracy but by profit lift. In other words, the direct mailing model evaluated on the test dataset should produce more profit than if they mass-mailed the entire dataset.

They have sought your assistance to use ML to produce a model that identifies the most probable donors, but also in a way that guarantees an ROI. Note that the model must be reliable in producing an ROI.

You received the dataset from the non-profit, which is more or less evenly split between train and test. If you send the mailer to absolutely everybody in the test dataset, you make a profit of $11,173, but if you manage somehow to identify only those that will donate, the maximum yield of $73,136 will be attained. Your goal is to achieve a high-profit lift and reasonable ROI. When the campaign runs, it will identify most probably donors for the entire mailing list, and they hope to spend not much more than $35,000 in total. However, the dataset has 435 columns, and some simple statistical tests and modeling exercises show that the data is too noisy to identify the potential donors' reliability because of overfitting.

The approach

You've decided to first fit a base model with all the features and assess it at different levels of complexity to understand how having more features increases the propensity to overfit. Then, you employ a series of feature selection methods ranging from simple filter-based methods to the most advanced ones to determine which one achieves the profitability and reliability goals sought after by the client. Lastly, once a list of final features has been selected, at this stage, feature engineering can be considered to enhance model interpretability.

Given the cost-sensitive nature of the problem, thresholds are important to optimize the profit lift. We will get into the role of thresholds later on, but one significant effect is that even though this is a classification problem, it is best to use regression models, and then use predictions to classify so that there's only one threshold to tune. That is, for classification models, you would need a threshold for the label, say those that donated over $1, and then another one for probabilities predicted. On the other hand, regression predicts the donation, and the threshold can be optimized based on that.

The preparations

You will find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python/tree/master/Chapter10/Mailer.ipynb.

Loading the libraries

To run this example, you need to install the following libraries:

  • mldatasets to load the dataset
  • pandas, numpy, and scipy to manipulate it
  • mlxtend, sklearn_genetic, xgboost, and sklearn (scikit-learn) to fit the models
  • matplotlib and seaborn to create and visualize the interpretations

To load the libraries, use the following code block:

import math
import os
import mldatasets
import pandas as pd
import numpy as np
import timeit
from tqdm.notebook import tqdm
from sklearn.feature_selection import VarianceThreshold,
                                    mutual_info_classif, SelectKBest
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression,
                                     LassoCV, LassoLarsCV, LassoLarsIC
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import RFECV
from sklearn.decomposition import PCA import shap
from sklearn_genetic import GAFeatureSelectionCV
from scipy.stats import rankdata
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

Next, we will load and prepare the dataset.

Understanding and preparing the data

We load the data like this into two dataframes (X_train, X_test) with the features and two NumPy arrays with corresponding labels (y_train, y_test). Please note that these dataframes have already been previously prepared for us to remove sparse or unnecessary features, treat missing values, and encode categorical features:

X_train, X_test, y_train, y_test =
                       mldatasets.load("nonprofit-mailer", prepare=True)
y_train = y_train.squeeze()
y_test = y_test.squeeze()

All features are numeric with no missing values and categorical features have already been one-hot encoded for us. Between both train and test mailing lists, there should be over 191,500 records and 435 features. You can check this is the case like this:

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

The preceding code should output the following:

(95485, 435)
(95485,)
(96017, 435)
(96017,)

Next we can verify that the test labels have the right amount of donators (test_donators), donations (test_donations), and profit ranges (test_min_profit, test_max_profit). We can print these, and then do the same for the training dataset:

var_cost = 0.68
y_test_donators = y_test[y_test > 0]
test_donators = len(y_test_donators)
test_donations = sum(y_test_donators)
test_min_profit = test_donations - (len(y_test)*var_cost)
test_max_profit = test_donations - (test_donators*var_cost)
print('%s test donators totaling $%.0f (min profit: $%.0f,
max profit: $%.0f)' %
         (test_donators, test_donations, test_min_profit,
          test_max_profit))
y_train_donators = y_train[y_train > 0]
train_donators = len(y_train_donators)
train_donations = sum(y_train_donators)
train_min_profit = train_donations –
(len(y_train)*var_cost)
train_max_profit = train_donations –
                                   (train_donators*var_cost)
print('%s train donators totaling $%.0f (min profit: $%.0f,
max profit: $%.0f)' %
        (train_donators, train_donations, train_min_profit,
         train_max_profit))

The preceding code should output the following:

4894 test donators totaling $76464 (min profit: $11173, maxprofit: $73136)
4812 train donators totaling $75113 (min profit: $10183, max profit: $71841)

Indeed, if the non-profit mass-mailed to everyone on the test mailing list, they'd make about $11,000 profit but would have to go grossly over budget to achieve this. The non-profit recognizes that making the max profit by identifying and targeting only donors is nearly an impossible feat. Therefore, they would be content with producing a model that reliably can yield more than the min profit but with a smaller cost, preferably under budget.

Understanding the effect of irrelevant features

Feature selection is also known as variable or attribute selection. It is the method by which you can automatically or manually select a subset of specific features useful to the construction of ML models.

It's not necessarily true that more features lead to better models. Irrelevant features can impact the learning process, leading to overfitting. Therefore, we need some strategies to remove any features that might adversely affect learning. Some of the advantages of selecting a smaller subset of features include the following:

  • It's easier to understand simpler models: For instance, feature importance for a model that uses 15 variables is much easier to grasp than one that uses 150 variables.
  • Shorter training time: Reducing the number of variables decreases the cost of computing, speeds up model training, and perhaps most notably, simpler models have quicker inference times.
  • Improved generalization by reducing overfitting: Sometimes, with little prediction value, many of the variables are just noise. The ML model, however, learns from this noise and triggers overfitting while minimizing generalization simultaneously. We may significantly enhance the generalization of ML models by removing these irrelevant noisy features.
  • Variable redundancy: It is common for datasets to have collinear features, which could mean they are redundant. In cases like these, as long as no significant information is lost, we can retain only one variable and delete others.

Now, we will fit some models to demonstrate the effect of too many features.

Creating a base model

Let's create a base model for our mailing list dataset to see how this plays out. But first, let's set our random numbers for reproducibility:

rand = 9
os.environ['PYTHONHASHSEED']=str(rand)
np.random.seed(rand)

We will use XGBoost's Random Forest (RF) regressor (XGBRFRegressor) throughout this chapter. It's just like scikit-learn's but faster because it uses second-order approximations of the objective function. It also has more options, such as setting the learning rate and monotonic constraints, examined in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability. We initialize XGBRFRegressor with a max_depth value of 4 and always use 200 estimators for consistency. Then, we fit it with our training data. We will use timeit to measure how long it takes, which we save in a variable (baseline_time) for later reference:

stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=4, n_estimators=200, seed=rand)
fitted_mdl = reg_mdl.fit(X_train, y_train)
etime = timeit.default_timer()
baseline_time = etime-stime

Now that we have a base model, let's evaluate it.

Evaluating the model

Next, let's create a dictionary (reg_mdls) to house all the models we will fit in this chapter to test which feature subsets produce the best models. Here, we can evaluate the RF model with all the features and a max_depth value of 4 (rf_4_all) using evaluate_reg_mdl. It will make a summary and a scatter plot with a regression line:

reg_mdls = {}
reg_mdls['rf_4_all'] = mldatasets.evaluate_reg_mdl(fitted_mdl,
                              X_train, X_test, y_train, y_test,
                              plot_regplot=True, ret_eval_dict=True)

The preceding code produces the metrics and plot shown in Figure 10.1:

Figure 10.1 – Base model predictive performance

For a plot like the one in Figure 10.1, usually a diagonal line is expected, so one glance at this plot would tell you that the model is useless. Also, the RMSEs may not seem bad but in the context of such a lopsided problem, they are dismal. Consider this: only 5% of the list makes a donation, and only 20% of those are over $20, so an average error of $4.3 – $4.6 of is enormous.

So, is this model useless? The answer lies in what thresholds we use to classify with it. Let's start by defining an array of thresholds (threshs), ranging from $0.40 to $25. We start spacing these out by a cent until it reaches $1, then by 10 cents until it reaches $3, and after that spaced by $1:

threshs = np.hstack([np.linspace(0.40,1,61), np.linspace(1.1,3,20),
                     np.linspace(4,25,22)])

There's a function in mldatasets that can compute profit at every threshold (profits_by_thresh). All it needs is the actual (y_test) and predicted labels, followed by the thresholds (threshs), the variable cost (var_costs), and the min_profit required. It produces a pandas dataframe with the revenue, costs, profit, and ROI for every threshold, as long as profit is above the min_profit. Remember, we had set this minimum at the beginning of the chapter as $11,173 because it makes no sense to target donators under this amount. After we generate these profit dataframes for the test and train datasets, we can place the maximum, and minimum amounts in the model's dictionary for later use. And then, we employ compare_df_plots to plot the costs, profits, and ROI ratio for test and train for every threshold where it exceeded the profit minimum:

y_formatter = plt.FuncFormatter(lambda x, loc: "${:,}K".format(x/1000))
profits_test = mldatasets.profits_by_thresh(y_test,
                        reg_mdls['rf_4_all']['preds_test'], threshs,
                        var_costs=var_cost, min_profit=test_min_profit)
profits_train = mldatasets.profits_by_thresh(y_train,
                        reg_mdls['rf_4_all']['preds_train'], threshs,
                        var_costs=var_cost, min_profit=train_min_profit)
reg_mdls['rf_4_all']['max_profit_train'] =profits_train.profit.max()
reg_mdls['rf_4_all']['max_profit_test'] = profits_test.profit.max()
reg_mdls['rf_4_all']['max_roi'] = profits_test.roi.max()
reg_mdls['rf_4_all']['min_costs'] = profits_test.costs.min()
reg_mdls['rf_4_all']['profits_train'] = profits_train
reg_mdls['rf_4_all']['profits_test'] = profits_test
mldatasets.compare_df_plots(
         profits_test[['costs', 'profit', 'roi']],
         profits_train[['costs', 'profit', 'roi']],
         'Test', 'Train', y_formatter=y_formatter, x_label='Threshold',
         plot_args={'secondary_y':'roi'})

The preceding snippet generates the plots in Figure 10.2. You can tell that Test and Train are almost identical. Costs decrease steadily at a high rate and profit at a lower rate, while ROI increases steadily. However, some differences exist, such as ROI, which become a bit higher eventually, and although viable thresholds start at the same point, Train does end at a different threshold. It turns out the model can turn a profit, so despite the appearances of the plot in Figure 10.1, the model is far from useless:

Figure 10.2 – Comparison between profit, costs, and ROI for the test and train datasets for the base model across thresholds

The difference in RMSEs for the train and test sets didn't lie. The model did not overfit. The main reason for this is that we used relatively shallow trees by setting our max_depth value at 4. We can easily see this effect of using shallow trees by computing how many features had a feature_importances_ value of over 0:

reg_mdls['rf_4_all']['total_feat'] =
        reg_mdls['rf_4_all']['fitted'].feature_importances_.shape[0] reg_mdls['rf_4_all']['num_feat'] =
        sum(reg_mdls['rf_4_all']['fitted'].feature_importances_ > 0)
print(reg_mdls['rf_4_all']['num_feat'])

The preceding code outputs 160. In other words, only 160 were used out of 435—there are only so many features that can be accommodated into such a shallow tree! Naturally, this leads to lowering overfitting, but at the same time, the choice of features with measures of impurity over a random selection of features is not necessarily the most optimal.

Training the base model at different max depths

So, what happens if we make the trees deeper? Let's repeat all the steps we did for the shallow one but for max depths between 5 and 12:

for depth in tqdm(range(5, 13)):
mdlname = 'rf_'+str(depth)+'_all'
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=depth, n_estimators=200,
                             seed=rand)
fitted_mdl = reg_mdl.fit(X_train, y_train)
etime = timeit.default_timer()
reg_mdls[mdlname] = mldatasets.evaluate_reg_mdl(fitted_mdl, X_train,
                           X_test, y_train, y_test, plot_regplot=False,
                           show_summary=False, ret_eval_dict=True)
reg_mdls[mdlname]['speed'] = (etime - stime)/baseline_time
reg_mdls[mdlname]['depth'] = depth
reg_mdls[mdlname]['fs'] = 'all'
profits_test = mldatasets.profits_by_thresh(y_test,
                        reg_mdls[mdlname]['preds_test'], threshs,
                        var_costs=var_cost, min_profit=test_min_profit)
profits_train = mldatasets.profits_by_thresh(y_train,  
                        reg_mdls[mdlname]['preds_train'], threshs,         
                        var_costs=var_cost, min_profit=train_min_profit)
reg_mdls[mdlname]['max_profit_train'] = profits_train.profit.max()
reg_mdls[mdlname]['max_profit_test'] = profits_test.profit.max()
reg_mdls[mdlname]['max_roi'] = profits_test.roi.max()
reg_mdls[mdlname]['min_costs'] = profits_test.costs.min()
reg_mdls[mdlname]['profits_train'] = profits_train
reg_mdls[mdlname]['profits_test'] = profits_test
reg_mdls[mdlname]['total_feat'] =
               reg_mdls[mdlname]['fitted'].feature_importances_.shape[0]
reg_mdls[mdlname]['num_feat'] =
               sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

Now, let's plot the details in the profits dataframes for the "deepest" model (with a max depth of 12) as we did before with compare_df_plots, producing Figure 10.3:

Figure 10.3 – Comparison between profit, costs, and ROI for the test and train datasets for a "deep" base model across thresholds

See how different Test and Train are this time in Figure 10.3. Test reaches a max of about $15,000 and Train exceeds $20,000. Train's costs dramatically fall, making the ROI orders of magnitude much higher than Test. Also, the ranges of thresholds are much different. Why is this a problem, you ask? If we had to guess what threshold to use to pick who to target in the next mailer, the optimal for Train is higher than for Test—meaning that by using an overfit model, we could miss the mark and underperform in unseen data.

Next, let's convert our model dictionary (reg_mdls) into a dataframe and extract some details from it. Then, we can sort it by depth, format it, color-code it, and output it:

def display_mdl_metrics(reg_mdls, sort_by='depth', max_depth=None):
reg_metrics_df = pd.DataFrame.from_dict(reg_mdls, 'index')
                    [['depth', 'fs', 'rmse_train', 'rmse_test',     
                      'max_profit_train', 'max_profit_test', 'max_roi',
                      'min_costs', 'speed', 'num_feat']]
pd.set_option('precision', 2) 
html = reg_metrics_df.sort_values(by=sort_by, ascending=False).style.
           format({'max_profit_train':'${0:,.0f}',
              'max_profit_test':'${0:,.0f}', 'min_costs':'${0:,.0f}'}).
           background_gradient(cmap='plasma', low=0.3, high=1,
                               subset=['rmse_train', 'rmse_test']).
           background_gradient(cmap='viridis', low=1, high=0.3,
                         subset=['max_profit_train', 'max_profit_test'])
html
display_mdl_metrics(reg_mdls)

The preceding snippet leverages the display_mdl_metrics function to output the dataframe shown in Figure 10.4. Something that should be immediately visible is how RMSE train and RMSE test are inverses. One decreases dramatically, and another increases slightly as depth increases. The same can be said for profit. ROI tends to increase with depth and training speed and the number of features used as well:

Figure 10.4 – Comparing metrics for all base RF models with different depths

You could be tempted to use rf_11_all with the highest profitability, but it will be risky to use it! A common misunderstanding is that black-box models can effectively cut through any amount of irrelevant features. While it will often be able to find something of value and make the most out of it, too many features will hinder its reliability by overfitting with more ease. Fortunately, there is a sweet spot where you can reach high profitability with minimal overfitting, but to get there, you have to reduce the number of features first!

Reviewing filter-based feature selection methods

Filter-based methods independently pick out features from a dataset without employing any ML. These methods depend only on the variables' characteristics and are relatively effective, computationally inexpensive, and quick to perform. Therefore, being the low-hanging fruit of feature selection methods, they are usually the first step in any feature selection pipeline.

Two kinds of filter-based methods exist:

  • Univariate: Individually and independently of the feature space, they evaluate and rate a single feature at a time. One problem that can occur with univariate methods is that they may filter out too much since they don't take into consideration the relationship between features.
  • Multivariate: These take into account the entire feature space and how features within interact with each other.

Overall, for the removal of obsolete, redundant, constant, duplicated, and uncorrelated features, filter methods are very strong. However, by not accounting for complex, non-linear, non-monotonic correlations and interactions that only ML models can find, they aren't effective whenever these relationships are prominent in the data.

We will review three categories of filter-based methods:

  • Basic
  • Correlation
  • Ranking

We will explain them further in their own sections.

Basic filter-based methods

We employ basic filter methods in the data preparation stage, specifically, the data cleaning stage, before any modeling. The reason for this is there's a low risk of taking feature selection decisions that would adversely impact models. They involve common-sense operations such as removing features that carry no information or duplicate it.

Constant features with a variance threshold

Constant features don't change in the training dataset and, therefore, carry no information, and the model can't learn from them. We can use a univariate method called VarianceThreshold, which filters out features that are low-variance. We will use a threshold of zero because we want to filter out only features with zero variance—in other words, constant. It only works with numeric features, so we must first identify which features are numeric and which are categorical. Once we fit the method on the numeric columns, get_support() returns the list of features that aren't constant, and we can use set algebra to return only the constant features (num_const_cols):

num_cols_l = X_train.select_dtypes([np.number]).columns
cat_cols_l = X_train.select_dtypes([np.bool,
np.object]).columns
num_const = VarianceThreshold(threshold=0)
num_const.fit(X_train[num_cols_l])
num_const_cols = list(set(X_train[num_cols_l].columns) -
                          set(num_cols_l[num_const.get_support()]))

The preceding snippet produced a list of constant numeric features, but how about categorical features? Categorical features would only have one category or unique value. You can easily check this by applying the nunique() function on categorical features. It will return a pandas Series, and then a lambda function can filter out only those with one unique value. Then, .index.tolist() returns the name of the features as a list. Now, you just join both lists of constant features and voilá! You have all constants (all_const_cols). You can print them; there should be three:

cat_const_cols = X_train[cat_cols_l].nunique()[lambda x:
                                                 x<2].index.tolist()
all_const_cols = num_const_cols + cat_const_cols
print(all_const_cols)

In most cases, removing constant features isn't good enough. A redundant feature might be almost constant or quasi-constant.

Quasi-constant features with Value-Counts

Quasi-constant features are almost entirely the same value. Unlike constant filtering, using a variance threshold won't work because high variance and quasi-constantness aren't mutually exclusive. Instead, we will iterate all features and get value_counts(), which returns the number of rows for each value. Then, divide these counts by the total number of rows to get a percentage and sort by the highest. If the top value is higher than the predetermined threshold (thresh), we append it to a list of quasi-constant columns (quasi_const_cols). Please note that choosing this threshold must be done with a lot of care and understanding of the problem. For instance, in this case, we know that it's lopsided because only 5% donate, most of which donate a low amount, so even a tiny percentage of a feature might make an impact, which is why our threshold is so high at 99.9%:

thresh = 0.999
quasi_const_cols = []
num_rows = X_train.shape[0]
for col in tqdm(X_train.columns):
   top_val = (X_train[col].value_counts() /
                     num_rows).sort_values(ascending=False).values[0]
   if top_val >= thresh:
      quasi_const_cols.append(col)
print(quasi_const_cols)

The preceding code should have printed five features, which include the three that were previously obtained. Next, we will deal with another form of irrelevant features: duplicates!

Duplicating features

Usually, when you discuss duplicates with data, you immediately think of duplicate rows, but duplicate columns are also problematic. You can find them just as you would find duplicate rows with the pandas duplicated() function, except you would transpose the dataframe first inversing columns and rows:

X_train_transposed = X_train.T
dup_cols =
  X_train_transposed[X_train_transposed.duplicated()].index.tolist()
print(dup_cols)

The preceding snippet outputs a list with the two duplicated rows.

Removing unnecessary features

Unlike other feature selection methods, which you should test with models, you can apply basic filter-based feature selection methods right away by removing the features you deemed useless. But just in case, it's good practice to make a copy of the original data. Please note that we don't include constant columns (all_constant_cols) in the columns we are to drop (drop_cols) because the quasi-constant ones already include them:

X_train_orig = X_train.copy()
X_test_orig = X_test.copy()
drop_cols = quasi_const_cols + dup_cols
X_train.drop(labels=drop_cols, axis=1, inplace=True)
X_test.drop(labels=drop_cols, axis=1, inplace=True)

Next, we will explore multivariate filter-based methods on the remaining features.

Correlation filter-based methods

Correlation filter-based methods quantify the strength of the relationship between two features. It is useful for feature selection because we might want to filter out extremely correlated features or those that aren't correlated with others at all. Either way, it is a multivariate feature selection method—bivariate to be precise.

But first, we ought to choose a correlation method:

  • Pearson's correlation coefficient: Measures how linearly correlated two features are between -1 (negative) and 1 (positive) with 0 meaning no linear correlation. Like linear regression, it assumes linearity, normality, and homoscedasticity.
  • Spearman's rank correlation coefficient: Measures the strength of monotonicity of two features regardless of whether they are linearly related or not. It also measured between -1 and 1 with 0 meaning no monotonic correlation. It makes no distribution assumptions and can work with both continuous and discrete features. However, its weakness is with non-monotonic relationships.
  • Kendall's tau correlation coefficient: Measures the ordinal association between features. It also ranges between -1 and 1, but they mean low and high, respectively. It's useful with discrete features.

The dataset is a mix of continuous and discrete, and we cannot make any linear assumptions about it, so spearman is the right choice. All three can be used with the pandas corr function though:

corrs = X_train.corr(method='spearman')
print(corrs.shape)

The preceding code should output the shape of the correlation matrix, which is (428, 428). This dimension makes sense because there are 428 features left, and each feature has a relationship with 428 features, including itself.

We can now look for features to remove in the correlation matrix (corrs). Note that to do so, we must establish thresholds. For instance, we can say that an extremely correlated feature has an absolute value coefficient over 0.99 and less than 0.15 for an uncorrelated feature. With these thresholds in mind, we can find features that are correlated to only one feature and extremely correlated to more than one feature. Why one feature? Because the diagonals in a correlation matrix are always 1 because a feature is always perfectly correlated with itself. The lambda functions in the following code make sure we are accounting for this:

extcorr_cols = (abs(corrs) > 0.99).sum(axis=1)[lambda x: x>1].
                                                          index.tolist()
print(extcorr_cols)
uncorr_cols = (abs(corrs) > 0.15).sum(axis=1)[lambda x: x==1].
                                                          index.tolist()
print(uncorr_cols)

The preceding code outputs the two lists as follows:

['MAJOR', 'HHAGE1', 'HHAGE3', 'HHN3', 'HHP1', 'HV1', 'HV2', 'MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A']
['TCODE', 'MAILCODE', 'NOEXCH', 'CHILD03', 'CHILD07', 'CHILD12', 'CHILD18', 'HC15', 'MAXADATE']

The first list is one of features that are extremely correlated with ones other than themselves. While this is useful to know, you shouldn't remove features from this list without understanding what features they are correlated with and how, as well as with the target. Then, only if redundancy is found, make sure you only remove one of them. The second one is of uncorrelated features to any others than themself, which in this case is suspicious given the sheer amount of features. That being said, you also should inspect them one by one, especially to measure them against the target to see whether they are redundant. However, we will take a chance and make a feature subset (corr_cols) excluding the uncorrelated ones:

corr_cols = X_train.columns[~X_train.columns.isin(uncorr_cols)].tolist()
print(len(corr_cols))

The preceding code should output 419. Let's now fit the RF model with only these features. Given that there are still over 400 features, we will use a max_depth value of 11. Except for that and a different model name (mdlname), it's the same code as before:

mdlname = 'rf_11_f-corr'
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=11, n_estimators=200, seed=rand)
fitted_mdl = reg_mdl.fit(X_train[corr_cols], y_train)
:
reg_mdls[mdlname]['num_feat'] =
               sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

Before we compare the results for the preceding model, let's learn about ranking filter methods.

Ranking filter-based methods

Ranking filter-based methods are based on statistical univariate ranking tests, which assess the strength of features against the target. These are some of the most popular methods:

  • ANOVA F-test: Analysis of Variance (ANOVA) F-test measures the linear dependency between features and the target. As the name suggests, it does this by decomposing the variance. It makes similar assumptions to linear regression, such as normality, independence, and homoscedasticity. In scikit-learn, you can use f_regression and f_classification for regression and classification, respectively, to rank features by the F-score yielded by the F-test.
  • Chi-square test of independence: This test measures the association between non-negative categorical variables and binary targets, so it's only suitable for classification problems. In scikit-learn, you can use chi2.
  • Mutual information (MI): Unlike the two previous methods, this one is derived from information theory rather than classical statistical hypothesis testing. It's a different name but a concept we have already discussed in this book as the Kullback-Leibler (KL) divergence because it's the KL for feature X and target Y. The Python implementation in scikit-learn uses a numerically stable and symmetric offshoot of KL called Jensen-Shannon (JS) divergence instead and leverages k-nearest neighbors to compute distances. Features can be ranked by MI with mutual_info_regression and mutual_info_classif for regression and classification, respectively.

Of the three options mentioned, the one that is most appropriate for this dataset is MI because we cannot assume linearity among our features, and most of them aren't categorical either. We can try classification with a threshold of $0.68, which at least covers the cost of sending the mailer. To that end, we must first create a binary classification target (y_train_class) with that threshold:

y_train_class = np.where(y_train > 0.68, 1, 0)

Next, we can use SelectKBest to get the top-160 features according to MI classification (MIC). We then employ get_support() to obtain a Boolean vector (or mask), which tells us which features are in the top 160, and we subset the list of features with this mask:

mic_selection = SelectKBest(mutual_info_classif, k=160).
                                             fit(X_train, y_train_class)
mic_cols = X_train.columns[mic_selection.get_support()].tolist()
print(len(mic_cols))

The preceding code should confirm that there are 160 features in the mic_cols list. Incidentally, this is an arbitrary number. Ideally, if there was time, we could test different thresholds for the classification target and ks for the MI, looking for the model that achieved the highest profit lift while underfitting the least. Next, we can fit the RF model as we've done before with the MIC features. This time, we will use a max depth of 5 because there are significantly fewer features:

mdlname = 'rf_5_f-mic'
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=5, n_estimators=200, seed=rand)
fitted_mdl = reg_mdl.fit(X_train[mic_cols], y_train)
:
reg_mdls[mdlname]['num_feat'] =  
               sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

Now, let's plot the profits for test and train as we did in Figure 10.3 but for the MIC model. It will produce what's shown in Figure 10.5:

Figure 10.5 – Comparison between profit, costs, and ROI for the test and train datasets for a model with MIC features across thresholds

In Figure 10.5, you can tell that there is quite a bit of difference between Test and Train, yet similarities indicate minimal overfitting. For instance, the highest profitability can be found between 0.66 and 0.75 for Train, and while Test is mostly between 0.66 and 0.7, it only gradually decreases afterward.

Although we have visually examined the MIC model, it's nice to have some reassurance by looking at raw metrics. Next, we will compare all the models we have trained so far using consistent metrics.

Comparing filter-based methods

We have been saving metrics into a dictionary (reg_mdls), which we easily convert to a dataframe and output as we have done before, but this time we sort by max_profit_test:

display_mdl_metrics(reg_mdls, 'max_profit_test')

The preceding snippet generated what is shown in Figure 10.6. It is evident that the filter MIC model is the least overfitted of all. It ranked higher than more-complex models with more features and took less time to train than any model. Its speed is an advantage for hyperparameter tuning. What if we wanted to find the best classification target thresholds or MIC k? We won't do this now, but we could likely get a better model if we ran every combination but it would take time to do and even more with more features:

Figure 10.6 – Comparing metrics for all base models and filter-based feature-selected models

In Figure 10.6, you can tell that the correlation filter model (f-corr) performs worse than the model with more features and an equal amount of max_depth, which suggests that we must have removed an important feature. As cautioned in that section, the problem with blindly setting thresholds and removing anything above it is that you can inadvertently remove something useful. Not all extremely correlated and uncorrelated features are useless, so further inspection is required. Next, we will explore some embedded methods that when combined with cross-validation, which require less oversight.

Exploring embedded feature selection methods

Embedded methods exist within models themselves by naturally selecting features during training. You can leverage the intrinsic properties of any model that has them to capture the features selected:

  • Tree-based models: For instance, we have used the following code many times to count the number of features used by the RF models, which is evidence of feature selection naturally occurring in the learning process:
              sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

XGBoost's RF uses gain by default, which is the average decrease in error in all splits where it used the feature to compute feature importance. We can increase the threshold above 0 to select even fewer features according to this relative contribution. However, by constraining the trees' depth, we forced the model to choose even fewer features already.

  • Regularized models with coefficients: We will study this further in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability, but many model classes can incorporate penalty-based regularization, such as L1, L2, and elastic net. However, not all of them have intrinsic parameters such as coefficients that can be extracted to determine which features were penalized.

This section will only cover regularized models given that we are using a tree-based model already. It's best to leverage different model classes to get different perspectives of what features matter the most.

We covered some of these models in Chapter 3, Interpretation Challenges, but these are a few model classes that incorporate penalty-based regularization and output feature-specific coefficients:

  • Least absolute shrinkage and selection operator (LASSO): Because it uses L1 penalty in the loss function, LASSO can set coefficients to 0.
  • Least-angle regression (LARS): Similar to LASSO but is vector-based and is more suitable to high-dimensional data. It is also fairer toward equally correlated features.
  • Ridge regression: Uses L2 penalty in the loss function and because of this can only shrink coefficients of irrelevance close to 0 but not to 0.
  • Elastic net regression: Uses a mix of both L1 and L2 norms as penalties.
  • Logistic regression: Contingent on the solver, it can handle L1, L2, or elastic net penalties.

There are also several variations of the preceding models, such as LASSO LARS, which is a LASSO fit using the LARS algorithm, or even LASSO LARS IC, which is the same but uses AIC or BIC criteria for the model section:

  • Akaike's Information Criteria (AIC): A relative goodness of fit measure founded in information theory
  • Bayesian Information Criteria (BIC): Has a similar formula to AIC but has a different penalty term

OK, now let's use SelectFromModel to extract top features from a LASSO model. We will use LassoCV because it can automatically cross-validate to find optimal penalty strength. Once you fit it, we can get the feature mask with get_support(). We can then print the number of features and list of features:

lasso_selection = SelectFromModel(LassoCV(n_jobs=-1, random_state=rand))
lasso_selection.fit(X_train, y_train)
lasso_cols = X_train.columns[lasso_selection.get_support()].tolist()
print(len(lasso_cols))
print(lasso_cols)

The preceding code outputs the following:

7
['ODATEDW', 'TCODE', 'POP901', 'POP902', 'HV2', 'RAMNTALL', 'MAXRDATE']

Now, let's try the same but with LassoLarsCV:

llars_selection = SelectFromModel(LassoLarsCV(n_jobs=-1))
llars_selection.fit(X_train, y_train)
llars_cols = X_train.columns[llars_selection.get_support()].tolist()
print(len(llars_cols))
print(llars_cols)

The preceding snippet produces the following output:

8
['RECPGVG', 'MDMAUD', 'HVP3', 'RAMNTALL', 'LASTGIFT', 'AVGGIFT', 'MDMAUD_A', 'DOMAIN_SOCIALCLS']

Lasso shrunk coefficients for all but seven features to 0, and Lasso LARS did the same but for eight. However, notice how there's no overlap between both lists! OK, so let's try incorporating AIC model selection into Lasso Lars with LassoLarsIC:

llarsic_selection =  SelectFromModel(LassoLarsIC(criterion='aic'))
llarsic_selection.fit(X_train, y_train)
llarsic_cols = X_train.columns[llarsic_selection.get_support()].tolist()
print(len(llarsic_cols))
print(llarsic_cols)

The preceding snippet generates the following output:

111
['TCODE', 'STATE', 'MAILCODE', 'RECINHSE', 'RECP3', 'RECPGVG', 'RECSWEEP',..., 'DOMAIN_URBANICITY', 'DOMAIN_SOCIALCLS', 'ZIP_LON']

It's the same algorithm but with a different method for selecting the value of the regularization parameter. Note how this less-conservative approach expands the number of features to 111. Now, so far, all of the methods we have used have the L1 norm. Let's try one with L2—more specifically, L2-penalized logistic regression. We do exactly what we did before, but this time we fit with the binary classification targets (y_train_class):

log_selection = SelectFromModel(LogisticRegression(C=0.0001,  
                                          solver='sag', penalty='l2',
                                          n_jobs=-1, random_state=rand))
log_selection.fit(X_train, y_train_class)
log_cols = X_train.columns[log_selection.get_support()].tolist()
print(len(log_cols))
print(log_cols)

The preceding code produces the following output:

87
['ODATEDW', 'TCODE', 'STATE', 'POP901', 'POP902', 'POP903', 'ETH1', 'ETH2', 'ETH5', 'CHIL1', 'HHN2',..., 'AMT_7', 'ZIP_LON']

Now that we have a few feature subsets to test, we can place their names into a list (fsnames) and the feature subset lists into another list (fscols):

fsnames = ['e-lasso', 'e-llars', 'e-llarsic', 'e-logl2']
fscols = [lasso_cols, llars_cols, llarsic_cols, log_cols]

We can then iterate across all list names and fit and evaluate our XGBRFRegressor model as we have done before but increasing max_depth at every iteration:

def train_mdls_with_fs(reg_mdls, fsnames, fscols, depths):
for i, fsname in tqdm(enumerate(fsnames), total=len(fsnames)):
   depth = depths[i]
   cols = fscols[i]
   mdlname = 'rf_'+str(depth)+'_'+fsname
   stime = timeit.default_timer()
   reg_mdl = xgb.XGBRFRegressor(max_depth=depth, n_estimators=200,
                                seed=rand)
   fitted_mdl = reg_mdl.fit(X_train[cols], y_train)
   :
   reg_mdls[mdlname]['num_feat'] =
               sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)
train_mdls_with_fs(reg_mdls, fsnames, fscols, [3, 4, 5, 6])

Now, let's see how our embedded feature-selected models fare in comparison to the filtered ones. We will rerun the code we ran to output what was shown in Figure 10.6. This time, we will get what is shown in Figure 10.7:

Figure 10.7 – Comparing metrics for all base models and filter-based and embedded feature-selected models

According to Figure 10.7, three out of the four embedded methods we tried produced models with the lowest test RMSE. They also all train much faster than any othesr and are more profitable than any other model of equal complexity. One of them (rf_5_e-llarsic) is even highly profitable. Compare this with rf_9_all with similar test profitability to see how performance diverges with that on the training data.

Discovering wrapper, hybrid, and advanced feature selection methods

The feature selection methods studied so far are computationally inexpensive because they require no model fitting or fitting simpler white-box models. In this section, we will learn about other, more exhaustive methods with many possible tuning options. The categories of methods included here are as follows:

  • Wrapper: Exhaustively look for the best subset of features by fitting an ML model using a search strategy that measures improvement on a metric.
  • Hybrid: A method that combines embedded and filter methods with wrapper methods.
  • Advanced: A method that doesn't fall into any of the previously discussed categories. Examples include dimensionality reduction, model-agnostic feature importance, and GAs.

And now, let's get started with wrapper methods!

Wrapper methods

The concept behind wrapper methods is reasonably simple: evaluate different subsets of features on the ML model and choose the one that achieves the best score in a predetermined objective function. What varies here is the search strategy:

  • Sequential forward selection (SFS): This approach begins without a feature and adds one, one at a time.
  • Sequential forward floating selection (SFFS): Same as the previous except for every feature it adds, it can remove one as long as the objective function increases.
  • Sequential backward selection (SBS): This process begins with all features present and eliminates one feature at a time.
  • Sequential floating backward selection (SFBS): Same as the previous except for every feature it removes, it can add one as long as the objective function increases.
  • Exhaustive feature selection (EFS): This approach seeks all possible combinations of features.
  • Bidirectional search (BDS): This last one simultaneously allows both forward and backward function selection to get one unique solution.

These methods are greedy algorithms because they solve the problem piece by piece, choosing pieces based on their immediate benefit. Even though they may arrive at a global maximum, they take an approach more suited for finding local maxima. Depending on the number of features, they might be too computationally expensive to be practical, especially EFS, which grows combinatorially.

To allow for shorter search times, we will do two things:

  1. Start our search with the features collectively selected by other methods to have a smaller feature space to chose from. To that end, we combine feature lists from several methods into a single top_cols list:
top_cols = list(set(mic_cols).union(set(llarsic_cols)).
                                                   union(set(log_cols)))
len(top_cols)
  1. Sample our datasets so that ML models speed up. We can use np.random.choice to do random selection of row indexes without replacement:
sample_size = 0.1
sample_train_idx = np.random.choice(X_train.shape[0],
                               math.ceil(X_train.shape[0]*sample_size),
                               replace=False)
sample_test_idx = np.random.choice(X_test.shape[0],
                               math.ceil(X_test.shape[0]*sample_size),
                               replace=False)

Out of the wrapper methods presented, we will only perform SFS given how time-consuming they are. Still, with an even smaller dataset, you can try the other options, which the mlextend library also supports.

Sequential forward selection (SFS)

The first argument of a wrapper method is an unfitted estimator (a model). In SequentialFeatureSelector, we are placing a LinearDiscriminantAnalysis model. Other arguments include the direction (forward=true), whether it's floating (floating=False), the number of features we wish to select (k_features=27), the number of cross-validations (cv=3), and the loss function to use (scoring=f1). Some recommended optional arguments to enter are the verbosity (verbose=2) and the number of jobs to run in parallel (n_jobs=-1). Since it could take a while, you'll definitely want it to output something and use as many processors as possible:

sfs_lda = SequentialFeatureSelector(
              LinearDiscriminantAnalysis(n_components=1), forward=True,
              floating=False, k_features=100, cv=3, scoring='f1',
              verbose=2, n_jobs=-1)
sfs_lda = sfs_lda.fit(X_train.iloc[sample_train_idx][top_cols],
                      y_train_class[sample_train_idx])
sfs_lda_cols = X_train.columns[list(sfs_lda.k_feature_idx_)].tolist()

Once we fit the SFS, it will return the index of features that have been selected with k_feature_idx_, and we can use those to subset the columns and obtain the list of feature names.

Hybrid methods

Starting with 435 features, there are over 1042 combinations of 27 feature subsets alone! So, you can see how EFS would be impractical on such a large feature space. Therefore, except for EFS on the entire dataset, wrapper methods will invariably take some shortcuts to select the features. Whether you are going forward, backward, or both, as long as you are not assessing every single combination of features, you could easily miss out on the best one.

However, we can leverage the more rigorous, exhaustive search approach of wrapper methods with filter and embedded methods' efficiency. The result of this is hybrid methods. For instance, you could employ filter or embedded methods to derive only the top-10 features and perform EFS or SBS on only those.

Recursive feature elimination

Another, more common approach is something such as SBS, but instead of removing features based on improving a metric alone, using the model's intrinsic parameters to rank the features and only removing the least ranked. The name of this approach is RFE, and it is a hybrid between embedded and wrapper methods. You can only use models with feature_importances_ or coefficients (coef_) because this is how the method knows what features to remove. Model classes in scikit-learn with these attributes are classified under linear_model, tree, and ensemble. Also, scikit-learn-compatible versions of XGBoost, LightGBM, and CatBoost also have feature_importances_.

We will use the cross-validated version of RFE because it's more reliable. RFECV takes the estimator first (LinearDiscriminantAnalysis). We can then define step, which sets how many features it should remove in every iteration, the number of cross-validations (cv), and the metric used for evaluation (scoring). Lastly, it is recommended to set the verbosity (verbose=2) and leverage as many processors as possible (n_jobs=-1). To speed it up, we will use a sample again for the training and start with the 267 for top_cols:

rfe_lda = RFECV(LinearDiscriminantAnalysis(n_components=1),
                step=2, cv=3, scoring='f1', verbose=2, n_jobs=-1)
rfe_lda.fit(X_train.iloc[sample_train_idx][top_cols],
            y_train_class[sample_train_idx])
rfe_lda_cols = np.array(top_cols)[rfe_lda.support_].tolist()

Next, we will try different methods that don't relate to the main three feature selection categories: filter, embedded, and wrapper.

Advanced methods

Many methods can be categorized under advanced feature selection methods, including the following subcategories:

  • Model-agnostic feature importance: Any feature importance method covered in Chapter 4, Global Model-Agnostic Interpretation Methods, can be used to obtain the top features of a model for feature selection purposes.
  • GA: This is a wrapper method in the sense that it "wraps" a model assessing predictive performance across many feature subsets. However, unlike the wrapper methods we examined, it's not greedy, and it's more optimized to work with large feature spaces. It's called genetic because it's inspired by biology—natural selection, specifically.
  • Dimensionality reduction: Some dimensionality reduction methods, such as Principal Component Analysis (PCA), can return explained variance on a feature basis. For others, such as factor analysis, it can be derived from other outputs. Explained variance can be used to rank features.
  • Auto-encoders: We won't delve into this one, but deep learning can be leveraged for feature selection with auto-encoders.

We will briefly cover the first two in this section so you can understand how they can be implemented. Let's dive right in!

Model-agnostic feature importance

A popular model-agnostic feature importance method that we have used throughout this book is SHAP, and it has many properties that make it more reliable than other methods. In the following code, we can take our best model and extract shap_values for it using TreeExplainer:

fitted_rf_mdl = reg_mdls['rf_11_all']['fitted']
shap_rf_explainer = shap.TreeExplainer(fitted_rf_mdl)
shap_rf_values =
   shap_rf_explainer.shap_values(X_test_orig.iloc[sample_test_idx])
shap_imps = pd.DataFrame({'col':X_train_orig.columns,
                          'imp':np.abs(shap_rf_values).mean(0)}).
                        sort_values(by='imp',ascending=False)
shap_cols = shap_imps.head(120).col.tolist()

Then, we average for the absolute value of the SHAP values across the first dimension is what provides us with a ranking for each feature. We put this value in a dataframe and sort it as we did for PCA. Lastly, also take the top 120 and place them in a list (shap_cols).

Genetic algorithms

GAs are a stochastic global optimization technique inspired by natural selection, which wrap a model much like wrapper methods do. However, they don't follow a sequence on a step-by-step basis. GAs don't have iterations but generations, which include populations of chromosomes. Each chromosome is a binary representation of your feature space where 1 means to select a feature and 0 to not. Each generation is produced with the following operations:

  • Selection: Like with natural selection, this is partially random (exploration) and partially based on what has already worked (exploitation). What has worked is its fitness. Fitness is assessed with a "scorer" much like wrapper methods. Poor fitness chromosomes are removed, whereas good ones get to reproduce through "crossover."
  • Crossover: Randomly, some good bits (or features) of each parent go to a child.
  • Mutation: Even when a chromosome has proved effective, given a low mutation rate, it will occasionally mutate or flip one of its bits, in other words, features.

The Python implementation we will use has many options. We won't explain all of them here but they are documented well in the code should you be interested. The first attribute is the estimator. We can also define the cross-validation iterations (cv=3), scoring to determine whether chromosomes are fit. There are some important probabilistic properties, such as probability for a mutated bit (mutation_probability) and that bits will get exchanged (crossover_probability). Generation-wise, n_gen_no_change provides a means for early stopping if generations haven't improved, and generations, by default, 40, but we will use 5. You can fit GeneticSelectionCV as you would any model. It can take a while, so it is best to define the verbosity and allow it to use all the processing capacity. Once finished, we can use the Boolean mask (support_) to subset the features:

ga_rf = GAFeatureSelectionCV(RandomForestRegressor(random_state=rand,                max_depth=3), cv=3, scoring='neg_root_mean_squared_error',
             crossover_probability=0.8, mutation_probability=0.1,
             generations=5, n_jobs=-1)
ga_rf = ga_rf.fit(X_train.iloc[sample_train_idx][top_cols].values,
                  y_train[sample_train_idx])
ga_rf_cols = np.array(top_cols)[ga_rf.best_features_].tolist()

OK, now that we have covered a wide variety of wrapper, hybrid, and advanced feature selection methods in this section, let's evaluate all of them at once and compare results.

Evaluating all feature-selected models

As we have done with embedded methods, we can place feature subset names (fsnames), lists (fscols), and corresponding depths in lists:

fsnames = ['w-sfs-lda', 'h-rfe-lda', 'a-shap', 'a-ga-rf']
fscols = [sfs_lda_cols, rfe_lda_cols, shap_cols, ga_rf_cols]
depths = [5, 6, 5, 6]

Then, we can use the two functions we created to first iterate across all feature subsets, training and evaluating a model with them. Then the second function outputs the results of the evaluation in a dataframe with previously trained models:

train_mdls_with_fs(reg_mdls, fsnames, fscols, depths) 
display_mdl_metrics(reg_mdls, 'max_profit_test', max_depth=7)

This time we are limiting the models to those with no more than a depth of seven since those a very overfitted. The result of the snippet is depicted in Figure 10.8:

Figure 10.8 – Comparing metrics for all feature-selected models

Figure 10.8 shows how feature-selected models are more profitable than ones that include all the features compared at the same depths. Also, the embedded Lasso LARS with AIC (e-llarsic) method and the MIC (f-mic) filter method outperform all wrapper, hybrid, and advanced methods with the same depths. Still, we also impeded these methods by using a sample of the training dataset, which was necessary to speed up the process. Maybe they would have outperformed the top ones otherwise. However, the three feature selection methods that follow are pretty competitive:

  • RFE with LDA: Hybrid method (h-rfe-lda)
  • Logistic regression with L2 regularization: Embedded method (e-logl2)
  • GAs with RF: Advanced method (a-ga-rf)

It would make sense to spend many days running many variations of the methods reviewed in this book. For instance, perhaps RFE with L1 regularized logistic regression or GA with support vector machines with additional mutation yields the best model. There are so many different possibilities! Nevertheless, if you were forced to make a recommendation based on Figure 10.8, by profit alone, the 111-feature e-llarsic is the best option, but it also has higher minimum costs and lower maximum ROI than any of the top models. There's a trade-off. And even though it has among the highest test RMSEs, the 160-feature model (f-mic) has a similar spread between max profit train and test and beat it in max ROI and min costs. Therefore, these are the two reasonable options. But before making a final determination, profitability would have to be compared side by side across different thresholds to assess when each model can make the most reliable predictions and at what costs and ROIs.

Considering feature engineering

Let's assume that the non-profit has chosen to use the model whose features were selected with Lasso LARS with AIC (e-llarsic) but would like to evaluate whether you can improve it further. Now that you have removed over 300 features that might have only marginally improved predictive performance but mostly added noise, you are left with more relevant features. However, you also know that 8 features selected by e-llars produced the same amount of RMSE as the 111 features. This means that while there's something in those extra features that improves profitability, it does not improve the RMSE.

From a feature selection standpoint, many things can be done to approach this problem. For instance, examine the overlap and difference of features between e-llarsic and e-llars, and do feature selection variations strictly on those features to see whether the RMSE dips on any combination while keeping or improving on current profitability. However, there's also another possibility, which is feature engineering. There are a few important reasons you would want to perform feature engineering at this stage:

  • Make model interpretation easier to understand: For instance, sometimes features have a scale that is not intuitive, or the scale is intuitive but the distribution makes it hard to understand. As long as transformations to these features don't worsen model performance, there's value in transforming the features to understand the outputs of interpretation methods better. As you train models on more engineered features, you realize what works and why it does. This will help you understand the model and, more importantly, the data.
  • Place guardrails on individual features: Sometimes, features have an uneven distribution, and models tend to overfit in sparser areas of the feature's histogram or where influential outliers exist.
  • Clean up counterintuitive interactions: Some interactions that models find make no sense and only exist because the features correlate, but not for the right reasons. They could be confounding variables or perhaps even redundant ones (such as the one we found in Chapter 4, Global Model-Agnostic Interpretation Methods). You could decide to engineer an interaction feature or remove a redundant one.

In reference to the last two reasons, we will examine feature engineering strategies in more detail in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability. This section will focus on the first reason, particularly because it's a good place to start since it will allow you to understand the data better until you know it well enough to make more transformational changes.

So, we are left with 111 features but have no idea how they relate to the target or each other. The first thing we ought to do is run a feature importance method. We can use SHAP's TreeExplainer on the e-llarsic model. An advantage of TreeExplainer is that it can compute SHAP interaction values, shap_interaction_values, instead of outputting an array of (N, 111) dimensions where N is the number of observations as shap_values does; it will output (N, 111, 111). You can produce a summary_plot graph with it that ranks both individual features and interactions. The only difference for interaction values is you use plot_type="compact_dot":

winning_mdl = 'rf_5_e-llarsic'
fitted_rf_mdl = reg_mdls[winning_mdl]['fitted']
shap_rf_explainer = shap.TreeExplainer(fitted_rf_mdl)
shap_rf_interact_values = shap_rf_explainer.
                                shap_interaction_values(X_test.
                                    iloc[sample_test_idx][llarsic_cols])
shap.summary_plot(shap_rf_interact_values,
                  X_test.iloc[sample_test_idx][llarsic_cols],
                  plot_type="compact_dot", sort=True)

The preceding snippet produces the SHAP interaction summary plot shown in Figure 10.9:

Figure 10.9 – SHAP interaction summary plot

You can read Figure 10.9 as you would any summary plot except it includes bivariate interactions twice—first with one feature and then with another. For instance, MDMAUD_A* - CLUSTER is the interaction SHAP values for that interaction from MDMAUD_A's perspective, so the feature values correspond to that feature alone, but the SHAP values are for the interaction. One thing that we can agree on here is that the plot is hard to read given the scale of the importance values and complexity of comparing bivariate interactions in no order. We will address this later.

Throughout this book, chapters with tabular data have started with a data dictionary. This one was an exception, given that there were 435 features to begin with. Now, it makes sense to at the very least understand what the top features are. The complete data dictionary can be found here, https://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98dic.txt, but some of the features have already been changed because of categorical encoding, so we will explain them in more detail here:

  • MAXRAMNT: Continuous, the dollar amount of the largest gift to date
  • HVP2: Discrete, percentage of homes with a value of >= $150,000 in the neighborhoods of donors (values between 0 and 100)
  • LASTGIFT: Continuous, the dollar amount of the most recent gift
  • RAMNTALL: Continuous, the dollar amount of lifetime gifts to date
  • AVGGIFT: Continuous, the average dollar amount of gifts to date
  • MDMAUD_A: Ordinal, the donation amount code for donors who have given a $100 + gift at any time in their giving history (values between 0 and 3, -1 for those who have never exceeded $100). The amount code is the third byte of an RFA (recency/frequency/amount) major customer matrix code, which is the amount given. The categories are as follows:

0: Less than $100 (low dollar)

1: $100 – 499 (core)

2: $500 – 999 (major)

3: $1,000 + (top)

  • NGIFTALL: Discrete, number of lifetime gifts to date
  • AMT_14: Ordinal, donation amount code of the RFA for the 14th previous promotion (2 years prior), which corresponds to the last dollar amount given back then:

0: $0.01 – 1.99

1: $2.00 – 2.99

2: $3.00 – 4.99

3: $5.00 – 9.99

4: $10.00 – 14.99

5: $15.00 – 24.99

6: $25.00 and above

  • DOMAIN_SOCIALCLS: Nominal, socio-economic status (SES) of the neighborhood, which combines with DOMAIN_URBANICITY (0: Urban, 1: City, 2: Suburban, 3: Town, 4: Rural), meaning the following:

1: Highest SES

2: Average SES, except above average for urban communities

3: Lowest SES, except below average for urban communities

4: Lowest SES for urban communities only

  • CLUSTER: Nominal, code indicating which cluster group the donor falls in
  • MINRAMNT: Continuous, dollar amount of the smallest gift to date
  • LSC2: Discrete, percent age of Spanish-speaking families in the donor's neighborhood (values between 0 and 100)
  • IC15: Discrete, percentage of families with an income of < $15,000 in the donor's neighborhood (values between 0 and 100)

The following insights can be distilled by the preceding dictionary and Figure 10.9:

  • Gift amounts prevail: Seven of the top features pertain to gift amounts, whether it's a total, min, max, averagem, or last. If you include the count of gifts (NGIFTALL), there are eight features involving donation history, making complete sense. So, why is this relevant? Because they are likely highly correlated and understanding how could hold the keys on how to improve the model. Perhaps other features can be created that distill these relationships much better.
  • High values of continuous gift amount features have high SHAP values: Plot a box plot of any of those features like this, plt.boxplot(X_test.MAXRAMNT), and you'll see how right-skewed these features are. Perhaps a transformation such as breaking them into bins—called "discretization"—or using a different scale such as logarithmic (try plt.boxplot(np.log(X_test.MAXRAMNT))) can help interpret these features but also help find the pockets where the likelihood of donation dramatically increases.
  • Relationship with the 14th previous promotion: What happened 2 years before they made that promotion connect to the one denoted in the dataset labels? Were the promotional materials similar? Is there a seasonality factor occurring at the same time every couple of years? Maybe you can engineer a feature that better identifies this phenomenon.
  • Inconsistent classifications: DOMAIN_SOCIALCLS has different categories depending on the DOMAIN_URBANICITY value. We can make this consistent by using all five categories in the scale (Highest, Above Average, Average, Below Average, and Lowest) even if this means non-urban donors would be using only three. The advantage to doing this would be easier interpretation, and it's highly unlikely it would adversely impact the model's performance.

The SHAP interaction summary plot can be useful for identifying feature and interaction rankings and some commonalities between them, but in this case (see Figure 10.9) it was hard to read. But to dig deeper into interactions, you first need to quantify their impact. To this end, let's create a heatmap with only the top interactions as measured by their mean absolute SHAP value (shap_rf_interact_avgs). We should then set all the diagonal values to 0 (shap_rf_interact_avgs_nodiag) because these aren't interactions but feature SHAP values, and it's easier to observe the interactions without them. We can place this matrix in a dataframe but it's a dataframe of 111 columns and 111 rows, so to filter it by those features with those most interactions, we sum them and rank them with scipy's rankdata. Then, we use the ranking to identify the 12 most interactive features (most_interact_cols) and subset the dataframe by them. Finally, we plot the dataframe as a heatmap:

shap_rf_interact_avgs = np.abs(shap_rf_interact_values).mean(0)
shap_rf_interact_avgs_nodiag = shap_rf_interact_avgs.copy()
np.fill_diagonal(shap_rf_interact_avgs_nodiag, 0)
shap_rf_interact_df = pd.DataFrame(shap_rf_interact_avgs_nodiag)
shap_rf_interact_df.columns = X_test[llarsic_cols].columns
shap_rf_interact_df.index = X_test[llarsic_cols].columns
shap_rf_interact_ranks = 112 -
                rankdata(np.sum(shap_rf_interact_avgs_nodiag, axis=0))
most_interact_cols =
                shap_rf_interact_df.columns[shap_rf_interact_ranks < 13]
shap_rf_interact_df =
          shap_rf_interact_df.loc[most_interact_cols,most_interact_cols]
sns.heatmap(shap_rf_interact_df, cmap='Blues', annot=True,   
            annot_kws={'size':10}, fmt='.3f', linewidths=.5)

The preceding snippet outputs what is shown in Figure 10.10. It depicts the most salient feature interactions according to SHAP interaction absolute mean values. Note that these are averages, so given how right-skewed most of these features are, it is likely much higher for many observations. However, it's still a good indication of relative impact:

Figure 10.10 – SHAP interactions heatmap

One way in which we can understand feature interactions one by one is with SHAP's dependence_plot. For instance, we can take our top feature, MAXRAMNT, and plot it with color-coded interactions with features such as RAMNTALL, LSC4, HVP2, and AVGGIFT. But first, we will need to compute shap_values. There are a couple of problems, though, that need to be addressed, which we mentioned earlier. They have to do with the following:

  • The prevalence of outliers: We can cut them out of the plot by limiting the x and y axes using percentiles for the feature and SHAP values, respectively, with plt.xlim and plt.ylim. This essentially zooms in to cases that lie between the 1st and 99th percentiles.
  • Lopsided distribution of dollar amount features: It is common in any feature involving money for it to be right-skewed. There are many ways to simplify it, such as using percentiles to bin the feature, but a quick way to make it easier to appreciate is by using a logarithmic scale. In matplotlib, you can do this with plt.xscale('log') without any need to transform the feature.

The following code accounts for the two issues. You can try commenting out xlim, ylim, or xscale to see the big difference they individually make in understanding dependence_plot:

shap_rf_values =
    shap_rf_explainer.shap_values(X_test.iloc[sample_test_idx]
                                                [llarsic_cols])
maxramt_shap = shap_rf_values[:,llarsic_cols.index("MAXRAMNT")]
shap.dependence_plot("MAXRAMNT", shap_rf_values,
                  X_test.iloc[sample_test_idx][llarsic_cols],
              interaction_index="AVGGIFT", show=False, alpha=0.1)
plt.xlim(xmin=np.percentile(X_test.MAXRAMNT, 1),
         xmax=np.percentile(X_test.MAXRAMNT, 99))
plt.ylim(ymin=np.percentile(maxramt_shap, 1),
         ymax=np.percentile(maxramt_shap, 99))
plt.xscale('log')

The preceding code generates what is shown in Figure 10.11. It shows how there's a tipping point somewhere between 10 and 100 for MAXRAMNT where the mean impact on the model output starts to creep out, and these correlate with a higher AVGGIFT value:

Figure 10.11 – SHAP interaction plot between MAXRAMNT and AVGGIFT

A lesson you could take from Figure 10.11 is that a cluster is formed by certain values of these features and possibly a few other two that increase the likelihood of a donation. From a feature engineering standpoint, you could take unsupervised methods to create special cluster features solely based on the few features you have identified as related. Or you could take a more manual route, comparing different plots to understand how to best identify clusters. You could derive binary features from this process or even a ratio between features that more clearly depict interactions or cluster belonging.

The idea here is not to reinvent the wheel trying to do what the model already does so well but to, first and foremost, aim for a more straightforward model interpretation. Hopefully, that will even have a positive impact on predictive performance by tidying up the features, because if you understand them better, maybe the model does so too! It's like smoothing a grainy image; it might confuse you less and the model too (see Chapter 13, Adversarial Robustness, for more on that)! But understanding the data better through the model has other positive side effects.

In fact, the lessons don't stop with feature engineering or modeling but can be directly applied to promotions. What if tipping points identified could be used to encourage donations? Perhaps get a free mug if you donate over $X? Or set up a recurring donation of $X and be in the exclusive list of "silver" patrons?

We will end this topic on that curious note, but hopefully, this inspires you to appreciate how we can apply lessons from model interpretation to feature selection, engineering, and much more.

Mission accomplished

To approach this mission, you have reduced overfitting using primarily the toolset of feature selection. The non-profit is pleased with a profit lift of roughly 30%, costing a total of $35,601, which is $30,000 less than it would cost to send everyone in the test dataset the mailer. However, they still want assurance that they can safely employ this model without worries that they'll experience losses.

In this chapter, we've examined how overfitting can cause the profitability curves not to align. Misalignment is critical because it could mean that choosing a threshold based on training data would not be reliable on out-of-sample data. So, you use compare_df_plots to compare profitability between the test and train sets as you've done before, but this time for the chosen model (rf_5_e-llarsic):

profits_test = reg_mdls['rf_5_e-llarsic']['profits_test']
profits_train = reg_mdls['rf_5_e-llarsic']['profits_train']
mldatasets.compare_df_plots(
             profits_test[['costs', 'profit', 'roi']],
             profits_train[['costs', 'profit', 'roi']],              'Test',  
             'Train', x_label='Threshold', 
             y_formatter=y_formatter,
             plot_args={'secondary_y':'roi'})

The preceding code generates what is shown in Figure 10.12. You can show this to the non-profit to prove that there's a sweet spot at $0.68 that is the second highest profit attainable in Test. It is also within reach of their budget and achieves an ROI of 41%. More importantly, these numbers are not far from what they are for Train. Another thing that is great to see is that the profit curve slowly slides down for both Train and Test instead of dramatically falling off a cliff. The non-profit can be assured that the operation would still be profitable if they choose to increase the threshold. After all, they want to target donors from the entire mailing list, and for that to be financially feasible, they have to be more exclusive. Say they are using a threshold of $0.77 on the entire mailing list; the campaign would cost about $46,000 but return over $24,000 in profit:

Figure 10.12 – Comparison between profit, costs, and ROI for the test and train datasets for the model with Lasso Lars via AIC features across different thresholds

Congratulations! You have accomplished this mission!

But there's one crucial detail we'd be remiss if we didn't bring up.

Although we trained this model with the next campaign in mind, the model will likely be used in future direct marketing campaigns without retraining. This model reusing presents a problem. There's a concept called data drift, also known as feature drift, which is that over time, what the model learned about the features concerning the target variable no longer holds true. Another, concept drift, is about how the definition of the target feature changes over time. For instance, what constitutes a profitable donor can change. Both drifts can happen simultaneously, and with problems involving human behavior, this is to be expected. Behavior is shaped by cultures, habits, attitudes, technologies, and fashions, which are always evolving. You can caution the non-profit that you can only assure that the model will be reliable for the next campaign, but they can't afford to hire you for model retraining every single time!

You can propose to the client to create a script that monitors drift directly on their mailing list database. If it finds significant changes in the features used by the model, it will alert both them and you. You could, at this point, trigger automatic retraining of the model. However, if the drift is due to data corruption, you won't have an opportunity to address the problem. And even if automatic retraining is done, it can't be deployed if performance metrics don't meet predetermined standards. Either way, you should keep a close eye on predictive performance to be able to guarantee reliability. Reliability is an essential theme in model interpretability because it relates heavily to accountability. We won't cover drift detection in this book, but future chapters discuss data augmentation (Chapter 11, Bias Mitigation and Causal Inference Methods) and adversarial robustness (Chapter 13), which pertain to reliability.

Summary

In this chapter, we have learned about how irrelevant features impact model outcomes and how feature selection provides a toolset to solve this problem. We then explored many different methods in this toolset, from the most basic filter methods to the most advanced ones. Lastly, we broached the subject of feature engineering for interpretability. Feature engineering can make for a more interpretable model that will perform better. We will cover this topic in more detail in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability. In the next chapter, we will discuss methods for bias mitigation and causal inference.

Dataset sources

Further reading

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset