Asking the model about the features using wrappers

While filters can help tremendously in getting rid of useless features, they can only go so far. After all the filtering, there might still be some features that are independent among themselves and that show some degree of dependence with the result variable, but that are totally useless from the model's point of view. Just think of the following data that describes the XOR function. Individually, neither A nor B would show any signs of dependence on Y, whereas together they clearly do:

A

B

Y

0

0

0

0

1

1

1

0

1

1

1

0

So, why not ask the model itself to give its vote on the individual features? This is what scikit wrappers do, as we can see in the following process chart:

Here, we pushed the calculation of feature importance to the model training process. Unfortunately (but understandably), feature importance is not determined as a binary, but as a ranking value, so we still have to specify where to make the cut, what part of the features we are willing to take, and what part we want to drop.

Coming back to scikit-learn, we find various excellent wrapper classes in the sklearn.feature_selection package. A real workhorse in this field is RFE, which stands for recursive feature elimination. It takes an estimator and the desired number of features to keep as parameters and then trains the estimator with various feature sets, as long as it has found a subset of features that is small enough. The RFE instance itself pretends to be an estimator, thereby, indeed, wrapping the provided estimator.

In the following example, we will create an artificial classification problem of 100 samples using the dataset's convenient make_classification() function. It lets us specify the creation of 10 features, out of which only three are really valuable, to solve the classification problem:

>>> from sklearn.feature_selection import RFE
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X,y = make_classification(n_samples=100, n_features=10,
n_informative=3, random_state=0)
>>> clf = LogisticRegression()
>>> selector = RFE(clf, n_features_to_select=3)
>>> selector = selector.fit(X, y)
>>> print(selector.support_)
[False False True False False False True True False False]
>>> print(selector.ranking_)
[5 4 1 2 6 7 1 1 8 3]

The problem in real-world scenarios is, of course, how can we know the right value for n_features_to_select? The truth is, we can't. However, most of the time, we can use a sample of the data and play with it using different settings to quickly get a feeling for the right ballpark.

The good thing is that we don't have to be that exact using wrappers. Let's try different values for n_features_to_select to see how support_ and ranking_ change:

n_features_

to_select

support_

ranking_

1 [False False False False False False False True False False] [ 7 6 3 4 8 9 2 1 10 5]

2

[False False False False False False True True False False] [6 5 2 3 7 8 1 1 9 4]

3

[False False True False False False True True False False] [5 4 1 2 6 7 1 1 8 3]

4

[False False True True False False True True False False] [4 3 1 1 5 6 1 1 7 2]

5

[False False True True False False True True False True] [3 2 1 1 4 5 1 1 6 1]

6

[False True True True False False True True False True] [2 1 1 1 3 4 1 1 5 1]

7

[ True True True True False False True True False True] [1 1 1 1 2 3 1 1 4 1]

8

[ True True True True True False True True False True] [1 1 1 1 1 2 1 1 3 1]

9

[ True True True True True True True True False True] [1 1 1 1 1 1 1 1 2 1]

10

[ True True True True True True True True True True]

[1 1 1 1 1 1 1 1 1 1]

We can see that the result is very stable. Features that have been used when requesting smaller feature sets keep on getting selected when letting more features in. Lastly, we rely on our train/test set splitting to warn us when we go the wrong way.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset