Feature selection via random forest

We have seen how feature selection works with L1-regularized logistic regression in a previous section, where 574 out of 2820 more important ad click features were chosen. This is because with L1 regularization, less important weights are compressed to close to or exactly 0. Besides L1-regularized logistic regression, random forest is another frequently used feature selection technique.

To recap, random forest is bagging over a set of individual decision trees. Each tree considers a random subset of the features when searching for the best splitting point at each node. And as an essence of the decision tree algorithm, only those significant features (along with their splitting values) are used to constitute tree nodes. Consider the whole forest, the more frequently a feature is used in a tree node, the more important it is. In other words, we can rank the importance of features based on their occurrences in nodes among all trees, and select the top most important ones.

A trained RandomForestClassifier in scikit-learn comes with a feature_importances_ attribute, indicating the importance of features, which are calculated as the proportions of occurrences in tree nodes. Again we will examine feature selection with random forest on the dataset with 10 thousand ad click samples:

>>> from sklearn.ensemble import RandomForestClassifier
>>> random_forest = RandomForestClassifier(n_estimators=100,
criterion='gini', min_samples_split=30, n_jobs=-1)
>>> random_forest.fit(X_train_10k, y_train_10k)

After fitting the random forest model, we will now take a look at the bottom 10 features and the corresponding 10 least important features:

>>> print(np.sort(random_forest.feature_importances_)[:10])
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> print(np.argsort(random_forest.feature_importances_)[:10])
[1359 2198 2475 2378 1980 516 2369 1380 157 2625]

As well as the top 10 features and the corresponding 10 most important features:

>>> print(np.sort(random_forest.feature_importances_)[-10:])
[ 0.0072243 0.00757724 0.00811834 0.00847693 0.00856942
0.00889287 0.00930343 0.01081189 0.013195 0.01567019]
>>> print(np.argsort(random_forest.feature_importances_)[-10:])
[ 549 1284 2265 1540 1085 1923 1503 2761 554 393]

The 2761 feature ('site_id=d9750ee7') is in the top 10 list both selected by L1-regularized logistic regression and random forest. The most important feature at this time becomes:

>>> dict_one_hot_encoder.feature_names_[393]
'C18=2'

Furthermore, we can select the top 500 features as follows:

>>> top500_feature = np.argsort(random_forest.feature_importances_)[-500:]
>>> X_train_10k_selected = X_train_10k[:, top500_feature]
>>> print(X_train_10k_selected.shape)
(10000, 500)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset