Feature-selection

In machine learning, there is often a misconception that the more data you have, the better off you are. This is usually true with observations (for example, the number of rows in the dataset). However, with features, more isn't always better. In some cases, the performance may be paradoxically better with fewer features, because multiple features with high correlation are biasing the predictions, or because there are more features present than the number of observations.

In other cases, the performance may be the same or infinitesimally worse with, say, half the features, but the smaller number of features may be desirable for a number of reasons, including time considerations, memory availability, or ease of explanation and interpretation to other non-technical stakeholders. In any case, it is almost always a good idea to perform some feature selection on the data. Even if you don't wish to remove any features, performing feature selection and ranking the feature importance can give you great insight into your model and understanding its predictive behavior and performance.

There are a number of classes and functions in the sklearn.feature_selection module that are built for feature selection, and different sets of classes correspond to different methods of performing feature selection. For example, univariate feature selection involves measuring the statistical dependency between each predictor variable and the target variable, and this can be done using the SelectKBest or SelectPercentile classes, among others. The VarianceThreshold class removes features that have a low variance across observations, for example, those features that are almost always zero. And the SelectFromModel class prunes features that don't meet a certain strength requirement (in terms of either coefficient or feature importance) after the model has been fit.

For a full list of the feature selection classes in scikit-learn, you can visit http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset