Avoiding overfitting with feature selection and dimensionality reduction

We typically represent data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning, one of the variables is actually not a feature, but the label that we're trying to predict. And in supervised learning, each row is an example that we can use for training or testing.

The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions.

Fitting high-dimensional data is computationally expensive and is prone to overfitting due to the high complexity. Higher dimensions are also impossible to visualize, and therefore we can't use simple diagnostic methods.

Not all of the features are useful and they may only add randomness to our results. It's therefore often important to do good feature selection. Feature selection is the process of picking a subset of significant features for use in better model construction. In practice, not every feature in a dataset carries information useful for discriminating samples; some features are either redundant or irrelevant, and hence can be discarded with little loss. 

In principle, feature selection boils down to multiple binary decisions about whether to include a feature or not. For n features, we get  feature sets, which can be a very large number for a large number of features. For example, for 10 features, we have 1,024 possible feature sets (for instance, if we're deciding what clothes to wear, the features can be temperature, rain, the weather forecast, where we're going, and so on). At a certain point, brute force evaluation becomes infeasible. We'll discuss better methods in Chapter 6, Predicting Online Ads Click-Through with Tree-Based Algorithms. Basically, we have two options: we either start with all of the features and remove features iteratively or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration and compare them.

We'll explore how to perform feature selection mainly in Chapter 7, Predicting Online Ads Click-Through with Logistic Regression.

Another common approach of reducing dimensionality is to transform high-dimensional data in lower-dimensional space. It's called dimensionality reduction or feature projection. This transformation leads to information loss, but we can keep the loss to a minimum.

We'll talk about and implement dimensionality reduction in Chapter 2, Exploring the 20 Newsgroups Dataset with Text Analysis TechniquesChapter 3, Mining the 20 Newsgroups Dataset with Clustering and Topic Modeling Algorithms,  and chapter 10, Machine Learning Best Practices

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset