Selecting features

When building models, more features are not always better. From an implementation perspective, a predictive pipeline modeling real-time clinical settings that interacts with multiple devices, health informatics systems, and source databases is more likely to fail than a simplified version with a minimal number of features. Specifically, while cleaning and exploring your data, you will find that not all of the features will be significantly related to the outcome variable.

Furthermore, many of the variables may be highly correlated with other variables and will offer little new information for making accurate predictions. Leaving these variables in your model could, in fact, reduce the accuracy of your model because they add random noise to the data. Therefore, a usual step in the machine learning pipeline is to perform feature selection and remove unwanted features from your data. The number and which variables to remove depends on many factors, including the choice of your machine learning algorithm and how interpretable you want the model to be.

There are many approaches to removing extraneous features from the final model. Iterative approaches, in which features are removed and the resulting model is built, evaluated, and compared to previous models, are popular because they allow one to measure how adjustments affect the performance of the model. Several algorithms for selecting features include best subset selection and forward and backward step-wise regression. There are also a variety of measures for feature importance, and these include the relative risk ratio, odds ratio, p-value significance, lasso regularization, correlation coefficient, and random forest out-of-bag error, and we will explore some of these measures in Chapter 7, Making Predictive Models in Healthcare.

Table of Contents for Selecting features

Create new playlist

Sign In

Sign Up

Table of Contents for
Selecting features