Feature Construction

In the previous chapter, we worked with the Pima Indian Diabetes Prediction dataset to get a better understanding of which given features in our dataset are most valuable. Working with the features that were available to us, we identified missing values within our columns and employed techniques of dropping missing values, imputing, and normalizing/standardizing our data to improve the accuracy of our machine learning model.

It is important to note that, up to this point, we have only worked with features that are quantitative. We will now shift into dealing with categorical data, in addition to the quantitative data that has missing values. Our main focus will be to work with our given features to construct entirely new features for our models to learn from. 

There are various methods we can utilize to construct our features, with the most basic starting with the pandas library in Python to scale an existing feature by a multiples. We will be diving into some more mathematically intensive methods, and will employ various packages available to us through the scikit-learn library; we will also create our own custom classes. We will go over these classes in detail as we get into the code. 

We will be covering the following topics in our discussions:

  • Examining our dataset
  • Imputing categorical features
  • Encoding categorical variables
  • Extending numerical features
  • Text-specific feature construction
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset