Feature Improvement - Cleaning Datasets

In the last two chapters, we have gone from talking about a basic understanding of feature engineering and how it can be used to enhance our machine learning pipelines to getting our hands dirty with datasets and evaluating and understanding the different types of data that we can encounter in the wild.

In this chapter, we will be using what we learned and taking things a step further and begin to change the datasets that we work with. Specifically, we will be starting to clean and augment our datasets. By cleaning, we will generally be referring to the process of altering columns and rows already given to us. By augmenting, we will generally refer to the processes of removing columns and adding columns to datasets. As always, our goal in all of these processes is to enhance our machine learning pipelines.

In the following chapters, we will be:

  • Identifying missing values in data
  • Removing harmful data
  • Imputing (filling in) these missing values
  • Normalizing/standardizing data
  • Constructing brand new features
  • Selecting (removing) features manually and automatically
  • Using mathematical matrix computations to transform datasets to different dimensions

These methods will help us develop a better sense of which features are important within our data. In this chapter, we will be diving deeper into the first four methods, and leave the other three for future chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset