Section 1 – Data Cleaning and Machine Learning Algorithms

I try to avoid thinking about different parts of the model building process sequentially, to see myself as cleaning data, then preprocessing, and so on until I have done model validation. I do not want to think about that process as involving phases that ever end. We start with data cleaning in this section, but I hope the chapters in this section convey that we are always looking ahead, anticipating modeling challenges as we clean data; and that we also typically reflect back on the data cleaning we have done when we evaluate our models.

To some extent, the clean and dirty metaphor hides the nuance in preparing data for subsequent analysis. The real concern is how representative our instances and attributes (observations and variables) are of phenomena of interest. This can always be improved, and easily made worse without care. One thing is for certain though. There is nothing we can do in any other part of the model building process that will make right something important we have gotten wrong during data cleaning.

The first three chapters of this book are about getting our data as right as we can. To do that we have to have a good sense of how all variables, features and targets, are distributed. There are three questions we should ask ourselves before we do any formal analysis: 1) Are we confident that we know the full range of values, and the shape of the distribution, of every variable of interest? 2) Do we have a good idea of the bivariate relationship between variables, how each moves with others? 3) How successful are our attempts to fix potential problems, such as outliers and missing values? The chapters in this section provide the tools you need to answer these questions.

This section comprises the following chapters:

  • Chapter 1, Examining the Distribution of Features and Targets
  • Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets
  • Chapter 3, Identifying and Fixing Missing Values
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset