Handling missing values

In the preprocessing work that we just completed, we decided to remove missing values. This is an option when there are very few cases that contain missing values and, in this example, this was true. However, other situations may require different approaches to handling missing values. Here are some common options in addition to deleting rows and columns:

  • Imputation with a measure of centrality (mean/median/mode): Use one of the measures of centrality to fill in the missing values. This can work well if you have normally distributed numeric data. Modal imputation can also be used on non-numeric data by selecting the most frequent value to replace the missing values.
  • Tweak for the missing values: You can use the known values to impute the missing values. Examples of this approach include using regression with linear data or the k-nearest neighbor (KNN) algorithm to assign a value based on similarity to known values in the feature space.
  • Replace it with a constant value: The missing value can also be replaced with a constant value outside the range of values present or not already present in the categorical data. The advantage here is that it will become clear later on whether these missing values have any informational value, as they will be clearly set to the side. This is in contrast to imputing with a measure of centrality where the final result will be some missing values now containing the imputed value, while some equal values will have actually already been present in the data. In this case, it becomes difficult to know which values were missing values and which were the values already present in the data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset