Chapter 2

Data Cleaning and Pre-processing

Learning Objectives

By the end of this chapter, you will be able to:

  • Perform the sort, rank, filter, subset, normalize, scale, and join operations in an R data frame.
  • Identify and handle outliers, missing values, and duplicates gracefully using the MICE and rpart packages.
  • Perform undersampling and oversampling on a dataset.
  • Apply the concepts of ROSE and SMOTE to handle unbalanced data.

This chapter covers the important concepts of handling data and making the data ready for analysis.

Introduction

Data cleaning and preparation takes about 70% of the effort in the entire process of a machine learning project. This step is essential because the quality of the data determines the accuracy of the prediction model. A clean dataset should contain good samples of the scenarios that we want to predict, and this will give us good prediction results. Also, the data should be balanced, which means that every category we want to predict should have similar number of samples. For example, if we want to predict whether it will rain or not on any particular day, and if the sample data size is 100, the data could contain 40 samples for It will rain and 60 samples for It will not rain today, or vice versa. However, if the ratio is 20:80 or 30:70, it is an unbalanced dataset, and this will not yield good results for the minority class.

In the following section, we will look at the essential operations performed on data frames in R. These operations will help us to manipulate and analyze the data. The datasets we will be utilizing in this chapter are as follows:

Figure 2.1: Datasets
Figure 2.1: Datasets

We will begin with advanced operations on R data frames.

Advanced Operations on Data Frames

In the previous chapter, we performed a number of operations on data frames, including rbind(). There are many more operations that can be performed on data frames, which are very useful while preparing the data for the model. The following exercises will describe these operations in detail and illustrate them through their corresponding implementation in R:

  • The order function: The order function is used to sort a data frame. We can specify ascending or descending order using the "-" symbol.
  • The sort function: The sort function can also be used to sort the data. The order can be specified as "decreasing=TRUE" or "decreasing"="FALSE".
  • The rank function: The rank function is used to rank the values in the data in a numerical manner.

Sorting, ordering, and ranking are operations that act as techniques to identify outliers. Outliers are values that are either too big or too small and do not fit in the value range. As datasets are often messy, fixing datasets is usually a challenge. It helps to sort through multiple records and decide the next suitable candidate. Hence, these operations form the basis of pre-processing in R.

Exercise 6: Sorting the Data Frame

In this exercise, we will be organizing the data using the order(), sort(), and rank() functions. We will be using the built-in PimaIndiansDiabetes dataset:

  1. Load the library and dataset:

    library(mlbench)

    library(caret)

    data(PimaIndiansDiabetes)

  2. Sort the dataset by glucose values using the order() function:

    # sort by glucose

    sorted_data <- PimaIndiansDiabetes[order(glucose),]

    # View the output

    head(sorted_data)

    The output is as follows:

    Figure 2.2: The first rows of the data sorted by glucose
    Figure 2.2: The first rows of the data sorted by glucose
  3. Sort the dataset by glucose and pressure values using the order() function:

    # sort by  glucose and  pressure

    sorted_data <- PimaIndiansDiabetes[order( glucose,  pressure),]

    head(sorted_data)

    The output is as follows:

    Figure 2.3: The first rows of data sorted by glucose and pressure
    Figure 2.3: The first rows of data sorted by glucose and pressure
  4. Sort the dataset by glucose values and in descending order by pressure values using the order() function:

    #sort in ascending order by glucose and descending order by pressure

    sorted_data <- PimaIndiansDiabetes[order( glucose, - pressure),]

    head(sorted_data)

    The output is as follows:

    Figure 2.4: The first rows of data sorted by glucose and pressure in descending order
    Figure 2.4: The first rows of data sorted by glucose and pressure in descending order
  5. Sort the dataset by glucose using the sort() function:

    #Using the sort function to sort glucose

    sort(glucose)

    The output is as follows:

    Figure 2.5: A section of  sorted glucose values
    Figure 2.5: A section of sorted glucose values
  6. Sort the dataset by glucose in descending order using the sort() function:

    #Sort in descending order

    sort(glucose, decreasing = TRUE)

    The output is as follows:

    Figure 2.6: A section of glucose values in descending order
    Figure 2.6: A section of glucose values in descending order
  7. Sort the dataset according to rank of values of glucose:

    #Using the rank function to rank the values of glucose

    rank(glucose)

    The output is as follows:

Figure 2.7: A section of the ranked values of glucose
Figure 2.7: A section of the ranked values of glucose

Through these given examples, we have learned how to sort using order(), sort(), and rank().

Join Operations

The join operations are extremely useful while handling data present in multiple tables. We can merge two datasets/data frames using a common column using the join operation. For instance, if one data frame contains credit card transaction information, and the other data frame contains the credit card customer information, and the two have to be merged based on customer ID, then we must use join operations to perform the merge. In this chapter, we will focus on the inner join, the outer join, the left join, and the right join in detail.

Inner join: The inner join gives us only the data where the fields in both the data frames have been merged by an exact match.

The syntax is as follows:

merge(df1, df2, by="fields used to merge")

Note that in the preceding code line, the following abbreviations are used:

df1 <- dataframe1

df2 <- dataframe2

The code to perform an inner join is as follows:

# Inner Join

data1 <- head(PimaIndiansDiabetes)

data2 <- head(PimaIndiansDiabetes)

merge(data1, data2, by='glucose')

The output is as follows:

Figure 2.8: An inner join
Figure 2.8: An inner join

Outer join: The outer join will join based on the exact match, but it will also keep the data that is not matched.

The syntax is as follows:

merge(df1, df2, by="common_key_column", all=TRUE)

The code to perform outer join is as follows:

#Outer Join

data1 <- head(PimaIndiansDiabetes)

data2 <- tail(PimaIndiansDiabetes)

merge(data1, data2, by='glucose', all=TRUE)

The output is as follows:

Figure 2.9: An outer join
Figure 2.9: An outer join

Left outer join: The left outer join will join based on exact matches, but it will also keep the data from df1 (which is not matched).

The syntax is as follows:

merge(df1, df2, by="common_key_column", all.x=TRUE)

The code to perform a left outer join is as follows:

#Left Join

data1 <- head(PimaIndiansDiabetes)

data2 <- tail(PimaIndiansDiabetes)

merge(data1, data2, by='glucose',all.x=TRUE)

The output is as follows:

Figure 2.10: A left outer join
Figure 2.10: A left outer join

Right outer join: The right outer join will join based on exact matches, but it will also keep the data from df2 (that which isn't matched).

The syntax is as follows:

merge(df1, df2, by="common_key_column", all.y=TRUE)

The code to perform a right outer join is as follows:

#Right Join

data1 <- head(PimaIndiansDiabetes)

data2 <- tail(PimaIndiansDiabetes)

merge(data1, data2, by='glucose',all.y=TRUE)

The output is as illustrated:

Figure 2.11: A right outer join
Figure 2.11: A right outer join

Pre-Processing of Data Frames

Pre-processing is done on data frames to improve the quality of the dataset. At times, values are spread over a long range, and it becomes essential to align the values to a common scale without altering the ranges of values.

Standardizing is a pre-processing technique that converts data values, such that they can now be compared to each other. For instance, if the age is in the range of 1-100 and salary is in the range of 2000-60000, the two fields cannot be directly compared because the range of values they take are different. Therefore, we will transform the values such that the mean is 0 and standard deviation is 1. Standardization can be performed using:

  • Scale: Each value in the variable is divided by standard deviation to bring them to a scale where the value 1 corresponds to one standard deviation of the original variable.
  • Normalize: Squeezing the data into the range of 0-1 is called normalization.
  • Center: The mean of the column is subtracted from each value in the column, so the new mean of the column becomes 0.

When values have different scales, they contribute differently to the analysis. It is good to scale features where we need to compute distance, like in k-nearest neighbors, Principal Component Analysis (PCA), gradient descent, and tree-based models.

The preProcess() function in R can take 16 arguments. In the upcoming exercise, we will look at the method argument. The method() argument is a vector that mentions the type of processing. The processing types we will look at in detail are center, scale, and pca.

Exercise 7: Centering Variables

In this exercise, we will perform the center pre-processing operation on the Pima Indians diabetes dataset:

  1. Attach the packages:

    #Attach the packages

    library(mlbench)

    library(caret)

    # load the dataset PimaIndiansDiabetes

    data(PimaIndiansDiabetes)

  2. View the summary of the dataset:

    # view data

    summary(PimaIndiansDiabetes[,1:2])

    The output is as follows:

    pregnant         glucose     

    Min.   : 0.000   Min.   :  0.0  

    1st Qu.: 1.000   1st Qu.: 99.0  

    Median : 3.000   Median :117.0  

    Mean   : 3.845   Mean   :120.9  

    3rd Qu.: 6.000   3rd Qu.:140.2  

    Max.   :17.000   Max.   :199.0  

  3. Perform the center operation:

    params <- preProcess(PimaIndiansDiabetes[,1:2],

                         method=c("center"))

    print(params)

    The output is as follows:

    Created from 768 samples and 2 variables

    Pre-processing:

      - centered (2)

      - ignored (0)

  4. Transform using the previous method:

    # transform the dataset using the parameters

    new_dataset <- predict(params, PimaIndiansDiabetes[,1:2])

    The predict() function will take the centered values (params) and transform the variables according to these param values.

  5. Summarize the transformed dataset:

    # summarize the transformed dataset

    summary(new_dataset)

    The output is as follows:

    pregnant          glucose        

    Min.   :-3.8451   Min.   :-120.895  

    1st Qu.:-2.8451   1st Qu.: -21.895  

    Median :-0.8451   Median :  -3.895  

    Mean   : 0.0000   Mean   :   0.000  

    3rd Qu.: 2.1549   3rd Qu.:  19.355  

    Max.   :13.1549   Max.   :  78.105  

The new values are found by subtracting the original values by the mean value and hence the mean value is now zero. In the next exercise, we will normalize the value using the range operation.

Exercise 8: Normalizing the Variables

In this exercise, we will perform the range operation during pre-processing on the PimaIndiansDiabetes dataset:

  1. Attach the caret and mlbench packages

    #Attach the caret and mlbench packages

    library(mlbench)

    library(caret)

    # load the dataset PimaIndiansDiabetes

    data(PimaIndiansDiabetes)

  2. View the summary of the dataset:

    # view the data

    summary(PimaIndiansDiabetes[,1:2])

  3. The output is as follows:

        pregnant          glucose        

    Min.   :-3.8451   Min.   :-120.895  

    1st Qu.:-2.8451   1st Qu.: -21.895  

    Median :-0.8451   Median :  -3.895  

    Mean   : 0.0000   Mean   :   0.000  

    3rd Qu.: 2.1549   3rd Qu.:  19.355  

    Max.   :13.1549   Max.   :  78.105

  4. Transform the dataset using range:

    # To normalise we will create a range

    params <- preProcess(PimaIndiansDiabetes [,1:2], method=c("range"))

    print(params)

    The output is as follows:

    Created from 768 samples and 2 variables

    Pre-processing:

      - ignored (0)

      - re-scaling to [0, 1] (2)

  5. Use predict() to create the new pre-processed dataset:

    # Transform the dataset using the parameters

    new_dataset <- predict(params, PimaIndiansDiabetes [,1:2])

  6. View the summary of the transformed dataset:

    # summarize the transformed dataset

    summary(new_dataset) 

    The output is as follows:

        pregnant          glucose      

    Min.   :0.00000   Min.   :0.0000  

    1st Qu.:0.05882   1st Qu.:0.4975  

    Median :0.17647   Median :0.5879  

    Mean   :0.22618   Mean   :0.6075  

    3rd Qu.:0.35294   3rd Qu.:0.7048  

    Max.   :1.00000   Max.   :1.0000  

We have successfully normalized the values, and the values now lie between 0 and 1.

Exercise 9: Scaling the Variables

In this exercise, we will perform the scale operation during pre-processing of the PimaIndiansDiabetes dataset:

  1. Load the dataset:

    data(PimaIndiansDiabetes)

  2. View the summary of the dataset:

    summary(PimaIndiansDiabetes[,1:2])

    The output is as follows:

      pregnant         glucose     

    Min.   : 0.000   Min.   :  0.0  

    1st Qu.: 1.000   1st Qu.: 99.0  

    Median : 3.000   Median :117.0  

    Mean   : 3.845   Mean   :120.9  

    3rd Qu.: 6.000   3rd Qu.:140.2  

    Max.   :17.000   Max.   :199.0

  3. Transform the dataset using scale:

    # to scale we will use scale keyword

    params <- preProcess(PimaIndiansDiabetes[,1:2], method=c("scale"))

    print(params)

    The output is as follows:

    Created from 768 samples and 2 variables

    Pre-processing:

      - ignored (0)

      - scaled (2)

  4. Use predict() to create the new pre-processed dataset:

    #Scale the data

    new_dataset <- predict(params, PimaIndiansDiabetes[,1:2])

  5. View the summary of the transformed dataset:

    # summarize the transformed dataset

    summary(new_dataset)

    The output is as follows:

        pregnant         glucose     

    Min.   :0.0000   Min.   :0.000  

    1st Qu.:0.2968   1st Qu.:3.096  

    Median :0.8903   Median :3.659  

    Mean   :1.1411   Mean   :3.781  

    3rd Qu.:1.7806   3rd Qu.:4.387  

    Max.   :5.0451   Max.   :6.224  

Thus, we have learned to perform the scale operation.

Activity 6: Centering and Scaling the Variables

In this activity, we will perform the center and scale operations during pre-processing on the PimaIndiansDiabetes dataset.

The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/PimaIndiansDiabetes.csv.

These are the steps that will help you solve the activity:

  1. Load the dataset.
  2. View the summary of the dataset.
  3. Find the parameters for centering and scaling with preProcess().
  4. Use predict() to transform the dataset.
  5. View the summary of the transformed dataset.

The summary of the new dataset will be as follows:

          pregnant          glucose       

Min.   :-1.1411   Min.   :-3.7812  

1st Qu.:-0.8443   1st Qu.:-0.6848  

Median :-0.2508   Median :-0.1218  

Mean   : 0.0000   Mean   : 0.0000  

3rd Qu.: 0.6395   3rd Qu.: 0.6054  

Max.   : 3.9040   Max.   : 2.4429

Note

The solution for this activity can be found on page 326.

Extracting the Principle Components

We will extract the principle components from the variables/columns in the dataset. These components are combinations of the features in the dataset. These features are created such that they contain maximum information and maximum variance. The the first feature will have maximum covariance, and the covariance will reduce in the successive features.

In the next exercise, generate the principle components for the PimaIndiansDiabetes dataset.

Exercise 10: Extracting the Principle Components

In this exercise, we will perform pre-processing just as we did in the previous exercises, and we'll use center, scale, and pca to do this. We will follow the same approach as in Exercise 8, Normalizing the Data.

  1. Load the dataset:

    # load the dataset

    data(PimaIndiansDiabetes)

  2. Set the params:

    params <- preProcess(PimaIndiansDiabetes, method=c("center", "scale",

                                                       "pca"))

  3. Perform PCA from the parameters:

    # perform pca on the dataset using the parameters

    new_dataset <- predict(params, PimaIndiansDiabetes)

  4. View the new dataset:

    # view the new dataset

    summary(new_dataset)

    The output is as follows:

    Figure 2.12: Our data frame after the center, scale and PCA operations
    Figure 2.12: Our data frame after the center, scale and PCA operations

    We note that there are seven principle components for the PimaIndiansDiabetes dataset.

  5. Compare new_dataset with the original using the glimpse function of the dplyr() package:

    library(dplyr)

    glimpse(PimaIndiansDiabetes)

    The output is as follows:

    Observations: 768

    Variables: 9

    $ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, ...

    $ glucose  <dbl> 148, 85, 183, 89, 137, 116, ...

    $ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, ...

    $ triceps  <dbl> 35, 29, 0, 23, 35, 0, 32, 0,...

    $ insulin  <dbl> 0, 0, 0, 94, 168, 0, 88, 0, ...

    $ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1...

    $ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, ...

    $ age      <dbl> 50, 31, 32, 21, 33, 30, 26, ...

    $ diabetes <fct> pos, neg, pos, neg, pos, neg...

  6. The glimpse of the new_dataset is as follows:

    glimpse(new_dataset)

    The output is as follows:

    Observations: 768

    Variables: 9

    $ diabetes <fct> pos, neg, pos, neg, pos, neg...

    $ PC1      <dbl> -1.0678069, 1.1209528, 0.396...

    $ PC2      <dbl> 1.2340908, -0.7333737, 1.594...

    $ PC3      <dbl> 0.09586737, -0.71247385, 1.7...

    $ PC4      <dbl> 0.49666654, 0.28487058, -0.0...

    $ PC5      <dbl> -0.10991328, -0.38925352, 0....

    $ PC6      <dbl> 0.35694989, -0.40606472, -0....

    $ PC7      <dbl> -0.85826202, -0.75654101, 1....

    $ PC8      <dbl> 0.97366903, 0.35398386, 1.06...

Since the value will have maximum variance, they are good for modeling.

Subsetting Data

The subsetting of data means that the data can be filtered based on certain criteria, or such that some columns can be selected from a dataset. The syntax for subsetting is as follows:

  • The syntax to select columns is dataframe[c('col1','col2','col3')].
  • If the columns are toward the start or end of the data frame, another syntax is dataframe[,col1:col n], where col1 is the index of the first column and col n is the index of the column until which you would like to select the data.
  • The syntax to select a range of columns is dataset[c(indexn:indexm)].
  • The syntax to subset a data frame is subset[dataframe,condn1 & cond2,select=c(col1,col2)).

Often, it happens that we need only a section of data and not the entire data, subsetting enables us to use a section of the data frame for analysis, ensuring quicker analysis.

Exercise 11: Subsetting a Data Frame

In this exercise, we will subset a data frame using data operations:

  1. Load the dataset:

    library(mlbench)

    data("PimaIndiansDiabetes")

  2. Select the age, glucose, and pressure columns from the PimaIndiansDiabetes dataset:

    Subsetting Data

    # select variables age, glucose, pressure

    myvars <- c("age", "glucose", "pressure")

    newdata <- PimaIndiansDiabetes[myvars]

    head(newdata)

    The output is as follows:

      age glucose pressure

    1  50     148       72

    2  31      85       66

    3  32     183       64

    4  21      89       66

    5  33     137       40

    6  30     116       74

  3. Select the first three columns in the PimaIndiansDiabetes dataset:

    # another method

    newdata <- PimaIndiansDiabetes[, 1:3]

    head(newdata)

    The output is as follows:

      pregnant glucose pressure

    1        6     148       72

    2        1      85       66

    3        8     183       64

    4        1      89       66

    5        0     137       40

    6        5     116       74

  4. Select the first column, and columns 5 through 9:

    # select 1st and 5th through 9th variables

    newdata <- PimaIndiansDiabetes[c(1,5:9)]

    head(newdata)

    The output is as follows:

      pregnant insulin mass pedigree age diabetes

    1        6       0 33.6    0.627  50      pos

    2        1       0 26.6    0.351  31      neg

    3        8       0 23.3    0.672  32      pos

    4        1      94 28.1    0.167  21      neg

    5        0     168 43.1    2.288  33      pos

    6        5       0 25.6    0.201  30      neg

  5. Select the insulin and age columns that satisfy the insulin>= 10, age <30 conditions:

    # using subset function

    newdata <- subset(PimaIndiansDiabetes,

                      insulin >= 20 & age < 30,

                      select=c(insulin, age))

    head(newdata)

    The output is as follows:

       insulin age

    4       94  21

    7       88  26

    21     235  27

    28     140  22

    32     245  28

    33      54  22

Thus, we have selected the required part of the data frame in each of these cases.

Data Transposes

Let's transpose the PimaIndiansDiabetes data frame; that is, convert the columns to rows and rows to columns. Use the t(dataframe) syntax as follows:

#Transpose Data

t_PimaIndiansDiabetes<-head(t(PimaIndiansDiabetes))

head(PimaIndiansDiabetes)

The first five rows of the original dataset are as follows:

Figure 2.13: The original data
Figure 2.13: The original data

The transposed dataset is as follows:

head(t_PimaIndiansDiabetes)

The output is as follows:

Figure 2.14: The transposed dataset
Figure 2.14: The transposed dataset

Often, it is essential to transpose a dataset, before we use it for analysis.

Identifying the Input and Output Variables

For any dataset, we should identify the input variables and the output variables. For the iris dataset, the input variables are the following:

  1. SepalLength
  2. SepalWidth
  3. PetalLength
  4. PetalWidth

The output variable, or the field to be predicted, is Species.

Identifying the Category of Prediction

Based on the category of prediction, we will perform different pre-processing steps. The category of prediction could be any of these:

  • Categorical Prediction: In this type of prediction, the output to be predicted will have class values such as yes, no, or given categories.
  • Numeric Prediction: In a numeric prediction, the output that will be predicted is a numeric value, such as predicting the cost of a house.

Handling Missing Values, Duplicates, and Outliers

In any dataset, we might have missing values, duplicate values, or outliers. We need to ensure that these are handled appropriately so that the data used by the model is clean.

Handling Missing Values

Missing values in a data frame can affect the model during the training process. Therefore, they need to be identified and handled during the pre-processing stage. They are represented as NA in a data frame. Using the example that follows, we will see how to identify a missing value in a dataset.

Using the is.na(), complete.cases(), and md.pattern() functions, we will identify the missing values.

The is.na() function, as the name suggests, returns TRUE for those elements marked NA or, for numeric or complex vectors, NaN (Not a Number) , and FALSE. The complete.cases() function returns TRUE if the value is missing and md.pattern() gives a summary of the missing values.

Exercise 12: Identifying the Missing Values

In the following example, we are adding rows with missing values to the PimaIndiansDiabetes dataset. We will be converting the columns of this dataset into numeric values. Using the is.na(), complete.cases(), and md.pattern() functions from the MICE library, we will identify the missing values.

  1. Add the missing values:

    library(mlbench)

    data("PimaIndiansDiabetes")

    #Adding NA values

    PimaIndiansDiabetes_new <- rbind(

      PimaIndiansDiabetes,c(1, 212,NA,NA,3,44,0.45,23,"neg"))

    PimaIndiansDiabetes_new <- rbind(

      PimaIndiansDiabetes_new,c(1, 212,NA,NA,3,44,0.45,23,"pos"))

  2. Convert the characters to numeric:

    #Convert character to numeric

    PimaIndiansDiabetes_new$pregnant=as.numeric(

      PimaIndiansDiabetes_new$pregnant)

    PimaIndiansDiabetes_new$glucose=as.numeric(

      PimaIndiansDiabetes_new$glucose)

    PimaIndiansDiabetes_new$pressure=as.numeric(

      PimaIndiansDiabetes_new$pressure)

    PimaIndiansDiabetes_new$triceps=as.numeric(

      PimaIndiansDiabetes_new$triceps)

    PimaIndiansDiabetes_new$insulin=as.numeric(

      PimaIndiansDiabetes_new$insulin)

    PimaIndiansDiabetes_new$mass=as.numeric(

      PimaIndiansDiabetes_new$mass)

    PimaIndiansDiabetes_new$pedigree=as.numeric(

      PimaIndiansDiabetes_new$pedigree)

    PimaIndiansDiabetes_new$age=as.numeric(

      PimaIndiansDiabetes_new$age)

    PimaIndiansDiabetes_new$diabetes=as.numeric(

      PimaIndiansDiabetes_new$diabetes)

  3. Identify the missing values using the ! (not) logical operator. Using the ! operator we will find the values that are not complete cases:

    #Identifying missing values

    #List the rows containing missing values

    PimaIndiansDiabetes_new[

      !complete.cases(

        PimaIndiansDiabetes_new),]

    The output is as follows:

    Figure 2.15: The identified missing values
    Figure 2.15: The identified missing values
  4. Use the is.na() function to find NA values:

    is.na(PimaIndiansDiabetes_new)

    The output is as follows:

    Figure 2.16: The section of output showing the NA values as TRUE
    Figure 2.16: The section of output showing the NA values as TRUE
  5. Finding the last values using the tail() function is done as follows:

    tail(is.na(PimaIndiansDiabetes_new) )

    The output is as follows:

    Figure 2.17: Identifying the missing values
    Figure 2.17: Identifying the missing values
  6. Find the missing data using the md.pattern() function:

    library(mice)

    md.pattern(PimaIndiansDiabetes_new)

    The output is as follows:

Figure 2.18: The output from MICE
Figure 2.18: The output from MICE

Now that we have identified the missing values, it's time to handle them gracefully.

Techniques for Handling Missing Values

When we encounter a missing value, we can handle it in a couple of ways. Some of the techniques include deleting or replacing them using mean or median.

In the following section, these techniques will be illustrated in detail with examples:

  • Delete missing values: Deleting the row with the missing value will remove the bad data from the dataset:

    #Remove rows containing missing values

    newdata <- na.omit(PimaIndiansDiabetes_new)

    is.na(newdata)

    The is.na() function will return FALSE if it is not an NA value for the data.

  • Impute missing values using the MICE package: Another approach is imputation, where the missing value is replaced with the mean, median, or mode of the field.

In the next exercise, we will learn how to impute using the MICE package.

Exercise 13: Imputing Using the MICE Package

The exercise will give us an overview of the MICE package. This exercise is a continuation of the previous exercise, and we will impute the missing values in this exercise using the complete() method.

  1. Find the NA values:

    #View the NA

    tail(PimaIndiansDiabetes_new)

    The output is as follows:

    Figure 2.19: The NA values in the iris dataset
    Figure 2.19: The NA values in the dataset
  2. Import the MICE package:

    library(mice)

  3. Impute the data using MICE:

    impute_step1 = mice(PimaIndiansDiabetes_new)

    imputed_data = complete(impute_step1)

    The output is as follows:

    Figure 2.20: Imputed data
    Figure 2.20: Imputed data
  4. View the imputed values:

    #View the imputed values

    tail(imputed_data)

    The output is as follows:

Figure 2.21: Imputed values
Figure 2.21: Imputed values

Thus, we have imputed the NA values using MICE.

Exercise 14: Performing Predictive Mean Matching

The abbreviation pmm is short for predictive mean matching. It will predict the value to be written into the missing field. A sample is as follows:

mice(Dataset, m=1,maxit=30,meth='pmm',seed=50)

In the preceding line of code, the following are used:

m = The number of imputed datasets

meth = The imputation method used. Other methods can also be used.

maxit = The number of iterations for each imputation

In this exercise, we will predict the missing value using pmm.

  1. Find the NA values:

    tail(PimaIndiansDiabetes_new)

    The output is as follows:

    Figure 2.22: NA values in the PimaIndiansDiabetes dataset
    Figure 2.22: NA values in the PimaIndiansDiabetes dataset
  2. Impute the values using MICE and add pmm as the method:

    impute_step1 <- mice(PimaIndiansDiabetes_new,

      m=5,maxit=30,meth='pmm',seed=50)

    The output is as follows:

    Figure 2.23: A section of output after imputation
    Figure 2.23: A section of output after imputation
  3. Find the summary:

    summary(impute_step1)

    The output is as follows:

    Figure 2.24: Imputed summary
    Figure 2.24: Imputed summary
  4. Use the complete() function on the imputed data:

    completedData <- complete(impute_step1,1)

    tail(completedData)

    The output is as follows:

Figure 2.25: The last values imputed
Figure 2.25: The last values imputed

Thus, we have imputed values using pmm.

Handling Duplicates

Duplicate data means rows that repeat themselves in the dataset. These duplicate data rows need to be removed, as they will reduce the quality of the data. If our training data contains duplicates, the duplicates can overtrain a model and bias it to predict those samples well. Thus, the model does not learn the other samples (non-duplicates) as well.

Exercise 15: Identifying Duplicates

There are functions in R that can be used to identify the duplicates in the data frame. We will identify duplicates using the duplicated() function.

  1. Add a duplicate value:

    #Adding duplicate values

    PimaIndiansDiabetes_new <- rbind(

      PimaIndiansDiabetes,c(1, 93,70,31,0,30.4,0.315,23,"pos"))

    PimaIndiansDiabetes_new <- rbind(

      PimaIndiansDiabetes_new,c(1, 93,70,31,0,30.4,0.315,23,"pos"))

    PimaIndiansDiabetes_new <- rbind(

      PimaIndiansDiabetes_new,c(1, 93,70,31,0,30.4,0.315,23,"pos"))

  2. Identify duplicates using the duplicated() function:

    #Identify Duplicates

    duplicated(PimaIndiansDiabetes_new)

    The output is as follows:

    Figure 2.26: A section of the output for duplicate values
    Figure 2.26: A section of the output for duplicate values
  3. Display the duplicate values:

    #Display the duplicates

    PimaIndiansDiabetes_new[duplicated(PimaIndiansDiabetes_new),]

    The output is as follows:

    Figure 2.27: A section of the output for duplicate values
    Figure 2.27: A section of the output for duplicate values
  4. Find the value that has been duplicated:

    #Display the unique values of the list of duplicates

    unique(PimaIndiansDiabetes_new[duplicated(PimaIndiansDiabetes_new),])

    The output is as follows:

    Figure 2.28: A section of the output for duplicate values
    Figure 2.28: A section of the output for duplicate values
  5. Identify the unique values:

    #Display the unique values

    unique(PimaIndiansDiabetes_new)

    The output is as follows:

Figure 2.29: A section of output
Figure 2.29: A section of output

Thus, the unique() and duplicated() functions can be used to eliminate duplicate values.

Techniques Used to Handle Duplicate Values

A technique to handle duplicate values is to remove duplicate rows:

#Remove duplicates

unique_data <- iris[!duplicated(iris),]

In the next section, we will handle outliers.

Handling Outliers

Any datapoint with a value that is very different from the other data points is an outlier. Outliers can affect the training process negatively and therefore they need to be handled gracefully. In the following section, we will illustrate via examples both the process of detecting an outlier and the techniques used to handle them.

Exercise 16: Identifying Outlier Values

The outlier package can detect the outlier values. Using the opposite=TRUE parameter will fetch the outliers from the other side of dataset. The outlier values can be verified using a boxplot.

  1. Attach the outlier package:

    library(outliers)

  2. Detect outliers:

    #Detect outliers

    outlier(PimaIndiansDiabetes[,1:4])

    The output is as follows:

    pregnant  glucose pressure  triceps

          17        0        0       99

    Detect outliers from the other end:

    #This detects outliers from the other side

    outlier(PimaIndiansDiabetes[,1:4],opposite=TRUE)

    The output is as follows:

    pregnant  glucose pressure  triceps

           0      199      122        0

  3. Plot the outliers using box plots. Using a boxplot, we can view the range of the data and the outliers:

    #View the outliers

    boxplot(PimaIndiansDiabetes[,1:4])

    The output is as follows:

Figure 2.30: A boxplot to identify the outliers
Figure 2.30: A boxplot to identify the outliers

A boxplot can be used to view the distribution of the data where the extreme values and the range of values can be viewed in the plot. For instance, the sepal width field has outliers (circles) with values of 4 and 2, and most of the data contains a sepal width of 3. The black line within each box is the median value.

Techniques Used to Handle Outliers

The following are some of the techniques used to handle outliers:

  • Removing the outliers (as covered in the previous section)
  • Imputing the missing values with mean, median, or mode (as covered in the previous section)
  • Predicting the values for the outlier fields using the other values in the dataset

Exercise 17: Predicting Values to Handle Outliers

In this exercise, we will be predicting values to handle outliers. The rpart package can be used to predict the values, as shown in this exercise:

  1. We will first add the outliers:

    #Add rows with missing values

    iris_new <- rbind(iris, c(1, 2,NA,NA,"setosa"))

    iris_new <- rbind(iris_new, c(NA,NA,3,4,"setosa"))

    iris_new <- rbind(iris_new, c(4,2,3,4,NA))

  2. Since we have NA values, we will convert the characters to numeric values:

    #Convert character to numeric

    iris_new$Sepal.Length <- as.numeric(iris_new$Sepal.Length)

    iris_new$Sepal.Width <- as.numeric(iris_new$Sepal.Width)

    iris_new$Petal.Length <- as.numeric(iris_new$Petal.Length)

    iris_new$Petal.Width <- as.numeric(iris_new$Petal.Width)

  3. Attach the rpart package:

    install.packages(rpart)

    library(rpart)

  4. Use the rpart function on Species in the iris dataset:

    class_mod <- rpart(Species ~ . - Sepal.Length, data=iris_new[!is.na(iris_new$Species), ], method="class", na.action=na.omit)

    # since Species is a factor

  5. Use the rpart function on Petal.Length in the iris dataset:

    anova_mod <- rpart(Petal.Length ~ . - Sepal.Length, data=iris_new[!

    is.na(iris_new$Petal.Length), ], method="anova", na.action=na.omit)

    # since Petal.Length is numeric.  

  6. Use the predict() function to find the categoric and numeric predicted values:

    categoric_pred <- predict(class_mod, iris_new[is.na(iris_new$Species), ])

    numeric_pred <- predict(anova_mod, iris_new[is.na(iris_new$Petal.Length), ])

  7. View the values:

    categoric_pred

    The output is as follows:

        setosa versicolor virginica

    153      0 0.02173913 0.9782609

  8. View the value for numeric prediction:

    numeric_pred

    The output is as follows:

    151

    1.462

This shows that in row 153, the species is predicted to be virginica with 97.8 percent probability.

In row 151, the petal length is predicted to be 1.462.

Handling Missing Data

To preprocess data, the syntax is preProcess(dataframe, method="medianImpute"), and the syntax for predicting is predict(preProcess(),newdata=dataframe). The medianImpute method will replace the values with the median value.

Exercise 18: Handling Missing Values

In this exercise, we will use the caret package to impute the missing values in the iris dataset. This package will work on only numeric data, so the first four columns can be selected in the dataset.

These are the steps that will solve the exercise:

  1. Attach the caret package:

    library(caret)

  2. Print the rows with NA values:

    #print the rows with NA

    tail(iris_new[,1:4])

    The output is as follows:

        Sepal.Length Sepal.Width Petal.Length Petal.Width

    147          6.3         2.5          5.0         1.9

    148          6.5         3.0          5.2         2.0

    149          6.2         3.4          5.4         2.3

    150          5.9         3.0          5.1         1.8

    151          1.0         2.0           NA          NA

    152           NA          NA          3.0         4.0

  3. Use the preProcess() method of the caret package to impute the values:

    #Impute

    iris_caret <- predict(preProcess(iris_new[,1:4],method = 'medianImpute'),newdata = iris_new[,1:4])

  4. Use predict() to generate the imputed data:

    #View the imputed values

    tail(iris_caret)

    The output is as follows:

        Sepal.Length Sepal.Width Petal.Length Petal.Width

    147          6.3         2.5          5.0         1.9

    148          6.5         3.0          5.2         2.0

    149          6.2         3.4          5.4         2.3

    150          5.9         3.0          5.1         1.8

    151          1.0         2.0          4.3         1.3

    152          5.8         3.0          3.0         4.0

Thus, we have replaced the NA values with data using prediction.

Activity 7: Identifying Outliers

In this activity, identify the outliers for the mtcars dataset. Also, display the outliers and plot a boxplot to verify it. The data can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/mtcars.csv.

These are the steps that will help you solve the activity:

  1. Load the mtcars dataset.
  2. Detect outliers using the outlier() function.
  3. Detect outliers using the outlier() function using the opposite=TRUE option.
  4. Plot a boxplot.

The output will display the outliers using a boxplot, as illustrated:

Figure 2.31: Outliers in the mtcars dataset
Figure 2.31: Outliers in the mtcars dataset

Note

The solution for this activity can be found on page 327.

Pre-Processing Categorical Data

A variable that contains distinct categories is called a categorical variable. For instance, the variable animal could have the classes cat, dog, and fish, and the variable married could have the classes yes and no. Pre-processing of a categorical field is essential because the model may not understand non-numeric literals. Therefore, these will be converted to numeric values.

Categorical data can be pre-processed in the following manner. The character values are converted to numeric values, which can be assigned by us:

#Categorical Variable

iris_new$Species <- factor(iris_new$Species,levels = c('setosa','versicolor','virginica'), labels = c(1,2,3))

iris_new$Species

The output is as follows:

Figure 2.32: Converting non-literal to numeric values
Figure 2.32: Converting non-literal to numeric values

In the previous example, we saw how to convert a character factor to a numeric factor.

Handling Imbalanced Datasets

In many business scenarios, the data can be imbalanced. For example, if we are identifying credit card fraud, out of 100 transactions, only 5 transactions are likely to be fraudulent. Therefore, the data contains 95 samples of good transactions and 5 samples for fraudulent transactions. So, if we use the data directly, the good sample will overpower the model compared to the fraudulent sample, and the model might not learn to predict credit card fraud with high accuracy. Hence, we can employ the following techniques to prevent this problem:

  • Oversampling
  • Undersampling
  • SMOTE
  • ROSE

We will use the PimaIndiansDiabetes dataset, which is a set of patients with diabetes, and the output field is either neg or pos:

head(PimaIndiansDiabetes)

The output is as follows:

Figure 2.33: First five records of PimaIndiansDiabetes
Figure 2.33: First five records of PimaIndiansDiabetes

Observe the variables in the PimaIndiansDiabetes dataset.

str(PimaIndiansDiabetes)

The output is as follows:

Figure 2.34: Variables in PimaIndiansDiabetes dataset
Figure 2.34: Variables in PimaIndiansDiabetes dataset

Get a summary of the diabetes variable:

summary(PimaIndiansDiabetes$diabetes)

The output is as follows:

neg pos

500 268

The diabetes field has imbalanced samples, where the number of pos is low that is, 268, and the number of neg samples are high in number, that is, 500.

The imbalance in the preceding example should be addressed, because it can affect the performance of the machine learning model.

Undersampling

This is a technique where a few samples from the larger class are removed to decrease the count so that the ratio is balanced. The disadvantage is that we will likely lose information in our model as the samples are picked randomly. For instance, we will randomly pick 268 out of 500 samples for the neg class.

Exercise 19: Undersampling a Dataset

In this exercise, we will consider the diabetes field of the PimaIndianDiabetes dataset and downsample it.

  1. View the summary of the diabetes field to see the class count:

    summary(PimaIndiansDiabetes$diabetes)

    The output is as follows:

    neg pos

    500 268

  2. Use downSample() to undersample the data:

    set.seed(9560)

    undersampling <- downSample(

        x = PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)],

        y = PimaIndiansDiabetes$diabetes)

    table(undersampling$Class)

    The output is as follows:

    neg pos

    268 268

In this example, we saw how the class with more data was undersampled to reduce the count of data.

Oversampling

This is a technique where the samples with a lower count are repeated/duplicated to increase the count so that the ratio is balanced. The disadvantage is that we are more likely to overfit our model, and we do not have unique data for training and testing. The duplicated samples are randomly picked. For instance, we can randomly pick and duplicate to increase pos samples from 268 to 500.

Exercise 20: Oversampling

The goal of this exercise is to perform oversampling for the data that contains minorities. The minority of the positive class having a count of 268 is oversampled to match the majority that had a count of 500.

  1. Set a seed value so that we can repeat the experiments and have the sample results:

    set.seed(9560)

  2. Perform oversampling using upSample():

    oversampling <- upSample(

      x = PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)],

      y = PimaIndiansDiabetes$diabetes)

    table(oversampling$Class)

    The output is as follows:

    neg pos

    500 500

Through the preceding example, we learned how to use upSample() to oversample the pos minority class for the diabetes column.

ROSE

The Random Oversampling Examples (ROSE) technique uses synthetic samples generated for the minority class, and is another technique that is used for binary imbalanced classification problems. It uses a smoothed bootstrap approach to create artificial samples in the data, thereby balancing it. ROSE creates artificial samples for the minority class.

Exercise 21: Oversampling using ROSE

In this exercise, we will learn to generate synthetic samples for the minority class to balance the dataset by oversampling using random examples in ROSE.

  1. Attach the packages:

    library(caret)

    library(ROSE)

  2. Set the seed value:

    set.seed(2)

  3. Create an imbalanced dataset:

    imbalance_data <- twoClassSim(1000, intercept = -15, linearVars = 5)

  4. View the count of the classes in this imbalanced dataset:

    table(imbalance_data$Class)

    The output is as follows:

    Class1 Class2

       908     92

  5. Use ROSE to balance the minority class, which is Class2, to increase the sample for this class:

    balanced_data <- ROSE(Class ~ .,

                          data  = imbalance_data,seed=3)$data table(balanced_data$Class)

    The output is as follows:

    Class1 Class2

       480    520

Through the preceding example, we learned to implement the ROSE method to perform oversampling.

SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is used to handle imbalanced binary classes. In this technique, the minority class is oversampled and the majority class is undersampled.

Exercise 22: Implementing the SMOTE Technique

In this exercise, we will implement the SMOTE concept. Here are the steps to complete the exercise:

  1. Set the seed:

    set.seed(2)

  2. Create the imbalanced data:

    imbalance_data <- twoClassSim(1000, intercept = -15, linearVars = 5)

  3. View the summary of the Class column:

    table(imbalance_data$Class)

    The output is as follows:

    Class1 Class2

       903     97

  4. Use the repeatedcv method to get the ctrl values:

    ctrl <- trainControl(method = "repeatedcv",

                         number = 10,

                         repeats = 5,

                         summaryFunction = twoClassSummary,

                         classProbs = TRUE)

  5. Set the sampling as smote:

    ctrl$sampling <- "smote"

  6. Set the smote_fit values:

    smote_fit <- train(Class ~ .,

                       data = imbalance_data,

                       method = "gbm",

                       verbose = FALSE,

                       metric = "ROC",

                       trControl = ctrl)

  7. Print the smote_fit values:

    smote_fit

    The output is as follows:

Figure 2.35: SMOTE output
Figure 2.35: SMOTE output

In the preceding example, we saw how to use SMOTE to balance the data.

Activity 8: Oversampling and Undersampling using SMOTE

The mushrooms dataset contains imbalanced data and has a property named bruises, which we will oversample and undersample in this activity.

The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/mushrooms.csv.

These are the steps that will help you solve the activity:

  1. Read the mushrooms.csv file using the read.csv() command, and save the value in the ms variable.
  2. Perform downsampling on bruises.
  3. Perform oversampling on bruises.

The output will be oversampled as shown in the following:

   f    t

4748 4748

Note

The solution for this activity can be found on page 329.

Activity 9: Sampling and Oversampling using ROSE

We want to use the German Credit dataset to make predictions relating to class in the German Credit dataset. However, the dataset does not have a good balance of Good and Bad values. We want to use ROSE to perform sampling to balance the class values so that the dataset is balanced.

The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

These are the steps that will help you solve the activity:

  1. Load the dataset.
  2. View samples from the data.
  3. Check the count of unbalanced classes using the summary() method.
  4. Use ROSE to balance the numbers.

The balanced data sampled using ROSE will look as follows:

Good  Bad

480  520

Note

The solution for this activity can be found on page 330.

Summary

In this chapter, we learned how to perform several operations on a data frame, including scaling, standardizing, and normalizing. Also, we covered the sorting, ranking, and joining operations with their implementations in R. We discussed the need for pre-processing of the data; and identified and handled outliers, missing values, and duplicate values.

Next, we moved on to the sampling of data. It is important for the data to contain a reasonable sample of each class that is to be predicted. If the data is imbalanced, it can affect our predictions in a negative manner. Therefore, we can use either the undersampling, oversampling, ROSE, or SMOTE techniques imbalanced to ensure that the dataset is representative of all the classes that we want to predict. This can be done using the MICE, rpart, ROSE, and caret packages.

In the next chapter, we will cover feature engineering in detail, where we will focus on extracting features to create models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset