By the end of this chapter, you will be able to:
This chapter covers the important concepts of handling data and making the data ready for analysis.
Data cleaning and preparation takes about 70% of the effort in the entire process of a machine learning project. This step is essential because the quality of the data determines the accuracy of the prediction model. A clean dataset should contain good samples of the scenarios that we want to predict, and this will give us good prediction results. Also, the data should be balanced, which means that every category we want to predict should have similar number of samples. For example, if we want to predict whether it will rain or not on any particular day, and if the sample data size is 100, the data could contain 40 samples for It will rain and 60 samples for It will not rain today, or vice versa. However, if the ratio is 20:80 or 30:70, it is an unbalanced dataset, and this will not yield good results for the minority class.
In the following section, we will look at the essential operations performed on data frames in R. These operations will help us to manipulate and analyze the data. The datasets we will be utilizing in this chapter are as follows:
We will begin with advanced operations on R data frames.
In the previous chapter, we performed a number of operations on data frames, including rbind(). There are many more operations that can be performed on data frames, which are very useful while preparing the data for the model. The following exercises will describe these operations in detail and illustrate them through their corresponding implementation in R:
Sorting, ordering, and ranking are operations that act as techniques to identify outliers. Outliers are values that are either too big or too small and do not fit in the value range. As datasets are often messy, fixing datasets is usually a challenge. It helps to sort through multiple records and decide the next suitable candidate. Hence, these operations form the basis of pre-processing in R.
In this exercise, we will be organizing the data using the order(), sort(), and rank() functions. We will be using the built-in PimaIndiansDiabetes dataset:
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
# sort by glucose
sorted_data <- PimaIndiansDiabetes[order(glucose),]
# View the output
head(sorted_data)
The output is as follows:
# sort by glucose and pressure
sorted_data <- PimaIndiansDiabetes[order( glucose, pressure),]
head(sorted_data)
The output is as follows:
#sort in ascending order by glucose and descending order by pressure
sorted_data <- PimaIndiansDiabetes[order( glucose, - pressure),]
head(sorted_data)
The output is as follows:
#Using the sort function to sort glucose
sort(glucose)
The output is as follows:
#Sort in descending order
sort(glucose, decreasing = TRUE)
The output is as follows:
#Using the rank function to rank the values of glucose
rank(glucose)
The output is as follows:
Through these given examples, we have learned how to sort using order(), sort(), and rank().
The join operations are extremely useful while handling data present in multiple tables. We can merge two datasets/data frames using a common column using the join operation. For instance, if one data frame contains credit card transaction information, and the other data frame contains the credit card customer information, and the two have to be merged based on customer ID, then we must use join operations to perform the merge. In this chapter, we will focus on the inner join, the outer join, the left join, and the right join in detail.
Inner join: The inner join gives us only the data where the fields in both the data frames have been merged by an exact match.
The syntax is as follows:
merge(df1, df2, by="fields used to merge")
Note that in the preceding code line, the following abbreviations are used:
df1 <- dataframe1
df2 <- dataframe2
The code to perform an inner join is as follows:
# Inner Join
data1 <- head(PimaIndiansDiabetes)
data2 <- head(PimaIndiansDiabetes)
merge(data1, data2, by='glucose')
The output is as follows:
Outer join: The outer join will join based on the exact match, but it will also keep the data that is not matched.
The syntax is as follows:
merge(df1, df2, by="common_key_column", all=TRUE)
The code to perform outer join is as follows:
#Outer Join
data1 <- head(PimaIndiansDiabetes)
data2 <- tail(PimaIndiansDiabetes)
merge(data1, data2, by='glucose', all=TRUE)
The output is as follows:
Left outer join: The left outer join will join based on exact matches, but it will also keep the data from df1 (which is not matched).
The syntax is as follows:
merge(df1, df2, by="common_key_column", all.x=TRUE)
The code to perform a left outer join is as follows:
#Left Join
data1 <- head(PimaIndiansDiabetes)
data2 <- tail(PimaIndiansDiabetes)
merge(data1, data2, by='glucose',all.x=TRUE)
The output is as follows:
Right outer join: The right outer join will join based on exact matches, but it will also keep the data from df2 (that which isn't matched).
The syntax is as follows:
merge(df1, df2, by="common_key_column", all.y=TRUE)
The code to perform a right outer join is as follows:
#Right Join
data1 <- head(PimaIndiansDiabetes)
data2 <- tail(PimaIndiansDiabetes)
merge(data1, data2, by='glucose',all.y=TRUE)
The output is as illustrated:
Pre-processing is done on data frames to improve the quality of the dataset. At times, values are spread over a long range, and it becomes essential to align the values to a common scale without altering the ranges of values.
Standardizing is a pre-processing technique that converts data values, such that they can now be compared to each other. For instance, if the age is in the range of 1-100 and salary is in the range of 2000-60000, the two fields cannot be directly compared because the range of values they take are different. Therefore, we will transform the values such that the mean is 0 and standard deviation is 1. Standardization can be performed using:
When values have different scales, they contribute differently to the analysis. It is good to scale features where we need to compute distance, like in k-nearest neighbors, Principal Component Analysis (PCA), gradient descent, and tree-based models.
The preProcess() function in R can take 16 arguments. In the upcoming exercise, we will look at the method argument. The method() argument is a vector that mentions the type of processing. The processing types we will look at in detail are center, scale, and pca.
In this exercise, we will perform the center pre-processing operation on the Pima Indians diabetes dataset:
#Attach the packages
library(mlbench)
library(caret)
# load the dataset PimaIndiansDiabetes
data(PimaIndiansDiabetes)
# view data
summary(PimaIndiansDiabetes[,1:2])
The output is as follows:
pregnant glucose
Min. : 0.000 Min. : 0.0
1st Qu.: 1.000 1st Qu.: 99.0
Median : 3.000 Median :117.0
Mean : 3.845 Mean :120.9
3rd Qu.: 6.000 3rd Qu.:140.2
Max. :17.000 Max. :199.0
params <- preProcess(PimaIndiansDiabetes[,1:2],
method=c("center"))
print(params)
The output is as follows:
Created from 768 samples and 2 variables
Pre-processing:
- centered (2)
- ignored (0)
# transform the dataset using the parameters
new_dataset <- predict(params, PimaIndiansDiabetes[,1:2])
The predict() function will take the centered values (params) and transform the variables according to these param values.
# summarize the transformed dataset
summary(new_dataset)
The output is as follows:
pregnant glucose
Min. :-3.8451 Min. :-120.895
1st Qu.:-2.8451 1st Qu.: -21.895
Median :-0.8451 Median : -3.895
Mean : 0.0000 Mean : 0.000
3rd Qu.: 2.1549 3rd Qu.: 19.355
Max. :13.1549 Max. : 78.105
The new values are found by subtracting the original values by the mean value and hence the mean value is now zero. In the next exercise, we will normalize the value using the range operation.
In this exercise, we will perform the range operation during pre-processing on the PimaIndiansDiabetes dataset:
#Attach the caret and mlbench packages
library(mlbench)
library(caret)
# load the dataset PimaIndiansDiabetes
data(PimaIndiansDiabetes)
# view the data
summary(PimaIndiansDiabetes[,1:2])
pregnant glucose
Min. :-3.8451 Min. :-120.895
1st Qu.:-2.8451 1st Qu.: -21.895
Median :-0.8451 Median : -3.895
Mean : 0.0000 Mean : 0.000
3rd Qu.: 2.1549 3rd Qu.: 19.355
Max. :13.1549 Max. : 78.105
# To normalise we will create a range
params <- preProcess(PimaIndiansDiabetes [,1:2], method=c("range"))
print(params)
The output is as follows:
Created from 768 samples and 2 variables
Pre-processing:
- ignored (0)
- re-scaling to [0, 1] (2)
# Transform the dataset using the parameters
new_dataset <- predict(params, PimaIndiansDiabetes [,1:2])
# summarize the transformed dataset
summary(new_dataset)
The output is as follows:
pregnant glucose
Min. :0.00000 Min. :0.0000
1st Qu.:0.05882 1st Qu.:0.4975
Median :0.17647 Median :0.5879
Mean :0.22618 Mean :0.6075
3rd Qu.:0.35294 3rd Qu.:0.7048
Max. :1.00000 Max. :1.0000
We have successfully normalized the values, and the values now lie between 0 and 1.
In this exercise, we will perform the scale operation during pre-processing of the PimaIndiansDiabetes dataset:
data(PimaIndiansDiabetes)
summary(PimaIndiansDiabetes[,1:2])
The output is as follows:
pregnant glucose
Min. : 0.000 Min. : 0.0
1st Qu.: 1.000 1st Qu.: 99.0
Median : 3.000 Median :117.0
Mean : 3.845 Mean :120.9
3rd Qu.: 6.000 3rd Qu.:140.2
Max. :17.000 Max. :199.0
# to scale we will use scale keyword
params <- preProcess(PimaIndiansDiabetes[,1:2], method=c("scale"))
print(params)
The output is as follows:
Created from 768 samples and 2 variables
Pre-processing:
- ignored (0)
- scaled (2)
#Scale the data
new_dataset <- predict(params, PimaIndiansDiabetes[,1:2])
# summarize the transformed dataset
summary(new_dataset)
The output is as follows:
pregnant glucose
Min. :0.0000 Min. :0.000
1st Qu.:0.2968 1st Qu.:3.096
Median :0.8903 Median :3.659
Mean :1.1411 Mean :3.781
3rd Qu.:1.7806 3rd Qu.:4.387
Max. :5.0451 Max. :6.224
Thus, we have learned to perform the scale operation.
In this activity, we will perform the center and scale operations during pre-processing on the PimaIndiansDiabetes dataset.
The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/PimaIndiansDiabetes.csv.
These are the steps that will help you solve the activity:
The summary of the new dataset will be as follows:
pregnant glucose
Min. :-1.1411 Min. :-3.7812
1st Qu.:-0.8443 1st Qu.:-0.6848
Median :-0.2508 Median :-0.1218
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6395 3rd Qu.: 0.6054
Max. : 3.9040 Max. : 2.4429
The solution for this activity can be found on page 326.
We will extract the principle components from the variables/columns in the dataset. These components are combinations of the features in the dataset. These features are created such that they contain maximum information and maximum variance. The the first feature will have maximum covariance, and the covariance will reduce in the successive features.
In the next exercise, generate the principle components for the PimaIndiansDiabetes dataset.
In this exercise, we will perform pre-processing just as we did in the previous exercises, and we'll use center, scale, and pca to do this. We will follow the same approach as in Exercise 8, Normalizing the Data.
# load the dataset
data(PimaIndiansDiabetes)
params <- preProcess(PimaIndiansDiabetes, method=c("center", "scale",
"pca"))
# perform pca on the dataset using the parameters
new_dataset <- predict(params, PimaIndiansDiabetes)
# view the new dataset
summary(new_dataset)
The output is as follows:
We note that there are seven principle components for the PimaIndiansDiabetes dataset.
library(dplyr)
glimpse(PimaIndiansDiabetes)
The output is as follows:
Observations: 768
Variables: 9
$ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, ...
$ glucose <dbl> 148, 85, 183, 89, 137, 116, ...
$ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, ...
$ triceps <dbl> 35, 29, 0, 23, 35, 0, 32, 0,...
$ insulin <dbl> 0, 0, 0, 94, 168, 0, 88, 0, ...
$ mass <dbl> 33.6, 26.6, 23.3, 28.1, 43.1...
$ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, ...
$ age <dbl> 50, 31, 32, 21, 33, 30, 26, ...
$ diabetes <fct> pos, neg, pos, neg, pos, neg...
glimpse(new_dataset)
The output is as follows:
Observations: 768
Variables: 9
$ diabetes <fct> pos, neg, pos, neg, pos, neg...
$ PC1 <dbl> -1.0678069, 1.1209528, 0.396...
$ PC2 <dbl> 1.2340908, -0.7333737, 1.594...
$ PC3 <dbl> 0.09586737, -0.71247385, 1.7...
$ PC4 <dbl> 0.49666654, 0.28487058, -0.0...
$ PC5 <dbl> -0.10991328, -0.38925352, 0....
$ PC6 <dbl> 0.35694989, -0.40606472, -0....
$ PC7 <dbl> -0.85826202, -0.75654101, 1....
$ PC8 <dbl> 0.97366903, 0.35398386, 1.06...
Since the value will have maximum variance, they are good for modeling.
The subsetting of data means that the data can be filtered based on certain criteria, or such that some columns can be selected from a dataset. The syntax for subsetting is as follows:
Often, it happens that we need only a section of data and not the entire data, subsetting enables us to use a section of the data frame for analysis, ensuring quicker analysis.
In this exercise, we will subset a data frame using data operations:
library(mlbench)
data("PimaIndiansDiabetes")
Subsetting Data
# select variables age, glucose, pressure
myvars <- c("age", "glucose", "pressure")
newdata <- PimaIndiansDiabetes[myvars]
head(newdata)
The output is as follows:
age glucose pressure
1 50 148 72
2 31 85 66
3 32 183 64
4 21 89 66
5 33 137 40
6 30 116 74
# another method
newdata <- PimaIndiansDiabetes[, 1:3]
head(newdata)
The output is as follows:
pregnant glucose pressure
1 6 148 72
2 1 85 66
3 8 183 64
4 1 89 66
5 0 137 40
6 5 116 74
# select 1st and 5th through 9th variables
newdata <- PimaIndiansDiabetes[c(1,5:9)]
head(newdata)
The output is as follows:
pregnant insulin mass pedigree age diabetes
1 6 0 33.6 0.627 50 pos
2 1 0 26.6 0.351 31 neg
3 8 0 23.3 0.672 32 pos
4 1 94 28.1 0.167 21 neg
5 0 168 43.1 2.288 33 pos
6 5 0 25.6 0.201 30 neg
# using subset function
newdata <- subset(PimaIndiansDiabetes,
insulin >= 20 & age < 30,
select=c(insulin, age))
head(newdata)
The output is as follows:
insulin age
4 94 21
7 88 26
21 235 27
28 140 22
32 245 28
33 54 22
Thus, we have selected the required part of the data frame in each of these cases.
Let's transpose the PimaIndiansDiabetes data frame; that is, convert the columns to rows and rows to columns. Use the t(dataframe) syntax as follows:
#Transpose Data
t_PimaIndiansDiabetes<-head(t(PimaIndiansDiabetes))
head(PimaIndiansDiabetes)
The first five rows of the original dataset are as follows:
The transposed dataset is as follows:
head(t_PimaIndiansDiabetes)
The output is as follows:
Often, it is essential to transpose a dataset, before we use it for analysis.
For any dataset, we should identify the input variables and the output variables. For the iris dataset, the input variables are the following:
The output variable, or the field to be predicted, is Species.
Based on the category of prediction, we will perform different pre-processing steps. The category of prediction could be any of these:
In any dataset, we might have missing values, duplicate values, or outliers. We need to ensure that these are handled appropriately so that the data used by the model is clean.
Missing values in a data frame can affect the model during the training process. Therefore, they need to be identified and handled during the pre-processing stage. They are represented as NA in a data frame. Using the example that follows, we will see how to identify a missing value in a dataset.
Using the is.na(), complete.cases(), and md.pattern() functions, we will identify the missing values.
The is.na() function, as the name suggests, returns TRUE for those elements marked NA or, for numeric or complex vectors, NaN (Not a Number) , and FALSE. The complete.cases() function returns TRUE if the value is missing and md.pattern() gives a summary of the missing values.
In the following example, we are adding rows with missing values to the PimaIndiansDiabetes dataset. We will be converting the columns of this dataset into numeric values. Using the is.na(), complete.cases(), and md.pattern() functions from the MICE library, we will identify the missing values.
library(mlbench)
data("PimaIndiansDiabetes")
#Adding NA values
PimaIndiansDiabetes_new <- rbind(
PimaIndiansDiabetes,c(1, 212,NA,NA,3,44,0.45,23,"neg"))
PimaIndiansDiabetes_new <- rbind(
PimaIndiansDiabetes_new,c(1, 212,NA,NA,3,44,0.45,23,"pos"))
#Convert character to numeric
PimaIndiansDiabetes_new$pregnant=as.numeric(
PimaIndiansDiabetes_new$pregnant)
PimaIndiansDiabetes_new$glucose=as.numeric(
PimaIndiansDiabetes_new$glucose)
PimaIndiansDiabetes_new$pressure=as.numeric(
PimaIndiansDiabetes_new$pressure)
PimaIndiansDiabetes_new$triceps=as.numeric(
PimaIndiansDiabetes_new$triceps)
PimaIndiansDiabetes_new$insulin=as.numeric(
PimaIndiansDiabetes_new$insulin)
PimaIndiansDiabetes_new$mass=as.numeric(
PimaIndiansDiabetes_new$mass)
PimaIndiansDiabetes_new$pedigree=as.numeric(
PimaIndiansDiabetes_new$pedigree)
PimaIndiansDiabetes_new$age=as.numeric(
PimaIndiansDiabetes_new$age)
PimaIndiansDiabetes_new$diabetes=as.numeric(
PimaIndiansDiabetes_new$diabetes)
#Identifying missing values
#List the rows containing missing values
PimaIndiansDiabetes_new[
!complete.cases(
PimaIndiansDiabetes_new),]
The output is as follows:
is.na(PimaIndiansDiabetes_new)
The output is as follows:
tail(is.na(PimaIndiansDiabetes_new) )
The output is as follows:
library(mice)
md.pattern(PimaIndiansDiabetes_new)
The output is as follows:
Now that we have identified the missing values, it's time to handle them gracefully.
When we encounter a missing value, we can handle it in a couple of ways. Some of the techniques include deleting or replacing them using mean or median.
In the following section, these techniques will be illustrated in detail with examples:
#Remove rows containing missing values
newdata <- na.omit(PimaIndiansDiabetes_new)
is.na(newdata)
The is.na() function will return FALSE if it is not an NA value for the data.
In the next exercise, we will learn how to impute using the MICE package.
The exercise will give us an overview of the MICE package. This exercise is a continuation of the previous exercise, and we will impute the missing values in this exercise using the complete() method.
#View the NA
tail(PimaIndiansDiabetes_new)
The output is as follows:
library(mice)
impute_step1 = mice(PimaIndiansDiabetes_new)
imputed_data = complete(impute_step1)
The output is as follows:
#View the imputed values
tail(imputed_data)
The output is as follows:
Thus, we have imputed the NA values using MICE.
The abbreviation pmm is short for predictive mean matching. It will predict the value to be written into the missing field. A sample is as follows:
mice(Dataset, m=1,maxit=30,meth='pmm',seed=50)
In the preceding line of code, the following are used:
m = The number of imputed datasets
meth = The imputation method used. Other methods can also be used.
maxit = The number of iterations for each imputation
In this exercise, we will predict the missing value using pmm.
tail(PimaIndiansDiabetes_new)
The output is as follows:
impute_step1 <- mice(PimaIndiansDiabetes_new,
m=5,maxit=30,meth='pmm',seed=50)
The output is as follows:
summary(impute_step1)
The output is as follows:
completedData <- complete(impute_step1,1)
tail(completedData)
The output is as follows:
Thus, we have imputed values using pmm.
Duplicate data means rows that repeat themselves in the dataset. These duplicate data rows need to be removed, as they will reduce the quality of the data. If our training data contains duplicates, the duplicates can overtrain a model and bias it to predict those samples well. Thus, the model does not learn the other samples (non-duplicates) as well.
There are functions in R that can be used to identify the duplicates in the data frame. We will identify duplicates using the duplicated() function.
#Adding duplicate values
PimaIndiansDiabetes_new <- rbind(
PimaIndiansDiabetes,c(1, 93,70,31,0,30.4,0.315,23,"pos"))
PimaIndiansDiabetes_new <- rbind(
PimaIndiansDiabetes_new,c(1, 93,70,31,0,30.4,0.315,23,"pos"))
PimaIndiansDiabetes_new <- rbind(
PimaIndiansDiabetes_new,c(1, 93,70,31,0,30.4,0.315,23,"pos"))
#Identify Duplicates
duplicated(PimaIndiansDiabetes_new)
The output is as follows:
#Display the duplicates
PimaIndiansDiabetes_new[duplicated(PimaIndiansDiabetes_new),]
The output is as follows:
#Display the unique values of the list of duplicates
unique(PimaIndiansDiabetes_new[duplicated(PimaIndiansDiabetes_new),])
The output is as follows:
#Display the unique values
unique(PimaIndiansDiabetes_new)
The output is as follows:
Thus, the unique() and duplicated() functions can be used to eliminate duplicate values.
A technique to handle duplicate values is to remove duplicate rows:
#Remove duplicates
unique_data <- iris[!duplicated(iris),]
In the next section, we will handle outliers.
Any datapoint with a value that is very different from the other data points is an outlier. Outliers can affect the training process negatively and therefore they need to be handled gracefully. In the following section, we will illustrate via examples both the process of detecting an outlier and the techniques used to handle them.
The outlier package can detect the outlier values. Using the opposite=TRUE parameter will fetch the outliers from the other side of dataset. The outlier values can be verified using a boxplot.
library(outliers)
#Detect outliers
outlier(PimaIndiansDiabetes[,1:4])
The output is as follows:
pregnant glucose pressure triceps
17 0 0 99
Detect outliers from the other end:
#This detects outliers from the other side
outlier(PimaIndiansDiabetes[,1:4],opposite=TRUE)
The output is as follows:
pregnant glucose pressure triceps
0 199 122 0
#View the outliers
boxplot(PimaIndiansDiabetes[,1:4])
The output is as follows:
A boxplot can be used to view the distribution of the data where the extreme values and the range of values can be viewed in the plot. For instance, the sepal width field has outliers (circles) with values of 4 and 2, and most of the data contains a sepal width of 3. The black line within each box is the median value.
The following are some of the techniques used to handle outliers:
In this exercise, we will be predicting values to handle outliers. The rpart package can be used to predict the values, as shown in this exercise:
#Add rows with missing values
iris_new <- rbind(iris, c(1, 2,NA,NA,"setosa"))
iris_new <- rbind(iris_new, c(NA,NA,3,4,"setosa"))
iris_new <- rbind(iris_new, c(4,2,3,4,NA))
#Convert character to numeric
iris_new$Sepal.Length <- as.numeric(iris_new$Sepal.Length)
iris_new$Sepal.Width <- as.numeric(iris_new$Sepal.Width)
iris_new$Petal.Length <- as.numeric(iris_new$Petal.Length)
iris_new$Petal.Width <- as.numeric(iris_new$Petal.Width)
install.packages(rpart)
library(rpart)
class_mod <- rpart(Species ~ . - Sepal.Length, data=iris_new[!is.na(iris_new$Species), ], method="class", na.action=na.omit)
# since Species is a factor
anova_mod <- rpart(Petal.Length ~ . - Sepal.Length, data=iris_new[!
is.na(iris_new$Petal.Length), ], method="anova", na.action=na.omit)
# since Petal.Length is numeric.
categoric_pred <- predict(class_mod, iris_new[is.na(iris_new$Species), ])
numeric_pred <- predict(anova_mod, iris_new[is.na(iris_new$Petal.Length), ])
categoric_pred
The output is as follows:
setosa versicolor virginica
153 0 0.02173913 0.9782609
numeric_pred
The output is as follows:
151
1.462
This shows that in row 153, the species is predicted to be virginica with 97.8 percent probability.
In row 151, the petal length is predicted to be 1.462.
To preprocess data, the syntax is preProcess(dataframe, method="medianImpute"), and the syntax for predicting is predict(preProcess(),newdata=dataframe). The medianImpute method will replace the values with the median value.
In this exercise, we will use the caret package to impute the missing values in the iris dataset. This package will work on only numeric data, so the first four columns can be selected in the dataset.
These are the steps that will solve the exercise:
library(caret)
#print the rows with NA
tail(iris_new[,1:4])
The output is as follows:
Sepal.Length Sepal.Width Petal.Length Petal.Width
147 6.3 2.5 5.0 1.9
148 6.5 3.0 5.2 2.0
149 6.2 3.4 5.4 2.3
150 5.9 3.0 5.1 1.8
151 1.0 2.0 NA NA
152 NA NA 3.0 4.0
#Impute
iris_caret <- predict(preProcess(iris_new[,1:4],method = 'medianImpute'),newdata = iris_new[,1:4])
#View the imputed values
tail(iris_caret)
The output is as follows:
Sepal.Length Sepal.Width Petal.Length Petal.Width
147 6.3 2.5 5.0 1.9
148 6.5 3.0 5.2 2.0
149 6.2 3.4 5.4 2.3
150 5.9 3.0 5.1 1.8
151 1.0 2.0 4.3 1.3
152 5.8 3.0 3.0 4.0
Thus, we have replaced the NA values with data using prediction.
In this activity, identify the outliers for the mtcars dataset. Also, display the outliers and plot a boxplot to verify it. The data can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/mtcars.csv.
These are the steps that will help you solve the activity:
The output will display the outliers using a boxplot, as illustrated:
The solution for this activity can be found on page 327.
A variable that contains distinct categories is called a categorical variable. For instance, the variable animal could have the classes cat, dog, and fish, and the variable married could have the classes yes and no. Pre-processing of a categorical field is essential because the model may not understand non-numeric literals. Therefore, these will be converted to numeric values.
Categorical data can be pre-processed in the following manner. The character values are converted to numeric values, which can be assigned by us:
#Categorical Variable
iris_new$Species <- factor(iris_new$Species,levels = c('setosa','versicolor','virginica'), labels = c(1,2,3))
iris_new$Species
The output is as follows:
In the previous example, we saw how to convert a character factor to a numeric factor.
In many business scenarios, the data can be imbalanced. For example, if we are identifying credit card fraud, out of 100 transactions, only 5 transactions are likely to be fraudulent. Therefore, the data contains 95 samples of good transactions and 5 samples for fraudulent transactions. So, if we use the data directly, the good sample will overpower the model compared to the fraudulent sample, and the model might not learn to predict credit card fraud with high accuracy. Hence, we can employ the following techniques to prevent this problem:
We will use the PimaIndiansDiabetes dataset, which is a set of patients with diabetes, and the output field is either neg or pos:
head(PimaIndiansDiabetes)
The output is as follows:
Observe the variables in the PimaIndiansDiabetes dataset.
str(PimaIndiansDiabetes)
The output is as follows:
Get a summary of the diabetes variable:
summary(PimaIndiansDiabetes$diabetes)
The output is as follows:
neg pos
500 268
The diabetes field has imbalanced samples, where the number of pos is low that is, 268, and the number of neg samples are high in number, that is, 500.
The imbalance in the preceding example should be addressed, because it can affect the performance of the machine learning model.
This is a technique where a few samples from the larger class are removed to decrease the count so that the ratio is balanced. The disadvantage is that we will likely lose information in our model as the samples are picked randomly. For instance, we will randomly pick 268 out of 500 samples for the neg class.
In this exercise, we will consider the diabetes field of the PimaIndianDiabetes dataset and downsample it.
summary(PimaIndiansDiabetes$diabetes)
The output is as follows:
neg pos
500 268
set.seed(9560)
undersampling <- downSample(
x = PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)],
y = PimaIndiansDiabetes$diabetes)
table(undersampling$Class)
The output is as follows:
neg pos
268 268
In this example, we saw how the class with more data was undersampled to reduce the count of data.
This is a technique where the samples with a lower count are repeated/duplicated to increase the count so that the ratio is balanced. The disadvantage is that we are more likely to overfit our model, and we do not have unique data for training and testing. The duplicated samples are randomly picked. For instance, we can randomly pick and duplicate to increase pos samples from 268 to 500.
The goal of this exercise is to perform oversampling for the data that contains minorities. The minority of the positive class having a count of 268 is oversampled to match the majority that had a count of 500.
set.seed(9560)
oversampling <- upSample(
x = PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)],
y = PimaIndiansDiabetes$diabetes)
table(oversampling$Class)
The output is as follows:
neg pos
500 500
Through the preceding example, we learned how to use upSample() to oversample the pos minority class for the diabetes column.
The Random Oversampling Examples (ROSE) technique uses synthetic samples generated for the minority class, and is another technique that is used for binary imbalanced classification problems. It uses a smoothed bootstrap approach to create artificial samples in the data, thereby balancing it. ROSE creates artificial samples for the minority class.
In this exercise, we will learn to generate synthetic samples for the minority class to balance the dataset by oversampling using random examples in ROSE.
library(caret)
library(ROSE)
set.seed(2)
imbalance_data <- twoClassSim(1000, intercept = -15, linearVars = 5)
table(imbalance_data$Class)
The output is as follows:
Class1 Class2
908 92
balanced_data <- ROSE(Class ~ .,
data = imbalance_data,seed=3)$data table(balanced_data$Class)
The output is as follows:
Class1 Class2
480 520
Through the preceding example, we learned to implement the ROSE method to perform oversampling.
Synthetic Minority Oversampling Technique (SMOTE) is used to handle imbalanced binary classes. In this technique, the minority class is oversampled and the majority class is undersampled.
In this exercise, we will implement the SMOTE concept. Here are the steps to complete the exercise:
set.seed(2)
imbalance_data <- twoClassSim(1000, intercept = -15, linearVars = 5)
table(imbalance_data$Class)
The output is as follows:
Class1 Class2
903 97
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE)
ctrl$sampling <- "smote"
smote_fit <- train(Class ~ .,
data = imbalance_data,
method = "gbm",
verbose = FALSE,
metric = "ROC",
trControl = ctrl)
smote_fit
The output is as follows:
In the preceding example, we saw how to use SMOTE to balance the data.
The mushrooms dataset contains imbalanced data and has a property named bruises, which we will oversample and undersample in this activity.
The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/mushrooms.csv.
These are the steps that will help you solve the activity:
The output will be oversampled as shown in the following:
f t
4748 4748
The solution for this activity can be found on page 329.
We want to use the German Credit dataset to make predictions relating to class in the German Credit dataset. However, the dataset does not have a good balance of Good and Bad values. We want to use ROSE to perform sampling to balance the class values so that the dataset is balanced.
The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.
These are the steps that will help you solve the activity:
The balanced data sampled using ROSE will look as follows:
Good Bad
480 520
The solution for this activity can be found on page 330.
In this chapter, we learned how to perform several operations on a data frame, including scaling, standardizing, and normalizing. Also, we covered the sorting, ranking, and joining operations with their implementations in R. We discussed the need for pre-processing of the data; and identified and handled outliers, missing values, and duplicate values.
Next, we moved on to the sampling of data. It is important for the data to contain a reasonable sample of each class that is to be predicted. If the data is imbalanced, it can affect our predictions in a negative manner. Therefore, we can use either the undersampling, oversampling, ROSE, or SMOTE techniques imbalanced to ensure that the dataset is representative of all the classes that we want to predict. This can be done using the MICE, rpart, ROSE, and caret packages.
In the next chapter, we will cover feature engineering in detail, where we will focus on extracting features to create models.