Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix

About

This section is included to assist the students to perform the activities present in the book. It includes detailed steps that are to be performed by the students to complete and achieve the objectives of the book.

Chapter 1: An Introduction to Machine Learning

Activity 1: Finding the Distribution of Diabetic Patients in the PimaIndiansDiabetes Dataset

Solution:

Load the dataset.
PimaIndiansDiabetes<-read.csv("PimaIndiansDiabetes.csv")
Create a variable PimaIndiansDiabetesData for further use.
#Assign it to a local variable for further use
PimaIndiansDiabetesData<- PimaIndiansDiabetes
Use the head() function to view the first five rows of the dataset.
#Display the first five rows
head(PimaIndiansDiabetesData)
The output is as follows:
  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35       0 33.6    0.627  50      pos
2        1      85       66      29       0 26.6    0.351  31      neg
3        8     183       64       0       0 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74       0       0 25.6    0.201  30      neg
From the preceding data, identify the input features and find the column that is the predictor variable. The output variable is diabetes.
Display the different categories of the output variable:
levels(PimaIndiansDiabetesData$diabetes)
The output is as follows:
[1] "neg" "pos"
Load the required library for plotting graphs.
library(ggplot2)
Create a bar plot to view the output variables.
barplot <- ggplot(data= PimaIndiansDiabetesData, aes(x=age))
barplot + geom_histogram(binwidth=0.2, color="black", aes(fill=diabetes)) + ggtitle("Bar plot of Age")
The output is as follows:

Figure 1.36: Bar plot output for diabetes

We can conclude that we have the most data for the age group of 20-30. Graphical representation thus allows us to understand the data.

Activity 2: Grouping the PimaIndiansDiabetes Data

Solution :

View the structure of the PimaIndiansDiabetes dataset.
#View the structure of the data
str(PimaIndiansDiabetesData)
The output is as follows:
'data.frame':768 obs. of  9 variables:
$ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
$ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
$ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
$ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
$ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
$ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
$ age     : num  50 31 32 21 33 30 26 29 53 54 ...
$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
View the summary of the PimaIndiansDiabetes dataset.
#View the Summary of the data
summary(PimaIndiansDiabetesData)
The output is as follows:
Figure 1.37: Summary of PimaIndiansDiabetes data
View the statistics of the columns of PimaIndiansDiabetes dataset grouped by the diabetes column.
#Perform Group by and view statistics for the columns
#Install the package
install.packages("psych")
library(psych) #Load package psych to use function describeBy
Use describeby with pregnancy and diabetes columns.
describeBy(PimaIndiansDiabetesData$pregnant, PimaIndiansDiabetesData$diabetes)
The output is as follows:
Descriptive statistics by group
group: neg
   vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 500  3.3 3.02      2    2.88 2.97   0  13    13 1.11     0.65 0.13
----------------------------------------------------------------------------------------------
group: pos
   vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 268 4.87 3.74      4     4.6 4.45   0  17    17  0.5    -0.47 0.23
We can view the mean, median, min, and max of the number of times pregnant attribute in the group of people who have diabetes (pos) and who do not have diabetes (neg).
Use describeby with pressure and diabetes.
describeBy(PimaIndiansDiabetesData$pressure, PimaIndiansDiabetesData$diabetes)
The output is as follows:
Descriptive statistics by group
group: neg
   vars   n  mean    sd median trimmed   mad min max range skew kurtosis   se
X1    1 500 68.18 18.06     70   69.97 11.86   0 122   122 -1.8     5.58 0.81
----------------------------------------------------------------------------------------------
group: pos
   vars   n  mean    sd median trimmed   mad min max range  skew kurtosis   se
X1    1 268 70.82 21.49     74   73.99 11.86   0 114   114 -1.92     4.53 1.31
We can view the mean, median, min, and max of the pressure in the group of people who have diabetes (pos) and who do not have diabetes (neg).
We have learned how to view the structure of any dataset and print the statistics about the range of every column using summary().

Activity 3: Performing EDA on the PimaIndiansDiabetes Dataset

Solution:

Load the PimaIndiansDaibetes dataset.
PimaIndiansDiabetes<-read.csv("PimaIndiansDiabetes.csv")
View the correlation among the features of the PimaIndiansDiabetes dataset.
#Calculate correlations
correlation <- cor(PimaIndiansDiabetesData[,1:4])
Round it to the second nearest digit.
#Round the values to the nearest 2 digit
round(correlation,2)
The output is as follows:
         pregnant glucose pressure triceps
pregnant     1.00    0.13     0.14   -0.08
glucose      0.13    1.00     0.15    0.06
pressure     0.14    0.15     1.00    0.21
triceps     -0.08    0.06     0.21    1.00
Pair them on a plot.
#Plot the pairs on a plot
pairs(PimaIndiansDiabetesData[,1:4])
The output is as follows:
Figure 1.38: A pair plot for the diabetes data
Create a box plot to view the data distribution for the pregnant column and color by diabetes.
# Load library
library(ggplot2)
boxplot <- ggplot(data=PimaIndiansDiabetesData, aes(x=diabetes, y=pregnant))
boxplot + geom_boxplot(aes(fill=diabetes)) +
ylab("Pregnant") + ggtitle("Diabetes Data Boxplot") +
stat_summary(fun.y=mean, geom="point", shape=5, size=4)
The output is as follows:

Figure 1.39: The box plot output using ggplot

In the preceding graph, we can see the distribution of "number of times pregnant" in people who do not have diabetes (neg) and in people who have diabetes (pos).

Activity 4: Building Linear Models for the GermanCredit Dataset

Solution:

These are the steps that will help you solve the activity:

Load the data.
GermanCredit <-read.csv("GermanCredit.csv")
Subset the data.
GermanCredit_Subset=GermanCredit[,1:10]
Fit a linear model using lm().
# fit model
fit <- lm(Duration~., GermanCredit_Subset)
Summarize the results using the summary() function.
# summarize the fit
summary(fit)
The output is as follows:
Call:
lm(formula = Duration ~ ., data = GermanCredit_Subset)
Residuals:
    Min      1Q  Median      3Q     Max
-44.722  -5.524  -1.187   4.431  44.287
Coefficients:
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                2.0325685  2.3612128   0.861  0.38955
Amount                     0.0029344  0.0001093  26.845  < 2e-16 ***
InstallmentRatePercentage  2.7171134  0.2640590  10.290  < 2e-16 ***
ResidenceDuration          0.2068781  0.2625670   0.788  0.43094
Age                       -0.0689299  0.0260365  -2.647  0.00824 **
NumberExistingCredits     -0.3810765  0.4903225  -0.777  0.43723
NumberPeopleMaintenance   -0.0999072  0.7815578  -0.128  0.89831
Telephone                  0.6354927  0.6035906   1.053  0.29266
ForeignWorker              4.9141998  1.4969592   3.283  0.00106 **
ClassGood                 -2.0068114  0.6260298  -3.206  0.00139 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.784 on 990 degrees of freedom
Multiple R-squared:  0.4742,    Adjusted R-squared:  0.4694
F-statistic:  99.2 on 9 and 990 DF,  p-value: < 2.2e-16
Use predict() to make the predictions.
# make predictions
predictions <- predict(fit, GermanCredit_Subset)
Calculate the RMSE for the predictions.
# summarize accuracy
rmse <- sqrt(mean((GermanCredit_Subset$Duration - predictions)^2))
print(rmse)
The output is as follows:
[1] 76.3849
In this activity, we have learned to build a linear model, make predictions on new data, and evaluate performance using RMSE.

Activity 5: Using Multiple Variables for a Regression Model for the Boston Housing Dataset

Solution:

These are the steps that will help you solve the activity:

Load the dataset.
BostonHousing <-read.csv("BostonHousing.csv")
Build a regression model using multiple variables.
#Build multi variable regression
regression <- lm(medv~crim + indus+rad , data = BostonHousing)
View the summary of the built regression model.
#View the summary
summary(regression)
The output is as follows:
Call:
lm(formula = medv ~ crim + indus + rad, data = BostonHousing)
Residuals:
    Min      1Q  Median      3Q     Max
-12.047  -4.860  -1.736   3.081  32.596
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.27515    0.68220  42.913  < 2e-16 ***
crim        -0.23952    0.05205  -4.602 5.31e-06 ***
indus       -0.51671    0.06336  -8.155 2.81e-15 ***
rad         -0.01281    0.05845  -0.219    0.827
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.838 on 502 degrees of freedom
Multiple R-squared:  0.2781, Adjusted R-squared:  0.2737
F-statistic: 64.45 on 3 and 502 DF,  p-value: < 2.2e-16
Plot the regression model using the plot() function.
#Plot the fit
plot(regression)
The output is as follows:

Figure 1.40: Residual versus fitted values

The preceding plot compares the predicted values and the residual values.

Hit <Return> to see the next plot:

Figure 1.41: Normal QQ

The preceding plot shows the distribution of error. It is a normal probability plot. A normal distribution of error will display a straight line.

Hit <Return> to see the next plot:

Figure 1.42: Scale location plot

The preceding plot compares the spread and the predicted values. We can see how the spread is with respect to the predicted values.

Hit <Return> to see the next plot:

Figure 1.43: Cook's distance plot

This plot helps to identify which data points are influential to the regression model, that is, which of our model results would be affected if we included or excluded them.

We have now explored the datasets with one or more variables.

Chapter 2: Data Cleaning and Pre-processing

Activity 6: Pre-processing using Center and Scale

Solution:

In this exercise, we will perform the center and scale pre-processing operations.

Load the mlbench library and the PimaIndiansDiabetes dataset:
# Load Library caret
library(caret)
library(mlbench)
# load the dataset PimaIndiansDiabetes
data(PimaIndiansDiabetes)
View the summary:
# view the data
summary(PimaIndiansDiabetes [,1:2])
The output is as follows:
    pregnant         glucose
Min.   : 0.000   Min.   :  0.0
1st Qu.: 1.000   1st Qu.: 99.0
Median : 3.000   Median :117.0
Mean   : 3.845   Mean   :120.9
3rd Qu.: 6.000   3rd Qu.:140.2
Max.   :17.000   Max.   :199.0
User preProcess() to pre-process the data to center and scale:
# to standardise we will scale and center
params <- preProcess(PimaIndiansDiabetes [,1:2], method=c("center", "scale"))
Transform the dataset using predict():
# transform the dataset
new_dataset <- predict(params, PimaIndiansDiabetes [,1:2])
Print the summary of the new dataset:
# summarize the transformed dataset
summary(new_dataset)
The output is as follows:
    pregnant          glucose
Min.   :-1.1411   Min.   :-3.7812
1st Qu.:-0.8443   1st Qu.:-0.6848
Median :-0.2508   Median :-0.1218
Mean   : 0.0000   Mean   : 0.0000
3rd Qu.: 0.6395   3rd Qu.: 0.6054
Max.   : 3.9040   Max.   : 2.4429
We will notice that the values are now mean centering values.

Activity 7: Identifying Outliers

Solution:

Load the dataset:
mtcars = read.csv("mtcars.csv")
Load the outlier package and use the outlier function to display the outliers:
#Load the outlier library
library(outliers)
Detect outliers in the dataset using the outlier() function:
#Detect outliers
outlier(mtcars)
The output is as follows:
    mpg     cyl    disp      hp    drat      wt    qsec      vs      am
    gear    carb
33.900   4.000 472.000 335.000   4.930   5.424  22.900
1.000   1.000   5.000   8.000
Display the other side of the outlier values:
#This detects outliers from the other side
outlier(mtcars,opposite=TRUE)
The output is as follows:
   mpg    cyl   disp     hp   drat     wt   qsec     vs     am
   gear   carb
10.400  8.000 71.100 52.000  2.760  1.513 14.500  0.000  0.000
  3.000  1.000
Plot a box plot:
#View the outliers
boxplot(Mushroom)
The output is as follows:

Figure 2.36: Outliers in the mtcars dataset.

The circle marks are the outliers.

Activity 8: Oversampling and Undersampling

Solution:

The detailed solution is as follows:

Read the mushroom CSV file:
ms<-read.csv('mushrooms.csv')
summary(ms$bruises)
The output is as follows:
f t
4748 3376
Perform downsampling:
set.seed(9560)
undersampling <- downSample(x = ms[, -ncol(ms)], y = ms$bruises)
table(undersampling$bruises)
The output is as follows:
f t
3376 3376
Perform oversampling:
set.seed(9560)
oversampling <- upSample(x = ms[, -ncol(ms)],y = ms$bruises)
table(oversampling$bruises)
The output is as follows:
f t
4748 4748
In this activity, we learned to use downSample() and upSample() from the caret package to perform downsampling and oversampling.

Activity 9: Sampling and OverSampling using ROSE

Solution:

The detailed solution is as follows:

Load the German credit dataset:
#load the dataset
library(caret)
library(ROSE)
data(GermanCredit)
View the samples in the German credit dataset:
#View samples
head(GermanCredit)
str(GermanCredit)
Check the number of unbalanced data in the German credit dataset using the summary() method:
#View the imbalanced data
summary(GermanCredit$Class)
The output is as follows:
Bad Good
300 700
Use ROSE to balance the numbers:
balanced_data <- ROSE(Class ~ ., data  = stagec,seed=3)$data
table(balanced_data$Class)
The output is as follows:
Good  Bad
480  520
Using the preceding example, we learned how to increase and decrease the class count using ROSE.

Chapter 3: Feature Engineering

Activity 10: Calculating Time series Feature – Binning

Solution:

Load the caret library:
#Time series features
library(caret)
#Install caret if not installed
#install.packages('caret')
Load the GermanCredit dataset:
GermanCredit = read.csv("GermanCredit.csv")
duration<- GermanCredit$Duration #take the duration column
Check the data summary as follows:
summary(duration)
The output is as follows:
Figure 3.27: The summary of the Duration values of German Credit dataset
Load the ggplot2 library:
library(ggplot2)
Plot using the command:
ggplot(data=GermanCredit, aes(x=GermanCredit$Duration)) +
  geom_density(fill='lightblue') +
  geom_rug() +
  labs(x='mean Duration')
The output is as follows:

Figure 3.28: Plot of the duration vs density
Create bins:
#Creating Bins
# set up boundaries for intervals/bins
breaks <- c(0,10,20,30,40,50,60,70,80)
Create labels:
# specify interval/bin labels
labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80")
Bucket the datapoints into the bins.
# bucketing data points into bins
bins <- cut(duration, breaks, include.lowest = T, right=FALSE, labels=labels)
Find the number of elements in each bin:
# inspect bins
summary(bins)
The output is as follows:
summary(bins)
<10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
143 403 241 131 66 2 13 1
Plot the bins:
#Ploting the bins
plot(bins, main="Frequency of Duration", ylab="Duration Count", xlab="Duration Bins",col="bisque")
The output is as follows:

Figure 3.29: Plot of duration in bins

We can conclude that the maximum number of customers are within the range of 10 to 20.

Activity 11: Identifying Skewness

Solution:

Load the library mlbench.
#Skewness
library(mlbench)
library(e1071)
Load the PrimaIndainsDiabetes data.
PimaIndiansDiabetes = read.csv("PimaIndiansDiabetes.csv")
Print the skewness of the glucose column, using the skewness() function.
#Printing the skewness of the columns
#Not skewed
skewness(PimaIndiansDiabetes$glucose)
The output is as follows:
[1] 0.1730754
Plot the histogram using the histogram() function.
histogram(PimaIndiansDiabetes$glucose)
The output is as follows:
Figure 3.30: Histogram of the glucose values of the PrimaIndainsGlucose dataset
A negative skewness value means that the data is skewed to the left and a positive skewness value means that the data is skewed to the right. Since the value here is 0.17, the data is neither completely left or right skewed. Therefore, it is not skewed.
Find the skewness of the age column using the skewness() function.
#Highly skewed
skewness(PimaIndiansDiabetes$age)
The output is as follows:
[1] 1.125188
Plot the histogram using the histogram() function.
histogram(PimaIndiansDiabetes$age)
The output is as follows:

Figure 3.31: Histogram of the age values of the PrimaIndiansDiabetes dataset

The positive skewness value means that it is skewed to the right as we can see above.

Activity 12: Generating PCA

Solution:

Load the GermanCredit data.
#PCA Analysis
data(GermanCredit)
Create a subset of first 9 columns into another variable names GermanCredit_subset
#Use the German Credit Data
GermanCredit_subset <- GermanCredit[,1:9]
Find the principal components:
#Find out the Principal components
principal_components <- prcomp(x = GermanCredit_subset, scale. = T)
Print the principal components:
#Print the principal components
print(principal_components)
The output is as follows:
Standard deviations (1, .., p=9):
[1] 1.3505916 1.2008442 1.1084157 0.9721503 0.9459586
0.9317018 0.9106746 0.8345178 0.5211137
Rotation (n x k) = (9 x 9):

Figure 3.32: Histogram of the age values of the PrimaIndiansDiabetes dataset

Therefore, by using principal component analysis we can identify the top nine principal components in the dataset. These components are calculated from multiple fields and they can be used as features on their own.

Activity 13: Implementing the Random Forest Approach

Solution:

Load the GermanCredit data:
data(GermanCredit)
Create a subset to load the first ten columns into GermanCredit_subset.
GermanCredit_subset <- GermanCredit[,1:10]
Attach the randomForest package:
library(randomForest)
Train a random forest model using random_forest =randomForest(Class~., data=GermanCredit_subset):
random_forest = randomForest(Class~., data=GermanCredit_subset)
Invoke importance() for the trained random_forest:
# Create an importance based on mean decreasing gini
importance(random_forest)
The output is as follows:
importance(random_forest)
                          MeanDecreaseGini
Duration                         70.380265
Amount                          121.458790
InstallmentRatePercentage        27.048517
ResidenceDuration                30.409254
Age                              86.476017
NumberExistingCredits            18.746057
NumberPeopleMaintenance          12.026969
Telephone                        15.581802
ForeignWorker                     2.888387
Use the varImp() function to view the list of important variables.
varImp(random_forest)
The output is as follows:
                             Overall
Duration                   70.380265
Amount                    121.458790
InstallmentRatePercentage  27.048517
ResidenceDuration          30.409254
Age                        86.476017
NumberExistingCredits      18.746057
NumberPeopleMaintenance    12.026969
Telephone                  15.581802
ForeignWorker               2.888387
In this activity, we built a random forest model and used it to see the importance of each variable in a dataset. The variables with higher scores are considered more important. Having done this, we can sort by importance and choose the top 5 or top 10 for the model or set a threshold for importance and choose all the variables that meet the threshold.

Activity 14: Selecting Features Using Variable Importance

Solution:

Install the following packages:
install.packages("rpart")
library(rpart)
library(caret)
set.seed(10)
Load the GermanCredit dataset:
data(GermanCredit)
Create a subset to load the first ten columns into GermanCredit_subset:
GermanCredit_subset <- GermanCredit[,1:10]
Train an rpart model using rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart"):
#Train a rpart model
rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart")
Invoke the varImp() function, as in rpartImp <- varImp(rPartMod).
#Find variable importance
rpartImp <- varImp(rPartMod)
Print rpartImp.
#Print variable importance
print(rpartImp)
The output is as follows:
rpart variable importance
                          Overall
Amount                    100.000
Duration                   89.670
Age                        75.229
ForeignWorker              22.055
InstallmentRatePercentage  17.288
Telephone                   7.813
ResidenceDuration           4.471
NumberExistingCredits       0.000
NumberPeopleMaintenance     0.000
Plot rpartImp using plot().
#Plot top 5 variable importance
plot(rpartImp, top = 5, main='Variable Importance')
The output is as follows:

Figure 3.33: Variable importance for the fields

From the preceding plot, we can observe that Amount, Duration, and Age have high importance values.

Chapter 4: Introduction to neuralnet and Evaluation Methods

Activity 15: Training a Neural Network

Solution:

Attach the packages:
# Attach the packages
library(caret)
library(groupdata2)
library(neuralnet)
library(NeuralNetTools)
Set the seed value for reproducibility and easier comparison:
# Set seed for reproducibility and easier comparison
set.seed(1)
Load the GermanCredit dataset:
# Load the German Credit dataset
GermanCredit <- read.csv("GermanCredit.csv")
Remove the Age column:
# Remove the Age column
GermanCredit$Age <- NULL
Create balanced partitions such that all three partitions have the same ratio of each class:
# Partition with same ratio of each class in all three partitions
partitions <- partition(GermanCredit, p = c(0.6, 0.2),
cat_col = "Class")
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
Find the preprocessing parameters for scaling and centering from the training set:
# Find scaling and centering parameters
params <- preProcess(train_set[, 1:6], method=c("center", "scale"))
Apply standardization to the first six predictors in all three partitions, using the preProcess parameters from the previous step:
# Transform the training set
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the development set
dev_set[, 1:6] <- predict(params, dev_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
Train the neural network classifier:
# Train the neural network classifier
nn <- neuralnet(Class == "Good" ~ InstallmentRatePercentage +
ResidenceDuration + NumberExistingCredits,
train_set, linear.output = FALSE)
Plot the network with its weights:
# Plot the network
plotnet(nn, var_labs=FALSE)
The output is as follows:
Figure 4.18: Neural network architecture using three predictors
Print the error:
train_error <- nn$result.matrix[1]
train_error
The output is as follows:
## [1] 62.15447
The random initialization of the neural network weights can lead to slightly different results from one training to another. To avoid this, we use the set.seed() function at the beginning of the script, which helps when comparing models. We could also train the same model architecture with five different seeds to get a better sense of its performance.

Activity 16: Training and Comparing Neural Network Architectures

Solution:

Attach the packages:
# Attach the packages
library(groupdata2)
library(caret)
library(neuralnet)
library(mlbench)
Set the random seed to 1:
# Set seed for reproducibility and easier comparison
set.seed(1)
Load the PimaIndiansDiabetes2 dataset:
# Load the PimaIndiansDiabetes2 dataset
PimaIndiansDiabetes2 <- read.csv("PimaIndiansDiabetes2.csv")
Summarize the dataset.
summary(PimaIndiansDiabetes2)
The summary is as follows:
##     pregnant         glucose         pressure         triceps
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00
##                   NA's   :5       NA's   :35       NA's   :227
##
##     insulin            mass          pedigree           age
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00
##  NA's   :374      NA's   :11
##
##  diabetes
##  neg:500
##  pos:268
Handle missing data (quick solution). Start by assigning the dataset to a new name:
# Assign/copy dataset to a new name
diabetes_data <- PimaIndiansDiabetes2
Remove the triceps and insulin columns:
# Remove the triceps and insulin columns
diabetes_data$triceps <- NULL
diabetes_data$insulin <- NULL
Remove all rows containing missing data (NAs):
# Remove all rows with NAs (missing data)
diabetes_data <- na.omit(diabetes_data)
Partition the dataset into a training set (60%), a development set (20%), and a validation set (20%). Use cat_col="diabetes" to balance the ratios of each class between the partitions:
# Partition with same ratio of each class in all three partitions
partitions <- partition(diabetes_data, p = c(0.6, 0.2),
cat_col = "diabetes")
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
Find the preProcess parameters for scaling and centering the first six features:
# Find scaling and centering parameters
params <- preProcess(train_set[, 1:6], method = c("center", "scale"))
Apply the scaling and centering to each partition:
# Transform the training set
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the development set
dev_set[, 1:6] <- predict(params, dev_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
Train multiple neural network architectures. Adjust them by changing the number of nodes and/or layers. In the model formula, use diabetes == "pos":
# Training multiple neural nets
nn4 <- neuralnet(diabetes == "pos" ~ ., train_set,
                 linear.output = FALSE, hidden = c(3))
nn5 <- neuralnet(diabetes == "pos" ~ ., train_set,
                 linear.output = FALSE, hidden = c(2,1))
nn6 <- neuralnet(diabetes == "pos" ~ ., train_set,
                 linear.output = FALSE, hidden = c(3,2))
Put the model objects into a list:
# Put the model objects into a list
models <- list("nn4"=nn4,"nn5"=nn5,"nn6"=nn6)
Create one-hot encoding of the diabetes variable:
# Evaluating each model on the dev_set
# Create one-hot encoding of diabetes variable
dev_true_labels <- ifelse(dev_set$diabetes == "pos", 1, 0)
Create a for loop for evaluating the models. By running the evaluations in a for loop, we avoid repeating the code:
# Evaluate one model at a time in a loop, to avoid repeating the code
for (i in 1:length(models)){

  # Predict the classes in the development set
  dev_predicted_probabilities <- predict(models[[i]], dev_set)
  dev_predictions <- ifelse(dev_predicted_probabilities > 0.5, 1, 0)

  # Create confusion Matrix
  confusion_matrix <- confusionMatrix(as.factor(dev_predictions),
                                      as.factor(dev_true_labels),
                                      mode="prec_recall",
                                      positive = "1")

  # Print the results for this model
  # Note: paste0() concatenates the strings
  # to (name of model + " on the dev...")
  print( paste0( names(models)[[i]], " on the development set: "))
  print(confusion_matrix)

}
The output is as follows:
## [1] "nn4 on the development set: "
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0 79 19
##          1 16 30
##
##                Accuracy : 0.7569
##                  95% CI : (0.6785, 0.8245)
##     No Information Rate : 0.6597
##     P-Value [Acc > NIR] : 0.007584
##
##                   Kappa : 0.4505
##  Mcnemar's Test P-Value : 0.735317
##
##               Precision : 0.6522
##                  Recall : 0.6122
##                      F1 : 0.6316
##              Prevalence : 0.3403
##          Detection Rate : 0.2083
##    Detection Prevalence : 0.3194
##       Balanced Accuracy : 0.7219
##
##        'Positive' Class : 1
##
## [1] "nn5 on the development set: "
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0 77 16
##          1 18 33
##
##                Accuracy : 0.7639
##                  95% CI : (0.686, 0.8306)
##     No Information Rate : 0.6597
##     P-Value [Acc > NIR] : 0.004457
##
##                   Kappa : 0.4793
##  Mcnemar's Test P-Value : 0.863832
##
##               Precision : 0.6471
##                  Recall : 0.6735
##                      F1 : 0.6600
##              Prevalence : 0.3403
##          Detection Rate : 0.2292
##    Detection Prevalence : 0.3542
##       Balanced Accuracy : 0.7420
##
##        'Positive' Class : 1
##
## [1] "nn6 on the development set: "
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0 76 14
##          1 19 35
##
##                Accuracy : 0.7708
##                  95% CI : (0.6935, 0.8367)
##     No Information Rate : 0.6597
##     P-Value [Acc > NIR] : 0.002528
##
##                   Kappa : 0.5019
##  Mcnemar's Test P-Value : 0.486234
##
##               Precision : 0.6481
##                  Recall : 0.7143
##                      F1 : 0.6796
##              Prevalence : 0.3403
##          Detection Rate : 0.2431
##    Detection Prevalence : 0.3750
##       Balanced Accuracy : 0.7571
##
##        'Positive' Class : 1
As the nn6 model has the highest accuracy and F1 score, it is the best model.
Evaluate the best model on the validation set. Start by creating the one-hot encoding of the diabetes variable in the validation set:
# Create one-hot encoding of Class variable
valid_true_labels <- ifelse(valid_set$diabetes == "pos", 1, 0)
Use the best model to predict the diabetes variable in the validation set:
# Predict the classes in the validation set
predicted_probabilities <- predict(nn6, valid_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
Create a confusion matrix:
# Create confusion Matrix
confusion_matrix <- confusionMatrix(as.factor(predictions),
as.factor(valid_true_labels),
mode="prec_recall", positive = "1")
Print the results:
# Print the results for this model
# Note that by separating two function calls by ";"
# we can have multiple calls per line
print("nn6 on the validation set:"); print(confusion_matrix)
The output is as follows:
## [1] "nn6 on the validation set:"
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0 70 16
##          1 25 35
##
##                Accuracy : 0.7192
##                  95% CI : (0.6389, 0.7903)
##     No Information Rate : 0.6507
##     P-Value [Acc > NIR] : 0.04779
##
##                   Kappa : 0.4065
##  Mcnemar's Test P-Value : 0.21152
##
##               Precision : 0.5833
##                  Recall : 0.6863
##                      F1 : 0.6306
##              Prevalence : 0.3493
##          Detection Rate : 0.2397
##    Detection Prevalence : 0.4110
##       Balanced Accuracy : 0.7116
##
##        'Positive' Class : 1
Plot the best model:
plotnet(nn6, var_labs=FALSE)
The output will look as follows:

Figure 4.19: The best neural network architecture without cross-validation.

In this activity, we have trained multiple neural network architectures and evaluated the best model on the validation set.

Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation

Solution:

Attach the packages.
# Attach the packages
library(groupdata2)
library(caret)
library(neuralnet)
library(mlbench)
Set the random seed to 1.
# Set seed for reproducibility and easier comparison
set.seed(1)
Load the PimaIndiansDiabetes2 dataset.
# Load the PimaIndiansDiabetes2 dataset
data(PimaIndiansDiabetes2)
Handle missing data (quick solution).
Start by assigning the dataset to a new name.
# Handling missing data (quick solution)
# Assign/copy dataset to a new name
diabetes_data <- PimaIndiansDiabetes2
Remove the triceps and insulin columns.
# Remove the triceps and insulin columns
diabetes_data$triceps <- NULL
diabetes_data$insulin <- NULL
Remove all rows with NAs.
# Remove all rows with Nas (missing data)
diabetes_data <- na.omit(diabetes_data)
Partition the dataset into a training set (80%) and validation set (20%). Use cat_col="diabetes" to balance the ratios of each class between the partitions.
# Partition into a training set and a validation set
partitions <- partition(diabetes_data, p = 0.8, cat_col = "diabetes")
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
Find the preProcess parameters for scaling and centering the first six features.
# Find scaling and centering parameters
# Note: We could also decide to do this inside the training loop!
params <- preProcess(train_set[, 1:6], method=c("center", "scale"))
Apply the scaling and centering to both partitions.
# Transform the training set
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
Create 4 folds in the training set, using the fold() function. Use cat_col="diabetes" to balance the ratios of each class between the folds.
# Create folds for cross-validation
# Balance on the Class variable
train_set <- fold(train_set, k=4, cat_col = "diabetes")
# Note: This creates a factor in the dataset called ".folds"
# Take care not to use this as a predictor.
Write the cross-validation training section. Start by initializing the vectors for collecting errors and accuracies.
## Cross-validation loop
# Change the model formula in the loop and run the below
# for each model architecture you're testing
# Initialize vectors for collecting errors and accuracies
errors <- c()
accuracies <- c()
Start the training for loop. We have 4 folds, so we need 4 iterations.
# Training loop
for (part in 1:4){
Assign the chosen fold as test set and the rest of the folds as train set. Be aware of the indentation.
  # Assign the chosen fold as test set
  # and the rest of the folds as train set
  cv_test_set <- train_set[train_set$.folds == part,]
  cv_train_set <- train_set[train_set$.folds != part,]
Train the neural network with your chosen predictors.
  # Train neural network classifier
  # Make sure not to include the .folds column as a predictor!
  nn <- neuralnet(diabetes == "pos" ~ .,
                  cv_train_set[, 1:7],
                  linear.output = FALSE,
                  hidden=c(2,2))
Append the error to the errors vector.
# Append error to errors vector
errors <- append(errors, nn$result.matrix[1])
Create one-hot encoding of the target variable in the CV test set.
# Create one-hot encoding of Class variable
true_labels <- ifelse(cv_test_set$diabetes == "pos", 1, 0)
Use the trained neural network to predict the target variable in the CV test set.
  # Predict the class in the test set
  # It returns probabilities that the observations are "pos"
  predicted_probabilities <- predict(nn, cv_test_set)
  predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
Calculate accuracy. We could also use confusionMatrix() here, if we wanted other metrics.
  # Calculate accuracy manually
  # Note: TRUE == 1, FALSE == 0
  cv_accuracy <- sum(true_labels == predictions) / length(true_labels)
Append the calculated accuracy to the accuracies vector.
# Append the accuracy to the accuracies vector
accuracies <- append(accuracies, cv_accuracy)
Close the for loop.
}
Calculate average_error and print it.
# Calculate average error and accuracy
# Note that we could also have gathered the predictions from all the
# folds and calculated the accuracy only once. This could lead to slightly
# different results, e.g. if the folds are not exactly the same size.
average_error <- mean(errors)
average_error
The output is as follows:
## [1] 28.38503
Calculate average_accuracy and print it. Note that we could also have gathered the predictions from all the folds and calculated the accuracy only once.
average_accuracy <- mean(accuracies)
average_accuracy
The output is as follows:
## [1] 0.7529813
Evaluate the best model architecture on the validation set. Start by training an instance of the model architecture on the entire training set.
# Once you have chosen the best model, train it on the entire training set
# and evaluate on the validation set
# Note that we set the stepmax, to make sure
# it has enough training steps to converge
nn_best <- neuralnet(diabetes == "pos" ~ .,
                     train_set[, 1:7],
                     linear.output = FALSE,
                     hidden=c(2,2),
                     stepmax = 2e+05)
Create an one-hot encoding of the diabetes variable in the validation set.
# Find the true labels in the validation set
valid_true_labels <- ifelse(valid_set$diabetes == "pos", 1, 0)
Use the model to predict the diabetes variable in the validation set.
# Predict the classes in the validation set
predicted_probabilities <- predict(nn_best, valid_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
Create a confusion matrix.
# Create confusion matrix
confusion_matrix <- confusionMatrix(as.factor(predictions),
as.factor(valid_true_labels),
mode="prec_recall", positive = "1")
Print the results.
# Print the results for this model
print("nn_best on the validation set:")
## [1] "nn_best on the validation set:"
print(confusion_matrix)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0 78 20
##          1 17 30
##
##                Accuracy : 0.7448
##                  95% CI : (0.6658, 0.8135)
##     No Information Rate : 0.6552
##     P-Value [Acc > NIR] : 0.01302
##
##                   Kappa : 0.4271
##  Mcnemar's Test P-Value : 0.74231
##
##               Precision : 0.6383
##                  Recall : 0.6000
##                      F1 : 0.6186
##              Prevalence : 0.3448
##          Detection Rate : 0.2069
##    Detection Prevalence : 0.3241
##       Balanced Accuracy : 0.7105
##
##        'Positive' Class : 1
##
Plot the neural network.
plotnet(nn_best, var_labs=FALSE)
The output will be as follows:

Figure 4.20: Best neural network architecture found with cross-validation.

Chapter 5: Linear and Logistic Regression Models

Activity 18: Implementing Linear Regression

Solution:

Attach the packages:
# Attach packages
library(groupdata2)
library(cvms)
library(caret)
library(knitr)
Set the random seed to 1:
# Set seed for reproducibility and easy comparison
set.seed(1)
Load the cars dataset from caret:
# Load the cars dataset
data(cars)
Partition the dataset into a training set (80%) and a validation set (20%):
# Partition the dataset
partitions <- partition(cars, p = 0.8)
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
Fit multiple linear regression models on the training set with the lm() function, predicting Price. Try different predictors. View and interpret the summary() of each fitted model. How do the interpretations change when you add or subtract predictors?
# Fit a couple of linear models and interpret them
# Model 1 - Predicting price by mileage
model_1 <- lm(Price ~ Mileage, data = train_set)
summary(model_1)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -12977  -7400  -3453   6009  45540
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.496e+04  1.019e+03  24.488  < 2e-16 ***
## Mileage     -1.736e-01  4.765e-02  -3.644 0.000291 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9762 on 641 degrees of freedom
## Multiple R-squared:  0.02029,    Adjusted R-squared:  0.01876
## F-statistic: 13.28 on 1 and 641 DF,  p-value: 0.0002906
# Model 2 - Predicting price by number of doors
model_2 <- lm(Price ~ Doors, data = train_set)
summary(model_2)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Doors, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -12540  -7179  -2934   5814  45805
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  25682.2     1662.1  15.452   <2e-16 ***
## Doors        -1176.6      457.5  -2.572   0.0103 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9812 on 641 degrees of freedom
## Multiple R-squared:  0.01021,    Adjusted R-squared:  0.008671
## F-statistic: 6.615 on 1 and 641 DF,  p-value: 0.01034
# Model 3 - Predicting price by mileage and number of doors
model_3 <- lm(Price ~ Mileage + Doors, data = train_set)
summary(model_3)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage + Doors, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -12642  -7503  -3000   5595  43576
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.945e+04  1.926e+03  15.292  < 2e-16 ***
## Mileage     -1.786e-01  4.744e-02  -3.764 0.000182 ***
## Doors       -1.242e+03  4.532e+02  -2.740 0.006308 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9713 on 640 degrees of freedom
## Multiple R-squared:  0.03165,    Adjusted R-squared:  0.02863
## F-statistic: 10.46 on 2 and 640 DF,  p-value: 3.388e-05
Create model formulas with combine_predictors(). Limit the number of possibilities by a) using only the first four predictors, b) limiting the number of fixed effects in the formulas to three by specifying max_fixed_effects = 3, c) limiting the biggest possible interaction to a two-way interaction by specifying max_interaction_size = 2, and d) limiting the number of times a predictor can be included in a formula by specifying max_effect_frequency = 1. These limitations will decrease the number of models to run, which you may or may not want in your own projects:
# Create list of model formulas with combine_predictors()
# Use only the 4 first predictors (to save time)
# Limit the number of fixed effects (predictors) to 3,
# Limit the biggest possible interaction to a 2-way interaction
# Limit the number of times a fixed effect is included to 1
model_formulas <- combine_predictors(
    dependent = "Price",
    fixed_effects = c("Mileage", "Cylinder",
                      "Doors", "Cruise"),
    max_fixed_effects = 3,
    max_interaction_size = 2,
    max_effect_frequency = 1)
# Output model formulas
model_formulas
The output is as follows:
##  [1] "Price ~ Cruise"
##  [2] "Price ~ Cylinder"
##  [3] "Price ~ Doors"
##  [4] "Price ~ Mileage"
##  [5] "Price ~ Cruise * Cylinder"
##  [6] "Price ~ Cruise * Doors"
##  [7] "Price ~ Cruise * Mileage"
##  [8] "Price ~ Cruise + Cylinder"
##  [9] "Price ~ Cruise + Doors"
## [10] "Price ~ Cruise + Mileage"
## [11] "Price ~ Cylinder * Doors"
## [12] "Price ~ Cylinder * Mileage"
## [13] "Price ~ Cylinder + Doors"
## [14] "Price ~ Cylinder + Mileage"
## [15] "Price ~ Doors * Mileage"
## [16] "Price ~ Doors + Mileage"
## [17] "Price ~ Cruise * Cylinder + Doors"
## [18] "Price ~ Cruise * Cylinder + Mileage"
## [19] "Price ~ Cruise * Doors + Cylinder"
## [20] "Price ~ Cruise * Doors + Mileage"
## [21] "Price ~ Cruise * Mileage + Cylinder"
## [22] "Price ~ Cruise * Mileage + Doors"
## [23] "Price ~ Cruise + Cylinder * Doors"
## [24] "Price ~ Cruise + Cylinder * Mileage"
## [25] "Price ~ Cruise + Cylinder + Doors"
## [26] "Price ~ Cruise + Cylinder + Mileage"
## [27] "Price ~ Cruise + Doors * Mileage"
## [28] "Price ~ Cruise + Doors + Mileage"
## [29] "Price ~ Cylinder * Doors + Mileage"
## [30] "Price ~ Cylinder * Mileage + Doors"
## [31] "Price ~ Cylinder + Doors * Mileage"
## [32] "Price ~ Cylinder + Doors + Mileage"
Create fivefold columns with four folds each in the training set, using fold() with k = 4 and num_fold_cols = 5. Feel free to choose a higher number of fold columns:
# Create 5 fold columns with 4 folds each in the training set
train_set <- fold(train_set, k = 4,
num_fold_cols = 5)
Create the fold column names with paste0():
# Create list of fold column names
fold_cols <- paste0(".folds_", 1:5)
Perform repeated cross-validation on your model formulas with cvms:
# Cross-validate the models with cvms
CV_results <- cross_validate(train_set,
                             models = model_formulas,
                             fold_cols = fold_cols,
                             family = "gaussian")
Print the top 10 performing models according to RMSE. Select the best model:
# Select the best model by RMSE
# Order by RMSE
CV_results <- CV_results[order(CV_results$RMSE),]
# Select the 10 best performing models for printing
# (Feel free to view all the models)
CV_results_top10 <- head(CV_results, 10)
# Show metrics and model definition columns
# Use kable for a prettier output
kable(select_metrics(CV_results_top10), digits = 2)
The output is as follows:
Figure 5.29: Top 10 performing models using RMSE
Fit the best model on the entire training set and evaluate it on the validation set. This can be done with the validate() function in cvms:
# Evaluate the best model on the validation set with validate()
V_results <- validate(
    train_data = train_set,
    test_data = valid_set,
    models = "Price ~ Cruise * Cylinder + Mileage",
    family = "gaussian")
The output contains the results data frame and the trained model. Assign these to variable names:
valid_results <- V_results$Results
valid_model <- V_results$Models[[1]]
Print the results:
# Print the results
kable(select_metrics(valid_results), digits = 2)
The output is as follows:
Figure 5.30: Results of the validated model
View and interpret the summary of the best model:
# Print the model summary and interpret it
summary(valid_model)
The summary is as follows:
##
## Call:
## lm(formula = model_formula, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -10485  -5495  -1425   3494  34693
##
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)      8993.2446  3429.9320   2.622  0.00895 **
## Cruise          -1311.6871  3585.6289  -0.366  0.71462
## Cylinder         1809.5447   741.9185   2.439  0.01500 *
## Mileage            -0.1569     0.0367  -4.274 2.21e-05 ***
## Cruise:Cylinder  1690.0768   778.7838   2.170  0.03036 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7503 on 638 degrees of freedom
## Multiple R-squared:  0.424,  Adjusted R-squared:  0.4203
## F-statistic: 117.4 on 4 and 638 DF,  p-value: < 2.2e-16

Activity 19: Classifying Room Types

Solution:

Attach the groupdata2, cvms, caret, randomForest, rPref, and doParallel packages:
library(groupdata2)
library(cvms)
library(caret)
library(randomForest)
library(rPref)
library(doParallel)
Set the random seed to 3:
set.seed(3)
Load the amsterdam.listings dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/amsterdam.listings.csv:
# Load the amsterdam.listings dataset
full_data <- read.csv("amsterdam.listings.csv")
Convert the id and neighbourhood columns to factors:
full_data$id <- factor(full_data$id)
full_data$neighbourhood <- factor(full_data$neighbourhood)
Summarize the dataset:
summary(full_data)
The summary of data is as follows:
##        id                        neighbourhood            room_type
##  2818   :    1   De Baarsjes - Oud-West :3052   Entire home/apt:13724
##  20168  :    1   De Pijp - Rivierenbuurt:2166   Private room   : 3594
##  25428  :    1   Centrum-West           :2019
##  27886  :    1   Centrum-Oost           :1498
##  28658  :    1   Westerpark             :1315
##  28871  :    1   Zuid                   :1187
##  (Other):17312   (Other)                :6081
##  availability_365   log_price     log_minimum_nights log_number_of_reviews
##  Min.   :  0.00   Min.   :2.079   Min.   :0.0000     Min.   :0.000
##  1st Qu.:  0.00   1st Qu.:4.595   1st Qu.:0.6931     1st Qu.:1.386
##  Median :  0.00   Median :4.852   Median :0.6931     Median :2.398
##  Mean   : 48.33   Mean   :4.883   Mean   :0.8867     Mean   :2.370
##  3rd Qu.: 49.00   3rd Qu.:5.165   3rd Qu.:1.0986     3rd Qu.:3.219
##  Max.   :365.00   Max.   :7.237   Max.   :4.7875     Max.   :6.196
##
##  log_reviews_per_month
##  Min.   :-4.60517
##  1st Qu.:-1.42712
##  Median :-0.61619
##  Mean   :-0.67858
##  3rd Qu.: 0.07696
##  Max.   : 2.48907
##
Partition the dataset into a training set (80%) and a validation set (20%). Balance the partitions by room_type:
partitions <- partition(full_data, p = 0.8,
cat_col = "room_type")
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
Prepare for running the baseline evaluations and the cross-validations in parallel by registering the number of cores for doParallel:
# Register four CPU cores
registerDoParallel(4)
Create the baseline evaluation for the task on the validation set with the baseline() function from cvms. Run 100 evaluations in parallel. Specify the dependent column as room_type. Note that the default positive class is Private room:
room_type_baselines <- baseline(test_data = valid_set,
                                dependent_col = "room_type",
                                n = 100,
                                family = "binomial",
                                parallel = TRUE)
# Inspect summarized metrics
room_type_baselines$summarized_metrics
The output is as follows:
## # A tibble: 10 x 15
##    Measure 'Balanced Accur…       F1 Sensitivity Specificity
##    <chr>              <dbl>    <dbl>       <dbl>       <dbl>
##  1 Mean              0.502   0.295        0.503      0.500
##  2 Median            0.502   0.294        0.502      0.500
##  3 SD                0.0101  0.00954      0.0184     0.00924
##  4 IQR               0.0154  0.0131       0.0226     0.0122
##  5 Max               0.525   0.317        0.551      0.522
##  6 Min               0.480   0.275        0.463      0.480
##  7 NAs               0       0            0          0
##  8 INFs              0       0            0          0
##  9 All_0             0.5    NA            0          1
## 10 All_1             0.5     0.344        1          0
## # … with 10 more variables: 'Pos Pred Value' <dbl>, 'Neg Pred
## #   Value' <dbl>, AUC <dbl>, 'Lower CI' <dbl>, 'Upper CI' <dbl>,
## #   Kappa <dbl>, MCC <dbl>, 'Detection Rate' <dbl>, 'Detection
## #   Prevalence' <dbl>, Prevalence <dbl>
Fit multiple logistic regression models on the training set with the glm() function, predicting room_type. Try different predictors. View the summary() of each fitted model and try to interpret the estimated coefficients. Observe how the interpretations change when you add or subtract predictors:
logit_model_1 <- glm("room_type ~ log_number_of_reviews",
                     data = train_set, family="binomial")
summary(logit_model_1)
The summary of the model is as follows:
## ## Call:
## glm(formula = "room_type ~ log_number_of_reviews", family = "binomial",
##     data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.3030  -0.7171  -0.5668  -0.3708   2.3288
##
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -2.64285    0.05374  -49.18   <2e-16 ***
## log_number_of_reviews  0.49976    0.01745   28.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14149  on 13853  degrees of freedom
## Residual deviance: 13249  on 13852  degrees of freedom
## AIC: 13253
##
## Number of Fisher Scoring iterations: 4
Add availability_365 as a predictor:
logit_model_2 <- glm(
    "room_type ~ availability_365 + log_number_of_reviews",
    data = train_set, family = "binomial")
summary(logit_model_2)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews",
##     family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.5277  -0.6735  -0.5535  -0.3688   2.3365
##
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -2.6622105  0.0533345  -49.91   <2e-16 ***
## availability_365       0.0039866  0.0002196   18.16   <2e-16 ***
## log_number_of_reviews  0.4172148  0.0178015   23.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14149  on 13853  degrees of freedom
## Residual deviance: 12935  on 13851  degrees of freedom
## AIC: 12941
##
## Number of Fisher Scoring iterations: 4
Add log_price as a predictor:
logit_model_3 <- glm(
    "room_type ~ availability_365 + log_number_of_reviews + log_price",
    data = train_set, family = "binomial")
summary(logit_model_3)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price",
##     family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.7678  -0.5805  -0.3395  -0.1208   3.9864
##
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)           12.6730455  0.3419771   37.06   <2e-16 ***
## availability_365       0.0081215  0.0002865   28.34   <2e-16 ***
## log_number_of_reviews  0.3613055  0.0199845   18.08   <2e-16 ***
## log_price             -3.2539506  0.0745417  -43.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14149.2  on 13853  degrees of freedom
## Residual deviance:  9984.7  on 13850  degrees of freedom
## AIC: 9992.7
##
## Number of Fisher Scoring iterations: 6
Add log_minimum_nights as a predictor:
logit_model_4 <- glm(
    "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
     data = train_set, family = "binomial")
summary(logit_model_4)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
##     family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.6268  -0.5520  -0.3142  -0.1055   4.6062
##
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           13.470868   0.354695   37.98   <2e-16 ***
## availability_365       0.008310   0.000295   28.17   <2e-16 ***
## log_number_of_reviews  0.360133   0.020422   17.64   <2e-16 ***
## log_price             -3.252957   0.076343  -42.61   <2e-16 ***
## log_minimum_nights    -1.007354   0.051131  -19.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14149.2  on 13853  degrees of freedom
## Residual deviance:  9543.9  on 13849  degrees of freedom
## AIC: 9553.9
##
## Number of Fisher Scoring iterations: 6
Replace log_number_of_reviews with log_reviews_per_month:
logit_model_5 <- glm(
    "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
    data = train_set, family = "binomial")
summary(logit_model_5)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
##     family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.7351  -0.5229  -0.2934  -0.0968   4.7252
##
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)           14.7495851  0.3652303   40.38   <2e-16 ***
## availability_365       0.0074308  0.0003019   24.61   <2e-16 ***
## log_reviews_per_month  0.6364218  0.0246423   25.83   <2e-16 ***
## log_price             -3.2850702  0.0781567  -42.03   <2e-16 ***
## log_minimum_nights    -0.8504701  0.0526379  -16.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14149  on 13853  degrees of freedom
## Residual deviance:  9103  on 13849  degrees of freedom
## AIC: 9113
##
## Number of Fisher Scoring iterations: 6
Create model formulas with combine_predictors(). To save time, limit the interaction size to 2 by specifying max_interaction_size = 2, and limit the number of times an effect can be included in a formula to 1 by specifying max_effect_frequency = 1:
model_formulas <- combine_predictors(
    dependent = "room_type",
    fixed_effects = c("log_minimum_nights",
                      "log_number_of_reviews",
                      "log_price",
                      "availability_365",
                      "log_reviews_per_month"),
    max_interaction_size = 2,
    max_effect_frequency = 1)
head(model_formulas, 10)
The output is as follows:
##  [1] "room_type ~ availability_365"
##  [2] "room_type ~ log_minimum_nights"
##  [3] "room_type ~ log_number_of_reviews"
##  [4] "room_type ~ log_price"
##  [5] "room_type ~ log_reviews_per_month"
##  [6] "room_type ~ availability_365 * log_minimum_nights"
##  [7] "room_type ~ availability_365 * log_number_of_reviews"
##  [8] "room_type ~ availability_365 * log_price"
##  [9] "room_type ~ availability_365 * log_reviews_per_month"
## [10] "room_type ~ availability_365 + log_minimum_nights"
Create fivefold columns with five folds each in the training set, using fold() with k = 5 and num_fold_cols = 5. Balance the folds by room_type. Feel free to choose a higher number of fold columns:
train_set <- fold(train_set, k = 5,
num_fold_cols = 5,
cat_col = "room_type")
Perform cross-validation (not repeated) on your model formulas with cvms. Specify fold_cols = ".folds_1". Order the results by F1 and show the best 10 models:
initial_cv_results <- cross_validate(
    train_set,
    models = model_formulas,
    fold_cols = ".folds_1",
    family = "binomial",
    parallel = TRUE)
initial_cv_results <- initial_cv_results[
    order(initial_cv_results$F1, decreasing = TRUE),]
head(initial_cv_results, 10)
The output is as follows:
## # A tibble: 10 x 26
##    'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##               <dbl> <dbl>       <dbl>       <dbl>            <dbl>
##  1            0.764 0.662       0.567       0.962            0.797
##  2            0.764 0.662       0.565       0.963            0.798
##  3            0.763 0.662       0.563       0.963            0.801
##  4            0.763 0.661       0.563       0.963            0.800
##  5            0.761 0.654       0.564       0.958            0.778
##  6            0.757 0.653       0.549       0.966            0.807
##  7            0.757 0.652       0.549       0.965            0.804
##  8            0.758 0.649       0.560       0.957            0.774
##  9            0.756 0.649       0.550       0.962            0.792
## 10            0.758 0.649       0.559       0.957            0.775
## # … with 21 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #   Coefficients <list>, Folds <int>, 'Fold Columns' <int>, 'Convergence
## #   Warnings' <dbl>, 'Singular Fit Messages' <int>, Family <chr>,
## #   Link <chr>, Dependent <chr>, Fixed <chr>
Perform repeated cross-validation on the 10-20 best model formulas (by F1) with cvms:
# Reconstruct the best 20 models' formulas
reconstructed_formulas <- reconstruct_formulas(
    initial_cv_results,
    topn = 20)
# Create fold_cols
fold_cols <- paste0(".folds_", 1:5)
# Perform repeated cross-validation
repeated_cv_results <- cross_validate(
    train_set,
    models = model_formulas,
    fold_cols = fold_cols,
    family = "binomial",
    parallel = TRUE)
# Order by F1
repeated_cv_results <- repeated_cv_results[
    order(repeated_cv_results$F1, decreasing = TRUE),]
# Inspect the 10 best modelsresults
head(repeated_cv_results)
The output is as follows:
## # A tibble: 6 x 27
##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>
## 1            0.764 0.662       0.566       0.962            0.796
## 2            0.763 0.661       0.564       0.963            0.798
## 3            0.763 0.660       0.562       0.963            0.800
## 4            0.762 0.659       0.561       0.963            0.800
## 5            0.761 0.654       0.563       0.958            0.780
## 6            0.758 0.654       0.551       0.965            0.805
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #   Coefficients <list>, Results <list>, Folds <int>, 'Fold
## #   Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## #   Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## #   Fixed <chr>
Find the Pareto front based on the F1 and balanced accuracy scores. Use psel() from the rPref package and specify pref = high("F1") * high("`Balanced Accuracy`"). Note the ticks around `Balanced Accuracy`:
# Find the Pareto front
front <- psel(repeated_cv_results,
              pref = high("F1") * high("`Balanced Accuracy`"))
# Remove rows with NA in F1 or Balanced Accuracy
front <- front[complete.cases(front[1:2]), ]
# Inspect front
front
The output is as follows:
## # A tibble: 1 x 27
##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>
## 1            0.764 0.662       0.566       0.962            0.796
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #   Coefficients <list>, Results <list>, Folds <int>, 'Fold
## #   Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## #   Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## #   Fixed <chr>
The best model according to F1 is also the best model by balanced accuracy, so the Pareto front only contains one model.
Plot the Pareto front with the ggplot2 code from Exercise 61, Plotting the Pareto Front. Note that you may need to add ticks around 'Balanced Accuracy' when specifying x or y in aes() in the ggplot call:
# Create ggplot object
# with precision on the x-axis and precision on the y-axis
ggplot(repeated_cv_results, aes(x = F1, y = 'Balanced Accuracy')) +

  # Add the models as points
  geom_point(shape = 1, size = 0.5) +

  # Add the nondominated models as larger points
  geom_point(data = front, size = 3) +

  # Add a line to visualize the pareto front
  geom_step(data = front, direction = "vh") +

  # Add the light theme
  theme_light()
The output is similar to the following:
Figure 5.31: Pareto front with the F1 and balanced accuracy scores
Use validate() to train the nondominated models on the training set and evaluate them on the validation set:
# Reconstruct the formulas for the front models
reconstructed_formulas <- reconstruct_formulas(front)
# Validate the models in the Pareto front
v_results_list <- validate(train_data = train_set,
                           test_data = valid_set,
                           models = reconstructed_formulas,
                           family = "binomial")
# Assign the results and model(s) to variable names
v_results <- v_results_list$Results
v_model <- v_results_list$Models[[1]]
v_results
The output is as follows:
## # A tibble: 1 x 24
##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>
## 1            0.758 0.652       0.554       0.962            0.794
## # … with 19 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #   Coefficients <list>, 'Convergence Warnings' <dbl>, 'Singular Fit
## #   Messages' <dbl>, Family <chr>, Link <chr>, Dependent <chr>,
## #   Fixed <chr>
These results are a lot better than the baseline on both F1 and balanced accuracy.
View the summaries of the nondominated model(s):
summary(v_model)
The summary of the model is as follows:
##
## Call:
## glm(formula = model_formula, family = binomial(link = link),
##     data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.8323  -0.4836  -0.2724  -0.0919   3.9091
##
## Coefficients:
##                           Estimate Std. Error z value  Pr(>|z|)
## (Intercept)               15.3685268  0.4511978  34.062  < 2e-16 ***
## availability_365          -0.0140209  0.0030623  -4.579 4.68e-06 ***
## log_price                 -3.4441520  0.0956189 -36.020  < 2e-16 ***
## log_minimum_nights        -0.7163252  0.0535452 -13.378  < 2e-16 ***
## log_number_of_reviews      -0.0823821  0.0282115  -2.920   0.0035 **
## log_reviews_per_month      0.0733808  0.0381629   1.923   0.0545 .
## availability_365:log_price 0.0042772  0.0006207   6.891 5.53e-12 ***
## log_n_o_reviews:log_r_p_month 0.3730603 0.0158122 23.593 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14149.2  on 13853  degrees of freedom
## Residual deviance:  8476.7  on 13846  degrees of freedom
## AIC: 8492.7
##
## Number of Fisher Scoring iterations: 6
Note, that we have shortened log_number_of_reviews and log_reviews_per_month in the interaction term, log_n_o_reviews:log_r_p_month. The coefficients for the two interaction terms are both statistically significant. The interaction term log_n_o_reviews:log_r_p_month tells us that when log_number_of_reviews increases by one unit, the coefficient for log_reviews_per_month increases by 0.37, and vice versa. We might question the meaningfulness of including both these predictors in the model, as they have some of the same information. If we also had the number of months the listing had been listed, we should be able to recreate log_number_of_reviews from log_reviews_per_month and the number of months, which would probably be easier to interpret as well.
The second interaction term, availability_365:log_price, tells us that when availability_365 increases by a single unit, the coefficient for log_price increases by 0.004, and vice versa. The coefficient estimate for log_price is -3.44, meaning that when availability_365 is low, a higher log_price decreases the probability that the listing is a private room. This fits with the intuition that a private room is usually cheaper than an entire home/apartment.
The coefficient for log_minimum_nights tells us that when there is a higher minimum requirement for the number of nights when we book the listing, there's a lower probability that the listing is a private room.

Chapter 6: Unsupervised Learning

Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset

Solution:

Attach the cluster and factoextra packages:
library(cluster)
library(factoextra)
Load the dataset:
df <- read.csv("mtcars.csv")
Set the row names to the values of the X column (the state names). Remove the X column afterward:
rownames(df) <- df$X
df$X <- NULL
Note
The row names (states) become a column, X, when you save it as a CSV file. So, we need to change it back, as the row names are used in the plot in step 7.
Remove those rows with missing data and standardize the dataset:
df <- na.omit(df)
df <- scale(df)
Implement divisive hierarchical clustering using DIANA. For easy comparison, document the dendrogram output. Feel free to experiment with different distance metrics:
dv <- diana(df,metric = "manhattan", stand = TRUE)
plot(dv)
The output is as follows:
Figure 6.41: Banner from diana()
The next plot is as follows:
Figure 6.42: Dendrogram from diana()
Implement bottom-up hierarchical clustering using AGNES. Take note of the dendrogram created for comparison purposes later on:
agn <- agnes(df)
pltree(agn)
The output is as follows:
Figure 6.43: Dendrogram from agnes()
Implement k-means clustering. Use the elbow method to determine the optimal number of clusters:
fviz_nbclust(mtcars, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
The output is as follows:
Figure 6.44: Optimal clusters using the elbow method
Perform k-means clustering with four clusters:
k4 <- kmeans(df, centers = 4, nstart = 20)
fviz_cluster(k4, data = df)
The output is as follows:
Figure 6.45: k-means with four clusters
Compare the clusters, starting with the smallest one. The following are your expected results for DIANA, AGNES, and k-means, respectively:

Figure 6.46: Dendrogram from running DIANA, cut at 20

If we consider cutting the DIANA tree at height 20, the Ferrari is clustered together with the Ford and the Maserati (the smallest cluster):

Figure 6.47: Dendrogram from agnes, cut at 4

Meanwhile, cutting the AGNES dendrogram at height 4 results in the Ferrari being clustered with the Mazda RX4, the Mazda RX4 Wag, and the Porsche. k-means clusters the Ferrari with the Mazdas, the Ford, and the Maserati.

Figure 6.48: kmeans clustering

Clearly, the choice of clustering technique and algorithms results in different clusters being created. It is important to apply some domain knowledge to determine the most valuable end results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix

Create new playlist

Sign In

Sign Up

Appendix

About

Chapter 1: An Introduction to Machine Learning

Activity 1: Finding the Distribution of Diabetic Patients in the PimaIndiansDiabetes Dataset

Figure 1.36: Bar plot output for diabetes

Activity 2: Grouping the PimaIndiansDiabetes Data

Figure 1.37: Summary of PimaIndiansDiabetes data

Activity 3: Performing EDA on the PimaIndiansDiabetes Dataset

Figure 1.38: A pair plot for the diabetes data

Figure 1.39: The box plot output using ggplot

Activity 4: Building Linear Models for the GermanCredit Dataset

Activity 5: Using Multiple Variables for a Regression Model for the Boston Housing Dataset

Figure 1.40: Residual versus fitted values

Figure 1.41: Normal QQ

Figure 1.42: Scale location plot

Figure 1.43: Cook's distance plot

Chapter 2: Data Cleaning and Pre-processing

Activity 6: Pre-processing using Center and Scale

Activity 7: Identifying Outliers

Figure 2.36: Outliers in the mtcars dataset.

Activity 8: Oversampling and Undersampling

Activity 9: Sampling and OverSampling using ROSE

Solution:

Chapter 3: Feature Engineering

Activity 10: Calculating Time series Feature – Binning

Figure 3.27: The summary of the Duration values of German Credit dataset

Figure 3.28: Plot of the duration vs density

Figure 3.29: Plot of duration in bins

Activity 11: Identifying Skewness

Figure 3.30: Histogram of the glucose values of the PrimaIndainsGlucose dataset

Figure 3.31: Histogram of the age values of the PrimaIndiansDiabetes dataset

Activity 12: Generating PCA

Figure 3.32: Histogram of the age values of the PrimaIndiansDiabetes dataset

Activity 13: Implementing the Random Forest Approach

Activity 14: Selecting Features Using Variable Importance

Figure 3.33: Variable importance for the fields

Chapter 4: Introduction to neuralnet and Evaluation Methods

Activity 15: Training a Neural Network

Figure 4.18: Neural network architecture using three predictors

Activity 16: Training and Comparing Neural Network Architectures

Figure 4.19: The best neural network architecture without cross-validation.

Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation

Figure 4.20: Best neural network architecture found with cross-validation.

Chapter 5: Linear and Logistic Regression Models

Activity 18: Implementing Linear Regression

Figure 5.29: Top 10 performing models using RMSE

Figure 5.30: Results of the validated model

Activity 19: Classifying Room Types

Figure 5.31: Pareto front with the F1 and balanced accuracy scores

Chapter 6: Unsupervised Learning

Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset

Note

Figure 6.41: Banner from diana()

Figure 6.42: Dendrogram from diana()

Figure 6.43: Dendrogram from agnes()

Figure 6.44: Optimal clusters using the elbow method

Figure 6.45: k-means with four clusters

Figure 6.46: Dendrogram from running DIANA, cut at 20

Figure 6.47: Dendrogram from agnes, cut at 4

Figure 6.48: kmeans clustering

Table of Contents for
Appendix