Appendix

About

This section is included to assist the students to perform the activities present in the book. It includes detailed steps that are to be performed by the students to complete and achieve the objectives of the book.

Chapter 1: An Introduction to Machine Learning

Activity 1: Finding the Distribution of Diabetic Patients in the PimaIndiansDiabetes Dataset

Solution:

  1. Load the dataset.

    PimaIndiansDiabetes<-read.csv("PimaIndiansDiabetes.csv")

  2. Create a variable PimaIndiansDiabetesData for further use.

    #Assign it to a local variable for further use

    PimaIndiansDiabetesData<- PimaIndiansDiabetes

  3. Use the head() function to view the first five rows of the dataset.

    #Display the first five rows

    head(PimaIndiansDiabetesData)

    The output is as follows:

      pregnant glucose pressure triceps insulin mass pedigree age diabetes

    1        6     148       72      35       0 33.6    0.627  50      pos

    2        1      85       66      29       0 26.6    0.351  31      neg

    3        8     183       64       0       0 23.3    0.672  32      pos

    4        1      89       66      23      94 28.1    0.167  21      neg

    5        0     137       40      35     168 43.1    2.288  33      pos

    6        5     116       74       0       0 25.6    0.201  30      neg

    From the preceding data, identify the input features and find the column that is the predictor variable. The output variable is diabetes.

  4. Display the different categories of the output variable:

    levels(PimaIndiansDiabetesData$diabetes)

    The output is as follows:

    [1] "neg" "pos"

  5. Load the required library for plotting graphs.

    library(ggplot2)

  6. Create a bar plot to view the output variables.

    barplot <- ggplot(data= PimaIndiansDiabetesData, aes(x=age))

    barplot + geom_histogram(binwidth=0.2, color="black", aes(fill=diabetes))  + ggtitle("Bar plot of Age")

    The output is as follows:

Figure 1.36: Bar plot output for diabetes
Figure 1.36: Bar plot output for diabetes

We can conclude that we have the most data for the age group of 20-30. Graphical representation thus allows us to understand the data.

Activity 2: Grouping the PimaIndiansDiabetes Data

Solution :

  1. View the structure of the PimaIndiansDiabetes dataset.

    #View the structure of the data

    str(PimaIndiansDiabetesData)

    The output is as follows:

    'data.frame':768 obs. of  9 variables:

    $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...

    $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...

    $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...

    $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...

    $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...

    $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...

    $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...

    $ age     : num  50 31 32 21 33 30 26 29 53 54 ...

    $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

  2. View the summary of the PimaIndiansDiabetes dataset.

    #View the Summary of the data

    summary(PimaIndiansDiabetesData)

    The output is as follows:

    Figure 1.37: Summary of PimaIndiansDiabetes data
    Figure 1.37: Summary of PimaIndiansDiabetes data
  3. View the statistics of the columns of PimaIndiansDiabetes dataset grouped by the diabetes column.

    #Perform Group by and view statistics for the columns

    #Install the package

    install.packages("psych")

    library(psych) #Load package psych to use function describeBy

    Use describeby with pregnancy and diabetes columns.

    describeBy(PimaIndiansDiabetesData$pregnant, PimaIndiansDiabetesData$diabetes)

    The output is as follows:

    Descriptive statistics by group

    group: neg

       vars   n mean   sd median trimmed  mad min max range skew kurtosis   se

    X1    1 500  3.3 3.02      2    2.88 2.97   0  13    13 1.11     0.65 0.13

    ----------------------------------------------------------------------------------------------

    group: pos

       vars   n mean   sd median trimmed  mad min max range skew kurtosis   se

    X1    1 268 4.87 3.74      4     4.6 4.45   0  17    17  0.5    -0.47 0.23

    We can view the mean, median, min, and max of the number of times pregnant attribute in the group of people who have diabetes (pos) and who do not have diabetes (neg).

  4. Use describeby with pressure and diabetes.

    describeBy(PimaIndiansDiabetesData$pressure, PimaIndiansDiabetesData$diabetes)

    The output is as follows:

    Descriptive statistics by group

    group: neg

       vars   n  mean    sd median trimmed   mad min max range skew kurtosis   se

    X1    1 500 68.18 18.06     70   69.97 11.86   0 122   122 -1.8     5.58 0.81

    ----------------------------------------------------------------------------------------------

    group: pos

       vars   n  mean    sd median trimmed   mad min max range  skew kurtosis   se

    X1    1 268 70.82 21.49     74   73.99 11.86   0 114   114 -1.92     4.53 1.31

    We can view the mean, median, min, and max of the pressure in the group of people who have diabetes (pos) and who do not have diabetes (neg).

    We have learned how to view the structure of any dataset and print the statistics about the range of every column using summary().

Activity 3: Performing EDA on the PimaIndiansDiabetes Dataset

Solution:

  1. Load the PimaIndiansDaibetes dataset.

    PimaIndiansDiabetes<-read.csv("PimaIndiansDiabetes.csv")

  2. View the correlation among the features of the PimaIndiansDiabetes dataset.

    #Calculate correlations

    correlation <- cor(PimaIndiansDiabetesData[,1:4])

  3. Round it to the second nearest digit.

    #Round the values to the nearest 2 digit

    round(correlation,2)

    The output is as follows:

             pregnant glucose pressure triceps

    pregnant     1.00    0.13     0.14   -0.08

    glucose      0.13    1.00     0.15    0.06

    pressure     0.14    0.15     1.00    0.21

    triceps     -0.08    0.06     0.21    1.00

  4. Pair them on a plot.

    #Plot the pairs on a plot

    pairs(PimaIndiansDiabetesData[,1:4])

    The output is as follows:

    Figure 1.38: A pair plot for the diabetes data
    Figure 1.38: A pair plot for the diabetes data
  5. Create a box plot to view the data distribution for the pregnant column and color by diabetes.

    # Load library

    library(ggplot2)

    boxplot <- ggplot(data=PimaIndiansDiabetesData, aes(x=diabetes, y=pregnant))

    boxplot + geom_boxplot(aes(fill=diabetes)) +

      ylab("Pregnant") + ggtitle("Diabetes Data Boxplot") +

      stat_summary(fun.y=mean, geom="point", shape=5, size=4)

    The output is as follows:

Figure 1.39: The box plot output using ggplot
Figure 1.39: The box plot output using ggplot

In the preceding graph, we can see the distribution of "number of times pregnant" in people who do not have diabetes (neg) and in people who have diabetes (pos).

Activity 4: Building Linear Models for the GermanCredit Dataset

Solution:

These are the steps that will help you solve the activity:

  1. Load the data.

    GermanCredit <-read.csv("GermanCredit.csv")

  2. Subset the data.

    GermanCredit_Subset=GermanCredit[,1:10]

  3. Fit a linear model using lm().

    # fit model

    fit <- lm(Duration~., GermanCredit_Subset)

  4. Summarize the results using the summary() function.

    # summarize the fit

    summary(fit)

    The output is as follows:

    Call:

    lm(formula = Duration ~ ., data = GermanCredit_Subset)

    Residuals:

        Min      1Q  Median      3Q     Max

    -44.722  -5.524  -1.187   4.431  44.287

    Coefficients:

                                Estimate Std. Error t value Pr(>|t|)    

    (Intercept)                2.0325685  2.3612128   0.861  0.38955    

    Amount                     0.0029344  0.0001093  26.845  < 2e-16 ***

    InstallmentRatePercentage  2.7171134  0.2640590  10.290  < 2e-16 ***

    ResidenceDuration          0.2068781  0.2625670   0.788  0.43094    

    Age                       -0.0689299  0.0260365  -2.647  0.00824 **

    NumberExistingCredits     -0.3810765  0.4903225  -0.777  0.43723    

    NumberPeopleMaintenance   -0.0999072  0.7815578  -0.128  0.89831    

    Telephone                  0.6354927  0.6035906   1.053  0.29266    

    ForeignWorker              4.9141998  1.4969592   3.283  0.00106 **

    ClassGood                 -2.0068114  0.6260298  -3.206  0.00139 **

    ---

    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 8.784 on 990 degrees of freedom

    Multiple R-squared:  0.4742,    Adjusted R-squared:  0.4694

    F-statistic:  99.2 on 9 and 990 DF,  p-value: < 2.2e-16

  5. Use predict() to make the predictions.

    # make predictions

    predictions <- predict(fit, GermanCredit_Subset)

  6. Calculate the RMSE for the predictions.

    # summarize accuracy

    rmse <- sqrt(mean((GermanCredit_Subset$Duration - predictions)^2))

    print(rmse)

    The output is as follows:

    [1] 76.3849

    In this activity, we have learned to build a linear model, make predictions on new data, and evaluate performance using RMSE.

Activity 5: Using Multiple Variables for a Regression Model for the Boston Housing Dataset

Solution:

These are the steps that will help you solve the activity:

  1. Load the dataset.

    BostonHousing <-read.csv("BostonHousing.csv")

  2. Build a regression model using multiple variables.

    #Build multi variable regression

    regression <- lm(medv~crim + indus+rad , data = BostonHousing)

  3. View the summary of the built regression model.

    #View the summary

    summary(regression)

    The output is as follows:

    Call:

    lm(formula = medv ~ crim + indus + rad, data = BostonHousing)

    Residuals:

        Min      1Q  Median      3Q     Max

    -12.047  -4.860  -1.736   3.081  32.596

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)    

    (Intercept) 29.27515    0.68220  42.913  < 2e-16 ***

    crim        -0.23952    0.05205  -4.602 5.31e-06 ***

    indus       -0.51671    0.06336  -8.155 2.81e-15 ***

    rad         -0.01281    0.05845  -0.219    0.827    

    ---

    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 7.838 on 502 degrees of freedom

    Multiple R-squared:  0.2781, Adjusted R-squared:  0.2737

    F-statistic: 64.45 on 3 and 502 DF,  p-value: < 2.2e-16

  4. Plot the regression model using the plot() function.

    #Plot the fit

    plot(regression)

    The output is as follows:

Figure 1.40: Residual versus fitted values
Figure 1.40: Residual versus fitted values

The preceding plot compares the predicted values and the residual values.

Hit <Return> to see the next plot:

Figure 1.41: Normal QQ
Figure 1.41: Normal QQ

The preceding plot shows the distribution of error. It is a normal probability plot. A normal distribution of error will display a straight line.

Hit <Return> to see the next plot:

Figure 1.42: Scale location plot
Figure 1.42: Scale location plot

The preceding plot compares the spread and the predicted values. We can see how the spread is with respect to the predicted values.

Hit <Return> to see the next plot:

Figure 1.43: Cook’s distance plot
Figure 1.43: Cook's distance plot

This plot helps to identify which data points are influential to the regression model, that is, which of our model results would be affected if we included or excluded them.

We have now explored the datasets with one or more variables.

Chapter 2: Data Cleaning and Pre-processing

Activity 6: Pre-processing using Center and Scale

Solution:

In this exercise, we will perform the center and scale pre-processing operations.

  1. Load the mlbench library and the PimaIndiansDiabetes dataset:

    # Load Library caret

    library(caret)

    library(mlbench)

    # load the dataset PimaIndiansDiabetes

    data(PimaIndiansDiabetes)

    View the summary:

    # view the data

    summary(PimaIndiansDiabetes [,1:2])

    The output is as follows:

        pregnant         glucose     

    Min.   : 0.000   Min.   :  0.0  

    1st Qu.: 1.000   1st Qu.: 99.0  

    Median : 3.000   Median :117.0  

    Mean   : 3.845   Mean   :120.9  

    3rd Qu.: 6.000   3rd Qu.:140.2  

    Max.   :17.000   Max.   :199.0

  2. User preProcess() to pre-process the data to center and scale:

    # to standardise we will scale and center

    params <- preProcess(PimaIndiansDiabetes [,1:2], method=c("center", "scale"))

  3. Transform the dataset using predict():

    # transform the dataset

    new_dataset <- predict(params, PimaIndiansDiabetes [,1:2])

  4. Print the summary of the new dataset:

    # summarize the transformed dataset

    summary(new_dataset)

    The output is as follows:

        pregnant          glucose       

    Min.   :-1.1411   Min.   :-3.7812  

    1st Qu.:-0.8443   1st Qu.:-0.6848  

    Median :-0.2508   Median :-0.1218  

    Mean   : 0.0000   Mean   : 0.0000  

    3rd Qu.: 0.6395   3rd Qu.: 0.6054  

    Max.   : 3.9040   Max.   : 2.4429

    We will notice that the values are now mean centering values.

Activity 7: Identifying Outliers

Solution:

  1. Load the dataset:

    mtcars = read.csv("mtcars.csv")

  2. Load the outlier package and use the outlier function to display the outliers:

    #Load the outlier library

    library(outliers)

  3. Detect outliers in the dataset using the outlier() function:

    #Detect outliers

    outlier(mtcars)

    The output is as follows:

        mpg     cyl    disp      hp    drat      wt    qsec      vs      am

        gear    carb

    33.900   4.000 472.000 335.000   4.930   5.424  22.900   

    1.000   1.000   5.000   8.000

  4. Display the other side of the outlier values:

    #This detects outliers from the other side

    outlier(mtcars,opposite=TRUE)

    The output is as follows:

       mpg    cyl   disp     hp   drat     wt   qsec     vs     am

       gear   carb

    10.400  8.000 71.100 52.000  2.760  1.513 14.500  0.000  0.000

      3.000  1.000

  5. Plot a box plot:

    #View the outliers

    boxplot(Mushroom)

    The output is as follows:

Figure 2.36: Outliers in the mtcars dataset.
Figure 2.36: Outliers in the mtcars dataset.

The circle marks are the outliers.

Activity 8: Oversampling and Undersampling

Solution:

The detailed solution is as follows:

  1. Read the mushroom CSV file:

    ms<-read.csv('mushrooms.csv')

    summary(ms$bruises)

    The output is as follows:

       f    t

    4748 3376

  2. Perform downsampling:

    set.seed(9560)

    undersampling <- downSample(x = ms[, -ncol(ms)], y = ms$bruises)

    table(undersampling$bruises)

    The output is as follows:

       f    t

    3376 3376

  3. Perform oversampling:

    set.seed(9560)

    oversampling <- upSample(x = ms[, -ncol(ms)],y = ms$bruises)

    table(oversampling$bruises)

    The output is as follows:

       f    t

    4748 4748

    In this activity, we learned to use downSample() and upSample() from the caret package to perform downsampling and oversampling.

Activity 9: Sampling and OverSampling using ROSE

Solution:

The detailed solution is as follows:

  1. Load the German credit dataset:

    #load the dataset

    library(caret)

    library(ROSE)

    data(GermanCredit)

  2. View the samples in the German credit dataset:

    #View samples

    head(GermanCredit)

    str(GermanCredit)

  3. Check the number of unbalanced data in the German credit dataset using the summary() method:

    #View the imbalanced data

    summary(GermanCredit$Class)

    The output is as follows:

    Bad Good

     300  700

  4. Use ROSE to balance the numbers:

    balanced_data <- ROSE(Class ~ ., data  = stagec,seed=3)$data

    table(balanced_data$Class)

    The output is as follows:

    Good  Bad

     480  520

    Using the preceding example, we learned how to increase and decrease the class count using ROSE.

Chapter 3: Feature Engineering

Activity 10: Calculating Time series Feature – Binning

Solution:

  1. Load the caret library:

    #Time series features

    library(caret)

    #Install caret if not installed

    #install.packages('caret')

  2. Load the GermanCredit dataset:

    GermanCredit = read.csv("GermanCredit.csv")

    duration<- GermanCredit$Duration #take the duration column

  3. Check the data summary as follows:

    summary(duration)

    The output is as follows:

    Figure 3.27: The summary of the Duration values of German Credit dataset
    Figure 3.27: The summary of the Duration values of German Credit dataset
  4. Load the ggplot2 library:

    library(ggplot2)

  5. Plot using the command:

    ggplot(data=GermanCredit, aes(x=GermanCredit$Duration)) +

      geom_density(fill='lightblue') +

      geom_rug() +

      labs(x='mean Duration')

    The output is as follows:

    Figure 3.28: Plot of the duration vs density
    Figure 3.28: Plot of the duration vs density
  6. Create bins:

    #Creating Bins

    # set up boundaries for intervals/bins

    breaks <- c(0,10,20,30,40,50,60,70,80)

  7. Create labels:

    # specify interval/bin labels

    labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80")

  8. Bucket the datapoints into the bins.

    # bucketing data points into bins

    bins <- cut(duration, breaks, include.lowest = T, right=FALSE, labels=labels)

  9. Find the number of elements in each bin:

    # inspect bins

    summary(bins)

    The output is as follows:

    summary(bins)

      <10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

      143   403   241   131    66     2    13     1

  10. Plot the bins:

    #Ploting the bins

    plot(bins, main="Frequency of Duration", ylab="Duration Count", xlab="Duration Bins",col="bisque")

    The output is as follows:

Figure 3.29: Plot of duration in bins
Figure 3.29: Plot of duration in bins

We can conclude that the maximum number of customers are within the range of 10 to 20.

Activity 11: Identifying Skewness

Solution:

  1. Load the library mlbench.

    #Skewness

    library(mlbench)

    library(e1071)

  2. Load the PrimaIndainsDiabetes data.

    PimaIndiansDiabetes = read.csv("PimaIndiansDiabetes.csv")

  3. Print the skewness of the glucose column, using the skewness() function.

    #Printing the skewness of the columns

    #Not skewed

    skewness(PimaIndiansDiabetes$glucose)

    The output is as follows:

    [1] 0.1730754

  4. Plot the histogram using the histogram() function.

    histogram(PimaIndiansDiabetes$glucose)

    The output is as follows:

    Figure 3.30: Histogram of the glucose values of the PrimaIndainsGlucose dataset
    Figure 3.30: Histogram of the glucose values of the PrimaIndainsGlucose dataset

    A negative skewness value means that the data is skewed to the left and a positive skewness value means that the data is skewed to the right. Since the value here is 0.17, the data is neither completely left or right skewed. Therefore, it is not skewed.

  5. Find the skewness of the age column using the skewness() function.

    #Highly skewed

    skewness(PimaIndiansDiabetes$age)

    The output is as follows:

    [1] 1.125188

  6. Plot the histogram using the histogram() function.

    histogram(PimaIndiansDiabetes$age)

    The output is as follows:

Figure 3.31: Histogram of the age values of the PrimaIndiansDiabetes dataset
Figure 3.31: Histogram of the age values of the PrimaIndiansDiabetes dataset

The positive skewness value means that it is skewed to the right as we can see above.

Activity 12: Generating PCA

Solution:

  1. Load the GermanCredit data.

    #PCA Analysis

    data(GermanCredit)

  2. Create a subset of first 9 columns into another variable names GermanCredit_subset

    #Use the German Credit Data

    GermanCredit_subset <- GermanCredit[,1:9]

  3. Find the principal components:

    #Find out the Principal components

    principal_components <- prcomp(x = GermanCredit_subset, scale. = T)

  4. Print the principal components:

    #Print the principal components

    print(principal_components)

    The output is as follows:

    Standard deviations (1, .., p=9):

    [1] 1.3505916 1.2008442 1.1084157 0.9721503 0.9459586

    0.9317018 0.9106746 0.8345178 0.5211137

    Rotation (n x k) = (9 x 9):

Figure 3.32: Histogram of the age values of the PrimaIndiansDiabetes dataset
Figure 3.32: Histogram of the age values of the PrimaIndiansDiabetes dataset

Therefore, by using principal component analysis we can identify the top nine principal components in the dataset. These components are calculated from multiple fields and they can be used as features on their own.

Activity 13: Implementing the Random Forest Approach

Solution:

  1. Load the GermanCredit data:

    data(GermanCredit)

  2. Create a subset to load the first ten columns into GermanCredit_subset.

    GermanCredit_subset <- GermanCredit[,1:10]

  3. Attach the randomForest package:

    library(randomForest)

  4. Train a random forest model using random_forest =randomForest(Class~., data=GermanCredit_subset):

    random_forest = randomForest(Class~., data=GermanCredit_subset)

  5. Invoke importance() for the trained random_forest:

    # Create an importance based on mean decreasing gini

    importance(random_forest)

    The output is as follows:

    importance(random_forest)

                              MeanDecreaseGini

    Duration                         70.380265

    Amount                          121.458790

    InstallmentRatePercentage        27.048517

    ResidenceDuration                30.409254

    Age                              86.476017

    NumberExistingCredits            18.746057

    NumberPeopleMaintenance          12.026969

    Telephone                        15.581802

    ForeignWorker                     2.888387

  6. Use the varImp() function to view the list of important variables.

    varImp(random_forest)

    The output is as follows:

                                 Overall

    Duration                   70.380265

    Amount                    121.458790

    InstallmentRatePercentage  27.048517

    ResidenceDuration          30.409254

    Age                        86.476017

    NumberExistingCredits      18.746057

    NumberPeopleMaintenance    12.026969

    Telephone                  15.581802

    ForeignWorker               2.888387

    In this activity, we built a random forest model and used it to see the importance of each variable in a dataset. The variables with higher scores are considered more important. Having done this, we can sort by importance and choose the top 5 or top 10 for the model or set a threshold for importance and choose all the variables that meet the threshold.

Activity 14: Selecting Features Using Variable Importance

Solution:

  1. Install the following packages:

    install.packages("rpart")

    library(rpart)

    library(caret)

    set.seed(10)

  2. Load the GermanCredit dataset:

    data(GermanCredit)

  3. Create a subset to load the first ten columns into GermanCredit_subset:

    GermanCredit_subset <- GermanCredit[,1:10]

  4. Train an rpart model using rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart"):

    #Train a rpart model

    rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart")

  5. Invoke the varImp() function, as in rpartImp <- varImp(rPartMod).

    #Find variable importance

    rpartImp <- varImp(rPartMod)

  6. Print rpartImp.

    #Print variable importance

    print(rpartImp)

    The output is as follows:

    rpart variable importance

                              Overall

    Amount                    100.000

    Duration                   89.670

    Age                        75.229

    ForeignWorker              22.055

    InstallmentRatePercentage  17.288

    Telephone                   7.813

    ResidenceDuration           4.471

    NumberExistingCredits       0.000

    NumberPeopleMaintenance     0.000

  7. Plot rpartImp using plot().

    #Plot top 5 variable importance

    plot(rpartImp, top = 5, main='Variable Importance')

    The output is as follows:

Figure 3.33: Variable importance for the fields
Figure 3.33: Variable importance for the fields

From the preceding plot, we can observe that Amount, Duration, and Age have high importance values.

Chapter 4: Introduction to neuralnet and Evaluation Methods

Activity 15: Training a Neural Network

Solution:

  1. Attach the packages:

    # Attach the packages

    library(caret)

    library(groupdata2)

    library(neuralnet)

    library(NeuralNetTools)

  2. Set the seed value for reproducibility and easier comparison:

    # Set seed for reproducibility and easier comparison

    set.seed(1)

  3. Load the GermanCredit dataset:

    # Load the German Credit dataset

    GermanCredit <- read.csv("GermanCredit.csv")

  4. Remove the Age column:

    # Remove the Age column

    GermanCredit$Age <- NULL

  5. Create balanced partitions such that all three partitions have the same ratio of each class:

    # Partition with same ratio of each class in all three partitions

    partitions <- partition(GermanCredit, p = c(0.6, 0.2),

                            cat_col = "Class")

    train_set <- partitions[[1]]

    dev_set <- partitions[[2]]

    valid_set <- partitions[[3]]

  6. Find the preprocessing parameters for scaling and centering from the training set:

    # Find scaling and centering parameters

    params <- preProcess(train_set[, 1:6], method=c("center", "scale"))

  7. Apply standardization to the first six predictors in all three partitions, using the preProcess parameters from the previous step:

    # Transform the training set

    train_set[, 1:6] <- predict(params, train_set[, 1:6])

    # Transform the development set

    dev_set[, 1:6] <- predict(params, dev_set[, 1:6])

    # Transform the validation set

    valid_set[, 1:6] <- predict(params, valid_set[, 1:6])

  8. Train the neural network classifier:

    # Train the neural network classifier

    nn <- neuralnet(Class == "Good" ~ InstallmentRatePercentage +

                    ResidenceDuration + NumberExistingCredits,

                    train_set, linear.output = FALSE)

  9. Plot the network with its weights:

    # Plot the network

    plotnet(nn, var_labs=FALSE)

    The output is as follows:

    Figure 4.18: Neural network architecture using three predictors
    Figure 4.18: Neural network architecture using three predictors
  10. Print the error:

    train_error <- nn$result.matrix[1]

    train_error

    The output is as follows:

    ## [1] 62.15447

    The random initialization of the neural network weights can lead to slightly different results from one training to another. To avoid this, we use the set.seed() function at the beginning of the script, which helps when comparing models. We could also train the same model architecture with five different seeds to get a better sense of its performance.

Activity 16: Training and Comparing Neural Network Architectures

Solution:

  1. Attach the packages:

    # Attach the packages

    library(groupdata2)

    library(caret)

    library(neuralnet)

    library(mlbench)

  2. Set the random seed to 1:

    # Set seed for reproducibility and easier comparison

    set.seed(1)

  3. Load the PimaIndiansDiabetes2 dataset:

    # Load the PimaIndiansDiabetes2 dataset

    PimaIndiansDiabetes2 <- read.csv("PimaIndiansDiabetes2.csv")

  4. Summarize the dataset.

    summary(PimaIndiansDiabetes2)

    The summary is as follows:

    ##     pregnant         glucose         pressure         triceps     

    ##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  

    ##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  

    ##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  

    ##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  

    ##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  

    ##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  

    ##                   NA's   :5       NA's   :35       NA's   :227  

    ##  

    ##     insulin            mass          pedigree           age       

    ##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  

    ##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  

    ##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  

    ##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  

    ##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  

    ##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  

    ##  NA's   :374      NA's   :11               

    ##                       

    ##  diabetes

    ##  neg:500  

    ##  pos:268        

  5. Handle missing data (quick solution). Start by assigning the dataset to a new name:

    # Assign/copy dataset to a new name

    diabetes_data <- PimaIndiansDiabetes2

  6. Remove the triceps and insulin columns:

    # Remove the triceps and insulin columns

    diabetes_data$triceps <- NULL

    diabetes_data$insulin <- NULL

  7. Remove all rows containing missing data (NAs):

    # Remove all rows with NAs (missing data)

    diabetes_data <- na.omit(diabetes_data)

  8. Partition the dataset into a training set (60%), a development set (20%), and a validation set (20%). Use cat_col="diabetes" to balance the ratios of each class between the partitions:

    # Partition with same ratio of each class in all three partitions

    partitions <- partition(diabetes_data, p = c(0.6, 0.2),

                                cat_col = "diabetes")

    train_set <- partitions[[1]]

    dev_set <- partitions[[2]]

    valid_set <- partitions[[3]]

  9. Find the preProcess parameters for scaling and centering the first six features:

    # Find scaling and centering parameters

    params <- preProcess(train_set[, 1:6], method = c("center", "scale"))

  10. Apply the scaling and centering to each partition:

    # Transform the training set

    train_set[, 1:6] <- predict(params, train_set[, 1:6])

    # Transform the development set

    dev_set[, 1:6] <- predict(params, dev_set[, 1:6])

    # Transform the validation set

    valid_set[, 1:6] <- predict(params, valid_set[, 1:6])

  11. Train multiple neural network architectures. Adjust them by changing the number of nodes and/or layers. In the model formula, use diabetes == "pos":

    # Training multiple neural nets

    nn4 <- neuralnet(diabetes == "pos" ~ ., train_set,

                     linear.output = FALSE, hidden = c(3))

    nn5 <- neuralnet(diabetes == "pos" ~ ., train_set,

                     linear.output = FALSE, hidden = c(2,1))

    nn6 <- neuralnet(diabetes == "pos" ~ ., train_set,

                     linear.output = FALSE, hidden = c(3,2))

  12. Put the model objects into a list:

    # Put the model objects into a list

    models <- list("nn4"=nn4,"nn5"=nn5,"nn6"=nn6)

  13. Create one-hot encoding of the diabetes variable:

    # Evaluating each model on the dev_set

    # Create one-hot encoding of diabetes variable

    dev_true_labels <- ifelse(dev_set$diabetes == "pos", 1, 0)

  14. Create a for loop for evaluating the models. By running the evaluations in a for loop, we avoid repeating the code:

    # Evaluate one model at a time in a loop, to avoid repeating the code

    for (i in 1:length(models)){

      

      # Predict the classes in the development set

      dev_predicted_probabilities <- predict(models[[i]], dev_set)

      dev_predictions <- ifelse(dev_predicted_probabilities > 0.5, 1, 0)

      

      # Create confusion Matrix

      confusion_matrix <- confusionMatrix(as.factor(dev_predictions),

                                          as.factor(dev_true_labels),

                                          mode="prec_recall",

                                          positive = "1")

      

      # Print the results for this model

      # Note: paste0() concatenates the strings

      # to (name of model + " on the dev...")

      print( paste0( names(models)[[i]], " on the development set: "))

      print(confusion_matrix)

      

    }

    The output is as follows:

    ## [1] "nn4 on the development set: "

    ## Confusion Matrix and Statistics

    ##

    ##           Reference

    ## Prediction  0  1

    ##          0 79 19

    ##          1 16 30

    ##                                           

    ##                Accuracy : 0.7569          

    ##                  95% CI : (0.6785, 0.8245)

    ##     No Information Rate : 0.6597          

    ##     P-Value [Acc > NIR] : 0.007584        

    ##                                           

    ##                   Kappa : 0.4505          

    ##  Mcnemar's Test P-Value : 0.735317        

    ##                                           

    ##               Precision : 0.6522          

    ##                  Recall : 0.6122          

    ##                      F1 : 0.6316          

    ##              Prevalence : 0.3403          

    ##          Detection Rate : 0.2083          

    ##    Detection Prevalence : 0.3194          

    ##       Balanced Accuracy : 0.7219          

    ##                                           

    ##        'Positive' Class : 1               

    ##                                           

    ## [1] "nn5 on the development set: "

    ## Confusion Matrix and Statistics

    ##

    ##           Reference

    ## Prediction  0  1

    ##          0 77 16

    ##          1 18 33

    ##                                          

    ##                Accuracy : 0.7639         

    ##                  95% CI : (0.686, 0.8306)

    ##     No Information Rate : 0.6597         

    ##     P-Value [Acc > NIR] : 0.004457       

    ##                                          

    ##                   Kappa : 0.4793         

    ##  Mcnemar's Test P-Value : 0.863832       

    ##                                          

    ##               Precision : 0.6471         

    ##                  Recall : 0.6735         

    ##                      F1 : 0.6600         

    ##              Prevalence : 0.3403         

    ##          Detection Rate : 0.2292         

    ##    Detection Prevalence : 0.3542         

    ##       Balanced Accuracy : 0.7420         

    ##                                          

    ##        'Positive' Class : 1              

    ##                                          

    ## [1] "nn6 on the development set: "

    ## Confusion Matrix and Statistics

    ##

    ##           Reference

    ## Prediction  0  1

    ##          0 76 14

    ##          1 19 35

    ##                                           

    ##                Accuracy : 0.7708          

    ##                  95% CI : (0.6935, 0.8367)

    ##     No Information Rate : 0.6597          

    ##     P-Value [Acc > NIR] : 0.002528        

    ##                                           

    ##                   Kappa : 0.5019          

    ##  Mcnemar's Test P-Value : 0.486234        

    ##                                           

    ##               Precision : 0.6481          

    ##                  Recall : 0.7143          

    ##                      F1 : 0.6796          

    ##              Prevalence : 0.3403          

    ##          Detection Rate : 0.2431          

    ##    Detection Prevalence : 0.3750          

    ##       Balanced Accuracy : 0.7571          

    ##                                           

    ##        'Positive' Class : 1               

  15. As the nn6 model has the highest accuracy and F1 score, it is the best model.
  16. Evaluate the best model on the validation set. Start by creating the one-hot encoding of the diabetes variable in the validation set:

    # Create one-hot encoding of Class variable

    valid_true_labels <- ifelse(valid_set$diabetes == "pos", 1, 0)

  17. Use the best model to predict the diabetes variable in the validation set:

    # Predict the classes in the validation set

    predicted_probabilities <- predict(nn6, valid_set)

    predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)

  18. Create a confusion matrix:

    # Create confusion Matrix

    confusion_matrix <- confusionMatrix(as.factor(predictions),

                                        as.factor(valid_true_labels),

                                        mode="prec_recall", positive = "1")

  19. Print the results:

    # Print the results for this model

    # Note that by separating two function calls by ";"

    # we can have multiple calls per line

    print("nn6 on the validation set:"); print(confusion_matrix)

    The output is as follows:

    ## [1] "nn6 on the validation set:"

    ## Confusion Matrix and Statistics

    ##

    ##           Reference

    ## Prediction  0  1

    ##          0 70 16

    ##          1 25 35

    ##                                           

    ##                Accuracy : 0.7192          

    ##                  95% CI : (0.6389, 0.7903)

    ##     No Information Rate : 0.6507          

    ##     P-Value [Acc > NIR] : 0.04779         

    ##                                           

    ##                   Kappa : 0.4065          

    ##  Mcnemar's Test P-Value : 0.21152         

    ##                                           

    ##               Precision : 0.5833          

    ##                  Recall : 0.6863          

    ##                      F1 : 0.6306          

    ##              Prevalence : 0.3493          

    ##          Detection Rate : 0.2397          

    ##    Detection Prevalence : 0.4110          

    ##       Balanced Accuracy : 0.7116          

    ##                                           

    ##        'Positive' Class : 1               

  20. Plot the best model:

    plotnet(nn6, var_labs=FALSE)

    The output will look as follows:

    Figure 4.19: The best neural network architecture without cross-validation.
Figure 4.19: The best neural network architecture without cross-validation.

In this activity, we have trained multiple neural network architectures and evaluated the best model on the validation set.

Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation

Solution:

  1. Attach the packages.

    # Attach the packages

    library(groupdata2)

    library(caret)

    library(neuralnet)

    library(mlbench)

  2. Set the random seed to 1.

    # Set seed for reproducibility and easier comparison

    set.seed(1)

  3. Load the PimaIndiansDiabetes2 dataset.

    # Load the PimaIndiansDiabetes2 dataset

    data(PimaIndiansDiabetes2)

  4. Handle missing data (quick solution).

    Start by assigning the dataset to a new name.

    # Handling missing data (quick solution)

    # Assign/copy dataset to a new name

    diabetes_data <- PimaIndiansDiabetes2

  5. Remove the triceps and insulin columns.

    # Remove the triceps and insulin columns

    diabetes_data$triceps <- NULL

    diabetes_data$insulin <- NULL

  6. Remove all rows with NAs.

    # Remove all rows with Nas (missing data)

    diabetes_data <- na.omit(diabetes_data)

  7. Partition the dataset into a training set (80%) and validation set (20%). Use cat_col="diabetes" to balance the ratios of each class between the partitions.

    # Partition into a training set and a validation set

    partitions <- partition(diabetes_data, p = 0.8, cat_col = "diabetes")

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  8. Find the preProcess parameters for scaling and centering the first six features.

    # Find scaling and centering parameters

    # Note: We could also decide to do this inside the training loop!

    params <- preProcess(train_set[, 1:6], method=c("center", "scale"))

  9. Apply the scaling and centering to both partitions.

    # Transform the training set

    train_set[, 1:6] <- predict(params, train_set[, 1:6])

    # Transform the validation set

    valid_set[, 1:6] <- predict(params, valid_set[, 1:6])

  10. Create 4 folds in the training set, using the fold() function. Use cat_col="diabetes" to balance the ratios of each class between the folds.

    # Create folds for cross-validation

    # Balance on the Class variable

    train_set <- fold(train_set, k=4, cat_col = "diabetes")

    # Note: This creates a factor in the dataset called ".folds"

    # Take care not to use this as a predictor.

  11. Write the cross-validation training section. Start by initializing the vectors for collecting errors and accuracies.

    ## Cross-validation loop

    # Change the model formula in the loop and run the below

    # for each model architecture you're testing

    # Initialize vectors for collecting errors and accuracies

    errors <- c()

    accuracies <- c()

    Start the training for loop. We have 4 folds, so we need 4 iterations.

    # Training loop

    for (part in 1:4){

  12. Assign the chosen fold as test set and the rest of the folds as train set. Be aware of the indentation.

      # Assign the chosen fold as test set

      # and the rest of the folds as train set

      cv_test_set <- train_set[train_set$.folds == part,]

      cv_train_set <- train_set[train_set$.folds != part,]

  13. Train the neural network with your chosen predictors.

      # Train neural network classifier

      # Make sure not to include the .folds column as a predictor!

      nn <- neuralnet(diabetes == "pos" ~ .,

                      cv_train_set[, 1:7],

                      linear.output = FALSE,

                      hidden=c(2,2))

  14. Append the error to the errors vector.

      # Append error to errors vector

      errors <- append(errors, nn$result.matrix[1])

  15. Create one-hot encoding of the target variable in the CV test set.

      # Create one-hot encoding of Class variable

      true_labels <- ifelse(cv_test_set$diabetes == "pos", 1, 0)

  16. Use the trained neural network to predict the target variable in the CV test set.

      # Predict the class in the test set

      # It returns probabilities that the observations are "pos"

      predicted_probabilities <- predict(nn, cv_test_set)

      predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)

  17. Calculate accuracy. We could also use confusionMatrix() here, if we wanted other metrics.

      # Calculate accuracy manually

      # Note: TRUE == 1, FALSE == 0

      cv_accuracy <- sum(true_labels == predictions) / length(true_labels)

  18. Append the calculated accuracy to the accuracies vector.

      # Append the accuracy to the accuracies vector

      accuracies <- append(accuracies, cv_accuracy)

  19. Close the for loop.

    }

  20. Calculate average_error and print it.

    # Calculate average error and accuracy

    # Note that we could also have gathered the predictions from all the

    # folds and calculated the accuracy only once. This could lead to slightly

    # different results, e.g. if the folds are not exactly the same size.

    average_error <- mean(errors)

    average_error

    The output is as follows:

    ## [1] 28.38503

  21. Calculate average_accuracy and print it. Note that we could also have gathered the predictions from all the folds and calculated the accuracy only once.

    average_accuracy <- mean(accuracies)

    average_accuracy

    The output is as follows:

    ## [1] 0.7529813

  22. Evaluate the best model architecture on the validation set. Start by training an instance of the model architecture on the entire training set.

    # Once you have chosen the best model, train it on the entire training set

    # and evaluate on the validation set

    # Note that we set the stepmax, to make sure

    # it has enough training steps to converge

    nn_best <- neuralnet(diabetes == "pos" ~ .,

                         train_set[, 1:7],

                         linear.output = FALSE,

                         hidden=c(2,2),

                         stepmax = 2e+05)

  23. Create an one-hot encoding of the diabetes variable in the validation set.

    # Find the true labels in the validation set

    valid_true_labels <- ifelse(valid_set$diabetes == "pos", 1, 0)

  24. Use the model to predict the diabetes variable in the validation set.

    # Predict the classes in the validation set

    predicted_probabilities <- predict(nn_best, valid_set)

    predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)

  25. Create a confusion matrix.

    # Create confusion matrix

    confusion_matrix <- confusionMatrix(as.factor(predictions),

                                        as.factor(valid_true_labels),

                                        mode="prec_recall", positive = "1")

  26. Print the results.

    # Print the results for this model

    print("nn_best on the validation set:")

    ## [1] "nn_best on the validation set:"

    print(confusion_matrix)

    ## Confusion Matrix and Statistics

    ##

    ##           Reference

    ## Prediction  0  1

    ##          0 78 20

    ##          1 17 30

    ##                                           

    ##                Accuracy : 0.7448          

    ##                  95% CI : (0.6658, 0.8135)

    ##     No Information Rate : 0.6552          

    ##     P-Value [Acc > NIR] : 0.01302         

    ##                                           

    ##                   Kappa : 0.4271          

    ##  Mcnemar's Test P-Value : 0.74231         

    ##                                           

    ##               Precision : 0.6383          

    ##                  Recall : 0.6000          

    ##                      F1 : 0.6186          

    ##              Prevalence : 0.3448          

    ##          Detection Rate : 0.2069          

    ##    Detection Prevalence : 0.3241          

    ##       Balanced Accuracy : 0.7105          

    ##                                           

    ##        'Positive' Class : 1               

    ##

  27. Plot the neural network.

    plotnet(nn_best, var_labs=FALSE)

    The output will be as follows:

Figure 4.20: Best neural network architecture found with cross-validation.
Figure 4.20: Best neural network architecture found with cross-validation.

Chapter 5: Linear and Logistic Regression Models

Activity 18: Implementing Linear Regression

Solution:

  1. Attach the packages:

    # Attach packages

    library(groupdata2)

    library(cvms)

    library(caret)

    library(knitr)

  2. Set the random seed to 1:

    # Set seed for reproducibility and easy comparison

    set.seed(1)

  3. Load the cars dataset from caret:

    # Load the cars dataset

    data(cars)

  4. Partition the dataset into a training set (80%) and a validation set (20%):

    # Partition the dataset

    partitions <- partition(cars, p = 0.8)

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  5. Fit multiple linear regression models on the training set with the lm() function, predicting Price. Try different predictors. View and interpret the summary() of each fitted model. How do the interpretations change when you add or subtract predictors?

    # Fit a couple of linear models and interpret them

    # Model 1 - Predicting price by mileage

    model_1 <- lm(Price ~ Mileage, data = train_set)

    summary(model_1)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = Price ~ Mileage, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -12977  -7400  -3453   6009  45540

    ##

    ## Coefficients:

    ##               Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)  2.496e+04  1.019e+03  24.488  < 2e-16 ***

    ## Mileage     -1.736e-01  4.765e-02  -3.644 0.000291 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 9762 on 641 degrees of freedom

    ## Multiple R-squared:  0.02029,    Adjusted R-squared:  0.01876

    ## F-statistic: 13.28 on 1 and 641 DF,  p-value: 0.0002906

    # Model 2 - Predicting price by number of doors

    model_2 <- lm(Price ~ Doors, data = train_set)

    summary(model_2)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = Price ~ Doors, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -12540  -7179  -2934   5814  45805

    ##

    ## Coefficients:

    ##             Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)  25682.2     1662.1  15.452   <2e-16 ***

    ## Doors        -1176.6      457.5  -2.572   0.0103 *

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 9812 on 641 degrees of freedom

    ## Multiple R-squared:  0.01021,    Adjusted R-squared:  0.008671

    ## F-statistic: 6.615 on 1 and 641 DF,  p-value: 0.01034

    # Model 3 - Predicting price by mileage and number of doors

    model_3 <- lm(Price ~ Mileage + Doors, data = train_set)

    summary(model_3)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = Price ~ Mileage + Doors, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -12642  -7503  -3000   5595  43576

    ##

    ## Coefficients:

    ##               Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)  2.945e+04  1.926e+03  15.292  < 2e-16 ***

    ## Mileage     -1.786e-01  4.744e-02  -3.764 0.000182 ***

    ## Doors       -1.242e+03  4.532e+02  -2.740 0.006308 **

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 9713 on 640 degrees of freedom

    ## Multiple R-squared:  0.03165,    Adjusted R-squared:  0.02863

    ## F-statistic: 10.46 on 2 and 640 DF,  p-value: 3.388e-05

  6. Create model formulas with combine_predictors(). Limit the number of possibilities by a) using only the first four predictors, b) limiting the number of fixed effects in the formulas to three by specifying max_fixed_effects = 3, c) limiting the biggest possible interaction to a two-way interaction by specifying max_interaction_size = 2, and d) limiting the number of times a predictor can be included in a formula by specifying max_effect_frequency = 1. These limitations will decrease the number of models to run, which you may or may not want in your own projects:

    # Create list of model formulas with combine_predictors()

    # Use only the 4 first predictors (to save time)

    # Limit the number of fixed effects (predictors) to 3,

    # Limit the biggest possible interaction to a 2-way interaction

    # Limit the number of times a fixed effect is included to 1

    model_formulas <- combine_predictors(

        dependent = "Price",

        fixed_effects = c("Mileage", "Cylinder",

                          "Doors", "Cruise"),

        max_fixed_effects = 3,

        max_interaction_size = 2,

        max_effect_frequency = 1)

    # Output model formulas

    model_formulas

    The output is as follows:

    ##  [1] "Price ~ Cruise"                    

    ##  [2] "Price ~ Cylinder"                  

    ##  [3] "Price ~ Doors"                     

    ##  [4] "Price ~ Mileage"                   

    ##  [5] "Price ~ Cruise * Cylinder"         

    ##  [6] "Price ~ Cruise * Doors"            

    ##  [7] "Price ~ Cruise * Mileage"          

    ##  [8] "Price ~ Cruise + Cylinder"         

    ##  [9] "Price ~ Cruise + Doors"            

    ## [10] "Price ~ Cruise + Mileage"          

    ## [11] "Price ~ Cylinder * Doors"          

    ## [12] "Price ~ Cylinder * Mileage"        

    ## [13] "Price ~ Cylinder + Doors"          

    ## [14] "Price ~ Cylinder + Mileage"        

    ## [15] "Price ~ Doors * Mileage"           

    ## [16] "Price ~ Doors + Mileage"           

    ## [17] "Price ~ Cruise * Cylinder + Doors"

    ## [18] "Price ~ Cruise * Cylinder + Mileage"

    ## [19] "Price ~ Cruise * Doors + Cylinder"

    ## [20] "Price ~ Cruise * Doors + Mileage"  

    ## [21] "Price ~ Cruise * Mileage + Cylinder"

    ## [22] "Price ~ Cruise * Mileage + Doors"  

    ## [23] "Price ~ Cruise + Cylinder * Doors"

    ## [24] "Price ~ Cruise + Cylinder * Mileage"

    ## [25] "Price ~ Cruise + Cylinder + Doors"

    ## [26] "Price ~ Cruise + Cylinder + Mileage"

    ## [27] "Price ~ Cruise + Doors * Mileage"  

    ## [28] "Price ~ Cruise + Doors + Mileage"  

    ## [29] "Price ~ Cylinder * Doors + Mileage"

    ## [30] "Price ~ Cylinder * Mileage + Doors"

    ## [31] "Price ~ Cylinder + Doors * Mileage"

    ## [32] "Price ~ Cylinder + Doors + Mileage"

  7. Create fivefold columns with four folds each in the training set, using fold() with k = 4 and num_fold_cols = 5. Feel free to choose a higher number of fold columns:

    # Create 5 fold columns with 4 folds each in the training set

    train_set <- fold(train_set, k = 4,

                      num_fold_cols = 5)

  8. Create the fold column names with paste0():

    # Create list of fold column names

    fold_cols <- paste0(".folds_", 1:5)

  9. Perform repeated cross-validation on your model formulas with cvms:

    # Cross-validate the models with cvms

    CV_results <- cross_validate(train_set,

                                 models = model_formulas,

                                 fold_cols = fold_cols,

                                 family = "gaussian")

  10. Print the top 10 performing models according to RMSE. Select the best model:

    # Select the best model by RMSE

    # Order by RMSE

    CV_results <- CV_results[order(CV_results$RMSE),]

    # Select the 10 best performing models for printing

    # (Feel free to view all the models)

    CV_results_top10 <- head(CV_results, 10)

    # Show metrics and model definition columns

    # Use kable for a prettier output

    kable(select_metrics(CV_results_top10), digits = 2)

    The output is as follows:

    Figure 5.29: Top 10 performing models using RMSE
    Figure 5.29: Top 10 performing models using RMSE
  11. Fit the best model on the entire training set and evaluate it on the validation set. This can be done with the validate() function in cvms:

    # Evaluate the best model on the validation set with validate()

    V_results <- validate(

        train_data = train_set,

        test_data = valid_set,

        models = "Price ~ Cruise * Cylinder + Mileage",

        family = "gaussian")

  12. The output contains the results data frame and the trained model. Assign these to variable names:

    valid_results <- V_results$Results

    valid_model <- V_results$Models[[1]]

  13. Print the results:

    # Print the results

    kable(select_metrics(valid_results), digits = 2)

    The output is as follows:

    Figure 5.30: Results of the validated model
    Figure 5.30: Results of the validated model
  14. View and interpret the summary of the best model:

    # Print the model summary and interpret it

    summary(valid_model)

    The summary is as follows:

    ##

    ## Call:

    ## lm(formula = model_formula, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -10485  -5495  -1425   3494  34693

    ##

    ## Coefficients:

    ##                   Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)      8993.2446  3429.9320   2.622  0.00895 **

    ## Cruise          -1311.6871  3585.6289  -0.366  0.71462   

    ## Cylinder         1809.5447   741.9185   2.439  0.01500 *

    ## Mileage            -0.1569     0.0367  -4.274 2.21e-05 ***

    ## Cruise:Cylinder  1690.0768   778.7838   2.170  0.03036 *

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 7503 on 638 degrees of freedom

    ## Multiple R-squared:  0.424,  Adjusted R-squared:  0.4203

    ## F-statistic: 117.4 on 4 and 638 DF,  p-value: < 2.2e-16

Activity 19: Classifying Room Types

Solution:

  1. Attach the groupdata2, cvms, caret, randomForest, rPref, and doParallel packages:

    library(groupdata2)

    library(cvms)

    library(caret)

    library(randomForest)

    library(rPref)

    library(doParallel)

  2. Set the random seed to 3:

    set.seed(3)

  3. Load the amsterdam.listings dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/amsterdam.listings.csv:

    # Load the amsterdam.listings dataset

    full_data <- read.csv("amsterdam.listings.csv")

  4. Convert the id and neighbourhood columns to factors:

    full_data$id <- factor(full_data$id)

    full_data$neighbourhood <- factor(full_data$neighbourhood)

  5. Summarize the dataset:

    summary(full_data)

    The summary of data is as follows:

    ##        id                        neighbourhood            room_type   

    ##  2818   :    1   De Baarsjes - Oud-West :3052   Entire home/apt:13724

    ##  20168  :    1   De Pijp - Rivierenbuurt:2166   Private room   : 3594

    ##  25428  :    1   Centrum-West           :2019                         

    ##  27886  :    1   Centrum-Oost           :1498                         

    ##  28658  :    1   Westerpark             :1315                         

    ##  28871  :    1   Zuid                   :1187                         

    ##  (Other):17312   (Other)                :6081                         

    ##  availability_365   log_price     log_minimum_nights log_number_of_reviews

    ##  Min.   :  0.00   Min.   :2.079   Min.   :0.0000     Min.   :0.000       

    ##  1st Qu.:  0.00   1st Qu.:4.595   1st Qu.:0.6931     1st Qu.:1.386       

    ##  Median :  0.00   Median :4.852   Median :0.6931     Median :2.398       

    ##  Mean   : 48.33   Mean   :4.883   Mean   :0.8867     Mean   :2.370       

    ##  3rd Qu.: 49.00   3rd Qu.:5.165   3rd Qu.:1.0986     3rd Qu.:3.219       

    ##  Max.   :365.00   Max.   :7.237   Max.   :4.7875     Max.   :6.196       

    ##                                                                          

    ##  log_reviews_per_month

    ##  Min.   :-4.60517    

    ##  1st Qu.:-1.42712    

    ##  Median :-0.61619    

    ##  Mean   :-0.67858    

    ##  3rd Qu.: 0.07696    

    ##  Max.   : 2.48907    

    ##

  6. Partition the dataset into a training set (80%) and a validation set (20%). Balance the partitions by room_type:

    partitions <- partition(full_data, p = 0.8,

                            cat_col = "room_type")

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  7. Prepare for running the baseline evaluations and the cross-validations in parallel by registering the number of cores for doParallel:

    # Register four CPU cores

    registerDoParallel(4)

  8. Create the baseline evaluation for the task on the validation set with the baseline() function from cvms. Run 100 evaluations in parallel. Specify the dependent column as room_type. Note that the default positive class is Private room:

    room_type_baselines <- baseline(test_data = valid_set,

                                    dependent_col = "room_type",

                                    n = 100,

                                    family = "binomial",

                                    parallel = TRUE)

    # Inspect summarized metrics

    room_type_baselines$summarized_metrics

    The output is as follows:

    ## # A tibble: 10 x 15

    ##    Measure 'Balanced Accur…       F1 Sensitivity Specificity

    ##    <chr>              <dbl>    <dbl>       <dbl>       <dbl>

    ##  1 Mean              0.502   0.295        0.503      0.500

    ##  2 Median            0.502   0.294        0.502      0.500

    ##  3 SD                0.0101  0.00954      0.0184     0.00924

    ##  4 IQR               0.0154  0.0131       0.0226     0.0122

    ##  5 Max               0.525   0.317        0.551      0.522

    ##  6 Min               0.480   0.275        0.463      0.480

    ##  7 NAs               0       0            0          0     

    ##  8 INFs              0       0            0          0     

    ##  9 All_0             0.5    NA            0          1     

    ## 10 All_1             0.5     0.344        1          0     

    ## # … with 10 more variables: 'Pos Pred Value' <dbl>, 'Neg Pred

    ## #   Value' <dbl>, AUC <dbl>, 'Lower CI' <dbl>, 'Upper CI' <dbl>,

    ## #   Kappa <dbl>, MCC <dbl>, 'Detection Rate' <dbl>, 'Detection

    ## #   Prevalence' <dbl>, Prevalence <dbl>

  9. Fit multiple logistic regression models on the training set with the glm() function, predicting room_type. Try different predictors. View the summary() of each fitted model and try to interpret the estimated coefficients. Observe how the interpretations change when you add or subtract predictors:

    logit_model_1 <- glm("room_type ~ log_number_of_reviews",

                         data = train_set, family="binomial")

    summary(logit_model_1)

    The summary of the model is as follows:

    ## ## Call:

    ## glm(formula = "room_type ~ log_number_of_reviews", family = "binomial",

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -1.3030  -0.7171  -0.5668  -0.3708   2.3288

    ##

    ## Coefficients:

    ##                       Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           -2.64285    0.05374  -49.18   <2e-16 ***

    ## log_number_of_reviews  0.49976    0.01745   28.64   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149  on 13853  degrees of freedom

    ## Residual deviance: 13249  on 13852  degrees of freedom

    ## AIC: 13253

    ##

    ## Number of Fisher Scoring iterations: 4

  10. Add availability_365 as a predictor:

    logit_model_2 <- glm(

        "room_type ~ availability_365 + log_number_of_reviews",

        data = train_set, family = "binomial")

    summary(logit_model_2)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_number_of_reviews",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -1.5277  -0.6735  -0.5535  -0.3688   2.3365

    ##

    ## Coefficients:

    ##                         Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           -2.6622105  0.0533345  -49.91   <2e-16 ***

    ## availability_365       0.0039866  0.0002196   18.16   <2e-16 ***

    ## log_number_of_reviews  0.4172148  0.0178015   23.44   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149  on 13853  degrees of freedom

    ## Residual deviance: 12935  on 13851  degrees of freedom

    ## AIC: 12941

    ##

    ## Number of Fisher Scoring iterations: 4

  11. Add log_price as a predictor:

    logit_model_3 <- glm(

        "room_type ~ availability_365 + log_number_of_reviews + log_price",

        data = train_set, family = "binomial")

    summary(logit_model_3)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.7678  -0.5805  -0.3395  -0.1208   3.9864

    ##

    ## Coefficients:

    ##                         Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           12.6730455  0.3419771   37.06   <2e-16 ***

    ## availability_365       0.0081215  0.0002865   28.34   <2e-16 ***

    ## log_number_of_reviews  0.3613055  0.0199845   18.08   <2e-16 ***

    ## log_price             -3.2539506  0.0745417  -43.65   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149.2  on 13853  degrees of freedom

    ## Residual deviance:  9984.7  on 13850  degrees of freedom

    ## AIC: 9992.7

    ##

    ## Number of Fisher Scoring iterations: 6

  12. Add log_minimum_nights as a predictor:

    logit_model_4 <- glm(

        "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",

         data = train_set, family = "binomial")

    summary(logit_model_4)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.6268  -0.5520  -0.3142  -0.1055   4.6062

    ##

    ## Coefficients:

    ##                        Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           13.470868   0.354695   37.98   <2e-16 ***

    ## availability_365       0.008310   0.000295   28.17   <2e-16 ***

    ## log_number_of_reviews  0.360133   0.020422   17.64   <2e-16 ***

    ## log_price             -3.252957   0.076343  -42.61   <2e-16 ***

    ## log_minimum_nights    -1.007354   0.051131  -19.70   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149.2  on 13853  degrees of freedom

    ## Residual deviance:  9543.9  on 13849  degrees of freedom

    ## AIC: 9553.9

    ##

    ## Number of Fisher Scoring iterations: 6

  13. Replace log_number_of_reviews with log_reviews_per_month:

    logit_model_5 <- glm(

        "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",

        data = train_set, family = "binomial")

    summary(logit_model_5)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.7351  -0.5229  -0.2934  -0.0968   4.7252

    ##

    ## Coefficients:

    ##                         Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           14.7495851  0.3652303   40.38   <2e-16 ***

    ## availability_365       0.0074308  0.0003019   24.61   <2e-16 ***

    ## log_reviews_per_month  0.6364218  0.0246423   25.83   <2e-16 ***

    ## log_price             -3.2850702  0.0781567  -42.03   <2e-16 ***

    ## log_minimum_nights    -0.8504701  0.0526379  -16.16   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149  on 13853  degrees of freedom

    ## Residual deviance:  9103  on 13849  degrees of freedom

    ## AIC: 9113

    ##

    ## Number of Fisher Scoring iterations: 6

  14. Create model formulas with combine_predictors(). To save time, limit the interaction size to 2 by specifying max_interaction_size = 2, and limit the number of times an effect can be included in a formula to 1 by specifying max_effect_frequency = 1:

    model_formulas <- combine_predictors(

        dependent = "room_type",

        fixed_effects = c("log_minimum_nights",

                          "log_number_of_reviews",

                          "log_price",

                          "availability_365",

                          "log_reviews_per_month"),

        max_interaction_size = 2,

        max_effect_frequency = 1)

    head(model_formulas, 10)

    The output is as follows:

    ##  [1] "room_type ~ availability_365"                       

    ##  [2] "room_type ~ log_minimum_nights"                     

    ##  [3] "room_type ~ log_number_of_reviews"                  

    ##  [4] "room_type ~ log_price"                              

    ##  [5] "room_type ~ log_reviews_per_month"                  

    ##  [6] "room_type ~ availability_365 * log_minimum_nights"  

    ##  [7] "room_type ~ availability_365 * log_number_of_reviews"

    ##  [8] "room_type ~ availability_365 * log_price"           

    ##  [9] "room_type ~ availability_365 * log_reviews_per_month"

    ## [10] "room_type ~ availability_365 + log_minimum_nights"

  15. Create fivefold columns with five folds each in the training set, using fold() with k = 5 and num_fold_cols = 5. Balance the folds by room_type. Feel free to choose a higher number of fold columns:

    train_set <- fold(train_set, k = 5,

                      num_fold_cols = 5,

                      cat_col = "room_type")

  16. Perform cross-validation (not repeated) on your model formulas with cvms. Specify fold_cols = ".folds_1". Order the results by F1 and show the best 10 models:

    initial_cv_results <- cross_validate(

        train_set,

        models = model_formulas,

        fold_cols = ".folds_1",

        family = "binomial",

        parallel = TRUE)

    initial_cv_results <- initial_cv_results[

        order(initial_cv_results$F1, decreasing = TRUE),]

    head(initial_cv_results, 10)

    The output is as follows:

    ## # A tibble: 10 x 26

    ##    'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##               <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ##  1            0.764 0.662       0.567       0.962            0.797

    ##  2            0.764 0.662       0.565       0.963            0.798

    ##  3            0.763 0.662       0.563       0.963            0.801

    ##  4            0.763 0.661       0.563       0.963            0.800

    ##  5            0.761 0.654       0.564       0.958            0.778

    ##  6            0.757 0.653       0.549       0.966            0.807

    ##  7            0.757 0.652       0.549       0.965            0.804

    ##  8            0.758 0.649       0.560       0.957            0.774

    ##  9            0.756 0.649       0.550       0.962            0.792

    ## 10            0.758 0.649       0.559       0.957            0.775

    ## # … with 21 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, Folds <int>, 'Fold Columns' <int>, 'Convergence

    ## #   Warnings' <dbl>, 'Singular Fit Messages' <int>, Family <chr>,

    ## #   Link <chr>, Dependent <chr>, Fixed <chr>

  17. Perform repeated cross-validation on the 10-20 best model formulas (by F1) with cvms:

    # Reconstruct the best 20 models' formulas

    reconstructed_formulas <- reconstruct_formulas(

        initial_cv_results,

        topn = 20)

    # Create fold_cols

    fold_cols <- paste0(".folds_", 1:5)

    # Perform repeated cross-validation

    repeated_cv_results <- cross_validate(

        train_set,

        models = model_formulas,

        fold_cols = fold_cols,

        family = "binomial",

        parallel = TRUE)

    # Order by F1

    repeated_cv_results <- repeated_cv_results[

        order(repeated_cv_results$F1, decreasing = TRUE),]

    # Inspect the 10 best modelsresults

    head(repeated_cv_results)

    The output is as follows:

    ## # A tibble: 6 x 27

    ##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ## 1            0.764 0.662       0.566       0.962            0.796

    ## 2            0.763 0.661       0.564       0.963            0.798

    ## 3            0.763 0.660       0.562       0.963            0.800

    ## 4            0.762 0.659       0.561       0.963            0.800

    ## 5            0.761 0.654       0.563       0.958            0.780

    ## 6            0.758 0.654       0.551       0.965            0.805

    ## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, Results <list>, Folds <int>, 'Fold

    ## #   Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit

    ## #   Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,

    ## #   Fixed <chr>

  18. Find the Pareto front based on the F1 and balanced accuracy scores. Use psel() from the rPref package and specify pref = high("F1") * high("`Balanced Accuracy`"). Note the ticks around `Balanced Accuracy`:

    # Find the Pareto front

    front <- psel(repeated_cv_results,

                  pref = high("F1") * high("`Balanced Accuracy`"))

    # Remove rows with NA in F1 or Balanced Accuracy

    front <- front[complete.cases(front[1:2]), ]

    # Inspect front

    front

    The output is as follows:

    ## # A tibble: 1 x 27

    ##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ## 1            0.764 0.662       0.566       0.962            0.796

    ## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, Results <list>, Folds <int>, 'Fold

    ## #   Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit

    ## #   Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,

    ## #   Fixed <chr>

    The best model according to F1 is also the best model by balanced accuracy, so the Pareto front only contains one model.

  19. Plot the Pareto front with the ggplot2 code from Exercise 61, Plotting the Pareto Front. Note that you may need to add ticks around 'Balanced Accuracy' when specifying x or y in aes() in the ggplot call:

    # Create ggplot object

    # with precision on the x-axis and precision on the y-axis

    ggplot(repeated_cv_results, aes(x = F1, y = 'Balanced Accuracy')) +

      # Add the models as points

      geom_point(shape = 1, size = 0.5) +

      # Add the nondominated models as larger points

      geom_point(data = front, size = 3) +

      # Add a line to visualize the pareto front

      geom_step(data = front, direction = "vh") +

      # Add the light theme

      theme_light()

    The output is similar to the following:

    Figure 5.31: Pareto front with the F1 and balanced accuracy scores
    Figure 5.31: Pareto front with the F1 and balanced accuracy scores
  20. Use validate() to train the nondominated models on the training set and evaluate them on the validation set:

    # Reconstruct the formulas for the front models

    reconstructed_formulas <- reconstruct_formulas(front)

    # Validate the models in the Pareto front

    v_results_list <- validate(train_data = train_set,

                               test_data = valid_set,

                               models = reconstructed_formulas,

                               family = "binomial")

    # Assign the results and model(s) to variable names

    v_results <- v_results_list$Results

    v_model <- v_results_list$Models[[1]]

    v_results

    The output is as follows:

    ## # A tibble: 1 x 24

    ##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ## 1            0.758 0.652       0.554       0.962            0.794

    ## # … with 19 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, 'Convergence Warnings' <dbl>, 'Singular Fit

    ## #   Messages' <dbl>, Family <chr>, Link <chr>, Dependent <chr>,

    ## #   Fixed <chr>

    These results are a lot better than the baseline on both F1 and balanced accuracy.

  21. View the summaries of the nondominated model(s):

    summary(v_model)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = model_formula, family = binomial(link = link),

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.8323  -0.4836  -0.2724  -0.0919   3.9091

    ##

    ## Coefficients:

    ##                           Estimate Std. Error z value  Pr(>|z|)

    ## (Intercept)               15.3685268  0.4511978  34.062  < 2e-16 ***

    ## availability_365          -0.0140209  0.0030623  -4.579 4.68e-06 ***

    ## log_price                 -3.4441520  0.0956189 -36.020  < 2e-16 ***

    ## log_minimum_nights        -0.7163252  0.0535452 -13.378  < 2e-16 ***

    ## log_number_of_reviews      -0.0823821  0.0282115  -2.920   0.0035 **

    ## log_reviews_per_month      0.0733808  0.0381629   1.923   0.0545 .

    ## availability_365:log_price 0.0042772  0.0006207   6.891 5.53e-12 ***

    ## log_n_o_reviews:log_r_p_month 0.3730603 0.0158122 23.593 < 2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149.2  on 13853  degrees of freedom

    ## Residual deviance:  8476.7  on 13846  degrees of freedom

    ## AIC: 8492.7

    ##

    ## Number of Fisher Scoring iterations: 6

    Note, that we have shortened log_number_of_reviews and log_reviews_per_month in the interaction term, log_n_o_reviews:log_r_p_month. The coefficients for the two interaction terms are both statistically significant. The interaction term log_n_o_reviews:log_r_p_month tells us that when log_number_of_reviews increases by one unit, the coefficient for log_reviews_per_month increases by 0.37, and vice versa. We might question the meaningfulness of including both these predictors in the model, as they have some of the same information. If we also had the number of months the listing had been listed, we should be able to recreate log_number_of_reviews from log_reviews_per_month and the number of months, which would probably be easier to interpret as well.

    The second interaction term, availability_365:log_price, tells us that when availability_365 increases by a single unit, the coefficient for log_price increases by 0.004, and vice versa. The coefficient estimate for log_price is -3.44, meaning that when availability_365 is low, a higher log_price decreases the probability that the listing is a private room. This fits with the intuition that a private room is usually cheaper than an entire home/apartment.

    The coefficient for log_minimum_nights tells us that when there is a higher minimum requirement for the number of nights when we book the listing, there's a lower probability that the listing is a private room.

Chapter 6: Unsupervised Learning

Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset

Solution:

  1. Attach the cluster and factoextra packages:

    library(cluster)

    library(factoextra)

  2. Load the dataset:

    df <- read.csv("mtcars.csv")

  3. Set the row names to the values of the X column (the state names). Remove the X column afterward:

    rownames(df) <- df$X

    df$X <- NULL

    Note

    The row names (states) become a column, X, when you save it as a CSV file. So, we need to change it back, as the row names are used in the plot in step 7.

  4. Remove those rows with missing data and standardize the dataset:

    df <- na.omit(df)

    df <- scale(df)

  5. Implement divisive hierarchical clustering using DIANA. For easy comparison, document the dendrogram output. Feel free to experiment with different distance metrics:

    dv <- diana(df,metric = "manhattan", stand = TRUE)

    plot(dv)

    The output is as follows:

    Figure 6.41: Banner from diana()
    Figure 6.41: Banner from diana()

    The next plot is as follows:

    Figure 6.42: Dendrogram from diana()
    Figure 6.42: Dendrogram from diana()
  6. Implement bottom-up hierarchical clustering using AGNES. Take note of the dendrogram created for comparison purposes later on:

    agn <- agnes(df)

    pltree(agn)

    The output is as follows:

    Figure 6.43: Dendrogram from agnes()
    Figure 6.43: Dendrogram from agnes()
  7. Implement k-means clustering. Use the elbow method to determine the optimal number of clusters:

    fviz_nbclust(mtcars, kmeans, method = "wss") +

        geom_vline(xintercept = 4, linetype = 2) +

        labs(subtitle = "Elbow method")

    The output is as follows:

    Figure 6.44: Optimal clusters using the elbow method
    Figure 6.44: Optimal clusters using the elbow method
  8. Perform k-means clustering with four clusters:

    k4 <- kmeans(df, centers = 4, nstart = 20)

    fviz_cluster(k4, data = df)

    The output is as follows:

    Figure 6.45: k-means with four clusters
    Figure 6.45: k-means with four clusters
  9. Compare the clusters, starting with the smallest one. The following are your expected results for DIANA, AGNES, and k-means, respectively:
Figure 6.46: Dendrogram from running DIANA, cut at 20
Figure 6.46: Dendrogram from running DIANA, cut at 20

If we consider cutting the DIANA tree at height 20, the Ferrari is clustered together with the Ford and the Maserati (the smallest cluster):

Figure 6.47: Dendrogram from agnes, cut at 4
Figure 6.47: Dendrogram from agnes, cut at 4

Meanwhile, cutting the AGNES dendrogram at height 4 results in the Ferrari being clustered with the Mazda RX4, the Mazda RX4 Wag, and the Porsche. k-means clusters the Ferrari with the Mazdas, the Ford, and the Maserati.

Figure 6.48: kmeans clustering
Figure 6.48: kmeans clustering

Clearly, the choice of clustering technique and algorithms results in different clusters being created. It is important to apply some domain knowledge to determine the most valuable end results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset