Chapter 4 Introduction to neuralnet and Evaluation Methods

Learning Objectives

By the end of this chapter, you will be able to:

Use a neural network to solve a classification problem by using the neuralnet package.
Create balanced partitions from a dataset, while decreasing leakage, with the groupdata2 package.
Evaluate and select between models using cross-validation.
Explain the concept of multiclass classification.

In this chapter, we aim to equip you with a practical understanding of neural networks.

Introduction

While neural networks have been around in some form since the mid-twentieth century, they have recently surged in popularity. Be it self-driving cars or healthcare technologies, neural networks are fundamental to some of the most innovative products being developed.

In this chapter, we will train a neural network to predict whether a loan applicant in the GermanCredit dataset has a good or bad credit rating. To do this, we will partition the dataset into a training set, a development set, and a validation set. The neural network will be trained on the training set, and we will evaluate whether it makes good predictions on the development and validation sets. We will use cross-validation, along with four different evaluation metrics, to select between different neural network architectures.

Classification

The goal of classification is to create a model that can predict classes in never-before-seen data. This means that the model should generalize beyond the training data. As the data we work with is supervised, meaning that we already know the answer to our question (does the applicant have a good or bad credit rating?), it is rarely interesting to have a model that can merely repeat that. Instead, we need the model to classify the unlabeled data we gather in the future. We will discuss this further when covering under- and overfitting.

The datasets being used are the following:

Figure 4.1: Datasets

Binary Classification

In binary classification, our question is of the either or type. There are only two options, either yes or no; not maybe and not both yes and no. In multiclass classification, we can have more than two options, though we can only choose one of them. In multilabel classification, it is possible to predict both yes and no at the same time, hence an observation can have multiple labels. We will discuss multiclass classification later, but multilabel classification is outside the scope of this chapter.

Exercise 40: Preparing the Dataset

In this exercise, we will load and assess the GermanCredit dataset. If the data has inappropriately biasing features, we will remove them:

Attach the required packages:
# Attaching the packages
library(caret)
library(groupdata2)
Load the GermanCredit dataset:
# Load the GermanCredit dataset
data(GermanCredit)
We can inspect the structure of the dataset. We will only check a subset, as the rest of the columns are all numerics (num), valued either 0 or 1. Feel free to check them yourself:
str(GermanCredit[,1:10])
The output is as follows:
## 'data.frame':    1000 obs. of  10 variables:
##  $ Duration                 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration        : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                      : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance  : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                    : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
To avoid unfair discrimination, we will not use the Age feature. It is not always straightforward to explain why a neural network makes a certain prediction; hence it can be useful to discuss the inclusion/exclusion of features. In many contexts, we would like to avoid biasing the model against certain groups of people (for instance, young people):
# Remove the Age column
GermanCredit$Age <- NULL

We have assessed the GermanCredit dataset and removed the Age feature to avoid biasing our models against a specific age group.

Note

To get a sense of a feature's importance, we can train models with and without it, and see whether their predictions differ. Be aware that interpretation can be difficult, as the different features can interact in nontransparent ways. If we include a feature, we can alter the value of that feature for an applicant and see how it changes the prediction. If we have a gender variable, for instance, we can change it from male to female to see whether the applicant would have gotten the loan if they were of a different gender.

Balanced Partitioning Using the groupdata2 Package

We wish to partition the data into a training set (60%), a development set (20%), and a validation set (20%). We will train our models on the training set and select the model that performs best on the development set. Once we have selected the best model, we will evaluate it on the validation set. We will treat the development and validation sets as if they were collected after we trained our model on the training set. This means that any preprocessing parameters (such as for scaling and centering) will be found for the training set and applied to all three partitions.

For the partitioning, we will use the groupdata2 package, which contains multiple functions for balanced grouping and splitting of data. A dataset can be balanced or imbalanced in multiple ways. In our case, we will balance our partitions by having the same ratio of Good and Bad loan applicants in the three partitions. This makes it easier to compare the evaluations of the different partitions and ensures that each partition contains both Good and Bad loan applicants. Alternatively, we could ensure that the development and validation sets have the same number of observations per class. We will also discuss how to avoid leakage in repeated measures datasets, although this does not apply to our dataset.

Note

The groupdata2 package was created by one of the authors of this book, Ludvig Renbo Olsen.

Exercise 41: Partitioning the Dataset

In this exercise, we will partition the dataset into a training set with 60% of the observations, a development set with 20% of the observations, and a validation set with the remaining 20% of the observations. In this exercise, we will partition the dataset without balancing it on the target variable. In the following exercise, we will then make balanced partitions and compare the results of the two approaches.

Attach the groupdata2 package:
# Attach groupdata2
library(groupdata2)
Set the seed value:
# Set seed for reproducibility and easier comparison
set.seed(1)
Partition the data:
# Simple partitioning
# Note that we only need to specify the 2 first partition sizes
# as the remaining observations are put into a third partition.
partitions <- partition(GermanCredit, p = c(0.6, 0.2))
The partition() function splits the data into groups. We specify the data we wish to partition and the list of partition sizes (as percentages between 0 and 1 in this case). Note that we only specify the first two partitions, as the remaining 20% will automatically be placed in a third partition.
Assign each partition to a variable name:
# This returns a list with three data frames.
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
Count the number of observations from each class in the training and test sets:
# Count the number of observations from each class
# in the training and test sets.
train_counts <- table(train_set$Class)
dev_counts <- table(dev_set$Class)
valid_counts <- table(valid_set$Class)
For each partition, print the counts of the two classes:
Inspect the class counts from the training set:
train_counts
The output is as follows:
## Bad Good
## 188 412
Inspect the class counts from the development set:
dev_counts
The output is as follows:
## Bad Good
## 60 140
Inspect the class counts from the validation set:
valid_counts
The output is as follows:
## Bad Good
## 52 148
For each partition, print the ratios of the two classes:
Inspect the ratio of the two classes in the training set:
train_counts/max(train_counts)
The output is as follows:
## Bad Good
## 0.4563107 1.0000000
Inspect the ratio of the two classes in the development set:
dev_counts/max(dev_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000
Inspect the ratio of the two classes in the validation set:
valid_counts/max(valid_counts)
The output is as follows:
## Bad Good
## 0.3513514 1.0000000

In this exercise, we used the partition() function to split the dataset into three subsets (partitions). The outputs from steps 9, 10, and 11 show that we have different ratios of Good and Bad loan applicants in our partitions.

In the next exercise, we will include the cat_col argument, which takes the name of a categorical column and tries to balance it so that there is the same ratio of the classes in all partitions.

Note

What happens when we specify the cat_col argument in partition()? First, the dataset is subset by each class level. In our case, we will have one subset with all the applicants with Good credit ratings and another subset with the applicants with Bad credit ratings. Then, both subsets are partitioned and merged, so that partition one from the Good subset is merged with partition one from the Bad subset, and so on.

Exercise 42: Creating Balanced Partitions

In this exercise, we will load the GermanCredit dataset and use partition() to create balanced training (60%), development (20%), and validation (20%) sets. In this case, balanced means that the ratios of the two classes should be similar in all three partitions. Also, remember to remove the Age column:

Attach the packages:
# Attaching the packages
library(caret) # GermanCredit
library(groupdata2) # partition()
Load the GermanCredit dataset:
# Load the German Credit dataset
data("GermanCredit")
Remove the Age column:
# Remove the Age column
GermanCredit$Age <- NULL
Partition the data:
# Partition into train, dev and valid sets.
partitions <- partition(GermanCredit, p = c(0.6, 0.2), cat_col = "Class")
The partition() function splits the data into groups. We specify the data we wish to partition, the list of partition sizes, and the categorical column to balance the partitions by.
Assign each partition to a variable name:
# Assign each partition to a variable name
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
Count the number of observations from each class in the three partitions:
# Count the number of observations from each class
# in the training and test sets.
train_counts <- table(train_set$Class)
dev_counts <- table(dev_set$Class)
valid_counts <- table(valid_set$Class)
Inspect the class counts from the training set:
train_counts
The output is as follows:
Bad Good
180 420
Inspect the class counts from the development set:
dev_counts
The output is as follows:
## Bad Good
## 60 140
Inspect the class counts from the validation set:
valid_counts
The output is as follows:
## Bad Good
## 60 140
For each partition, print the ratios of the two classes. Inspect the ratio of the two classes in the training set:
train_counts/max(train_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000
Inspect the ratio of the two classes in the development set:
dev_counts/max(dev_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000
Inspect the ratio of the two classes in the validation set:
valid_counts/max(valid_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000

Notice how the ratios of the two classes are now the same for all partitions.

In this exercise, you created balanced partitions using the partition() function from groupdata2. You will likely do this very often in your future machine learning practices.

Leakage

As noted previously, we should treat the development and validation sets as if they were independent of the training set. In practice, they might not be so, but when creating the partitions, it is worth discussing whether there is any leakage between the partitions. One such example could be that we had recorded multiple loans per loan applicant. In that case, we would have an ID column with multiple rows per applicant ID. We refer to this as repeated measures data. In this case, the same applicant should not be in both the training set and one of the test sets, as we wish to test how well our model performs on new applicants, not the applicants it encountered during training. The id_col argument in the partition() function ensures that all rows with the same applicant ID are placed in only one of the partitions.

The following code shows how we would specify the id_col argument in partition(), if we had an ApplicantID column in our dataset.

# Avoiding ID leakage

# NOTE: Can't be run as we don't actually have an ApplicantID column

partition(GermanCredit, p = c(0.6, 0.2),

cat_col = "Class",

id_col = "ApplicantID")

# With this, each applicant would have all their observations/rows

# in the same partition!

Note

What happens when we specify the id_col argument in partition()? First, we extract a list of the unique IDs. Then, that list is partitioned and the rows are put in the partition of their ID. What happens when we specify both the cat_col and id_col arguments in partition()? First, the dataset is subset by each class level. For each subset, we extract and partition the list of unique IDs. Finally, the rows are put in the partition of their ID. This will ensure that all rows with the same ID are put in the same partition, while also balancing the ratios of each class between the folds. Note that this approach requires all rows with the same ID to have the same value in the cat_col column.

Exercise 43: Ensuring an Equal Number of Observations Per Class

In this exercise, we will partition the dataset such that the development and validation sets have an equal number of observations per class. This makes our evaluation metrics easier to interpret. We can do this by first specifying the number of observations per class for the two test partitions and then use the rest of the observations as the training set. Note that if you specify the id_col argument, this will put that number of individuals (per class) into the partitions. Also note that the training set is now the third element in the list of partitions.

Attach the necessary packages:
# Attach groupdata2
library(groupdata2)
Set the seed value for reproducibility and easier comparison:
# Set seed for reproducibility and easier comparison
set.seed(1)
Partition the data to have the same number (50) of observations per class in the test partitions:
# Partition with an equal number of observations (50) per class in
# the development and validation sets
# The remaining observations will end up in a third partition,
# which will be our training set
partitions_equal <- partition(GermanCredit, p = c(50, 50), cat_col = "Class")
Assign each partition to a variable name:
dev_set_equal <- partitions_equal[[1]]
valid_set_equal <- partitions_equal[[2]]
train_set_equal <- partitions_equal[[3]]
Count the number of observations from each class in the three partitions:
# Count the number of observations from each class
# in the training and test sets.
train_counts_equal <- table(train_set_equal$Class)
dev_counts_equal <- table(dev_set_equal$Class)
valid_counts_equal <- table(valid_set_equal$Class)
Inspect the number of observations per class in each of the partitions:
Inspect the class counts from the training set:
train_counts_equal
The output is as follows:
## Bad Good
## 200 600
Inspect the class counts from the development set:
dev_counts_equal
The output is as follows:
## Bad Good
## 50 50
Inspect the class counts from the validation set:
valid_counts_equal
The output is as follows:
## Bad Good
## 50 50

Notice how the development and validation sets both have 50 Good and 50 Bad loan applicants.

In this exercise, we have partitioned the dataset such that the development and validation sets have an equal number of observations per class, and everything else is the training set.

Standardizing

We will work with the partitions from Exercise 42, Creating Balanced Partitions. A few of the features could benefit from being standardized, and the rest are already one-hot encoded. We only run the preProcess() function on the training set to find the scaling and centering parameters, which are then applied to all three partitions:

# Find scaling and centering parameters for the first 6 columns

params <- preProcess(train_set[, 1:6], method=c("center", "scale"))

The predict() function uses the parameters in the params object to apply the preprocessing transformations to the specified dataset:

# Transform the training set

train_set[, 1:6] <- predict(params, train_set[, 1:6])

# Transform the development set

dev_set[, 1:6] <- predict(params, dev_set[, 1:6])

# Transform the validation set

valid_set[, 1:6] <- predict(params, valid_set[, 1:6])

Neural Networks with neuralnet

We are now ready to train a classifier to predict whether an applicant is creditworthy or not. The neuralnet package makes it easy to specify and train an artificial neural network. The first argument in the neuralnet() function is the model formula. Here, we tell our model to predict whether Class is Good or Bad. The tilde ~ means predicted by, and the variables to the right of it are the predictors. If we wish to use all the variables (excluding the target variable) as predictors, we can use a dot (y~.).

To begin with, we specify a simple formula with only Duration and Amount as predictors:

# Attaching neuralnet

library(neuralnet)

# Set seed for reproducibility and easier comparison

set.seed(1)

# Classifying if class is "Good"

nn1 <- neuralnet(Class == "Good" ~ Duration + Amount,

train_set, linear.output = FALSE)

plot(nn1, rep="best", fontsize = 10)

The output is as follows:

Figure 4.2: Neural network architecture with trained weights, error, and number of training steps.

The plot shows the neural network with three layers (input, hidden, output), from left to right. We have two nodes (also called neurons) in the input layer. The values in these nodes are multiplied by the weights (the numbers on the lines) going into the hidden layer with one node. The output layer has one node as well, which is our probability of Class being Good. The last two layers also have bias parameters (the "1" nodes). At the bottom of the plot, we see the training error, along with the number of training steps.

We can also get the error from the result matrix in the model object:

train_error <- nn1$result.matrix[1]

train_error

## [1] 59.630264

Now we will use columns 11 to 20 as predictors (remember to include column 9, as that's the target variable – Class). With the regular plot() function, the plot gets very messy with larger networks, so we will use the plotnet() function from the NeuralNetTools package. Let's see whether our error is reduced:

# Attach packages

library(neuralnet)

library(NeuralNetTools)

# Set seed for reproducibility and easier comparison

set.seed(1)

# Classifying if class is "Good"

# Using columns 11 to 20

# Notice that we choose the predictors by subsetting the data frame

# and use every column as predictor with the dot "~."

nn2 <- neuralnet(Class == "Good" ~ ., train_set[, c(9, 11:20)],

linear.output = FALSE)

plotnet(nn2, var_labs = FALSE)

The output is as follows:

Figure 4.3: Neural network architecture using columns 11- 20 as predictors

The plot produced by plotnet() does not show the weights by numbers but by line thickness and nuance (gray is negative and black is positive). Let's check the error:

train_error <- nn2$result.matrix[1]

train_error

The output is as follows:

## [1] 52.13454

Choosing these predictors lowered our error. In the next activity, we will train a neural network and calculate the training error.

Activity 15: Training a Neural Network

In this activity, we will train the neural network using the German Credit dataset found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

We will preprocess the GermanCredit dataset and train a neural network to predict creditworthiness. Feel free to reuse the code from Exercise 42, Creating Balanced Partitions, so that you get to train multiple neural networks.

Expected output: we expect the training error to be close to 62.15447. Note that, depending on the version of R and the packages installed, the results might vary.

Here are the steps that will help you complete the activity:

Attach the caret, groupdata2, neuralnet, and NeuralNetTools packages.
Set the random seed to 1.
Load the GermanCredit dataset with read.csv().
Remove the Age column.
Partition the dataset into a training set (60%), a development set (20%), and a validation set (20%). Use the cat_col argument to balance the ratio of the two classes.
Find the preprocessing parameters for scaling and centering from the training set.
Apply standardization to the first six predictors in all three partitions using the preProcess parameters from the previous step.
Train a neural network using the InstallmentRatePercentage, ResidenceDuration, and NumberExistingCredits variables as predictors. The goal is to classify whether or not Class is Good. Feel free to experiment with the other predictors as well and see whether you can obtain a lower error.
Plot the neural network and print the error.

The random initialization of the neural network weights can lead to slightly different results from one training to another. To avoid this, we use the set.seed() function at the beginning of the script, which helps when comparing models. We could also train the same model architecture with five different seeds to get a better sense of its performance.

Note

The solution for this activity can be found on page 342.

Model Selection

Now that we know how to train a classifier, we will revise how to evaluate and choose between models. First, we will cover four common metrics: Accuracy, Precision, Recall, and F1, and then we will discuss cross-validation as a good tool for model comparison.

Evaluation Metrics

We use a confusion matrix to split the predictions into True Positives, False Positives, True Negatives, and False Negatives. In our case, True Positives are the applicants that were correctly identified as being "Good," False Positives are the applicants that were incorrectly identified as being "Good," True Negatives are the applicants that were correctly identified as "Bad," while False Negatives are the applicants that were incorrectly identified as "Bad."

Figure 4.4: Confusion matrix

In Figure 4.4, the rows are the true classes, while the columns are the predicted classes. If the true class is Positive and the model predicts Positive, that prediction is a True Positive.

Accuracy

Accuracy: what percentage of the predictions were correct?

Figure 4.5: Accuracy formula

Precision

Precision: of all the Good predictions, how many were actually Good?

Note that we can choose to use the other class, Bad, as the Positive class. In that case, we would be asking how many of the Bad predictions actually were Bad.

Figure 4.6: Precision formula

Recall

Recall: how many of the applicants with a Good credit rating were correctly identified?

Figure 4.7: Recall formula

Header F1: the harmonic mean of precision and recall, which is a common way to merge the two scores into one, making it easier to compare models.

Figure 4.8: F1 formula

It is easier to visualize the difference between the F1 score and a simple average of the two metrics than to explain it in words:

Figure 4.9: Contour plots comparing the F1 score to the simple averaging of precision and recall

Figure 4.9 shows that the F1 score rewards models where both the precision and recall scores are decent, whereas the simple average does not take an imbalance between the metrics into account.

Exercise 44: Creating a Confusion Matrix

In this exercise, we will find the confusion matrix for a trained neural network. We will use the trained neural network from Activity 15, Training a neural network, to predict the development set. From these predictions, we will create a confusion matrix. The confusionMatrix() function from the caret package also calculates our evaluation metrics:

Attach the caret package:
# Attach caret
library(caret)
Create a one-hot encoding of the Class variable:
# Create a one-hot encoding of the Class variable
# ifelse() takes the arguments: "If x, then a, else b"
true_labels <- ifelse(dev_set$Class == "Good", 1, 0)
Predict the class in the development set, using the trained neural network from Activity 15, Training a neural network:
# Predict the class in the dev set
# It returns probabilities that the observations are "Good"
predicted_probabilities <- predict(nn, dev_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
Create a confusion matrix with the predictions:
# Create confusion matrix
confusion_matrix <- confusionMatrix(as.factor(predictions), as.factor(true_labels),
mode="prec_recall", positive = "1")
Print the confusion matrix and try to interpret it:
confusion_matrix
The output is as follows:
## Confusion Matrix and Statistics
##
##           Reference
## Prediction   0   1
##          0   0   0
##          1  60 140
##
##                Accuracy : 0.7
##                  95% CI : (0.6314, 0.7626)
##     No Information Rate : 0.7
##     P-Value [Acc > NIR] : 0.5348
##
##                   Kappa : 0
##  Mcnemar's Test P-Value : 2.599e-14
##
##               Precision : 0.7000
##                  Recall : 1.0000
##                      F1 : 0.8235
##              Prevalence : 0.7000
##          Detection Rate : 0.7000
##    Detection Prevalence : 1.0000
##       Balanced Accuracy : 0.5000
##
##        'Positive' Class : 1
##

If we look at the metrics, our model has an accuracy of 70%, precision of 70%, recall of 100%, and an F1 score of 82.35%. This seems pretty good, but when we look at the predictions in the confusion matrix, we see that the model simply predicts Good for all observations. With the imbalance in the training set, where we have 420 Good and 180 Bad observations, the model will get a decent error by simply predicting Good. This especially happens if our features do not provide enough information about the classes to get a smaller error. Another possible reason is that the model is simply not big enough (the number of nodes and layers) to extract useful information from the features. In that case, we would say that the model is underfitting the data.

An important point to take away from this example is that we can get seemingly high accuracies, precision scores, and so on, when our dataset is imbalanced, even though our model is useless. It is always a good idea to look at the confusion matrix to check that this isn't the case. In the upcoming exercise, we will create baseline evaluations so that we have something to compare our models against.

Exercise 45: Creating Baseline Evaluations

In this exercise, we will create and evaluate three sets of predictions in order to know what accuracy we should beat to be better than chance, or the discussed "one-class-predictor." The three sets of predictions are 1) all "Good" predictions, 2) all "Bad" predictions, and 3) random predictions, preferably repeated 100 times or more:

Attach caret:
# Attach caret
library(caret)
Create a one-hot encoding of the target variable:
# Create one-hot encoding of Class variable
true_labels <- ifelse(dev_set$Class == "Good", 1, 0)
Specify the number of predictions to make. This is the same as the number of true labels:
# The number of predictions to make
num_total_predictions <- 200 # Alternatively: length(true_labels)
Create the set of "All Good" predictions by simply repeating "1" 200 times:
# All "Good"
good_predictions <- rep(1, num_total_predictions)
Create a confusion matrix and inspect the results:
# Create confusion Matrix
confusion_matrix_good <- confusionMatrix(as.factor(good_predictions),as.factor(true_labels),mode="prec_recall",positive = "1")
confusion_matrix_good
## Confusion Matrix and Statistics
##
##           Reference
## Prediction   0   1
##          0   0   0
##          1  60 140
##
##                Accuracy : 0.7
##                  95% CI : (0.6314, 0.7626)
##     No Information Rate : 0.7
##     P-Value [Acc > NIR] : 0.5348
##
##                   Kappa : 0
##  Mcnemar's Test P-Value : 2.599e-14
##
##               Precision : 0.7000
##                  Recall : 1.0000
##                      F1 : 0.8235
##              Prevalence : 0.7000
##          Detection Rate : 0.7000
##    Detection Prevalence : 1.0000
##       Balanced Accuracy : 0.5000
##
##        'Positive' Class : 1
##
Create the set of "All Bad" predictions by simply repeating "0" 200 times:
# All "Bad"
bad_predictions <- rep(0, num_total_predictions)
Create a confusion matrix and inspect the results:
# Create confusion Matrix
confusion_matrix_bad <- confusionMatrix(
    as.factor(bad_predictions), as.factor(true_labels),
    mode="prec_recall", positive = "1")
confusion_matrix_bad
The output is as follows:
## Confusion Matrix and Statistics
##
##           Reference
## Prediction   0   1
##          0  60 140
##          1   0   0
##
##                Accuracy : 0.3
##                  95% CI : (0.2374, 0.3686)
##     No Information Rate : 0.7
##     P-Value [Acc > NIR] : 1
##
##                   Kappa : 0
##  Mcnemar's Test P-Value : <2e-16
##
##               Precision :  NA
##                  Recall : 0.0
##                      F1 :  NA
##              Prevalence : 0.7
##          Detection Rate : 0.0
##    Detection Prevalence : 0.0
##       Balanced Accuracy : 0.5
##
##        'Positive' Class : 1
Note that the precision and F1 scores could not be calculated, as the precision calculation becomes and the F1 metric uses precision in its calculation. By default, the confusionMatrix() function would set Bad as the positive class, but we are using positive = "1". If you wanted to know how precise the model is at finding Bad applicants, you could set positive = "0".
For the random predictions, we could either sample and evaluate within a loop and find the average evaluation, or, as we will do in this example, we could sample the number of predictions 100 times and repeat the true_labels vector 100 times. The method to use depends on the chosen set of evaluation metrics.
Set the random seed for reproducibility:
# Set seed for reproducibility
set.seed(1)
Set the number of evaluations to perform. We will perform 100 evaluations with random predictions:
# The number of evaluations to do
num_evaluations <- 100
Repeat the true labels vector 100 times:
# Repeat the true labels vector 100 times
true_labels_100 <- rep(true_labels, num_evaluations)
Draw 100*200 random predictions and ensure that they are either 0 or 1:
# Random predictions
# Draw random predictions (either 1 or 2, hence the "-1")
random_predictions <- sample.int(
2, size = num_total_predictions * num_evaluations,
replace = TRUE) - 1
head(random_predictions)
The output is as follows:
## [1] 0 0 1 1 0 1
Find the average number of times the prediction is Good. It should be about half of the 200 predictions:
# Average number of times predicting "Good"
sum(random_predictions) / num_evaluations
The output is as follows:
## [1] 99.59
Create a confusion matrix and inspect the results:
# Create confusion matrix
confusion_matrix_random <- confusionMatrix(
    as.factor(random_predictions),
    as.factor(true_labels_100),
    mode = "prec_recall",
    positive = "1")
confusion_matrix_random
The random confusion matrix is as follows:
## Confusion Matrix and Statistics
##
##           Reference
## Prediction    0    1
##          0 3020 7021
##          1 2980 6979
##
##                Accuracy : 0.5
##                  95% CI : (0.493, 0.5069)
##     No Information Rate : 0.7
##     P-Value [Acc > NIR] : 1
##
##                   Kappa : 0.0015
##  Mcnemar's Test P-Value : <2e-16
##
##               Precision : 0.7008
##                  Recall : 0.4985
##                      F1 : 0.5826
##              Prevalence : 0.7000
##          Detection Rate : 0.3489
##    Detection Prevalence : 0.4980
##       Balanced Accuracy : 0.5009
##
##        'Positive' Class : 1
##

When randomly predicting the class of the applicants, we get around 50% accuracy, 50% recall, 70% precision, and an F1 score of 58%. Hence, our model should perform better than this, before its predictions are more useful than if we simply guessed.

Over and Underfitting

Underfitting happens when a model is too simple to use the information in the training data. Conversely, overfitting happens when a model is too complex and learns too much from the dataset, which does not generalize to new data.

Finding the right number of nodes and layers in a neural network can be a long, tedious process with a lot of trial and error. If we increase the size of the network, we can handle more advanced tasks but risk overfitting, whereas if we make it smaller, we risk making the model too simple to solve our task (underfitting). We need to find the middle ground, usually by increasing the network size until overfitting and then dialing it back some. We must also consider that larger networks can take longer to train and take up more disk space. If, for instance, we intend to run a model on a smartphone, we might sacrifice a bit of accuracy in order to have the model run faster and save some megabytes:

Figure 4.10: Three examples of curve fitting

Figure 4.10 shows three examples of curve fitting. The black line with points shows the training data, which is a sine wave with random noise. The dashed grey line shows the pure sine wave, which is the formula the model should learn. The solid magenta line shows the predicted values for each x. We wish to avoid underfitting and overfitting as their predictions do not generalize to new data.

Adding Layers and Nodes in neuralnet

So far, we have only one node in one layer (excluding the input and output layers) in our neural networks. Adding nodes and layers is very simple and can decrease the error a lot. We simply pass a vector with the number of nodes per layer in the hidden argument. Let's create a model with two layers with two nodes each. We can have thousands of nodes in tens or even hundreds of layers, but that would take too long to train for this example (and would most likely overfit):

# Attach neuralnet

library(neuralnet)

# Set seed for reproducibility and easy comparison

set.seed(1)

Train the neural network:

# Classifying if class is "Good"

nn3 <- neuralnet(Class == "Good" ~ ., train_set[, c(9, 11:20)],

linear.output = FALSE, hidden = c(2,2))

Plot the trained model:

plotnet(nn3, var_labs = FALSE)

The output is as follows:

Figure 4.11: Neural network architecture with two hidden layers with two nodes each.

Calculate the training error:

# Print the training error

train_error <- nn3$result.matrix[1]

train_error

The output is as follows:

## [1] 48.42343

We get a lower error, but we have no real indication of whether it is underfitting or overfitting.

Cross-Validation

Cross-validation (CV) is a common technique used when deciding between model architectures. We split the training set into a number () of groups called folds. We train a model times, using a different fold as the test set each time, while the other folds constitute the CV training set.

Note

The CV training set is not to be confused with the overall training set that we used to create the folds.

Say we use 4 folds ( = 4). We train the first instance of the model on folds 2, 3, and 4 and evaluate on fold 1. Then, we train a model instance on folds 1, 3, and 4 and evaluate on fold 2. Once all the folds have been used as test folds, we average the results and compare them to the cross-validation results of the other model architectures. Finally, we train an instance of the best-performing model architecture on the entire training set and test it on the validation set.

An alternative approach to averaging the results from each iteration is to collect the predictions from all iterations and evaluate once.

When working with a small development set, we risk it not being representative of our data distribution. In cross-validation, we evaluate our model architecture on all observations in the training set. We learn how well the model architecture can be fitted to multiple subsets of the training set instead of just a single subset. We can also detect inconsistencies in our data if, for instance, we get a very low accuracy when evaluating one of the folds compared to the others.

Figure 4.12: The percentages of the dataset assigned to the partitions and folds

Figure 4.12 shows the percentage split of the dataset into, first, the training and validation sets, and second, the 4 folds. Figure 4.13 shows how these folds are used in the cross-validation training loop:

Figure 4.13: The usage of each fold per cross-validation iteration

Similar to when we created the partitions, we would like to have balanced ratios of the classes and avoid leakage between the folds. For this purpose, groupdata2 has a fold() function, working similarly to partition(). Instead of returning a list of folds though, it simply creates a column with the fold identifiers in our data frame, called .folds.

There are multiple variations of cross-validation. We are using stratified cross-validation, as we are balancing the ratio of the classes between the folds. Another method is called leave-one-out cross-validation, where we would treat each observation as a fold. In situations where that would lead to leakage between the folds, we can use leave-one-group-out cross-validation, where, for instance, each applicant is treated as a fold (containing all their loan applications). A fourth method is repeated cross-validation, where we repeat the fold creation step multiple times, compared to just once. We create the folds, run cross-validation, create new folds, run cross-validation, and so on. This allows us to evaluate a lot more combinations of our observations.

Creating Folds

We will create and preprocess a training and validation set and create four folds from the training set. It is common to use 10 folds, but there are no hard rules:

# Attach groupdata2

library(groupdata2)

# Set seed for reproducibility and easier comparison

set.seed(1)

# Partition into a training set and a validation set

partitions <- partition(GermanCredit, p = 0.8, cat_col = "Class")

train_set <- partitions[[1]]

valid_set <- partitions[[2]]

Find scaling and centering parameters:

# Note: We could also decide to do this inside the training loop!

params <- preProcess(train_set[, 1:6], method=c("center", "scale"))

Transform the training set:

train_set[, 1:6] <- predict(params, train_set[, 1:6])

# Transform the validation set

valid_set[, 1:6] <- predict(params, valid_set[, 1:6])

To create four folds, we will use k = 4:

# Create folds for cross-validation

# Again balanced on the Class variable

train_set <- fold(train_set, k=4, cat_col = "Class")

Note

This creates a factor in the dataset called ".folds". Take care not to use this as a predictor.

Exercise 46: Writing a Cross-Validation Training Loop

In this exercise, we will perform cross-validation with a simple for loop. For each iteration, we subset the cross-validation training and test sets, train our model on the training set, and evaluate the model on the test set. We then find the average accuracy and error:

Initialize two vectors for collecting errors and accuracies:
# Initialize vectors for collecting errors and accuracies
errors <- c()
accuracies <- c()
Start the training for loop. We have four folds, so we need four iterations:
# Training loop
for (part in 1:4){
Assign the chosen fold as the test_set and the rest of the folds as the train_set; Write this term in the same line be aware of the indentation:
  # Assign the chosen fold as test set
  # and the rest of the folds as train set
  cv_test_set <- train_set[train_set$.folds == part,]
  cv_train_set <- train_set[train_set$.folds != part,]
Train the neural network on predictors 11 - 20:
  # Train neural network classifier
  # Make sure not to include the ".folds" column as a predictor!
  nn <- neuralnet(Class == "Good" ~ .,
                  cv_train_set[, c(9, 11:20)],
                  linear.output = FALSE)
Append the error to the errors vector:
# Append error to the errors vector
errors <- append(errors, nn$result.matrix[1])
Create one-hot encoding of the target variable in the CV test set:
# Create one-hot encoding of Class variable
true_labels <- ifelse(cv_test_set$Class == "Good", 1, 0)
Use the trained neural network to predict the target variable in the CV test set:
  # Predict the class in the test set
  # It returns probabilities that the observations are "Good"
  predicted_probabilities <- predict(nn, cv_test_set)
  predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
Calculate the accuracy. We could also use confusionMatrix() here if we wanted other metrics:
  # Calculate accuracy manually
  # Note: TRUE == 1, FALSE == 0
  cv_accuracy <- sum(true_labels == predictions) / length(true_labels)
Append the calculated accuracy to the accuracies vector:
# Append the accuracy to the accuracies vector
accuracies <- append(accuracies, cv_accuracy)
Close the for loop:
}
Calculate the average error and accuracy and print them. Note that we could also have gathered the predictions from all the folds and calculated the accuracy only once:
# Calculate average error and accuracy
# Note that we could also have gathered the predictions from all the folds and calculated the accuracy only once. This could lead to slightly different results, if, for instance, the folds were not exactly the same size.
average_error <- mean(errors)
average_error
## [1] 51.47703
average_accuracy <- mean(accuracies)
average_accuracy
## [1] 0.74375

Once we have found the model architecture that gives the best average accuracy (or some other chosen metric), we can train this model on the entire training set and evaluate it on the validation set.

As we will be rerunning the cross-validation training loop again and again when comparing multiple model architectures, it would make sense to convert this code into a function, taking the arguments for neuralnet() and returning the average accuracy and error. A function in R is defined like this:

# Basic function in R

function_name <- function(arg1, arg2){

# Do something with the arguments

result <- arg1 + arg2

# Return the result

return(result)

}

In the upcoming activity, we will be training neural networks.

Activity 16: Training and Comparing Neural Network Architectures

In this activity, we will predict whether a diabetes test is positive or negative based on eight predictors. We will be using the PimaIndiansDiabetes2 dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/PimaIndiansDiabetes2.csv.

The dataset has missing data, which you will need to handle. A quick solution would be to remove the columns with many missing values and the rows with missing data in the other columns. Let's summarize the dataset and discuss this before diving into the activity:

# Attach packages

library(groupdata2)

library(caret)

library(mlbench)

library(neuralnet)

# Load the data

PimaIndiansDiabetes2 <- read.csv("PimaIndiansDiabetes2.csv")

# Summarize the dataset

summary(PimaIndiansDiabetes2)

## pregnant glucose pressure triceps

## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00

## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00

## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00

## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15

## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00

## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00

## NA's :5 NA's :35 NA's :227

##

## insulin mass pedigree age

## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00

## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00

## Median :125.00 Median :32.30 Median :0.3725 Median :29.00

## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24

## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00

## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00

## NA's :374 NA's :11

##

## diabetes

## neg:500

## pos:268

Two of the predictors (triceps and insulin) contain a lot of NAs ("Not Available"). If we remove these, we can either try to infer the other missing values or simply remove the rows containing NAs. This is up to you.

The diabetes column is the target variable that we wish to predict.

The main purpose of this activity is to train and compare multiple neural network architectures. Try changing the number of layers and nodes and compare them on accuracy, precision, recall, and F1.

Here are the steps that will help you complete the activity. Note that these do not include cross-validation, which will be performed in the next activity:

Attach the packages.
Set the random seed to 1.
Load the PimaIndiansDiabetes2 dataset.
Summarize the dataset.
Handle the missing data (possible solution: remove the triceps and insulin columns and use na.omit() on the dataset).
Partition the dataset into a training set (60%), development set (20%), and validation set (20%). Use cat_col="diabetes" to balance the ratios of each class between the partitions.
Assign the partitions to variable names.
Find the preProcess parameters for scaling and centering on the training set and apply them to all partitions.
Train multiple neural network architectures. Adjust them by changing the number of nodes and/or layers. In the model formula, use diabetes == "pos".
Evaluate each model on the development set using the confusionMatrix() function.
Evaluate the best model on the validation set.
Plot the best model.
Consider/discuss whether the model is underfitting or overfitting the training set based on the validation set results.

The output will be similar to the following:

Figure 4.14: The best neural network architecture without cross-validation

In this activity, we have trained multiple neural network architectures and evaluated the best model on the validation set.

Note

The solution for this activity can be found on page 344.

Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation

In this activity, we will perform the same operations as in the previous activity, Activity 16, Training and Comparing Neural Network Architectures, but instead of using a development set, we will use cross-validation to select the best model. We will be using the cross-validation code from Exercise 6, Writing a Cross-validation Training Loop.

We will be using the PimaIndiansDiabetes2 dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/PimaIndiansDiabetes2.csv.

Here the steps to complete the activity:

Attach the groupdata2, caret, neuralnet, and mlbench packages.
Set the random seed to 1.
Load the PimaIndiansDiabetes2 dataset.
Handle the missing data (quick solution: remove the triceps and insulin columns and use na.omit() on the dataset).
Partition the dataset into a training set (80%) and a validation set (20%). Use cat_col="diabetes" to balance the ratios of each class between the partitions.
Assign the partitions to variable names.
Find the preProcess parameters for scaling and centering on the training set and apply them to both partitions.
Create four folds in the training set, using the fold() function. Use cat_col="diabetes" to balance the ratios of each class between the folds.
Use the cross-validation training loop code to cross-validate your models. In the model formula, use diabetes == "pos". Also, remember to not use the .folds column as a predictor.
Select the best performing model and train it on the entire training set.
Evaluate the best model on the validation set.
Plot the best model.
Consider/discuss whether the model is underfitting or overfitting the training set based on the validation set results.

The output will be similar to the following:

Figure 4.15: The best neural network with cross-validation

In this activity, we used cross-validation to evaluate the performance of neural networks, and then we evaluated the best model on the validation set. In the provided solution, we only showed one model architecture, so you might have found another model architecture that performs better.

Note

The solution for this activity can be found on page 351.

Multiclass Classification Overview

When we have more than two classes, we have to modify our approach slightly. In the output layer of the neural network, we now have the same number of nodes as the number of classes. The values in these nodes are normalized using the softmax function, such that they all add up to 1. We can interpret these normalized values as probabilities, and the node with the highest probability is our predicted class. The softmax function is given by , where is the vector of output nodes.

When evaluating the model, we have to increase the size of our confusion matrix. Figure 4.16 shows a confusion matrix with three classes. The "Yay!" boxes contain the counts of correct predictions, while the "Nope!" boxes contain the counts of incorrect predictions:

Figure 4.16: The confusion matrix with three classes

With this, we can calculate both overall metrics and one-vs-all metrics. In one-vs-all evaluations, we have one class (such as class 2) against all the other classes combined as one negative class (class 1 + class 3). We can then use the metrics we know.

For "class 2-vs-all", we would use the confusion matrix in Figure 4.17:

Figure 4.17: The confusion matrix for the one-vs-all evaluation for Class 2

We can average the metrics from the one-vs-all evaluations. We can also calculate an overall accuracy, which is simply all the predictions in the diagonal "Yay!" boxes divided by the total number of predictions. Both the average and overall accuracy metrics can inform us of the model performance. How we interpret them depends on how balanced the validation set is. If one class has 90% of the observations, the overall accuracy could be 90% by simply predicting that class all the time, while the average accuracy would be much lower.

By looking at the confusion matrix and the one-vs-all metrics, we can gain an understanding of which classes might be difficult to differentiate. This might suggest that we need to collect more data from these classes, or that the classes are very similar on the chosen features.

Summary

In this chapter, you trained, evaluated, and compared multiple neural network architectures on the GermanCredit and PimaIndiansDiabetes2 classification tasks. To achieve this, you created balanced partitions and folds with the groupdata2 package. You used the neuralnet package to specify and train neural networks and used those trained models to predict the classes in the development and validation sets. Both in theory and by using caret's confusionMatrix function, you learned how to calculate accuracy, precision, recall, and F1 metrics. You implemented a cross-validation training loop and used it to compare multiple model architectures. Finally, we introduced multiclass classification and the softmax function.

If you wish to build more advanced neural networks while keeping the code simple, the keras package would be a good place to start.

In the next chapter, you will learn how to fit and interpret linear and logistic regression models. We will use the cvms package to easily cross-validate multiple model formulas at once, without having to write a training loop ourselves.

Table of Contents for Chapter 4

Create new playlist

Sign In

Sign Up

Chapter 4

Introduction to neuralnet and Evaluation Methods

Learning Objectives

Introduction

Classification

Figure 4.1: Datasets

Binary Classification

Exercise 40: Preparing the Dataset

Note

Balanced Partitioning Using the groupdata2 Package

Note

Exercise 41: Partitioning the Dataset

Note

Exercise 42: Creating Balanced Partitions

Leakage

Note

Exercise 43: Ensuring an Equal Number of Observations Per Class

Standardizing

Neural Networks with neuralnet

Figure 4.2: Neural network architecture with trained weights, error, and number of training steps.

Figure 4.3: Neural network architecture using columns 11- 20 as predictors

Activity 15: Training a Neural Network

Note

Model Selection

Evaluation Metrics

Figure 4.4: Confusion matrix

Accuracy

Figure 4.5: Accuracy formula

Precision

Figure 4.6: Precision formula

Recall

Figure 4.7: Recall formula

Figure 4.8: F1 formula

Figure 4.9: Contour plots comparing the F1 score to the simple averaging of precision and recall

Exercise 44: Creating a Confusion Matrix

Exercise 45: Creating Baseline Evaluations

Over and Underfitting

Figure 4.10: Three examples of curve fitting

Adding Layers and Nodes in neuralnet

Figure 4.11: Neural network architecture with two hidden layers with two nodes each.

Cross-Validation

Note

Figure 4.12: The percentages of the dataset assigned to the partitions and folds

Figure 4.13: The usage of each fold per cross-validation iteration

Creating Folds

Note

Exercise 46: Writing a Cross-Validation Training Loop

Activity 16: Training and Comparing Neural Network Architectures

Figure 4.14: The best neural network architecture without cross-validation

Note

Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation

Figure 4.15: The best neural network with cross-validation

Note

Multiclass Classification Overview

Figure 4.16: The confusion matrix with three classes

Figure 4.17: The confusion matrix for the one-vs-all evaluation for Class 2

Summary

Table of Contents for
Chapter 4