By the end of this chapter, you will be able to:
In this chapter, we aim to equip you with a practical understanding of neural networks.
While neural networks have been around in some form since the mid-twentieth century, they have recently surged in popularity. Be it self-driving cars or healthcare technologies, neural networks are fundamental to some of the most innovative products being developed.
In this chapter, we will train a neural network to predict whether a loan applicant in the GermanCredit dataset has a good or bad credit rating. To do this, we will partition the dataset into a training set, a development set, and a validation set. The neural network will be trained on the training set, and we will evaluate whether it makes good predictions on the development and validation sets. We will use cross-validation, along with four different evaluation metrics, to select between different neural network architectures.
The goal of classification is to create a model that can predict classes in never-before-seen data. This means that the model should generalize beyond the training data. As the data we work with is supervised, meaning that we already know the answer to our question (does the applicant have a good or bad credit rating?), it is rarely interesting to have a model that can merely repeat that. Instead, we need the model to classify the unlabeled data we gather in the future. We will discuss this further when covering under- and overfitting.
The datasets being used are the following:
In binary classification, our question is of the either or type. There are only two options, either yes or no; not maybe and not both yes and no. In multiclass classification, we can have more than two options, though we can only choose one of them. In multilabel classification, it is possible to predict both yes and no at the same time, hence an observation can have multiple labels. We will discuss multiclass classification later, but multilabel classification is outside the scope of this chapter.
In this exercise, we will load and assess the GermanCredit dataset. If the data has inappropriately biasing features, we will remove them:
# Attaching the packages
library(caret)
library(groupdata2)
# Load the GermanCredit dataset
data(GermanCredit)
str(GermanCredit[,1:10])
The output is as follows:
## 'data.frame': 1000 obs. of 10 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage: int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
# Remove the Age column
GermanCredit$Age <- NULL
We have assessed the GermanCredit dataset and removed the Age feature to avoid biasing our models against a specific age group.
To get a sense of a feature's importance, we can train models with and without it, and see whether their predictions differ. Be aware that interpretation can be difficult, as the different features can interact in nontransparent ways. If we include a feature, we can alter the value of that feature for an applicant and see how it changes the prediction. If we have a gender variable, for instance, we can change it from male to female to see whether the applicant would have gotten the loan if they were of a different gender.
We wish to partition the data into a training set (60%), a development set (20%), and a validation set (20%). We will train our models on the training set and select the model that performs best on the development set. Once we have selected the best model, we will evaluate it on the validation set. We will treat the development and validation sets as if they were collected after we trained our model on the training set. This means that any preprocessing parameters (such as for scaling and centering) will be found for the training set and applied to all three partitions.
For the partitioning, we will use the groupdata2 package, which contains multiple functions for balanced grouping and splitting of data. A dataset can be balanced or imbalanced in multiple ways. In our case, we will balance our partitions by having the same ratio of Good and Bad loan applicants in the three partitions. This makes it easier to compare the evaluations of the different partitions and ensures that each partition contains both Good and Bad loan applicants. Alternatively, we could ensure that the development and validation sets have the same number of observations per class. We will also discuss how to avoid leakage in repeated measures datasets, although this does not apply to our dataset.
The groupdata2 package was created by one of the authors of this book, Ludvig Renbo Olsen.
In this exercise, we will partition the dataset into a training set with 60% of the observations, a development set with 20% of the observations, and a validation set with the remaining 20% of the observations. In this exercise, we will partition the dataset without balancing it on the target variable. In the following exercise, we will then make balanced partitions and compare the results of the two approaches.
# Attach groupdata2
library(groupdata2)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Simple partitioning
# Note that we only need to specify the 2 first partition sizes
# as the remaining observations are put into a third partition.
partitions <- partition(GermanCredit, p = c(0.6, 0.2))
The partition() function splits the data into groups. We specify the data we wish to partition and the list of partition sizes (as percentages between 0 and 1 in this case). Note that we only specify the first two partitions, as the remaining 20% will automatically be placed in a third partition.
# This returns a list with three data frames.
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
# Count the number of observations from each class
# in the training and test sets.
train_counts <- table(train_set$Class)
dev_counts <- table(dev_set$Class)
valid_counts <- table(valid_set$Class)
Inspect the class counts from the training set:
train_counts
The output is as follows:
## Bad Good
## 188 412
dev_counts
The output is as follows:
## Bad Good
## 60 140
valid_counts
The output is as follows:
## Bad Good
## 52 148
Inspect the ratio of the two classes in the training set:
train_counts/max(train_counts)
The output is as follows:
## Bad Good
## 0.4563107 1.0000000
dev_counts/max(dev_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000
valid_counts/max(valid_counts)
The output is as follows:
## Bad Good
## 0.3513514 1.0000000
In this exercise, we used the partition() function to split the dataset into three subsets (partitions). The outputs from steps 9, 10, and 11 show that we have different ratios of Good and Bad loan applicants in our partitions.
In the next exercise, we will include the cat_col argument, which takes the name of a categorical column and tries to balance it so that there is the same ratio of the classes in all partitions.
What happens when we specify the cat_col argument in partition()? First, the dataset is subset by each class level. In our case, we will have one subset with all the applicants with Good credit ratings and another subset with the applicants with Bad credit ratings. Then, both subsets are partitioned and merged, so that partition one from the Good subset is merged with partition one from the Bad subset, and so on.
In this exercise, we will load the GermanCredit dataset and use partition() to create balanced training (60%), development (20%), and validation (20%) sets. In this case, balanced means that the ratios of the two classes should be similar in all three partitions. Also, remember to remove the Age column:
# Attaching the packages
library(caret) # GermanCredit
library(groupdata2) # partition()
# Load the German Credit dataset
data("GermanCredit")
# Remove the Age column
GermanCredit$Age <- NULL
# Partition into train, dev and valid sets.
partitions <- partition(GermanCredit, p = c(0.6, 0.2), cat_col = "Class")
The partition() function splits the data into groups. We specify the data we wish to partition, the list of partition sizes, and the categorical column to balance the partitions by.
# Assign each partition to a variable name
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
# Count the number of observations from each class
# in the training and test sets.
train_counts <- table(train_set$Class)
dev_counts <- table(dev_set$Class)
valid_counts <- table(valid_set$Class)
train_counts
The output is as follows:
Bad Good
180 420
dev_counts
The output is as follows:
## Bad Good
## 60 140
valid_counts
The output is as follows:
## Bad Good
## 60 140
train_counts/max(train_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000
dev_counts/max(dev_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000
valid_counts/max(valid_counts)
The output is as follows:
## Bad Good
## 0.4285714 1.0000000
Notice how the ratios of the two classes are now the same for all partitions.
In this exercise, you created balanced partitions using the partition() function from groupdata2. You will likely do this very often in your future machine learning practices.
As noted previously, we should treat the development and validation sets as if they were independent of the training set. In practice, they might not be so, but when creating the partitions, it is worth discussing whether there is any leakage between the partitions. One such example could be that we had recorded multiple loans per loan applicant. In that case, we would have an ID column with multiple rows per applicant ID. We refer to this as repeated measures data. In this case, the same applicant should not be in both the training set and one of the test sets, as we wish to test how well our model performs on new applicants, not the applicants it encountered during training. The id_col argument in the partition() function ensures that all rows with the same applicant ID are placed in only one of the partitions.
The following code shows how we would specify the id_col argument in partition(), if we had an ApplicantID column in our dataset.
# Avoiding ID leakage
# NOTE: Can't be run as we don't actually have an ApplicantID column
partition(GermanCredit, p = c(0.6, 0.2),
cat_col = "Class",
id_col = "ApplicantID")
# With this, each applicant would have all their observations/rows
# in the same partition!
What happens when we specify the id_col argument in partition()? First, we extract a list of the unique IDs. Then, that list is partitioned and the rows are put in the partition of their ID. What happens when we specify both the cat_col and id_col arguments in partition()? First, the dataset is subset by each class level. For each subset, we extract and partition the list of unique IDs. Finally, the rows are put in the partition of their ID. This will ensure that all rows with the same ID are put in the same partition, while also balancing the ratios of each class between the folds. Note that this approach requires all rows with the same ID to have the same value in the cat_col column.
In this exercise, we will partition the dataset such that the development and validation sets have an equal number of observations per class. This makes our evaluation metrics easier to interpret. We can do this by first specifying the number of observations per class for the two test partitions and then use the rest of the observations as the training set. Note that if you specify the id_col argument, this will put that number of individuals (per class) into the partitions. Also note that the training set is now the third element in the list of partitions.
# Attach groupdata2
library(groupdata2)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Partition with an equal number of observations (50) per class in
# the development and validation sets
# The remaining observations will end up in a third partition,
# which will be our training set
partitions_equal <- partition(GermanCredit, p = c(50, 50), cat_col = "Class")
dev_set_equal <- partitions_equal[[1]]
valid_set_equal <- partitions_equal[[2]]
train_set_equal <- partitions_equal[[3]]
# Count the number of observations from each class
# in the training and test sets.
train_counts_equal <- table(train_set_equal$Class)
dev_counts_equal <- table(dev_set_equal$Class)
valid_counts_equal <- table(valid_set_equal$Class)
train_counts_equal
The output is as follows:
## Bad Good
## 200 600
dev_counts_equal
The output is as follows:
## Bad Good
## 50 50
valid_counts_equal
The output is as follows:
## Bad Good
## 50 50
Notice how the development and validation sets both have 50 Good and 50 Bad loan applicants.
In this exercise, we have partitioned the dataset such that the development and validation sets have an equal number of observations per class, and everything else is the training set.
We will work with the partitions from Exercise 42, Creating Balanced Partitions. A few of the features could benefit from being standardized, and the rest are already one-hot encoded. We only run the preProcess() function on the training set to find the scaling and centering parameters, which are then applied to all three partitions:
# Find scaling and centering parameters for the first 6 columns
params <- preProcess(train_set[, 1:6], method=c("center", "scale"))
The predict() function uses the parameters in the params object to apply the preprocessing transformations to the specified dataset:
# Transform the training set
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the development set
dev_set[, 1:6] <- predict(params, dev_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
We are now ready to train a classifier to predict whether an applicant is creditworthy or not. The neuralnet package makes it easy to specify and train an artificial neural network. The first argument in the neuralnet() function is the model formula. Here, we tell our model to predict whether Class is Good or Bad. The tilde ~ means predicted by, and the variables to the right of it are the predictors. If we wish to use all the variables (excluding the target variable) as predictors, we can use a dot (y~.).
To begin with, we specify a simple formula with only Duration and Amount as predictors:
# Attaching neuralnet
library(neuralnet)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Classifying if class is "Good"
nn1 <- neuralnet(Class == "Good" ~ Duration + Amount,
train_set, linear.output = FALSE)
plot(nn1, rep="best", fontsize = 10)
The output is as follows:
The plot shows the neural network with three layers (input, hidden, output), from left to right. We have two nodes (also called neurons) in the input layer. The values in these nodes are multiplied by the weights (the numbers on the lines) going into the hidden layer with one node. The output layer has one node as well, which is our probability of Class being Good. The last two layers also have bias parameters (the "1" nodes). At the bottom of the plot, we see the training error, along with the number of training steps.
We can also get the error from the result matrix in the model object:
train_error <- nn1$result.matrix[1]
train_error
## [1] 59.630264
Now we will use columns 11 to 20 as predictors (remember to include column 9, as that's the target variable – Class). With the regular plot() function, the plot gets very messy with larger networks, so we will use the plotnet() function from the NeuralNetTools package. Let's see whether our error is reduced:
# Attach packages
library(neuralnet)
library(NeuralNetTools)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Classifying if class is "Good"
# Using columns 11 to 20
# Notice that we choose the predictors by subsetting the data frame
# and use every column as predictor with the dot "~."
nn2 <- neuralnet(Class == "Good" ~ ., train_set[, c(9, 11:20)],
linear.output = FALSE)
plotnet(nn2, var_labs = FALSE)
The output is as follows:
The plot produced by plotnet() does not show the weights by numbers but by line thickness and nuance (gray is negative and black is positive). Let's check the error:
train_error <- nn2$result.matrix[1]
train_error
The output is as follows:
## [1] 52.13454
Choosing these predictors lowered our error. In the next activity, we will train a neural network and calculate the training error.
In this activity, we will train the neural network using the German Credit dataset found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.
We will preprocess the GermanCredit dataset and train a neural network to predict creditworthiness. Feel free to reuse the code from Exercise 42, Creating Balanced Partitions, so that you get to train multiple neural networks.
Expected output: we expect the training error to be close to 62.15447. Note that, depending on the version of R and the packages installed, the results might vary.
Here are the steps that will help you complete the activity:
The random initialization of the neural network weights can lead to slightly different results from one training to another. To avoid this, we use the set.seed() function at the beginning of the script, which helps when comparing models. We could also train the same model architecture with five different seeds to get a better sense of its performance.
The solution for this activity can be found on page 342.
Now that we know how to train a classifier, we will revise how to evaluate and choose between models. First, we will cover four common metrics: Accuracy, Precision, Recall, and F1, and then we will discuss cross-validation as a good tool for model comparison.
We use a confusion matrix to split the predictions into True Positives, False Positives, True Negatives, and False Negatives. In our case, True Positives are the applicants that were correctly identified as being "Good," False Positives are the applicants that were incorrectly identified as being "Good," True Negatives are the applicants that were correctly identified as "Bad," while False Negatives are the applicants that were incorrectly identified as "Bad."
In Figure 4.4, the rows are the true classes, while the columns are the predicted classes. If the true class is Positive and the model predicts Positive, that prediction is a True Positive.
Accuracy: what percentage of the predictions were correct?
Precision: of all the Good predictions, how many were actually Good?
Note that we can choose to use the other class, Bad, as the Positive class. In that case, we would be asking how many of the Bad predictions actually were Bad.
Recall: how many of the applicants with a Good credit rating were correctly identified?
Header F1: the harmonic mean of precision and recall, which is a common way to merge the two scores into one, making it easier to compare models.
It is easier to visualize the difference between the F1 score and a simple average of the two metrics than to explain it in words:
Figure 4.9 shows that the F1 score rewards models where both the precision and recall scores are decent, whereas the simple average does not take an imbalance between the metrics into account.
In this exercise, we will find the confusion matrix for a trained neural network. We will use the trained neural network from Activity 15, Training a neural network, to predict the development set. From these predictions, we will create a confusion matrix. The confusionMatrix() function from the caret package also calculates our evaluation metrics:
# Attach caret
library(caret)
# Create a one-hot encoding of the Class variable
# ifelse() takes the arguments: "If x, then a, else b"
true_labels <- ifelse(dev_set$Class == "Good", 1, 0)
# Predict the class in the dev set
# It returns probabilities that the observations are "Good"
predicted_probabilities <- predict(nn, dev_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
# Create confusion matrix
confusion_matrix <- confusionMatrix(as.factor(predictions), as.factor(true_labels),
mode="prec_recall", positive = "1")
confusion_matrix
The output is as follows:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 60 140
##
## Accuracy : 0.7
## 95% CI : (0.6314, 0.7626)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.5348
##
## Kappa : 0
## Mcnemar's Test P-Value : 2.599e-14
##
## Precision : 0.7000
## Recall : 1.0000
## F1 : 0.8235
## Prevalence : 0.7000
## Detection Rate : 0.7000
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
If we look at the metrics, our model has an accuracy of 70%, precision of 70%, recall of 100%, and an F1 score of 82.35%. This seems pretty good, but when we look at the predictions in the confusion matrix, we see that the model simply predicts Good for all observations. With the imbalance in the training set, where we have 420 Good and 180 Bad observations, the model will get a decent error by simply predicting Good. This especially happens if our features do not provide enough information about the classes to get a smaller error. Another possible reason is that the model is simply not big enough (the number of nodes and layers) to extract useful information from the features. In that case, we would say that the model is underfitting the data.
An important point to take away from this example is that we can get seemingly high accuracies, precision scores, and so on, when our dataset is imbalanced, even though our model is useless. It is always a good idea to look at the confusion matrix to check that this isn't the case. In the upcoming exercise, we will create baseline evaluations so that we have something to compare our models against.
In this exercise, we will create and evaluate three sets of predictions in order to know what accuracy we should beat to be better than chance, or the discussed "one-class-predictor." The three sets of predictions are 1) all "Good" predictions, 2) all "Bad" predictions, and 3) random predictions, preferably repeated 100 times or more:
# Attach caret
library(caret)
# Create one-hot encoding of Class variable
true_labels <- ifelse(dev_set$Class == "Good", 1, 0)
# The number of predictions to make
num_total_predictions <- 200 # Alternatively: length(true_labels)
# All "Good"
good_predictions <- rep(1, num_total_predictions)
# Create confusion Matrix
confusion_matrix_good <- confusionMatrix(as.factor(good_predictions),as.factor(true_labels),mode="prec_recall",positive = "1")
confusion_matrix_good
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 60 140
##
## Accuracy : 0.7
## 95% CI : (0.6314, 0.7626)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.5348
##
## Kappa : 0
## Mcnemar's Test P-Value : 2.599e-14
##
## Precision : 0.7000
## Recall : 1.0000
## F1 : 0.8235
## Prevalence : 0.7000
## Detection Rate : 0.7000
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
# All "Bad"
bad_predictions <- rep(0, num_total_predictions)
Create a confusion matrix and inspect the results:
# Create confusion Matrix
confusion_matrix_bad <- confusionMatrix(
as.factor(bad_predictions), as.factor(true_labels),
mode="prec_recall", positive = "1")
confusion_matrix_bad
The output is as follows:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 60 140
## 1 0 0
##
## Accuracy : 0.3
## 95% CI : (0.2374, 0.3686)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Precision : NA
## Recall : 0.0
## F1 : NA
## Prevalence : 0.7
## Detection Rate : 0.0
## Detection Prevalence : 0.0
## Balanced Accuracy : 0.5
##
## 'Positive' Class : 1
Note that the precision and F1 scores could not be calculated, as the precision calculation becomes and the F1 metric uses precision in its calculation. By default, the confusionMatrix() function would set Bad as the positive class, but we are using positive = "1". If you wanted to know how precise the model is at finding Bad applicants, you could set positive = "0".
For the random predictions, we could either sample and evaluate within a loop and find the average evaluation, or, as we will do in this example, we could sample the number of predictions 100 times and repeat the true_labels vector 100 times. The method to use depends on the chosen set of evaluation metrics.
# Set seed for reproducibility
set.seed(1)
# The number of evaluations to do
num_evaluations <- 100
# Repeat the true labels vector 100 times
true_labels_100 <- rep(true_labels, num_evaluations)
# Random predictions
# Draw random predictions (either 1 or 2, hence the "-1")
random_predictions <- sample.int(
2, size = num_total_predictions * num_evaluations,
replace = TRUE) - 1
head(random_predictions)
The output is as follows:
## [1] 0 0 1 1 0 1
# Average number of times predicting "Good"
sum(random_predictions) / num_evaluations
The output is as follows:
## [1] 99.59
# Create confusion matrix
confusion_matrix_random <- confusionMatrix(
as.factor(random_predictions),
as.factor(true_labels_100),
mode = "prec_recall",
positive = "1")
confusion_matrix_random
The random confusion matrix is as follows:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3020 7021
## 1 2980 6979
##
## Accuracy : 0.5
## 95% CI : (0.493, 0.5069)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0015
## Mcnemar's Test P-Value : <2e-16
##
## Precision : 0.7008
## Recall : 0.4985
## F1 : 0.5826
## Prevalence : 0.7000
## Detection Rate : 0.3489
## Detection Prevalence : 0.4980
## Balanced Accuracy : 0.5009
##
## 'Positive' Class : 1
##
When randomly predicting the class of the applicants, we get around 50% accuracy, 50% recall, 70% precision, and an F1 score of 58%. Hence, our model should perform better than this, before its predictions are more useful than if we simply guessed.
Underfitting happens when a model is too simple to use the information in the training data. Conversely, overfitting happens when a model is too complex and learns too much from the dataset, which does not generalize to new data.
Finding the right number of nodes and layers in a neural network can be a long, tedious process with a lot of trial and error. If we increase the size of the network, we can handle more advanced tasks but risk overfitting, whereas if we make it smaller, we risk making the model too simple to solve our task (underfitting). We need to find the middle ground, usually by increasing the network size until overfitting and then dialing it back some. We must also consider that larger networks can take longer to train and take up more disk space. If, for instance, we intend to run a model on a smartphone, we might sacrifice a bit of accuracy in order to have the model run faster and save some megabytes:
Figure 4.10 shows three examples of curve fitting. The black line with points shows the training data, which is a sine wave with random noise. The dashed grey line shows the pure sine wave, which is the formula the model should learn. The solid magenta line shows the predicted values for each x. We wish to avoid underfitting and overfitting as their predictions do not generalize to new data.
So far, we have only one node in one layer (excluding the input and output layers) in our neural networks. Adding nodes and layers is very simple and can decrease the error a lot. We simply pass a vector with the number of nodes per layer in the hidden argument. Let's create a model with two layers with two nodes each. We can have thousands of nodes in tens or even hundreds of layers, but that would take too long to train for this example (and would most likely overfit):
# Attach neuralnet
library(neuralnet)
# Set seed for reproducibility and easy comparison
set.seed(1)
Train the neural network:
# Classifying if class is "Good"
nn3 <- neuralnet(Class == "Good" ~ ., train_set[, c(9, 11:20)],
linear.output = FALSE, hidden = c(2,2))
Plot the trained model:
plotnet(nn3, var_labs = FALSE)
The output is as follows:
Calculate the training error:
# Print the training error
train_error <- nn3$result.matrix[1]
train_error
The output is as follows:
## [1] 48.42343
We get a lower error, but we have no real indication of whether it is underfitting or overfitting.
Cross-validation (CV) is a common technique used when deciding between model architectures. We split the training set into a number () of groups called folds. We train a model times, using a different fold as the test set each time, while the other folds constitute the CV training set.
The CV training set is not to be confused with the overall training set that we used to create the folds.
Say we use 4 folds ( = 4). We train the first instance of the model on folds 2, 3, and 4 and evaluate on fold 1. Then, we train a model instance on folds 1, 3, and 4 and evaluate on fold 2. Once all the folds have been used as test folds, we average the results and compare them to the cross-validation results of the other model architectures. Finally, we train an instance of the best-performing model architecture on the entire training set and test it on the validation set.
An alternative approach to averaging the results from each iteration is to collect the predictions from all iterations and evaluate once.
When working with a small development set, we risk it not being representative of our data distribution. In cross-validation, we evaluate our model architecture on all observations in the training set. We learn how well the model architecture can be fitted to multiple subsets of the training set instead of just a single subset. We can also detect inconsistencies in our data if, for instance, we get a very low accuracy when evaluating one of the folds compared to the others.
Figure 4.12 shows the percentage split of the dataset into, first, the training and validation sets, and second, the 4 folds. Figure 4.13 shows how these folds are used in the cross-validation training loop:
Similar to when we created the partitions, we would like to have balanced ratios of the classes and avoid leakage between the folds. For this purpose, groupdata2 has a fold() function, working similarly to partition(). Instead of returning a list of folds though, it simply creates a column with the fold identifiers in our data frame, called .folds.
There are multiple variations of cross-validation. We are using stratified cross-validation, as we are balancing the ratio of the classes between the folds. Another method is called leave-one-out cross-validation, where we would treat each observation as a fold. In situations where that would lead to leakage between the folds, we can use leave-one-group-out cross-validation, where, for instance, each applicant is treated as a fold (containing all their loan applications). A fourth method is repeated cross-validation, where we repeat the fold creation step multiple times, compared to just once. We create the folds, run cross-validation, create new folds, run cross-validation, and so on. This allows us to evaluate a lot more combinations of our observations.
We will create and preprocess a training and validation set and create four folds from the training set. It is common to use 10 folds, but there are no hard rules:
# Attach groupdata2
library(groupdata2)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Partition into a training set and a validation set
partitions <- partition(GermanCredit, p = 0.8, cat_col = "Class")
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
Find scaling and centering parameters:
# Note: We could also decide to do this inside the training loop!
params <- preProcess(train_set[, 1:6], method=c("center", "scale"))
Transform the training set:
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
To create four folds, we will use k = 4:
# Create folds for cross-validation
# Again balanced on the Class variable
train_set <- fold(train_set, k=4, cat_col = "Class")
This creates a factor in the dataset called ".folds". Take care not to use this as a predictor.
In this exercise, we will perform cross-validation with a simple for loop. For each iteration, we subset the cross-validation training and test sets, train our model on the training set, and evaluate the model on the test set. We then find the average accuracy and error:
# Initialize vectors for collecting errors and accuracies
errors <- c()
accuracies <- c()
# Training loop
for (part in 1:4){
# Assign the chosen fold as test set
# and the rest of the folds as train set
cv_test_set <- train_set[train_set$.folds == part,]
cv_train_set <- train_set[train_set$.folds != part,]
# Train neural network classifier
# Make sure not to include the ".folds" column as a predictor!
nn <- neuralnet(Class == "Good" ~ .,
cv_train_set[, c(9, 11:20)],
linear.output = FALSE)
# Append error to the errors vector
errors <- append(errors, nn$result.matrix[1])
# Create one-hot encoding of Class variable
true_labels <- ifelse(cv_test_set$Class == "Good", 1, 0)
# Predict the class in the test set
# It returns probabilities that the observations are "Good"
predicted_probabilities <- predict(nn, cv_test_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
# Calculate accuracy manually
# Note: TRUE == 1, FALSE == 0
cv_accuracy <- sum(true_labels == predictions) / length(true_labels)
# Append the accuracy to the accuracies vector
accuracies <- append(accuracies, cv_accuracy)
}
# Calculate average error and accuracy
# Note that we could also have gathered the predictions from all the folds and calculated the accuracy only once. This could lead to slightly different results, if, for instance, the folds were not exactly the same size.
average_error <- mean(errors)
average_error
## [1] 51.47703
average_accuracy <- mean(accuracies)
average_accuracy
## [1] 0.74375
Once we have found the model architecture that gives the best average accuracy (or some other chosen metric), we can train this model on the entire training set and evaluate it on the validation set.
As we will be rerunning the cross-validation training loop again and again when comparing multiple model architectures, it would make sense to convert this code into a function, taking the arguments for neuralnet() and returning the average accuracy and error. A function in R is defined like this:
# Basic function in R
function_name <- function(arg1, arg2){
# Do something with the arguments
result <- arg1 + arg2
# Return the result
return(result)
}
In the upcoming activity, we will be training neural networks.
In this activity, we will predict whether a diabetes test is positive or negative based on eight predictors. We will be using the PimaIndiansDiabetes2 dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/PimaIndiansDiabetes2.csv.
The dataset has missing data, which you will need to handle. A quick solution would be to remove the columns with many missing values and the rows with missing data in the other columns. Let's summarize the dataset and discuss this before diving into the activity:
# Attach packages
library(groupdata2)
library(caret)
library(mlbench)
library(neuralnet)
# Load the data
PimaIndiansDiabetes2 <- read.csv("PimaIndiansDiabetes2.csv")
# Summarize the dataset
summary(PimaIndiansDiabetes2)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
##
## insulin mass pedigree age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
##
## diabetes
## neg:500
## pos:268
Two of the predictors (triceps and insulin) contain a lot of NAs ("Not Available"). If we remove these, we can either try to infer the other missing values or simply remove the rows containing NAs. This is up to you.
The diabetes column is the target variable that we wish to predict.
The main purpose of this activity is to train and compare multiple neural network architectures. Try changing the number of layers and nodes and compare them on accuracy, precision, recall, and F1.
Here are the steps that will help you complete the activity. Note that these do not include cross-validation, which will be performed in the next activity:
The output will be similar to the following:
In this activity, we have trained multiple neural network architectures and evaluated the best model on the validation set.
The solution for this activity can be found on page 344.
In this activity, we will perform the same operations as in the previous activity, Activity 16, Training and Comparing Neural Network Architectures, but instead of using a development set, we will use cross-validation to select the best model. We will be using the cross-validation code from Exercise 6, Writing a Cross-validation Training Loop.
We will be using the PimaIndiansDiabetes2 dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/PimaIndiansDiabetes2.csv.
Here the steps to complete the activity:
The output will be similar to the following:
In this activity, we used cross-validation to evaluate the performance of neural networks, and then we evaluated the best model on the validation set. In the provided solution, we only showed one model architecture, so you might have found another model architecture that performs better.
The solution for this activity can be found on page 351.
When we have more than two classes, we have to modify our approach slightly. In the output layer of the neural network, we now have the same number of nodes as the number of classes. The values in these nodes are normalized using the softmax function, such that they all add up to 1. We can interpret these normalized values as probabilities, and the node with the highest probability is our predicted class. The softmax function is given by , where is the vector of output nodes.
When evaluating the model, we have to increase the size of our confusion matrix. Figure 4.16 shows a confusion matrix with three classes. The "Yay!" boxes contain the counts of correct predictions, while the "Nope!" boxes contain the counts of incorrect predictions:
With this, we can calculate both overall metrics and one-vs-all metrics. In one-vs-all evaluations, we have one class (such as class 2) against all the other classes combined as one negative class (class 1 + class 3). We can then use the metrics we know.
For "class 2-vs-all", we would use the confusion matrix in Figure 4.17:
We can average the metrics from the one-vs-all evaluations. We can also calculate an overall accuracy, which is simply all the predictions in the diagonal "Yay!" boxes divided by the total number of predictions. Both the average and overall accuracy metrics can inform us of the model performance. How we interpret them depends on how balanced the validation set is. If one class has 90% of the observations, the overall accuracy could be 90% by simply predicting that class all the time, while the average accuracy would be much lower.
By looking at the confusion matrix and the one-vs-all metrics, we can gain an understanding of which classes might be difficult to differentiate. This might suggest that we need to collect more data from these classes, or that the classes are very similar on the chosen features.
In this chapter, you trained, evaluated, and compared multiple neural network architectures on the GermanCredit and PimaIndiansDiabetes2 classification tasks. To achieve this, you created balanced partitions and folds with the groupdata2 package. You used the neuralnet package to specify and train neural networks and used those trained models to predict the classes in the development and validation sets. Both in theory and by using caret's confusionMatrix function, you learned how to calculate accuracy, precision, recall, and F1 metrics. You implemented a cross-validation training loop and used it to compare multiple model architectures. Finally, we introduced multiclass classification and the softmax function.
If you wish to build more advanced neural networks while keeping the code simple, the keras package would be a good place to start.
In the next chapter, you will learn how to fit and interpret linear and logistic regression models. We will use the cvms package to easily cross-validate multiple model formulas at once, without having to write a training loop ourselves.