This section is included to assist the students to perform the activities present in the book. It includes detailed steps that are to be performed by the students to complete and achieve the objectives of the book.
Solution:
PimaIndiansDiabetes<-read.csv("PimaIndiansDiabetes.csv")
#Assign it to a local variable for further use
PimaIndiansDiabetesData<- PimaIndiansDiabetes
#Display the first five rows
head(PimaIndiansDiabetesData)
The output is as follows:
pregnant glucose pressure triceps insulin mass pedigree age diabetes
1 6 148 72 35 0 33.6 0.627 50 pos
2 1 85 66 29 0 26.6 0.351 31 neg
3 8 183 64 0 0 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.288 33 pos
6 5 116 74 0 0 25.6 0.201 30 neg
From the preceding data, identify the input features and find the column that is the predictor variable. The output variable is diabetes.
levels(PimaIndiansDiabetesData$diabetes)
The output is as follows:
[1] "neg" "pos"
library(ggplot2)
barplot <- ggplot(data= PimaIndiansDiabetesData, aes(x=age))
barplot + geom_histogram(binwidth=0.2, color="black", aes(fill=diabetes)) + ggtitle("Bar plot of Age")
The output is as follows:
We can conclude that we have the most data for the age group of 20-30. Graphical representation thus allows us to understand the data.
Solution :
#View the structure of the data
str(PimaIndiansDiabetesData)
The output is as follows:
'data.frame':768 obs. of 9 variables:
$ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
$ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
$ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
$ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
$ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
$ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
$ age : num 50 31 32 21 33 30 26 29 53 54 ...
$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
#View the Summary of the data
summary(PimaIndiansDiabetesData)
The output is as follows:
#Perform Group by and view statistics for the columns
#Install the package
install.packages("psych")
library(psych) #Load package psych to use function describeBy
Use describeby with pregnancy and diabetes columns.
describeBy(PimaIndiansDiabetesData$pregnant, PimaIndiansDiabetesData$diabetes)
The output is as follows:
Descriptive statistics by group
group: neg
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 500 3.3 3.02 2 2.88 2.97 0 13 13 1.11 0.65 0.13
----------------------------------------------------------------------------------------------
group: pos
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 268 4.87 3.74 4 4.6 4.45 0 17 17 0.5 -0.47 0.23
We can view the mean, median, min, and max of the number of times pregnant attribute in the group of people who have diabetes (pos) and who do not have diabetes (neg).
describeBy(PimaIndiansDiabetesData$pressure, PimaIndiansDiabetesData$diabetes)
The output is as follows:
Descriptive statistics by group
group: neg
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 500 68.18 18.06 70 69.97 11.86 0 122 122 -1.8 5.58 0.81
----------------------------------------------------------------------------------------------
group: pos
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 268 70.82 21.49 74 73.99 11.86 0 114 114 -1.92 4.53 1.31
We can view the mean, median, min, and max of the pressure in the group of people who have diabetes (pos) and who do not have diabetes (neg).
We have learned how to view the structure of any dataset and print the statistics about the range of every column using summary().
Solution:
PimaIndiansDiabetes<-read.csv("PimaIndiansDiabetes.csv")
#Calculate correlations
correlation <- cor(PimaIndiansDiabetesData[,1:4])
#Round the values to the nearest 2 digit
round(correlation,2)
The output is as follows:
pregnant glucose pressure triceps
pregnant 1.00 0.13 0.14 -0.08
glucose 0.13 1.00 0.15 0.06
pressure 0.14 0.15 1.00 0.21
triceps -0.08 0.06 0.21 1.00
#Plot the pairs on a plot
pairs(PimaIndiansDiabetesData[,1:4])
The output is as follows:
# Load library
library(ggplot2)
boxplot <- ggplot(data=PimaIndiansDiabetesData, aes(x=diabetes, y=pregnant))
boxplot + geom_boxplot(aes(fill=diabetes)) +
ylab("Pregnant") + ggtitle("Diabetes Data Boxplot") +
stat_summary(fun.y=mean, geom="point", shape=5, size=4)
The output is as follows:
In the preceding graph, we can see the distribution of "number of times pregnant" in people who do not have diabetes (neg) and in people who have diabetes (pos).
Solution:
These are the steps that will help you solve the activity:
GermanCredit <-read.csv("GermanCredit.csv")
GermanCredit_Subset=GermanCredit[,1:10]
# fit model
fit <- lm(Duration~., GermanCredit_Subset)
# summarize the fit
summary(fit)
The output is as follows:
Call:
lm(formula = Duration ~ ., data = GermanCredit_Subset)
Residuals:
Min 1Q Median 3Q Max
-44.722 -5.524 -1.187 4.431 44.287
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0325685 2.3612128 0.861 0.38955
Amount 0.0029344 0.0001093 26.845 < 2e-16 ***
InstallmentRatePercentage 2.7171134 0.2640590 10.290 < 2e-16 ***
ResidenceDuration 0.2068781 0.2625670 0.788 0.43094
Age -0.0689299 0.0260365 -2.647 0.00824 **
NumberExistingCredits -0.3810765 0.4903225 -0.777 0.43723
NumberPeopleMaintenance -0.0999072 0.7815578 -0.128 0.89831
Telephone 0.6354927 0.6035906 1.053 0.29266
ForeignWorker 4.9141998 1.4969592 3.283 0.00106 **
ClassGood -2.0068114 0.6260298 -3.206 0.00139 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.784 on 990 degrees of freedom
Multiple R-squared: 0.4742, Adjusted R-squared: 0.4694
F-statistic: 99.2 on 9 and 990 DF, p-value: < 2.2e-16
# make predictions
predictions <- predict(fit, GermanCredit_Subset)
# summarize accuracy
rmse <- sqrt(mean((GermanCredit_Subset$Duration - predictions)^2))
print(rmse)
The output is as follows:
[1] 76.3849
In this activity, we have learned to build a linear model, make predictions on new data, and evaluate performance using RMSE.
Solution:
These are the steps that will help you solve the activity:
BostonHousing <-read.csv("BostonHousing.csv")
#Build multi variable regression
regression <- lm(medv~crim + indus+rad , data = BostonHousing)
#View the summary
summary(regression)
The output is as follows:
Call:
lm(formula = medv ~ crim + indus + rad, data = BostonHousing)
Residuals:
Min 1Q Median 3Q Max
-12.047 -4.860 -1.736 3.081 32.596
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.27515 0.68220 42.913 < 2e-16 ***
crim -0.23952 0.05205 -4.602 5.31e-06 ***
indus -0.51671 0.06336 -8.155 2.81e-15 ***
rad -0.01281 0.05845 -0.219 0.827
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.838 on 502 degrees of freedom
Multiple R-squared: 0.2781, Adjusted R-squared: 0.2737
F-statistic: 64.45 on 3 and 502 DF, p-value: < 2.2e-16
#Plot the fit
plot(regression)
The output is as follows:
The preceding plot compares the predicted values and the residual values.
Hit <Return> to see the next plot:
The preceding plot shows the distribution of error. It is a normal probability plot. A normal distribution of error will display a straight line.
Hit <Return> to see the next plot:
The preceding plot compares the spread and the predicted values. We can see how the spread is with respect to the predicted values.
Hit <Return> to see the next plot:
This plot helps to identify which data points are influential to the regression model, that is, which of our model results would be affected if we included or excluded them.
We have now explored the datasets with one or more variables.
Solution:
In this exercise, we will perform the center and scale pre-processing operations.
# Load Library caret
library(caret)
library(mlbench)
# load the dataset PimaIndiansDiabetes
data(PimaIndiansDiabetes)
View the summary:
# view the data
summary(PimaIndiansDiabetes [,1:2])
The output is as follows:
pregnant glucose
Min. : 0.000 Min. : 0.0
1st Qu.: 1.000 1st Qu.: 99.0
Median : 3.000 Median :117.0
Mean : 3.845 Mean :120.9
3rd Qu.: 6.000 3rd Qu.:140.2
Max. :17.000 Max. :199.0
# to standardise we will scale and center
params <- preProcess(PimaIndiansDiabetes [,1:2], method=c("center", "scale"))
# transform the dataset
new_dataset <- predict(params, PimaIndiansDiabetes [,1:2])
# summarize the transformed dataset
summary(new_dataset)
The output is as follows:
pregnant glucose
Min. :-1.1411 Min. :-3.7812
1st Qu.:-0.8443 1st Qu.:-0.6848
Median :-0.2508 Median :-0.1218
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6395 3rd Qu.: 0.6054
Max. : 3.9040 Max. : 2.4429
We will notice that the values are now mean centering values.
Solution:
mtcars = read.csv("mtcars.csv")
#Load the outlier library
library(outliers)
#Detect outliers
outlier(mtcars)
The output is as follows:
mpg cyl disp hp drat wt qsec vs am
gear carb
33.900 4.000 472.000 335.000 4.930 5.424 22.900
1.000 1.000 5.000 8.000
#This detects outliers from the other side
outlier(mtcars,opposite=TRUE)
The output is as follows:
mpg cyl disp hp drat wt qsec vs am
gear carb
10.400 8.000 71.100 52.000 2.760 1.513 14.500 0.000 0.000
3.000 1.000
#View the outliers
boxplot(Mushroom)
The output is as follows:
The circle marks are the outliers.
Solution:
The detailed solution is as follows:
ms<-read.csv('mushrooms.csv')
summary(ms$bruises)
The output is as follows:
f t
4748 3376
set.seed(9560)
undersampling <- downSample(x = ms[, -ncol(ms)], y = ms$bruises)
table(undersampling$bruises)
The output is as follows:
f t
3376 3376
set.seed(9560)
oversampling <- upSample(x = ms[, -ncol(ms)],y = ms$bruises)
table(oversampling$bruises)
The output is as follows:
f t
4748 4748
In this activity, we learned to use downSample() and upSample() from the caret package to perform downsampling and oversampling.
The detailed solution is as follows:
#load the dataset
library(caret)
library(ROSE)
data(GermanCredit)
#View samples
head(GermanCredit)
str(GermanCredit)
#View the imbalanced data
summary(GermanCredit$Class)
The output is as follows:
Bad Good
300 700
balanced_data <- ROSE(Class ~ ., data = stagec,seed=3)$data
table(balanced_data$Class)
The output is as follows:
Good Bad
480 520
Using the preceding example, we learned how to increase and decrease the class count using ROSE.
Solution:
#Time series features
library(caret)
#Install caret if not installed
#install.packages('caret')
GermanCredit = read.csv("GermanCredit.csv")
duration<- GermanCredit$Duration #take the duration column
summary(duration)
The output is as follows:
library(ggplot2)
ggplot(data=GermanCredit, aes(x=GermanCredit$Duration)) +
geom_density(fill='lightblue') +
geom_rug() +
labs(x='mean Duration')
The output is as follows:
#Creating Bins
# set up boundaries for intervals/bins
breaks <- c(0,10,20,30,40,50,60,70,80)
# specify interval/bin labels
labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80")
# bucketing data points into bins
bins <- cut(duration, breaks, include.lowest = T, right=FALSE, labels=labels)
# inspect bins
summary(bins)
The output is as follows:
summary(bins)
<10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
143 403 241 131 66 2 13 1
#Ploting the bins
plot(bins, main="Frequency of Duration", ylab="Duration Count", xlab="Duration Bins",col="bisque")
The output is as follows:
We can conclude that the maximum number of customers are within the range of 10 to 20.
Solution:
#Skewness
library(mlbench)
library(e1071)
PimaIndiansDiabetes = read.csv("PimaIndiansDiabetes.csv")
#Printing the skewness of the columns
#Not skewed
skewness(PimaIndiansDiabetes$glucose)
The output is as follows:
[1] 0.1730754
histogram(PimaIndiansDiabetes$glucose)
The output is as follows:
A negative skewness value means that the data is skewed to the left and a positive skewness value means that the data is skewed to the right. Since the value here is 0.17, the data is neither completely left or right skewed. Therefore, it is not skewed.
#Highly skewed
skewness(PimaIndiansDiabetes$age)
The output is as follows:
[1] 1.125188
histogram(PimaIndiansDiabetes$age)
The output is as follows:
The positive skewness value means that it is skewed to the right as we can see above.
Solution:
#PCA Analysis
data(GermanCredit)
#Use the German Credit Data
GermanCredit_subset <- GermanCredit[,1:9]
#Find out the Principal components
principal_components <- prcomp(x = GermanCredit_subset, scale. = T)
#Print the principal components
print(principal_components)
The output is as follows:
Standard deviations (1, .., p=9):
[1] 1.3505916 1.2008442 1.1084157 0.9721503 0.9459586
0.9317018 0.9106746 0.8345178 0.5211137
Rotation (n x k) = (9 x 9):
Therefore, by using principal component analysis we can identify the top nine principal components in the dataset. These components are calculated from multiple fields and they can be used as features on their own.
Solution:
data(GermanCredit)
GermanCredit_subset <- GermanCredit[,1:10]
library(randomForest)
random_forest = randomForest(Class~., data=GermanCredit_subset)
# Create an importance based on mean decreasing gini
importance(random_forest)
The output is as follows:
importance(random_forest)
MeanDecreaseGini
Duration 70.380265
Amount 121.458790
InstallmentRatePercentage 27.048517
ResidenceDuration 30.409254
Age 86.476017
NumberExistingCredits 18.746057
NumberPeopleMaintenance 12.026969
Telephone 15.581802
ForeignWorker 2.888387
varImp(random_forest)
The output is as follows:
Overall
Duration 70.380265
Amount 121.458790
InstallmentRatePercentage 27.048517
ResidenceDuration 30.409254
Age 86.476017
NumberExistingCredits 18.746057
NumberPeopleMaintenance 12.026969
Telephone 15.581802
ForeignWorker 2.888387
In this activity, we built a random forest model and used it to see the importance of each variable in a dataset. The variables with higher scores are considered more important. Having done this, we can sort by importance and choose the top 5 or top 10 for the model or set a threshold for importance and choose all the variables that meet the threshold.
Solution:
install.packages("rpart")
library(rpart)
library(caret)
set.seed(10)
data(GermanCredit)
GermanCredit_subset <- GermanCredit[,1:10]
#Train a rpart model
rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart")
#Find variable importance
rpartImp <- varImp(rPartMod)
#Print variable importance
print(rpartImp)
The output is as follows:
rpart variable importance
Overall
Amount 100.000
Duration 89.670
Age 75.229
ForeignWorker 22.055
InstallmentRatePercentage 17.288
Telephone 7.813
ResidenceDuration 4.471
NumberExistingCredits 0.000
NumberPeopleMaintenance 0.000
#Plot top 5 variable importance
plot(rpartImp, top = 5, main='Variable Importance')
The output is as follows:
From the preceding plot, we can observe that Amount, Duration, and Age have high importance values.
Solution:
# Attach the packages
library(caret)
library(groupdata2)
library(neuralnet)
library(NeuralNetTools)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Load the German Credit dataset
GermanCredit <- read.csv("GermanCredit.csv")
# Remove the Age column
GermanCredit$Age <- NULL
# Partition with same ratio of each class in all three partitions
partitions <- partition(GermanCredit, p = c(0.6, 0.2),
cat_col = "Class")
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
# Find scaling and centering parameters
params <- preProcess(train_set[, 1:6], method=c("center", "scale"))
# Transform the training set
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the development set
dev_set[, 1:6] <- predict(params, dev_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
# Train the neural network classifier
nn <- neuralnet(Class == "Good" ~ InstallmentRatePercentage +
ResidenceDuration + NumberExistingCredits,
train_set, linear.output = FALSE)
# Plot the network
plotnet(nn, var_labs=FALSE)
The output is as follows:
train_error <- nn$result.matrix[1]
train_error
The output is as follows:
## [1] 62.15447
The random initialization of the neural network weights can lead to slightly different results from one training to another. To avoid this, we use the set.seed() function at the beginning of the script, which helps when comparing models. We could also train the same model architecture with five different seeds to get a better sense of its performance.
Solution:
# Attach the packages
library(groupdata2)
library(caret)
library(neuralnet)
library(mlbench)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Load the PimaIndiansDiabetes2 dataset
PimaIndiansDiabetes2 <- read.csv("PimaIndiansDiabetes2.csv")
summary(PimaIndiansDiabetes2)
The summary is as follows:
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
##
## insulin mass pedigree age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
##
## diabetes
## neg:500
## pos:268
# Assign/copy dataset to a new name
diabetes_data <- PimaIndiansDiabetes2
# Remove the triceps and insulin columns
diabetes_data$triceps <- NULL
diabetes_data$insulin <- NULL
# Remove all rows with NAs (missing data)
diabetes_data <- na.omit(diabetes_data)
# Partition with same ratio of each class in all three partitions
partitions <- partition(diabetes_data, p = c(0.6, 0.2),
cat_col = "diabetes")
train_set <- partitions[[1]]
dev_set <- partitions[[2]]
valid_set <- partitions[[3]]
# Find scaling and centering parameters
params <- preProcess(train_set[, 1:6], method = c("center", "scale"))
# Transform the training set
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the development set
dev_set[, 1:6] <- predict(params, dev_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
# Training multiple neural nets
nn4 <- neuralnet(diabetes == "pos" ~ ., train_set,
linear.output = FALSE, hidden = c(3))
nn5 <- neuralnet(diabetes == "pos" ~ ., train_set,
linear.output = FALSE, hidden = c(2,1))
nn6 <- neuralnet(diabetes == "pos" ~ ., train_set,
linear.output = FALSE, hidden = c(3,2))
# Put the model objects into a list
models <- list("nn4"=nn4,"nn5"=nn5,"nn6"=nn6)
# Evaluating each model on the dev_set
# Create one-hot encoding of diabetes variable
dev_true_labels <- ifelse(dev_set$diabetes == "pos", 1, 0)
# Evaluate one model at a time in a loop, to avoid repeating the code
for (i in 1:length(models)){
# Predict the classes in the development set
dev_predicted_probabilities <- predict(models[[i]], dev_set)
dev_predictions <- ifelse(dev_predicted_probabilities > 0.5, 1, 0)
# Create confusion Matrix
confusion_matrix <- confusionMatrix(as.factor(dev_predictions),
as.factor(dev_true_labels),
mode="prec_recall",
positive = "1")
# Print the results for this model
# Note: paste0() concatenates the strings
# to (name of model + " on the dev...")
print( paste0( names(models)[[i]], " on the development set: "))
print(confusion_matrix)
}
The output is as follows:
## [1] "nn4 on the development set: "
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 79 19
## 1 16 30
##
## Accuracy : 0.7569
## 95% CI : (0.6785, 0.8245)
## No Information Rate : 0.6597
## P-Value [Acc > NIR] : 0.007584
##
## Kappa : 0.4505
## Mcnemar's Test P-Value : 0.735317
##
## Precision : 0.6522
## Recall : 0.6122
## F1 : 0.6316
## Prevalence : 0.3403
## Detection Rate : 0.2083
## Detection Prevalence : 0.3194
## Balanced Accuracy : 0.7219
##
## 'Positive' Class : 1
##
## [1] "nn5 on the development set: "
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 77 16
## 1 18 33
##
## Accuracy : 0.7639
## 95% CI : (0.686, 0.8306)
## No Information Rate : 0.6597
## P-Value [Acc > NIR] : 0.004457
##
## Kappa : 0.4793
## Mcnemar's Test P-Value : 0.863832
##
## Precision : 0.6471
## Recall : 0.6735
## F1 : 0.6600
## Prevalence : 0.3403
## Detection Rate : 0.2292
## Detection Prevalence : 0.3542
## Balanced Accuracy : 0.7420
##
## 'Positive' Class : 1
##
## [1] "nn6 on the development set: "
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 76 14
## 1 19 35
##
## Accuracy : 0.7708
## 95% CI : (0.6935, 0.8367)
## No Information Rate : 0.6597
## P-Value [Acc > NIR] : 0.002528
##
## Kappa : 0.5019
## Mcnemar's Test P-Value : 0.486234
##
## Precision : 0.6481
## Recall : 0.7143
## F1 : 0.6796
## Prevalence : 0.3403
## Detection Rate : 0.2431
## Detection Prevalence : 0.3750
## Balanced Accuracy : 0.7571
##
## 'Positive' Class : 1
# Create one-hot encoding of Class variable
valid_true_labels <- ifelse(valid_set$diabetes == "pos", 1, 0)
# Predict the classes in the validation set
predicted_probabilities <- predict(nn6, valid_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
# Create confusion Matrix
confusion_matrix <- confusionMatrix(as.factor(predictions),
as.factor(valid_true_labels),
mode="prec_recall", positive = "1")
# Print the results for this model
# Note that by separating two function calls by ";"
# we can have multiple calls per line
print("nn6 on the validation set:"); print(confusion_matrix)
The output is as follows:
## [1] "nn6 on the validation set:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 70 16
## 1 25 35
##
## Accuracy : 0.7192
## 95% CI : (0.6389, 0.7903)
## No Information Rate : 0.6507
## P-Value [Acc > NIR] : 0.04779
##
## Kappa : 0.4065
## Mcnemar's Test P-Value : 0.21152
##
## Precision : 0.5833
## Recall : 0.6863
## F1 : 0.6306
## Prevalence : 0.3493
## Detection Rate : 0.2397
## Detection Prevalence : 0.4110
## Balanced Accuracy : 0.7116
##
## 'Positive' Class : 1
plotnet(nn6, var_labs=FALSE)
The output will look as follows:
In this activity, we have trained multiple neural network architectures and evaluated the best model on the validation set.
Solution:
# Attach the packages
library(groupdata2)
library(caret)
library(neuralnet)
library(mlbench)
# Set seed for reproducibility and easier comparison
set.seed(1)
# Load the PimaIndiansDiabetes2 dataset
data(PimaIndiansDiabetes2)
Start by assigning the dataset to a new name.
# Handling missing data (quick solution)
# Assign/copy dataset to a new name
diabetes_data <- PimaIndiansDiabetes2
# Remove the triceps and insulin columns
diabetes_data$triceps <- NULL
diabetes_data$insulin <- NULL
# Remove all rows with Nas (missing data)
diabetes_data <- na.omit(diabetes_data)
# Partition into a training set and a validation set
partitions <- partition(diabetes_data, p = 0.8, cat_col = "diabetes")
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
# Find scaling and centering parameters
# Note: We could also decide to do this inside the training loop!
params <- preProcess(train_set[, 1:6], method=c("center", "scale"))
# Transform the training set
train_set[, 1:6] <- predict(params, train_set[, 1:6])
# Transform the validation set
valid_set[, 1:6] <- predict(params, valid_set[, 1:6])
# Create folds for cross-validation
# Balance on the Class variable
train_set <- fold(train_set, k=4, cat_col = "diabetes")
# Note: This creates a factor in the dataset called ".folds"
# Take care not to use this as a predictor.
## Cross-validation loop
# Change the model formula in the loop and run the below
# for each model architecture you're testing
# Initialize vectors for collecting errors and accuracies
errors <- c()
accuracies <- c()
Start the training for loop. We have 4 folds, so we need 4 iterations.
# Training loop
for (part in 1:4){
# Assign the chosen fold as test set
# and the rest of the folds as train set
cv_test_set <- train_set[train_set$.folds == part,]
cv_train_set <- train_set[train_set$.folds != part,]
# Train neural network classifier
# Make sure not to include the .folds column as a predictor!
nn <- neuralnet(diabetes == "pos" ~ .,
cv_train_set[, 1:7],
linear.output = FALSE,
hidden=c(2,2))
# Append error to errors vector
errors <- append(errors, nn$result.matrix[1])
# Create one-hot encoding of Class variable
true_labels <- ifelse(cv_test_set$diabetes == "pos", 1, 0)
# Predict the class in the test set
# It returns probabilities that the observations are "pos"
predicted_probabilities <- predict(nn, cv_test_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
# Calculate accuracy manually
# Note: TRUE == 1, FALSE == 0
cv_accuracy <- sum(true_labels == predictions) / length(true_labels)
# Append the accuracy to the accuracies vector
accuracies <- append(accuracies, cv_accuracy)
}
# Calculate average error and accuracy
# Note that we could also have gathered the predictions from all the
# folds and calculated the accuracy only once. This could lead to slightly
# different results, e.g. if the folds are not exactly the same size.
average_error <- mean(errors)
average_error
The output is as follows:
## [1] 28.38503
average_accuracy <- mean(accuracies)
average_accuracy
The output is as follows:
## [1] 0.7529813
# Once you have chosen the best model, train it on the entire training set
# and evaluate on the validation set
# Note that we set the stepmax, to make sure
# it has enough training steps to converge
nn_best <- neuralnet(diabetes == "pos" ~ .,
train_set[, 1:7],
linear.output = FALSE,
hidden=c(2,2),
stepmax = 2e+05)
# Find the true labels in the validation set
valid_true_labels <- ifelse(valid_set$diabetes == "pos", 1, 0)
# Predict the classes in the validation set
predicted_probabilities <- predict(nn_best, valid_set)
predictions <- ifelse(predicted_probabilities > 0.5, 1, 0)
# Create confusion matrix
confusion_matrix <- confusionMatrix(as.factor(predictions),
as.factor(valid_true_labels),
mode="prec_recall", positive = "1")
# Print the results for this model
print("nn_best on the validation set:")
## [1] "nn_best on the validation set:"
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 78 20
## 1 17 30
##
## Accuracy : 0.7448
## 95% CI : (0.6658, 0.8135)
## No Information Rate : 0.6552
## P-Value [Acc > NIR] : 0.01302
##
## Kappa : 0.4271
## Mcnemar's Test P-Value : 0.74231
##
## Precision : 0.6383
## Recall : 0.6000
## F1 : 0.6186
## Prevalence : 0.3448
## Detection Rate : 0.2069
## Detection Prevalence : 0.3241
## Balanced Accuracy : 0.7105
##
## 'Positive' Class : 1
##
plotnet(nn_best, var_labs=FALSE)
The output will be as follows:
Solution:
# Attach packages
library(groupdata2)
library(cvms)
library(caret)
library(knitr)
# Set seed for reproducibility and easy comparison
set.seed(1)
# Load the cars dataset
data(cars)
# Partition the dataset
partitions <- partition(cars, p = 0.8)
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
# Fit a couple of linear models and interpret them
# Model 1 - Predicting price by mileage
model_1 <- lm(Price ~ Mileage, data = train_set)
summary(model_1)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12977 -7400 -3453 6009 45540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.496e+04 1.019e+03 24.488 < 2e-16 ***
## Mileage -1.736e-01 4.765e-02 -3.644 0.000291 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9762 on 641 degrees of freedom
## Multiple R-squared: 0.02029, Adjusted R-squared: 0.01876
## F-statistic: 13.28 on 1 and 641 DF, p-value: 0.0002906
# Model 2 - Predicting price by number of doors
model_2 <- lm(Price ~ Doors, data = train_set)
summary(model_2)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Doors, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12540 -7179 -2934 5814 45805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25682.2 1662.1 15.452 <2e-16 ***
## Doors -1176.6 457.5 -2.572 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9812 on 641 degrees of freedom
## Multiple R-squared: 0.01021, Adjusted R-squared: 0.008671
## F-statistic: 6.615 on 1 and 641 DF, p-value: 0.01034
# Model 3 - Predicting price by mileage and number of doors
model_3 <- lm(Price ~ Mileage + Doors, data = train_set)
summary(model_3)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage + Doors, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12642 -7503 -3000 5595 43576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.945e+04 1.926e+03 15.292 < 2e-16 ***
## Mileage -1.786e-01 4.744e-02 -3.764 0.000182 ***
## Doors -1.242e+03 4.532e+02 -2.740 0.006308 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9713 on 640 degrees of freedom
## Multiple R-squared: 0.03165, Adjusted R-squared: 0.02863
## F-statistic: 10.46 on 2 and 640 DF, p-value: 3.388e-05
# Create list of model formulas with combine_predictors()
# Use only the 4 first predictors (to save time)
# Limit the number of fixed effects (predictors) to 3,
# Limit the biggest possible interaction to a 2-way interaction
# Limit the number of times a fixed effect is included to 1
model_formulas <- combine_predictors(
dependent = "Price",
fixed_effects = c("Mileage", "Cylinder",
"Doors", "Cruise"),
max_fixed_effects = 3,
max_interaction_size = 2,
max_effect_frequency = 1)
# Output model formulas
model_formulas
The output is as follows:
## [1] "Price ~ Cruise"
## [2] "Price ~ Cylinder"
## [3] "Price ~ Doors"
## [4] "Price ~ Mileage"
## [5] "Price ~ Cruise * Cylinder"
## [6] "Price ~ Cruise * Doors"
## [7] "Price ~ Cruise * Mileage"
## [8] "Price ~ Cruise + Cylinder"
## [9] "Price ~ Cruise + Doors"
## [10] "Price ~ Cruise + Mileage"
## [11] "Price ~ Cylinder * Doors"
## [12] "Price ~ Cylinder * Mileage"
## [13] "Price ~ Cylinder + Doors"
## [14] "Price ~ Cylinder + Mileage"
## [15] "Price ~ Doors * Mileage"
## [16] "Price ~ Doors + Mileage"
## [17] "Price ~ Cruise * Cylinder + Doors"
## [18] "Price ~ Cruise * Cylinder + Mileage"
## [19] "Price ~ Cruise * Doors + Cylinder"
## [20] "Price ~ Cruise * Doors + Mileage"
## [21] "Price ~ Cruise * Mileage + Cylinder"
## [22] "Price ~ Cruise * Mileage + Doors"
## [23] "Price ~ Cruise + Cylinder * Doors"
## [24] "Price ~ Cruise + Cylinder * Mileage"
## [25] "Price ~ Cruise + Cylinder + Doors"
## [26] "Price ~ Cruise + Cylinder + Mileage"
## [27] "Price ~ Cruise + Doors * Mileage"
## [28] "Price ~ Cruise + Doors + Mileage"
## [29] "Price ~ Cylinder * Doors + Mileage"
## [30] "Price ~ Cylinder * Mileage + Doors"
## [31] "Price ~ Cylinder + Doors * Mileage"
## [32] "Price ~ Cylinder + Doors + Mileage"
# Create 5 fold columns with 4 folds each in the training set
train_set <- fold(train_set, k = 4,
num_fold_cols = 5)
# Create list of fold column names
fold_cols <- paste0(".folds_", 1:5)
# Cross-validate the models with cvms
CV_results <- cross_validate(train_set,
models = model_formulas,
fold_cols = fold_cols,
family = "gaussian")
# Select the best model by RMSE
# Order by RMSE
CV_results <- CV_results[order(CV_results$RMSE),]
# Select the 10 best performing models for printing
# (Feel free to view all the models)
CV_results_top10 <- head(CV_results, 10)
# Show metrics and model definition columns
# Use kable for a prettier output
kable(select_metrics(CV_results_top10), digits = 2)
The output is as follows:
# Evaluate the best model on the validation set with validate()
V_results <- validate(
train_data = train_set,
test_data = valid_set,
models = "Price ~ Cruise * Cylinder + Mileage",
family = "gaussian")
valid_results <- V_results$Results
valid_model <- V_results$Models[[1]]
# Print the results
kable(select_metrics(valid_results), digits = 2)
The output is as follows:
# Print the model summary and interpret it
summary(valid_model)
The summary is as follows:
##
## Call:
## lm(formula = model_formula, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10485 -5495 -1425 3494 34693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8993.2446 3429.9320 2.622 0.00895 **
## Cruise -1311.6871 3585.6289 -0.366 0.71462
## Cylinder 1809.5447 741.9185 2.439 0.01500 *
## Mileage -0.1569 0.0367 -4.274 2.21e-05 ***
## Cruise:Cylinder 1690.0768 778.7838 2.170 0.03036 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7503 on 638 degrees of freedom
## Multiple R-squared: 0.424, Adjusted R-squared: 0.4203
## F-statistic: 117.4 on 4 and 638 DF, p-value: < 2.2e-16
Solution:
library(groupdata2)
library(cvms)
library(caret)
library(randomForest)
library(rPref)
library(doParallel)
set.seed(3)
# Load the amsterdam.listings dataset
full_data <- read.csv("amsterdam.listings.csv")
full_data$id <- factor(full_data$id)
full_data$neighbourhood <- factor(full_data$neighbourhood)
summary(full_data)
The summary of data is as follows:
## id neighbourhood room_type
## 2818 : 1 De Baarsjes - Oud-West :3052 Entire home/apt:13724
## 20168 : 1 De Pijp - Rivierenbuurt:2166 Private room : 3594
## 25428 : 1 Centrum-West :2019
## 27886 : 1 Centrum-Oost :1498
## 28658 : 1 Westerpark :1315
## 28871 : 1 Zuid :1187
## (Other):17312 (Other) :6081
## availability_365 log_price log_minimum_nights log_number_of_reviews
## Min. : 0.00 Min. :2.079 Min. :0.0000 Min. :0.000
## 1st Qu.: 0.00 1st Qu.:4.595 1st Qu.:0.6931 1st Qu.:1.386
## Median : 0.00 Median :4.852 Median :0.6931 Median :2.398
## Mean : 48.33 Mean :4.883 Mean :0.8867 Mean :2.370
## 3rd Qu.: 49.00 3rd Qu.:5.165 3rd Qu.:1.0986 3rd Qu.:3.219
## Max. :365.00 Max. :7.237 Max. :4.7875 Max. :6.196
##
## log_reviews_per_month
## Min. :-4.60517
## 1st Qu.:-1.42712
## Median :-0.61619
## Mean :-0.67858
## 3rd Qu.: 0.07696
## Max. : 2.48907
##
partitions <- partition(full_data, p = 0.8,
cat_col = "room_type")
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
# Register four CPU cores
registerDoParallel(4)
room_type_baselines <- baseline(test_data = valid_set,
dependent_col = "room_type",
n = 100,
family = "binomial",
parallel = TRUE)
# Inspect summarized metrics
room_type_baselines$summarized_metrics
The output is as follows:
## # A tibble: 10 x 15
## Measure 'Balanced Accur… F1 Sensitivity Specificity
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Mean 0.502 0.295 0.503 0.500
## 2 Median 0.502 0.294 0.502 0.500
## 3 SD 0.0101 0.00954 0.0184 0.00924
## 4 IQR 0.0154 0.0131 0.0226 0.0122
## 5 Max 0.525 0.317 0.551 0.522
## 6 Min 0.480 0.275 0.463 0.480
## 7 NAs 0 0 0 0
## 8 INFs 0 0 0 0
## 9 All_0 0.5 NA 0 1
## 10 All_1 0.5 0.344 1 0
## # … with 10 more variables: 'Pos Pred Value' <dbl>, 'Neg Pred
## # Value' <dbl>, AUC <dbl>, 'Lower CI' <dbl>, 'Upper CI' <dbl>,
## # Kappa <dbl>, MCC <dbl>, 'Detection Rate' <dbl>, 'Detection
## # Prevalence' <dbl>, Prevalence <dbl>
logit_model_1 <- glm("room_type ~ log_number_of_reviews",
data = train_set, family="binomial")
summary(logit_model_1)
The summary of the model is as follows:
## ## Call:
## glm(formula = "room_type ~ log_number_of_reviews", family = "binomial",
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3030 -0.7171 -0.5668 -0.3708 2.3288
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64285 0.05374 -49.18 <2e-16 ***
## log_number_of_reviews 0.49976 0.01745 28.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149 on 13853 degrees of freedom
## Residual deviance: 13249 on 13852 degrees of freedom
## AIC: 13253
##
## Number of Fisher Scoring iterations: 4
logit_model_2 <- glm(
"room_type ~ availability_365 + log_number_of_reviews",
data = train_set, family = "binomial")
summary(logit_model_2)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5277 -0.6735 -0.5535 -0.3688 2.3365
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6622105 0.0533345 -49.91 <2e-16 ***
## availability_365 0.0039866 0.0002196 18.16 <2e-16 ***
## log_number_of_reviews 0.4172148 0.0178015 23.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149 on 13853 degrees of freedom
## Residual deviance: 12935 on 13851 degrees of freedom
## AIC: 12941
##
## Number of Fisher Scoring iterations: 4
logit_model_3 <- glm(
"room_type ~ availability_365 + log_number_of_reviews + log_price",
data = train_set, family = "binomial")
summary(logit_model_3)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7678 -0.5805 -0.3395 -0.1208 3.9864
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.6730455 0.3419771 37.06 <2e-16 ***
## availability_365 0.0081215 0.0002865 28.34 <2e-16 ***
## log_number_of_reviews 0.3613055 0.0199845 18.08 <2e-16 ***
## log_price -3.2539506 0.0745417 -43.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149.2 on 13853 degrees of freedom
## Residual deviance: 9984.7 on 13850 degrees of freedom
## AIC: 9992.7
##
## Number of Fisher Scoring iterations: 6
logit_model_4 <- glm(
"room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
data = train_set, family = "binomial")
summary(logit_model_4)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6268 -0.5520 -0.3142 -0.1055 4.6062
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.470868 0.354695 37.98 <2e-16 ***
## availability_365 0.008310 0.000295 28.17 <2e-16 ***
## log_number_of_reviews 0.360133 0.020422 17.64 <2e-16 ***
## log_price -3.252957 0.076343 -42.61 <2e-16 ***
## log_minimum_nights -1.007354 0.051131 -19.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149.2 on 13853 degrees of freedom
## Residual deviance: 9543.9 on 13849 degrees of freedom
## AIC: 9553.9
##
## Number of Fisher Scoring iterations: 6
logit_model_5 <- glm(
"room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
data = train_set, family = "binomial")
summary(logit_model_5)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7351 -0.5229 -0.2934 -0.0968 4.7252
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.7495851 0.3652303 40.38 <2e-16 ***
## availability_365 0.0074308 0.0003019 24.61 <2e-16 ***
## log_reviews_per_month 0.6364218 0.0246423 25.83 <2e-16 ***
## log_price -3.2850702 0.0781567 -42.03 <2e-16 ***
## log_minimum_nights -0.8504701 0.0526379 -16.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149 on 13853 degrees of freedom
## Residual deviance: 9103 on 13849 degrees of freedom
## AIC: 9113
##
## Number of Fisher Scoring iterations: 6
model_formulas <- combine_predictors(
dependent = "room_type",
fixed_effects = c("log_minimum_nights",
"log_number_of_reviews",
"log_price",
"availability_365",
"log_reviews_per_month"),
max_interaction_size = 2,
max_effect_frequency = 1)
head(model_formulas, 10)
The output is as follows:
## [1] "room_type ~ availability_365"
## [2] "room_type ~ log_minimum_nights"
## [3] "room_type ~ log_number_of_reviews"
## [4] "room_type ~ log_price"
## [5] "room_type ~ log_reviews_per_month"
## [6] "room_type ~ availability_365 * log_minimum_nights"
## [7] "room_type ~ availability_365 * log_number_of_reviews"
## [8] "room_type ~ availability_365 * log_price"
## [9] "room_type ~ availability_365 * log_reviews_per_month"
## [10] "room_type ~ availability_365 + log_minimum_nights"
train_set <- fold(train_set, k = 5,
num_fold_cols = 5,
cat_col = "room_type")
initial_cv_results <- cross_validate(
train_set,
models = model_formulas,
fold_cols = ".folds_1",
family = "binomial",
parallel = TRUE)
initial_cv_results <- initial_cv_results[
order(initial_cv_results$F1, decreasing = TRUE),]
head(initial_cv_results, 10)
The output is as follows:
## # A tibble: 10 x 26
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.764 0.662 0.567 0.962 0.797
## 2 0.764 0.662 0.565 0.963 0.798
## 3 0.763 0.662 0.563 0.963 0.801
## 4 0.763 0.661 0.563 0.963 0.800
## 5 0.761 0.654 0.564 0.958 0.778
## 6 0.757 0.653 0.549 0.966 0.807
## 7 0.757 0.652 0.549 0.965 0.804
## 8 0.758 0.649 0.560 0.957 0.774
## 9 0.756 0.649 0.550 0.962 0.792
## 10 0.758 0.649 0.559 0.957 0.775
## # … with 21 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, Folds <int>, 'Fold Columns' <int>, 'Convergence
## # Warnings' <dbl>, 'Singular Fit Messages' <int>, Family <chr>,
## # Link <chr>, Dependent <chr>, Fixed <chr>
# Reconstruct the best 20 models' formulas
reconstructed_formulas <- reconstruct_formulas(
initial_cv_results,
topn = 20)
# Create fold_cols
fold_cols <- paste0(".folds_", 1:5)
# Perform repeated cross-validation
repeated_cv_results <- cross_validate(
train_set,
models = model_formulas,
fold_cols = fold_cols,
family = "binomial",
parallel = TRUE)
# Order by F1
repeated_cv_results <- repeated_cv_results[
order(repeated_cv_results$F1, decreasing = TRUE),]
# Inspect the 10 best modelsresults
head(repeated_cv_results)
The output is as follows:
## # A tibble: 6 x 27
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.764 0.662 0.566 0.962 0.796
## 2 0.763 0.661 0.564 0.963 0.798
## 3 0.763 0.660 0.562 0.963 0.800
## 4 0.762 0.659 0.561 0.963 0.800
## 5 0.761 0.654 0.563 0.958 0.780
## 6 0.758 0.654 0.551 0.965 0.805
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, Results <list>, Folds <int>, 'Fold
## # Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## # Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## # Fixed <chr>
# Find the Pareto front
front <- psel(repeated_cv_results,
pref = high("F1") * high("`Balanced Accuracy`"))
# Remove rows with NA in F1 or Balanced Accuracy
front <- front[complete.cases(front[1:2]), ]
# Inspect front
front
The output is as follows:
## # A tibble: 1 x 27
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.764 0.662 0.566 0.962 0.796
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, Results <list>, Folds <int>, 'Fold
## # Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## # Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## # Fixed <chr>
The best model according to F1 is also the best model by balanced accuracy, so the Pareto front only contains one model.
# Create ggplot object
# with precision on the x-axis and precision on the y-axis
ggplot(repeated_cv_results, aes(x = F1, y = 'Balanced Accuracy')) +
# Add the models as points
geom_point(shape = 1, size = 0.5) +
# Add the nondominated models as larger points
geom_point(data = front, size = 3) +
# Add a line to visualize the pareto front
geom_step(data = front, direction = "vh") +
# Add the light theme
theme_light()
The output is similar to the following:
# Reconstruct the formulas for the front models
reconstructed_formulas <- reconstruct_formulas(front)
# Validate the models in the Pareto front
v_results_list <- validate(train_data = train_set,
test_data = valid_set,
models = reconstructed_formulas,
family = "binomial")
# Assign the results and model(s) to variable names
v_results <- v_results_list$Results
v_model <- v_results_list$Models[[1]]
v_results
The output is as follows:
## # A tibble: 1 x 24
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.758 0.652 0.554 0.962 0.794
## # … with 19 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, 'Convergence Warnings' <dbl>, 'Singular Fit
## # Messages' <dbl>, Family <chr>, Link <chr>, Dependent <chr>,
## # Fixed <chr>
These results are a lot better than the baseline on both F1 and balanced accuracy.
summary(v_model)
The summary of the model is as follows:
##
## Call:
## glm(formula = model_formula, family = binomial(link = link),
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8323 -0.4836 -0.2724 -0.0919 3.9091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.3685268 0.4511978 34.062 < 2e-16 ***
## availability_365 -0.0140209 0.0030623 -4.579 4.68e-06 ***
## log_price -3.4441520 0.0956189 -36.020 < 2e-16 ***
## log_minimum_nights -0.7163252 0.0535452 -13.378 < 2e-16 ***
## log_number_of_reviews -0.0823821 0.0282115 -2.920 0.0035 **
## log_reviews_per_month 0.0733808 0.0381629 1.923 0.0545 .
## availability_365:log_price 0.0042772 0.0006207 6.891 5.53e-12 ***
## log_n_o_reviews:log_r_p_month 0.3730603 0.0158122 23.593 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149.2 on 13853 degrees of freedom
## Residual deviance: 8476.7 on 13846 degrees of freedom
## AIC: 8492.7
##
## Number of Fisher Scoring iterations: 6
Note, that we have shortened log_number_of_reviews and log_reviews_per_month in the interaction term, log_n_o_reviews:log_r_p_month. The coefficients for the two interaction terms are both statistically significant. The interaction term log_n_o_reviews:log_r_p_month tells us that when log_number_of_reviews increases by one unit, the coefficient for log_reviews_per_month increases by 0.37, and vice versa. We might question the meaningfulness of including both these predictors in the model, as they have some of the same information. If we also had the number of months the listing had been listed, we should be able to recreate log_number_of_reviews from log_reviews_per_month and the number of months, which would probably be easier to interpret as well.
The second interaction term, availability_365:log_price, tells us that when availability_365 increases by a single unit, the coefficient for log_price increases by 0.004, and vice versa. The coefficient estimate for log_price is -3.44, meaning that when availability_365 is low, a higher log_price decreases the probability that the listing is a private room. This fits with the intuition that a private room is usually cheaper than an entire home/apartment.
The coefficient for log_minimum_nights tells us that when there is a higher minimum requirement for the number of nights when we book the listing, there's a lower probability that the listing is a private room.
Solution:
library(cluster)
library(factoextra)
df <- read.csv("mtcars.csv")
rownames(df) <- df$X
df$X <- NULL
The row names (states) become a column, X, when you save it as a CSV file. So, we need to change it back, as the row names are used in the plot in step 7.
df <- na.omit(df)
df <- scale(df)
dv <- diana(df,metric = "manhattan", stand = TRUE)
plot(dv)
The output is as follows:
The next plot is as follows:
agn <- agnes(df)
pltree(agn)
The output is as follows:
fviz_nbclust(mtcars, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
The output is as follows:
k4 <- kmeans(df, centers = 4, nstart = 20)
fviz_cluster(k4, data = df)
The output is as follows:
If we consider cutting the DIANA tree at height 20, the Ferrari is clustered together with the Ford and the Maserati (the smallest cluster):
Meanwhile, cutting the AGNES dendrogram at height 4 results in the Ferrari being clustered with the Mazda RX4, the Mazda RX4 Wag, and the Porsche. k-means clusters the Ferrari with the Mazdas, the Ford, and the Maserati.
Clearly, the choice of clustering technique and algorithms results in different clusters being created. It is important to apply some domain knowledge to determine the most valuable end results.