We will use caret to try different classifiers at once:
library(caret)
We first need to prepare the training scheme:
control <- trainControl(method="cv", repeats=5)
And we will set up the different models to try:
- Random forest
- Gradient boosting machines
- Logit boost
- Naive Bayes
Do not forget to set the seed for the random number generator, to make the results repeatable:
set.seed(7)
modelRF <- train(
label~.,
data=df,
method="rf",
trControl=control
)
modelGbm <- train(
label~.,
data=df,
method="gbm",
trControl=control,
verbose=FALSE
)
modelLogitBoost <- train(
label~.,
data=df,
method="LogitBoost",
trControl=control
)
modelNaiveBayes <- train(
label~.,
data=df,
method="nb",
trControl=control
)
The training will take some time, but once this is done, we can collect the results in a data frame for later exploration:
results <- resamples(
list(
RF=modelRF,
GBM=modelGbm,
LB=modelLogitBoost,
NB=modelNaiveBayes
)
)
We can easily get a summary of the results, with the familiar summary function:
> summary(results)
Call:
summary.resamples(object = results)
Models: RF, GBM, LB, NB
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.7116 0.7189 0.7284 0.72732 0.7363 0.7412 0
GBM 0.7168 0.7199 0.7352 0.73228 0.7410 0.7496 0
LB 0.5680 0.5933 0.6124 0.60656 0.6184 0.6420 0
NB 0.6244 0.6291 0.6374 0.63992 0.6510 0.6588 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.4232 0.4378 0.4568 0.45464 0.4726 0.4824 0
GBM 0.4336 0.4398 0.4704 0.46456 0.4820 0.4992 0
LB 0.1360 0.1866 0.2248 0.21312 0.2368 0.2840 0
NB 0.2488 0.2582 0.2748 0.27984 0.3020 0.3176 0
Or more detailed visualizations with:
bwplot(results)
The preceding code produces the following metrics:
You can also do it with:
dotplot(results)
Anyhow, time to look at the performance metrics!
On the plots and the preceding summary results shown, the accuracy is the percentage of correctly classified instances out of all instances, whereas the kappa metric (or Cohen's kappa statistic) is the accuracy normalized with the baseline random classification for your data, as if every observation were classified according to the probability distribution given by the frequency of each class.
So we see that tree-based methods tend to do quite well. This is somehow in contrast to one-hot encoding models. It is usually not recommended to use tree-based methods in data with a large number of sparse features, which would be the case if we had one-hot encoded models as before. Since we are instead embedding into dense vectors, trees are welcome again and they do perform well, even without tweaking hyperparameters.