First attempt – logistic regression

We start off with probably the most basic classifier, the logistic regression, to be specific multinomial logistic regression as it is a multiclass case. It is a probabilistic linear classifier parameterized by a weight matrix W (also called coefficient matrix) and a bias (also called intercept) vector b. And it maps an input vector x to a set of probabilities P(y=1), P(y=2),. . ., P(y-K) for K possible classes.

A multinomial logistic regression for two possible classes can be represented graphically as follows:

Suppose x is n-dimension, then the weight matrix W is of size n by K with each column Wk representing the coefficients associated with class k; similarly, the bias vector b is of length K, with each element bk served as the bias for class k. For simplicity, the bias b can be viewed as an additional row in the weight matrix W. So the probability of x being class k can be expressed mathematically as:

Where softmax() denotes the softmax function and that is why multinomial logistic regression is often called softmax regression.

Given a set of training samples  where , the optimal model w is obtained by minimizing the cost (also called log loss), which is defined as:


As usual we resort to gradient descent, an iterative optimization algorithm, to solve for the optimal w. In each iteration, w moves a step that is proportional to the negative derivative Δw of the objective function at the current point. That is, w:=w – ƞΔw, where ƞ is the learning rate. Each column Δwk  of Δw can be computed as:

The well trained model, the optimal w will be used to classify a new sample x' by:

Armed with the mechanics of the multinomial logistic regression we just reviewed, we can then apply it as the first solution to our digit classification project.

We first split the dataset into two subsets for training and testing respectively using the caret package.

caret stands for classification and regression training. The package is designed to facilitate the process for training and evaluating models. It contains tools and methods for data splitting, data pre-processing, feature selection and model tuning. Documentation and a full list of functions can be found in
Install and import package caret:
> if (!require("caret"))
+ install.packages("caret")
> library (caret)
Loading required package: lattice
Loading required package: ggplot2

We first split the data into two partitions, 75% for training and 25% for testing, using the createDataPartition function:

> set.seed(42)
> train_perc = 0.75
> train_index <- createDataPartition(data$label, p=train_perc, list=FALSE)
> data_train <- data[train_index,]
> data_test <- data[-train_index,]
To ensure the experiments are reproducible, it is always a good practice to pick a seed from the random number generator.

Then, we implement the multinomial logistic regression model using the nnet package. The package contains functions for feed-forward single-layer neural networks as well as multinomial logistic regression models. More details can be found in

> library(nnet)
> # Multinomial logistic regression
> model_lr <- multinom(label ~ ., data=data_train, MaxNWts=10000,
decay=5e-3, maxit=100)
# weights: 7860 (7065 variable)
initial value 72538.338185
iter 10 value 17046.804658
iter 20 value 11166.225504
iter 30 value 9514.340319
iter 40 value 8819.724147
iter 50 value 8405.001712
iter 60 value 8164.997939
iter 70 value 7983.427139
iter 80 value 7897.005940
iter 90 value 7831.663204
iter 100 value 7730.047242
final value 7730.047242
stopped after 100 iterations

We fit a multinomial logistic regression model on the training subset, with parameters which include:

  • MaxNWts=10000: It allows, at most, 10,000 weights. In our case, there are (784 dimensions + 1 bias) * 10 classes = 7850 elements in the weight matrix w
  • decay=5e-3: The regularization strength, the weight decay is 0.005
  • maxit=100: The maximum number of iterations is set to be 100

The error value is printed for every 10 iterations, and it is decreasing. The model converges as the maximum number of iterations is reached. Then we use the trained model to predict the classes of the testing samples:

> prediction_lr <- predict(model_lr, data_test, type = "class")

Take a look at the prediction results of the first five samples:

> prediction_lr[1:5]
[1] 1 0 7 5 8
Levels: 0 1 2 3 4 5 6 7 8 9

And their true values are:

> data_test$label[1:5]
[1] 1 0 7 5 8
Levels: 0 1 2 3 4 5 6 7 8 9

We can also obtain the confusion matrix by:

> cm_lr = table(data_test$label, prediction_lr)
> cm_lr

And the classification accuracy:

> accuracy_lr = mean(prediction_lr == data_test$label)
> accuracy_lr
[1] 0.8935886

89.4% for the first try. Not bad! We could definitely do better by tweaking the model parameters, such as decay and maxit. But our focus is for a more advanced model that learns the underneath patterns better. So we move on with the second solution, the feed-forward neural networks with a single hidden layer.

