Going from logistic regression to single-layer neural networks

Basically, logistic regression is a feed-forward neural network without a hidden layer, where the output layer directly connects with the input layer. In other words, logistic regression is a single neuron that maps the input to the output layer. Theoretically, the neural networks with an additional hidden layer between the input and output layer should be able to learn more about the relationship underneath.

A single-layer neural network for two possible classes can be represented graphically as follows:

Suppose x is n-dimension, and there are H hidden units in the hidden layer, then the weight matrix w⁽¹⁾connecting the input layer to the hidden layer is of size n by H with each column w_h⁽¹⁾ representing the coefficients associated with the h-th hidden unit. So, the output (also called activation) of the h-th hidden unit a_h⁽²⁾ can be expressed mathematically as:

For example, for the outputs of the first, second and the last hidden unit:

where f(z) is an activation function. Typical choices for the activation function in simple networks include logistic function (more often called sigmoid function) and tanh function (which can be considered a re-scaled version of logistic function):

Plots of these two functions are as follows:

We will be using the logistic function in our single-layered networks for now.

For a case of K possible classes, the weight matrix w⁽²⁾ connecting the hidden layer to the output layer is of size H by K. Each column w_k⁽²⁾ represents the coefficients associated with class k. The input to the output layer is the output of the hidden layer a⁽²⁾ = {a₁⁽²⁾, a₂⁽²⁾, ,...,a_I⁽²⁾ }, the probability of x being class k can be expressed mathematically as (for consistency, we denote it as a_k⁽³⁾):

Similarly, given m training samples, to train the neural network, we learn all weights w = {w⁽¹⁾, w⁽²⁾}using gradient descent with the goal of minimizing the mean squared error cost J(w).

Computation of the gradients Δw can be realized through the backpropagation algorithm. The idea of the backpropagation algorithm is the following, we first travel through the network and compute all outputs of the hidden layers and output layer; then moving backward from the last layer, we calculate how much each node contributed to the error in the final output and propagate it back to the previous layers. In our single-layer network, the detailed steps are:

Compute a⁽²⁾ for the hidden layer and feed them to the output layer to compute the outputs a⁽³⁾
For the output layer, compute the derivative of the cost function of one sample j(W) with regards to each unit, , or , rewritten for the entire layer
For the hidden layer, we compute the error term δ⁽²⁾ based on a weighted average of
Compute the gradients applying the chain rule:

We repeatedly update all weights by taking these steps until the cost function converges.

After a brief review of the single-layer network, we can then apply it as the second solution to our digit classification project.

Again, we use the nnet package to implement our single-layer network:

> model_nn <- nnet(label ~ ., data=data_train, size=50, maxit=300, MaxNWts=100000, decay=1e-4)
# weights: 39760
initial value 108597.598656
iter 10 value 27708.286001
iter 20 value 16027.005297
iter 30 value 14058.141050
iter 40 value 12615.442747
iter 50 value 11793.700937
iter 60 value 11026.672273
iter 70 value 10654.855058
iter 80 value 10193.580947
iter 90 value 9854.836168
iter 100 value 9544.973159
iter 110 value 9307.192737
iter 120 value 9043.028253
iter 130 value 8845.069307
iter 140 value 8686.707561
iter 150 value 8525.104362
iter 160 value 8281.609223
iter 170 value 8140.051273
iter 180 value 7998.721024
iter 190 value 7854.388240
iter 200 value 7712.459027
iter 210 value 7636.945553
iter 220 value 7557.675909
iter 230 value 7449.854506
iter 240 value 7355.021651
iter 250 value 7259.186906
iter 260 value 7192.798089
iter 270 value 7055.027833
iter 280 value 6957.926522
iter 290 value 6866.641511
iter 300 value 6778.342997
final value 6778.342997
stopped after 300 iterations

We fit the model with parameters including:

size=50: There are 50 hidden units in the hidden layer.
MaxNWts=100000: It allows at most 100,000 weights. In our case, there are (784 input dimensions + 1 bias) * 50 hidden units = 39,250 elements in the weight matrix w⁽¹⁾ and (50 hidden units + 1 bias) * 10 output units = 510 elements in the weight matrix w⁽¹⁾. So there are 39,760 weights in total.
decay=1e-4: The regularization strength, the weight decay is 0.0001.
maxit=300: The maximum number of iterations is set to be 300.

We apply the trained network model on the testing set:

> prediction_nn <- predict(model_nn, data_test, type = "class")
> cm_nn = table(data_test$label, prediction_nn)
> cm_nn
prediction_nn
0 1 2 3 4 5 6 7 8 9
0 987 0 3 3 2 11 10 7 5 5
1 0 1134 9 6 0 2 0 5 11 4
2 14 9 918 31 11 11 7 15 22 6
3 3 1 17 966 0 41 3 14 24 18
4 4 3 8 2 929 4 11 11 5 41
5 12 2 6 17 5 851 15 9 26 5
6 10 1 14 0 9 14 970 0 15 1
7 4 4 23 6 5 2 0 1010 7 39
8 5 15 9 18 5 31 9 4 912 7
9 11 2 1 20 52 4 0 36 8 913
> accuracy_nn = mean(prediction_nn == data_test$label)
> accuracy_nn
[1] 0.9135944

Better than our first attempt! Can we do better with deep learning models, say intuitively more hidden layers? Sure.

Table of Contents for Going from logistic regression to single-layer neural networks

Create new playlist

Sign In

Sign Up

Table of Contents for
Going from logistic regression to single-layer neural networks