Tuning hyperparameters and advanced FFNNs

The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak. Even in a simple MLP, you can change the number of layers, the number of neurons per layer, and the type of activation function to use in each layer. You can also change the weight initialization logic, the drop out keep probability, and so on.

Additionally, some common problems in FFNNs, such as the gradient vanishing problem, and selecting the most suitable activation function, learning rate, and optimizer, are of prime importance.

Tuning FFNN hyperparameters

Hyperparameters are parameters that are not directly learned within estimators. It is possible and recommended that you search the hyperparameter space for the best cross-validation (http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) score. Any parameter provided when constructing an estimator may be optimized in this manner. Now, the question is: how you do know what combination of hyperparameters is best for your task? Of course, you can use grid search, with cross-validation, to find the right hyperparameters for linear machine learning models.

However, for the DNNs, there are many hyperparameters to tune. Since training a neural network on a large dataset takes a lot of time, you will only be able to explore a tiny part of the hyperparameter space in a reasonable amount of time. Here are some insights that can be followed.

Moreover, of course, as I said, you can use grid search or randomized searches, with cross-validation, to find the right hyperparameters for linear machine learning models. We will see some possible ways of exhaustive and randomized grid searching and cross-validation later in this section.

Number of hidden layers

For many problems, you can start with just one or two hidden layers and this setting will work just fine using two hidden layers, with the same total amount of neurons (see below to get an idea about a number of neurons), in roughly the same amount of training time. Now let's see some naïve estimation about setting the number of hidden layers:

  • 0: Only capable of representing linear separable functions or decisions
  • 1: Can approximate any function that contains a continuous mapping from one finite space to another
  • 2: Can represent an arbitrary decision boundary to arbitrary accuracy, with rational activation functions, and can approximate any smooth mapping to any accuracy

However, for a more complex problem, you can gradually ramp up the number of hidden layers, until you start overfitting the training set. Very complex tasks, such as large image classification or speech recognition, typically require networks with dozens of layers, and they need a large amount of training data.

Nevertheless, you can try increasing the number of neurons gradually until the network starts overfitting. This means the upper bound on the number of hidden neurons that will not result in overfitting is:

Number of hidden layers

In the preceding equation:

N i = number of input neurons

N o = number of output neurons

N s = number of samples in training dataset

Number of hidden layers

= an arbitrary scaling factor usually 2-10.

Note that the above equation does not come from any research, but from my personal working experience. However, for an automated procedure, you would start with an alpha of 2, that is twice as many degrees of freedom in your training data as your model, and work your way up to 10, if the error for training data is significantly smaller than for the cross-validation data set.

Number of neurons per hidden layer

Obviously, the number of neurons in the input and output layers is determined by the type of input and output your task requires. For example, if your dataset has the shape of 28x28, it should have input neurons of size 784, and the output neurons should be equal to the number of classes to be predicted.

We will see how this works in practice in the next example, using MLP, where there will be four hidden layers with 256 neurons (just one hyperparameter to tune, instead of one per layer). Just like for the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting.

There are some empirically derived rules-of-thumb, of which the most commonly relied on is: "The optimal size of the hidden layer is usually between the size of the input and size of the output layers."

In summary, for most problems, you could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules:

  • The number of hidden layers equals one
  • The number of neurons in that layer is the mean of the neurons in the input and output layers

Nevertheless, just like for the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting.

Weight and biases initialization

As we will see in the next example, initializing weight and biases for the hidden layers is an important hyperparameter to be taken care of:

  • Do not do all zero initialization: A reasonable-sounding idea might be to set all the initial weights to zero, but it does not work in practice. This is because if every neuron in the network computes the same output, there will be no source of asymmetry between neurons if their weights are initialized to be the same.
  • Small random numbers: It is also possible to initialize the weights of the neurons to small numbers, but not identically zero. Alternatively, it is possible to use small numbers drawn from a uniform distribution.
  • Initializing the biases: It is common to initialize the biases to be zero since the small random numbers in the weights provide the asymmetry breaking. Setting the biases to a small constant value, such as 0.01 for all biases, ensures that all ReLU units can propagate some gradient. However, it neither performs well nor shows consistent improvement. Therefore, sticking with zero is recommended.

Selecting the most suitable optimizer

Since, in FFNNs, one of the objective functions is to minimize the evaluated cost, we must define an optimizer. We have already seen how to use tf.train.AdamOptimizer (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer). Tensorflow tf.train (https://www.tensorflow.org/api_docs/python/tf/train) provides a set of classes and functions that help to train models. Personally, I have found that Adam optimizer works well for me in practice, without having to think much about learning rates and so on.

For most of the cases, we can utilize Adam, but sometimes we can adopt the implemented RMSPropOptimizer function, which is an advanced form of gradient descent. The RMSPropOptimizer function implements the RMSProp algorithm.

The RMSPropOptimizer function also divides the learning rate by an exponentially decaying average of squared gradients. The suggested setting value of the decay parameter is 0.9, while a good default value for the learning rate is 0.001:

optimizer = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost_op)

Using the most common optimizer SGD, the learning rates must scale with 1/T to get convergence, where T is the number of iterations. RMSProp tries to overcome this limitation automatically by adjusting the step size so that the step is on the same scale as the gradients.

So if you're training a neural network, but computing the gradients is mandatory, using tf.train.RMSPropOptimizer() would be the faster way of learning in a mini-batch setting. Researchers also recommend using Momentum optimizer while training a deep network such as CNN.

Finally, if you want to play around by setting these optimizers, you just need to change one line. Due to time constraints, I have not tried all of these. However, according to a recent research paper by Sebastian Ruder (see at https://arxiv.org/abs/1609.04747), optimizers with adaptive learning-rate methods that is, Adagrad, Adadelta, RMSprop, and Adam are most suitable and provide the best convergence for these scenarios.

GridSearch and randomized search for hyperparameters tuning

Two generic approaches to sampling search candidates are provided in other Python-based machine-learning libraries such as Scikit-learn. For given values, GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) exhaustively considers all parameter combinations, while RandomizedSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) can sample a given number of candidates from a parameter space with a specified distribution.

GridSearchCV is a great way to test and optimize hyperparameters automatically. I often use it with Scikit-learn. However, it is not yet so straightforward with TensorFlowEstimator to optimize learning_rate, batch_size, and so on. Moreover, as I said, we often have so many hyperparameters to tune to get the best result. Nevertheless, I found this article quite useful to learn how to tune aforementioned hyperparameters:https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

The randomized search and the grid search explore exactly the same space of parameters. The result in parameter settings is quite similar, while the runtime for the randomized search is drastically lower.

Some benchmarks (for example, http://scikit-learn.org/stable/auto_examples/model_selection/) have reported that the performance is slightly worse for the randomized search, though this is most likely a noise effect and would not carry over to a held-out test set.

Regularization

There are several ways of controlling the training of DNNs to prevent overfitting in the training phase, for example, L2/L1 regularization, max norm constraints, and drop out:

  • L2 regularization: This is probably the most common form of regularization. Using the gradient descent parameter update, L2 regularization signifies that every weight will be decayed linearly towards zero.
  • L1 regularization: For each weight w, we add the term Regularizationto the objective. However, it is also possible to combine L1 and L2 regularization to achieve elastic net regularization.
  • Max-norm constraints: This is issued to enforce an absolute upper bound on the magnitude of the weight vector for each hidden layer neuron. Projected gradient descent can be used further to enforce the constraint.

The vanishing gradient problem arises in very deep neural networks (typically RNNs, which will have a dedicated chapter on), that use activation functions, whose gradients tend to be small (in the range of 0 from 1).

Since these small gradients are further multiplied during the backpropagation, they tend to "vanish" throughout the layers, preventing the network from learning long-range dependencies. A common way to counter this problem is to use activation functions like Linear Unit (aka. ReLU), which does not suffer from small gradients. We will see an improved variant of an RNN, called Long Short-Term Memory (aka. LSTM), which can combat this problem. We will see a more detailed discussion on this topic in Chapter 5, Optimizing TensorFlow Autoencoders.

Nevertheless, we have seen that the last architectural change improved the accuracy of our model, but we can do even better by changing the sigmoid activation function with the ReLU, shown as follows:

Regularization

Figure 20: ReLU function

A ReLU unit computes the function f(x) = max(0, x). ReLU is computationally fast because it does not require any exponential computation, such as that required in sigmoid or tanh activation. Furthermore, it was found to accelerate the convergence of stochastic gradient descent greatly, compared to the sigmoid/tanh functions. To use the ReLU function, we simply change, in the previously implemented model, the following definitions of the first four layers:

First layer output:

Y1 = tf.nn.relu(tf.matmul(XX, W1) + B1) # Output from layer 1

Second layer output:

Y2 = tf.nn.relu(tf.matmul(Y1, W2) + B2) # Output from layer 2

Third layer output:

Y3 = tf.nn.relu(tf.matmul(Y2, W3) + B3) # Output from layer 3

Fourth layer output:

Y4 = tf.nn.relu(tf.matmul(Y3, W4) + B4) # Output from layer 4

Output layer:

Ylogits = tf.matmul(Y4, W5) + B5 # computing the logits
Y = tf.nn.softmax(Ylogits) # output from layer 5

Of course, tf.nn.relu is TensorFlow's implementation of ReLU. The accuracy of the model is almost 98%, as you could see running the network:

>>>
Loading data/train-images-idx3-ubyte.mnist 
Loading data/train-labels-idx1-ubyte.mnist Loading data/t10k-images-idx3-ubyte.mnist 
Loading data/t10k-labels-idx1-ubyte.mnist 
Epoch:	0
Epoch:	1
Epoch:	2
Epoch:	
3
Epoch:	4
Epoch:	5
Epoch:	6
Epoch:	7
Epoch:	8
Epoch:	9
Accuracy:0.9789 
done
>>>

As concerns, the TensorBoard analysis, from the folder where the source has been executed, you should digit:

$> Tensorboard --logdir = 'log_relu' # Don't put space before or after '='

Then open the browser at localhost to visualize TensorBoard's starting page. In the following figure, we show the trend's accuracy over the number of examples of the training set:

Regularization

Figure 21: Accuracy function over the training set

You can easily see how the accuracy, after a bad initial trend, begins a rapid progressive improvement after about 1000 examples.

Dropout optimization

While working with a DNN, we need another placeholder for dropout, which is a hyperparameter to be tuned. It is implemented by only keeping a neuron active with some probability (say p<1.0) or setting it to zero otherwise. The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with dropout_keep_prob < 1.0 during training, the outgoing weights of that unit are multiplied by p at test time.

During the learning phase, the connections with the next layer can be limited to a subset of neurons, to reduce the weights to be updated. This learning optimization technique is called dropout. The dropout is, therefore, a technique used to decrease the overfitting within a network with many layers and/or neurons. In general, the dropout layers are positioned after the layers that possess a large number of trainable neurons.

This technique allows the setting to 0, and then excluding the activation, of a certain percentage of the neurons of the preceding layer. The probability that the neuron's activation is set to 0 is indicated by the dropout ratio parameter within the layer, via a number between 0 and 1. In practice, the activation of a neuron is held with probability equal to the dropout ratio; otherwise, it is discarded, that is, set to 0.

Dropout optimization

Figure 22: Dropout representation

In this way, for each input, the network owns an architecture slightly different from the previous one. Some connections are active and some are not, in a different way, every time, even if these architectures possess the same weights. The preceding figure shows how the dropout works: each hidden unit is randomly omitted from the network with a probability of p.

One thing to notice, though, is that selected dropout units are different for each training instance; that is why this is more of a training problem. Dropout can be seen as an efficient way to perform model averaging, across a large number of different neural networks, where overfitting can be avoided with much less cost of computation than an architecture problem. The dropout reduces the possibility that a neuron relies on the presence of other neurons. In this way, it is forced to learn more about robust features, and that they are useful with linkages to other different neurons.

The TensorFlow function that allows building a dropout layer is tf.nn.dropout. The input of this function is the output of the previous layer, and a dropout parameter, tf.nn.dropout, returns an output tensor of the same size as the input tensor. The implementation of this model follows the same rules used for the five-layer network. In this case, we must insert the dropout function between one layer and another layer:

pkeep = tf.placeholder(tf.float32)


Y1 = tf.nn.relu(tf.matmul(XX, W1) + B1) # Output from layer 1
Y1d = tf.nn.dropout(Y1, pkeep)

Y2 = tf.nn.relu(tf.matmul(Y1, W2) + B2) # Output from layer 2
Y2d = tf.nn.dropout(Y2, pkeep)

Y3 = tf.nn.relu(tf.matmul(Y2, W3) + B3) # Output from layer 3
Y3d = tf.nn.dropout(Y3, pkeep)

Y4 = tf.nn.relu(tf.matmul(Y3, W4) + B4) # Output from layer 4
Y4d = tf.nn.dropout(Y4, pkeep)

Ylogits = tf.matmul(Y4d, W5) + B5 # computing the logits
Y = tf.nn.softmax(Ylogits) # output from layer 5

The dropout optimization produces the following results:

>>>
Loading data/train-images-idx3-ubyte.mnist Loading data/train-labels-idx1-ubyte.mnist Loading data/t10k-images-idx3-ubyte.mnist Loading data/t10k-labels-idx1-ubyte.mnist Epoch:	0
Epoch:	1
Epoch:	2
Epoch:	3
Epoch:	4
Epoch:	5
Epoch:	6
Epoch:
	7
Epoch:	8
Epoch:	9
Accuracy:	0.9666 done
>>>

Despite this implementation, the previous ReLU network is still better, but you can try to change the network parameters to improve the model's accuracy. Also, since this is a tiny network and we dealt with a small-scale dataset, when you handle a large-scale high-dimensional dataset with a more complex network, you will realize that the dropout could be really important. We will see a few hands-on examples in the next chapter.

Now, to see the effect of the dropout optimization, let's start the TensorBoard analysis. Just type the following:

$> Tensorboard --logdir=' log_softmax_relu_dropout/'

The following graph shows the accuracy cost function as a function of the training examples:

Dropout optimization

Figure 23: a) accuracy in dropout optimization, b) the cost function over the training set

In the preceding chart, we display the cost function as a function of the training examples. Both trends are what we expected: the accuracy increases with training examples, while the cost function decreases with increasing iterations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset