D.5. Holding your model back

During the model.fit(), the gradient descent is over-enthusiastic about pursuing the lowest possible error in your model. This can lead to overfitting, where your model does really well on the training set but poorly on new unseen examples (the test set). So you probably want to “hold back” on the reins of your model. Here are three ways to do that:

  • Regularization
  • Random dropout
  • Batch normalization

D.5.1. Regularization

In any machine learning model, overfitting will eventually come up. Luckily, several tools can combat it. The first is regularization, which is a penalization to the learned parameters at each training step. It’s usually, but not always, a factor of the parameters themselves. L1-norm and L2-norm are the most common.

L1 regularization

L1 is the sum of the absolute values of all the parameters (weights) multiplied by some lambda (a hyperparameter), usually a small float between 0 and 1. This sum is applied to the weights update—the idea being that weights with large magnitudes cause a penalty to be incurred, and the model is encouraged to use more of its weights ... evenly.

L2 regularization

Similarly, L2 is a weight penalization, but defined slightly differently. In this case, it’s the sum of the weights squared multiplied by some value lambda (a separate hyper-parameter to be chosen ahead of training).

D.5.2. Dropout

In neural networks, dropout is another handy tool for this situation—one that is seemingly magical on first glance. Dropout is the concept that at any given layer of a neural network we’ll turn off a percentage of the signal coming through that layer at training time. Note that this occurs only during training, and never during inference. At any given training pass, a subset of the neurons in the layer below are “ignored;” those output values are explicitly set to zero. And because they have no input to the resulting prediction, they’ll receive no weight update during the backpropagation step. In the next training step, a different subset of the weights in the layer will be chosen and those others are zeroed out.

How is a network supposed to learn anything with 20% of its brain turned off at any given time? The idea is that no specific weight path should wholly define a particular attribute of the data. The model must generalize its internal structures to be able to handle data via multiple paths through the neurons.

The percentage of the signal that gets turned off is defined as a hyperparameter, because it’s a percentage that’ll be a float between 0 and 1. In practice, a dropout of .1 to .5 is usually optimal, but of course it’s model dependent. And at inference time, dropout is ignored and the full power of the trained weights are brought to bear on the novel data.

Keras provides a very simple way to implement this, and it can be seen in the book’s examples and in the following listing.

Listing D.1. A dropout layer in Keras reduces overfitting
>>> from keras.models import Sequential
>>> from keras.layers import Dropout, LSTM, Flatten, Dense
 
>>> num_neurons = 20                                        1
>>> maxlen = 100
>>> embedding_dims = 300
>>> model = Sequential()
 
>>> model.add(LSTM(num_neurons, return_sequences=True,
...                input_shape=(maxlen, embedding_dims)))
>>> model.add(Dropout(.2))                                 2
 
>>> model.add(Flatten())
>>> model.add(Dense(1, activation='sigmoid'))

  • 1 Arbitrary hyperparmeters used as an example
  • 2 .2 here is the hyperparameter, so 20% of the outputs of the LSTM layer above will be zeroed out and therefore ignored.

D.5.3. Batch normalization

A newer concept in neural networks called batch normalization can help regularize and generalize your model. Batch normalization is the idea that, much like the input data, the outputs of each layer should be normalized to values between 0 and 1. There’s still some debate about how or why or when this is beneficial, and under which conditions it should be used. We leave it to you to explore that research on your own.

But Keras does provide a handy implementation with its BatchNormalization layer, as shown in the following listing.

Listing D.2. BatchNormalization
>>> from keras.models import Sequential
>>> from keras.layers import Activation, Dropout, LSTM, Flatten, Dense
>>> from keras.layers.normalization import BatchNormalization
 
>>> model = Sequential()
>>> model.add(Dense(64, input_dim=14))
>>> model.add(BatchNormalization())
>>> model.add(Activation('sigmoid'))
>>> model.add(Dense(64, input_dim=14))
>>> model.add(BatchNormalization())
>>> model.add(Activation('sigmoid'))
>>> model.add(Dense(1, activation='sigmoid'))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset