Batch normalization

Batch normalization (BN) is a method to reduce internal covariate shift while training regular DNNs. This can apply to CNNs too. Due to the normalization, BN further prevents smaller changes to the parameters to amplify and thereby allows higher learning rates, making the network even faster:

The idea is placing an additional step between the layers, in which the output of the layer before is normalized. To be more specific, in the case of non-linear operations (for example, ReLU), BN transformation has to be applied to the non-linear operation. Typically, the overall process has the following workflow:

  • Transforming the network into a BN network (see Figure 1)
  • Then training the new network
  • Transforming the batch statistic into a population statistic

This way, BN can fully partake in the process of backpropagation. As shown in Figure 1, BN is performed before the other processes of the network in this layer are applied. However, any kind of gradient descent (for example, stochastic gradient descent (SGD) and its variants) can be applied to train the BN network.

Interested readers can refer to the original paper to get to more information: Ioffe, Sergey, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

Now a valid question would be: where to place the BN layer? Well, to know the answer, a quick evaluation of BatchNorm layer performance on ImageNet-2012 (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md) shows the following benchmark:

From the preceding table, it can be seen that placing BN after non-linearity would be the right way. The second question would be: what activation function should be used in a BN layer? Well, from the same benchmark, we can see the following result:

From the preceding table, we can assume that using ReLU or its variants would be a better idea. Now, another question would be how to use these using deep learning libraries. Well, in TensorFlow, it is:

training = tf.placeholder(tf.bool)
x = tf.layers.dense(input_x, units=100)
x = tf.layers.batch_normalization(x, training=training)
x = tf.nn.relu(x)

A general warning: set this to True for training and False for testing. However, the preceding addition introduces extra ops to be performed on the graph, which is updating its mean and variance variables in such a way that they will not be dependencies of your training op. To do it, we can just run the ops separately, as follows:

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
sess.run([train_op, extra_update_ops], ...)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset