Preprocessing the MNIST dataset

In the following steps, you will learn to preprocess the data before it is fed to the neural network:

To make sure we get the same result every time we run the experiment, we will pick a random seed for NumPy's random number generator. This way, shuffling the training samples from the MNIST dataset will always result in the same order:

In [1]: import numpy as np
...     np.random.seed(1337)

Keras provides a loading function similar to train_test_split from scikit-learn's model_selection module. Its syntax might look strangely familiar to you:

In [2]: from keras.datasets import mnist
...     (X_train, y_train), (X_test, y_test) = mnist.load_data()

In contrast to other datasets we have encountered so far, MNIST comes with a predefined train-test split. This allows the dataset to be used as a benchmark, as the test score reported by different algorithms will always apply to the same test samples.

The neural networks in Keras act on the feature matrix slightly differently than the standard OpenCV and scikit-learn estimators. Whereas the rows of a feature matrix in Keras still correspond to the number of samples (X_train.shape[0] in the following code), we can preserve the two-dimensional nature of the input images by adding more dimensions to the feature matrix:

In [3]: img_rows, img_cols = 28, 28
...     X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
...     X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
...     input_shape = (img_rows, img_cols, 1)

Here, we have reshaped the feature matrix into a four-dimensional matrix with dimensions n_features x 28 x 28 x 1. We also need to make sure we operate on 32-bit floating point numbers between [0, 1], rather than unsigned integers in [0, 255]:

...     X_train = X_train.astype('float32') / 255.0
...     X_test = X_test.astype('float32') / 255.0

Then, we can one-hot encode the training labels as we did before. This will make sure each category of target labels can be assigned to a neuron in the output layer. We could do this with scikit-learn's preprocessing, but in this case, it is easier to use Keras' own utility function:

In [4]: from keras.utils import np_utils
...     n_classes = 10
...     Y_train = np_utils.to_categorical(y_train, n_classes)
...     Y_test = np_utils.to_categorical(y_test, n_classes)

Table of Contents for Preprocessing the MNIST dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Preprocessing the MNIST dataset