Get started with exploring MNIST

The MNIST dataset from http://yann.lecun.com/exdb/mnist/ consists of a training set of 60,000 samples, and a testing set of 10,000 samples. As said previously, images were originally taken from the NIST, and then centered and resized to the same height and width (28 * 28).

Rather than handling the ubyte files, train-images-idx3-ubyte.gz and train-labels-idx1-ubyte.gz in the preceding website and merge them, we use a dataset that is well-formatted from the Kaggle competition Digit Recognizer, https://www.kaggle.com/c/digit-recognizer/. We can download the training dataset, train.csv directly from https://www.kaggle.com/c/digit-recognizer/data. It is the only labeled dataset provided in the site, and we will use it to train classification models, evaluate models and do predictions. Now let's load it up:

> data <- read.csv ("train.csv")
> dim(data)
[1] 42000 785

We have 42,000 labeled samples available, and each sample has 784 features, which means each digit image has 784 (28 * 28) pixels. Take a look at the label and the first 5 features (pixels) for each of the first 6 data samples:

> head(data[1:6])
label pixel0 pixel1 pixel2 pixel3 pixel4
1 1 0 0 0 0 0
2 0 0 0 0 0 0
3 1 0 0 0 0 0
4 4 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0

The target label ranging from 0 to 9 denotes 10 digits:

> unique(unlist(data[1]))
[1] 1 0 4 7 3 5 8 9 2 6

The pixel variable ranges from 0 to 255, representing the brightness of the pixel, for example 0 means black and 255 stands for white:

> min(data[2:785])
[1] 0
> max(data[2:785])
[1] 255

Now let's take a look at two samples, first, the fourth image:

> sample_4 <- matrix(as.numeric(data[4,-1]), nrow = 28, byrow = TRUE)
> image(sample_4, col = grey.colors(255))

Where we reshaped the feature vector of length 784 into a matrix of 28 * 28.

Second, the 7th image:

> sample_7 <- matrix(as.numeric(data[7,-1]), nrow = 28, byrow = TRUE)
> image(sample_7, col = grey.colors(255))

The result is as follows:

We noticed that the images are rotated 90 degrees to the left. To better view the images, a rotation of 90 degrees clockwise is required. We simply need to reserve elements in each column of an image matrix:

> # Rotate the matrix by reversing elements in each column
> rotate <- function(x) t(apply(x, 2, rev))

Now visualize the rotated images:


> image(rotate(sample_4), col = grey.colors(255))
> image(rotate(sample_7), col = grey.colors(255))

After viewing what the data and images behind look like, we do more exploratory analysis on the labels and features. Firstly, because it is a classification problem, we inspect whether the classes from the data are balanced or unbalanced as a good practice. But before doing so, we should transform the label from integer to factor:

> # Transform target variable "label" from integer to factor, in order to perform classification
> is.factor(data$label)
[1] FALSE
> data$label <- as.factor(data$label)
> is.factor(data$label)
[1] TRUE

Now, we can summarize the label distribution in counts:

> summary(data$label)
0 1 2 3 4 5 6 7 8 9
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

Or combined with proportion (%):

> proportion <- prop.table(table(data$label)) * 100
> cbind(count=table(data$label), proportion=proportion)
count proportion
0 4132 9.838095
1 4684 11.152381
2 4177 9.945238
3 4351 10.359524
4 4072 9.695238
5 3795 9.035714
6 4137 9.850000
7 4401 10.478571
8 4063 9.673810
9 4188 9.971429

Classes are balanced.

Now, we explore the distribution of features, the pixels. As an example, we take the 4 pixels from the central 2*2 block (that is, pixel376, pixel377, pixel404, and pixel405) in each image and display the histogram for each of the 9 digits:

> central_block <- c("pixel376", "pixel377", "pixel404", "pixel405")
> par(mfrow=c(2, 2))
> for(i in 1:9) {
+ hist(c(as.matrix(data[data$label==i, central_block])),
+ main=sprintf("Histogram for digit %d", i),
+ xlab="Pixel value")
+ }

The resulting pixel brightness histograms for digit 1 to 4 are displayed respectively, as follows:

Histograms for digits 5 to 9:

And that for digit 9:

The brightness of central pixels is distributed differently among 9 digits. For instance, the majority of the central pixels are bright, as digit 8 is usually written in a way that strokes go through the center; while digit 7 is not written in this way, hence most of the central pixels are dark. Pixels taken from other positions can also be distinctly distributed among different digits.

The exploratory analysis we just conducted helps move us forward with building classification models based on pixels.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset