Running k-means

Setting up k-means works exactly the same as in the previous examples. We tell the algorithm to perform at most 10 iterations and stop the process if our prediction of the cluster centers does not improve within a distance of 1.0:

In [2]: import cv2
...     criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER,
...                 10, 1.0)
...     flags = cv2.KMEANS_RANDOM_CENTERS

Then, we apply k-means to the data as we did before. Since there are 10 different digits (0-9), we tell the algorithm to look for 10 distinct clusters:

In [3]: import numpy as np
...     digits.data = digits.data.astype(np.float32)
...     compactness, clusters, centers = cv2.kmeans(digits.data, 10, None,
...                                                 criteria, 10, flags)

And we're done!

Similar to the N x 3 matrices that represented different RGB colors, this time, the centers array consists of N x 8 x 8 center images, where N is the number of clusters. Therefore, if we want to plot the centers, we have to reshape the centers matrix back into 8 x 8 images:

In [4]: import matplotlib.pyplot as plt
...     plt.style.use('ggplot')
...     %matplotlib inline
...     fig, ax = plt.subplots(2, 5, figsize=(8, 3))
...     centers = centers.reshape(10, 8, 8)
...     for axi, center in zip(ax.flat, centers):
...         axi.set(xticks=[], yticks=[])
...         axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

The output looks like this:

Does it look familiar?

Remarkably, k-means was able to partition the digit images not just into any 10 random clusters, but into the digits, 0-9! To find out which images were grouped into which clusters, we need to generate a labels vector, as we know it from supervised learning problems:

In [5]: from scipy.stats import mode
...     labels = np.zeros_like(clusters.ravel())
...     for i in range(10):
...         mask = (clusters.ravel() == i)
...         labels[mask] = mode(digits.target[mask])[0]

Then, we can calculate the performance of the algorithm using scikit-learn's accuracy_score metric:

In [6]: from sklearn.metrics import accuracy_score
...     accuracy_score(digits.target, labels)
Out[6]: 0.78464106844741233

Remarkably, k-means achieved 78.4% accuracy without knowing the first thing about the labels of the original images!

We can gain more insight about what went wrong and how by looking at the confusion matrix. The confusion matrix is a 2D matrix, C, where every element, C_i,j, is equal to the number of observations known to be in the group (or cluster), i, but predicted to be in group j. Hence, all elements on the diagonal of the matrix represent data points that have been correctly classified (that is, known to be in group i and predicted to be in group i). Off-diagonal elements show misclassifications.

In scikit-learn, creating a confusion matrix is essentially a one-liner:

In [7]: from sklearn.metrics import confusion_matrix
...     confusion_matrix(digits.target, labels)
Out[7]: array([[177,   0,   0,   0,   1,   0,   0,   0,   0,   0],
               [  0, 154,  25,   0,   0,   1,   2,   0,   0,   0],
               [  1,   3, 147,  11,   0,   0,   0,   3,  12,   0],
               [  0,   1,   2, 159,   0,   2,   0,   9,  10,   0],
               [  0,  12,   0,   0, 162,   0,   0,   5,   2,   0],
               [  0,   0,   0,  40,   2, 138,   2,   0,   0,   0],
               [  1,   2,   0,   0,   0,   0, 177,   0,   1,   0],
               [  0,  14,   0,   0,   0,   0,   0, 164,   1,   0],
               [  0,  23,   3,   8,   0,   5,   1,   2, 132,   0],
               [  0,  21,   0, 145,   0,   5,   0,   8,   1,   0]])

The confusion matrix tells us that k-means did a pretty good job at classifying data points from the first nine classes; however, it confused all nines to be (mostly) threes. Still, this result is pretty solid, given that the algorithm had no target labels to be trained on.

Table of Contents for Running k-means

Create new playlist

Sign In

Sign Up

Table of Contents for
Running k-means