In this chapter, you will learn how to apply unsupervised learning techniques to identify patterns and structure within datasets.
Unsupervised learning techniques are a valuable set of tools for exploratory analysis. They bring out patterns and structure within datasets, which yield information that may be informative in itself or serve as a guide to further analysis. It's critical to have a solid set of unsupervised learning tools that you can apply to help break up unfamiliar or complex datasets into actionable information.
We'll begin by reviewing Principal Component Analysis (PCA), a fundamental data manipulation technique with a range of dimensionality reduction applications. Next, we will discuss k-means clustering, a widely-used and approachable unsupervised learning technique. Then, we will discuss Kohenen's Self-Organizing Map (SOM), a method of topological clustering that enables the projection of complex datasets into two dimensions.
Throughout the chapter, we will spend some time discussing how to effectively apply these techniques to make high-dimensional datasets readily accessible. We will use the UCI Handwritten Digits dataset to demonstrate technical applications of each algorithm. In the course of discussing and applying each technique, we will review practical applications and methodological questions, particularly regarding how to calibrate and validate each technique as well as which performance measures are valid. To recap, then, we will be covering the following topics in order:
In order to work effectively with high-dimensional datasets, it is important to have a set of techniques that can reduce this dimensionality down to manageable levels. The advantages of this dimensionality reduction include the ability to plot multivariate data in two dimensions, capture the majority of a dataset's informational content within a minimal number of features, and, in some contexts, identify collinear model components.
For those in need of a refresher, collinearity in a machine learning context refers to model features that share an approximately linear relationship. For reasons that will likely be obvious, these features tend to be unhelpful as the related features are unlikely to add information mutually that either one provides independently. Moreover, collinear features may emphasize local minima or other false leads.
Probably the most widely-used dimensionality reduction technique today is PCA. As we'll be applying PCA in multiple contexts throughout this book, it's appropriate for us to review the technique, understand the theory behind it, and write Python code to effectively apply it.
PCA is a powerful decomposition technique; it allows one to break down a highly multivariate dataset into a set of orthogonal components. When taken together in sufficient number, these components can explain almost all of the dataset's variance. In essence, these components deliver an abbreviated description of the dataset. PCA has a broad set of applications and its extensive utility makes it well worth our time to cover.
Note the slightly cautious phrasing here—a given set of components of length less than the number of variables in the original dataset will almost always lose some amount of the information content within the source dataset. This lossiness is typically minimal, given enough components, but in cases where small numbers of principal components are composed from very high-dimensional datasets, there may be substantial lossiness. As such, when performing PCA, it is always appropriate to consider how many components will be necessary to effectively model the dataset in question.
PCA works by successively identifying the axis of greatest variance in a dataset (the principal components). It does this as follows:
Let's unpack these concepts briefly:
In summary, the covariance matrix is used to calculate Eigenvectors. An orthonormalization process is undertaken that produces orthogonal, normalized vectors from the Eigenvectors. The eigenvector with the greatest eigenvalue is the first principal component with successive components having smaller eigenvalues. In this way, the PCA algorithm has the effect of taking a dataset and transforming it into a new, lower-dimensional coordinate system.
Now that we've reviewed the PCA algorithm at a high level, we're going to jump straight in and apply PCA to a key Python dataset—the UCI handwritten digits
dataset, distributed as part of
scikit-learn.
This dataset is composed of 1,797 instances of handwritten digits gathered from 44 different writers. The input (pressure and location) from these authors' writing is resampled twice across an 8 x 8 grid so as to yield maps of the kind shown in the following image:
These maps can be transformed into feature vectors of length 64, which are then readily usable as analysis input. With an input dataset of 64 features, there is an immediate appeal to using a technique like PCA to reduce the set of variables to a manageable amount. As it currently stands, we cannot effectively explore the dataset with exploratory visualization!
We will begin applying PCA to the handwritten digits
dataset with the following code:
import numpy as np from sklearn.datasets import load_digits import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.preprocessing import scale from sklearn.lda import LDA import matplotlib.cm as cm digits = load_digits() data = digits.data n_samples, n_features = data.shape n_digits = len(np.unique(digits.target)) labels = digits.target
This code does several things for us:
numpy
, a set of components from scikit-learn, including the digits
dataset itself, PCA and data scaling functions, and the plotting capability of matplotlib.digits
dataset. It does several things in order:data
variable is created for subsequent use, and the number of distinct digits
in the target
vector (0 through to 9, so n_digits = 10
) is saved as a variable that we can easily access for subsequent analysistarget
vector is also saved as labels for later usepca = PCA(n_components=10) data_r = pca.fit(data).transform(data) print('explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_)) print('sum of explained variance (first two components): %s' % str(sum(pca.explained_variance_ratio_)))
In the case of this set of 10
principal components, they collectively explain 0.589 of the overall dataset variance. This isn't actually too bad, considering that it's a reduction from 64 variables to 10
components. It does, however, illustrate the potential lossiness of PCA. The key question, though, is whether this reduced set of components makes subsequent analysis or classification easier to achieve; that is, whether many of the remaining components contained variance that disrupts classification attempts.
Having created a data_r
object containing the output of pca
performed over the digits
dataset, let's visualize the output. To do so, we'll first create a vector of colors
for class coloration. We then simply create a scatterplot with colorized classes:
X = np.arange(10) ys = [i+x+(i*x)**2 for i in range(10)] plt.figure() colors = cm.rainbow(np.linspace(0, 1, len(ys))) for c, i target_name in zip(colors, [1,2,3,4,5,6,7,8,9,10], labels): plt.scatter(data_r[labels == I, 0], data_r[labels == I, 1], c=c, alpha = 0.4) plt.legend() plt.title('Scatterplot of Points plotted in first ' '10 Principal Components') plt.show()
The resulting scatterplot looks as follows:
This plot shows us that, while there is some separation between classes in the first two principal components, it may be tricky to classify highly accurately with this dataset. However, classes do appear to be clustered and we may be able to get reasonably good results by employing a clustering analysis. In this way, PCA has given us some insight into how the dataset is structured and has informed our subsequent analysis.
At this point, let's take this insight and move on to examine clustering by the application of the k-means clustering algorithm.