In this section, we will take a look at an alternative approach to prototype-based clustering: hierarchical clustering. One advantage of hierarchical clustering algorithms is that it allows us to plot dendrograms (visualizations of a binary hierarchical clustering), which can help with the interpretation of the results by creating meaningful taxonomies. Another useful advantage of this hierarchical approach is that we do not need to specify the number of clusters upfront.
The two main approaches to hierarchical clustering are agglomerative and divisive hierarchical clustering. In divisive hierarchical clustering, we start with one cluster that encompasses all our samples, and we iteratively split the cluster into smaller clusters until each cluster only contains one sample. In this section, we will focus on agglomerative clustering, which takes the opposite approach. We start with each sample as an individual cluster and merge the closest pairs of clusters until only one cluster remains.
The two standard algorithms for agglomerative hierarchical clustering are single linkage and complete linkage. Using single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest. The complete linkage approach is similar to single linkage but, instead of comparing the most similar members in each pair of clusters, we compare the most dissimilar members to perform the merge. This is shown in the following diagram:
Other commonly used algorithms for agglomerative hierarchical clustering include average linkage and Ward's linkage. In average linkage, we merge the cluster pairs based on the minimum average distances between all group members in the two clusters. In Ward's method, those two clusters that lead to the minimum increase of the total within-cluster SSE are merged.
In this section, we will focus on agglomerative clustering using the complete linkage approach. This is an iterative procedure that can be summarized by the following steps:
Now we will discuss how to compute the distance matrix (step 1). But first, let's generate some random sample data to work with. The rows represent different observations (IDs 0 to 4), and the columns are the different features (X, Y, Z) of those samples:
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(123) >>> variables = ['X', 'Y', 'Z'] >>> labels = ['ID_0','ID_1','ID_2','ID_3','ID_4'] >>> X = np.random.random_sample([5,3])*10 >>> df = pd.DataFrame(X, columns=variables, index=labels) >>> df
After executing the preceding code, we should now see the following distance matrix:
To calculate the distance matrix as input for the hierarchical clustering algorithm, we will use the pdist
function from SciPy's spatial.distance
submodule:
>>> from scipy.spatial.distance import pdist, squareform >>> row_dist = pd.DataFrame(squareform( ... pdist(df, metric='euclidean')), ... columns=labels, index=labels) >>> row_dist
Using the preceding code, we calculated the Euclidean distance between each pair of sample points in our dataset based on the features X, Y, and Z. We provided the condensed distance matrix—returned by pdist
—as input to the squareform
function to create a symmetrical matrix of the pair-wise distances, as shown here:
Next we will apply the complete linkage agglomeration to our clusters using the linkage
function from SciPy's cluster.hierarchy
submodule, which returns a so-called linkage matrix.
However, before we call the linkage
function, let's take a careful look at the function documentation:
>>> from scipy.cluster.hierarchy import linkage >>> help(linkage) […] Parameters: y : ndarray A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array. method : str, optional The linkage algorithm to use. See the Linkage Methods section below for full descriptions. metric : str, optional The distance metric to use. See the distance.pdist function for a list of valid distance metrics. Returns: Z : ndarray The hierarchical clustering encoded as a linkage matrix. […]
Based on the function description, we conclude that we can use a condensed distance matrix (upper triangular) from the pdist
function as an input attribute. Alternatively, we could also provide the initial data array and use the euclidean
metric as a function argument in linkage
. However, we should not use the squareform distance matrix that we defined earlier, since it would yield different distance values from those expected. To sum it up, the three possible scenarios are listed here:
>>> from scipy.cluster.hierarchy import linkage >>> row_clusters = linkage(row_dist, ... method='complete', ... metric='euclidean')
>>> row_clusters = linkage(pdist(df, metric='euclidean'), ... method='complete')
>>> row_clusters = linkage(df.values, ... method='complete', ... metric='euclidean')
To take a closer look at the clustering results, we can turn them to a pandas DataFrame
(best viewed in IPython Notebook) as follows:
>>> pd.DataFrame(row_clusters, ... columns=['row label 1', ... 'row label 2', ... 'distance', ... 'no. of items in clust.'], ... index=['cluster %d' %(i+1) for i in ... range(row_clusters.shape[0])])
As shown in the following table, the linkage matrix consists of several rows where each row represents one merge. The first and second columns denote the most dissimilar members in each cluster, and the third row reports the distance between those members. The last column returns the count of the members in each cluster.
Now that we have computed the linkage matrix, we can visualize the results in the form of a dendrogram:
>>> from scipy.cluster.hierarchy import dendrogram # make dendrogram black (part 1/2) # from scipy.cluster.hierarchy import set_link_color_palette # set_link_color_palette(['black']) >>> row_dendr = dendrogram(row_clusters, ... labels=labels, ... # make dendrogram black (part 2/2) ... # color_threshold=np.inf ... ) >>> plt.tight_layout() >>> plt.ylabel('Euclidean distance') >>> plt.show()
If you are executing the preceding code or reading the e-book version of this book, you will notice that the branches in the resulting dendrogram are shown in different colors. The coloring scheme is derived from a list of matplotlib colors that are cycled for the distance thresholds in the dendrogram. For example, to display the dendrograms in black, you can uncomment the respective sections that I inserted in the preceding code.
Such a dendrogram summarizes the different clusters that were formed during the agglomerative hierarchical clustering; for example, we can see that the samples ID_0 and ID_4, followed by ID_1 and ID_2, are the most similar ones based on the Euclidean distance metric.
In practical applications, hierarchical clustering dendrograms are often used in combination with a heat map, which allows us to represent the individual values in the sample matrix with a color code. In this section, we will discuss how to attach a dendrogram to a heat map plot and order the rows in the heat map correspondingly.
However, attaching a dendrogram to a heat map can be a little bit tricky, so let's go through this procedure step by step:
figure
object and define the x axis position, y axis position, width, and height of the dendrogram via the add_axes
attribute. Furthermore, we rotate the dendrogram 90 degrees counter-clockwise. The code is as follows:>>> fig = plt.figure(figsize=(8,8)) >>> axd = fig.add_axes([0.09,0.1,0.2,0.6]) >>> row_dendr = dendrogram(row_clusters, orientation='right')
DataFrame
according to the clustering labels that can be accessed from the dendrogram object, which is essentially a Python dictionary, via the leaves
key. The code is as follows:>>> df_rowclust = df.ix[row_dendr['leaves'][::-1]]
DataFrame
and position it right next to the dendrogram:>>> axm = fig.add_axes([0.23,0.1,0.6,0.6]) >>> cax = axm.matshow(df_rowclust, ... interpolation='nearest', cmap='hot_r')
>>> axd.set_xticks([]) >>> axd.set_yticks([]) >>> for i in axd.spines.values(): ... i.set_visible(False) >>> fig.colorbar(cax) >>> axm.set_xticklabels([''] + list(df_rowclust.columns)) >>> axm.set_yticklabels([''] + list(df_rowclust.index)) >>> plt.show()
After following the previous steps, the heat map should be displayed with the dendrogram attached:
As we can see, the row order in the heat map reflects the clustering of the samples in the dendrogram. In addition to a simple dendrogram, the color-coded values of each sample and feature in the heat map provide us with a nice summary of the dataset.
In this section, we saw how to perform agglomerative hierarchical clustering using SciPy. However, there is also an AgglomerativeClustering
implementation in scikit-learn, which allows us to choose the number of clusters that we want to return. This is useful if we want to prune the hierarchical cluster tree. By setting the n_cluster
parameter to 2
, we will now cluster the samples into two groups using the same complete linkage approach based on the Euclidean distance metric as before:
>>> from sklearn.cluster import AgglomerativeClustering >>> ac = AgglomerativeClustering(n_clusters=2, ... affinity='euclidean', ... linkage='complete') >>> labels = ac.fit_predict(X) >>> print('Cluster labels: %s' % labels) Cluster labels: [0 1 1 0 0]
Looking at the predicted cluster labels, we can see that the first, fourth, and fifth sample (ID_0, ID_3, and ID_4) were assigned to one cluster (0
), and the samples ID_1 and ID_2 were assigned to a second cluster (1
), which is consistent with the results that we can observe in the dendrogram.