In Python Data Analysis, you learned about clustering—separating data into clusters without providing any hints-which is a form of unsupervised learning. Sometimes, we need to take a guess for the number of clusters, as we did in the Clustering streaming data with Spark recipe.
There is no restriction against having clusters contain other clusters. In such a case, we speak of hierarchical clustering. We need a distance metric to separate data points. Take a look at the following equations:
In this recipe, we will use Euclidean distance (9.2), provided by the SciPy pdist()
function. The distance between sets of points is given by the linkage criteria. In this recipe, we will use the single-linkage criteria (9.3) provided by the SciPy linkage()
function.
The script is in the clustering_hierarchy.ipynb
file in this book's code bundle:
from scipy.spatial.distance import pdist from scipy.cluster.hierarchy import linkage from scipy.cluster.hierarchy import dendrogram import dautil as dl import matplotlib.pyplot as plt
df = dl.data.Weather.load().resample('A').dropna() dist = pdist(df)
dendrogram(linkage(dist), labels=[d.year for d in df.index], orientation='right') plt.tick_params(labelsize=8) plt.xlabel('Cluster') plt.ylabel('Year')
Refer to the following screenshot for the end result:
pdist()
function at https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html (retrieved November 2015)linkage()
function at https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html (retrieved November 2015)