Hierarchically clustering data

In Python Data Analysis, you learned about clustering—separating data into clusters without providing any hints-which is a form of unsupervised learning. Sometimes, we need to take a guess for the number of clusters, as we did in the Clustering streaming data with Spark recipe.

There is no restriction against having clusters contain other clusters. In such a case, we speak of hierarchical clustering. We need a distance metric to separate data points. Take a look at the following equations:

Hierarchically clustering data

In this recipe, we will use Euclidean distance (9.2), provided by the SciPy pdist() function. The distance between sets of points is given by the linkage criteria. In this recipe, we will use the single-linkage criteria (9.3) provided by the SciPy linkage() function.

How to do it...

The script is in the clustering_hierarchy.ipynb file in this book's code bundle:

  1. The imports are as follows:
    from scipy.spatial.distance import pdist
    from scipy.cluster.hierarchy import linkage
    from scipy.cluster.hierarchy import dendrogram
    import dautil as dl
    import matplotlib.pyplot as plt
  2. Load the data, resample to annual values, and compute distances:
    df = dl.data.Weather.load().resample('A').dropna()
    dist = pdist(df)
  3. Plot the hierarchical cluster as follows:
    dendrogram(linkage(dist), labels=[d.year for d in df.index],
               orientation='right')
    plt.tick_params(labelsize=8)
    plt.xlabel('Cluster')
    plt.ylabel('Year')

Refer to the following screenshot for the end result:

How to do it...

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset