How it works...

In this recipe, we demonstrate how to validate clusters. To validate a clustering method, we often employ two techniques: intercluster distance and intracluster distance. In these techniques, the higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is. In order to calculate related statistics, we can apply cluster.stat from the fpc package on the fitted clustering object.

From the output, the within.cluster.ss measurement stands for the within clusters sum of squares, and avg.silwidth represents the average silhouette width. The within.cluster.ss measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster. On the other hand, a silhouette is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. Mathematically, we can define the silhouette width for each point x as follows:

In the preceding equation, a(x) is the average distance between x and all other points within the cluster, and b(x) is the minimum of the average distances between x and the points in the other clusters. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.

The summary table generated in the last step shows that the complete hierarchical clustering method outperforms a single hierarchical clustering method and k-means clustering in within.cluster.ss and avg.silwidth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset