Chapter 22
Measuring Cluster Goodness

22.1 Rationale for Measuring Cluster Goodness

Every modeling technique requires an evaluation phase. For example, we may work hard to develop a multiple regression model for predicting the amount of money to be spent on a new car. But, if the standard error of the estimate s for this regression model is $100,000, then the usefulness of the regression model is questionable. In the classification realm, we would expect that a model predicting who will respond to our direct-mail marketing operation will yield more profitable results than the baseline “send-a-coupon-to-everybody” or “send-out-no-coupons-at-all” models.

In a similar way, clustering models need to be evaluated as well. Some of the questions of interest might be the following:

  • Do my clusters actually correspond to reality, or are they simply artifacts of mathematical convenience?
  • I am not sure how many clusters there are in the data. What is the optimal number of clusters to identify?
  • How do I measure whether one set of clusters is preferable to another?

In this chapter, we introduce two methods for measuring cluster goodness, the silhouette method, and the pseudo-F statistic. These techniques will help to address these questions by evaluating and measuring the goodness of our cluster solutions. We also examine a method to validate our clusters using cross-validation with graphical and statistical analysis.

Any measure of cluster goodness, or cluster quality, should address the concepts of cluster separation as well as cluster cohesion. Cluster separation represents how distant the clusters are from each other; cluster cohesion refers to how tightly related the records within the individual clusters are. Good measures of cluster quality need to incorporate both criteria. For example, it has been written elsewhere that the sum of squares error (SSE) is a good measure of cluster quality. However, by measuring the distance between each record and its cluster center, SSE accounts only for cluster cohesion and does not account for cluster separation. Thus, SSE is monotonically decreasing as the number of clusters increases, which is not a desired property of a valid measure of cluster goodness. Of course, both the silhouette method and the pseudo-F statistic account for both cluster cohesion and cluster separation.

22.2 The Silhouette Method

The silhouette is a characteristic of each data value, and is defined as follows:

The silhouette value is used to gauge how good the cluster assignment is for that particular point. A positive value indicates that the assignment is good, with higher values being better than lower values. A value that is close to zero is considered to be a weak assignment, as the observation could have been assigned to the next closest cluster with limited negative consequence. A negative silhouette value is considered to be misclassified, as assignment to the next closest cluster would have been better.

Note how the definition of silhouette accounts for both separation and cohesion. The value of c22-math-0005 represents cohesion, as it measures the distance between the data value and its cluster center, while c22-math-0006 represents separation, as it measures the distance between the data value and a different cluster. This is illustrated in Figure 22.1. Each of the data values in Cluster 1 have their values of c22-math-0007 and c22-math-0008 represented by solid lines and dotted lines, respectively. Clearly, c22-math-0009 for each data value, as represented by the longer dotted lines. Thus, each data value's silhouette value is positive, indicating that the data values have not been misclassified. The dotted lines indicate separation, and the solid lines indicate cohesion. (The silhouettes for the data values in Cluster 2 are not represented in Figure 22.1.)

c22fz001

Figure 22.1 Illustration of how silhouette accounts for both separation and cohesion.

Taking the average silhouette value over all records yields a useful measure of how well the cluster solution fits the data. The following thumbnail interpretation of average silhouette is meant as a guideline only, and should bow before the expertise of the domain expert.

22.3 Silhouette Example

Suppose we apply c22-math-0010-means clustering to the following little one-dimensional data set:

equation

c22-math-0012-means assigns the first three data values to Cluster 1 and the last two to Cluster 2, as shown in Figure 22.2. The cluster center for Cluster 1 is c22-math-0013, and the cluster center for Cluster 2 is c22-math-0014, represented by the dotted vertical lines in Figure 22.2. The values for c22-math-0015 represent the distance between the data value c22-math-0016 and the cluster center to which c22-math-0017 belongs. The values for c22-math-0018 represent the distance between the data value and the other cluster center. Note that c22-math-0019 because c22-math-0020.

c22fz002

Figure 22.2 Distances between the data values and the cluster centers.

Table 22.1 contains the calculations for the individual data value silhouettes, along with the mean silhouette. Using our rule of thumb, mean silhouette = 0.7 represents good evidence of the reality of the clusters in the data. Note that c22-math-0021 is perfectly classified as belonging to Cluster 1, as it sits right on the cluster center c22-math-0022; thus, its silhouette value is a perfect 1.00. However, c22-math-0023 is somewhat farther from its own cluster center, and somewhat closer to the other cluster center; hence, its silhouette value is lower, 0.50.

Table 22.1 Calculations for individual data value silhouettes and mean silhouette

c22-math-0024 c22-math-0025 c22-math-0026 c22-math-0027 c22-math-0028
0 2 8 8 c22-math-0029
2 0 6 6 c22-math-0030
4 2 4 4 c22-math-0031
6 2 4 4 c22-math-0032
10 2 8 8 c22-math-0033
Mean silhouette = 0.7

22.4 Silhouette Analysis of the IRIS Data Set

Next, we apply the silhouette method to Fisher's well-known Iris data set. The data set consists of 150 observations of three species of Iris, along with measurements of their petal width, petal length, sepal width, and sepal length. Figure 22.3 shows a scatter plot of petal width versus petal length, with an overlay of Iris species. (Note that min–max normalization is used.) Figure 22.3 shows that one species is well separated, but the other two are not, at least in this dimension. So, one question we could ask of these Irises: True there are three species in the data set, but are there really three clusters in the data set, or only two?

c22fz003

Figure 22.3 Two of the Iris species seem to blend into one another.

It makes sense to begin with c22-math-0034 clusters. c22-math-0035-means clustering was applied to the Iris data, asking for c22-math-0036 clusters. A logical question might be: Do the clusters match perfectly with the species? (Of course, the species type was not included as input to the clustering algorithm.) The answer is, not quite. Compare Figure 22.4 with Figure 22.3. For example, most of the Iris virginica belong to Cluster 2, but some belong to Cluster 3. And most of the Iris versicolor belong to Cluster 3, but some belong to Cluster 2.

c22fz004

Figure 22.4 The species do not quite correspond one-to-one with the clusters.

So, we proceed with the silhouette analysis. The silhouette values for each flower were calculated, and graphed in the silhouette plot in Figure 22.5. This silhouette plot shows the silhouette values, sorted from highest to lowest, for each cluster. Cluster 1 is the best-defined cluster, as most of its silhouette values are rather high. However, Clusters 2 and 3 have some records with high silhouette and some records with low silhouette. However, there are no records with negative silhouette, which would indicate the wrong cluster assignment. The mean silhouette values for each cluster, and the overall mean silhouette, are provided in Table 22.2. These values support our suggestion that, although Cluster 1 is well-defined, Clusters 2 and 3 are not so well-defined. This makes sense, in light of what we learned in Figures 22.3 and 22.4.

c22fz005

Figure 22.5 Silhouette plot of the Iris data, for c22-math-0037 clusters.

Table 22.2 Mean silhouette values for k = 3 clusters

Cluster 1 Cluster 2 Cluster 3 Overall
Mean silhouette 0.8002 0.5593 0.5254 0.6258

Many of the low silhouette values for Clusters 2 and 3 come from the boundary area between their respective clusters. Evidence for this is shown in Figure 22.6. The silhouette values were binned (for illustrative purposes): A silhouette value below 0.5 is low; a silhouette value at least 0.5 is high. The lower silhouette values in this boundary area result from the proximity of the “other” cluster center. This holds down the value of c22-math-0038, and thus of the silhouette value.

c22fz006

Figure 22.6 The boundary area between Clusters 2 and 3 is where you will find many low silhouette values.

Also, throughout this section, it is worth noting that the clusters were formed using four predictors, but we are examining scatter plots of two predictors only. This represents a projection of the predictor space down to two dimensions, and so loses some of the information available in four dimensions.

Next, c22-math-0039-means was applied, with c22-math-0040 clusters. This clustering combines I. versicolor and I. virginica into a single cluster, as shown in Figure 22.7. The silhouette plot for c22-math-0041 clusters is shown in Figure 22.8. There seem to be fewer low silhouette values than for c22-math-0042 clusters. This is supported by the mean silhouette values reported in Table 22.3. The overall mean silhouette is 17% higher than for c22-math-0043, and the cluster mean silhouettes are higher as well.

c22fz007

Figure 22.7 Iris versicolor and Iris virginica are now combined in Cluster 2.

c22fz008

Figure 22.8 Silhouette plot of the Iris data, for c22-math-0044 clusters.

Table 22.3 Mean silhouette values for k = 3 clusters

Cluster 1 Cluster 2 Overall
Mean silhouette 0.8285 0.6838 0.7321

So, it is clear that the silhouette method prefers the clustering model where c22-math-0045. This is fine, but just be aware that the c22-math-0046 solution recognizes no distinction between I. versicolor and I. virginica, whereas the c22-math-0047 solution does recognize this distinction. Such a distinction may be important to the client. There may be times when the data analyst may have to choose a model that is attuned to the requirements of the client rather than a model that is preferred by a given statistic.

Next, we turn to another useful tool for measuring cluster goodness, the pseudo-F statistic.

22.5 The Pseudo-F Statistic

Suppose we have c22-math-0048 clusters, with c22-math-0049 data values respectively, so that c22-math-0050, the total sample size. Let c22-math-0051 refer to the c22-math-0052 data value in the c22-math-0053 cluster, let c22-math-0054 refer to the cluster center (centroid) of the c22-math-0055 cluster, and let c22-math-0056 represent the grand mean of all the data. Define SSB, the sum of squares between the clusters, as follows:

equation

Define SSE or the sum of squares within the clusters, as follows:

equation

where

equation

Then, the pseudo-F statistic equals

equation

The pseudo-F statistic is measures the ratio of (i) the separation between the clusters, as measured by MSB, the mean square between the clusters, to (ii) the spread of the data within the clusters, as measured by the mean square error, MSE.

The hypotheses being tested are the following:

equation
equation

Reject c22-math-0063 for a sufficiently small c22-math-0064-value, where:

equation

The reason we call this statistic pseudo-F, is that it rejects the null hypothesis far too easily. For example, 100 random Uniform(0,1) variates were drawn, and c22-math-0066-means clustering was told to find c22-math-0067 clusters in this random uniformly distributed data, where there should be no clusters.

c22-math-0068-Means duly found the clusters shown in Figure 22.9, by simply separating the data values larger than 0.5 from those smaller than 0.5. The resulting pseudo-F statistic is given as follows:

equation

with a c22-math-0070-value near zero, strongly rejecting the null hypothesis that there are no clusters in the data. But as the data is randomly generated, the presumption must be that there are no true clusters in the data. For this reason, the F statistic should not be used to test for the presence of clusters in data, and thereby earns the nomenclature, pseudo-F.

c22fz009

Figure 22.9 The pseudo-F statistic found the presence of clusters in randomly generated data.

However, if we have reason to believe that clusters do exist in the data, but we do not know how many clusters there are, then pseudo-F can be helpful. The process is as follows:

Note: It has been written elsewhere that the best clustering model is the one with the largest value of pseudo-F. This is not always correct. One must account for the different degrees of freedom c22-math-0073 and c22-math-0074 for each model. For example, suppose Model A has c22-math-0075, c22-math-0076, and pseudo-F = 3.1, while Model B has c22-math-0077, c22-math-0078, and pseudo-F = 3.0. The larger value of pseudo-F is for Model A. However, Model A's c22-math-0079-value is 0.0121, while Model B's c22-math-0080-value is 0.0098, indicating that Model B is in fact preferred.

22.6 Example of the Pseudo-F Statistic

Recall that we applied c22-math-0081-means clustering to the following data set, and found, for c22-math-0082, that c22-math-0083-means assigns the first three data values to Cluster 1 and the last two to Cluster 2.

equation

Let us calculate the pseudo-F statistic for this clustering.

We have c22-math-0085 clusters, with c22-math-0086 and c22-math-0087 data values, and c22-math-0088. The cluster centers are c22-math-0089 and c22-math-0090, and the grand mean is c22-math-0091. Now, because we are in one dimension, c22-math-0092, so that:

equation

Next,

equation

Then, the pseudo-F statistic equals

equation

Figure 22.10 shows the distribution of the F statistic, with c22-math-0096 and c22-math-0097. Note that the p-value is 0.06532, which does not indicate strong evidence in favor of the reality of the clusters. But this is probably due in part to the very small sample size.

c22fz010

Figure 22.10 The c22-math-0098-value of 0.06532 does not indicate strong evidence in favor of the reality of the clusters.

22.7 Pseudo-F Statistic Applied to the IRIS Data Set

Next, we see which value of c22-math-0099 is favored by the pseudo-F statistic for clustering the Iris data set. For the vector (sepal length, sepal width, petal length, petal width), we have c22-math-0100 data values, with the grand mean

equation

For c22-math-0102 clusters, we have the following cluster counts and cluster centers:

  • Cluster 1: c22-math-0103 and c22-math-0104
  • Cluster 2: c22-math-0105 and c22-math-0106
  • Cluster 3: c22-math-0107 and c22-math-0108

Then we have the following contribution to SSB for each cluster:

  • Cluster 1: c22-math-0109 c22-math-0110
  • Cluster 2: c22-math-0111 c22-math-0112
  • Cluster 3:c22-math-0113 c22-math-0114

Summing the three contributions gives us c22-math-0115.

The value for c22-math-0116 is obtained by squaring the distance between each observation and its cluster center, and then summing all the entries. Having done this (calculations not shown), we obtain c22-math-0117. The pseudo-F statistic is then

equation

with a c22-math-0119-value equal to zero to the 57th (!) decimal place.

For c22-math-0120, perform calculations analogous to those above to obtain:

equation

with a c22-math-0122-value equal to zero only to the 41st decimal place. Thus, the pseudo-F statistic prefers the c22-math-0123 solution, in contrast to the silhouette method, which prefers the c22-math-0124 solution.

There are other methods for determining the optimal number of clusters. For example, the Bayesian Information Criterion may be used, as demonstrated by Zhao, Xu, and Franti.1

Next, we turn to a topic closely associated with measuring cluster goodness: the topic of cluster validation.

22.8 Cluster Validation

As with any other data mining modeling technique, the application of cluster analysis should be subject to cross-validation, in order to ensure that the clusters are real, and not just a result of random noise in the training data set. Many sophisticated cluster validation techniques exist, such as the Prediction Strength2 method. However, these methods require data programming beyond the intent of this book.

Instead, we use the following simple, graphical and statistical approach to cluster validation, summarized as follows:

This methodology is simply a restatement of the usual cross-validation methodology, in terms of cluster analysis. We now apply this cluster validation methodology to the Loans data set.

22.9 Cluster Validation Applied to the Loans Data Set

Recall the Loans data set, where a bank loan approval analyst is using debt-to-income ratio (DIR), FICO score, and request amount in order to predict loan approval. There are 150,302 records in the training data set and 49,698 records in the test data set. For simplicity, c22-math-0125-means clustering was applied to both the training and the test data sets, with c22-math-0126 clusters.

Summary graphics of the clusters generated from the training and test data sets are shown in Figures 22.11 and 22.12. For each partition, Cluster 1 is the smallest and Cluster 3 is the largest, with the percentages involved relatively close. For each partition:

  • Cluster 1 contains high debt-to-income applicants, with low FICO scores and low request amounts;
  • Cluster 2 contains moderate debt-to-income applicants, with moderate/high FICO scores, and high request amounts;
  • Cluster 3 contains low debt-to-income applicants, with high FICO scores, and low request amounts.
c22fz011

Figure 22.11 Summary graphics of the clusters generated by the training data set.

c22fz012

Figure 22.12 Summary graphics of the clusters generated by the test data set.

Table 22.4 contains the summary statistics (mean, standard deviation, number of records) for the clusters generated by the training and test data sets (DIR). Table 22.5 contains the difference in variable means for each cluster, along with the c22-math-0127-statistic for the two-sample c22-math-0128-test,3 and the c22-math-0129-value for this hypothesis test.

Table 22.4 Summary statistics for clusters generated by training and test data sets

c22t004

Table 22.5 Difference between the variables, with results from the two-sample hypothesis test

c22t005

Practically all the c22-math-0130-values are small, indicating rejection of the null hypothesis that the true means are equal. In other words, the hypothesis test results suggest that the clusters from the training and test data sets do not match.

However, here we should recall a possible downside to statistical hypothesis tests: that the null hypothesis is very easily rejected for very large sample sizes. It is well known that classical hypothesis tests will reject the null hypothesis for tiny effect sizes, or tiny differences between samples, when the sample sizes are large enough. For example, take the c22-math-0131-test for the difference between partitions for the mean DIR for Cluster 3. In Table 22.5, the c22-math-0132-value is zero, and the null hypothesis is rejected. But, suppose we repeat this c22-math-0133-test with the very same values for the means and standard deviations, but this time with only a 10th of the number of records in each partition: 8017 for the training set and 2696 for the test set. In that case, the c22-math-0134-value becomes 0.203 and the null hypothesis is no longer rejected. So, for the very same difference in means, sufficiently increasing the sample sizes will eventually lead to rejection of most null hypotheses. This makes classical hypothesis testing of limited usefulness for most big data applications.

Instead, consider whether the difference of 0.003 between the partitions for the Cluster 3 DIR means is really of practical significance. Probably not. Instead, the analyst should concentrate on the big picture: We are trying to determine whether there is a match between the clusters uncovered in the training and test data sets. If the clusters are broadly similar, with similar profiles, and variable means close to each other, then they should be confirmed as validated. Based on these criteria, we judge the clusters for the Loans data set to be validated.

Why, then, is Table 22.5 included here? First, for small-to-moderate sample sizes, statistical inference is quite helpful in this situation. Second, regardless of the number of records involved, the analyst should report the difference in variable means between the partitions, so that the client (or someone well familiar with the data) can render judgment on whether the difference is of practical significance.

R References

  1. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2013). cluster: Cluster Analysis Basics and Extensions. R package version 1.14.4.
  2. Walesiak, M, Dudek, A <[email protected]> (2014). clusterSim: Searching for Optimal Clustering Procedure for a Data Set. R package version 0.43-4. http://CRAN.R-project.org/package=clusterSim. Accessed 2014 Sep 30.
  3. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. 3-900051-07-0, http://www.R-project.org/. Accessed 2014 Sep 30.

Exercises

1. Why do we need evaluation measures for cluster algorithms?

2. What is cluster separation and cluster cohesion?

3. Why is SSE not necessarily a good measure of cluster quality?

4. What is a silhouette? What is its range? Is it a characteristic of a cluster, a variable, or a data value?

5. How do we interpret a silhouette value?

6. Explain how silhouette accounts for both separation and cohesion.

7. How is average silhouette interpreted?

8. When will a data value have a perfect silhouette value? What is this value?

9. Describe what a silhouette plot is.

10. Should the analyst always choose the cluster solution with the better mean silhouette value? Explain.

11. Explain how the pseudo-F statistic accounts for both separation and cohesion.

12. Why does the pseudo-F statistic have the word pseudo in its name?

13. Explain how we can use the pseudo-F statistic to select the optimal number of clusters.

14. True or false: The best clustering model is the one with the largest value of pseudo-F. Explain.

15. What is our cluster validation methodology?

16. Why might statistical hypothesis tests not be very helpful for big data applications?

17. What are the criteria for determining whether there is a match between the clusters uncovered in the training and test data sets?

Hands-On Exercises

Use the Loans_training data set and the Loans_test data set for the following exercises. These data sets are available from the textbook web site www.DataMiningConsultant.com.

18. Use c22-math-0135-means with c22-math-0136 to generate a cluster model with the training data set.

19. Generate a silhouette plot of your cluster model.

20. Calculate the mean silhouette values for each cluster, as well as the overall mean silhouette for the cluster model.

21. Provide a two-dimensional scatter plot, using variables of your choice, with an overlay of cluster membership. Choose variables that result in an interesting plot. Note where the cluster boundaries are close, and where they are not so close.

22. Using the same variables as the previous exercise, provide a two-dimensional scatter plot, with an overlay of binned silhouette values, as shown in this chapter. Comment on the relationship between your two scatter plots.

23. Repeat Exercises 18–22 using c22-math-0137-means with c22-math-0138.

24. Compare the mean silhouette values for the two cluster models. Which model is preferred?

25. Compare the pseudo-F statistics for the two cluster models. Which model is preferred?

26. Develop a good classification model for predicting loan approval, based solely on cluster membership. Apply data-driven misclassification costs as shown in Chapter 16. Compare your results for the c22-math-0139 and c22-math-0140 cases using overall model cost. Which model is preferred?

27. With the test data set, apply c22-math-0141-means with the value of c22-math-0142 from the preferred model above. Perform validation of the clusters you uncovered with the training and test data sets of the preferred model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset