Every modeling technique requires an evaluation phase. For example, we may work hard to develop a multiple regression model for predicting the amount of money to be spent on a new car. But, if the standard error of the estimate s for this regression model is $100,000, then the usefulness of the regression model is questionable. In the classification realm, we would expect that a model predicting who will respond to our direct-mail marketing operation will yield more profitable results than the baseline “send-a-coupon-to-everybody” or “send-out-no-coupons-at-all” models.
In a similar way, clustering models need to be evaluated as well. Some of the questions of interest might be the following:
In this chapter, we introduce two methods for measuring cluster goodness, the silhouette method, and the pseudo-F statistic. These techniques will help to address these questions by evaluating and measuring the goodness of our cluster solutions. We also examine a method to validate our clusters using cross-validation with graphical and statistical analysis.
Any measure of cluster goodness, or cluster quality, should address the concepts of cluster separation as well as cluster cohesion. Cluster separation represents how distant the clusters are from each other; cluster cohesion refers to how tightly related the records within the individual clusters are. Good measures of cluster quality need to incorporate both criteria. For example, it has been written elsewhere that the sum of squares error (SSE) is a good measure of cluster quality. However, by measuring the distance between each record and its cluster center, SSE accounts only for cluster cohesion and does not account for cluster separation. Thus, SSE is monotonically decreasing as the number of clusters increases, which is not a desired property of a valid measure of cluster goodness. Of course, both the silhouette method and the pseudo-F statistic account for both cluster cohesion and cluster separation.
The silhouette is a characteristic of each data value, and is defined as follows:
The silhouette value is used to gauge how good the cluster assignment is for that particular point. A positive value indicates that the assignment is good, with higher values being better than lower values. A value that is close to zero is considered to be a weak assignment, as the observation could have been assigned to the next closest cluster with limited negative consequence. A negative silhouette value is considered to be misclassified, as assignment to the next closest cluster would have been better.
Note how the definition of silhouette accounts for both separation and cohesion. The value of represents cohesion, as it measures the distance between the data value and its cluster center, while represents separation, as it measures the distance between the data value and a different cluster. This is illustrated in Figure 22.1. Each of the data values in Cluster 1 have their values of and represented by solid lines and dotted lines, respectively. Clearly, for each data value, as represented by the longer dotted lines. Thus, each data value's silhouette value is positive, indicating that the data values have not been misclassified. The dotted lines indicate separation, and the solid lines indicate cohesion. (The silhouettes for the data values in Cluster 2 are not represented in Figure 22.1.)
Taking the average silhouette value over all records yields a useful measure of how well the cluster solution fits the data. The following thumbnail interpretation of average silhouette is meant as a guideline only, and should bow before the expertise of the domain expert.
Suppose we apply -means clustering to the following little one-dimensional data set:
-means assigns the first three data values to Cluster 1 and the last two to Cluster 2, as shown in Figure 22.2. The cluster center for Cluster 1 is , and the cluster center for Cluster 2 is , represented by the dotted vertical lines in Figure 22.2. The values for represent the distance between the data value and the cluster center to which belongs. The values for represent the distance between the data value and the other cluster center. Note that because .
Table 22.1 contains the calculations for the individual data value silhouettes, along with the mean silhouette. Using our rule of thumb, mean silhouette = 0.7 represents good evidence of the reality of the clusters in the data. Note that is perfectly classified as belonging to Cluster 1, as it sits right on the cluster center ; thus, its silhouette value is a perfect 1.00. However, is somewhat farther from its own cluster center, and somewhat closer to the other cluster center; hence, its silhouette value is lower, 0.50.
Table 22.1 Calculations for individual data value silhouettes and mean silhouette
0 | 2 | 8 | 8 | |
2 | 0 | 6 | 6 | |
4 | 2 | 4 | 4 | |
6 | 2 | 4 | 4 | |
10 | 2 | 8 | 8 | |
Mean silhouette = 0.7 |
Next, we apply the silhouette method to Fisher's well-known Iris data set. The data set consists of 150 observations of three species of Iris, along with measurements of their petal width, petal length, sepal width, and sepal length. Figure 22.3 shows a scatter plot of petal width versus petal length, with an overlay of Iris species. (Note that min–max normalization is used.) Figure 22.3 shows that one species is well separated, but the other two are not, at least in this dimension. So, one question we could ask of these Irises: True there are three species in the data set, but are there really three clusters in the data set, or only two?
It makes sense to begin with clusters. -means clustering was applied to the Iris data, asking for clusters. A logical question might be: Do the clusters match perfectly with the species? (Of course, the species type was not included as input to the clustering algorithm.) The answer is, not quite. Compare Figure 22.4 with Figure 22.3. For example, most of the Iris virginica belong to Cluster 2, but some belong to Cluster 3. And most of the Iris versicolor belong to Cluster 3, but some belong to Cluster 2.
So, we proceed with the silhouette analysis. The silhouette values for each flower were calculated, and graphed in the silhouette plot in Figure 22.5. This silhouette plot shows the silhouette values, sorted from highest to lowest, for each cluster. Cluster 1 is the best-defined cluster, as most of its silhouette values are rather high. However, Clusters 2 and 3 have some records with high silhouette and some records with low silhouette. However, there are no records with negative silhouette, which would indicate the wrong cluster assignment. The mean silhouette values for each cluster, and the overall mean silhouette, are provided in Table 22.2. These values support our suggestion that, although Cluster 1 is well-defined, Clusters 2 and 3 are not so well-defined. This makes sense, in light of what we learned in Figures 22.3 and 22.4.
Table 22.2 Mean silhouette values for k = 3 clusters
Cluster 1 | Cluster 2 | Cluster 3 | Overall | |
Mean silhouette | 0.8002 | 0.5593 | 0.5254 | 0.6258 |
Many of the low silhouette values for Clusters 2 and 3 come from the boundary area between their respective clusters. Evidence for this is shown in Figure 22.6. The silhouette values were binned (for illustrative purposes): A silhouette value below 0.5 is low; a silhouette value at least 0.5 is high. The lower silhouette values in this boundary area result from the proximity of the “other” cluster center. This holds down the value of , and thus of the silhouette value.
Also, throughout this section, it is worth noting that the clusters were formed using four predictors, but we are examining scatter plots of two predictors only. This represents a projection of the predictor space down to two dimensions, and so loses some of the information available in four dimensions.
Next, -means was applied, with clusters. This clustering combines I. versicolor and I. virginica into a single cluster, as shown in Figure 22.7. The silhouette plot for clusters is shown in Figure 22.8. There seem to be fewer low silhouette values than for clusters. This is supported by the mean silhouette values reported in Table 22.3. The overall mean silhouette is 17% higher than for , and the cluster mean silhouettes are higher as well.
Table 22.3 Mean silhouette values for k = 3 clusters
Cluster 1 | Cluster 2 | Overall | |
Mean silhouette | 0.8285 | 0.6838 | 0.7321 |
So, it is clear that the silhouette method prefers the clustering model where . This is fine, but just be aware that the solution recognizes no distinction between I. versicolor and I. virginica, whereas the solution does recognize this distinction. Such a distinction may be important to the client. There may be times when the data analyst may have to choose a model that is attuned to the requirements of the client rather than a model that is preferred by a given statistic.
Next, we turn to another useful tool for measuring cluster goodness, the pseudo-F statistic.
Suppose we have clusters, with data values respectively, so that , the total sample size. Let refer to the data value in the cluster, let refer to the cluster center (centroid) of the cluster, and let represent the grand mean of all the data. Define SSB, the sum of squares between the clusters, as follows:
Define SSE or the sum of squares within the clusters, as follows:
where
Then, the pseudo-F statistic equals
The pseudo-F statistic is measures the ratio of (i) the separation between the clusters, as measured by MSB, the mean square between the clusters, to (ii) the spread of the data within the clusters, as measured by the mean square error, MSE.
The hypotheses being tested are the following:
Reject for a sufficiently small -value, where:
The reason we call this statistic pseudo-F, is that it rejects the null hypothesis far too easily. For example, 100 random Uniform(0,1) variates were drawn, and -means clustering was told to find clusters in this random uniformly distributed data, where there should be no clusters.
-Means duly found the clusters shown in Figure 22.9, by simply separating the data values larger than 0.5 from those smaller than 0.5. The resulting pseudo-F statistic is given as follows:
with a -value near zero, strongly rejecting the null hypothesis that there are no clusters in the data. But as the data is randomly generated, the presumption must be that there are no true clusters in the data. For this reason, the F statistic should not be used to test for the presence of clusters in data, and thereby earns the nomenclature, pseudo-F.
However, if we have reason to believe that clusters do exist in the data, but we do not know how many clusters there are, then pseudo-F can be helpful. The process is as follows:
Note: It has been written elsewhere that the best clustering model is the one with the largest value of pseudo-F. This is not always correct. One must account for the different degrees of freedom and for each model. For example, suppose Model A has , , and pseudo-F = 3.1, while Model B has , , and pseudo-F = 3.0. The larger value of pseudo-F is for Model A. However, Model A's -value is 0.0121, while Model B's -value is 0.0098, indicating that Model B is in fact preferred.
Recall that we applied -means clustering to the following data set, and found, for , that -means assigns the first three data values to Cluster 1 and the last two to Cluster 2.
Let us calculate the pseudo-F statistic for this clustering.
We have clusters, with and data values, and . The cluster centers are and , and the grand mean is . Now, because we are in one dimension, , so that:
Next,
Then, the pseudo-F statistic equals
Figure 22.10 shows the distribution of the F statistic, with and . Note that the p-value is 0.06532, which does not indicate strong evidence in favor of the reality of the clusters. But this is probably due in part to the very small sample size.
Next, we see which value of is favored by the pseudo-F statistic for clustering the Iris data set. For the vector (sepal length, sepal width, petal length, petal width), we have data values, with the grand mean
For clusters, we have the following cluster counts and cluster centers:
Then we have the following contribution to SSB for each cluster:
Summing the three contributions gives us .
The value for is obtained by squaring the distance between each observation and its cluster center, and then summing all the entries. Having done this (calculations not shown), we obtain . The pseudo-F statistic is then
with a -value equal to zero to the 57th (!) decimal place.
For , perform calculations analogous to those above to obtain:
with a -value equal to zero only to the 41st decimal place. Thus, the pseudo-F statistic prefers the solution, in contrast to the silhouette method, which prefers the solution.
There are other methods for determining the optimal number of clusters. For example, the Bayesian Information Criterion may be used, as demonstrated by Zhao, Xu, and Franti.1
Next, we turn to a topic closely associated with measuring cluster goodness: the topic of cluster validation.
As with any other data mining modeling technique, the application of cluster analysis should be subject to cross-validation, in order to ensure that the clusters are real, and not just a result of random noise in the training data set. Many sophisticated cluster validation techniques exist, such as the Prediction Strength2 method. However, these methods require data programming beyond the intent of this book.
Instead, we use the following simple, graphical and statistical approach to cluster validation, summarized as follows:
This methodology is simply a restatement of the usual cross-validation methodology, in terms of cluster analysis. We now apply this cluster validation methodology to the Loans data set.
Recall the Loans data set, where a bank loan approval analyst is using debt-to-income ratio (DIR), FICO score, and request amount in order to predict loan approval. There are 150,302 records in the training data set and 49,698 records in the test data set. For simplicity, -means clustering was applied to both the training and the test data sets, with clusters.
Summary graphics of the clusters generated from the training and test data sets are shown in Figures 22.11 and 22.12. For each partition, Cluster 1 is the smallest and Cluster 3 is the largest, with the percentages involved relatively close. For each partition:
Table 22.4 contains the summary statistics (mean, standard deviation, number of records) for the clusters generated by the training and test data sets (DIR). Table 22.5 contains the difference in variable means for each cluster, along with the -statistic for the two-sample -test,3 and the -value for this hypothesis test.
Table 22.4 Summary statistics for clusters generated by training and test data sets
Table 22.5 Difference between the variables, with results from the two-sample hypothesis test
Practically all the -values are small, indicating rejection of the null hypothesis that the true means are equal. In other words, the hypothesis test results suggest that the clusters from the training and test data sets do not match.
However, here we should recall a possible downside to statistical hypothesis tests: that the null hypothesis is very easily rejected for very large sample sizes. It is well known that classical hypothesis tests will reject the null hypothesis for tiny effect sizes, or tiny differences between samples, when the sample sizes are large enough. For example, take the -test for the difference between partitions for the mean DIR for Cluster 3. In Table 22.5, the -value is zero, and the null hypothesis is rejected. But, suppose we repeat this -test with the very same values for the means and standard deviations, but this time with only a 10th of the number of records in each partition: 8017 for the training set and 2696 for the test set. In that case, the -value becomes 0.203 and the null hypothesis is no longer rejected. So, for the very same difference in means, sufficiently increasing the sample sizes will eventually lead to rejection of most null hypotheses. This makes classical hypothesis testing of limited usefulness for most big data applications.
Instead, consider whether the difference of 0.003 between the partitions for the Cluster 3 DIR means is really of practical significance. Probably not. Instead, the analyst should concentrate on the big picture: We are trying to determine whether there is a match between the clusters uncovered in the training and test data sets. If the clusters are broadly similar, with similar profiles, and variable means close to each other, then they should be confirmed as validated. Based on these criteria, we judge the clusters for the Loans data set to be validated.
Why, then, is Table 22.5 included here? First, for small-to-moderate sample sizes, statistical inference is quite helpful in this situation. Second, regardless of the number of records involved, the analyst should report the difference in variable means between the partitions, so that the client (or someone well familiar with the data) can render judgment on whether the difference is of practical significance.
1. Why do we need evaluation measures for cluster algorithms?
2. What is cluster separation and cluster cohesion?
3. Why is SSE not necessarily a good measure of cluster quality?
4. What is a silhouette? What is its range? Is it a characteristic of a cluster, a variable, or a data value?
5. How do we interpret a silhouette value?
6. Explain how silhouette accounts for both separation and cohesion.
7. How is average silhouette interpreted?
8. When will a data value have a perfect silhouette value? What is this value?
9. Describe what a silhouette plot is.
10. Should the analyst always choose the cluster solution with the better mean silhouette value? Explain.
11. Explain how the pseudo-F statistic accounts for both separation and cohesion.
12. Why does the pseudo-F statistic have the word pseudo in its name?
13. Explain how we can use the pseudo-F statistic to select the optimal number of clusters.
14. True or false: The best clustering model is the one with the largest value of pseudo-F. Explain.
15. What is our cluster validation methodology?
16. Why might statistical hypothesis tests not be very helpful for big data applications?
17. What are the criteria for determining whether there is a match between the clusters uncovered in the training and test data sets?
Use the Loans_training data set and the Loans_test data set for the following exercises. These data sets are available from the textbook web site www.DataMiningConsultant.com.
18. Use -means with to generate a cluster model with the training data set.
19. Generate a silhouette plot of your cluster model.
20. Calculate the mean silhouette values for each cluster, as well as the overall mean silhouette for the cluster model.
21. Provide a two-dimensional scatter plot, using variables of your choice, with an overlay of cluster membership. Choose variables that result in an interesting plot. Note where the cluster boundaries are close, and where they are not so close.
22. Using the same variables as the previous exercise, provide a two-dimensional scatter plot, with an overlay of binned silhouette values, as shown in this chapter. Comment on the relationship between your two scatter plots.
23. Repeat Exercises 18–22 using -means with .
24. Compare the mean silhouette values for the two cluster models. Which model is preferred?
25. Compare the pseudo-F statistics for the two cluster models. Which model is preferred?
26. Develop a good classification model for predicting loan approval, based solely on cluster membership. Apply data-driven misclassification costs as shown in Chapter 16. Compare your results for the and cases using overall model cost. Which model is preferred?
27. With the test data set, apply -means with the value of from the preferred model above. Perform validation of the clusters you uncovered with the training and test data sets of the preferred model.