Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 22
Measuring Cluster Goodness

22.1 Rationale for Measuring Cluster Goodness

Every modeling technique requires an evaluation phase. For example, we may work hard to develop a multiple regression model for predicting the amount of money to be spent on a new car. But, if the standard error of the estimate s for this regression model is $100,000, then the usefulness of the regression model is questionable. In the classification realm, we would expect that a model predicting who will respond to our direct-mail marketing operation will yield more profitable results than the baseline “send-a-coupon-to-everybody” or “send-out-no-coupons-at-all” models.

In a similar way, clustering models need to be evaluated as well. Some of the questions of interest might be the following:

Do my clusters actually correspond to reality, or are they simply artifacts of mathematical convenience?
I am not sure how many clusters there are in the data. What is the optimal number of clusters to identify?
How do I measure whether one set of clusters is preferable to another?

In this chapter, we introduce two methods for measuring cluster goodness, the silhouette method, and the pseudo-F statistic. These techniques will help to address these questions by evaluating and measuring the goodness of our cluster solutions. We also examine a method to validate our clusters using cross-validation with graphical and statistical analysis.

Any measure of cluster goodness, or cluster quality, should address the concepts of cluster separation as well as cluster cohesion. Cluster separation represents how distant the clusters are from each other; cluster cohesion refers to how tightly related the records within the individual clusters are. Good measures of cluster quality need to incorporate both criteria. For example, it has been written elsewhere that the sum of squares error (SSE) is a good measure of cluster quality. However, by measuring the distance between each record and its cluster center, SSE accounts only for cluster cohesion and does not account for cluster separation. Thus, SSE is monotonically decreasing as the number of clusters increases, which is not a desired property of a valid measure of cluster goodness. Of course, both the silhouette method and the pseudo-F statistic account for both cluster cohesion and cluster separation.

22.2 The Silhouette Method

The silhouette is a characteristic of each data value, and is defined as follows:

The silhouette value is used to gauge how good the cluster assignment is for that particular point. A positive value indicates that the assignment is good, with higher values being better than lower values. A value that is close to zero is considered to be a weak assignment, as the observation could have been assigned to the next closest cluster with limited negative consequence. A negative silhouette value is considered to be misclassified, as assignment to the next closest cluster would have been better.

Note how the definition of silhouette accounts for both separation and cohesion. The value of $c22-math-0005$ represents cohesion, as it measures the distance between the data value and its cluster center, while $c22-math-0006$ represents separation, as it measures the distance between the data value and a different cluster. This is illustrated in Figure 22.1. Each of the data values in Cluster 1 have their values of $c22-math-0007$ and $c22-math-0008$ represented by solid lines and dotted lines, respectively. Clearly, $c22-math-0009$ for each data value, as represented by the longer dotted lines. Thus, each data value's silhouette value is positive, indicating that the data values have not been misclassified. The dotted lines indicate separation, and the solid lines indicate cohesion. (The silhouettes for the data values in Cluster 2 are not represented in Figure 22.1.)

c22fz001 — **Figure 22.1** Illustration of how silhouette accounts for both separation and cohesion.

Taking the average silhouette value over all records yields a useful measure of how well the cluster solution fits the data. The following thumbnail interpretation of average silhouette is meant as a guideline only, and should bow before the expertise of the domain expert.

22.3 Silhouette Example

Suppose we apply $c22-math-0010$ -means clustering to the following little one-dimensional data set:

$c22-math-0012$ -means assigns the first three data values to Cluster 1 and the last two to Cluster 2, as shown in Figure 22.2. The cluster center for Cluster 1 is $c22-math-0013$ , and the cluster center for Cluster 2 is $c22-math-0014$ , represented by the dotted vertical lines in Figure 22.2. The values for $c22-math-0015$ represent the distance between the data value $c22-math-0016$ and the cluster center to which $c22-math-0017$ belongs. The values for $c22-math-0018$ represent the distance between the data value and the other cluster center. Note that $c22-math-0019$ because $c22-math-0020$ .

c22fz002 — **Figure 22.2** Distances between the data values and the cluster centers.

Table 22.1 contains the calculations for the individual data value silhouettes, along with the mean silhouette. Using our rule of thumb, mean silhouette = 0.7 represents good evidence of the reality of the clusters in the data. Note that $c22-math-0021$ is perfectly classified as belonging to Cluster 1, as it sits right on the cluster center $c22-math-0022$ ; thus, its silhouette value is a perfect 1.00. However, $c22-math-0023$ is somewhat farther from its own cluster center, and somewhat closer to the other cluster center; hence, its silhouette value is lower, 0.50.

Table 22.1 Calculations for individual data value silhouettes and mean silhouette

$c22-math-0024$	$c22-math-0025$	$c22-math-0026$	$c22-math-0027$	$c22-math-0028$
0	2	8	8	$c22-math-0029$
2	0	6	6	$c22-math-0030$
4	2	4	4	$c22-math-0031$
6	2	4	4	$c22-math-0032$
10	2	8	8	$c22-math-0033$
				Mean silhouette = 0.7

22.4 Silhouette Analysis of the IRIS Data Set

Next, we apply the silhouette method to Fisher's well-known Iris data set. The data set consists of 150 observations of three species of Iris, along with measurements of their petal width, petal length, sepal width, and sepal length. Figure 22.3 shows a scatter plot of petal width versus petal length, with an overlay of Iris species. (Note that min–max normalization is used.) Figure 22.3 shows that one species is well separated, but the other two are not, at least in this dimension. So, one question we could ask of these Irises: True there are three species in the data set, but are there really three clusters in the data set, or only two?

c22fz003 — **Figure 22.3** Two of the *Iris* species seem to blend into one another.

It makes sense to begin with $c22-math-0034$ clusters. $c22-math-0035$ -means clustering was applied to the Iris data, asking for $c22-math-0036$ clusters. A logical question might be: Do the clusters match perfectly with the species? (Of course, the species type was not included as input to the clustering algorithm.) The answer is, not quite. Compare Figure 22.4 with Figure 22.3. For example, most of the Iris virginica belong to Cluster 2, but some belong to Cluster 3. And most of the Iris versicolor belong to Cluster 3, but some belong to Cluster 2.

c22fz004 — **Figure 22.4** The species do not quite correspond one-to-one with the clusters.

So, we proceed with the silhouette analysis. The silhouette values for each flower were calculated, and graphed in the silhouette plot in Figure 22.5. This silhouette plot shows the silhouette values, sorted from highest to lowest, for each cluster. Cluster 1 is the best-defined cluster, as most of its silhouette values are rather high. However, Clusters 2 and 3 have some records with high silhouette and some records with low silhouette. However, there are no records with negative silhouette, which would indicate the wrong cluster assignment. The mean silhouette values for each cluster, and the overall mean silhouette, are provided in Table 22.2. These values support our suggestion that, although Cluster 1 is well-defined, Clusters 2 and 3 are not so well-defined. This makes sense, in light of what we learned in Figures 22.3 and 22.4.

c22fz005 — **Figure 22.5** Silhouette plot of the *Iris* data, for $c22-math-0037$ clusters.

Table 22.2 Mean silhouette values for k = 3 clusters

	Cluster 1	Cluster 2	Cluster 3	Overall
Mean silhouette	0.8002	0.5593	0.5254	0.6258

Many of the low silhouette values for Clusters 2 and 3 come from the boundary area between their respective clusters. Evidence for this is shown in Figure 22.6. The silhouette values were binned (for illustrative purposes): A silhouette value below 0.5 is low; a silhouette value at least 0.5 is high. The lower silhouette values in this boundary area result from the proximity of the “other” cluster center. This holds down the value of $c22-math-0038$ , and thus of the silhouette value.

c22fz006 — **Figure 22.6** The boundary area between Clusters 2 and 3 is where you will find many low silhouette values.

Also, throughout this section, it is worth noting that the clusters were formed using four predictors, but we are examining scatter plots of two predictors only. This represents a projection of the predictor space down to two dimensions, and so loses some of the information available in four dimensions.

Next, $c22-math-0039$ -means was applied, with $c22-math-0040$ clusters. This clustering combines I. versicolor and I. virginica into a single cluster, as shown in Figure 22.7. The silhouette plot for $c22-math-0041$ clusters is shown in Figure 22.8. There seem to be fewer low silhouette values than for $c22-math-0042$ clusters. This is supported by the mean silhouette values reported in Table 22.3. The overall mean silhouette is 17% higher than for $c22-math-0043$ , and the cluster mean silhouettes are higher as well.

c22fz007 — **Figure 22.7** *Iris versicolor* and *Iris virginica* are now combined in Cluster 2.

c22fz008 — **Figure 22.8** Silhouette plot of the *Iris* data, for $c22-math-0044$ clusters.

Table 22.3 Mean silhouette values for k = 3 clusters

	Cluster 1	Cluster 2	Overall
Mean silhouette	0.8285	0.6838	0.7321

So, it is clear that the silhouette method prefers the clustering model where $c22-math-0045$ . This is fine, but just be aware that the $c22-math-0046$ solution recognizes no distinction between I. versicolor and I. virginica, whereas the $c22-math-0047$ solution does recognize this distinction. Such a distinction may be important to the client. There may be times when the data analyst may have to choose a model that is attuned to the requirements of the client rather than a model that is preferred by a given statistic.

Next, we turn to another useful tool for measuring cluster goodness, the pseudo-F statistic.

22.5 The Pseudo-F Statistic

Suppose we have $c22-math-0048$ clusters, with $c22-math-0049$ data values respectively, so that $c22-math-0050$ , the total sample size. Let $c22-math-0051$ refer to the $c22-math-0052$ data value in the $c22-math-0053$ cluster, let $c22-math-0054$ refer to the cluster center (centroid) of the $c22-math-0055$ cluster, and let $c22-math-0056$ represent the grand mean of all the data. Define SSB, the sum of squares between the clusters, as follows:

Define SSE or the sum of squares within the clusters, as follows:

where

Then, the pseudo-F statistic equals

The pseudo-F statistic is measures the ratio of (i) the separation between the clusters, as measured by MSB, the mean square between the clusters, to (ii) the spread of the data within the clusters, as measured by the mean square error, MSE.

The hypotheses being tested are the following:

Reject $c22-math-0063$ for a sufficiently small $c22-math-0064$ -value, where:

The reason we call this statistic pseudo-F, is that it rejects the null hypothesis far too easily. For example, 100 random Uniform(0,1) variates were drawn, and $c22-math-0066$ -means clustering was told to find $c22-math-0067$ clusters in this random uniformly distributed data, where there should be no clusters.

$c22-math-0068$ -Means duly found the clusters shown in Figure 22.9, by simply separating the data values larger than 0.5 from those smaller than 0.5. The resulting pseudo-F statistic is given as follows:

with a $c22-math-0070$ -value near zero, strongly rejecting the null hypothesis that there are no clusters in the data. But as the data is randomly generated, the presumption must be that there are no true clusters in the data. For this reason, the F statistic should not be used to test for the presence of clusters in data, and thereby earns the nomenclature, pseudo-F.

c22fz009 — **Figure 22.9** The pseudo-F statistic found the presence of clusters in randomly generated data.

However, if we have reason to believe that clusters do exist in the data, but we do not know how many clusters there are, then pseudo-F can be helpful. The process is as follows:

Note: It has been written elsewhere that the best clustering model is the one with the largest value of pseudo-F. This is not always correct. One must account for the different degrees of freedom $c22-math-0073$ and $c22-math-0074$ for each model. For example, suppose Model A has $c22-math-0075$ , $c22-math-0076$ _, and pseudo-F = 3.1, while Model B has $c22-math-0077$ , $c22-math-0078$ , and pseudo-F = 3.0. The larger value of pseudo-F is for Model A. However, Model A's $c22-math-0079$ -value is 0.0121, while Model B's $c22-math-0080$ -value is 0.0098, indicating that Model B is in fact preferred.

22.6 Example of the Pseudo-F Statistic

Recall that we applied $c22-math-0081$ -means clustering to the following data set, and found, for $c22-math-0082$ , that $c22-math-0083$ -means assigns the first three data values to Cluster 1 and the last two to Cluster 2.

Let us calculate the pseudo-F statistic for this clustering.

We have $c22-math-0085$ clusters, with $c22-math-0086$ and $c22-math-0087$ data values, and $c22-math-0088$ . The cluster centers are $c22-math-0089$ and $c22-math-0090$ , and the grand mean is $c22-math-0091$ . Now, because we are in one dimension, $c22-math-0092$ , so that:

Next,

Then, the pseudo-F statistic equals

Figure 22.10 shows the distribution of the F statistic, with $c22-math-0096$ and $c22-math-0097$ . Note that the p-value is 0.06532, which does not indicate strong evidence in favor of the reality of the clusters. But this is probably due in part to the very small sample size.

c22fz010 — **Figure 22.10** The $c22-math-0098$ -value of 0.06532 does not indicate strong evidence in favor of the reality of the clusters.

22.7 Pseudo-F Statistic Applied to the IRIS Data Set

Next, we see which value of $c22-math-0099$ is favored by the pseudo-F statistic for clustering the Iris data set. For the vector (sepal length, sepal width, petal length, petal width), we have $c22-math-0100$ data values, with the grand mean

For $c22-math-0102$ clusters, we have the following cluster counts and cluster centers:

Cluster 1: $c22-math-0103$ and $c22-math-0104$
Cluster 2: $c22-math-0105$ and $c22-math-0106$
Cluster 3: $c22-math-0107$ and $c22-math-0108$

Then we have the following contribution to SSB for each cluster:

Cluster 1: $c22-math-0109$ $c22-math-0110$
Cluster 2: $c22-math-0111$ $c22-math-0112$
Cluster 3: $c22-math-0113$ $c22-math-0114$

Summing the three contributions gives us $c22-math-0115$ .

The value for $c22-math-0116$ is obtained by squaring the distance between each observation and its cluster center, and then summing all the entries. Having done this (calculations not shown), we obtain $c22-math-0117$ _. The pseudo-F statistic is then

with a $c22-math-0119$ -value equal to zero to the 57th (!) decimal place.

For $c22-math-0120$ , perform calculations analogous to those above to obtain:

with a $c22-math-0122$ -value equal to zero only to the 41st decimal place. Thus, the pseudo-F statistic prefers the $c22-math-0123$ solution, in contrast to the silhouette method, which prefers the $c22-math-0124$ solution.

There are other methods for determining the optimal number of clusters. For example, the Bayesian Information Criterion may be used, as demonstrated by Zhao, Xu, and Franti.¹

Next, we turn to a topic closely associated with measuring cluster goodness: the topic of cluster validation.

22.8 Cluster Validation

As with any other data mining modeling technique, the application of cluster analysis should be subject to cross-validation, in order to ensure that the clusters are real, and not just a result of random noise in the training data set. Many sophisticated cluster validation techniques exist, such as the Prediction Strength² method. However, these methods require data programming beyond the intent of this book.

Instead, we use the following simple, graphical and statistical approach to cluster validation, summarized as follows:

This methodology is simply a restatement of the usual cross-validation methodology, in terms of cluster analysis. We now apply this cluster validation methodology to the Loans data set.

22.9 Cluster Validation Applied to the Loans Data Set

Recall the Loans data set, where a bank loan approval analyst is using debt-to-income ratio (DIR), FICO score, and request amount in order to predict loan approval. There are 150,302 records in the training data set and 49,698 records in the test data set. For simplicity, $c22-math-0125$ -means clustering was applied to both the training and the test data sets, with $c22-math-0126$ clusters.

Summary graphics of the clusters generated from the training and test data sets are shown in Figures 22.11 and 22.12. For each partition, Cluster 1 is the smallest and Cluster 3 is the largest, with the percentages involved relatively close. For each partition:

Cluster 1 contains high debt-to-income applicants, with low FICO scores and low request amounts;
Cluster 2 contains moderate debt-to-income applicants, with moderate/high FICO scores, and high request amounts;
Cluster 3 contains low debt-to-income applicants, with high FICO scores, and low request amounts.

c22fz011 — **Figure 22.11** Summary graphics of the clusters generated by the training data set.

c22fz012 — **Figure 22.12** Summary graphics of the clusters generated by the test data set.

Table 22.4 contains the summary statistics (mean, standard deviation, number of records) for the clusters generated by the training and test data sets (DIR). Table 22.5 contains the difference in variable means for each cluster, along with the $c22-math-0127$ -statistic for the two-sample $c22-math-0128$ -test,³ and the $c22-math-0129$ -value for this hypothesis test.

c22t004 — **Table 22.4** Summary statistics for clusters generated by training and test data sets

c22t005 — **Table 22.5** Difference between the variables, with results from the two-sample hypothesis test

Practically all the $c22-math-0130$ -values are small, indicating rejection of the null hypothesis that the true means are equal. In other words, the hypothesis test results suggest that the clusters from the training and test data sets do not match.

However, here we should recall a possible downside to statistical hypothesis tests: that the null hypothesis is very easily rejected for very large sample sizes. It is well known that classical hypothesis tests will reject the null hypothesis for tiny effect sizes, or tiny differences between samples, when the sample sizes are large enough. For example, take the $c22-math-0131$ -test for the difference between partitions for the mean DIR for Cluster 3. In Table 22.5, the $c22-math-0132$ -value is zero, and the null hypothesis is rejected. But, suppose we repeat this $c22-math-0133$ -test with the very same values for the means and standard deviations, but this time with only a 10th of the number of records in each partition: 8017 for the training set and 2696 for the test set. In that case, the $c22-math-0134$ -value becomes 0.203 and the null hypothesis is no longer rejected. So, for the very same difference in means, sufficiently increasing the sample sizes will eventually lead to rejection of most null hypotheses. This makes classical hypothesis testing of limited usefulness for most big data applications.

Instead, consider whether the difference of 0.003 between the partitions for the Cluster 3 DIR means is really of practical significance. Probably not. Instead, the analyst should concentrate on the big picture: We are trying to determine whether there is a match between the clusters uncovered in the training and test data sets. If the clusters are broadly similar, with similar profiles, and variable means close to each other, then they should be confirmed as validated. Based on these criteria, we judge the clusters for the Loans data set to be validated.

Why, then, is Table 22.5 included here? First, for small-to-moderate sample sizes, statistical inference is quite helpful in this situation. Second, regardless of the number of records involved, the analyst should report the difference in variable means between the partitions, so that the client (or someone well familiar with the data) can render judgment on whether the difference is of practical significance.

The R Zone

# Read in and prepare the data

i.data <− iris # Iris is a built-in dataset
# Min-max normalization
i.data$SL <− (i.data$Sepal.Length - min(i.data$Sepal.Length))/
        (max(i.data$Sepal.Length) - min(i.data$Sepal.Length))
i.data$SW <− (i.data$Sepal.Width - min(i.data$Sepal.Width))/
        (max(i.data$Sepal.Width) - min(i.data$Sepal.Width))
i.data$PL <− (i.data$Petal.Length - min(i.data$Petal.Length))/
        (max(i.data$Petal.Length) - min(i.data$Petal.Length))
i.data$PW <− (i.data$Petal.Width - min(i.data$Petal.Width))/
        (max(i.data$Petal.Width) - min(i.data$Petal.Width))

# Silhouette values

# Plot silhouette values

# Pseudo-F

# Requires package 'clusterSim'
library("clusterSim")
n <− dim(i.data)[1]
psF1 <− index.G1(i.data[,6:9], cl = km1$cluster)
pf(psF1, 2, n-2)
psF2 <− index.G1(i.data[,6:9], cl = km2$cluster)
pf(psF2, 1, n-1)

# Cluster validation—prepare the data

loan.test <− read.csv(file="C:/…/Loans_Test.csv", header = TRUE)
loan.train <− read.csv(file="C:/…/Loans_Training.csv", header = TRUE)
test <− loan.test[,-1]
train <− loan.train[,-1]
kmtest <− kmeans(test, centers = 3)
kmtrain <− kmeans(train, centers = 3)

# Cluster validation—variable summaries by cluster

R References

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2013). cluster: Cluster Analysis Basics and Extensions. R package version 1.14.4.
Walesiak, M, Dudek, A <[email protected]> (2014). clusterSim: Searching for Optimal Clustering Procedure for a Data Set. R package version 0.43-4. http://CRAN.R-project.org/package=clusterSim. Accessed 2014 Sep 30.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. 3-900051-07-0, http://www.R-project.org/. Accessed 2014 Sep 30.

Exercises

1. Why do we need evaluation measures for cluster algorithms?

2. What is cluster separation and cluster cohesion?

3. Why is SSE not necessarily a good measure of cluster quality?

4. What is a silhouette? What is its range? Is it a characteristic of a cluster, a variable, or a data value?

5. How do we interpret a silhouette value?

6. Explain how silhouette accounts for both separation and cohesion.

7. How is average silhouette interpreted?

8. When will a data value have a perfect silhouette value? What is this value?

9. Describe what a silhouette plot is.

10. Should the analyst always choose the cluster solution with the better mean silhouette value? Explain.

11. Explain how the pseudo-F statistic accounts for both separation and cohesion.

12. Why does the pseudo-F statistic have the word pseudo in its name?

13. Explain how we can use the pseudo-F statistic to select the optimal number of clusters.

14. True or false: The best clustering model is the one with the largest value of pseudo-F. Explain.

15. What is our cluster validation methodology?

16. Why might statistical hypothesis tests not be very helpful for big data applications?

17. What are the criteria for determining whether there is a match between the clusters uncovered in the training and test data sets?

Hands-On Exercises

Use the Loans_training data set and the Loans_test data set for the following exercises. These data sets are available from the textbook web site www.DataMiningConsultant.com.

18. Use $c22-math-0135$ -means with $c22-math-0136$ to generate a cluster model with the training data set.

19. Generate a silhouette plot of your cluster model.

20. Calculate the mean silhouette values for each cluster, as well as the overall mean silhouette for the cluster model.

21. Provide a two-dimensional scatter plot, using variables of your choice, with an overlay of cluster membership. Choose variables that result in an interesting plot. Note where the cluster boundaries are close, and where they are not so close.

22. Using the same variables as the previous exercise, provide a two-dimensional scatter plot, with an overlay of binned silhouette values, as shown in this chapter. Comment on the relationship between your two scatter plots.

23. Repeat Exercises 18–22 using $c22-math-0137$ -means with $c22-math-0138$ .

24. Compare the mean silhouette values for the two cluster models. Which model is preferred?

25. Compare the pseudo-F statistics for the two cluster models. Which model is preferred?

26. Develop a good classification model for predicting loan approval, based solely on cluster membership. Apply data-driven misclassification costs as shown in Chapter 16. Compare your results for the $c22-math-0139$ and $c22-math-0140$ cases using overall model cost. Which model is preferred?

27. With the test data set, apply $c22-math-0141$ -means with the value of $c22-math-0142$ from the preferred model above. Perform validation of the clusters you uncovered with the training and test data sets of the preferred model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.