Modeling and evaluation

Having created our data frame, df, we can begin to develop the clustering algorithms. We will start with hierarchical and then try our hand at k-means. After this, we will need to manipulate our data a little bit to demonstrate how to incorporate mixed data and conduct PAM.

Hierarchical clustering

To build a hierarchical cluster model in R, you can utilize the hclust() function in the base stats package. The two primary inputs needed for the function are a distance matrix and the clustering method. The distance matrix is easily done with the dist() function. For the distance, we will use Euclidean distance. A number of clustering methods are available and the default for hclust() is the complete linkage. We will try this, but I also recommend Ward's linkage method. Ward's method tends to produce clusters with a similar number of observations.

The complete linkage method results in the distance between any two clusters that is the maximum distance between any one observation in a cluster and any one observation in the other cluster. Ward's linkage method seeks to cluster the observations in order to minimize the within-cluster sum of squares.

It is noteworthy that the R method, ward.D2, uses the squared Euclidean distance, which is indeed Ward's linkage method. In R, ward.D is available but requires your distance matrix to be squared values. As we will be building a distance matrix of non-squared values, we will require ward.D2.

Now, the big question is how many clusters should we create? The short, and probably not very satisfying, answer is that it depends. Even though there are cluster validity measures to help with this dilemma—which we will look at—it really requires an intimate knowledge of the business context, underlying data, and, quite frankly, trial and error. As our sommelier partner is fictional, we will have to rely on the validity measures. However, that is no panacea to selecting the numbers of clusters as there are several dozens of them.

As exploring the positives and negatives of the vast array of cluster validity measures is way outside the scope of this chapter, we can turn to a couple of papers and even R itself to simplify this problem for us. A paper by Miligan and Cooper, 1985, explored the performance of 30 different measures/indices on simulated data. The top five performers were CH index, Duda Index, Cindex, Gamma, and Beale Index. Another well-known method to determine the number of clusters is the gap statistic (Tibshirani, Walther, and Hastie, 2001). These are two good papers for you to explore if your cluster validity curiosity gets the better of you.

With R, one can use the NbClust() function in the NbClust package to pull results on 23 indices, including the top five from Miligan and Cooper and the gap statistic. You can see a list of all the available indices in the help file for the package. There are two ways to approach this process. One is to pick your favorite index or indices and call them. The other way is to include all of them in the analysis and go with the majority rules method, which the function summarizes for you nicely. The function will also produce a couple of plots as well.

With this stage set, let's walk through an example using the complete linkage method. When using the function, you will need to specify the minimum and maximum number of clusters, distance measures, and indices in addition to the linkage. As you can see in the following code, we will create an object called numComplete. The function specifications are for Euclidean distance, minimum number of clusters two, maximum number of clusters six, complete linkage, and all indices. When you run the command, the function will automatically produce an output similar to what you can see here—a discussion on both the graphical methods and majority rules conclusion:

> numComplete = NbClust(df, distance="euclidean", min.nc=2, max.nc=6, method="complete", index="all")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a 
significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 1 proposed 2 as the best number of clusters 
* 11 proposed 3 as the best number of clusters 
* 6 proposed 5 as the best number of clusters 
* 5 proposed 6 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is 3

Going with the majority rules method, we would select three clusters as the optimal solution, at least for hierarchical clustering. The two plots that are produced contain two graphs each. As the preceding output states that you are looking for a significant knee in the plot (the graph on the left-hand side) and the peak of the graph on the right-hand side. This is the Hubert Index plot:

Hierarchical clustering

You can see that the bend or knee is at three clusters in the graph on the left-hand side. Additionally, the graph on the right-hand side has its peak at three clusters. The following Dindex plot provides the same information:

Hierarchical clustering

There are a number of values that you can call with the function and there is one that I would like to show. This output is the best number of clusters for each index and the index value for that corresponding number of clusters. This is done with $Best.nc. I've abbreviated the output to the first nine indices:

> numComplete$Best.nc
                     KL      CH Hartigan   CCC    Scott
Number_clusters  5.0000  3.0000   3.0000 5.000   3.0000
Value_Index     14.2227 48.9898  27.8971 1.148 340.9634
                     Marriot   TrCovW   TraceW Friedman
Number_clusters 3.000000e+00     3.00   3.0000   3.0000
Value_Index     6.872632e+25 22389.83 256.4861  10.6941

You can see that the first index, (KL), has the optimal number of clusters as five and the next index, (CH), has it at three.

With three clusters as the recommended selection, we will now compute the distance matrix and build our hierarchical cluster object. This code will build the distance matrix:

> dis = dist(df, method="euclidean")

Then, we will use this matrix as the input for the actual clustering with hclust():

> hc = hclust(dis, method="complete")

The common way to visualize hierarchical clustering is to plot a dendrogram. We will do this with the plot function. Note that hang=-1 puts the observations across the bottom of the diagram:

> plot(hc, hang=-1,labels=FALSE, main="Complete-Linkage")
Hierarchical clustering

The dendrogram is a tree diagram that shows you how the individual observations are clustered together. The arrangement of the connections (branches, if you will) tell us which observations are similar. The height of the branches indicates how much the observations are similar or dissimilar to each other from the distance matrix. Note that I specified labels=FALSE. This was done to aid in the interpretation because of the number of observations. In a smaller dataset of, say, no more than 40 observations, the row names can be displayed.

To aid in visualizing the clusters, you can produce a colored dendrogram using the sparcl package. To color the appropriate number of clusters, you need to cut the dendrogram tree to the proper number of clusters using the cutree() function. This will also create the cluster label for each of the observations:

> comp3 = cutree(hc, 3)

Now, the comp3 object is used in the function to build the colored dendrogram:

> ColorDendrogram(hc, y = comp3, main = "Complete", branchlength = 50)
Hierarchical clustering

Note that I used branchlength = 50. This value will vary based on your own data. As we have the cluster labels, let's build a table that shows the count per cluster:

> table(comp3)
comp3
  1   2   3  
69 58 51

Out of curiosity, let's go ahead and compare how this clustering algorithm compared to the cultivar labels:

> table(comp3,wine$Class)
     
comp3  1  2  3
    1 51 18  0
    2  8 50  0
    3  0  3 48

In this table, the rows are the clusters and columns are the cultivars. This method matched the cultivar labels at an 84 percent rate. Note that we are not trying to use the clusters to predict a cultivar, and in this example, we have no apriori reason to match clusters to the cultivars.

We will now try Ward's linkage. This is the same code as before; it first starts with trying to identify the number of clusters, which means that we will need to change the method to Ward.D2:

> NbClust(df, diss=NULL, distance="euclidean", min.nc=2, max.nc=6, method="ward.D2", index="all")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 2 proposed 2 as the best number of clusters 
* 18 proposed 3 as the best number of clusters 
* 2 proposed 6 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is 3

This time around also, the majority rules was for a three cluster solution. Looking at the Hubert Index, the best solution is a three cluster as well:

Hierarchical clustering

The Dindex adds further support to the three cluster solution:

Hierarchical clustering

Let's move on to the actual clustering and production of the dendrogram for Ward's linkage:

> hcWard = hclust(dis, method="ward.D2")

> plot(hcWard, labels=FALSE, main="Ward's-Linkage")
Hierarchical clustering

The plot shows three pretty distinct clusters that are roughly equal in size. Let's get a count of the cluster size and show it in relation with the cultivar labels:

> ward3 = cutree(hcWard, 3)
> table(ward3,wine$Class)
     
ward3  1  2  3
    1 59  5  0
    2  0 58  0
    3  0  8 48

So, cluster one has 64 observations, cluster two has 58, and cluster three has 56. This method matches the cultivar categories closer than using complete linkage.

With another table, we can compare how the two methods match observations:

> table(comp3, ward3)
     ward3
comp3  1  2  3
    1 53 11  5
    2 11 47  0
    3  0  0 51

While cluster three for each method is pretty close, the other two are not. The question now is how do we identify what the differences are for the interpretation? In many examples, the datasets are very small and you can look at the labels for each cluster. In the real world, this is often impossible. A good way to compare is to use the aggregate() function, summarizing on a statistic such as the mean or median. Additionally, instead of doing it on the scaled data, let's try it on the original data. In the function, you will need to specify the dataset, what you are aggregating it by, and the summary statistic:

> aggregate(wine[,-1],list(comp3),mean)
  Group.1  Alcohol MalicAcid      Ash  Alk_ash magnesium T_phenols
1       1 13.40609  1.898986 2.305797 16.77246 105.00000  2.643913
2       2 12.41517  1.989828 2.381379 21.11724  93.84483  2.424828
3       3 13.11784  3.322157 2.431765 21.33333  99.33333  1.675686
  Flavanoids  Non_flav Proantho C_Intensity       Hue OD280_315  Proline
1  2.6689855 0.2966667 1.832899    4.990725 1.0696522  2.970000 984.6957
2  2.3398276 0.3668966 1.678103    3.280345 1.0579310  2.978448 573.3793
3  0.8105882 0.4443137 1.164314    7.170980 0.6913725  1.709804 622.4902

This gave us the mean by the cluster for each of the 13 variables in the data. With complete linkage done, let's give Ward a try:

> aggregate(wine[,-1],list(ward3),mean)
  Group.1  Alcohol MalicAcid      Ash  Alk_ash magnesium T_phenols
1       1 13.66922  1.970000 2.463125 17.52812 106.15625  2.850000
2       2 12.20397  1.938966 2.215172 20.20862  92.55172  2.262931
3       3 13.06161  3.166607 2.412857 21.00357  99.85714  1.694286
  Flavanoids  Non_flav Proantho C_Intensity      Hue OD280_315   Proline
1  3.0096875 0.2910937 1.908125    5.450000 1.071406  3.158437 1076.0469
2  2.0881034 0.3553448 1.686552    2.895345 1.060000  2.862241  501.4310
3  0.8478571 0.4494643 1.129286    6.850179 0.721000  1.727321  624.9464

The numbers are very close. The cluster one for Ward's method does have slightly higher values for all the variables. For cluster two of Ward's method, the mean values are smaller except for Hue. This would be something to share with someone who has the domain expertise to assist in the interpretation. We can help this effort by plotting the values for the variables by the cluster for the two methods. A nice plot to compare distributions is the boxplot. The boxplot will show us the minimum, first quartile, median, third quartile, maximum, and potential outliers. Let's build a comparison plot with two boxplot graphs with the assumption that we are curious about the Proline values for each clustering method. The first thing to do is prepare our plot area in order to display the graphs side by side. This is done with the par() function:

> par(mfrow=c(1,2))

Here, we specified that we wanted one row and two columns with mfrow=c(1,2)). If you want it as two rows and one column, then it would have been mfrow=c(2,1)).

With the boxplot() function in R, your variables for the x and y axis need to be in the same dataset and so we will need to turn the clusters from both the methods to variables in the wine dataset as follows:

> wine$comp_cluster = comp3

> wine$ward_cluster = ward3

In the boxplot() function, we will need to specify that the y axis values are a function of the x axis values with the tilde ~ symbol:

> boxplot(Proline~comp_cluster, data=wine, main="Proline by Complete Linkage")

> boxplot(Proline~ward_cluster, data=wine, main="Proline by Ward's Linkage")
Hierarchical clustering

Looking at the boxplot, the thick boxes represent the first quartile, median (the thick horizontal line in the box), and the third quartile, which is the interquartile range. The ends of the dotted lines, commonly referred to as whiskers represent the minimum and maximum values. You can see that cluster two in complete linkage has five small circles above the maximum. These are known as suspected outliers and are calculated as greater than plus or minus 1.5 times the interquartile range. Any value that is greater than plus or minus three times the interquartile range are deemed outliers and would be represented as solid black circles. For what it's worth, clusters one and two of Ward's linkage have tighter interquartile ranges with no suspected outliers. Looking at the boxplots for each of the variables could help you and a domain expert can determine the best hierarchical clustering method to accept. With this in mind, let's move on to k-means clustering.

K-means clustering

As we did with hierarchical clustering, we can also use NbClust() to determine the optimum number of clusters for k-means. All you need to do is specify kmeans as the method in the function. Let's also loosen up the maximum number of clusters to 15. I've abbreviated the following output to just the majority rules portion:

> NbClust(df, min.nc=2, max.nc=15, method="kmeans")
* Among all indices:                                                
* 4 proposed 2 as the best number of clusters 
* 15 proposed 3 as the best number of clusters 
* 1 proposed 10 as the best number of clusters 
* 1 proposed 12 as the best number of clusters 
* 1 proposed 14 as the best number of clusters 
* 1 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is 3

Once again, three clusters appear to be the optimum solution. Here is the Hubert plot, which confirms this:

K-means clustering

In R, we can use the kmeans() function to do this analysis. In addition to the input data, we have to specify the number of clusters we are solving for and a value for random assignments, the nstart argument. We will also need to specify a random seed:

> set.seed(1234)

> km=kmeans(df,3,nstart=25)

Creating a table of the clusters gives us a sense of the distribution of the observations in them:

> table(km$cluster)

 1  2  3 
62 65 51

The number of observations per cluster is well-balanced. I have seen on a number of occasions with larger datasets and many more variables that no number of k-means yields a promising and compelling result. Another way to analyze the clustering is to look at a matrix of the cluster centers for each variable in each cluster:

> km$centers
     Alcohol  MalicAcid        Ash    Alk_ash   magnesium   T_phenols
1  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
2 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
3  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
   Flavanoids    Non_flav    Proantho C_Intensity        Hue  OD280_315
1  0.97506900 -0.56050853  0.57865427   0.1705823  0.4726504  0.7770551
2  0.02075402 -0.03343924  0.05810161  -0.8993770  0.4605046  0.2700025
3 -1.21182921  0.72402116 -0.77751312   0.9388902 -1.1615122 -1.2887761
     Proline
1  1.1220202
2 -0.7517257
3 -0.4059428

Note that cluster one has, on an average, a higher alcohol content. Let's produce a boxplot to look at the distribution of alcohol content in the same manner as we did before and also compare it to Ward's:

> wine$km_cluster = km$cluster

> boxplot(Alcohol~km_cluster, data=wine, main="Alcohol Content, K-Means")

> boxplot(Alcohol~ward_cluster, data=wine, main="Alcohol Content, Ward's")
K-means clustering

The alcohol content for each cluster is almost exactly the same. On the surface, this tells me that three clusters is the proper latent structure for the wines and there is little difference between using k-means or hierarchical clustering. Finally, let's do the comparison of the k-means clusters versus the cultivars:

> table(km$cluster, wine$Class)
   
     1  2  3
  1 59  3  0
  2  0 65  0
  3  0  3 48

This is very similar to the distribution produced by Ward's method and either one would probably be acceptable to our hypothetical sommelier. However, to demonstrate how you can cluster on data with both numeric and non-numeric values, let's work through a final example.

Clustering with mixed data

To begin this step, we will need to wrangle our data a little bit. As this method can take variables that are factors, we will convert alcohol to either high or low content as well as incorporate the cultivar class as a factor. The easiest step is to incorporate the cultivars:

> df$class = as.factor(wine$Class)

To change alcohol, it also takes only one line of code but we will need to utilize the ifelse() function and change the variable to a factor. What this will accomplish is if alcohol is greater than zero, it will be High, otherwise, it will be Low:

> df$Alcohol = as.factor(ifelse(df$Alcohol>0,"High","Low"))

Check the structure to verify that it all worked:

> str(df)
'data.frame':178 obs. of  17 variables:
 $ Alcohol     : Factor w/ 2 levels "High","Low": 1 1 1 1 1 1 1 1 1 1 ...
 $ MalicAcid   : num  -0.5607 -0.498 0.0212 -0.3458 0.2271 ...
 $ Ash         : num  0.231 -0.826 1.106 0.487 1.835 ...
 $ Alk_ash     : num  -1.166 -2.484 -0.268 -0.807 0.451 ...
 $ magnesium   : num  1.9085 0.0181 0.0881 0.9283 1.2784 ...
 $ T_phenols   : num  0.807 0.567 0.807 2.484 0.807 ...
 $ Flavanoids  : num  1.032 0.732 1.212 1.462 0.661 ...
 $ Non_flav    : num  -0.658 -0.818 -0.497 -0.979 0.226 ...
 $ Proantho    : num  1.221 -0.543 2.13 1.029 0.4 ...
 $ C_Intensity : num  0.251 -0.292 0.268 1.183 -0.318 ...
 $ Hue         : num  0.361 0.405 0.317 -0.426 0.361 ...
 $ OD280_315   : num  1.843 1.11 0.786 1.181 0.448 ...
 $ Proline     : num  1.0102 0.9625 1.3912 2.328 -0.0378 ...
 $ comp_cluster: num  -1.1 -1.1 -1.1 -1.1 0.124 ...
 $ ward_cluster: num  -1.16 -1.16 -1.16 -1.16 -1.16 ...
 $ km_cluster  : num  -1.18 -1.18 -1.18 -1.18 -1.18 ...
 $ class       : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

We are now ready to create the dissimilarity matrix using the daisy() function from the cluster package and specifying the method as gower:

> disMat = daisy(df, metric="gower")

The creation of the cluster object—let's call it pamFit—is done with the pam() function, which is a part of the cluster package. We will create three clusters in this example and create a table of the cluster size:

> set.seed(123)

> pamFit = pam(disMat, k=3)

> table(pamFit$clustering)

 1  2  3 
60 69 49

Now, let's see how it does compared to the cultivar labels:

> table(pamFit$clustering, wine$Class)
   
     1  2  3
  1 59  1  0
  2  0 69  0
  3  0  1 48

Well, no surprise that by actually including the cultivar class, the clusters almost achieved a perfect match. So, let's take this solution and build a descriptive statistics table using the power of the compareGroups package. In base R, creating presentation-worthy tables can be quite difficult and this package offers an excellent solution. The first step is to create an object of the descriptive statistics by the cluster with the compareGroups() function of the package. Then, using createTable(), we will turn the statistics to an easy-to-export table, so we will do this as a .csv. If you want, you can also export the table as a .pdf, HTML, or the LaTeX format:

> df$cluster = pamFit$clustering

> group = compareGroups(cluster~., data=df)

> clustab = createTable(group)

> clustab

--------Summary descriptives table by 'cluster'---------

_____________________________________________________________ 
                  1            2            3       p.overall 
                 N=60         N=69         N=49               
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
Alcohol:                                             <0.001   
    High      58 (96.7%)   6 (8.70%)    28 (57.1%)            
    Low        2 (3.33%)  63 (91.3%)    21 (42.9%)            
MalicAcid    -0.31 (0.62) -0.37 (0.89) 0.90 (0.97)   <0.001   
Ash          0.28 (0.89)  -0.42 (1.14) 0.25 (0.67)   <0.001   
Alk_ash      -0.75 (0.76) 0.24 (1.00)  0.58 (0.67)   <0.001   
magnesium    0.43 (0.77)  -0.34 (1.18) -0.05 (0.77)  <0.001   
T_phenols    0.87 (0.54)  -0.06 (0.86) -0.99 (0.56)  <0.001   
Flavanoids   0.96 (0.40)  0.04 (0.70)  -1.23 (0.31)  <0.001   
Non_flav     -0.58 (0.56) 0.00 (0.98)  0.71 (1.00)   <0.001   
Proantho     0.55 (0.72)  0.05 (1.06)  -0.75 (0.72)  <0.001   
C_Intensity  0.20 (0.53)  -0.87 (0.38) 0.99 (1.00)   <0.001   
Hue          0.46 (0.51)  0.44 (0.89)  -1.19 (0.51)  <0.001   
OD280_315    0.77 (0.50)  0.25 (0.69)  -1.30 (0.38)  <0.001   
Proline      1.14 (0.74)  -0.72 (0.51) -0.38 (0.37)  <0.001   
comp_cluster -0.94 (0.42) -0.14 (0.59) 1.35 (0.00)   <0.001   
ward_cluster -1.16 (0.00) 0.11 (0.49)  1.27 (0.00)   <0.001   
km_cluster   -1.16 (0.16) 0.06 (0.34)  1.33 (0.00)   <0.001   
class:                                               <0.001   
    1         59 (98.3%)   0 (0.00%)    0 (0.00%)             
    2         1 (1.67%)    69 (100%)    1 (2.04%)             
    3         0 (0.00%)    0 (0.00%)    48 (98.0%)            
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

This table shows the proportion of the factor levels by the cluster, and for the numeric variables, the mean and standard deviation are displayed in parentheses. To export the table to a .csv file, just use the export2csv() function:

> export2csv(clustab,file="wine_clusters.csv")

If you open this file, you will get this table, which is conducive to further analysis and can be easily manipulated for the presentation purposes:

Clustering with mixed data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset