Chapter 9. Unsupervised methods

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Unsupervised methods

This chapter covers

Using R’s clustering functions to explore data and look for similarities
Choosing the right number of clusters
Evaluating a cluster
Using R’s association rules functions to find patterns of co-occurrence in data
Evaluating a set of association rules

In the previous chapter, we covered using the vtreat package to prepare messy real-world data for modeling. In this chapter, we’ll look at methods to discover unknown relationships in data. These methods are called unsupervised methods. With unsupervised methods, there’s no outcome that you’re trying to predict; instead, you want to discover patterns in the data that perhaps you hadn’t previously suspected. For example, you may want to find groups of customers with similar purchase patterns, or correlations between population movement and socioeconomic factors. We will still consider this pattern discovery to be “modeling,” and as such, the outcomes of the algorithms can still be evaluated, as shown in the mental model for this chapter (figure 9.1).

Figure 9.1. Mental model

Unsupervised analyses are often not ends in themselves; rather, they’re ways of finding relationships and patterns that can be used to build predictive models. In fact, we encourage you to think of unsupervised methods as exploratory—procedures that help you get your hands in the data—rather than as black-box approaches that mysteriously and automatically give you “the right answer.”

In this chapter, we’ll look at two classes of unsupervised methods:

Cluster analysis finds groups with similar characteristics.
Association rule mining finds elements or properties in the data that tend to occur together.

9.1. Cluster analysis

In cluster analysis, the goal is to group the observations in your data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. For example, a company that offers guided tours might want to cluster its clients by behavior and tastes: which countries they like to visit; whether they prefer adventure tours, luxury tours, or educational tours; what kinds of activities they participate in; and what sorts of sites they like to visit. Such information can help the company design attractive travel packages and target the appropriate segments of their client base with them.

Cluster analysis is a topic worthy of a book in itself; in this chapter, we’ll discuss two approaches. Hierarchical clustering finds nested groups of clusters. An example of hierarchical clustering might be the standard plant taxonomy, which classifies plants by family, then genus, then species, and so on. The second approach we’ll cover is k-means, which is a quick and popular way of finding clusters in quantitative data.

Clustering and density estimation

Historically, cluster analysis is related to the problem of density estimation: if you think of your data as living in a large dimensional space, then you want to find the regions of the space where the data is densest. If those regions are distinct, or nearly so, then you have clusters.

9.1.1. Distances

In order to cluster, you need the notions of similarity and dissimilarity. Dissimilarity can be thought of as distance, so that the points in a cluster are closer to each other than they are to the points in other clusters. This is shown in figure 9.2.

Figure 9.2. An example of data in three clusters

Different application areas will have different notions of distance and dissimilarity. In this section, we’ll cover a few of the most common ones:

Euclidean distance
Hamming distance
Manhattan (city block) distance
Cosine similarity

Euclidean distance

Example

Suppose you have measurements on how many minutes per day subjects spend on different activities, and you want to group the subjects by their activity patterns.

Since your measurements are numerical and continuous, Euclidean distance is a good distance to use for clustering. Euclidean distance is the measure people tend to think of when they think of “distance.” Optimizing squared Euclidean distance is the basis of k-means. Of course, Euclidean distance only makes sense when all the data is real-valued (quantitative). If the data is categorical (in particular, binary), then other distances should be used.

The Euclidean distance between two vectors x and y is defined as

edist(x, y) <- sqrt((x[1] - y[1])^2 + (x[2] - y[2])^2 + ...)

Hamming distance

Example

Suppose you want to group your recipe box into groups of similar recipes. One way to do that is to measure the similarity of their ingredients lists.

By this measure, pancakes, waffles, and crepes are highly similar (they have almost identical ingredients, and vary only in proportions); they all differ somewhat from cornbread (which uses cornmeal, rather than flour); and they all differ to a greater degree from mashed potatoes.

For categorical variables like recipe ingredients, gender (male/female), or qualitative size (small/medium/large), you can define the distance as 0 if two points are in the same category, and 1 otherwise. If all the variables are categorical, then you can use Hamming distance, which counts the number of mismatches:

hdist(x, y) <- sum((x[1] != y[1]) + (x[2] != y[2]) + ...)

Here, a != b is defined to have a value of 1 if the expression is true, and a value of 0 if the expression is false.

You can also expand categorical variables to indicator variables (as we discussed in section 7.1.4), one for each level of the variable.

If the categories are ordered (like small/medium/large) so that some categories are “closer” to each other than others, then you can convert them to a numerical sequence. For example, (small/medium/large) might map to (1/2/3). Then you can use Euclidean distance or other distances for quantitative data.

Manhattan (city block) distance

Example

Suppose you run a delivery service that caters to downtown businesses. You want to cluster your clients so that you can place pickup/drop-off boxes that are centrally located in each cluster.

Manhattan distance measures distance in the number of horizontal and vertical units it takes to get from one (real-valued) point to the other (no diagonal moves). This is also known as L1 distance (and squared Euclidean distance is L2 distance ).

In this example, Manhattan distance is more appropriate because you want to measure distance by how far people will walk along the streets, not diagonally point-to-point (Euclidean distance). For example, in figure 9.3, client A is 2 blocks north of the site and 2 blocks west, while client B is 3 blocks south of the site and 1 block east. They are equidistant from the site (4 blocks) by Manhattan distance. But client B is further by Euclidean distance: the diagonal of a 3-by-1 rectangle is longer than the diagonal of a 2-by-2 square.

Figure 9.3. Manhattan vs. Euclidean distance

The Manhattan distance between two vectors x and y is defined as

mdist(x, y) <- sum(abs(x[1] - y[1]) + abs(x[2] - y[2]) + ...)

Cosine similarity

Example

Suppose you represent a document as rows of a document-text matrix, as we did in section 6.3.3, where each element i of the row vector gives the number of times that word i appeared in the document. Then the cosine similarity between two row vectors is a measure of the similarity of the corresponding documents.

Cosine similarity is a common similarity metric in text analysis. It measures the smallest angle between two vectors. In our text example, we assume non-negative vectors, so the angle theta between two vectors is between 0 and 90 degrees. Cosine similarity is shown in figure 9.4.

Figure 9.4. Cosine similarity

Two perpendicular vectors (theta = 90 degrees) are the most dissimilar; the cosine of 90 degrees is 0. Two parallel vectors are the most similar (identical, if you assume they’re both based at the origin); the cosine of 0 degrees is 1.

From elementary geometry, you can derive that the cosine of the angle between two vectors is given by the normalized dot product between the two vectors:

dot(x, y) <- sum(x[1] * y[1] + x[2] * y[2] + ...)
cossim(x, y) <- dot(x, y) / (sqrt(dot(x,x) * dot(y, y)))

You can turn the cosine similarity into a pseudo distance by subtracting it from 1.0 (though to get an actual metric, you should use 1 - 2 * acos(cossim(x, y)) / pi).

Different distance metrics will give you different clusters, as will different clustering algorithms. The application domain may give you a hint as to the most appropriate distance, or you can try several distance metrics. In this chapter, we’ll use (squared) Euclidean distance, as it’s the most natural distance for quantitative data.

9.1.2. Preparing the data

To demonstrate clustering, we’ll use a small dataset from 1973 on protein consumption from nine different food groups in 25 countries in Europe.^[1] The goal is to group the countries based on patterns in their protein consumption. The dataset is loaded into R as a data frame called protein, as shown in the next listing.

¹
The original dataset was from the Data and Story Library, previously hosted at CMU. It is no longer online there. A tab-separated text file with the data can be found at https://github.com/WinVector/PDSwR2/tree/master/Protein/. The data file is called protein.txt; additional information can be found in the file protein_README.txt.

Listing 9.1. Reading the protein data

protein <- read.table("protein.txt", sep = "	", header=TRUE)
summary(protein)
##            Country      RedMeat         WhiteMeat           Eggs
##  Albania       : 1   Min.   : 4.400   Min.   : 1.400   Min.   :0.500
##  Austria       : 1   1st Qu.: 7.800   1st Qu.: 4.900   1st Qu.:2.700
##  Belgium       : 1   Median : 9.500   Median : 7.800   Median :2.900
##  Bulgaria      : 1   Mean   : 9.828   Mean   : 7.896   Mean   :2.936
##  Czechoslovakia: 1   3rd Qu.:10.600   3rd Qu.:10.800   3rd Qu.:3.700
##  Denmark       : 1   Max.   :18.000   Max.   :14.000   Max.   :4.700
##  (Other)       :19
##       Milk            Fish           Cereals          Starch
##  Min.   : 4.90   Min.   : 0.200   Min.   :18.60   Min.   :0.600
##  1st Qu.:11.10   1st Qu.: 2.100   1st Qu.:24.30   1st Qu.:3.100
##  Median :17.60   Median : 3.400   Median :28.00   Median :4.700
##  Mean   :17.11   Mean   : 4.284   Mean   :32.25   Mean   :4.276
##  3rd Qu.:23.30   3rd Qu.: 5.800   3rd Qu.:40.10   3rd Qu.:5.700
##  Max.   :33.70   Max.   :14.200   Max.   :56.70   Max.   :6.500
##
##       Nuts           Fr.Veg
##  Min.   :0.700   Min.   :1.400
##  1st Qu.:1.500   1st Qu.:2.900
##  Median :2.400   Median :3.800
##  Mean   :3.072   Mean   :4.136
##  3rd Qu.:4.700   3rd Qu.:4.900
##  Max.   :7.800   Max.   :7.900

Units and scaling

The documentation for this dataset doesn’t mention what the units of measurement are; we will assume all the columns are measured in the same units. This is important: units (or, more precisely, disparity in units ) affect what clusterings an algorithm will discover. If you measure vital statistics of your subjects as age in years, height in feet, and weight in pounds, you’ll get different distances—and possibly different clusters—than if you measure age in years, height in meters, and weight in kilograms.

Ideally, you want a unit of change in each coordinate to represent the same degree of difference. In the protein dataset, we assume that the measurements are all in the same units, so it might seem that we’re okay. This may well be a correct assumption, but different food groups provide different amounts of protein. Animal-based food sources in general have more grams of protein per serving than plant-based food sources, so one could argue that a change in consumption of five grams is a bigger difference in terms of vegetable consumption than it is in terms of red meat consumption.

One way to try to make the units of each variable more compatible is to transform all the columns to have a mean value of 0 and a standard deviation of 1. This makes the standard deviation the unit of measurement in each coordinate. Assuming that your training data has a distribution that accurately represents the population at large, then a standard deviation represents approximately the same degree of difference in every coordinate.

You can scale numeric data in R using the function scale(). The output of scale() is a matrix. For the purposes of this chapter, you can mostly think of a matrix as a data frame with all numeric columns (this isn’t strictly true, but it’s close enough).

The scale() function annotates its output with two attributes—scaled:center returns the mean values of all the columns, and scaled:scale returns the standard deviations. You’ll store these away so you can “unscale” the data later.

Listing 9.2. Rescaling the dataset

vars_to_use <- colnames(protein)[-1]           1
 pmatrix <- scale(protein[, vars_to_use])
pcenter <- attr(pmatrix, "scaled:center")      2
 pscale <- attr(pmatrix, "scaled:scale")

rm_scales <- function(scaled_matrix) {         3
  attr(scaled_matrix, "scaled:center") <- NULL
  attr(scaled_matrix, "scaled:scale") <- NULL
  scaled_matrix
}

pmatrix <- rm_scales(pmatrix)                  4

1 Uses all the columns except the first (Country)
2 Stores the scaling attributes
3 Convenience function to remove scale attributes from a scaled matrix.
4 Nulls out the scale attributes for safety

Figure 9.5 shows the effect of scaling on two variables, Fr.Veg and RedMeat. The raw (unscaled) variables have different ranges, reflecting the fact that the amount of protein supplied via red meat tends to be higher than the amount of protein supplied via fruits and vegetables. The scaled variables now have similar ranges, which makes comparing relative variation in each variable easier.

Figure 9.5. Comparison of `Fr.Veg` and `RedMeat` variables, unscaled (top) and scaled (bottom)

Now you are ready to cluster the protein data. We’ll start with hierarchical clustering.

9.1.3. Hierarchical clustering with hclust

The hclust() function takes as input a distance matrix (as an object of class dist), which records the distances between all pairs of points in the data (using any one of a variety of metrics). You can compute the distance matrix using the function dist().

dist() will calculate distance functions using the (squared) Euclidean distance (method = "euclidean"), the Manhattan distance (method = "manhattan"), and something like the Hamming distance, when categorical variables are expanded to indicators (method = "binary"). If you want to use another distance metric, you’ll have to compute the appropriate distance matrix and convert it to a dist object using the as.dist() call (see help(dist) for further details).

hclust() also uses one of a variety of clustering methods to produce a tree that records the nested cluster structure. We’ll use Ward’s method, which starts out with each data point as an individual cluster and merges clusters iteratively so as to minimize the total within sum of squares (WSS) of the clustering (we’ll explain more about WSS later in the chapter).

Let’s cluster the protein data.

Listing 9.3. Hierarchical clustering

distmat <- dist(pmatrix, method = "euclidean")   1
pfit <- hclust(distmat, method = "ward.D")       2
plot(pfit, labels = protein$Country)             3

1 Creates the distance matrix
2 Does the clustering
3 Plots the dendrogram

hclust() returns a dendrogram: a tree that represents the nested clusters. The dendrogram for the protein data is shown in figure 9.6. The leaves of the tree are in the same cluster if there is a path between them. By cutting the tree at a certain depth, you disconnect some of the paths, and so create more, smaller clusters.

Figure 9.6. Dendrogram of countries clustered by protein consumption

This dendrogram suggests five clusters might be an appropriate number, as shown in figure 9.6. You can draw the rectangles on the dendrogram using the function rect.hclust():

rect.hclust(pfit, k=5)

To extract the members of each cluster from the hclust object, use cutree().

Listing 9.4. Extracting the clusters found by `hclust()`

groups <- cutree(pfit, k = 5)

print_clusters <- function(data, groups, columns) {     1
   groupedD <- split(data, groups)
   lapply(groupedD,
         function(df) df[, columns])
}

cols_to_print <- wrapr::qc(Country, RedMeat, Fish, Fr.Veg)
print_clusters(protein, groups, cols_to_print)

## $`1`
##       Country RedMeat Fish Fr.Veg
## 1     Albania    10.1  0.2    1.7
## 4    Bulgaria     7.8  1.2    4.2
## 18    Romania     6.2  1.0    2.8
## 25 Yugoslavia     4.4  0.6    3.2
##
## $`2`
##        Country RedMeat Fish Fr.Veg
## 2      Austria     8.9  2.1    4.3
## 3      Belgium    13.5  4.5    4.0
## 9       France    18.0  5.7    6.5
## 12     Ireland    13.9  2.2    2.9
## 14 Netherlands     9.5  2.5    3.7
## 21 Switzerland    13.1  2.3    4.9
## 22          UK    17.4  4.3    3.3
## 24   W Germany    11.4  3.4    3.8
##
## $`3`
##           Country RedMeat Fish Fr.Veg
## 5  Czechoslovakia     9.7  2.0    4.0
## 7       E Germany     8.4  5.4    3.6
## 11        Hungary     5.3  0.3    4.2
## 16         Poland     6.9  3.0    6.6
## 23           USSR     9.3  3.0    2.9
##
## $`4`
##    Country RedMeat Fish Fr.Veg
## 6  Denmark    10.6  9.9    2.4
## 8  Finland     9.5  5.8    1.4
## 15  Norway     9.4  9.7    2.7
## 20  Sweden     9.9  7.5    2.0
##
## $`5`
##     Country RedMeat Fish Fr.Veg
## 10   Greece    10.2  5.9    6.5
## 13    Italy     9.0  3.4    6.7
## 17 Portugal     6.2 14.2    7.9
## 19    Spain     7.1  7.0    7.2

1 A convenience function for printing out the countries in each cluster, along with the values for red meat, fish, and fruit/vegetable consumption. We’ll use this function throughout this section. Note that the function assumes the data is in a data.frame (not a matrix).

There’s a certain logic to these clusters: the countries in each cluster tend to be in the same geographical region. It makes sense that countries in the same region would have similar dietary habits. You can also see that

Cluster 2 is made of countries with higher-than-average red meat consumption.
Cluster 4 contains countries with higher-than-average fish consumption, but low produce consumption.
Cluster 5 contains countries with high fish and produce consumption.

This dataset has only 25 points; it’s harder to “eyeball” the clusters and the cluster members when there are very many data points. In the next few sections, we’ll look at some ways to examine clusters more holistically.

Visualizing clusters using principal components analysis

As we mentioned in chapter 3, visualization is an effective way to get an overall view of the data, or in this case, the clusterings. The protein data is nine-dimensional, so it’s hard to visualize with a scatter plot.

We can try to visualize the clustering by projecting the data onto the first two principal components of the data.^[1] If N is the number of variables that describe the data, then the principal components describe the hyperellipsoid in N-space that roughly bounds the data. Each principal component is an N-dimensional vector that describes an axis of that hyperellipsoid. Figure 9.7 shows this for N = 3.

¹
We can project the data onto any two of the principal components, but the first two are the most likely to show useful information.

Figure 9.7. The idea behind principal components analysis

If you order the principal components by the length of the hyperellipsoid’s corresponding axes (longest first), then the first two principal components describe a plane in N-space that captures as much of the variation of the data as can be captured in two dimensions. In other words, it describes the best 2-D projection of the data. We’ll use the prcomp() call to do the principal components decomposition.

Listing 9.5. Projecting the clusters on the first two principal components

library(ggplot2)
princ <- prcomp(pmatrix)                                            1
nComp <- 2
project <- predict(princ, pmatrix)[, 1:nComp]                       2
project_plus <- cbind(as.data.frame(project),                       3
                      cluster = as.factor(groups),
                     country = protein$Country)

ggplot(project_plus, aes(x = PC1, y = PC2)) +                       4
  geom_point(data = as.data.frame(project), color = "darkgrey") +
  geom_point() +
  geom_text(aes(label = country),
            hjust = 0, vjust = 1) +
  facet_wrap(~ cluster, ncol = 3, labeller = label_both)

1 Calculates the principal components of the data
2 The predict() function will rotate the data into the coordinates described by the principal components. The first two columns of the rotated data are the projection of the data on the first two principal components.
3 Creates a data frame with the transformed data, along with the cluster label and country label of each point
4 Plot it. Put each cluster in a separate facet for legibility.

You can see in figure 9.8 that cluster 1 (Romania/Yugoslavia/Bulgaria/Albania) and the Mediterranean cluster (cluster 5) are separated from the others. The other three clusters comingle in this projection, though they’re probably more separated in other projections.

Figure 9.8. Plot of countries clustered by protein consumption, projected onto the first two principal components

Bootstrap evaluation of clusters

An important question when evaluating clusters is whether a given cluster is “real”—does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? As you’ll see, this is especially important with clustering algorithms like k-means, where the user has to specify the number of clusters beforehand. It’s been our experience that clustering algorithms will often produce several clusters that represent actual structure or relationships in the data, and then one or two clusters that are buckets that represent “other” or “miscellaneous.” Clusters of "other" tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.

One way to assess whether a cluster represents true structure is to see if the cluster holds up under plausible variations in the dataset. The fpc package has a function called clusterboot() that uses bootstrap resampling to evaluate how stable a given cluster is.^[1] clusterboot() is an integrated function that both performs the clustering and evaluates the final produced clusters. It has interfaces to a number of R clustering algorithms, including both hclust and kmeans.

¹
For a full description of the algorithm, see Christian Henning, “Cluster-wise assessment of cluster stability,” Research Report 271, Dept. of Statistical Science, University College London, December 2006.

clusterboot’s algorithm uses the Jaccard coefficient, a similarity measure between sets. The Jaccard similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B. This is shown in figure 9.9.

Figure 9.9. Jaccard similarity

The basic general strategy is as follows:

Cluster the data as usual.
Draw a new dataset (of the same size as the original) by resampling the original dataset with replacement (meaning that some of the data points may show up more than once, and others not at all). Cluster the new dataset.
For every cluster in the original clustering, find the most similar cluster in the new clustering (the one that gives the maximum Jaccard coefficient) and record that value. If this maximum Jaccard coefficient is less than 0.5, the original cluster is considered to be dissolved—it didn’t show up in the new clustering. A cluster that’s dissolved too often is probably not a “real” cluster.
Repeat steps 2–3 several times.

The cluster stability of each cluster in the original clustering is the mean value of its Jaccard coefficient over all the bootstrap iterations. As a rule of thumb, clusters with a stability value less than 0.6 should be considered unstable. Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together. Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters).

Different clustering algorithms can give different stability values, even when the algorithms produce highly similar clusterings, so clusterboot() is also measuring how stable the clustering algorithm is.

Let’s run clusterboot() on the protein data, using hierarchical clustering with five clusters. Note that clusterboot() is randomized, so you may not get identical results.

Listing 9.6. Running `clusterboot()` on the protein data

library(fpc)                                             1
kbest_p <- 5                                             2
cboot_hclust <- clusterboot(pmatrix,
                           clustermethod = hclustCBI,    3

                           method = "ward.D",
                           k = kbest_p)

summary(cboot_hclust$result)                             4

##               Length Class  Mode
## result         7     hclust list
## noise          1     -none- logical
## nc             1     -none- numeric
## clusterlist    5     -none- list
## partition     25     -none- numeric
## clustermethod  1     -none- character
## nccl           1     -none- numeric

groups <- cboot_hclust$result$partition                  5
print_clusters(protein, groups, cols_to_print)           6
## $`1`
##       Country RedMeat Fish Fr.Veg
## 1     Albania    10.1  0.2    1.7
## 4    Bulgaria     7.8  1.2    4.2
## 18    Romania     6.2  1.0    2.8
## 25 Yugoslavia     4.4  0.6    3.2
##
## $`2`
##        Country RedMeat Fish Fr.Veg
## 2      Austria     8.9  2.1    4.3
## 3      Belgium    13.5  4.5    4.0
## 9       France    18.0  5.7    6.5
## 12     Ireland    13.9  2.2    2.9
## 14 Netherlands     9.5  2.5    3.7
## 21 Switzerland    13.1  2.3    4.9
## 22          UK    17.4  4.3    3.3
## 24   W Germany    11.4  3.4    3.8
##
## $`3`
##           Country RedMeat Fish Fr.Veg
## 5  Czechoslovakia     9.7  2.0    4.0
## 7       E Germany     8.4  5.4    3.6
## 11        Hungary     5.3  0.3    4.2
## 16         Poland     6.9  3.0    6.6
## 23           USSR     9.3  3.0    2.9
##
## $`4`
##    Country RedMeat Fish Fr.Veg
## 6  Denmark    10.6  9.9    2.4
## 8  Finland     9.5  5.8    1.4
## 15  Norway     9.4  9.7    2.7
## 20  Sweden     9.9  7.5    2.0
##
## $`5`
##     Country RedMeat Fish Fr.Veg
## 10   Greece    10.2  5.9    6.5
## 13    Italy     9.0  3.4    6.7
## 17 Portugal     6.2 14.2    7.9
## 19    Spain     7.1  7.0    7.2

cboot_hclust$bootmean                                    7
## [1] 0.8090000 0.7939643 0.6247976 0.9366667 0.7815000

cboot_hclust$bootbrd                                     8
## [1] 19 14 45  9 30

1 Loads the fpc package. You may have to install it first.
2 Sets the desired number of clusters
3 Runs clusterboot() with hclust (clustermethod = hclustCBI) using Ward’s method (method = "ward.D") and kbest_p clusters (k = kbest_p). Returns the results in an object called cboot_hclust.
4 The results of the clustering are in cboot_hclust$result.
5 cboot_hclust$result$partition returns a vector of cluster labels.
6 The clusters are the same as those produced by a direct call to hclust().
7 The vector of cluster stabilities
8 The count of how many times each cluster was dissolved. By default, clusterboot() runs 100 bootstrap iterations.

The clusterboot() results show that the cluster of countries with high fish consumption (cluster 4) is highly stable: the cluster stability is high, and the cluster was dissolved relatively few times. Clusters 1 and 2 are also quite stable; cluster 5 less so (you can see in figure 9.8 that the members of cluster 5 are separated from the other countries, but also fairly separated from each other). Cluster 3 has the characteristics of what we’ve been calling the “other” cluster.

clusterboot() assumes that you know the number of clusters, k. We eyeballed the appropriate k from the dendrogram, but this isn’t always feasible with a large dataset. Can we pick a plausible k in a more automated fashion? We’ll look at this question in the next section.

Picking the number of clusters

There are a number of heuristics and rules of thumb for picking clusters; a given heuristic will work better on some datasets than others. It’s best to take advantage of domain knowledge to help set the number of clusters, if that’s possible. Otherwise, try a variety of heuristics, and perhaps a few different values of k.

Total within sum of squares

One simple heuristic is to compute the total within sum of squares (WSS) for different values of k and look for an “elbow” in the curve. We'll walk through the definition of WSS in this section.

Figure 9.10 shows data with four clusters. Define the centroid of each cluster as the point that is the mean value of all the points in the cluster. The centroid will be in the center of the cluster, as shown in the figure. The within sum of squares (or WSS_i) for a single cluster is the summed squared distance of each point in the cluster from the cluster’s centroid. This is shown in the figure for cluster 4.

Figure 9.10. Cluster WSS and total WSS for a set of four clusters

The total within sum of squares is the sum of the WSS_i of all the clusters. We show the calculation in the following listing.

Listing 9.7. Calculating total within sum of squares

sqr_edist <- function(x, y) {                                              1
   sum((x - y)^2)
}

wss_cluster <- function(clustermat) {                                      2
   c0 <- colMeans(clustermat)                                              3
   sum(apply(clustermat, 1, FUN = function(row) { sqr_edist(row, c0) }))   4
 }

wss_total <- function(dmatrix, labels) {                                   5
  wsstot <- 0
  k <- length(unique(labels))
  for(i in 1:k)
    wsstot <- wsstot + wss_cluster(subset(dmatrix, labels == i))           6
  wsstot
}

wss_total(pmatrix, groups)                                                 7

## [1] 71.94342

1 Function to calculate squared distance between two vectors
2 Function to calculate the WSS for a single cluster, which is represented as a matrix (one row for every point)
3 Calculates the centroid of the cluster (the mean of all the points)
4 Calculates the squared difference of every point in the cluster from the centroid, and sums all the distances
5 Function to compute the total WSS from a set of data points and cluster labels
6 Extracts each cluster, calculates the cluster’s WSS, and sums all the values
7 Calculates the total WSS for the current protein clustering

The total WSS will decrease as the number of clusters increases, because each cluster will be smaller and tighter. The hope is that the rate at which the WSS decreases will slow down for k beyond the optimal number of clusters. In other words, the graph of WSS versus k should flatten out beyond the optimal k, so the optimal k will be at the “elbow” of the graph. Let’s try calculating WSS for up to 10 clusters.

Listing 9.8. Plotting WSS for a range of `k`

get_wss <- function(dmatrix, max_clusters) {               1
    wss = numeric(max_clusters)


  wss[1] <- wss_cluster(dmatrix)                           2

  d <- dist(dmatrix, method = "euclidean")
  pfit <- hclust(d, method = "ward.D")                     3

  for(k in 2:max_clusters) {                               4

    labels <- cutree(pfit, k = k)
    wss[k] <- wss_total(dmatrix, labels)
  }

  wss
}

kmax <- 10
cluster_meas <- data.frame(nclusters = 1:kmax,
                          wss = get_wss(pmatrix, kmax))

breaks <- 1:kmax
ggplot(cluster_meas, aes(x=nclusters, y = wss)) +          5
   geom_point() + geom_line() +
  scale_x_continuous(breaks = breaks)

1 A function to get the total WSS for a range of clusters from 1 to max
2 wss[1] is just the WSS of all the data.
3 Clusters the data
4 For each k, calculates the cluster labels and the cluster WSS
5 Plots WSS as a function of k

Figure 9.11 shows the plot of WSS as a function of k. Unfortunately, in this case the elbow of the graph is hard to see, although if you squint your eyes you might be able to convince yourself that there is an elbow at k = 2, and another one at k = 5 or 6. This means the best clusterings might be 2 clusters, 5 clusters, or 6 clusters.

Figure 9.11. WSS as a function of `k` for the protein data

Calinski-Harabasz index

The Calinski-Harabasz index is another commonly used measure of cluster goodness. It tries to find the point where all the clusters are tight, and also far apart from each other. To motivate (and calculate) the Calinski-Harabasz index (CH index, for short), we first need to define a few more terms.

As shown in figure 9.12, the total sum of squares (TSS) of a set of points is the sum of the squared distances of all the points from the centroid of the data. In the function get_wss() of listing 9.8, the value wss[1] is the TSS, and it is independent of the clustering. For a given clustering with total within sum of squares, we can also define the between sum of squares (BSS):

BSS = TSS - WSS

Figure 9.12. Total sum of squares for a set of four clusters

BSS measures how far apart the clusters are from each other. A good clustering has a small WSS (all the clusters are tight around their centers) and a large BSS. We can compare how BSS and WSS vary as we vary the number of clusters.

Listing 9.9. Plotting BSS and WSS as a function of k

total_ss <- function(dmatrix) {                                            1
   grandmean <- colMeans(dmatrix)
  sum(apply(dmatrix, 1, FUN = function(row) { sqr_edist(row, grandmean) }))
}

tss <- total_ss(pmatrix)
cluster_meas$bss <- with(cluster_meas, tss - wss)

library(cdata)                                                             2
cmlong <- unpivot_to_blocks(cluster_meas,                                  3
                            nameForNewKeyColumn = "measure",
                           nameForNewValueColumn = "value",
                           columnsToTakeFrom = c("wss", "bss"))

ggplot(cmlong, aes(x = nclusters, y = value)) +
  geom_point() + geom_line() +
  facet_wrap(~measure, ncol = 1, scale = "free_y") +
  scale_x_continuous(breaks = 1:10)

1 Calculates the total sum of squares: TSS
2 Loads the cdata package to reshape the data
3 Reshapes cluster_meas so that the WSS and the BSS are in the same column

Figure 9.13 shows that as k increases, BSS increases, while WSS decreases. We want a clustering with a good balance of BSS and WSS. To find such a clustering, we have to look at a couple of measures related to the BSS and the WSS.

Figure 9.13. BSS and WSS as a function of k

The within cluster variance W is given by

W = WSS / (n - k)

Here, n is the number of data points and k is the number of clusters. You can think of W as the “average” WSS.

The between cluster variance B is given by

B = BSS / (k - 1)

Again, you can think of B as the average contribution to the BSS from each cluster.

A good clustering should have a small average WSS and a large average BSS, so we might try to maximize the ratio of B to W. This is the Calinski-Harabasz (CH) index. Let’s calculate the CH index and plot it for up to 10 clusters.

Listing 9.10. The Calinski-Harabasz index

cluster_meas$B <- with(cluster_meas,  bss / (nclusters - 1))     1

n = nrow(pmatrix)
cluster_meas$W <- with(cluster_meas,  wss / (n - nclusters))     2
cluster_meas$ch_crit <- with(cluster_meas, B / W)                3

ggplot(cluster_meas, aes(x = nclusters, y = ch_crit)) +
  geom_point() + geom_line() +
  scale_x_continuous(breaks = 1:kmax)

1 Calculates the between cluster variance B
2 Calculates the within cluster variance W
3 Calculates the CH index

Looking at figure 9.14, you see that the CH criterion is maximized at k = 2, with another local maximum at k = 5. The k = 2 clustering corresponds to the first split of the protein data dendrogram, as shown in figure 9.15; if you use clusterboot() to do the clustering, you’ll see that the clusters are highly stable, though perhaps not very informative.

Figure 9.14. The Calinski-Harabasz index as a function of k

Figure 9.15. The protein data dendrogram with two clusters

Other measures of cluster quality

There are several other measures that you can try when picking k. The gap statistic^[a] is an attempt to automate the “elbow finding” on the WSS curve. It works best when the data comes from a mix of populations that all have approximately Gaussian distributions (called a mixture of Gaussians). We’ll see one more measure, the average silhouette width, when we discuss kmeans().

^a
See Robert Tibshirani, Guenther Walther, and Trevor Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society B, 2001. 63(2), pp. 411-423; www.stanford.edu/~hastie/Papers/gap.pdf.

9.1.4. The k-means algorithm

K-means is a popular clustering algorithm when the data is all numeric and the distance metric is squared Euclidean (though you could in theory run it with other distance metrics). It’s fairly ad hoc and has the major disadvantage that you must pick k in advance. On the plus side, it’s easy to implement (one reason it’s so popular) and can be faster than hierarchical clustering on large datasets. It works best on data that looks like a mixture of Gaussians, which the protein data unfortunately doesn’t appear to be.

The kmeans() function

The function to run k-means in R is kmeans(). The output of kmeans() includes the cluster labels, the centers (centroids) of the clusters, the total sum of squares, total WSS, total BSS, and the WSS of each cluster.

The k-means algorithm is illustrated in figure 9.16, with k = 2. This algorithm isn’t guaranteed to have a unique stopping point. K-means can be fairly unstable, in that the final clusters depend on the initial cluster centers. It’s good practice to run k-means several times with different random starts, and then select the clustering with the lowest total WSS. The kmeans() function can do this automatically, though it defaults to using only one random start.

Figure 9.16. The k-means procedure. The two cluster centers are represented by the outlined star and diamond.

Let’s run kmeans() on the protein data (scaled to 0 mean and unit standard deviation, as before). We’ll use k = 5, as shown in listing 9.11. Note that kmeans() is randomized code, so you may not get exactly the results shown.

Listing 9.11. Running k-means with `k = 5`

kbest_p <- 5

pclusters <- kmeans(pmatrix, kbest_p, nstart = 100, iter.max = 100)   1
summary(pclusters)                                                    2
##              Length Class  Mode
## cluster      25     -none- numeric
## centers      45     -none- numeric
## totss         1     -none- numeric
## withinss      5     -none- numeric
## tot.withinss  1     -none- numeric
## betweenss     1     -none- numeric
## size          5     -none- numeric
## iter          1     -none- numeric
## ifault        1     -none- numeric

pclusters$centers                                                     3

##        RedMeat  WhiteMeat        Eggs       Milk       Fish    Cereals
## 1 -0.570049402  0.5803879 -0.08589708 -0.4604938 -0.4537795  0.3181839
## 2 -0.508801956 -1.1088009 -0.41248496 -0.8320414  0.9819154  0.1300253
## 3 -0.807569986 -0.8719354 -1.55330561 -1.0783324 -1.0386379  1.7200335
## 4  0.006572897 -0.2290150  0.19147892  1.3458748  1.1582546 -0.8722721
## 5  1.011180399  0.7421332  0.94084150  0.5700581 -0.2671539 -0.6877583
##       Starch       Nuts      Fr.Veg
## 1  0.7857609 -0.2679180  0.06873983
## 2 -0.1842010  1.3108846  1.62924487
## 3 -1.4234267  0.9961313 -0.64360439
## 4  0.1676780 -0.9553392 -1.11480485
## 5  0.2288743 -0.5083895  0.02161979

pclusters$size                                                        4
 ## [1] 5 4 4 4 8

groups <- pclusters$cluster                                           5

cols_to_print = wrapr::qc(Country, RedMeat, Fish, Fr.Veg)
print_clusters(protein, groups, cols_to_print)                        6

## $`1`
##           Country RedMeat Fish Fr.Veg
## 5  Czechoslovakia     9.7  2.0    4.0
## 7       E Germany     8.4  5.4    3.6
## 11        Hungary     5.3  0.3    4.2
## 16         Poland     6.9  3.0    6.6
## 23           USSR     9.3  3.0    2.9
##
## $`2`
##     Country RedMeat Fish Fr.Veg
## 10   Greece    10.2  5.9    6.5
## 13    Italy     9.0  3.4    6.7
## 17 Portugal     6.2 14.2    7.9
## 19    Spain     7.1  7.0    7.2
##
## $`3`
##       Country RedMeat Fish Fr.Veg
## 1     Albania    10.1  0.2    1.7
## 4    Bulgaria     7.8  1.2    4.2
## 18    Romania     6.2  1.0    2.8
## 25 Yugoslavia     4.4  0.6    3.2
##
## $`4`
##    Country RedMeat Fish Fr.Veg
## 6  Denmark    10.6  9.9    2.4
## 8  Finland     9.5  5.8    1.4
## 15  Norway     9.4  9.7    2.7
## 20  Sweden     9.9  7.5    2.0
##
## $`5`
##        Country RedMeat Fish Fr.Veg
## 2      Austria     8.9  2.1    4.3
## 3      Belgium    13.5  4.5    4.0
## 9       France    18.0  5.7    6.5
## 12     Ireland    13.9  2.2    2.9
## 14 Netherlands     9.5  2.5    3.7
## 21 Switzerland    13.1  2.3    4.9
## 22          UK    17.4  4.3    3.3
## 24   W Germany    11.4  3.4    3.8

1 Runs kmeans() with five clusters (kbest_p = 5), 100 random starts, and 100 maximum iterations per run
2 kmeans() returns all the sum-of-squares measures.
3 pclusters$centers is a matrix whose rows are the centroids of the clusters. Note that pclusters$centers is in the scaled coordinates, not the original protein coordinates.
4 pclusters$size returns the number of points in each cluster. Generally (though not always), a good clustering will be fairly well balanced: no extremely small clusters and no extremely large ones.
5 pclusters$cluster is a vector of cluster labels.
6 In this case, kmeans() and hclust () return the same clustering. This won’t always be true.

The kmeansruns() function for picking k

To run kmeans(), you must know k. The fpc package (the same package that has clusterboot()) has a function called kmeansruns() that calls kmeans() over a range of k and estimates the best k. It then returns its pick for the best value of k, the output of kmeans() for that value, and a vector of criterion values as a function of k. Currently, kmeansruns() has two criteria: the Calinski-Harabasz index ("ch") and the average silhouette width ("asw"). For either criterion, the maximum value indicates the optimal number of clusters (for more about silhouette clustering, see http://mng.bz/Qe15). It’s a good idea to examine the criterion values over the entire range of k, since you may see evidence for a k that the algorithm didn’t automatically pick. The following listing illustrates this point.

Listing 9.12. Plotting cluster criteria

clustering_ch <- kmeansruns(pmatrix, krange = 1:10, criterion = "ch")      1

clustering_ch$bestk                                                        2
## [1] 2

clustering_asw <- kmeansruns(pmatrix, krange = 1:10, criterion = "asw")    3
clustering_asw$bestk
## [1] 3

clustering_asw$crit                                                        4
## [1] 0.0000000 0.3271084 0.3351694 0.2617868 0.2639450 0.2734815 0.2471165
## [8] 0.2429985 0.2412922 0.2388293

clustering_ch$crit                                                         5
##  [1]  0.000000 14.094814 11.417985 10.418801 10.011797  9.964967  9.861682
##  [8]  9.412089  9.166676  9.075569

cluster_meas$ch_crit                                                       6
##  [1]       NaN 12.215107 10.359587  9.690891 10.011797  9.964967  9.506978
##  [8]  9.092065  8.822406  8.695065

summary(clustering_ch)                                                     7

##              Length Class  Mode
## cluster      25     -none- numeric
## centers      18     -none- numeric
## totss         1     -none- numeric
## withinss      2     -none- numeric
## tot.withinss  1     -none- numeric
## betweenss     1     -none- numeric
## size          2     -none- numeric
## iter          1     -none- numeric
## ifault        1     -none- numeric
## crit         10     -none- numeric
## bestk         1     -none- numeric

1 Runs kmeansruns() from 1–10 clusters, and the ch criterion. By default, kmeansruns() uses 100 random starts and 100 maximum iterations per run.
2 The ch criterion picks two clusters.
3 Runs kmeansruns() from 1–10 clusters, and the average silhouette width criterion. The average silhouette width picks 3 clusters.
4 Looks at the values of the asw criterion as a function of k
5 Looks at the values of the ch criterion as a function of k
6 Compares these to the ch values for the hclust() clustering. They’re not quite the same, because the two algorithms didn’t pick the same clusters.
7 kmeansruns() also returns the output of kmeans for k = bestk.

The top graph of figure 9.17 compares the results of the two clustering criteria provided by kmeansruns. Both criteria have been scaled to be in compatible units. They suggest two to three clusters as the best choice. However, if you compare the values of the (unscaled) CH criterion for the kmeans and hclust clusterings, as shown in the bottom graph of figure 9.17, you’ll see that the CH criterion produces different curves for kmeans() and hclust() clusterings, but it did pick the same value (which probably means it picked the same clusters) for k = 5 and k = 6, which might be taken as evidence that either 5 or 6 is the optimal choice for k.

Figure 9.17. Top: Comparison of the (scaled) CH and average silhouette width indices for `kmeans` clusterings. Bottom: Comparison of CH indices for `kmeans` and `hclust` clusterings.

clusterboot() revisited

We can run clusterboot() using the k-means algorithm, as well.

Listing 9.13. Running `clusterboot()` with k-means

kbest_p <- 5
cboot <- clusterboot(pmatrix, clustermethod = kmeansCBI,
            runs = 100,iter.max = 100,
            krange = kbest_p, seed = 15555)           1

groups <- cboot$result$partition
print_clusters(protein, groups, cols_to_print)
## $`1`
##       Country RedMeat Fish Fr.Veg
## 1     Albania    10.1  0.2    1.7
## 4    Bulgaria     7.8  1.2    4.2
## 18    Romania     6.2  1.0    2.8
## 25 Yugoslavia     4.4  0.6    3.2
##
## $`2`
##    Country RedMeat Fish Fr.Veg
## 6  Denmark    10.6  9.9    2.4
## 8  Finland     9.5  5.8    1.4
## 15  Norway     9.4  9.7    2.7
## 20  Sweden     9.9  7.5    2.0
##
## $`3`
##           Country RedMeat Fish Fr.Veg
## 5  Czechoslovakia     9.7  2.0    4.0
## 7       E Germany     8.4  5.4    3.6
## 11        Hungary     5.3  0.3    4.2
## 16         Poland     6.9  3.0    6.6
## 23           USSR     9.3  3.0    2.9
##
## $`4`
##        Country RedMeat Fish Fr.Veg
## 2      Austria     8.9  2.1    4.3
## 3      Belgium    13.5  4.5    4.0
## 9       France    18.0  5.7    6.5
## 12     Ireland    13.9  2.2    2.9
## 14 Netherlands     9.5  2.5    3.7
## 21 Switzerland    13.1  2.3    4.9
## 22          UK    17.4  4.3    3.3
## 24   W Germany    11.4  3.4    3.8
##
## $`5`
##     Country RedMeat Fish Fr.Veg
## 10   Greece    10.2  5.9    6.5
## 13    Italy     9.0  3.4    6.7
## 17 Portugal     6.2 14.2    7.9
## 19    Spain     7.1  7.0    7.2

cboot$bootmean
## [1] 0.8670000 0.8420714 0.6147024 0.7647341 0.7508333

cboot$bootbrd
## [1] 15 20 49 17 32

1 We’ve set the seed for the random generator so the results are reproducible.

Note that the stability numbers as given by cboot$bootmean (and the number of times that the clusters were “dissolved” as given by cboot$bootbrd) are different for the hierarchical clustering and k-means, even though the discovered clusters are the same. This shows that the stability of a clustering is partly a function of the clustering algorithm, not just the data. Again, the fact that both clustering algorithms discovered the same clusters might be taken as an indication that 5 is the optimal number of clusters.

9.1.5. Assigning new points to clusters

Clustering is often used as part of data exploration, or as a precursor to other supervised learning methods. But you may want to use the clusters that you discovered to categorize new data, as well. One common way to do so is to treat the centroid of each cluster as the representative of the cluster as a whole, and then assign new points to the cluster with the nearest centroid. Note that if you scaled the original data before clustering, then you should also scale the new data point the same way before assigning it to a cluster.

Listing 9.14 shows an example of a function that assigns a new data point, newpt (represented as a vector), to a clustering, centers, which is represented as a matrix where each row is a cluster centroid. This is the representation of cluster centroids that kmeans() returns. If the data was scaled using scale() before clustering, then xcenter and xscale are the scaled:center and scaled:scale attributes, respectively.

Listing 9.14. A function to assign points to a cluster

assign_cluster <- function(newpt, centers, xcenter = 0, xscale = 1) {
   xpt <- (newpt - xcenter) / xscale                                       1
    dists <- apply(centers, 1, FUN = function(c0) { sqr_edist(c0, xpt) })  2
    which.min(dists)                                                       3
  }

1 Centers and scales the new data point
2 Calculates how far the new data point is from each of the cluster centers
3 Returns the cluster number of the closest centroid

Note that the function sqr_edist() (the squared Euclidean distance) was defined previously, in section 9.1.1.

Let’s look at an example of assigning points to clusters, using synthetic data. First, we’ll generate the data.

Listing 9.15. Generating and clustering synthetic data

mean1 <- c(1, 1, 1)                                                    1
sd1 <- c(1, 2, 1)

mean2 <- c(10, -3, 5)
sd2 <- c(2, 1, 2)

mean3 <- c(-5, -5, -5)
sd3 <- c(1.5, 2, 1)

library(MASS)                                                          2
clust1 <- mvrnorm(100, mu = mean1, Sigma = diag(sd1))
clust2 <- mvrnorm(100, mu = mean2, Sigma = diag(sd2))
clust3 <- mvrnorm(100, mu = mean3, Sigma = diag(sd3))
toydata <- rbind(clust3, rbind(clust1, clust2))

tmatrix <- scale(toydata)                                              3

tcenter <- attr(tmatrix, "scaled:center")                              4
tscale <-attr(tmatrix, "scaled:scale")
tmatrix <- rm_scales(tmatrix)

kbest_t <- 3
tclusters <- kmeans(tmatrix, kbest_t, nstart = 100, iter.max = 100)    5

tclusters$size

## [1] 101 100  99

1 Sets the parameters for three 3D Gaussian clusters
2 Uses the mvrnorm() function from the MASS package to generate 3D axis-aligned Gaussian clusters
3 Scales the synthetic data
4 Gets the scaling attributes, then removes them from the matrix
5 Clusters the synthetic data into three clusters
6 The generated clusters are consistent in size with the true clusters.

Let’s compare the centers of the found k-means clusters to the true cluster centers. To do that, we need to unscale tclusters$centers. The scale() function works by subtracting the center vector, then dividing by the scale vector. So to reverse the process, first “unscale” the scaled matrix, then “uncenter” it.

Listing 9.16. Unscaling the centers

unscaled = scale(tclusters$centers, center = FALSE, scale = 1 / tscale)
rm_scales(scale(unscaled, center = -tcenter, scale = FALSE))

##         [,1]      [,2]       [,3]
## 1  9.8234797 -3.005977  4.7662651
## 2 -4.9749654 -4.862436 -5.0577002
## 3  0.8926698  1.185734  0.8336977

Comparing the unscaled centers to mean1, mean2, and mean3 in listing 9.15, we see that

The first discovered center corresponds to mean2: (10, –3, 5).
The second discovered center corresponds to mean3: (–5, –5, –5).
The third discovered center corresponds to mean1: (1, 1, 1).

So it appears that the discovered clusters are consistent with the true clusters.

Now we can demonstrate assigning new points to the clusters. Let’s generate a point from each of the true clusters, and see which k-means cluster it is assigned to.

Listing 9.17. An example of assigning points to clusters

assign_cluster(mvrnorm(1, mean1, diag(sd1))   1
                 tclusters$centers,
                tcenter, tscale)

## 3
## 3

assign_cluster(mvrnorm(1, mean2, diag(sd2))   2
                 tclusters$centers,
                tcenter, tscale)

## 1
## 1

assign_cluster(mvrnorm(1, mean3, diag(sd3))   3
                 tclusters$centers,
                tcenter, tscale)

## 2
## 2

1 This should be assigned to cluster 3.
2 This should be assigned to cluster 1.
3 This should be assigned to cluster 2.

The assign_cluster() function has correctly assigned each point to the appropriate cluster.

9.1.6. Clustering takeaways

At this stage, you have learned how to estimate the appropriate number of clusters for a dataset, how to cluster a dataset using both hierarchical clustering and k-means, and how to evaluate the resulting clusters. Here’s what you should remember about clustering:

The goal of clustering is to discover or draw out similarities among subsets of your data.
In a good clustering, points in the same cluster should be more similar (nearer) to each other than they are to points in other clusters.
When clustering, the units that each variable is measured in matter. Different units cause different distances and potentially different clusterings.
Ideally, you want a unit change in each coordinate to represent the same degree of change. One way to approximate this is to transform all the columns to have a mean value of 0 and a standard deviation of 1.0, for example, by using the function scale().
Clustering is often used for data exploration or as a precursor to supervised learning methods.
Like visualization, clustering is more iterative and interactive, and less automated, than supervised methods.
Different clustering algorithms will give different results. You should consider different approaches, with different numbers of clusters.
There are many heuristics for estimating the best number of clusters. Again, you should consider the results from different heuristics and explore various numbers of clusters.

Sometimes, rather than looking for subsets of data points that are highly similar to each other, you’d like to know what kinds of data (or which data attributes) tend to occur together. In the next section, we’ll look at one approach to this problem.

9.2. Association rules

Association rule mining is used to find objects or attributes that frequently occur together—for example, products that are often bought together during a shopping session, or queries that tend to occur together during a session on a website’s search engine. Such information can be used to recommend products to shoppers, to place frequently bundled items together on store shelves, or to redesign websites for easier navigation.

9.2.1. Overview of association rules

Example

Suppose you work in a library. You want to know which books tend to be checked out together, to help you make predictions about book availability.

The unit of “togetherness” when mining association rules is called a transaction. Depending on the problem, a transaction could be a single shopping basket, a single user session on a website, or even a single customer. The objects that comprise a transaction are referred to as items in an itemset: the products in the shopping basket, the pages visited during a website session, the actions of a customer. Sometimes transactions are referred to as baskets, from the shopping basket analogy.

When a library patron checks out a set of books, that’s a transaction; the books that the patron checks out are the itemset that comprise the transaction. Table 9.1 represents a database of transactions (you run a library where fantasy is quite popular).

Table 9.1. A database of library transactions

Transaction ID	Books checked out
1	The Hobbit, The Princess Bride
2	The Princess Bride, The Last Unicorn
3	The Hobbit
4	The Neverending Story
5	The Last Unicorn
6	The Hobbit, The Princess Bride, The Fellowship of the Ring
7	The Hobbit, The Fellowship of the Ring, The Two Towers, The Return of the King
8	The Fellowship of the Ring, The Two Towers, The Return of the King
9	The Hobbit, The Princess Bride, The Last Unicorn
10	The Last Unicorn, The Neverending Story

Mining for association rules occurs in two steps:

Look for all the itemsets (subsets of transactions) that occur more often than in a minimum fraction of the transactions.
Turn those itemsets into rules.

Let's look at the transactions that involve the items The Hobbit (H for short) and The Princess Bride (PB for short). The columns of table 9.2 represent transactions; the rows mark the transactions where a given itemset appears.

Table 9.2. Looking for The Hobbit and The Princess Bride

	1	2	3	6	7	9	Total
H	X		X	X	X	X	5
PB	X	X		X		X	4
{H, PB}	X			X		X	3

Looking over all the transactions in table 9.2, you find that

The Hobbit is in 5/10, or 50% of all transactions.
The Princess Bride is in 4/10, or 40% of all transactions.
Both books are checked out together in 3/10, or 30% of all transactions.

We’d say the support of the itemset {The Hobbit, The Princess Bride} is 30%.

Of the five transactions that include The Hobbit, three (3/5 = 60%) also include The Princess Bride.

So you can make a rule: “People who check out The Hobbit also check out The Princess Bride.” This rule should be correct (according to your data) 60% of the time. We’d say that the confidence of the rule is 60%.

Conversely, of the four times The Princess Bride is checked out, The Hobbit appears three times, or 3/4 = 75% of the time.

So the rule “People who check out The Princess Bride also check out The Hobbit” has 75% confidence.

Let's formally define rules, support, and confidence.

Rules

The rule “if X, then Y” means that every time you see the itemset X in a transaction, you expect to also see Y (with a given confidence). For the apriori algorithm (which we’ll look at in this section), Y is always an itemset with one item.

Support

Suppose that your database of transactions is called T, and X is an itemset. Then support(X) is the number of transactions that contain X divided by the total number of transactions in T.

Confidence

The confidence of the rule “if X, then Y” gives the fraction or percentage of the time that a rule is true, relative to how often you see X. In other words, if support(X) is how often the itemset X occurs in a transaction, and support({X, Y}) is how often both itemsets X and Y occur in a transaction, then the confidence of the rule “if X, then Y” is support({X, Y})/support(X).

The goal in association rule mining is to find all the interesting rules in the database with at least a given minimum support (say, 10%) and a minimum given confidence (say, 60%).

9.2.2. The example problem

Example

Suppose you work for a bookstore, and you want to recommend books that a customer might be interested in, based on all of their previous purchases and book interests. You want to use historical book interest information to develop some recommendation rules.

You can get information about customers’ book interests two ways: either they’ve purchased a book from you, or they’ve rated a book on your website (even if they bought the book somewhere else). In this case, a transaction is a customer, and an itemset is all the books that they’ve expressed an interest in, either by purchase or by rating.

The data that you’ll use is based on data collected in 2004 from the book community Book-Crossing^[1] for research conducted at the Institut für Informatik, University of Freiburg.^[2] The information is condensed into a single tab-separated text file called bookdata.tsv. Each row of the file consists of a user ID, a book title (which has been designed as a unique ID for each book), and the rating (which you won’t actually use in this example):

¹
The original data repository can be found at http://mng.bz/2052. Since some artifacts in the original files caused errors when reading into R, we’re providing copies of the data as a prepared RData object: https://github.com/WinVector/PDSwR2/blob/master/Bookdata/bxBooks.RData. The prepared version of the data that we’ll use in this section is at https://github.com/WinVector/PDSwR2/blob/master/Bookdata/bookdata.tsv.gz. Further information and scripts for preparing the data can be found at https://github.com/WinVector/PDSwR2/tree/master/Bookdata.

²
The researchers’ original paper is “Improving Recommendation Lists Through Topic Diversification,” Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW ’05), May 10-14, 2005, Chiba, Japan. It can be found online at http://mng.bz/7trR.

|token                 | userid| rating|title                 |
|:---------------------|------:|------:|:---------------------|
|always have popsicles | 172742|      0|Always Have Popsicles |

The token column contains lowercase column strings; the tokens were used to identify books with different ISBNs (the original book IDs) that had the same title except for casing. The title column holds properly capitalized title strings; these are unique per book, so for this example you will use them as book IDs.

In this format, the transaction (customer) information is diffused through the data, rather than being all in one row; this reflects the way the data would naturally be stored in a database, since the customer’s activity would be diffused throughout time. Books generally come in different editions or from different publishers. For this example, we’ve condensed all different versions into a single item; hence, different copies or printings of Little Women will all map to the same item ID in our data (namely, the title “Little Women”).

The original data includes approximately a million ratings of 271,379 books from 278,858 readers. Our data will have fewer books due to the mapping that we discussed earlier.

Now you’re ready to mine.

9.2.3. Mining association rules with the arules package

You’ll use the package arules for association rule mining. arules includes an implementation of the popular association rule algorithm apriori, as well as implementations to read in and examine transaction data.^[1] The package uses special data types to hold and manipulate the data; you’ll explore these data types as you work the example.

¹
For a more comprehensive introduction to arules than we can give in this chapter, please see Hahsler, Grin, Hornik, and Buchta, “Introduction to arules—A computational environment for mining association rules and frequent item sets,” online at cran.r-project.org/web/packages/arules/vignettes/arules.pdf.

Reading in the data

You can read the data directly from the bookdata.tsv.gz file into the object bookbaskets using the function read.transactions().

Listing 9.18. Reading in the book data

library(arules)                                                        1
bookbaskets <- read.transactions("bookdata.tsv.gz",
                                     format = "single",                2
                                      header = TRUE,                   3
                                      sep = "	",                      4
                                      cols = c("userid", "title"),     5
                                      rm.duplicates = TRUE)            6

1 Loads the arules package
2 Specifies the file and the file format
3 Specifies that the input file has a header
4 Specifies the column separator (a tab)
5 Specifies the column of transaction IDs and of item IDs, respectively
6 Tells the function to look for and remove duplicate entries (for example, multiple entries for “The Hobbit” by the same user)

The read.transactions() function reads data in two formats: the format where every row corresponds to a single item (like bookdata.tsv.gz), and a format where each row corresponds to a single transaction, possibly with a transaction ID, like table 9.1. To read data in the first format, use the argument format = "single"; to read data in the second format, use the argument format = "basket".

It sometimes happens that a reader will buy one edition of a book and then later add a rating for that book under a different edition. Because of the way we’re representing books for this example, these two actions will result in duplicate entries. The rm.duplicates = TRUE argument will eliminate them. It will also output some (not too useful) diagnostics about the duplicates.

Once you’ve read in the data, you can inspect the resulting object.

Examining the data

Transactions are represented as a special object called transactions. You can think of a transactions object as a 0/1 matrix, with one row for every transaction (in this example, a customer) and one column for every possible item (in this example, a book). The matrix entry (i, j) is 1 if the ith transaction contains item j, or if customer i has expressed an interest in book j. There are a number of calls you can use to examine the transaction data, as the next listing shows.

Listing 9.19. Examining the transaction data

class(bookbaskets)                                                    1
## [1] "transactions"
## attr(,"package")
## [1] "arules"
bookbaskets                                                           2
## transactions in sparse format with
##  92108 transactions (rows) and
##  220447 items (columns)
dim(bookbaskets)                                                      3
## [1]  92108 220447
colnames(bookbaskets)[1:5]                                            4
## [1] " A Light in the Storm:[...]"
## [2] " Always Have Popsicles"
## [3] " Apple Magic"
## [4] " Ask Lily"
## [5] " Beyond IBM: Leadership Marketing and Finance for the 1990s"
rownames(bookbaskets)[1:5]                                            5
## [1] "10"     "1000"   "100001" "100002" "100004"

1 The object is of class transactions.
2 Printing the object tells you its dimensions.
3 You can also use dim() to see the dimensions of the matrix.
4 The columns are labeled by book title.
5 The rows are labeled by customer.

You can examine the distribution of transaction sizes (or basket sizes) with the function size():

basketSizes <- size(bookbaskets)
summary(basketSizes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     1.0     1.0     1.0    11.1     4.0 10250.0

Most customers (at least half of them, in fact) only expressed interest in one book. But someone has expressed interest in more than 10,000! You probably want to look more closely at the size distribution to see what’s going on.

Listing 9.20. Examining the size distribution

quantile(basketSizes, probs = seq(0, 1, 0.1))                         1
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100%
##     1     1     1     1     1     1     2     3     5    13 10253
library(ggplot2)                                                      2
ggplot(data.frame(count = basketSizes)) +
  geom_density(aes(x = count)) +
  scale_x_log10()

1 Looks at the basket size distribution, in 10% increments
2 Plots the distribution to get a better look

Figure 9.18 shows the distribution of basket sizes. 90% of customers expressed interest in fewer than 15 books; most of the remaining customers expressed interest in up to about 100 books or so; the call quantile(basketSizes, probs = c(0.99, 1)) will show you that 99% of customers expressed interest in 179 books or fewer. Still, a few people have expressed interest in several hundred, or even several thousand books.

Figure 9.18. A density plot of basket sizes

Which books are they reading? The function itemFrequency() can tell you how often each book shows up in the transaction data.

Listing 9.21. Counting how often each book occurs

bookCount <- itemFrequency(bookbaskets, "absolute")
summary(bookCount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##    1.000    1.000    1.000    4.638    3.000 2502.000

You can also find the 10 most frequently occurring books.

Listing 9.22. Finding the 10 most frequently occurring books

orderedBooks <- sort(bookCount, decreasing = TRUE)           1
knitr::kable(orderedBooks[1:10])                             2

# |                                                |    x|
# |:-----------------------------------------------|----:|
# |Wild Animus                                     | 2502|
# |The Lovely Bones: A Novel                       | 1295|
# |She's Come Undone                               |  934|
# |The Da Vinci Code                               |  905|
# |Harry Potter and the Sorcerer's Stone           |  832|
# |The Nanny Diaries: A Novel                      |  821|
# |A Painted House                                 |  819|
# |Bridget Jones's Diary                           |  772|
# |The Secret Life of Bees                         |  762|
# |Divine Secrets of the Ya-Ya Sisterhood: A Novel |  737|

orderedBooks[1] / nrow(bookbaskets)                          3

## Wild Animus
##  0.02716376

1 Sorts the counts in decreasing order
2 Displays the top 10 books in a nice format
3 The most popular book in the dataset occurred in fewer than 3% of the baskets.

The last observation in the preceding listing highlights one of the issues with mining high-dimensional data: when you have thousands of variables, or thousands of items, almost every event is rare. Keep this point in mind when deciding on support thresholds for rule mining; your thresholds will often need to be quite low.

Before we get to the rule mining, let’s refine the data a bit more. As you observed earlier, half of the customers in the data only expressed interest in a single book. Since you want to find books that occur together in people’s interest lists, you can’t make any direct use of people who haven’t yet shown interest in multiple books. You can restrict the dataset to customers who have expressed interest in at least two books:

bookbaskets_use <- bookbaskets[basketSizes > 1]
dim(bookbaskets_use)
## [1]  40822 220447

Now you’re ready to look for association rules.

The apriori() function

In order to mine rules, you need to decide on a minimum support level and a minimum threshold level. For this example, let’s try restricting the itemsets that we’ll consider to those with a minimum support of 0.2%, or 0.002. This corresponds to itemsets that appear at least 0.002 * nrow(bookbaskets_use) times, which is about 82 transactions. We’ll use a confidence threshold of 75%.

Listing 9.23. Finding the association rules

rules <- apriori(bookbaskets_use,                                       1
                  parameter = list(support = 0.002, confidence = 0.75))

summary(rules)
## set of 191 rules                                                     2
##
## rule length distribution (lhs + rhs):sizes                           3
##   2   3   4   5
##  11 100  66  14
##
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   2.000   3.000   3.000   3.435   4.000   5.000
##
## summary of quality measures:                                         4
##     support           confidence          lift            count
##  Min.   :0.002009   Min.   :0.7500   Min.   : 40.89   Min.   : 82.0
##  1st Qu.:0.002131   1st Qu.:0.8113   1st Qu.: 86.44   1st Qu.: 87.0
##  Median :0.002278   Median :0.8468   Median :131.36   Median : 93.0
##  Mean   :0.002593   Mean   :0.8569   Mean   :129.68   Mean   :105.8
##  3rd Qu.:0.002695   3rd Qu.:0.9065   3rd Qu.:158.77   3rd Qu.:110.0
##  Max.   :0.005830   Max.   :0.9882   Max.   :321.89   Max.   :238.0
##
## mining info:                                                         5
##             data ntransactions support confidence
##  bookbaskets_use         40822   0.002       0.75

1 Calls apriori() with a minimum support of 0.002 and a minimum confidence of 0.75
2 The number of rules found
3 The distribution of rule lengths (in this example, most rules contain 3 items—2 on the left side, X (lhs), and one on the right side, Y (rhs))
4 A summary of rule quality measures, including support and confidence
5 Some information on how apriori() was called

The quality measures on the rules include a rule’s support and confidence, the support count (how many transactions the rule applied to), and a quantity called lift. Lift compares the frequency of an observed pattern with how often you’d expect to see that pattern just by chance. The lift of a rule “if X, then Y” is given by support({X, Y}) / (support(X) * support(Y)). If the lift is near 1, then there’s a good chance that the pattern you observed is occurring just by chance. The larger the lift, the more likely that the pattern is “real.” In this case, all the discovered rules have a lift of at least 40, so they’re likely to be real patterns in customer behavior.

Inspecting and evaluating rules

There are also other metrics and interest measures you can use to evaluate the rules by using the function interestMeasure(). We’ll look at two of these measures: coverage and fishersExactTest. Coverage is the support of the left side of the rule (X); it tells you how often the rule would be applied in the dataset. Fisher’s exact test is a significance test for whether an observed pattern is real or chance (the same thing lift measures; Fisher’s test is more formal). Fisher’s exact test returns the p-value, or the probability that you would see the observed pattern by chance; you want the p-value to be small.

Listing 9.24. Scoring rules

measures <- interestMeasure(rules,                              1
                  measure=c("coverage", "fishersExactTest"),    2
                  transactions = bookbaskets_use)               3
summary(measures)
##     coverage        fishersExactTest
##  Min.   :0.002082   Min.   : 0.000e+00
##  1st Qu.:0.002511   1st Qu.: 0.000e+00
##  Median :0.002719   Median : 0.000e+00
##  Mean   :0.003039   Mean   :5.080e-138
##  3rd Qu.:0.003160   3rd Qu.: 0.000e+00
##  Max.   :0.006982   Max.   :9.702e-136

1 The first argument to interestMeasure() is the discovered rules.
2 The second argument is a list of interest measures to apply.
3 The last argument is a dataset to evaluate the interest measures over. This is usually the same set used to mine the rules, but it needn’t be. For instance, you can evaluate the rules over the full dataset, bookbaskets, to get coverage estimates that reflect all the customers, not just the ones who showed interest in more than one book.

The coverage of the discovered rules ranges from 0.002–0.007, equivalent to a range of about 82–286 people. All the p-values from Fisher’s test are small, so it’s likely that the rules reflect actual customer behavior patterns.

You can also call interestMeasure() with the methods support, confidence, and lift, among others. This would be useful in our example if you wanted to get support, confidence, and lift estimates for the full dataset bookbaskets, rather than the filtered dataset bookbaskets_use—or for a subset of the data, for instance, only customers from the United States.

The function inspect() pretty-prints the rules. The function sort() allows you to sort the rules by a quality or interest measure, like confidence. To print the five most confident rules in the dataset, you could use the following statement, which we will expand out using pipe notation.

Listing 9.25. Getting the five most confident rules

library(magrittr)                    1

rules %>%
  sort(., by = "confidence") %>%     2

  head(., n = 5) %>%                 3

  inspect(.)                         4

1 Attaches magrittr to get pipe notation
2 Sorts rules by confidence
3 Gets the first five rules
4 Calls inspect() to pretty-print the rules

For legibility, we show the output of this command in table 9.3.

Table 9.3. The five most confident rules discovered in the data

Left side	Right side	Support	Confidence	Lift	Count
Four to Score High Five Seven Up Two for the Dough	Three to Get Deadly	0.002	0.988	165	84
Harry Potter and the Order of the Phoenix Harry Potter and the Prisoner of Azkaban Harry Potter and the Sorcerer’s Stone	Harry Potter and the Chamber of Secrets	0.003	0.966	73	117
Four to Score High Five One for the Money Two for the Dough	Three to Get Deadly	0.002	0.966	162	85
Four to Score Seven Up Three to Get Deadly Two for the Dough	High Five	0.002	0.966	181	84
High Five Seven Up Three to Get Deadly Two for the Dough	Four to Score	0.002	0.966	168	84

There are two things to notice in table 9.3. First, the rules concern books that come in series: the numbered series of novels about bounty hunter Stephanie Plum, and the Harry Potter series. So these rules essentially say that if a reader has read four Stephanie Plum or three Harry Potter books, they’re almost sure to buy another one.

The second thing to notice is that rules 1, 4, and 5 are permutations of the same itemset. This is likely to happen when the rules get long.

Restricting which items to mine

You can restrict which items appear in the left side or right side of a rule. Suppose you’re interested specifically in books that tend to co-occur with the novel The Lovely Bones. You can do this by restricting which books appear on the right side of the rule, using the appearance parameter.

Listing 9.26. Finding rules with restrictions

brules <- apriori(bookbaskets_use,
                parameter = list(support = 0.001,                        1
                                  confidence = 0.6),
                appearance = list(rhs = c("The Lovely Bones: A Novel"),  2
                                   default = "lhs"))                     3
 summary(brules)
## set of 46 rules
##
## rule length distribution (lhs + rhs):sizes
##  3  4
## 44  2
##
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   3.000   3.000   3.000   3.043   3.000   4.000
##
## summary of quality measures:
##     support           confidence          lift           count
##  Min.   :0.001004   Min.   :0.6000   Min.   :21.81   Min.   :41.00
##  1st Qu.:0.001029   1st Qu.:0.6118   1st Qu.:22.24   1st Qu.:42.00
##  Median :0.001102   Median :0.6258   Median :22.75   Median :45.00
##  Mean   :0.001132   Mean   :0.6365   Mean   :23.14   Mean   :46.22
##  3rd Qu.:0.001219   3rd Qu.:0.6457   3rd Qu.:23.47   3rd Qu.:49.75
##  Max.   :0.001396   Max.   :0.7455   Max.   :27.10   Max.   :57.00
##
## mining info:
##             data ntransactions support confidence
##  bookbaskets_use         40822   0.001        0.6

1 Relaxes the minimum support to 0.001 and the minimum confidence to 0.6
2 Only “The Lovely Bones” is allowed to appear on the right side of the rules.
3 By default, all the books can go into the left side of the rules.

The supports, confidences, counts, and lifts are lower than they were in our previous example, but the lifts are still much greater than one, so it’s likely that the rules reflect real customer behavior patterns.

Let’s inspect the rules, sorted by confidence. Since they’ll all have the same right side, you can use the lhs() function to only look at the left sides.

Listing 9.27. Inspecting rules

brules %>%
  sort(., by = "confidence") %>%
  lhs(.) %>%                          1

  head(., n = 5) %>%
  inspect(.)
##   items
## 1 {Divine Secrets of the Ya-Ya Sisterhood: A Novel,
##    Lucky : A Memoir}
## 2 {Lucky : A Memoir,
##    The Notebook}
## 3 {Lucky : A Memoir,
##    Wild Animus}
## 4 {Midwives: A Novel,
##    Wicked: The Life and Times of the Wicked Witch of the West}
## 5 {Lucky : A Memoir,
##    Summer Sisters}

1 Gets the left-hand side of the sorted rules

Note that four of the five most confident rules include Lucky: A Memoir in the left side, which perhaps isn’t surprising, since Lucky was written by the author of The Lovely Bones. Suppose you want to find out about works by other authors that are interesting to people who showed interest in The Lovely Bones; you can use subset() to filter down to only rules that don’t include Lucky.

Listing 9.28. Inspecting rules with restrictions

brulesSub <- subset(brules, subset = !(lhs %in% "Lucky : A Memoir"))    1
 brulesSub %>%
  sort(., by = "confidence") %>%
  lhs(.) %>%
  head(., n = 5) %>%
  inspect(.)

brulesConf <- sort(brulesSub, by="confidence")

inspect(head(lhs(brulesConf), n = 5))
##   items
## 1 {Midwives: A Novel,
##    Wicked: The Life and Times of the Wicked Witch of the West}
## 2 {She's Come Undone,
##    The Secret Life of Bees,
##    Wild Animus}
## 3 {A Walk to Remember,
##    The Nanny Diaries: A Novel}
## 4 {Beloved,
##    The Red Tent}
## 5 {The Da Vinci Code,
##    The Reader}

1 Restricts to the subset of rules where Lucky is not in the left side

These examples show that association rule mining is often highly interactive. To get interesting rules, you must often set the support and confidence levels fairly low; as a result, you can get many, many rules. Some rules will be more interesting or surprising to you than others; to find them requires sorting the rules by different interest measures, or perhaps restricting yourself to specific subsets of rules.

9.2.4. Association rule takeaways

You've now walked through an example of using association rules to explore common patterns in purchase data. Here’s what you should remember about association rules:

The goal of association rule mining is to find relationships in the data: items or attributes that tend to occur together.
A good rule “if X then Y” should occur more often than you’d expect to observe by chance. You can use lift or Fisher’s exact test to check if this is true.
When it's possible for a large number of different items to be in a basket (in our example, thousands of different books), most events will be rare (have low support).
Association rule mining is often interactive, as there can be many rules to sort and sift through.

Summary

In this chapter, you’ve learned how to find similarities in data using two different clustering methods in R, and how to find items that tend to occur together in data using association rules. You’ve also learned how to evaluate your discovered clusters and your discovered rules.

Unsupervised methods like the ones we’ve covered in this chapter are really more exploratory in nature. Unlike with supervised methods, there’s no “ground truth” to evaluate your findings against. But the findings from unsupervised methods can be the starting point for more-focused experiments and modeling.

In the last few chapters, we’ve covered the most basic modeling and data analysis techniques; they’re all good first approaches to consider when you’re starting a new project. In the next chapter, we’ll touch on a few more-advanced methods.

In this chapter you have learned

How to cluster unlabeled data, using both hierarchical methods and k-means
How to estimate what the appropriate number of clusters should be
How to evaluate an existing clustering for cluster stability
How to find patterns (association rules) in transaction data using apriori
How to evaluate and sort through discovered association rules

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Unsupervised methods

Create new playlist

Sign In

Sign Up

Chapter 9. Unsupervised methods

Figure 9.1. Mental model

9.1. Cluster analysis

9.1.1. Distances

Figure 9.2. An example of data in three clusters

Euclidean distance

Example

Hamming distance

Example

Manhattan (city block) distance

Example

Figure 9.3. Manhattan vs. Euclidean distance

Cosine similarity

Example

Figure 9.4. Cosine similarity

9.1.2. Preparing the data

Listing 9.1. Reading the protein data

Units and scaling

Listing 9.2. Rescaling the dataset

Figure 9.5. Comparison of Fr.Veg and RedMeat variables, unscaled (top) and scaled (bottom)

9.1.3. Hierarchical clustering with hclust

Listing 9.3. Hierarchical clustering

Figure 9.6. Dendrogram of countries clustered by protein consumption

Listing 9.4. Extracting the clusters found by hclust()

Visualizing clusters using principal components analysis

Figure 9.7. The idea behind principal components analysis

Listing 9.5. Projecting the clusters on the first two principal components

Figure 9.8. Plot of countries clustered by protein consumption, projected onto the first two principal components

Bootstrap evaluation of clusters

Figure 9.9. Jaccard similarity

Listing 9.6. Running clusterboot() on the protein data

Picking the number of clusters

Total within sum of squares

Figure 9.10. Cluster WSS and total WSS for a set of four clusters

Listing 9.7. Calculating total within sum of squares

Listing 9.8. Plotting WSS for a range of k

Figure 9.11. WSS as a function of k for the protein data

Calinski-Harabasz index

Figure 9.12. Total sum of squares for a set of four clusters

Listing 9.9. Plotting BSS and WSS as a function of k

Figure 9.13. BSS and WSS as a function of k

Listing 9.10. The Calinski-Harabasz index

Figure 9.14. The Calinski-Harabasz index as a function of k

Figure 9.15. The protein data dendrogram with two clusters

9.1.4. The k-means algorithm

The kmeans() function

Figure 9.16. The k-means procedure. The two cluster centers are represented by the outlined star and diamond.

Listing 9.11. Running k-means with k = 5

The kmeansruns() function for picking k

Listing 9.12. Plotting cluster criteria

Figure 9.17. Top: Comparison of the (scaled) CH and average silhouette width indices for kmeans clusterings. Bottom: Comparison of CH indices for kmeans and hclust clusterings.

clusterboot() revisited

Listing 9.13. Running clusterboot() with k-means

9.1.5. Assigning new points to clusters

Listing 9.14. A function to assign points to a cluster

Listing 9.15. Generating and clustering synthetic data

Listing 9.16. Unscaling the centers

Listing 9.17. An example of assigning points to clusters

9.1.6. Clustering takeaways

9.2. Association rules

9.2.1. Overview of association rules

Example

Table 9.1. A database of library transactions

Table 9.2. Looking for The Hobbit and The Princess Bride

Rules

Support

Confidence

9.2.2. The example problem

Example

9.2.3. Mining association rules with the arules package

Reading in the data

Listing 9.18. Reading in the book data

Examining the data

Listing 9.19. Examining the transaction data

Listing 9.20. Examining the size distribution

Figure 9.18. A density plot of basket sizes

Table of Contents for
Chapter 9. Unsupervised methods

Figure 9.5. Comparison of `Fr.Veg` and `RedMeat` variables, unscaled (top) and scaled (bottom)

Listing 9.4. Extracting the clusters found by `hclust()`

Listing 9.6. Running `clusterboot()` on the protein data

Listing 9.8. Plotting WSS for a range of `k`

Figure 9.11. WSS as a function of `k` for the protein data

Listing 9.11. Running k-means with `k = 5`

Figure 9.17. Top: Comparison of the (scaled) CH and average silhouette width indices for `kmeans` clusterings. Bottom: Comparison of CH indices for `kmeans` and `hclust` clusterings.

Listing 9.13. Running `clusterboot()` with k-means