This chapter covers
In the previous chapter, we covered using the vtreat package to prepare messy real-world data for modeling. In this chapter, we’ll look at methods to discover unknown relationships in data. These methods are called unsupervised methods. With unsupervised methods, there’s no outcome that you’re trying to predict; instead, you want to discover patterns in the data that perhaps you hadn’t previously suspected. For example, you may want to find groups of customers with similar purchase patterns, or correlations between population movement and socioeconomic factors. We will still consider this pattern discovery to be “modeling,” and as such, the outcomes of the algorithms can still be evaluated, as shown in the mental model for this chapter (figure 9.1).
Unsupervised analyses are often not ends in themselves; rather, they’re ways of finding relationships and patterns that can be used to build predictive models. In fact, we encourage you to think of unsupervised methods as exploratory—procedures that help you get your hands in the data—rather than as black-box approaches that mysteriously and automatically give you “the right answer.”
In this chapter, we’ll look at two classes of unsupervised methods:
In cluster analysis, the goal is to group the observations in your data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. For example, a company that offers guided tours might want to cluster its clients by behavior and tastes: which countries they like to visit; whether they prefer adventure tours, luxury tours, or educational tours; what kinds of activities they participate in; and what sorts of sites they like to visit. Such information can help the company design attractive travel packages and target the appropriate segments of their client base with them.
Cluster analysis is a topic worthy of a book in itself; in this chapter, we’ll discuss two approaches. Hierarchical clustering finds nested groups of clusters. An example of hierarchical clustering might be the standard plant taxonomy, which classifies plants by family, then genus, then species, and so on. The second approach we’ll cover is k-means, which is a quick and popular way of finding clusters in quantitative data.
Historically, cluster analysis is related to the problem of density estimation: if you think of your data as living in a large dimensional space, then you want to find the regions of the space where the data is densest. If those regions are distinct, or nearly so, then you have clusters.
In order to cluster, you need the notions of similarity and dissimilarity. Dissimilarity can be thought of as distance, so that the points in a cluster are closer to each other than they are to the points in other clusters. This is shown in figure 9.2.
Different application areas will have different notions of distance and dissimilarity. In this section, we’ll cover a few of the most common ones:
Suppose you have measurements on how many minutes per day subjects spend on different activities, and you want to group the subjects by their activity patterns.
Since your measurements are numerical and continuous, Euclidean distance is a good distance to use for clustering. Euclidean distance is the measure people tend to think of when they think of “distance.” Optimizing squared Euclidean distance is the basis of k-means. Of course, Euclidean distance only makes sense when all the data is real-valued (quantitative). If the data is categorical (in particular, binary), then other distances should be used.
The Euclidean distance between two vectors x and y is defined as
edist(x, y) <- sqrt((x[1] - y[1])^2 + (x[2] - y[2])^2 + ...)
Suppose you want to group your recipe box into groups of similar recipes. One way to do that is to measure the similarity of their ingredients lists.
By this measure, pancakes, waffles, and crepes are highly similar (they have almost identical ingredients, and vary only in proportions); they all differ somewhat from cornbread (which uses cornmeal, rather than flour); and they all differ to a greater degree from mashed potatoes.
For categorical variables like recipe ingredients, gender (male/female), or qualitative size (small/medium/large), you can define the distance as 0 if two points are in the same category, and 1 otherwise. If all the variables are categorical, then you can use Hamming distance, which counts the number of mismatches:
hdist(x, y) <- sum((x[1] != y[1]) + (x[2] != y[2]) + ...)
Here, a != b is defined to have a value of 1 if the expression is true, and a value of 0 if the expression is false.
You can also expand categorical variables to indicator variables (as we discussed in section 7.1.4), one for each level of the variable.
If the categories are ordered (like small/medium/large) so that some categories are “closer” to each other than others, then you can convert them to a numerical sequence. For example, (small/medium/large) might map to (1/2/3). Then you can use Euclidean distance or other distances for quantitative data.
Suppose you run a delivery service that caters to downtown businesses. You want to cluster your clients so that you can place pickup/drop-off boxes that are centrally located in each cluster.
Manhattan distance measures distance in the number of horizontal and vertical units it takes to get from one (real-valued) point to the other (no diagonal moves). This is also known as L1 distance (and squared Euclidean distance is L2 distance ).
In this example, Manhattan distance is more appropriate because you want to measure distance by how far people will walk along the streets, not diagonally point-to-point (Euclidean distance). For example, in figure 9.3, client A is 2 blocks north of the site and 2 blocks west, while client B is 3 blocks south of the site and 1 block east. They are equidistant from the site (4 blocks) by Manhattan distance. But client B is further by Euclidean distance: the diagonal of a 3-by-1 rectangle is longer than the diagonal of a 2-by-2 square.
The Manhattan distance between two vectors x and y is defined as
mdist(x, y) <- sum(abs(x[1] - y[1]) + abs(x[2] - y[2]) + ...)
Suppose you represent a document as rows of a document-text matrix, as we did in section 6.3.3, where each element i of the row vector gives the number of times that word i appeared in the document. Then the cosine similarity between two row vectors is a measure of the similarity of the corresponding documents.
Cosine similarity is a common similarity metric in text analysis. It measures the smallest angle between two vectors. In our text example, we assume non-negative vectors, so the angle theta between two vectors is between 0 and 90 degrees. Cosine similarity is shown in figure 9.4.
Two perpendicular vectors (theta = 90 degrees) are the most dissimilar; the cosine of 90 degrees is 0. Two parallel vectors are the most similar (identical, if you assume they’re both based at the origin); the cosine of 0 degrees is 1.
From elementary geometry, you can derive that the cosine of the angle between two vectors is given by the normalized dot product between the two vectors:
dot(x, y) <- sum(x[1] * y[1] + x[2] * y[2] + ...) cossim(x, y) <- dot(x, y) / (sqrt(dot(x,x) * dot(y, y)))
You can turn the cosine similarity into a pseudo distance by subtracting it from 1.0 (though to get an actual metric, you should use 1 - 2 * acos(cossim(x, y)) / pi).
Different distance metrics will give you different clusters, as will different clustering algorithms. The application domain may give you a hint as to the most appropriate distance, or you can try several distance metrics. In this chapter, we’ll use (squared) Euclidean distance, as it’s the most natural distance for quantitative data.
To demonstrate clustering, we’ll use a small dataset from 1973 on protein consumption from nine different food groups in 25 countries in Europe.[1] The goal is to group the countries based on patterns in their protein consumption. The dataset is loaded into R as a data frame called protein, as shown in the next listing.
The original dataset was from the Data and Story Library, previously hosted at CMU. It is no longer online there. A tab-separated text file with the data can be found at https://github.com/WinVector/PDSwR2/tree/master/Protein/. The data file is called protein.txt; additional information can be found in the file protein_README.txt.
protein <- read.table("protein.txt", sep = " ", header=TRUE) summary(protein) ## Country RedMeat WhiteMeat Eggs ## Albania : 1 Min. : 4.400 Min. : 1.400 Min. :0.500 ## Austria : 1 1st Qu.: 7.800 1st Qu.: 4.900 1st Qu.:2.700 ## Belgium : 1 Median : 9.500 Median : 7.800 Median :2.900 ## Bulgaria : 1 Mean : 9.828 Mean : 7.896 Mean :2.936 ## Czechoslovakia: 1 3rd Qu.:10.600 3rd Qu.:10.800 3rd Qu.:3.700 ## Denmark : 1 Max. :18.000 Max. :14.000 Max. :4.700 ## (Other) :19 ## Milk Fish Cereals Starch ## Min. : 4.90 Min. : 0.200 Min. :18.60 Min. :0.600 ## 1st Qu.:11.10 1st Qu.: 2.100 1st Qu.:24.30 1st Qu.:3.100 ## Median :17.60 Median : 3.400 Median :28.00 Median :4.700 ## Mean :17.11 Mean : 4.284 Mean :32.25 Mean :4.276 ## 3rd Qu.:23.30 3rd Qu.: 5.800 3rd Qu.:40.10 3rd Qu.:5.700 ## Max. :33.70 Max. :14.200 Max. :56.70 Max. :6.500 ## ## Nuts Fr.Veg ## Min. :0.700 Min. :1.400 ## 1st Qu.:1.500 1st Qu.:2.900 ## Median :2.400 Median :3.800 ## Mean :3.072 Mean :4.136 ## 3rd Qu.:4.700 3rd Qu.:4.900 ## Max. :7.800 Max. :7.900
The documentation for this dataset doesn’t mention what the units of measurement are; we will assume all the columns are measured in the same units. This is important: units (or, more precisely, disparity in units ) affect what clusterings an algorithm will discover. If you measure vital statistics of your subjects as age in years, height in feet, and weight in pounds, you’ll get different distances—and possibly different clusters—than if you measure age in years, height in meters, and weight in kilograms.
Ideally, you want a unit of change in each coordinate to represent the same degree of difference. In the protein dataset, we assume that the measurements are all in the same units, so it might seem that we’re okay. This may well be a correct assumption, but different food groups provide different amounts of protein. Animal-based food sources in general have more grams of protein per serving than plant-based food sources, so one could argue that a change in consumption of five grams is a bigger difference in terms of vegetable consumption than it is in terms of red meat consumption.
One way to try to make the units of each variable more compatible is to transform all the columns to have a mean value of 0 and a standard deviation of 1. This makes the standard deviation the unit of measurement in each coordinate. Assuming that your training data has a distribution that accurately represents the population at large, then a standard deviation represents approximately the same degree of difference in every coordinate.
You can scale numeric data in R using the function scale(). The output of scale() is a matrix. For the purposes of this chapter, you can mostly think of a matrix as a data frame with all numeric columns (this isn’t strictly true, but it’s close enough).
The scale() function annotates its output with two attributes—scaled:center returns the mean values of all the columns, and scaled:scale returns the standard deviations. You’ll store these away so you can “unscale” the data later.
vars_to_use <- colnames(protein)[-1] 1 pmatrix <- scale(protein[, vars_to_use]) pcenter <- attr(pmatrix, "scaled:center") 2 pscale <- attr(pmatrix, "scaled:scale") rm_scales <- function(scaled_matrix) { 3 attr(scaled_matrix, "scaled:center") <- NULL attr(scaled_matrix, "scaled:scale") <- NULL scaled_matrix } pmatrix <- rm_scales(pmatrix) 4
Figure 9.5 shows the effect of scaling on two variables, Fr.Veg and RedMeat. The raw (unscaled) variables have different ranges, reflecting the fact that the amount of protein supplied via red meat tends to be higher than the amount of protein supplied via fruits and vegetables. The scaled variables now have similar ranges, which makes comparing relative variation in each variable easier.
Now you are ready to cluster the protein data. We’ll start with hierarchical clustering.
The hclust() function takes as input a distance matrix (as an object of class dist), which records the distances between all pairs of points in the data (using any one of a variety of metrics). You can compute the distance matrix using the function dist().
dist() will calculate distance functions using the (squared) Euclidean distance (method = "euclidean"), the Manhattan distance (method = "manhattan"), and something like the Hamming distance, when categorical variables are expanded to indicators (method = "binary"). If you want to use another distance metric, you’ll have to compute the appropriate distance matrix and convert it to a dist object using the as.dist() call (see help(dist) for further details).
hclust() also uses one of a variety of clustering methods to produce a tree that records the nested cluster structure. We’ll use Ward’s method, which starts out with each data point as an individual cluster and merges clusters iteratively so as to minimize the total within sum of squares (WSS) of the clustering (we’ll explain more about WSS later in the chapter).
Let’s cluster the protein data.
distmat <- dist(pmatrix, method = "euclidean") 1 pfit <- hclust(distmat, method = "ward.D") 2 plot(pfit, labels = protein$Country) 3
hclust() returns a dendrogram: a tree that represents the nested clusters. The dendrogram for the protein data is shown in figure 9.6. The leaves of the tree are in the same cluster if there is a path between them. By cutting the tree at a certain depth, you disconnect some of the paths, and so create more, smaller clusters.
This dendrogram suggests five clusters might be an appropriate number, as shown in figure 9.6. You can draw the rectangles on the dendrogram using the function rect.hclust():
rect.hclust(pfit, k=5)
To extract the members of each cluster from the hclust object, use cutree().
groups <- cutree(pfit, k = 5) print_clusters <- function(data, groups, columns) { 1 groupedD <- split(data, groups) lapply(groupedD, function(df) df[, columns]) } cols_to_print <- wrapr::qc(Country, RedMeat, Fish, Fr.Veg) print_clusters(protein, groups, cols_to_print) ## $`1` ## Country RedMeat Fish Fr.Veg ## 1 Albania 10.1 0.2 1.7 ## 4 Bulgaria 7.8 1.2 4.2 ## 18 Romania 6.2 1.0 2.8 ## 25 Yugoslavia 4.4 0.6 3.2 ## ## $`2` ## Country RedMeat Fish Fr.Veg ## 2 Austria 8.9 2.1 4.3 ## 3 Belgium 13.5 4.5 4.0 ## 9 France 18.0 5.7 6.5 ## 12 Ireland 13.9 2.2 2.9 ## 14 Netherlands 9.5 2.5 3.7 ## 21 Switzerland 13.1 2.3 4.9 ## 22 UK 17.4 4.3 3.3 ## 24 W Germany 11.4 3.4 3.8 ## ## $`3` ## Country RedMeat Fish Fr.Veg ## 5 Czechoslovakia 9.7 2.0 4.0 ## 7 E Germany 8.4 5.4 3.6 ## 11 Hungary 5.3 0.3 4.2 ## 16 Poland 6.9 3.0 6.6 ## 23 USSR 9.3 3.0 2.9 ## ## $`4` ## Country RedMeat Fish Fr.Veg ## 6 Denmark 10.6 9.9 2.4 ## 8 Finland 9.5 5.8 1.4 ## 15 Norway 9.4 9.7 2.7 ## 20 Sweden 9.9 7.5 2.0 ## ## $`5` ## Country RedMeat Fish Fr.Veg ## 10 Greece 10.2 5.9 6.5 ## 13 Italy 9.0 3.4 6.7 ## 17 Portugal 6.2 14.2 7.9 ## 19 Spain 7.1 7.0 7.2
There’s a certain logic to these clusters: the countries in each cluster tend to be in the same geographical region. It makes sense that countries in the same region would have similar dietary habits. You can also see that
This dataset has only 25 points; it’s harder to “eyeball” the clusters and the cluster members when there are very many data points. In the next few sections, we’ll look at some ways to examine clusters more holistically.
As we mentioned in chapter 3, visualization is an effective way to get an overall view of the data, or in this case, the clusterings. The protein data is nine-dimensional, so it’s hard to visualize with a scatter plot.
We can try to visualize the clustering by projecting the data onto the first two principal components of the data.[1] If N is the number of variables that describe the data, then the principal components describe the hyperellipsoid in N-space that roughly bounds the data. Each principal component is an N-dimensional vector that describes an axis of that hyperellipsoid. Figure 9.7 shows this for N = 3.
We can project the data onto any two of the principal components, but the first two are the most likely to show useful information.
If you order the principal components by the length of the hyperellipsoid’s corresponding axes (longest first), then the first two principal components describe a plane in N-space that captures as much of the variation of the data as can be captured in two dimensions. In other words, it describes the best 2-D projection of the data. We’ll use the prcomp() call to do the principal components decomposition.
library(ggplot2) princ <- prcomp(pmatrix) 1 nComp <- 2 project <- predict(princ, pmatrix)[, 1:nComp] 2 project_plus <- cbind(as.data.frame(project), 3 cluster = as.factor(groups), country = protein$Country) ggplot(project_plus, aes(x = PC1, y = PC2)) + 4 geom_point(data = as.data.frame(project), color = "darkgrey") + geom_point() + geom_text(aes(label = country), hjust = 0, vjust = 1) + facet_wrap(~ cluster, ncol = 3, labeller = label_both)
You can see in figure 9.8 that cluster 1 (Romania/Yugoslavia/Bulgaria/Albania) and the Mediterranean cluster (cluster 5) are separated from the others. The other three clusters comingle in this projection, though they’re probably more separated in other projections.
An important question when evaluating clusters is whether a given cluster is “real”—does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? As you’ll see, this is especially important with clustering algorithms like k-means, where the user has to specify the number of clusters beforehand. It’s been our experience that clustering algorithms will often produce several clusters that represent actual structure or relationships in the data, and then one or two clusters that are buckets that represent “other” or “miscellaneous.” Clusters of "other" tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.
One way to assess whether a cluster represents true structure is to see if the cluster holds up under plausible variations in the dataset. The fpc package has a function called clusterboot() that uses bootstrap resampling to evaluate how stable a given cluster is.[1] clusterboot() is an integrated function that both performs the clustering and evaluates the final produced clusters. It has interfaces to a number of R clustering algorithms, including both hclust and kmeans.
For a full description of the algorithm, see Christian Henning, “Cluster-wise assessment of cluster stability,” Research Report 271, Dept. of Statistical Science, University College London, December 2006.
clusterboot’s algorithm uses the Jaccard coefficient, a similarity measure between sets. The Jaccard similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B. This is shown in figure 9.9.
The basic general strategy is as follows:
The cluster stability of each cluster in the original clustering is the mean value of its Jaccard coefficient over all the bootstrap iterations. As a rule of thumb, clusters with a stability value less than 0.6 should be considered unstable. Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together. Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters).
Different clustering algorithms can give different stability values, even when the algorithms produce highly similar clusterings, so clusterboot() is also measuring how stable the clustering algorithm is.
Let’s run clusterboot() on the protein data, using hierarchical clustering with five clusters. Note that clusterboot() is randomized, so you may not get identical results.
library(fpc) 1 kbest_p <- 5 2 cboot_hclust <- clusterboot(pmatrix, clustermethod = hclustCBI, 3 method = "ward.D", k = kbest_p) summary(cboot_hclust$result) 4 ## Length Class Mode ## result 7 hclust list ## noise 1 -none- logical ## nc 1 -none- numeric ## clusterlist 5 -none- list ## partition 25 -none- numeric ## clustermethod 1 -none- character ## nccl 1 -none- numeric groups <- cboot_hclust$result$partition 5 print_clusters(protein, groups, cols_to_print) 6 ## $`1` ## Country RedMeat Fish Fr.Veg ## 1 Albania 10.1 0.2 1.7 ## 4 Bulgaria 7.8 1.2 4.2 ## 18 Romania 6.2 1.0 2.8 ## 25 Yugoslavia 4.4 0.6 3.2 ## ## $`2` ## Country RedMeat Fish Fr.Veg ## 2 Austria 8.9 2.1 4.3 ## 3 Belgium 13.5 4.5 4.0 ## 9 France 18.0 5.7 6.5 ## 12 Ireland 13.9 2.2 2.9 ## 14 Netherlands 9.5 2.5 3.7 ## 21 Switzerland 13.1 2.3 4.9 ## 22 UK 17.4 4.3 3.3 ## 24 W Germany 11.4 3.4 3.8 ## ## $`3` ## Country RedMeat Fish Fr.Veg ## 5 Czechoslovakia 9.7 2.0 4.0 ## 7 E Germany 8.4 5.4 3.6 ## 11 Hungary 5.3 0.3 4.2 ## 16 Poland 6.9 3.0 6.6 ## 23 USSR 9.3 3.0 2.9 ## ## $`4` ## Country RedMeat Fish Fr.Veg ## 6 Denmark 10.6 9.9 2.4 ## 8 Finland 9.5 5.8 1.4 ## 15 Norway 9.4 9.7 2.7 ## 20 Sweden 9.9 7.5 2.0 ## ## $`5` ## Country RedMeat Fish Fr.Veg ## 10 Greece 10.2 5.9 6.5 ## 13 Italy 9.0 3.4 6.7 ## 17 Portugal 6.2 14.2 7.9 ## 19 Spain 7.1 7.0 7.2 cboot_hclust$bootmean 7 ## [1] 0.8090000 0.7939643 0.6247976 0.9366667 0.7815000 cboot_hclust$bootbrd 8 ## [1] 19 14 45 9 30
The clusterboot() results show that the cluster of countries with high fish consumption (cluster 4) is highly stable: the cluster stability is high, and the cluster was dissolved relatively few times. Clusters 1 and 2 are also quite stable; cluster 5 less so (you can see in figure 9.8 that the members of cluster 5 are separated from the other countries, but also fairly separated from each other). Cluster 3 has the characteristics of what we’ve been calling the “other” cluster.
clusterboot() assumes that you know the number of clusters, k. We eyeballed the appropriate k from the dendrogram, but this isn’t always feasible with a large dataset. Can we pick a plausible k in a more automated fashion? We’ll look at this question in the next section.
There are a number of heuristics and rules of thumb for picking clusters; a given heuristic will work better on some datasets than others. It’s best to take advantage of domain knowledge to help set the number of clusters, if that’s possible. Otherwise, try a variety of heuristics, and perhaps a few different values of k.
One simple heuristic is to compute the total within sum of squares (WSS) for different values of k and look for an “elbow” in the curve. We'll walk through the definition of WSS in this section.
Figure 9.10 shows data with four clusters. Define the centroid of each cluster as the point that is the mean value of all the points in the cluster. The centroid will be in the center of the cluster, as shown in the figure. The within sum of squares (or WSS_i) for a single cluster is the summed squared distance of each point in the cluster from the cluster’s centroid. This is shown in the figure for cluster 4.
The total within sum of squares is the sum of the WSS_i of all the clusters. We show the calculation in the following listing.
sqr_edist <- function(x, y) { 1 sum((x - y)^2) } wss_cluster <- function(clustermat) { 2 c0 <- colMeans(clustermat) 3 sum(apply(clustermat, 1, FUN = function(row) { sqr_edist(row, c0) })) 4 } wss_total <- function(dmatrix, labels) { 5 wsstot <- 0 k <- length(unique(labels)) for(i in 1:k) wsstot <- wsstot + wss_cluster(subset(dmatrix, labels == i)) 6 wsstot } wss_total(pmatrix, groups) 7 ## [1] 71.94342
The total WSS will decrease as the number of clusters increases, because each cluster will be smaller and tighter. The hope is that the rate at which the WSS decreases will slow down for k beyond the optimal number of clusters. In other words, the graph of WSS versus k should flatten out beyond the optimal k, so the optimal k will be at the “elbow” of the graph. Let’s try calculating WSS for up to 10 clusters.
get_wss <- function(dmatrix, max_clusters) { 1 wss = numeric(max_clusters) wss[1] <- wss_cluster(dmatrix) 2 d <- dist(dmatrix, method = "euclidean") pfit <- hclust(d, method = "ward.D") 3 for(k in 2:max_clusters) { 4 labels <- cutree(pfit, k = k) wss[k] <- wss_total(dmatrix, labels) } wss } kmax <- 10 cluster_meas <- data.frame(nclusters = 1:kmax, wss = get_wss(pmatrix, kmax)) breaks <- 1:kmax ggplot(cluster_meas, aes(x=nclusters, y = wss)) + 5 geom_point() + geom_line() + scale_x_continuous(breaks = breaks)
Figure 9.11 shows the plot of WSS as a function of k. Unfortunately, in this case the elbow of the graph is hard to see, although if you squint your eyes you might be able to convince yourself that there is an elbow at k = 2, and another one at k = 5 or 6. This means the best clusterings might be 2 clusters, 5 clusters, or 6 clusters.
The Calinski-Harabasz index is another commonly used measure of cluster goodness. It tries to find the point where all the clusters are tight, and also far apart from each other. To motivate (and calculate) the Calinski-Harabasz index (CH index, for short), we first need to define a few more terms.
As shown in figure 9.12, the total sum of squares (TSS) of a set of points is the sum of the squared distances of all the points from the centroid of the data. In the function get_wss() of listing 9.8, the value wss[1] is the TSS, and it is independent of the clustering. For a given clustering with total within sum of squares, we can also define the between sum of squares (BSS):
BSS = TSS - WSS
BSS measures how far apart the clusters are from each other. A good clustering has a small WSS (all the clusters are tight around their centers) and a large BSS. We can compare how BSS and WSS vary as we vary the number of clusters.
total_ss <- function(dmatrix) { 1 grandmean <- colMeans(dmatrix) sum(apply(dmatrix, 1, FUN = function(row) { sqr_edist(row, grandmean) })) } tss <- total_ss(pmatrix) cluster_meas$bss <- with(cluster_meas, tss - wss) library(cdata) 2 cmlong <- unpivot_to_blocks(cluster_meas, 3 nameForNewKeyColumn = "measure", nameForNewValueColumn = "value", columnsToTakeFrom = c("wss", "bss")) ggplot(cmlong, aes(x = nclusters, y = value)) + geom_point() + geom_line() + facet_wrap(~measure, ncol = 1, scale = "free_y") + scale_x_continuous(breaks = 1:10)
Figure 9.13 shows that as k increases, BSS increases, while WSS decreases. We want a clustering with a good balance of BSS and WSS. To find such a clustering, we have to look at a couple of measures related to the BSS and the WSS.
The within cluster variance W is given by
W = WSS / (n - k)
Here, n is the number of data points and k is the number of clusters. You can think of W as the “average” WSS.
The between cluster variance B is given by
B = BSS / (k - 1)
Again, you can think of B as the average contribution to the BSS from each cluster.
A good clustering should have a small average WSS and a large average BSS, so we might try to maximize the ratio of B to W. This is the Calinski-Harabasz (CH) index. Let’s calculate the CH index and plot it for up to 10 clusters.
cluster_meas$B <- with(cluster_meas, bss / (nclusters - 1)) 1 n = nrow(pmatrix) cluster_meas$W <- with(cluster_meas, wss / (n - nclusters)) 2 cluster_meas$ch_crit <- with(cluster_meas, B / W) 3 ggplot(cluster_meas, aes(x = nclusters, y = ch_crit)) + geom_point() + geom_line() + scale_x_continuous(breaks = 1:kmax)
Looking at figure 9.14, you see that the CH criterion is maximized at k = 2, with another local maximum at k = 5. The k = 2 clustering corresponds to the first split of the protein data dendrogram, as shown in figure 9.15; if you use clusterboot() to do the clustering, you’ll see that the clusters are highly stable, though perhaps not very informative.
There are several other measures that you can try when picking k. The gap statistic[a] is an attempt to automate the “elbow finding” on the WSS curve. It works best when the data comes from a mix of populations that all have approximately Gaussian distributions (called a mixture of Gaussians). We’ll see one more measure, the average silhouette width, when we discuss kmeans().
See Robert Tibshirani, Guenther Walther, and Trevor Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society B, 2001. 63(2), pp. 411-423; www.stanford.edu/~hastie/Papers/gap.pdf.
K-means is a popular clustering algorithm when the data is all numeric and the distance metric is squared Euclidean (though you could in theory run it with other distance metrics). It’s fairly ad hoc and has the major disadvantage that you must pick k in advance. On the plus side, it’s easy to implement (one reason it’s so popular) and can be faster than hierarchical clustering on large datasets. It works best on data that looks like a mixture of Gaussians, which the protein data unfortunately doesn’t appear to be.
The function to run k-means in R is kmeans(). The output of kmeans() includes the cluster labels, the centers (centroids) of the clusters, the total sum of squares, total WSS, total BSS, and the WSS of each cluster.
The k-means algorithm is illustrated in figure 9.16, with k = 2. This algorithm isn’t guaranteed to have a unique stopping point. K-means can be fairly unstable, in that the final clusters depend on the initial cluster centers. It’s good practice to run k-means several times with different random starts, and then select the clustering with the lowest total WSS. The kmeans() function can do this automatically, though it defaults to using only one random start.
Let’s run kmeans() on the protein data (scaled to 0 mean and unit standard deviation, as before). We’ll use k = 5, as shown in listing 9.11. Note that kmeans() is randomized code, so you may not get exactly the results shown.
kbest_p <- 5 pclusters <- kmeans(pmatrix, kbest_p, nstart = 100, iter.max = 100) 1 summary(pclusters) 2 ## Length Class Mode ## cluster 25 -none- numeric ## centers 45 -none- numeric ## totss 1 -none- numeric ## withinss 5 -none- numeric ## tot.withinss 1 -none- numeric ## betweenss 1 -none- numeric ## size 5 -none- numeric ## iter 1 -none- numeric ## ifault 1 -none- numeric pclusters$centers 3 ## RedMeat WhiteMeat Eggs Milk Fish Cereals ## 1 -0.570049402 0.5803879 -0.08589708 -0.4604938 -0.4537795 0.3181839 ## 2 -0.508801956 -1.1088009 -0.41248496 -0.8320414 0.9819154 0.1300253 ## 3 -0.807569986 -0.8719354 -1.55330561 -1.0783324 -1.0386379 1.7200335 ## 4 0.006572897 -0.2290150 0.19147892 1.3458748 1.1582546 -0.8722721 ## 5 1.011180399 0.7421332 0.94084150 0.5700581 -0.2671539 -0.6877583 ## Starch Nuts Fr.Veg ## 1 0.7857609 -0.2679180 0.06873983 ## 2 -0.1842010 1.3108846 1.62924487 ## 3 -1.4234267 0.9961313 -0.64360439 ## 4 0.1676780 -0.9553392 -1.11480485 ## 5 0.2288743 -0.5083895 0.02161979 pclusters$size 4 ## [1] 5 4 4 4 8 groups <- pclusters$cluster 5 cols_to_print = wrapr::qc(Country, RedMeat, Fish, Fr.Veg) print_clusters(protein, groups, cols_to_print) 6 ## $`1` ## Country RedMeat Fish Fr.Veg ## 5 Czechoslovakia 9.7 2.0 4.0 ## 7 E Germany 8.4 5.4 3.6 ## 11 Hungary 5.3 0.3 4.2 ## 16 Poland 6.9 3.0 6.6 ## 23 USSR 9.3 3.0 2.9 ## ## $`2` ## Country RedMeat Fish Fr.Veg ## 10 Greece 10.2 5.9 6.5 ## 13 Italy 9.0 3.4 6.7 ## 17 Portugal 6.2 14.2 7.9 ## 19 Spain 7.1 7.0 7.2 ## ## $`3` ## Country RedMeat Fish Fr.Veg ## 1 Albania 10.1 0.2 1.7 ## 4 Bulgaria 7.8 1.2 4.2 ## 18 Romania 6.2 1.0 2.8 ## 25 Yugoslavia 4.4 0.6 3.2 ## ## $`4` ## Country RedMeat Fish Fr.Veg ## 6 Denmark 10.6 9.9 2.4 ## 8 Finland 9.5 5.8 1.4 ## 15 Norway 9.4 9.7 2.7 ## 20 Sweden 9.9 7.5 2.0 ## ## $`5` ## Country RedMeat Fish Fr.Veg ## 2 Austria 8.9 2.1 4.3 ## 3 Belgium 13.5 4.5 4.0 ## 9 France 18.0 5.7 6.5 ## 12 Ireland 13.9 2.2 2.9 ## 14 Netherlands 9.5 2.5 3.7 ## 21 Switzerland 13.1 2.3 4.9 ## 22 UK 17.4 4.3 3.3 ## 24 W Germany 11.4 3.4 3.8
To run kmeans(), you must know k. The fpc package (the same package that has clusterboot()) has a function called kmeansruns() that calls kmeans() over a range of k and estimates the best k. It then returns its pick for the best value of k, the output of kmeans() for that value, and a vector of criterion values as a function of k. Currently, kmeansruns() has two criteria: the Calinski-Harabasz index ("ch") and the average silhouette width ("asw"). For either criterion, the maximum value indicates the optimal number of clusters (for more about silhouette clustering, see http://mng.bz/Qe15). It’s a good idea to examine the criterion values over the entire range of k, since you may see evidence for a k that the algorithm didn’t automatically pick. The following listing illustrates this point.
clustering_ch <- kmeansruns(pmatrix, krange = 1:10, criterion = "ch") 1 clustering_ch$bestk 2 ## [1] 2 clustering_asw <- kmeansruns(pmatrix, krange = 1:10, criterion = "asw") 3 clustering_asw$bestk ## [1] 3 clustering_asw$crit 4 ## [1] 0.0000000 0.3271084 0.3351694 0.2617868 0.2639450 0.2734815 0.2471165 ## [8] 0.2429985 0.2412922 0.2388293 clustering_ch$crit 5 ## [1] 0.000000 14.094814 11.417985 10.418801 10.011797 9.964967 9.861682 ## [8] 9.412089 9.166676 9.075569 cluster_meas$ch_crit 6 ## [1] NaN 12.215107 10.359587 9.690891 10.011797 9.964967 9.506978 ## [8] 9.092065 8.822406 8.695065 summary(clustering_ch) 7 ## Length Class Mode ## cluster 25 -none- numeric ## centers 18 -none- numeric ## totss 1 -none- numeric ## withinss 2 -none- numeric ## tot.withinss 1 -none- numeric ## betweenss 1 -none- numeric ## size 2 -none- numeric ## iter 1 -none- numeric ## ifault 1 -none- numeric ## crit 10 -none- numeric ## bestk 1 -none- numeric
The top graph of figure 9.17 compares the results of the two clustering criteria provided by kmeansruns. Both criteria have been scaled to be in compatible units. They suggest two to three clusters as the best choice. However, if you compare the values of the (unscaled) CH criterion for the kmeans and hclust clusterings, as shown in the bottom graph of figure 9.17, you’ll see that the CH criterion produces different curves for kmeans() and hclust() clusterings, but it did pick the same value (which probably means it picked the same clusters) for k = 5 and k = 6, which might be taken as evidence that either 5 or 6 is the optimal choice for k.
We can run clusterboot() using the k-means algorithm, as well.
kbest_p <- 5 cboot <- clusterboot(pmatrix, clustermethod = kmeansCBI, runs = 100,iter.max = 100, krange = kbest_p, seed = 15555) 1 groups <- cboot$result$partition print_clusters(protein, groups, cols_to_print) ## $`1` ## Country RedMeat Fish Fr.Veg ## 1 Albania 10.1 0.2 1.7 ## 4 Bulgaria 7.8 1.2 4.2 ## 18 Romania 6.2 1.0 2.8 ## 25 Yugoslavia 4.4 0.6 3.2 ## ## $`2` ## Country RedMeat Fish Fr.Veg ## 6 Denmark 10.6 9.9 2.4 ## 8 Finland 9.5 5.8 1.4 ## 15 Norway 9.4 9.7 2.7 ## 20 Sweden 9.9 7.5 2.0 ## ## $`3` ## Country RedMeat Fish Fr.Veg ## 5 Czechoslovakia 9.7 2.0 4.0 ## 7 E Germany 8.4 5.4 3.6 ## 11 Hungary 5.3 0.3 4.2 ## 16 Poland 6.9 3.0 6.6 ## 23 USSR 9.3 3.0 2.9 ## ## $`4` ## Country RedMeat Fish Fr.Veg ## 2 Austria 8.9 2.1 4.3 ## 3 Belgium 13.5 4.5 4.0 ## 9 France 18.0 5.7 6.5 ## 12 Ireland 13.9 2.2 2.9 ## 14 Netherlands 9.5 2.5 3.7 ## 21 Switzerland 13.1 2.3 4.9 ## 22 UK 17.4 4.3 3.3 ## 24 W Germany 11.4 3.4 3.8 ## ## $`5` ## Country RedMeat Fish Fr.Veg ## 10 Greece 10.2 5.9 6.5 ## 13 Italy 9.0 3.4 6.7 ## 17 Portugal 6.2 14.2 7.9 ## 19 Spain 7.1 7.0 7.2 cboot$bootmean ## [1] 0.8670000 0.8420714 0.6147024 0.7647341 0.7508333 cboot$bootbrd ## [1] 15 20 49 17 32
Note that the stability numbers as given by cboot$bootmean (and the number of times that the clusters were “dissolved” as given by cboot$bootbrd) are different for the hierarchical clustering and k-means, even though the discovered clusters are the same. This shows that the stability of a clustering is partly a function of the clustering algorithm, not just the data. Again, the fact that both clustering algorithms discovered the same clusters might be taken as an indication that 5 is the optimal number of clusters.
Clustering is often used as part of data exploration, or as a precursor to other supervised learning methods. But you may want to use the clusters that you discovered to categorize new data, as well. One common way to do so is to treat the centroid of each cluster as the representative of the cluster as a whole, and then assign new points to the cluster with the nearest centroid. Note that if you scaled the original data before clustering, then you should also scale the new data point the same way before assigning it to a cluster.
Listing 9.14 shows an example of a function that assigns a new data point, newpt (represented as a vector), to a clustering, centers, which is represented as a matrix where each row is a cluster centroid. This is the representation of cluster centroids that kmeans() returns. If the data was scaled using scale() before clustering, then xcenter and xscale are the scaled:center and scaled:scale attributes, respectively.
assign_cluster <- function(newpt, centers, xcenter = 0, xscale = 1) { xpt <- (newpt - xcenter) / xscale 1 dists <- apply(centers, 1, FUN = function(c0) { sqr_edist(c0, xpt) }) 2 which.min(dists) 3 }
Note that the function sqr_edist() (the squared Euclidean distance) was defined previously, in section 9.1.1.
Let’s look at an example of assigning points to clusters, using synthetic data. First, we’ll generate the data.
mean1 <- c(1, 1, 1) 1 sd1 <- c(1, 2, 1) mean2 <- c(10, -3, 5) sd2 <- c(2, 1, 2) mean3 <- c(-5, -5, -5) sd3 <- c(1.5, 2, 1) library(MASS) 2 clust1 <- mvrnorm(100, mu = mean1, Sigma = diag(sd1)) clust2 <- mvrnorm(100, mu = mean2, Sigma = diag(sd2)) clust3 <- mvrnorm(100, mu = mean3, Sigma = diag(sd3)) toydata <- rbind(clust3, rbind(clust1, clust2)) tmatrix <- scale(toydata) 3 tcenter <- attr(tmatrix, "scaled:center") 4 tscale <-attr(tmatrix, "scaled:scale") tmatrix <- rm_scales(tmatrix) kbest_t <- 3 tclusters <- kmeans(tmatrix, kbest_t, nstart = 100, iter.max = 100) 5 tclusters$size ## [1] 101 100 99
Let’s compare the centers of the found k-means clusters to the true cluster centers. To do that, we need to unscale tclusters$centers. The scale() function works by subtracting the center vector, then dividing by the scale vector. So to reverse the process, first “unscale” the scaled matrix, then “uncenter” it.
unscaled = scale(tclusters$centers, center = FALSE, scale = 1 / tscale) rm_scales(scale(unscaled, center = -tcenter, scale = FALSE)) ## [,1] [,2] [,3] ## 1 9.8234797 -3.005977 4.7662651 ## 2 -4.9749654 -4.862436 -5.0577002 ## 3 0.8926698 1.185734 0.8336977
Comparing the unscaled centers to mean1, mean2, and mean3 in listing 9.15, we see that
So it appears that the discovered clusters are consistent with the true clusters.
Now we can demonstrate assigning new points to the clusters. Let’s generate a point from each of the true clusters, and see which k-means cluster it is assigned to.
assign_cluster(mvrnorm(1, mean1, diag(sd1)) 1 tclusters$centers, tcenter, tscale) ## 3 ## 3 assign_cluster(mvrnorm(1, mean2, diag(sd2)) 2 tclusters$centers, tcenter, tscale) ## 1 ## 1 assign_cluster(mvrnorm(1, mean3, diag(sd3)) 3 tclusters$centers, tcenter, tscale) ## 2 ## 2
The assign_cluster() function has correctly assigned each point to the appropriate cluster.
At this stage, you have learned how to estimate the appropriate number of clusters for a dataset, how to cluster a dataset using both hierarchical clustering and k-means, and how to evaluate the resulting clusters. Here’s what you should remember about clustering:
Sometimes, rather than looking for subsets of data points that are highly similar to each other, you’d like to know what kinds of data (or which data attributes) tend to occur together. In the next section, we’ll look at one approach to this problem.
Association rule mining is used to find objects or attributes that frequently occur together—for example, products that are often bought together during a shopping session, or queries that tend to occur together during a session on a website’s search engine. Such information can be used to recommend products to shoppers, to place frequently bundled items together on store shelves, or to redesign websites for easier navigation.
Suppose you work in a library. You want to know which books tend to be checked out together, to help you make predictions about book availability.
The unit of “togetherness” when mining association rules is called a transaction. Depending on the problem, a transaction could be a single shopping basket, a single user session on a website, or even a single customer. The objects that comprise a transaction are referred to as items in an itemset: the products in the shopping basket, the pages visited during a website session, the actions of a customer. Sometimes transactions are referred to as baskets, from the shopping basket analogy.
When a library patron checks out a set of books, that’s a transaction; the books that the patron checks out are the itemset that comprise the transaction. Table 9.1 represents a database of transactions (you run a library where fantasy is quite popular).
Mining for association rules occurs in two steps:
Let's look at the transactions that involve the items The Hobbit (H for short) and The Princess Bride (PB for short). The columns of table 9.2 represent transactions; the rows mark the transactions where a given itemset appears.
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Total |
|
---|---|---|---|---|---|---|---|---|---|---|---|
H | X | X | X | X | X | 5 | |||||
PB | X | X | X | X | 4 | ||||||
{H, PB} | X | X | X | 3 |
Looking over all the transactions in table 9.2, you find that
We’d say the support of the itemset {The Hobbit, The Princess Bride} is 30%.
So you can make a rule: “People who check out The Hobbit also check out The Princess Bride.” This rule should be correct (according to your data) 60% of the time. We’d say that the confidence of the rule is 60%.
So the rule “People who check out The Princess Bride also check out The Hobbit” has 75% confidence.
Let's formally define rules, support, and confidence.
The rule “if X, then Y” means that every time you see the itemset X in a transaction, you expect to also see Y (with a given confidence). For the apriori algorithm (which we’ll look at in this section), Y is always an itemset with one item.
Suppose that your database of transactions is called T, and X is an itemset. Then support(X) is the number of transactions that contain X divided by the total number of transactions in T.
The confidence of the rule “if X, then Y” gives the fraction or percentage of the time that a rule is true, relative to how often you see X. In other words, if support(X) is how often the itemset X occurs in a transaction, and support({X, Y}) is how often both itemsets X and Y occur in a transaction, then the confidence of the rule “if X, then Y” is support({X, Y})/support(X).
The goal in association rule mining is to find all the interesting rules in the database with at least a given minimum support (say, 10%) and a minimum given confidence (say, 60%).
Suppose you work for a bookstore, and you want to recommend books that a customer might be interested in, based on all of their previous purchases and book interests. You want to use historical book interest information to develop some recommendation rules.
You can get information about customers’ book interests two ways: either they’ve purchased a book from you, or they’ve rated a book on your website (even if they bought the book somewhere else). In this case, a transaction is a customer, and an itemset is all the books that they’ve expressed an interest in, either by purchase or by rating.
The data that you’ll use is based on data collected in 2004 from the book community Book-Crossing[1] for research conducted at the Institut für Informatik, University of Freiburg.[2] The information is condensed into a single tab-separated text file called bookdata.tsv. Each row of the file consists of a user ID, a book title (which has been designed as a unique ID for each book), and the rating (which you won’t actually use in this example):
The original data repository can be found at http://mng.bz/2052. Since some artifacts in the original files caused errors when reading into R, we’re providing copies of the data as a prepared RData object: https://github.com/WinVector/PDSwR2/blob/master/Bookdata/bxBooks.RData. The prepared version of the data that we’ll use in this section is at https://github.com/WinVector/PDSwR2/blob/master/Bookdata/bookdata.tsv.gz. Further information and scripts for preparing the data can be found at https://github.com/WinVector/PDSwR2/tree/master/Bookdata.
The researchers’ original paper is “Improving Recommendation Lists Through Topic Diversification,” Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW ’05), May 10-14, 2005, Chiba, Japan. It can be found online at http://mng.bz/7trR.
|token | userid| rating|title | |:---------------------|------:|------:|:---------------------| |always have popsicles | 172742| 0|Always Have Popsicles |
The token column contains lowercase column strings; the tokens were used to identify books with different ISBNs (the original book IDs) that had the same title except for casing. The title column holds properly capitalized title strings; these are unique per book, so for this example you will use them as book IDs.
In this format, the transaction (customer) information is diffused through the data, rather than being all in one row; this reflects the way the data would naturally be stored in a database, since the customer’s activity would be diffused throughout time. Books generally come in different editions or from different publishers. For this example, we’ve condensed all different versions into a single item; hence, different copies or printings of Little Women will all map to the same item ID in our data (namely, the title “Little Women”).
The original data includes approximately a million ratings of 271,379 books from 278,858 readers. Our data will have fewer books due to the mapping that we discussed earlier.
Now you’re ready to mine.
You’ll use the package arules for association rule mining. arules includes an implementation of the popular association rule algorithm apriori, as well as implementations to read in and examine transaction data.[1] The package uses special data types to hold and manipulate the data; you’ll explore these data types as you work the example.
For a more comprehensive introduction to arules than we can give in this chapter, please see Hahsler, Grin, Hornik, and Buchta, “Introduction to arules—A computational environment for mining association rules and frequent item sets,” online at cran.r-project.org/web/packages/arules/vignettes/arules.pdf.
You can read the data directly from the bookdata.tsv.gz file into the object bookbaskets using the function read.transactions().
library(arules) 1 bookbaskets <- read.transactions("bookdata.tsv.gz", format = "single", 2 header = TRUE, 3 sep = " ", 4 cols = c("userid", "title"), 5 rm.duplicates = TRUE) 6
The read.transactions() function reads data in two formats: the format where every row corresponds to a single item (like bookdata.tsv.gz), and a format where each row corresponds to a single transaction, possibly with a transaction ID, like table 9.1. To read data in the first format, use the argument format = "single"; to read data in the second format, use the argument format = "basket".
It sometimes happens that a reader will buy one edition of a book and then later add a rating for that book under a different edition. Because of the way we’re representing books for this example, these two actions will result in duplicate entries. The rm.duplicates = TRUE argument will eliminate them. It will also output some (not too useful) diagnostics about the duplicates.
Once you’ve read in the data, you can inspect the resulting object.
Transactions are represented as a special object called transactions. You can think of a transactions object as a 0/1 matrix, with one row for every transaction (in this example, a customer) and one column for every possible item (in this example, a book). The matrix entry (i, j) is 1 if the ith transaction contains item j, or if customer i has expressed an interest in book j. There are a number of calls you can use to examine the transaction data, as the next listing shows.
class(bookbaskets) 1 ## [1] "transactions" ## attr(,"package") ## [1] "arules" bookbaskets 2 ## transactions in sparse format with ## 92108 transactions (rows) and ## 220447 items (columns) dim(bookbaskets) 3 ## [1] 92108 220447 colnames(bookbaskets)[1:5] 4 ## [1] " A Light in the Storm:[...]" ## [2] " Always Have Popsicles" ## [3] " Apple Magic" ## [4] " Ask Lily" ## [5] " Beyond IBM: Leadership Marketing and Finance for the 1990s" rownames(bookbaskets)[1:5] 5 ## [1] "10" "1000" "100001" "100002" "100004"
You can examine the distribution of transaction sizes (or basket sizes) with the function size():
basketSizes <- size(bookbaskets) summary(basketSizes) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.0 1.0 1.0 11.1 4.0 10250.0
Most customers (at least half of them, in fact) only expressed interest in one book. But someone has expressed interest in more than 10,000! You probably want to look more closely at the size distribution to see what’s going on.
quantile(basketSizes, probs = seq(0, 1, 0.1)) 1 ## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ## 1 1 1 1 1 1 2 3 5 13 10253 library(ggplot2) 2 ggplot(data.frame(count = basketSizes)) + geom_density(aes(x = count)) + scale_x_log10()
Figure 9.18 shows the distribution of basket sizes. 90% of customers expressed interest in fewer than 15 books; most of the remaining customers expressed interest in up to about 100 books or so; the call quantile(basketSizes, probs = c(0.99, 1)) will show you that 99% of customers expressed interest in 179 books or fewer. Still, a few people have expressed interest in several hundred, or even several thousand books.
Which books are they reading? The function itemFrequency() can tell you how often each book shows up in the transaction data.
bookCount <- itemFrequency(bookbaskets, "absolute") summary(bookCount) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 1.000 1.000 4.638 3.000 2502.000
You can also find the 10 most frequently occurring books.
orderedBooks <- sort(bookCount, decreasing = TRUE) 1 knitr::kable(orderedBooks[1:10]) 2 # | | x| # |:-----------------------------------------------|----:| # |Wild Animus | 2502| # |The Lovely Bones: A Novel | 1295| # |She's Come Undone | 934| # |The Da Vinci Code | 905| # |Harry Potter and the Sorcerer's Stone | 832| # |The Nanny Diaries: A Novel | 821| # |A Painted House | 819| # |Bridget Jones's Diary | 772| # |The Secret Life of Bees | 762| # |Divine Secrets of the Ya-Ya Sisterhood: A Novel | 737| orderedBooks[1] / nrow(bookbaskets) 3 ## Wild Animus ## 0.02716376
The last observation in the preceding listing highlights one of the issues with mining high-dimensional data: when you have thousands of variables, or thousands of items, almost every event is rare. Keep this point in mind when deciding on support thresholds for rule mining; your thresholds will often need to be quite low.
Before we get to the rule mining, let’s refine the data a bit more. As you observed earlier, half of the customers in the data only expressed interest in a single book. Since you want to find books that occur together in people’s interest lists, you can’t make any direct use of people who haven’t yet shown interest in multiple books. You can restrict the dataset to customers who have expressed interest in at least two books:
bookbaskets_use <- bookbaskets[basketSizes > 1] dim(bookbaskets_use) ## [1] 40822 220447
Now you’re ready to look for association rules.
In order to mine rules, you need to decide on a minimum support level and a minimum threshold level. For this example, let’s try restricting the itemsets that we’ll consider to those with a minimum support of 0.2%, or 0.002. This corresponds to itemsets that appear at least 0.002 * nrow(bookbaskets_use) times, which is about 82 transactions. We’ll use a confidence threshold of 75%.
rules <- apriori(bookbaskets_use, 1 parameter = list(support = 0.002, confidence = 0.75)) summary(rules) ## set of 191 rules 2 ## ## rule length distribution (lhs + rhs):sizes 3 ## 2 3 4 5 ## 11 100 66 14 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.000 3.000 3.000 3.435 4.000 5.000 ## ## summary of quality measures: 4 ## support confidence lift count ## Min. :0.002009 Min. :0.7500 Min. : 40.89 Min. : 82.0 ## 1st Qu.:0.002131 1st Qu.:0.8113 1st Qu.: 86.44 1st Qu.: 87.0 ## Median :0.002278 Median :0.8468 Median :131.36 Median : 93.0 ## Mean :0.002593 Mean :0.8569 Mean :129.68 Mean :105.8 ## 3rd Qu.:0.002695 3rd Qu.:0.9065 3rd Qu.:158.77 3rd Qu.:110.0 ## Max. :0.005830 Max. :0.9882 Max. :321.89 Max. :238.0 ## ## mining info: 5 ## data ntransactions support confidence ## bookbaskets_use 40822 0.002 0.75
The quality measures on the rules include a rule’s support and confidence, the support count (how many transactions the rule applied to), and a quantity called lift. Lift compares the frequency of an observed pattern with how often you’d expect to see that pattern just by chance. The lift of a rule “if X, then Y” is given by support({X, Y}) / (support(X) * support(Y)). If the lift is near 1, then there’s a good chance that the pattern you observed is occurring just by chance. The larger the lift, the more likely that the pattern is “real.” In this case, all the discovered rules have a lift of at least 40, so they’re likely to be real patterns in customer behavior.
There are also other metrics and interest measures you can use to evaluate the rules by using the function interestMeasure(). We’ll look at two of these measures: coverage and fishersExactTest. Coverage is the support of the left side of the rule (X); it tells you how often the rule would be applied in the dataset. Fisher’s exact test is a significance test for whether an observed pattern is real or chance (the same thing lift measures; Fisher’s test is more formal). Fisher’s exact test returns the p-value, or the probability that you would see the observed pattern by chance; you want the p-value to be small.
measures <- interestMeasure(rules, 1 measure=c("coverage", "fishersExactTest"), 2 transactions = bookbaskets_use) 3 summary(measures) ## coverage fishersExactTest ## Min. :0.002082 Min. : 0.000e+00 ## 1st Qu.:0.002511 1st Qu.: 0.000e+00 ## Median :0.002719 Median : 0.000e+00 ## Mean :0.003039 Mean :5.080e-138 ## 3rd Qu.:0.003160 3rd Qu.: 0.000e+00 ## Max. :0.006982 Max. :9.702e-136
The coverage of the discovered rules ranges from 0.002–0.007, equivalent to a range of about 82–286 people. All the p-values from Fisher’s test are small, so it’s likely that the rules reflect actual customer behavior patterns.
You can also call interestMeasure() with the methods support, confidence, and lift, among others. This would be useful in our example if you wanted to get support, confidence, and lift estimates for the full dataset bookbaskets, rather than the filtered dataset bookbaskets_use—or for a subset of the data, for instance, only customers from the United States.
The function inspect() pretty-prints the rules. The function sort() allows you to sort the rules by a quality or interest measure, like confidence. To print the five most confident rules in the dataset, you could use the following statement, which we will expand out using pipe notation.
library(magrittr) 1 rules %>% sort(., by = "confidence") %>% 2 head(., n = 5) %>% 3 inspect(.) 4
For legibility, we show the output of this command in table 9.3.
Left side |
Right side |
Support |
Confidence |
Lift |
Count |
---|---|---|---|---|---|
Four to Score High Five Seven Up Two for the Dough | Three to Get Deadly | 0.002 | 0.988 | 165 | 84 |
Harry Potter and the Order of the Phoenix Harry Potter and the Prisoner of Azkaban Harry Potter and the Sorcerer’s Stone | Harry Potter and the Chamber of Secrets | 0.003 | 0.966 | 73 | 117 |
Four to Score High Five One for the Money Two for the Dough | Three to Get Deadly | 0.002 | 0.966 | 162 | 85 |
Four to Score Seven Up Three to Get Deadly Two for the Dough | High Five | 0.002 | 0.966 | 181 | 84 |
High Five Seven Up Three to Get Deadly Two for the Dough | Four to Score | 0.002 | 0.966 | 168 | 84 |
There are two things to notice in table 9.3. First, the rules concern books that come in series: the numbered series of novels about bounty hunter Stephanie Plum, and the Harry Potter series. So these rules essentially say that if a reader has read four Stephanie Plum or three Harry Potter books, they’re almost sure to buy another one.
The second thing to notice is that rules 1, 4, and 5 are permutations of the same itemset. This is likely to happen when the rules get long.
You can restrict which items appear in the left side or right side of a rule. Suppose you’re interested specifically in books that tend to co-occur with the novel The Lovely Bones. You can do this by restricting which books appear on the right side of the rule, using the appearance parameter.
brules <- apriori(bookbaskets_use, parameter = list(support = 0.001, 1 confidence = 0.6), appearance = list(rhs = c("The Lovely Bones: A Novel"), 2 default = "lhs")) 3 summary(brules) ## set of 46 rules ## ## rule length distribution (lhs + rhs):sizes ## 3 4 ## 44 2 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.000 3.000 3.000 3.043 3.000 4.000 ## ## summary of quality measures: ## support confidence lift count ## Min. :0.001004 Min. :0.6000 Min. :21.81 Min. :41.00 ## 1st Qu.:0.001029 1st Qu.:0.6118 1st Qu.:22.24 1st Qu.:42.00 ## Median :0.001102 Median :0.6258 Median :22.75 Median :45.00 ## Mean :0.001132 Mean :0.6365 Mean :23.14 Mean :46.22 ## 3rd Qu.:0.001219 3rd Qu.:0.6457 3rd Qu.:23.47 3rd Qu.:49.75 ## Max. :0.001396 Max. :0.7455 Max. :27.10 Max. :57.00 ## ## mining info: ## data ntransactions support confidence ## bookbaskets_use 40822 0.001 0.6
The supports, confidences, counts, and lifts are lower than they were in our previous example, but the lifts are still much greater than one, so it’s likely that the rules reflect real customer behavior patterns.
Let’s inspect the rules, sorted by confidence. Since they’ll all have the same right side, you can use the lhs() function to only look at the left sides.
brules %>% sort(., by = "confidence") %>% lhs(.) %>% 1 head(., n = 5) %>% inspect(.) ## items ## 1 {Divine Secrets of the Ya-Ya Sisterhood: A Novel, ## Lucky : A Memoir} ## 2 {Lucky : A Memoir, ## The Notebook} ## 3 {Lucky : A Memoir, ## Wild Animus} ## 4 {Midwives: A Novel, ## Wicked: The Life and Times of the Wicked Witch of the West} ## 5 {Lucky : A Memoir, ## Summer Sisters}
Note that four of the five most confident rules include Lucky: A Memoir in the left side, which perhaps isn’t surprising, since Lucky was written by the author of The Lovely Bones. Suppose you want to find out about works by other authors that are interesting to people who showed interest in The Lovely Bones; you can use subset() to filter down to only rules that don’t include Lucky.
brulesSub <- subset(brules, subset = !(lhs %in% "Lucky : A Memoir")) 1 brulesSub %>% sort(., by = "confidence") %>% lhs(.) %>% head(., n = 5) %>% inspect(.) brulesConf <- sort(brulesSub, by="confidence") inspect(head(lhs(brulesConf), n = 5)) ## items ## 1 {Midwives: A Novel, ## Wicked: The Life and Times of the Wicked Witch of the West} ## 2 {She's Come Undone, ## The Secret Life of Bees, ## Wild Animus} ## 3 {A Walk to Remember, ## The Nanny Diaries: A Novel} ## 4 {Beloved, ## The Red Tent} ## 5 {The Da Vinci Code, ## The Reader}
These examples show that association rule mining is often highly interactive. To get interesting rules, you must often set the support and confidence levels fairly low; as a result, you can get many, many rules. Some rules will be more interesting or surprising to you than others; to find them requires sorting the rules by different interest measures, or perhaps restricting yourself to specific subsets of rules.
You've now walked through an example of using association rules to explore common patterns in purchase data. Here’s what you should remember about association rules:
In this chapter, you’ve learned how to find similarities in data using two different clustering methods in R, and how to find items that tend to occur together in data using association rules. You’ve also learned how to evaluate your discovered clusters and your discovered rules.
Unsupervised methods like the ones we’ve covered in this chapter are really more exploratory in nature. Unlike with supervised methods, there’s no “ground truth” to evaluate your findings against. But the findings from unsupervised methods can be the starting point for more-focused experiments and modeling.
In the last few chapters, we’ve covered the most basic modeling and data analysis techniques; they’re all good first approaches to consider when you’re starting a new project. In the next chapter, we’ll touch on a few more-advanced methods.
In this chapter you have learned