By the end of this chapter, you will be able to:
In this chapter, we aim to equip you with a practical understanding of unsupervised learning.
In this chapter, we will look at the implementation of unsupervised learning. We will explore different ways of clustering; namely, bottom-up (or agglomerative) and top-down (or divisive). We will also look at the distinction between monothetic and polythetic hierarchical clustering and delve deeper into the implementation of k-means, a popular clustering technique.
Before we go into the details of the chapter, let's take a brief look at an overview of machine learning. Machine learning, in general, can be divided into three distinct groups; namely, reinforcement learning, supervised learning, and unsupervised learning, as shown in Figure 6.1. There is also one more category, semi-supervised learning, which falls between supervised learning and unsupervised learning. Most widely used learning techniques are supervised and unsupervised learning.
Reinforcement learning is a category of machine learning that focuses on training an agent to take actions in an environment. The agent is trained through a set of rewards and penalties. Reinforcement learning is popular in areas such as gaming and robotics.
In supervised learning, a model is provided with examples of each class. The model is trained to learn the best mapping of the input to the output in tasks such as classification and regression.
In this chapter, we will look at unsupervised learning. The main distinction vis-à-vis supervised learning is that the training examples is unlabeled. In supervised learning, we start off knowing the groups or categories that each data element belongs to. In unsupervised learning, the goal is to discover these categories. One real-life example described in Altuncu et Al. (2018) is the grouping of related news articles. This form of document clustering is applicable in many areas where manual categorization is impractical.
Unsupervised learning is a subcategory of machine learning that learns or trains using unlabeled data. In other words, as opposed to supervised learning, where the model is expected to predict or categorize data into a set of known classes, unsupervised learning establishes the structure within data to create the categories or groups.
Before delving into unsupervised learning and, specifically, clustering, there are a few questions you need to answer:
Answering these questions will help us to create a better clustering model. There are different ways to perform clustering. Let's now look at three ways of classifying clustering techniques:
Let's now look at these clustering techniques in detail.
In hard clustering, a datapoint can only belong to one of the groups. That is, if it is a description of a dog, it can't also be a description of a horse. In soft clustering, members can belong to multiple subgroups with a varying degree of strength. For instance, an emoji can be judged as part sad and part angry. One soft clustering technique is the Expected Maximization (EM) algorithm. EM is a probabilistic, generative model that uses joint probability and Bayes' theorem.
The following steps, 1 to 5, explain the workings of the EM algorithm:
EM is regarded as a soft clustering technique since a point can belong to different clusters with different probabilities. In step 3, we establish to what extent each datapoint belongs to each distribution (for instance, x1 belongs to the first distribution with a high probability, and the second distribution with a lower probability). In step 4, we adjust the mean and variances based on the memberships predicted in step 3. That is, the distribution mean becomes the weighted mean of the datapoints belonging to that distribution. Hence, the "amount" that x1 is part of the first distribution (b1) multiplied by its value (b1x1) plus the amount that x2 is part of the first distribution multiplied by its value (b2x2), and so on, divided by the sum of the weights (b1+b2+...). Steps 3 and 4 are repeated until they stop changing (convergence).
Hierarchical clustering produces different levels of abstraction in a taxonomic fashion, unlike flat clustering, where all members are on the same level. Hierarchical clustering aims to create a dendrogram where the top cluster is a superset with all the elements. If the top level was animal, the second level could contain the different animal groups (fish, mammal, birds, and so on), and the third level could be the specific types of fish, mammals, birds, and so on:
In Figure 6.3, the vertical axis (y) represents dissimilarity within the cluster. As you go up, the cluster members become more dissimilar. The x-axis represents the objects that are recursively being grouped.
Hierarchical clustering can be executed top-down or bottom-up. As the name suggests, for top-down clustering, we subdivide bigger clusters into successively smaller clusters. Bottom-up clustering involves grouping smaller clusters into bigger ones as we move up the dendrogram. An example of top-down clustering is recursive k-means.
In recursive k-means, we perform several iterations. In the first iteration, we might split the data into two parts. In the second iteration, each of the two clusters is further split into other clusters. We continue iteratively splitting the clusters into smaller sub-clusters until the sub-clusters only contain elements of a single class, as shown in Figure 6.4. As you can see, by Iteration 2, the clusters contain similar elements out of the entire data provided:
Top-down clustering is fast but has the disadvantage of not allowing the traversal of boundaries set at higher levels. On the other hand, bottom-up clustering is slow but has the advantage of being more likely to group similar elements. In bottom-up clustering, each point is regarded as a cluster, as shown in Figure 6.5. In the first iteration, we find a pair of clusters that are similar and merge them into one new cluster:
Figure 6.5 shows a cluster with a pair of points, a and b. We then recursively continue the process until we have only one cluster. We can also set a threshold, where we "cut" the dendrogram at a certain threshold, as shown in Figure 6.6:
An example of bottom-up clustering is Agglomerative Clustering. Agglomerative clustering is used to form clusters based on similarity. It is also known as AGNES. This topic will be covered in detail in Exercise 64, Agglomerative Clustering using AGNES.
In monothetic clustering, the membership of a cluster is determined by the absence or presence of a single attribute. In polythetic clustering, multiple variables are used to determine the memberships.
We will use R's mona() function, which stands for monothetic analysis. It is monothetic and hierarchical. At each step, a split is made based on a single variable. mona() is a divisive clustering technique and requires the variables to be binary, that is, either zero or one.
For more details of the mona algorithm, please read Clustering in an object-oriented environment, by Struyf et al. (1997). Struyf, Anja, Mia Hubert, and Peter Rousseeuw. "Clustering in an object-oriented environment." Journal of Statistical Software 1.4 (1997): 1-30.
In the following exercise, we are going to perform clustering on a dataset that has binary values for features.
In this exercise, we'll perform monothetic and hierarchical clustering. To start with, we will use the built-in R dataset of animals. This dataset contains a list of 20 animals with attributes such as warm-blooded, can fly, and endangered.
library(cluster)
data(animals)
apply(animals, 2, table) # simple overview
The output is as follows:
From the preceding results, of the 20 observations, 10 are warm-blooded (war) and 10 are not. 4 animals can fly and 16 cannot. 14 animals are vertebrates (ver) and 6 are not. 6 animals are endangered (end), 12 are not, and 2 are unknown (NA). 11 live on the ground (gro), 6 do not, and 3 are unknown (NA). 9 have hair (hai), while 11 do not.
ma <- mona(animals)
ma
The output is as follows:
Figure 6.8 is the data used by mona, in a binary (0 or 1) form. mona() also outputs details of the clustering, such as the variable used at each separation step:
To read up on the banner, refer to Struyf, Anja, Mia Hubert, and Peter Rousseeuw. "Clustering in an object-oriented environment." Journal of Statistical Software 1.4 (1997): 1-30.
plot(ma)
The plot is as follows:
To interpret the preceding graph, we start with all the elements in one group. At step one, we place all warm-blooded animals on one side (below sal on the y-axis) and all the rest on the other side (above and including sal). At step 2, the warm-blooded ones are split by whether they can fly, while the cold-blooded ones are split by whether they are a vertebrate. The process continues, recursively, until we can no longer affect a split.
DIANA stands for DIvisive ANAlysis clustering. It is a hierarchical clustering technique that starts with one cluster and subsequently divides the clusters until each cluster is just a single element. At each step, we select the cluster with the most dissimilar elements. The most dissimilar element within that cluster is then used as a starting point in the creation of a new cluster, consequently splitting the original cluster into two. Divisive coefficient is a unit that measures the amount of clustering structures found.
For more information on DIANA, please refer to the work of Kaufman and Rousseeuw (1990), Finding groups in data. Rousseeuw, Peter J., and L. Kaufman. "Finding groups in data." Hoboken: Wiley Online Library (1990).
In the next exercise, we will be covering the concepts of hierarchical clustering using DIANA.
In this exercise, we will perform hierarchal clustering using DIANA. We will be using the built-in EU agricultural forces dataset. This contains the GNP (Gross National Product per capita) and the percentage of the population working in agriculture for all EU countries.
library(cluster)
data(agriculture)
agriculture
The output is as follows:
##
## x y
## B 16.8 2.7
## DK 21.3 5.7
## D 18.7 3.5
## GR 5.9 22.2
## E 11.4 10.9
## F 17.8 6.0
## IRL 10.9 14.0
## I 16.6 8.5
## L 21.0 3.5
## NL 16.4 4.3
## P 7.8 17.4
## UK 14.0 2.3
dv <- diana(agriculture, metric = "manhattan", stand = TRUE)
print(dv)
The output is as follows:
The order of objects, height, divisive coefficients, and available components are as follows:
The diana() function creates clusters based on the suggested metrics, as shown in the preceding diagram.
plot(dv)
The banner plot will appear as follows:
Figure 6.13 shows the splitting banner from the DIANA algorithm. Imagine that you are starting off with all the countries in one group on the left, and the goal is to divide the group into smaller parts until you have the elements on the right. We split the group into two groups at a height of 7.34 (the height is the diameter of the cluster prior to splitting). The "F" group contains F, D, L, DK, I, UK, NL, and B. The "GR" group contains GR, P, E, and IRL. The "GR" group is split again at a height of ~3.5, and we now have a "P" group and an "E" group. The P group is split again at a height of ~1.3 into individual P and GR elements.
This dendrogram communicates the same information as the banner. Starting from the top, we split the cluster into two, one group with B, NL, UK, I, DK, K, D, and F, while the second group contains everything else. Each of the groups is then split into two as we go down and the elements become increasingly similar.
str(as.dendrogram(dv))
The output is as follows:
The str() representation of the dendrogram presents the exact splitting points. Notice, for example, the first split at height 7.34, and the next one at height 3.54.
AGNES stands for AGglomerative NESting and performs bottom-up or agglomerative clustering. It performs agglomerative hierarchal clustering on the given dataset. Using agnes, you can get the agglomerative coefficient. This coefficient denotes the amount of clustered structures found in the given dataset. Instead of splitting heterogeneous clusters, we now merge similar clusters.
For this exercise, we will use the built-in R dataset votes.repub, which contains the state-wise percentage of Republican votes in 31 US presidential elections.
library(cluster)
data(votes.repub)
agn <- agnes(votes.repub)
agn
The output is as follows:
##
## Call: agnes(x = votes.repub)
## Agglomerative coefficient: 0.7688431
## Order of objects:
## [1] Alabama Georgia Texas Louisiana Arkansas Florida
## [7] Mississippi South Carolina Alaska Michigan
## Connecticut New York
## [13] New Hampshire Illinois New Jersey Pennsylvania
## Indiana Ohio
## [19] California Oregon Washington Iowa South Dakota
## [24] Kansas Nebraska Minnesota North Dakota Wisconsin
## Hawaii Maine
## [31] Massachusetts Rhode Island Arizona Nevada Montana
## [36] Colorado Idaho Wyoming Utah Oklahoma Delaware
## [42] Missouri New Mexico West Virginia Kentucky
## [46] Maryland North Carolina Tennessee Virginia Vermont
## Height (summary):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.20 26.89 34.75 41.93 50.08 144.14
##
## Available components:
## [1] "order" "height" "ac" "merge" "diss" "call" "method"
## [8] "order.lab" "data"
From the height summary, we can tell that the first merge was made at a height of 17.20, and that the last merge was made at a height of 144.14.
bannerplot(agn)
The banner plot is as follows:
The white section of the banner plot represents the height.
Notice that the white section is on the left in Figure 6.16. We start with individual elements and group (merge) similar elements. As seen in the previous step, we make the first merge at height ~17, and successively merge up to the last merge at height ~144.
pltree(agn)
The output is as follows:
Figure 6.17 shows the dendrogram representation of the clustering. The first merge happens between Idaho and Wyoming at height ~17. We then group Connecticut and New York at height 19 and continue merging until the last grouping at height 144.
Distance metrics, such as Euclidean and Manhattan Distance metrics, can be used to calculate dissimilarity matrices, containing the distances between the elements. These are useful when grouping elements or splitting clusters.
Euclidean distance uses the Pythagorean theorem to calculate the distance between two points, as shown in the following diagram:
The Manhattan distance calculates the distance that needs to be traveled to reach from one point to another:
Here, x and y are two vectors of length n.
The main difference is that the Euclidean distance "penalizes" outlier differences more (if, for instance, two objects are similar in most attributes but very different in one aspect) due to squaring. Note how this is similar to the discussion of the RMSE and MAE metrics from Chapter 5, Linear and Logistic Regression Models.
For this topic, we will see how distance metrics, such as Euclidean and Manhattan distance, can be used to calculate dissimilarity matrices.
In this exercise, we will use R's dist() function, which computes a pairwise distance matrix, on the built-in agriculture dataset. We will compare the Euclidean and Manhattan distance metrics. The dataset contains the GNP (Gross National Product per capita) and the percentage of the population working in agriculture for all EU countries.
library(cluster)
data(agriculture)
dist(agriculture) # Euclidean distance is the default
The output is as follows:
Figure 6.20 shows the dissimilarity matrix with the Euclidean distances between the countries. The greater the value, the more dissimilar the countries are. For instance, B is most similar to NL, and least similar to GR.
dist(agriculture, method = "manhattan")
The output is as follows:
The Manhattan distance is interpreted in the same way. The greater the distance, the more dissimilar the countries are. For instance, B is still most similar to NL, and least similar to GR.
The dist() function only allows numeric values. If we had other types of variables (such as nominal, ordinal, or binary variables), we could use the daisy() function. In our case, the results will be the same, but let's try it anyway.
daisy(agriculture, metric = "euclidean", stand = FALSE)
The output is as follows:
daisy(agriculture, metric = "manhattan", stand = FALSE)
The output is as follows:
The interpretation is the same as for step 4. In the next section, we will look at correlation-based distance metrics.
The intuition behind this type of metric is that two items are considered similar if they are highly correlated. In fields such as genomics, correlation-based metrics are commonly used. Here are some common correlation-based distance metrics:
Given two variables, x and y, the Pearson correlation distance is computed by:
Pearson correlation measures the degree of a linear relationship between two profiles.
Spearman correlation distance:
Given two variables, x and y, the Spearman correlation distance is computed by:
Pearson correlation is used to evaluate the linear relationship between continuous variables. Spearman and Kendall correlations are non-parametric and are used to perform rank-based correlation analysis. The rank of a vector element is its index in the sorted version of the vector. You can use the rank() function in R to find the rank of each element. If we ran rank(c(0,5,3,6,7)), the result would be c(1,3,2,4,5).
The Kendall correlation method is calculated from the number of concordant and discordant pairs:
Here is the definitions of the variables used in the aforementioned equation:
Given two pairs, x1, y1 and x2, y2, a concordant pair is one such that sgn(x1-x2) = sgn(y1-y2), where:
Similarly, a discordant pair is one such that sgn(x1-x2) = - sgn(y1-y2).
In the next exercise, we will calculate the correlation-based metrics based on these formulas.
For more information on correlation-based distance metrics, please refer the official documentation: https://www.rdocumentation.org/packages/amap/versions/0.8-17/topics/Dist
In this exercise, we will establish the Pearson correlation distance matrix, the Spearman correlation distance matrix, and the Kendall correlation distance matrix. We will only look at the five first rows to simplify the output.
install.packages("factoextra")
library(factoextra)
df <- swiss
pearson_dist <- get_dist(df, method = "pearson")
round(as.matrix(pearson_dist)[1:5, 1:5], 1)
The output is as follows:
The highest values in the (visible) correlation distance matrix are between Neuveville and Delmont, and between Neuveville and Franches-Mnt.
spearman_dist <- get_dist(df, method = "spearman")
round(as.matrix(spearman_dist)[1:5, 1:5], 1)
The output is as follows:
The highest value in the (visible) correlation distance matrix is between Courtelary and Delemont.
kendall_dist <- get_dist(df, method = "kendall")
round(as.matrix(kendall_dist)[1:5, 1:5], 1)
The output is as follows:
The highest values in the (visible) correlation distance matrix are between Courtelary and Delemont, and between Courtelary and Frances-Mnt.
As we can tell from the varying results, the type of correlation-based distance metric chosen would affect the algorithms relying on the correlation distances matrix.
Clustering is useful in a variety of fields. We will look at two of them in detail:
Most businesses start off not knowing how to group their customers. Clustering is useful in helping these businesses to identify segments in their data.
Additionally, in order to pick out high-level topics, document clustering can be applied together with topic modeling to perform some form of information retrieval.
The k-means clustering algorithm is one of the most popular clustering techniques. It produces hard (an element can only be a member of one cluster), flat, and polythetic (membership is determined by similarity based on multiple attributes) clusters. The k-means algorithm has no training or testing data per se. It works by creating clusters around centroids. A centroid is an average cluster member; that is, the center of a cluster. k-means requires us to specify the number of clusters (k). It is important to note that the number of clusters specified greatly affects the performance of the k-means algorithm. Deciding on the number of clusters can be informed by domain knowledge. For example, knowing about the features of a given dataset will help to set parameters for clusters. In situations where this information is not available, there are two techniques we can use to help us decide on the correct number of clusters.
In situations where data can be plotted in two dimensions, we can use scatter plots to perform exploratory data analysis:
Just by eyeballing how the data is grouped, we can come up with an estimate of the number of clusters that exist within the data. In Figure 6.31, we can identify two clusters within the data.
The elbow method works by running k-means successively with different values of k each time, computing an average score for all of the clusters. This score is usually the distortion score, which is calculated as the sum of squared distances from each point to its centroid. Other scores that might be used include the CH score (Calinski and Harabasz) and the Silhouette Coefficient. The CH score is calculated as the ratio of dispersion between, and within, the clusters. The Silhouette Coefficient is a score ranging from -1 to +1 that signifies how similar a sample is to its own cluster compared to other clusters.
The optimal number of clusters is selected as the point where the line graph makes an elbow, as shown in Figure 6.32:
Here are the basic steps of the k-means clustering algorithm:
Cluster assignment: For each point, find the nearest centroid and assign it to that cluster. To find the nearest centroid, use one of the distance metrics discussed previously, such as the Euclidean distance.
Move the centroid: Adjust each centroid so that it minimizes the distance within the cluster variance.
Stop after a predetermined number of iterations, or once cluster assignment stops making any changes.
We will implement k-means in the next exercise.
For this exercise, we will use a publicly available dataset that looks at the arrests per state within the US in 1973. The data has 4 columns and 50 rows and appears as follows in an Excel sheet:
The dataset has been modified and can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/USArrests.csv. The USArrests dataset is derived from real crime data. However, due to sensitivity and the violent and extreme nature of some of the crimes, we have taken the editorial decision to leave out descriptions. Readers should use their own discretion as to whether they wish to research any further details about the USArrests dataset.
library(cluster)
library(factoextra) # used for visualization
# unlike earlier where we used it for correlation
df <- read.csv("USArrests.csv")
rownames(df) <- df$X
df$X <- NULL
The row names (states) become a column, X, when you save it as a CSV file. So, we need to change it back, since the row names are used in the plot in step 7.
df <- scale(df)
k3 <- kmeans(df, centers = 3, nstart = 20)
str(k3)
The output is as follows:
Printing out the kmeans object as in the preceding example returns the following elements:
k3
The output of k-means clustering is as follows:
##
## K-means clustering with 3 clusters of sizes 17, 13, 20
##
## Cluster means:
## Crime1 Crime2 Crime3 Crime4
## 1 -0.4469795 -0.3465138 0.4788049 -0.2571398
## 2 -0.9615407 -1.1066010 -0.9301069 -0.9667633
## 3 1.0049340 1.0138274 0.1975853 0.8469650
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California Colorado
## 3 3 3 1 3 3
## Connecticut Delaware Florida Georgia Hawaii Idaho
## 1 1 3 3 1 2
## Illinois Indiana Iowa Kansas Kentucky Louisiana
## 3 1 2 1 2 3
## Maine Maryland Massachusetts Michigan Minnesota Mississippi
## 2 3 1 3 2 3
## Missouri Montana Nebraska Nevada New Hampshire New Jersey
## 3 2 2 3 2 1
## New Mexico New York North Carolina North Dakota Ohio Oklahoma
## 3 3 3 2 1 1
## Oregon Pennsylvania Rhode Island South Carolina South Dakota
## 1 1 1 3 2
## Tennessee Texas Utah Vermont Virginia Washington
## 3 3 1 2 1 1
## West Virginia Wisconsin Wyoming
## 2 2 1
##
## Within cluster sum of squares by cluster:
## [1] 19.62285 11.95246 46.74796
## (between_SS / total_SS = 60.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
## [7] "size" "iter" "ifault"
fviz_cluster(k3, data = df)
The output will be as follows:
We used three clusters in the previous example. How do we determine the optimal number of clusters?
install.packages("NbClust")
library(NbClust)
The fviz_nbclust() function plots the score against the number of clusters, k, as shown in the following diagram. In this example, we use "wss", which stands for within cluster sums of squares. This measures the variability within a cluster. The smaller it is, the more compact the clusters are. The silhouette coefficient described earlier can also be used by specifying it as a method. We can also use the gap statistic, which is calculated as the deviation of the observed intracluster variation from the expected intracluster variation under the null hypothesis.
To learn more about the null hypothesis, refer to the following link: https://en.wikipedia.org/wiki/Null_hypothesis
fviz_nbclust(df, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
The output is as follows:
Observing Figure 6.36, we can determine that the optimal number of clusters is four. Running the k-means code from earlier, but with k set as 4, yields a slightly different visualization:
These clusters are better as the points are closer to their centroid, as seen in Figure 6.37.
This activity builds upon earlier exercises that implemented DIANA, AGNES, and k-means clustering on different datasets. The aim of this activity is to implement different clustering techniques as well as to compare the results of each method and find the best clusters. We want to implement three clustering techniques using the cars dataset.
The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/mtcars.csv.
Here are the steps to complete the activity:
Expected output:
The dendrogram for AGNES with a cut at 4 will look as follows:
The following output is obtained after k-means clustering is performed:
The solution for this activity can be found on page 379.
In this chapter, we have learned to differentiate between supervised and unsupervised learning. We also performed and compared the DIANA, AGNES, and k-means clustering techniques, and discussed the applications of clustering.
We have thus covered the basics of identifying machine learning-related business problems, preparing datasets for analysis, and selecting and training suitable model architectures and evaluating their performance. We covered the basics of a commonly used set of machine learning methods and used different R packages, such as rpart, randomForest, MICE, groupdata2, and cvms. Having worked with tasks such as classification, regression, and clustering, you now possess the tools required to tackle many of your data-related business problems.