Chapter 6

Unsupervised Learning

Learning Objectives

By the end of this chapter, you will be able to:

  • Distinguish between unsupervised and supervised learning
  • Implement different techniques applied in clustering, such as soft and hard clustering, monothetic and polythetic clustering, and bottom-up versus top-down clustering
  • Perform k-means clustering
  • Compare performance using DIANA, AGNES, and k-means

In this chapter, we aim to equip you with a practical understanding of unsupervised learning.

Introduction

In this chapter, we will look at the implementation of unsupervised learning. We will explore different ways of clustering; namely, bottom-up (or agglomerative) and top-down (or divisive). We will also look at the distinction between monothetic and polythetic hierarchical clustering and delve deeper into the implementation of k-means, a popular clustering technique.

Before we go into the details of the chapter, let's take a brief look at an overview of machine learning. Machine learning, in general, can be divided into three distinct groups; namely, reinforcement learning, supervised learning, and unsupervised learning, as shown in Figure 6.1. There is also one more category, semi-supervised learning, which falls between supervised learning and unsupervised learning. Most widely used learning techniques are supervised and unsupervised learning.

Figure 6.1: Types of machine learning
Figure 6.1: Types of machine learning

Reinforcement learning is a category of machine learning that focuses on training an agent to take actions in an environment. The agent is trained through a set of rewards and penalties. Reinforcement learning is popular in areas such as gaming and robotics.

In supervised learning, a model is provided with examples of each class. The model is trained to learn the best mapping of the input to the output in tasks such as classification and regression.

In this chapter, we will look at unsupervised learning. The main distinction vis-à-vis supervised learning is that the training examples is unlabeled. In supervised learning, we start off knowing the groups or categories that each data element belongs to. In unsupervised learning, the goal is to discover these categories. One real-life example described in Altuncu et Al. (2018) is the grouping of related news articles. This form of document clustering is applicable in many areas where manual categorization is impractical.

Overview of Unsupervised Learning (Clustering)

Unsupervised learning is a subcategory of machine learning that learns or trains using unlabeled data. In other words, as opposed to supervised learning, where the model is expected to predict or categorize data into a set of known classes, unsupervised learning establishes the structure within data to create the categories or groups.

Before delving into unsupervised learning and, specifically, clustering, there are a few questions you need to answer:

  • Based on domain knowledge, does your dataset inherently have subgroups? If yes, how do you identify the subgroups? How many subgroups are present in the dataset?
  • Are the members of each subgroup similar? Typically, clustering should only be applied to datasets that have subgroups with somewhat similar datapoints.
  • Are there outliers in the dataset? Outliers can often influence the choice of which clustering algorithm to use.

Answering these questions will help us to create a better clustering model. There are different ways to perform clustering. Let's now look at three ways of classifying clustering techniques:

  • Hard versus soft clusters
  • Flat versus hierarchical clustering
  • Monothetic versus polythetic clustering

Let's now look at these clustering techniques in detail.

Hard versus Soft Clusters

In hard clustering, a datapoint can only belong to one of the groups. That is, if it is a description of a dog, it can't also be a description of a horse. In soft clustering, members can belong to multiple subgroups with a varying degree of strength. For instance, an emoji can be judged as part sad and part angry. One soft clustering technique is the Expected Maximization (EM) algorithm. EM is a probabilistic, generative model that uses joint probability and Bayes' theorem.

The following steps, 1 to 5, explain the workings of the EM algorithm:

Figure 6.2: Steps of the EM algorithm
Figure 6.2: Steps of the EM algorithm

EM is regarded as a soft clustering technique since a point can belong to different clusters with different probabilities. In step 3, we establish to what extent each datapoint belongs to each distribution (for instance, x1 belongs to the first distribution with a high probability, and the second distribution with a lower probability). In step 4, we adjust the mean and variances based on the memberships predicted in step 3. That is, the distribution mean becomes the weighted mean of the datapoints belonging to that distribution. Hence, the "amount" that x1 is part of the first distribution (b1) multiplied by its value (b1x1) plus the amount that x2 is part of the first distribution multiplied by its value (b2x2), and so on, divided by the sum of the weights (b1+b2+...). Steps 3 and 4 are repeated until they stop changing (convergence).

Flat versus Hierarchical Clustering

Hierarchical clustering produces different levels of abstraction in a taxonomic fashion, unlike flat clustering, where all members are on the same level. Hierarchical clustering aims to create a dendrogram where the top cluster is a superset with all the elements. If the top level was animal, the second level could contain the different animal groups (fish, mammal, birds, and so on), and the third level could be the specific types of fish, mammals, birds, and so on:

Figure 6.3: Dendrogram

In Figure 6.3, the vertical axis (y) represents dissimilarity within the cluster. As you go up, the cluster members become more dissimilar. The x-axis represents the objects that are recursively being grouped.

Hierarchical clustering can be executed top-down or bottom-up. As the name suggests, for top-down clustering, we subdivide bigger clusters into successively smaller clusters. Bottom-up clustering involves grouping smaller clusters into bigger ones as we move up the dendrogram. An example of top-down clustering is recursive k-means.

In recursive k-means, we perform several iterations. In the first iteration, we might split the data into two parts. In the second iteration, each of the two clusters is further split into other clusters. We continue iteratively splitting the clusters into smaller sub-clusters until the sub-clusters only contain elements of a single class, as shown in Figure 6.4. As you can see, by Iteration 2, the clusters contain similar elements out of the entire data provided:

Figure 6.4: The k-means clusters
Figure 6.4: The k-means clusters

Top-down clustering is fast but has the disadvantage of not allowing the traversal of boundaries set at higher levels. On the other hand, bottom-up clustering is slow but has the advantage of being more likely to group similar elements. In bottom-up clustering, each point is regarded as a cluster, as shown in Figure 6.5. In the first iteration, we find a pair of clusters that are similar and merge them into one new cluster:

Figure 6.5: Bottom-up clustering (step 1)
Figure 6.5: Bottom-up clustering (step 1)

Figure 6.5 shows a cluster with a pair of points, a and b. We then recursively continue the process until we have only one cluster. We can also set a threshold, where we "cut" the dendrogram at a certain threshold, as shown in Figure 6.6:

Figure 6.6: Bottom-up clustering (step 2)
Figure 6.6: Bottom-up clustering (step 2)

An example of bottom-up clustering is Agglomerative Clustering. Agglomerative clustering is used to form clusters based on similarity. It is also known as AGNES. This topic will be covered in detail in Exercise 64, Agglomerative Clustering using AGNES.

Monothetic versus Polythetic Clustering

In monothetic clustering, the membership of a cluster is determined by the absence or presence of a single attribute. In polythetic clustering, multiple variables are used to determine the memberships.

We will use R's mona() function, which stands for monothetic analysis. It is monothetic and hierarchical. At each step, a split is made based on a single variable. mona() is a divisive clustering technique and requires the variables to be binary, that is, either zero or one.

Note

For more details of the mona algorithm, please read Clustering in an object-oriented environment, by Struyf et al. (1997). Struyf, Anja, Mia Hubert, and Peter Rousseeuw. "Clustering in an object-oriented environment." Journal of Statistical Software 1.4 (1997): 1-30.

In the following exercise, we are going to perform clustering on a dataset that has binary values for features.

Exercise 60: Monothetic and Hierarchical Clustering on a Binary Dataset

In this exercise, we'll perform monothetic and hierarchical clustering. To start with, we will use the built-in R dataset of animals. This dataset contains a list of 20 animals with attributes such as warm-blooded, can fly, and endangered.

  1. Attach the cluster package:

    library(cluster)

  2. Load and view the animals dataset:

    data(animals)

  3. View the number of observations for each column. We apply the table function as follows:

    apply(animals, 2, table) # simple overview

    The output is as follows:

    Figure 6.7: Summary of the attributes where 1 is "no" and 2 is "yes"
    Figure 6.7: Summary of the attributes where 1 is "no" and 2 is "yes"

    From the preceding results, of the 20 observations, 10 are warm-blooded (war) and 10 are not. 4 animals can fly and 16 cannot. 14 animals are vertebrates (ver) and 6 are not. 6 animals are endangered (end), 12 are not, and 2 are unknown (NA). 11 live on the ground (gro), 6 do not, and 3 are unknown (NA). 9 have hair (hai), while 11 do not.

  4. Apply the mona() function to create clusters based on whether they are warm-blooded, can fly, are vertebrates, are endangered, live on the ground, and whether they have hair:

    ma <- mona(animals)

    ma

    The output is as follows:

    Figure 6.8: Revised data from mona output – part 1
    Figure 6.8: Revised data from mona output – part 1

    Figure 6.8 is the data used by mona, in a binary (0 or 1) form. mona() also outputs details of the clustering, such as the variable used at each separation step:

    Figure 6.9: Cluster information from mona output – part 2
    Figure 6.9: Cluster information from mona output – part 2
  5. Plot a banner:

    Note

    To read up on the banner, refer to Struyf, Anja, Mia Hubert, and Peter Rousseeuw. "Clustering in an object-oriented environment." Journal of Statistical Software 1.4 (1997): 1-30.

    plot(ma)

    The plot is as follows:

Figure 6.10: Banner using mona
Figure 6.10: Banner using mona

To interpret the preceding graph, we start with all the elements in one group. At step one, we place all warm-blooded animals on one side (below sal on the y-axis) and all the rest on the other side (above and including sal). At step 2, the warm-blooded ones are split by whether they can fly, while the cold-blooded ones are split by whether they are a vertebrate. The process continues, recursively, until we can no longer affect a split.

DIANA

DIANA stands for DIvisive ANAlysis clustering. It is a hierarchical clustering technique that starts with one cluster and subsequently divides the clusters until each cluster is just a single element. At each step, we select the cluster with the most dissimilar elements. The most dissimilar element within that cluster is then used as a starting point in the creation of a new cluster, consequently splitting the original cluster into two. Divisive coefficient is a unit that measures the amount of clustering structures found.

Note

For more information on DIANA, please refer to the work of Kaufman and Rousseeuw (1990), Finding groups in data. Rousseeuw, Peter J., and L. Kaufman. "Finding groups in data." Hoboken: Wiley Online Library (1990).

In the next exercise, we will be covering the concepts of hierarchical clustering using DIANA.

Exercise 61: Implement Hierarchical Clustering Using DIANA

In this exercise, we will perform hierarchal clustering using DIANA. We will be using the built-in EU agricultural forces dataset. This contains the GNP (Gross National Product per capita) and the percentage of the population working in agriculture for all EU countries.

  1. Attach the cluster package:

    library(cluster)

  2. Load the agriculture dataset:

    data(agriculture)

  3. Show the dataset:

    agriculture

    The output is as follows:

    ##

    ##        x    y

    ## B   16.8  2.7

    ## DK  21.3  5.7

    ## D   18.7  3.5

    ## GR   5.9 22.2

    ## E   11.4 10.9

    ## F   17.8  6.0

    ## IRL 10.9 14.0

    ## I   16.6  8.5

    ## L   21.0  3.5

    ## NL  16.4  4.3

    ## P    7.8 17.4

    ## UK  14.0  2.3

  4. Apply the diana() function and print the output. Note that you do not need to understand the output for now, as we will plot the dendrogram to visualize the clusters in the next step. You can read the explanation of the output at ?diana:

    dv <- diana(agriculture, metric = "manhattan", stand = TRUE)

    print(dv)

    The output is as follows:

    Figure 6.11: Merge information from the DIANA output – part 1
    Figure 6.11: Merge information from the DIANA output – part 1

    The order of objects, height, divisive coefficients, and available components are as follows:

    Figure 6.12: Output using diana – part 2
    Figure 6.12: Output using diana – part 2

    The diana() function creates clusters based on the suggested metrics, as shown in the preceding diagram.

  5. Plot the banner and dendrogram:

    plot(dv)

    The banner plot will appear as follows:

    Figure 6.13: Banner using DIANA
    Figure 6.13: Banner using DIANA

    Figure 6.13 shows the splitting banner from the DIANA algorithm. Imagine that you are starting off with all the countries in one group on the left, and the goal is to divide the group into smaller parts until you have the elements on the right. We split the group into two groups at a height of 7.34 (the height is the diameter of the cluster prior to splitting). The "F" group contains F, D, L, DK, I, UK, NL, and B. The "GR" group contains GR, P, E, and IRL. The "GR" group is split again at a height of ~3.5, and we now have a "P" group and an "E" group. The P group is split again at a height of ~1.3 into individual P and GR elements.

  6. Press Enter to view the next plot:
    Figure 6.14: Dendrogram using diana
    Figure 6.14: Dendrogram using DIANA

    This dendrogram communicates the same information as the banner. Starting from the top, we split the cluster into two, one group with B, NL, UK, I, DK, K, D, and F, while the second group contains everything else. Each of the groups is then split into two as we go down and the elements become increasingly similar.

  7. To view the dendrogram as text, use str():

    str(as.dendrogram(dv))

    The output is as follows:

Figure 6.15: Dendrogram using str()
Figure 6.15: Dendrogram using str() function

The str() representation of the dendrogram presents the exact splitting points. Notice, for example, the first split at height 7.34, and the next one at height 3.54.

AGNES

AGNES stands for AGglomerative NESting and performs bottom-up or agglomerative clustering. It performs agglomerative hierarchal clustering on the given dataset. Using agnes, you can get the agglomerative coefficient. This coefficient denotes the amount of clustered structures found in the given dataset. Instead of splitting heterogeneous clusters, we now merge similar clusters.

Exercise 62: Agglomerative Clustering Using AGNES

For this exercise, we will use the built-in R dataset votes.repub, which contains the state-wise percentage of Republican votes in 31 US presidential elections.

  1. Attach the cluster package:

    library(cluster)

  2. Load the dataset:

    data(votes.repub)

  3. Perform agglomerative clustering using AGNES:

    agn <- agnes(votes.repub)

    agn

    The output is as follows:

    ##

    ## Call:     agnes(x = votes.repub)

    ## Agglomerative coefficient:  0.7688431

    ## Order of objects:

    ##  [1] Alabama   Georgia   Texas   Louisiana  Arkansas   Florida       

    ##  [7] Mississippi   South Carolina   Alaska   Michigan   

    ##      Connecticut   New York      

    ## [13] New Hampshire   Illinois   New Jersey   Pennsylvania   

    ##      Indiana  Ohio          

    ## [19] California   Oregon   Washington   Iowa   South Dakota

    ## [24] Kansas   Nebraska   Minnesota   North Dakota   Wisconsin      

    ##      Hawaii   Maine         

    ## [31] Massachusetts   Rhode Island   Arizona   Nevada   Montana   

    ## [36] Colorado   Idaho   Wyoming   Utah   Oklahoma   Delaware

    ## [42] Missouri   New Mexico   West Virginia   Kentucky

    ## [46] Maryland   North Carolina   Tennessee   Virginia  Vermont       

    ## Height (summary):

    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

    ##   17.20   26.89   34.75   41.93   50.08  144.14

    ##

    ## Available components:

    ## [1] "order" "height" "ac" "merge" "diss" "call" "method"

    ## [8] "order.lab" "data"

    From the height summary, we can tell that the first merge was made at a height of 17.20, and that the last merge was made at a height of 144.14.

  4. Plot the bannerplot for agnes:

    bannerplot(agn)

    The banner plot is as follows:

    Figure 6.16: Banner using agnes
    Figure 6.16: Banner using AGNES

    Note

    The white section of the banner plot represents the height.

    Notice that the white section is on the left in Figure 6.16. We start with individual elements and group (merge) similar elements. As seen in the previous step, we make the first merge at height ~17, and successively merge up to the last merge at height ~144.

  5. Plot the tree:

    pltree(agn)

    The output is as follows:

Figure 6.17: Dendrogram using agnes
Figure 6.17: Dendrogram using AGNES

Figure 6.17 shows the dendrogram representation of the clustering. The first merge happens between Idaho and Wyoming at height ~17. We then group Connecticut and New York at height 19 and continue merging until the last grouping at height 144.

Distance Metrics in Clustering

Distance metrics, such as Euclidean and Manhattan Distance metrics, can be used to calculate dissimilarity matrices, containing the distances between the elements. These are useful when grouping elements or splitting clusters.

Euclidean distance uses the Pythagorean theorem to calculate the distance between two points, as shown in the following diagram:

Figure 6.18: Euclidean distance
Figure 6.18: Euclidean distance

The Manhattan distance calculates the distance that needs to be traveled to reach from one point to another:

Figure 6.19: Manhattan distance
Figure 6.19: Manhattan distance

Here, x and y are two vectors of length n.

The main difference is that the Euclidean distance "penalizes" outlier differences more (if, for instance, two objects are similar in most attributes but very different in one aspect) due to squaring. Note how this is similar to the discussion of the RMSE and MAE metrics from Chapter 5, Linear and Logistic Regression Models.

For this topic, we will see how distance metrics, such as Euclidean and Manhattan distance, can be used to calculate dissimilarity matrices.

Exercise 63: Calculate Dissimilarity Matrices Using Euclidean and Manhattan Distance

In this exercise, we will use R's dist() function, which computes a pairwise distance matrix, on the built-in agriculture dataset. We will compare the Euclidean and Manhattan distance metrics. The dataset contains the GNP (Gross National Product per capita) and the percentage of the population working in agriculture for all EU countries.

  1. Attach the cluster package:

    library(cluster)

  2. Load the dataset:

    data(agriculture)

  3. Compute the Euclidean distance:

    dist(agriculture) # Euclidean distance is the default

    The output is as follows:

    Figure 6.20: Dissimilarity matrix using the Euclidean distance
    Figure 6.20: Dissimilarity matrix using the Euclidean distance

    Figure 6.20 shows the dissimilarity matrix with the Euclidean distances between the countries. The greater the value, the more dissimilar the countries are. For instance, B is most similar to NL, and least similar to GR.

  4. Compute the Manhattan distances:

    dist(agriculture, method = "manhattan")

    The output is as follows:

    Figure 6.21: Dissimilarity matrix using the Manhattan distance
    Figure 6.21: Dissimilarity matrix using the Manhattan distance

    The Manhattan distance is interpreted in the same way. The greater the distance, the more dissimilar the countries are. For instance, B is still most similar to NL, and least similar to GR.

    The dist() function only allows numeric values. If we had other types of variables (such as nominal, ordinal, or binary variables), we could use the daisy() function. In our case, the results will be the same, but let's try it anyway.

  5. Use R's daisy() function to compute pairwise Euclidean distances:

    daisy(agriculture, metric = "euclidean", stand = FALSE)

    The output is as follows:

    Figure 6.22: Dissimilarity matrix using the Euclidean distance using daisy()
    Figure 6.22: Dissimilarity matrix using the Euclidean distance using daisy()
  6. Use R's daisy() function to compute pairwise Manhattan distances:

    daisy(agriculture, metric = "manhattan", stand = FALSE)

    The output is as follows:

Figure 6.23: Dissimilarity matrix using the Manhattan distance and daisy()
Figure 6.23: Dissimilarity matrix using the Manhattan distance and daisy()

The interpretation is the same as for step 4. In the next section, we will look at correlation-based distance metrics.

Correlation-Based Distance Metrics

The intuition behind this type of metric is that two items are considered similar if they are highly correlated. In fields such as genomics, correlation-based metrics are commonly used. Here are some common correlation-based distance metrics:

Pearson correlation distance:

Given two variables, x and y, the Pearson correlation distance is computed by:

Figure 6.24: Pearson correlation distance
Figure 6.24: Pearson correlation distance

Pearson correlation measures the degree of a linear relationship between two profiles.

Spearman correlation distance:

Given two variables, x and y, the Spearman correlation distance is computed by:

Figure 6.25: Spearman correlation distance
Figure 6.25: Spearman correlation distance

Pearson correlation is used to evaluate the linear relationship between continuous variables. Spearman and Kendall correlations are non-parametric and are used to perform rank-based correlation analysis. The rank of a vector element is its index in the sorted version of the vector. You can use the rank() function in R to find the rank of each element. If we ran rank(c(0,5,3,6,7)), the result would be c(1,3,2,4,5).

Kendall correlation distance:

The Kendall correlation method is calculated from the number of concordant and discordant pairs:

Figure 6.26: Kendall correlation distance
Figure 6.26: Kendall correlation distance

Here is the definitions of the variables used in the aforementioned equation:

  • nc is the total number of concordant pairs
  • nd is the total number of discordant pairs
  • n is the size of x and y

Given two pairs, x1, y1 and x2, y2, a concordant pair is one such that sgn(x1-x2) = sgn(y1-y2), where:

Figure 6.27: sgn values of x
Figure 6.27: sgn values of x

Similarly, a discordant pair is one such that sgn(x1-x2) = - sgn(y1-y2).

In the next exercise, we will calculate the correlation-based metrics based on these formulas.

Note

For more information on correlation-based distance metrics, please refer the official documentation: https://www.rdocumentation.org/packages/amap/versions/0.8-17/topics/Dist

Exercise 64: Apply Correlation-Based Metrics

In this exercise, we will establish the Pearson correlation distance matrix, the Spearman correlation distance matrix, and the Kendall correlation distance matrix. We will only look at the five first rows to simplify the output.

  1. Install the factoextra R package:

    install.packages("factoextra")

  2. Attach the factoextra package:

    library(factoextra)

  3. Load the dataset:

    df <- swiss

  4. Compute the distance using the Pearson correlation:

    pearson_dist <- get_dist(df, method = "pearson")

  5. Display the results as a matrix:

    round(as.matrix(pearson_dist)[1:5, 1:5], 1)

    The output is as follows:

    Figure 6.28: Pearson correlation metrics
    Figure 6.28: Pearson correlation metrics

    The highest values in the (visible) correlation distance matrix are between Neuveville and Delmont, and between Neuveville and Franches-Mnt.

  6. Repeat the process using the spearman and kendall correlation metrics. Compute distances using the Spearman correlation:

    spearman_dist <- get_dist(df, method = "spearman")

  7. Display the results as a matrix:

    round(as.matrix(spearman_dist)[1:5, 1:5], 1)

    The output is as follows:

    Figure 6.29: Spearman correlation metrics
    Figure 6.29: Spearman correlation metrics

    The highest value in the (visible) correlation distance matrix is between Courtelary and Delemont.

  8. Compute the distances using the kendall correlation metric:

    kendall_dist <- get_dist(df, method = "kendall")

  9. Display the results as a matrix:

    round(as.matrix(kendall_dist)[1:5, 1:5], 1)

    The output is as follows:

Figure 6.30: Kendall correlation metrics
Figure 6.30: Kendall correlation metrics

The highest values in the (visible) correlation distance matrix are between Courtelary and Delemont, and between Courtelary and Frances-Mnt.

As we can tell from the varying results, the type of correlation-based distance metric chosen would affect the algorithms relying on the correlation distances matrix.

Applications of Clustering

Clustering is useful in a variety of fields. We will look at two of them in detail:

  • Market segmentation: Segmentation is the process of splitting a heterogeneous group of consumers or customers into smaller homogeneous groups. These smaller groups can be targeted differently based on their characteristics and behavior. Segmentation is important to businesses because it shapes both marketing efforts and product development. Businesses designing products must decide which product is targeted to which segment of consumers or customers. They must consequently decide what features to include, how to price the product, and how to take it to market. All these actions are heavily influenced by the characteristics of the different segments.

    Most businesses start off not knowing how to group their customers. Clustering is useful in helping these businesses to identify segments in their data.

  • Document clustering and information retrieval: In the information and knowledge age, we have seen unprecedented growth in the amount of digital information. This information is often in documents that include raw text in different formats. Tweets, news articles, blogs, research papers, and publications can all be considered documents. In order to be able to group news articles that are related, search engine companies often employ text and document clustering techniques.

Additionally, in order to pick out high-level topics, document clustering can be applied together with topic modeling to perform some form of information retrieval.

k-means Clustering

The k-means clustering algorithm is one of the most popular clustering techniques. It produces hard (an element can only be a member of one cluster), flat, and polythetic (membership is determined by similarity based on multiple attributes) clusters. The k-means algorithm has no training or testing data per se. It works by creating clusters around centroids. A centroid is an average cluster member; that is, the center of a cluster. k-means requires us to specify the number of clusters (k). It is important to note that the number of clusters specified greatly affects the performance of the k-means algorithm. Deciding on the number of clusters can be informed by domain knowledge. For example, knowing about the features of a given dataset will help to set parameters for clusters. In situations where this information is not available, there are two techniques we can use to help us decide on the correct number of clusters.

Exploratory Data Analysis Using Scatter Plots

In situations where data can be plotted in two dimensions, we can use scatter plots to perform exploratory data analysis:

Figure 6.31: Scatter plot
Figure 6.31: Scatter plot

Just by eyeballing how the data is grouped, we can come up with an estimate of the number of clusters that exist within the data. In Figure 6.31, we can identify two clusters within the data.

The Elbow Method

The elbow method works by running k-means successively with different values of k each time, computing an average score for all of the clusters. This score is usually the distortion score, which is calculated as the sum of squared distances from each point to its centroid. Other scores that might be used include the CH score (Calinski and Harabasz) and the Silhouette Coefficient. The CH score is calculated as the ratio of dispersion between, and within, the clusters. The Silhouette Coefficient is a score ranging from -1 to +1 that signifies how similar a sample is to its own cluster compared to other clusters.

The optimal number of clusters is selected as the point where the line graph makes an elbow, as shown in Figure 6.32:

Figure 6.32: The elbow method
Figure 6.32: The elbow method

Here are the basic steps of the k-means clustering algorithm:

  1. Start by specifying the number of clusters (k).
  2. Initialize k random centroids based on datapoints in the data.
  3. Repeat the following steps:

    Cluster assignment: For each point, find the nearest centroid and assign it to that cluster. To find the nearest centroid, use one of the distance metrics discussed previously, such as the Euclidean distance.

    Move the centroid: Adjust each centroid so that it minimizes the distance within the cluster variance.

    Stop after a predetermined number of iterations, or once cluster assignment stops making any changes.

We will implement k-means in the next exercise.

Exercise 65: Implementation of k-means Clustering in R

For this exercise, we will use a publicly available dataset that looks at the arrests per state within the US in 1973. The data has 4 columns and 50 rows and appears as follows in an Excel sheet:

Note

The dataset has been modified and can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/USArrests.csv. The USArrests dataset is derived from real crime data. However, due to sensitivity and the violent and extreme nature of some of the crimes, we have taken the editorial decision to leave out descriptions. Readers should use their own discretion as to whether they wish to research any further details about the USArrests dataset.

  1. Attach the required packages:

    library(cluster)   

    library(factoextra) # used for visualization

    # unlike earlier where we used it for correlation

  2. Load the data:

    df <- read.csv("USArrests.csv")

  3. Set the row names to the values of the X column (the state names). Remove the X column afterward:

    rownames(df) <- df$X

    df$X <- NULL

    Note

    The row names (states) become a column, X, when you save it as a CSV file. So, we need to change it back, since the row names are used in the plot in step 7.

  4. Scale and standardize the data using the scale() function:

    df <- scale(df)

  5. Find three clusters in the data. Note that the kmeans() function only accepts numeric columns. In case of categorical columns, we would need to convert those columns to numeric using methods such as one hot encoding, label encoding and so on.

    k3 <- kmeans(df, centers = 3, nstart = 20)

    str(k3)

    The output is as follows:

    Figure 6.33: Cluster information
    Figure 6.33: Cluster information

    Printing out the kmeans object as in the preceding example returns the following elements:

    Figure 6.34: k-means object
    Figure 6.34: k-means object
  6. View k3:

    k3

    The output of k-means clustering is as follows:

    ##

    ## K-means clustering with 3 clusters of sizes 17, 13, 20

    ##

    ## Cluster means:

    ##       Crime1     Crime2     Crime3     Crime4

    ## 1 -0.4469795 -0.3465138  0.4788049 -0.2571398

    ## 2 -0.9615407 -1.1066010 -0.9301069 -0.9667633

    ## 3  1.0049340  1.0138274  0.1975853  0.8469650

    ##

    ## Clustering vector:

    ##  Alabama Alaska Arizona Arkansas California Colorado

    ##    3       3       3        1        3         3

    ##  Connecticut Delaware Florida Georgia Hawaii Idaho

    ##       1          1       3       3       1     2

    ##  Illinois Indiana Iowa Kansas Kentucky Louisiana

    ##     3        1      2    1        2        3

    ##  Maine Maryland Massachusetts Michigan Minnesota Mississippi

    ##     2      3          1           3        2          3

    ##  Missouri Montana Nebraska Nevada  New Hampshire New Jersey

    ##     3        2        2       3          2           1

    ##  New Mexico New York North Carolina North Dakota Ohio Oklahoma

    ##      3          3          3             2         1      1

    ##  Oregon Pennsylvania Rhode Island South Carolina South Dakota

    ##    1         1            1             3             2

    ##  Tennessee Texas Utah Vermont Virginia Washington

    ##   3          3     1     2        1        1

    ##  West Virginia Wisconsin Wyoming

    ##       2            2        1

    ##

    ## Within cluster sum of squares by cluster:

    ## [1] 19.62285 11.95246 46.74796

    ##  (between_SS / total_SS =  60.0 %)

    ##

    ## Available components:

    ##

    ## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   

    ## [7] "size"         "iter"         "ifault"    

  7. Use the fviz_cluster() function:

    fviz_cluster(k3, data = df)

    The output will be as follows:

    Figure 6.35: Visualization of the clusters
    Figure 6.35: Visualization of the clusters

    We used three clusters in the previous example. How do we determine the optimal number of clusters?

  8. To apply the elbow method described earlier, we need to install the NbClust package:

    install.packages("NbClust")

  9. Thereafter, we can attach the package:

    library(NbClust)

    The fviz_nbclust() function plots the score against the number of clusters, k, as shown in the following diagram. In this example, we use "wss", which stands for within cluster sums of squares. This measures the variability within a cluster. The smaller it is, the more compact the clusters are. The silhouette coefficient described earlier can also be used by specifying it as a method. We can also use the gap statistic, which is calculated as the deviation of the observed intracluster variation from the expected intracluster variation under the null hypothesis.

    Note

    To learn more about the null hypothesis, refer to the following link: https://en.wikipedia.org/wiki/Null_hypothesis

  10. Use the wss method to measure the variation within the cluster and plot the elbow graph:

    fviz_nbclust(df, kmeans, method = "wss") +

        geom_vline(xintercept = 4, linetype = 2) +

        labs(subtitle = "Elbow method")

    The output is as follows:

Figure 6.36: Elbow graph using wss
Figure 6.36: Elbow graph using wss

Observing Figure 6.36, we can determine that the optimal number of clusters is four. Running the k-means code from earlier, but with k set as 4, yields a slightly different visualization:

Figure 6.37: Visualization after performing wss
Figure 6.37: Visualization after performing wss

These clusters are better as the points are closer to their centroid, as seen in Figure 6.37.

Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset

This activity builds upon earlier exercises that implemented DIANA, AGNES, and k-means clustering on different datasets. The aim of this activity is to implement different clustering techniques as well as to compare the results of each method and find the best clusters. We want to implement three clustering techniques using the cars dataset.

The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/mtcars.csv.

Here are the steps to complete the activity:

  1. Attach the cluster and factoextra packages.
  2. Load the dataset.
  3. Prepare the data as demonstrated in the preceding implementation. Set the row names to the values of the X column and remove it, remove rows with missing data, and standardize the dataset.
  4. Implement divisive hierarchical clustering using DIANA. This is similar to Exercise 63, Implement Hierarchical Clustering Using DIANA. For easy comparison, save the dendrogram output.
  5. Implement bottom-up hierarchical clustering using AGNES. This is similar to Exercise 64, Agglomerative Clustering Using AGNES. Take note of the dendrogram created for comparison purposes later on.
  6. Implement k-means clustering. Use the elbow method to determine the optimal number of clusters.
  7. For easy comparison, consider the elements in the smaller clusters for each method.

Expected output:

Figure 6.38: Dendrogram for DIANA with a cut at 20
Figure 6.38: Dendrogram for DIANA with a cut at 20

The dendrogram for AGNES with a cut at 4 will look as follows:

Figure 6.39: Dendrogram for AGENS with a cut at 4
Figure 6.39: Dendrogram for AGENS with a cut at 4

The following output is obtained after k-means clustering is performed:

Figure 6.40: K-means clustering
Figure 6.40: K-means clustering

Note

The solution for this activity can be found on page 379.

Summary

In this chapter, we have learned to differentiate between supervised and unsupervised learning. We also performed and compared the DIANA, AGNES, and k-means clustering techniques, and discussed the applications of clustering.

We have thus covered the basics of identifying machine learning-related business problems, preparing datasets for analysis, and selecting and training suitable model architectures and evaluating their performance. We covered the basics of a commonly used set of machine learning methods and used different R packages, such as rpart, randomForest, MICE, groupdata2, and cvms. Having worked with tasks such as classification, regression, and clustering, you now possess the tools required to tackle many of your data-related business problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset