Chapter 9. Finding Groups of Data – Clustering with k-means

Have you ever spent time watching large crowds? As a sociologist, this was one of my favorite pastimes. I would choose a busy location, such as a coffee shop, library, or cafeteria, and observe the masses of people for interesting patterns of behavior. The goal was to look for details that reveal an insight into how people, as a general rule, relate to each other and their environment.

The more you perform such observational research, the more you may see recurring personalities. Perhaps a certain type of person, identified by a freshly-pressed suit and a briefcase, comes to typify the white-collar business executive. A twenty-something wearing tight jeans, a flannel shirt, and sunglasses might fall into the hipster category, while a woman unloading children from a minivan could be labeled a soccer mom.

Of course, these types of stereotypes are dangerous to apply to individuals—no two people are exactly alike. Used in aggregate, however, the labels may reflect some underlying pattern of similarity among the individuals falling within the group.

This chapter describes methods to address the machine learning task of clustering, which involves finding natural groupings of data. As you will see, clustering works in a process very similar to the observational research described just now. Along the way, you will learn:

  • Ways clustering tasks differ from the classification tasks we've examined previously and how clustering defines groups
  • The basic methods used by k-means, a classic and easy-to-understand clustering algorithm
  • How to apply clustering to a real-world task of identifying marketing segments within teenage social media users

Before jumping into action, let's begin by taking an in-depth look at exactly what clustering entails.

Understanding clustering

Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groupings of similar items. It does this without having been told what the groups should look like ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data.

Without advance knowledge of what comprises a cluster, how could a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that records inside a cluster should be very similar to each other, but very different from those outside. As you will see later, the definition of similarity might vary across applications, but the basic idea is always the same: group the data such that related elements are placed together.

The resulting clusters can then be used for action. For instance, you might find clustering methods employed in applications such as:

  • Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns and/or detailed analysis of purchasing behavior by subgroup
  • Detecting anomalous behavior, such as unauthorized intrusions into computer networks, by identifying patterns of use falling outside known clusters
  • Simplifying extremely large datasets by grouping a large number of features with similar values into a much smaller number of homogeneous categories

Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable structures within data that reduce complexity and provide insight into patterns of relationships.

Clustering as a machine learning task

Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we've examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; the model identifies patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label and inferred entirely from the relationships within the data. For this reason, you will sometimes see the clustering task referred to as unsupervised classification because, in a sense, this is classifying unlabeled examples.

The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example.

Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking the discipline the attendee would prefer to be seated with.

In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. Toward this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot:

Clustering as a machine learning task

As expected, there seems to be a pattern here. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts.

Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. Given the axis-parallel splits in the previous figure, this problem seems like an obvious application for decision trees as described in Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules. This would provide us with a clean rule like "if the scholar has few math publications, then he/she is a computer science expert." Unfortunately, there's a problem with this plan. As we do not know the true class value for each point, we cannot deploy supervised learning algorithms.

Our groupings were formed visually; we simply identified clusters as closely grouped data points. In spite of the seemingly obvious groupings, we have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows:

Clustering as a machine learning task

Clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, they can be assigned to homogeneous groups. In the next section, we'll start looking at how clustering algorithms are implemented.

Tip

This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning.

The k-means algorithm for clustering

The k-means algorithm is perhaps the most often used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN task view for clustering:

http://cran.r-project.org/web/views/Cluster.html

Note

As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in A k-means clustering algorithm in Applied Statistics, Vol. 28, pp. 100-108, by Hartigan, J.A. and Wong, M.A. (1979).

Even though clustering methods have advanced since the inception of k-means, that does not suggest that it is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely:

Strengths

Weaknesses

  • Uses simple principles for identifying clusters which can be explained in non-statistical terms
  • It is highly flexible and can be adapted to address nearly all of its shortcomings with simple adjustments
  • It is fairly efficient and performs well at dividing the data into useful clusters
  • It is less sophisticated than more recent clustering algorithms
  • Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters
  • Requires a reasonable guess as to how many clusters naturally exist in the data

If the name k-means sounds familiar to you, you may be recalling the kNN algorithm presented in Chapter 3, Lazy Learning – Classification Using Nearest Neighbors. As you will soon see, k-means shares more in common with k-nearest neighbors than just the letter k.

The k-means algorithm involves assigning each of the n examples to one of the k clusters, where k is a number that has been defined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between clusters.

Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Putting it simply, this means that it starts with an initial guess for the cluster assignments then modifies the assignments slightly to see if the changes improve the homogeneity within the clusters.

We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until making changes no longer improves the cluster fit. At this point, the process stops and the clusters are finalized.

Tip

Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. For this reason, it's a good idea to try a cluster analysis more than once to test the robustness of your findings.

To see how the process of assigning and updating works in practice, let's revisit the example data for the data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood.

Using distance to assign and update clusters

As with kNN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot, as depicted previously.

The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. Because we hope to identify three clusters, k = 3 points are selected. These points are indicated by the star, triangle, and diamond in the following figure:

Using distance to assign and update clusters

There are several other ways to choose the initial cluster centers. One option is to choose random values occurring anywhere in the feature space (rather than only selecting among values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to tailor your results.

After choosing the initial cluster centers, the other examples are assigned to the cluster center that is most similar or nearest according to the distance function. You will remember that we studied distance functions while learning about kNN. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used.

Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is as follows:

Using distance to assign and update clusters

For instance, if we are comparing an event guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as:

> sqrt((5 - 0)^2 + (1 - 2)^2)
[1] 5.09902

Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center.

Tip

Keep in mind that because we are using distance calculations, all data need to be numeric, and the values should be normalized to a standard range ahead of time. The methods presented in Chapter 3, Lazy Learning – Classification Using Nearest Neighbors, will prove helpful here.

As shown in the following figure, the three cluster centers partition the examples into three segments labeled Cluster A, B, and C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. A Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds:

Using distance to assign and update clusters

Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the mean value of the points currently assigned to that cluster. The following figure illustrates how the cluster centers shift to the new centroids:

Using distance to assign and update clusters

Because the cluster boundaries have been adjusted according to the repositioned centers, Cluster A is able to claim an additional example from Cluster B (indicated by an arrow). Because of this reassignment, the k-means algorithm will continue through another update phase. After recalculating the centroids for the clusters, the figure looks like this:

Using distance to assign and update clusters

Two more points have been reassigned from Cluster B to Cluster A during this phase, as they are now closer to the centroid for A than B. This leads to another update as shown:

Using distance to assign and update clusters

As no points were reassigned during this phase, the k-means algorithm stops. The cluster assignments are now final.

The learned clusters can be reported in one of the two ways. First, you can simply report the cluster assignments for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids and/or assigning each example to its nearest cluster.

Choosing the appropriate number of clusters

In the introduction to k-means, we learned that the algorithm can be sensitive to randomly chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected.

Tip

Choosing the number of clusters requires a delicate balance. Setting the k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data.

Ideally, you will have some a priori knowledge (that is, a prior belief) about the true groupings, and you can begin applying k-means using this information. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited.

Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to a business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals.

Without any a priori knowledge at all, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set.

A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following figures, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. Because you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k such that there are diminishing returns beyond that point. This value of k is known as the elbow point, because it looks like an elbow.

Choosing the appropriate number of clusters

There are numerous statistics to measure homogeneity and heterogeneity within clusters that can be used with the elbow method (have a look at the following information box). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k based on convenience rather than strict performance requirements.

Note

For a very thorough review of the vast assortment of measures of cluster performance, have a look at On clustering validation techniques, Journal of Intelligent Information Systems Vol. 17, pp. 107-145, by M. Halkidi, Y. Batistakis, and M. Vazirgiannis (2001).

The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change little, while less homogeneous groups will form and disband over time.

In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. Because clustering is unsupervised, the task is really about what you make of it—the insights you take away from the algorithm's findings.

Finding teen market segments using k-means clustering

Interacting with friends on social networking sites such as Facebook and MySpace has become a rite of passage for teenagers around the world. Having a relatively large amount of disposable income, these adolescents are a coveted demographic for businesses hoping to sell snacks, beverages, electronics, and hygiene products.

The many millions of teenage consumers browsing such sites have attracted the attention of marketers struggling to find an edge in an increasingly competitive market. One way to gain this edge is to identify segments of teenagers who share similar tastes, so that clients can avoid targeting advertisements to teens with no interest in the product being sold. For instance, a sports beverage is likely to be a difficult sell to teens with no interest in sports.

Given the text of teenagers' Social Networking Service (SNS) pages, we can identify groups that share common interests such as sports, religion, or music. Clustering can automate the process of discovering the natural segments in this population. However, it will be up to us to decide whether or not the clusters are interesting and how we can use them for advertising. Let's try this process from start to finish.

Step 1 – collecting data

For this analysis, we will be using a dataset representing a random sample of 30,000 U.S. high school students who had profiles on a well-known SNS in 2006. To protect the users' anonymity, the SNS will remain unnamed. However, at the time the data was collected, the SNS was a popular web destination for U.S. teenagers. Therefore, it is reasonable to assume that the profiles represent a fairly wide cross section of American adolescents in 2006.

Tip

This dataset was compiled while conducting sociological research on teenage identities at the University of Notre Dame. If you use the data for research purposes, please cite this book chapter. The full dataset is available at the Packt Publishing's website with the filename snsdata.csv. To follow along interactively, this chapter assumes you have saved this file to your R working directory.

The data was sampled evenly across four high school graduation years (2006 through 2009) representing the senior, junior, sophomore, and freshman classes at the time of data collection. Using an automated web crawler, the full text of the SNS profiles were downloaded, and each teen's gender, age, and number of SNS friends was recorded.

A text mining tool was used to divide the remaining SNS page content into words. From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of interests, namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset indicates, for each person, how many times each word appeared in the person's SNS profile.

Step 2 – exploring and preparing the data

We can use the default settings of read.csv() to load the data into a data frame:

> teens <- read.csv("snsdata.csv")

Let's also take a quick look at the specifics of the data. The first several lines of the str() output are as follows:

> str(teens)
'data.frame':  30000 obs. of  40 variables:
 $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 ...
 $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 ...
 $ age         : num  19 18.8 18.3 18.9 19 ...
 $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
 $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...

As we had expected, the data include 30,000 teenagers with four variables indicating personal characteristics and 36 words indicating interests.

Do you notice anything strange around the gender variable? If you were looking carefully, you may have noticed NA, which is out of place compared to the 1 and 2 values. This is R's way of telling us that the record has a missing value—we do not know the person's gender. Until now, we haven't dealt with missing data, but it can be a significant problem for many types of analyses.

Let's see how substantial this problem is. One option is to use the table() command, as follows:

> table(teens$gender)
    F     M 
22054  5222

Although this tells us how many F and M values are present, the table() function excluded the NA values rather than treating it as a separate value. To include the NA values (if there are any), we simply need to add an additional parameter:

> table(teens$gender, useNA = "ifany")
    F     M  <NA> 
22054  5222  2724

Here, we see that 2,724 records (9 percent) have missing gender data. Interestingly, there are over four times as many females as males in the SNS data, suggesting that males are not as inclined to use SNSs as females.

If you examine the other variables in the teens data frame, you will find that besides the gender variable, only age has missing values. For numeric data, the summary() command tells us the number of missing NA values:

> summary(teens$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  3.086  16.310  17.290  17.990  18.260 106.900    5086

A total of 5,086 records (17 percent) have missing values for age. Also concerning is the fact that the minimum and maximum values seem to be the suspect; it is unlikely that a 3 year old or a 106 year old is attending high school. To ensure that these extreme values don't cause problems for the analysis, we'll need to clean them up before moving on.

A reasonable range of ages for high school students includes those who are at least 13 years old and not yet 20 years old. Any age value falling outside this range will be treated the same as missing data—we cannot trust the age provided. To recode the age variable, we can use the ifelse() function, assigning teen$age the value of teen$age if the age is at least 13 and less than 20 years; otherwise, it will receive the value NA:

> teens$age <- ifelse(teens$age >= 13 & teens$age < 20,
                        teens$age, NA)

By rechecking the summary() output, we see that the age range now follows a distribution that looks much more like an actual high school:

> summary(teens$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  13.03   16.30   17.26   17.25   18.22   20.00    5523

Unfortunately, now we've created an even larger missing data problem. We'll need to find a way to deal with these values before continuing with our analysis.

Data preparation – dummy coding missing values

An easy solution for handling missing values is to exclude any record with a missing value. However, if you think through the implications of this practice, you might think twice before doing so. (I said it was easy, I never said it was a good idea!) The problem with this approach is that even if the missingness is not extensive, you can very quickly start to exclude large portions of data.

For example, suppose that in our data the people with NA values for gender are completely different from those with missing age data. This would imply that by excluding those missing either gender or age, you would exclude 26 percent, which is an addition of 9 percent and 17 percent (9% + 17% = 26%), of your data, or over 7,500 records. And this is for missing data on only two variables! The larger the number of missing values present in a dataset, the more likely it is that any given record will be excluded. Fairly soon, you will be left with a tiny subset of data, or worse, the remaining records will be systematically different or non-representative of the full population.

An alternative solution for categorical data like gender is to treat a missing value as a separate category. For instance, rather than limiting to female and male, we can add an additional level for "unknown." At the same time, we should also utilize dummy coding, which is covered in more depth in Chapter 3, Lazy Learning - Classification Using Nearest Neighbors, to transform the nominal gender variable into a numeric form that can be used for distance calculations.

Dummy coding involves creating a separate binary 1 or 0 valued dummy variable for each level of a nominal feature except one, which is held out to serve as the reference group. The reason one category can be excluded is because it can be inferred from the other categories. For instance, if someone is not female and not unknown gender, they must be male. Therefore, we need to only create dummy variables for female and unknown gender:

> teens$female <- ifelse(teens$gender == "F" &
                           !is.na(teens$gender), 1, 0)
> teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

The first statement assigns teens$female the value 1 if gender is equal to F and the gender is not equal to NA, otherwise it assigns the value 0. The is.na()function tests whether gender is equal to NA. If is.na() returns TRUE, then the teens$no_gender variable is assigned 1, otherwise it is assigned the value 0. To confirm that we did the work correctly, let's compare our constructed dummy variables to the original gender variable:

> table(teens$gender, useNA = "ifany")
    F     M  <NA> 
22054  5222  2724 
> table(teens$female, useNA = "ifany")
    0     1 
 7946 22054 
> table(teens$no_gender, useNA = "ifany")
    0     1 
27276  2724

The number of 1 values for teens$female and teens$no_gender matches the number of F and NA values in the initial coding, so we should be able to trust our work.

Data preparation – imputing missing values

Next, let's eliminate the 5,523 missing values on age. As age is numeric, it doesn't make sense to create an additional category for unknown values—where would you rank "unknown" relative to the other age values? Instead, we'll use a different strategy known as imputation, which involves filling in the missing data with a guess as to what the true value really is.

Can you think of a way we might be able to use the SNS data to make an educated individual guess about a teenager's age? If you thought about using the graduation year, you've got the right idea. Most people in a graduation cohort were born within a single calendar year. If we can figure out the typical age for each cohort, then we would have a fairly reasonable estimate of the age of a student in that graduation year.

One way to find a typical value is by calculating the average, or mean, value. If we try to apply the mean() function as we have done for previous analyses, there's a problem:

> mean(teens$age)
[1] NA

The issue is that the mean value is undefined for a vector containing missing data. As age contains missing values, mean(teens$age) returns a missing value. We can correct this by adding an additional parameter to remove the missing values before calculating the mean:

 > mean(teens$age, na.rm = TRUE)
[1] 17.25243

This reveals that the average student in our data is about 17 years old. This only gets us part of the way there; we actually need the average age for each graduation year. You might be tempted to calculate the mean four times, but one of the benefits of R is that there's usually a more efficient way. In this case, the aggregate() function is the tool for the job. It computes statistics for subgroups of data. Here, it calculates the mean age for levels of gradyear after removing the NA values:

> aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
  gradyear      age
1     2006 18.65586
2     2007 17.70617
3     2008 16.76770
4     2009 15.81957

The mean age differs by roughly one year per change in graduation year. This is not at all surprising, but a helpful finding for confirming our data is reasonable.

The aggregate() output is in a data frame which is human-readable but requires extra work to merge back onto our original data. As an alternative, we can use the ave() function, which returns a vector with the group means repeated such that the result is equal in length to the original vector:

> ave_age <- ave(teens$age, teens$gradyear, FUN =
                  function(x) mean(x, na.rm = TRUE))

To impute these means onto the missing values, we need one more ifelse() call to use the ave_age value only if the original age value was NA:

> teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)

The summary() results show that the missing values have now been eliminated:

> summary(teens$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.03   16.28   17.24   17.24   18.21   20.00

With the data ready for analysis, we are ready to dive into the interesting part of this project. Let's see if our efforts have paid off.

Step 3 – training a model on the data

To cluster the teenagers into marketing segments, we will use an implementation of k-means in the stats package, which should be included in your R installation by default. If by chance you do not have this package, you can install it as you would any other package and load it using the library(stats) command. Although there is no shortage of k-means functions available in various R packages, the kmeans() function in the stats package is widely used and provides a vanilla implementation of the algorithm.

Step 3 – training a model on the data

The kmeans() function requires a data frame containing only numeric data and a parameter specifying the desired number of clusters. If you have these two things ready, the actual process of building the model is simple. The trouble is that choosing the right combination of data and clusters can be a bit of an art; sometimes a great deal of trial and error is involved.

We'll start our cluster analysis by considering only the 36 features that represent the number of times various interests appeared on the SNS profiles of teens. For convenience, let's make a data frame containing only these features:

> interests <- teens[5:40]

A common practice employed prior to any analysis using distance calculations is to normalize or z-score standardize the features such that each utilizes the same scale. By doing so, you can avoid a problem in which some features come to dominate solely because they tend to have larger values than others.

If you recall from Chapter 3, Lazy Learning – Classification Using Nearest Neighbors, z-score standardization rescales features such that they have a mean of zero and a standard deviation of one. This transformation changes the interpretation of the feature in a way that may be useful here. Specifically, if someone mentions football a total of 3 times on their profile, without additional information, we have no idea whether this implies they like football more or less than their peers. On the other hand, if the z-score value is 3, we know that that they mentioned football many more times than the average teenager.

To apply z-score standardization to the interests data frame, we can use the scale() function with lapply(), as follows:

> interests_z <- as.data.frame(lapply(interests, scale))

Note

As lapply() returns a matrix, it must be coaxed back to data frame form using the as.data.frame() function.

Our last decision involves deciding how many clusters to use for segmenting the data. If we use too many clusters, we may find them too specific to be useful; conversely, choosing too few may result in heterogeneous groupings. You should feel comfortable experimenting with the values of k. If you don't like the result, you can easily try another value and start over.

Tip

Choosing the number of clusters is easier if you are familiar with the analysis population. Having a hunch about the true number of natural groupings can save you some trial and error.

To help us predict the number of clusters in the data, I'll defer to one of my favorite films, the 1985 John Hughes coming-of-age comedy, The Breakfast Club. The high-school-age characters in this movie are identified in terms of five stereotypes: a Brain, an Athlete, a Basket Case, a Princess, and a Criminal. Given that these identities prevail throughout popular teen fiction, five seems like a reasonable starting point for k.

To divide teens into five clusters, we can use the following command:

teen_clusters <- kmeans(interests_z, 5)

This saves the result of the k-means clustering in an object named teen_clusters.

Step 4 – evaluating model performance

Evaluating the results of clustering can be somewhat subjective. Ultimately, the success or failure of the model hinges on whether the clusters are useful for their intended purpose. As the goal of this analysis was to identify clusters of teenagers with similar interests for marketing purposes, we will largely measure our success in qualitative terms. For other clustering applications, more quantitative measures of success may be needed.

One of the most basic ways to evaluate the utility of a set of clusters is to examine the number of examples falling in each of the groups. If the groups are too large or too small, then they are not likely to be very useful. To obtain the size of the kmeans() clusters, use the teen_clusters$size component as follows:

> teen_clusters$size
[1]  3376   601  1036  3279 21708

Here we see the five clusters we requested. The smallest is 601 teenagers (2 percent) while the largest is 21,708 (72 percent). Although the large cluster is slightly concerning, without examining it more carefully, we will not know whether it indicates a problem or not. In good news, we did not find any clusters containing only a single person, which can happen occasionally with k-means.

Tip

Given the random nature of k-means, do not be alarmed if your results differ from those shown here. Instead, consider it as an opportunity to apply your analytic skills to the unique result you obtained!

For a more in-depth look at the clusters, we can examine the coordinates of the cluster centroids using the teen_clusters$centers component, which is as follows for the first eight features:

> teen_clusters$centers
   basketball    football      soccer    softball
1  0.02447191  0.10550409  0.04357739 -0.02411100
2 -0.09442631  0.06927662 -0.09956009 -0.04697009
3  0.37669577  0.38401287  0.14650286  0.15136541
4  1.12232737  1.03625113  0.53915320  0.87051183
5 -0.18869703 -0.19317864 -0.09245172 -0.13366478

   volleyball    swimming cheerleading    baseball
1  0.04803724  0.31298181   0.63868578 -0.03875155
2 -0.07806216  0.04578401  -0.10703701 -0.11182941
3  0.09157715  0.24413955   0.18678448  0.28545186
4  0.78664128  0.11992750   0.01325191  0.86858544
5 -0.12850235 -0.07970857  -0.10728007 -0.13570044

The rows of the output (numbered 1 to 5) refer to the clusters, while the numbers in the output indicate the average value for the interest listed at the top of the column. As the values are z-score standardized, negative values are below the overall mean for all students and positive values are above the mean.

Given only these eight interests, we can already infer some characteristics of the clusters. Cluster 4 is substantially above the mean on all the listed sports except cheerleading, suggesting that this group may include athletes. Cluster 1 includes the most mentions of cheerleading and is above the average in football interest.

By continuing to examine the clusters in this way, it's possible to construct a table listing the dominant interests of each of the groups. In the following figure, each cluster is shown with the features that most distinguish it from the other clusters. Interestingly, Cluster 5 is distinguished by the fact that it is unremarkable; its members had lower-than-average levels of interest in every measured activity. It is also the single largest group in terms of the number of members. One potential explanation is that these users created a profile on the website but never posted any interests.

Step 4 – evaluating model performance

Tip

When sharing the results of a segmentation analysis, it is often helpful to apply informative labels that capture the essence of the groups like The Breakfast Club typology applied here. The risk in adding such labels is that they can obscure the groups' nuances by stereotyping the group members.

Given the table, a marketing executive would have a clear depiction of five types of teenage visitors to the social networking website. Based on these profiles, the executive could sell targeted advertising impressions to businesses with products relevant to one or more of the clusters. In the next section, we will see how the cluster labels can be applied back to the original population for such uses.

Step 5 – improving model performance

Because clustering creates new information, the performance of a clustering algorithm depends at least somewhat on both the quality of the clusters themselves as well as what you do with that information. In the prior section, we already demonstrated that the five clusters provided useful and novel insights into the interests of teenagers; by that measure, the algorithm appears to be performing quite well. Therefore, we can now focus our effort on turning these insights into action.

We'll begin by applying the clusters back onto the full dataset. When the k-means clusters were created, the function stored a component called teens$cluster that contains the cluster assignments for all 30,000 people in the sample. We can add this as a column on the teens data frame using the following command:

> teens$cluster <- teen_clusters$cluster

Given this information, we can determine which cluster each user has been assigned to. For example, here's the personal information for the first five users in the SNS data:

> teens[1:5, c("cluster", "gender", "age", "friends")]
  cluster gender    age friends
1       5      M 18.982       7
2       1      F 18.801       0
3       5      M 18.335      69
4       5      F 18.875       0
5       3   <NA> 18.995      10

Using the aggregate() function we had used before, we can also look at the demographic characteristics of the clusters overall. The mean age does not vary much by cluster, although we wouldn't necessarily think that interests should systematically differ by age. This is depicted as follows:

> aggregate(data = teens, age ~ cluster, mean)
  cluster      age
1       1 16.99678
2       2 17.38765
3       3 17.10022
4       4 17.09634
5       5 17.29841

On the other hand, there are some notable differences in the proportion of females by cluster. This is an interesting finding, as we didn't use gender data to create the clusters, yet the clusters are still very predictive of gender:

> aggregate(data = teens, female ~ cluster, mean)
  cluster    female
1       1 0.8942536
2       2 0.7221298
3       3 0.8001931
4       4 0.7130223
5       5 0.7109821

Recall that overall about 74 percent of the SNS users are female. Cluster 1, the so-called Princesses, is nearly 90 percent female, while Clusters 2, 4, and 5 are only about 70 percent female.

Given our success in predicting gender, you might also suspect that the clusters are predictive of the number of friends the users have. This hypothesis seems to be supported by the data, which is as follows:

> aggregate(data = teens, friends ~ cluster, mean)
  cluster  friends
1       1 38.74733
2       2 32.88186
3       3 30.57046
4       4 36.14029
5       5 27.85314

On an average, Princesses have the most friends (38.7), followed by Athletes (36.1) and Brains (32.9). Criminals have only 30.6 while Basket Cases have 27.9. As with gender, this finding is remarkable given that we did not use the number of friends as an input to the clustering algorithm.

The association among group membership, gender, and number of friends suggests that the clusters can be useful predictors. Validating their predictive ability in this way may make the clusters an easier sell when they are pitched to the marketing team, ultimately improving the performance of the algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset