Chapter 3. Finding Patterns in the Noise – Clustering and Unsupervised Learning

One of the natural questions to ask about a dataset is if it contains groups. For example, if we examine financial market data consisting of stock price fluctuations over time, are there groups of stocks that fall and rise with a similar pattern? Similarly, for a set of customer transactions from an e-commerce business we might ask if are there groups of user accounts distinguished by patterns of similar purchasing activity? By identifying groups of related items using the methods described in this chapter, we can understand data as a set of general patterns rather than just individual points. These patterns can help in making high-level summaries at the outset of a predictive modeling project, or as an ongoing way to report on the shape of the data we are modeling. The groupings produced can serve as insights themselves, or they can provide starting points for the models we will cover in later chapters. For example, the group to which a datapoint is assigned can become a feature of this observation, adding additional information beyond its individual values. Additionally, we can potentially calculate statistics (such as mean and standard deviation) item features within these groups, which may be more robust as model features than individual entries.

In contrast to the methods we will discuss in later chapters, grouping or clustering algorithms are known as unsupervised learning, meaning we have no response value, such as a sale price or click-through rate, which is used to determine the optimal parameters of the algorithm. Rather, we identify similar datapoints using only the features, and as a secondary analysis might ask whether the clusters we identify share a common pattern in their responses (and thus suggest the cluster is useful in finding groups associated with the outcome we are interested in).

The task of finding these groups, or clusters, has a few common ingredients that vary between algorithms. One is a notion of distance or similarity between items in the dataset, which will allow us to quantitatively compare them. A second is the number of groups we wish to identify; this can be specified initially using domain knowledge, or determined by running an algorithm with different numbers of clusters. We can then identify the best number of clusters that describes a dataset through statistics, such as examining numerical variance within the groups determined by the algorithm, or by visual inspection. In this chapter we will dive into:

  • How to normalize data for use in a clustering algorithm and to compute similarity measurements for both categorical and numerical data
  • How to use k-means clustering to identify an optimal number of clusters by examining the squared error function
  • How to use agglomerative clustering to identify clusters at different scales
  • Using affinity propagation to automatically identify the number of clusters in a dataset
  • How to use spectral methods to cluster data with nonlinear relationships between points

Similarity and distance metrics

The first step in clustering any new dataset is to decide how to compare the similarity (or dissimilarity) between items. Sometimes the choice is dictated by what kinds of similarity we are most interested in trying to measure, in others it is restricted by the properties of the dataset. In the following sections we illustrate several kinds of distance for numerical, categorical, time series, and set-based data—while this list is not exhaustive, it should cover many of the common use cases you will encounter in business analysis. We will also cover normalizations that may be needed for different data types prior to running clustering algorithms.

Numerical distance metrics

Let us begin by exploring the data in the wine.data file. It contains a set of chemical measurements describing the properties of different kinds of wines, and the quality level (I-III) to which the wines are assigned (Forina, M., et al. PARVUS An Extendible Package for Data Exploration. Classification and Correla (1988)). Open the file in an iPython notebook and look at the first few rows with the following commands:

>>> df = pd.read_csv("wine.data",header=None)
>>> df.head()
Numerical distance metrics

Notice that in this dataset we have no column descriptions, which makes the data hard to understand since we do not know what the features are. We need to parse the column names from the dataset description file wine.names, which in addition to the column names contains additional information about the dataset.With the following code, we generate a regular expression that will match a column name (using a pattern where a number followed by a parenthesis has a column name after it, as you can see in the list of column names in the file):

>>> import re
>>> expr = re.compile('.*[0-9]+)s?(w+).*')

We then create an array where the first element is the class label of the wine (whether it belongs to quality category I-III).

>>> header_names = ['Class']

Iterating through the lines in the file, we extract those that match our regular expression:

>>> df_header = open("wine.names")
>>> for l in df_header.readlines():
        if len(expr.findall(l.strip()))!=0:
            header_names.append(expr.findall(l.strip())[0])
>>> df_header.close()

We then assign this list to the dataframe columns property, which contains the names of the columns:

>>> df.columns = header_names

Now that we have appended the column names, we can look at a summary of the dataset using the df.describe() method:

Numerical distance metrics

Having performed some cleanup on the data, how can we calculate a similarity measurement between wines based on the information in each row? One option would be to consider each of the wines as a point in a thirteen-dimensional space specified by its dimensions (each of the columns other than Class). Since the resulting space has thirteen dimensions, we cannot directly visualize the datapoints using a scatterplot to see if they are nearby, but we can calculate distances just the same as with a more familiar 2- or 3-dimensional space using the Euclidean distance formula, which is simply the length of the straight line between two points. This formula for this length can be used whether the points are in a 2-dimensional plot or a more complex space such as this example, and is given by:

Numerical distance metrics

Here x and y are rows of the dataset and n is the number of columns. One important aspect of the Euclidean distance formula is that columns whose scale is much different from others can dominate the overall result of the calculation. In our example, the values describing the magnesium content of each wine are ~100x greater than the magnitude of features describing the alcohol content or ash percentage.

If we were to calculate the distance between these datapoints, it would largely be determined by the magnesium concentration (as even small differences on this scale overwhelmingly determine the value of the distance calculation), rather than any of its other properties. While this might sometimes be desirable (for example, if the column with the largest numerical value is the one we most care about for judging similarity), in most applications we do not favor one feature over another and want to give equal weight to all columns. To get a fair distance comparison between these points, we need to normalize the columns so that they fall into the same numerical range (have similar maximal and minimal values). We can do so using the scale() function in scikit-learn and the following commands, which uses the array header_names we constructed previously to access all columns but the class label (the first element of the array):

>>> from sklearn import preprocessing
>>> df_normalized = pd.DataFrame(preprocessing.scale(df[header_names[1:]]))
>>> df_normalized.columns = header_names[1:]
>>> df_normalized.describe()
Numerical distance metrics

This function will subtract the mean value of a column from each element and then divide each point by the standard deviation of the column. This normalization centers the data in each column at mean 0 with variance 1, and in the case of normally distributed data this results in a standard normal distribution. Also, note that the scale() function returns a numpy array, which is why we must call dataframe on the output to use the pandas function describe().

Now that we have scaled the data, we can calculate Euclidean distances between the rows using the following commands:

>>> import sklearn.metrics.pairwise as pairwise
>>> distances = pairwise.euclidean_distances(df_normalized)

You can verify that this command produces a square matrix of dimension 178 x 178 (the number of rows in the original dataset by the following command:

>>> distances.shape

We have now converted our dataset of 178 rows and 13 columns into a square matrix, giving the distance between each of these rows. In other words, row i, column j in this matrix represents the Euclidean distance between rows i and j in our dataset. This 'distance matrix' is the input we will use for clustering inputs in the following section.

If want to get a sense of how the points are distributed relative to one another using a given distance metric, we can use multidimensional scaling (MDS)—(Borg, Ingwer, and Patrick JF Groenen. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media, 2005; Kruskal, Joseph B. Nonmetric multidimensional scaling: a numerical method. Psychometrika 29.2 (1964): 115-129; Kruskal, Joseph B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29.1 (1964): 1-27) to create a visualization. MDS attempts to find the set of lower dimensional coordinates (here, we will use two dimensions) that best represents the distances between points in the original, higher dimensions of a dataset (here, the pairwise Euclidean distances we calculated from the 13 dimensions). MDS finds the optimal 2-d coordinates according to the strain function:

Numerical distance metrics

Where D are the distances we calculated between points. The 2-d coordinates that minimize this function are found using Singular Value Decomposition (SVD), which we will discuss in more detail in Chapter 6, Words and Pixels – Working with Unstructured Data. After obtaining the coordinates from MDS, we can then plot the results using the wine class to color points in the diagram. Note that the coordinates themselves have no interpretation (in fact, they could change each time we run the algorithm due to numerical randomness in the algorithm). Rather, it is the relative position of points that we are interested in:

>>> from sklearn.manifold import MDS
>>> mds_coords = MDS().fit_transform(distances)
>>> pd.DataFrame(mds_coords).plot(kind='scatter',x=1,y=0,color=df.Class[:],colormap='Reds')
Numerical distance metrics

Given that there are many ways we could have calculated the distance between datapoints, is the Euclidean distance a good choice here? Visually, based on the multidimensional scaling plot, we can see there is separation between the classes based on the features we have used to calculate distance, so conceptually it appears that this is a reasonable choice in this case. However, the decision also depends on what we are trying to compare. If we are interested in detecting wines with similar attributes in absolute values, then it is a good metric. However, what if we're not interested so much in the absolute composition of the wine, but whether its variables follow similar trends among wines with different alcohol contents? In this case, we would not be interested in the absolute difference in values, but rather the correlation between the columns. This sort of comparison is common for time series, which we consider next.

Correlation similarity metrics and time series

For time series data, we are often concerned with whether the patterns between series exhibit the same variation over time, rather than their absolute differences in value. For example, if we were to compare stocks, we might want to identify groups of stocks whose prices move up and down in similar patterns over time. The absolute price is of less interest than this pattern of increase and decrease. Let us look at an example using the variation in prices of stocks in the Dow Jones Industrial Average (DJIA) over time (Brown, Michael Scott, Michael J. Pelosi, and Henry Dirska. Dynamic-radius species-conserving genetic algorithm for the financial forecasting of Dow Jones index stocks. Machine Learning and Data Mining in Pattern Recognition. Springer Berlin Heidelberg, 2013. 27-41.). Start by importing the data and examining the first rows:

>>> df = pd.read_csv("dow_jones_index/dow_jones_index.data")
>>> df.head()
Correlation similarity metrics and time series

This data contains the daily stock price (for 6 months) for a set of 30 stocks. Because all of the numerical values (the prices) are on the same scale, we won't normalize this data as we did with the wine dimensions.

We notice two things about this data. First, the closing price per week (the variable we will use to calculate correlation) is presented as a string. Second, the date is not in the correct format for plotting. We will process both columns to fix this, converting these columns to a float and datetime object, respectively, using the following commands.

To convert the closing price to a number, we apply an anonymous function that takes all characters but the dollar sign and casts it as a float.

>>> df.close = df.close.apply( lambda x: float(x[1:]))

To convert the date, we also use an anonymous function on each row of the date column, splitting the string in to year, month, and day elements and casting them as integers to form a tuple input for a datetime object:

>>> import datetime
>>> df.date = df.date.apply( lambda x: datetime.
   datetime(int(x.split('/')[2]),int(x.split('/')[0]),int(x.split('/')[1])))

With this transformation, we can now make a pivot table (as we covered in Chapter 2, Exploratory Data Analysis and Visualization in Python) to place the closing prices for each week as columns and individual stocks as rows using the following commands:

>>> df_pivot = df.pivot('stock','date','close').reset_index()
>>> df_pivot.head()
Correlation similarity metrics and time series

As we can see, we only need columns after 2 to calculate correlations between rows, as the first two columns are the index and stock ticker symbol. Let us now calculate the correlation between these time series of stock prices by selecting the second column to end columns of the data frame for each row, calculating the pairwise correlations distance metric, and visualizing it using MDS, as before:

>>> import numpy as np
>>> correlations = np.corrcoef(np.float64(np.array(df_pivot)[:,2:]))
>>> mds_coords = MDS().fit_transform(correlations)
>>> pd.DataFrame(mds_coords).plot(kind='scatter',x=1,y=0)
Correlation similarity metrics and time series

It is important to note that the Pearson coefficient, which we have calculated here, is a measure of linear correlation between these time series. In other words, it captures the linear increase (or decrease) of the trend in one price relative to another, but won't necessarily capture nonlinear trends (such as a parabola or sigmoidal pattern). We can see this by looking at the formula for the Pearson correlation, which is given by:

Correlation similarity metrics and time series

Here μ and σ represent the mean and standard deviation of series a and b. This value varies from 1 (highly correlated) to -1 (inversely correlated), with 0 representing no correlation (such as a spherical cloud of points). You might recognize the numerator of this equation as the covariance, which is a measure of how much two datasets, a and b, vary in synch with one another. You can understand this by considering that the numerator is maximized when corresponding points in both datasets are above or below their mean value. However, whether this accurately captures the similarity in the data depends upon the scale. In data that is distributed in regular intervals between a maximum and minimum, with roughly the same difference between consecutive values it captures this pattern well. However, consider a case in which the data is exponentially distributed, with orders of magnitude differences between the minimum and maximum, and the difference between consecutive datapoints also varying widely. Here, the Pearson correlation would be numerically dominated by only the largest values in the series, which might or might not represent the overall similarity in the data. This numerical sensitivity also occurs in the denominator, which represents the product of the standard deviations of both datasets. The value of the correlation is maximized when the variation in the two datasets is roughly explained by the product of their individual variations; there is no left-over variation between the datasets that is not explained by their respective standard deviations. By extracting data for the first two stocks in this collection and plotting their pairwise values, we see that this assumption of linearity appears to be a valid one for comparing datapoints:

>>> df_pivot.iloc[0:2].transpose().iloc[2:].plot(kind='scatter',x=0,y=1)
Correlation similarity metrics and time series

In addition to verifying that these stocks have a roughly linear correlation, this command introduces some new functions in pandas you may find useful. The first is iloc, which allows you to select indexed rows from a dataframe. The second is transpose, which inverts the rows and columns. Here, we select the first two rows, transpose, and then select all rows (prices) after the second (since the first is the index and the second is the Ticker symbol).

Despite the trend we see in this example, we could imagine there might a nonlinear trend between prices. In these cases, it might be better to measure not the linear correlation of the prices themselves, but whether the high prices for one stock coincide with another. In other words, the rank of market days by price should be the same, even if the prices are nonlinearly related. We can also calculate this rank correlation, also known as the Spearman's Rho, using SciPy, with the following formula:

Correlation similarity metrics and time series

Note

Note that this formula assumes that the ranks are distinct (no ties); in the case of ties, we can instead calculate the Pearson correlation using the ranks of the datasets instead of their raw values.

Where n is the number of datapoints in each of two sets a and b, and d is the difference in ranks between each pair of datapoints ai and bi. Because we only compare the ranks of the data, not their actual values, this measure can capture variations between two datasets, even if their numerical value vary over wide ranges. Let us see if plotting the results using the Spearman correlation metric generates any differences in the pairwise distance of the stocks computed from MDS, using the following commands:

>>> import scipy.stats
>>> correlations2 = scipy.stats.spearmanr(np.float64(np.array(df_pivot)[:,1:]))
>>> mds_coords = MDS().fit_transform(correlations2.correlation)
>>> pd.DataFrame(mds_coords).plot(kind='scatter',x=1,y=0)
Correlation similarity metrics and time series

The Spearman correlation distances, based on the x and y axes, appear closer to each other than the Pearson distances, suggesting from the perspective of rank correlation that the time series are more similar.

Though they differ in their assumptions about how two datasets are distributed numerically, Pearson and Spearman correlations share the requirement that the two sets are of the same length. This is usually a reasonable assumption, and will be true of most of the examples we consider in this book. However, for cases where we wish to compare time series of unequal lengths, we can use Dynamic Time Warping (DTW). Conceptually, the idea of DTW is to warp one time series to align with a second, by allowing us to open gaps in either dataset so that it becomes the same size as the second. What the algorithm needs to resolve is where the most similar points of the two series are, so that gaps can be places in the appropriate locations. In the simplest implementation, DTW consists of the following steps (see the following diagram):

Correlation similarity metrics and time series
  1. For dataset a of length n and a dataset b of length m, construct a matrix of size n by m.
  2. Set the top row and the leftmost column of this matrix to both be infinity (left, in figure above).
  3. For each point i in set a, and point j in set b, compare their similarity using a cost function. To this cost function, add the minimum of the element (i-1, j-1), (i-1, j), and (j-1, i)—that is, from moving up and left, left, or up in the matrix). These conceptually represent the costs of opening a gap in one of the series, versus aligning the same element in both (middle, above).
  4. At the end of step 2, we will have traced the minimum cost path to align the two series, and the DTW distance will be represented by the bottommost corner of the matrix, (n.m) (right, above).

A negative aspect of this algorithm is that step 3 involves computing a value for every element of series a and b. For large time series or large datasets, this can be computationally prohibitive. While a full discussion of algorithmic improvements is beyond the scope of our present examples, we refer interested readers to FastDTW (which we will use in our example) and SparseDTW as examples of improvements that can be evaluated using many fewer calculations (Al-Naymat, Ghazi, Sanjay Chawla, and Javid Taheri. Sparsedtw: A novel approach to speed up dynamic time warping. Proceedings of the Eighth Australasian Data Mining Conference-Volume 101. Australian Computer Society, Inc., 2009. Salvador, Stan, and Philip Chan. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11.5 (2007): 561-580.).

We can use the FastDTW algorithm to compare the stocks data and plot the results again using MDS. First we will compare pairwise each pair of stocks and record their DTW distance in a matrix:

>>> from fastdtw import fastdtw
>>> dtw_matrix = np.zeros(shape=(df_pivot.shape[0],df_pivot.shape[0]))
…  for i in np.arange(0,df_pivot.shape[0]):
…     for j in np.arange(i+1,df_pivot.shape[0]):
       …      dtw_matrix[i,j] = fastdtw(df_pivot.iloc[i,2:],df_pivot.iloc[j,2:])[0]

Note

This function is found in the fastdtw library, which you can install using pip or easy_install.

For computational efficiency (because the distance between i and j equals the distance between stocks j and i), we calculate only the upper triangle of this matrix. We then add the transpose (for example, the lower triangle) to this result to get the full distance matrix.

>>> dtw_matrix+=dtw_matrix.transpose()

Finally, we can use MDS again to plot the results:

>>> mds_coords = MDS().fit_transform(dtw_matrix)
>>> pd.DataFrame(mds_coords).plot(kind='scatter',x=1,y=0) 
Correlation similarity metrics and time series

Compared to the distribution of coordinates along the x and y axis for Pearson correlation and rank correlation, the DTW distances appear to span a wider range, picking up more nuanced differences between the time series of stock prices.

Now that we have looked at numerical and time series data, as a last example let us examine calculating similarity measurements for categorical datasets.

Similarity metrics for categorical data

Text represents one class of categorical data: for instance, we might have use a vector to represent the presence or absence of a given keyword for a set of papers submitted to an academic conference, as in our example dataset (Moran, Kelly H., Byron C. Wallace, and Carla E. Brodley. Discovering Better AAAI Keywords via Clustering with Community-sourced Constraints. Twenty-Eighth AAAI Conference on Artificial Intelligence. 2014.). If we open the data, we see that the keywords are represented as a string in one column, which we will need to convert into a binary vector:

>>> df2 = pd.read_csv("Papers.csv",sep=",")
>>> df2.head()
Similarity metrics for categorical data

While in Chapter 6, Words and Pixels – Working with Unstructured Data, we will examine special functions to do this conversion from text to vector, for illustrative purposes we will now code the solution ourselves. We need to gather all the unique keywords, and assign a unique index to each of them to generate a new column name 'keword_n' for each keyword:

>>> keywords_mapping = {}
>>> keyword_index = 0

>>> for k in df2.keywords:
…    k = k.split('
')
…    for kw in k:
…        if keywords_mapping.get(kw,None) is None:
…           keywords_mapping[kw]='keyword_'+str(keyword_index)
…           keyword_index+=1

We then generate a new set of columns using this keyword to column name mapping, to set a 1 in each row where the keyword appears in that article's keywords:

>>>for (k,v) in keywords_mapping.items():
…        df2[v] = df2.keywords.map( lambda x: 1 if k in x.split('
') else 0 ) Image_B04881_03_18.png

These columns will be appended to the right of the existing columns and we can select out these binary indicators using the iloc command, as before:

>>> df2.head().iloc[:,6:]
Similarity metrics for categorical data

In this case, a Euclidean distance between articles could be computed, but because each coordinate is either 0 or 1, it does not provide the continuous distribution of distances we would like (we will get many ties, since there are only so many ways to add and subtract 1 and 0). Similarly, measurements of correlation between these binary vectors are less than ideal because the values can only be identical or non-identical, again leading to many duplicate correlation values.

What kinds of similarity metric could we use instead? One is the Jaccard coefficient:

Similarity metrics for categorical data

This is the number of intersecting items (positions where both a and b are set to 1 in our example) divided by the union (the total number of positions where either a or b are set to 1).This measure could be biased, however, if the articles have very different numbers of keywords, as a larger set of words will have a greater probability of being similar to another article. If we are concerned about such bias, we could use the cosine similarity, which measure the angle between vectors and is sensitive to the number of elements in each:

Similarity metrics for categorical data

Where:

Similarity metrics for categorical data

We could also use the Hamming distance (Hamming, Richard W. Error detecting and error correcting codes. Bell System technical journal 29.2 (1950): 147-160.), which simply sums whether the elements of two sets are identical or not:

Similarity metrics for categorical data

Clearly, this measure will be best if we are primarily looking for matches and mismatches. It is also, like the Jaccard coefficient, sensitive to the number of items in each set, as simply increasing the number of elements increases the possible upper bound of the distance. Similar to Hamming is the Manhattan distance, which does not require the data to be binary:

Similarity metrics for categorical data

If we use the Manhattan distance as an example, we can use MDS again to plot the arrangement of the documents in keyword space using the following commands:

>>> distances = pairwise.pairwise_distances(np.float64(np.array(df2)[:,6:]),metric='manhattan')
>>> mds_coords = MDS().fit_transform(distances)
pd.DataFrame(mds_coords).plot(kind='scatter',x=1,y=0)
Similarity metrics for categorical data

We see a number of groups of papers, suggesting that a simple comparison of common keywords provides a way to distinguish between articles.

The diagram below provides a summary of the different distance methods we have discussed and the decision process for choosing one over another for a particular problem. While it is not exhaustive, we hope it provides a starting point for your own clustering applications.

Similarity metrics for categorical data

Tip

Aside: Normalizing categorical data

As you may have noted, we don't normalize categorical data in the same way that we used the scale() function for numerical data. The reason for this is twofold. First, with categorical data we are usually dealing with data in the range [0,1], so the problem of one column of the dataset containing wildly larger values that overwhelm the distance metric is minimized. Secondly, the notion of the scale() function is that the data in the column is biased, and we are removing that bias by subtracting the mean. For categorical data, the 'mean' has a less clear interpretation, as the data can only take the value of 0 or 1, while the mean might be somewhere between the two (for example, 0.25). Subtracting this value doesn't make sense as it would make the data elements no longer binary indicators.

Aside: Blending distance metrics

In the examples considered so far in this chapter, we have dealt with data that may be described as either numerical, time series, or categorical. However, we might easily find examples where this is not true. For instance, we could have a dataset of stock values over time that also contains categorical information about which industry the stock belongs to and numerical information about the size and revenue of the company. In this case, it would be difficult to choose a single distance metric that adequately handles all of these features. Instead, we could calculate a different distance metric for each of the three sets of features (time-series, categorical, and numerical) and blend them (by taking their average, sum, or product, for instance). Since these distances might cover very different numerical scales, we might need to normalize them, for instance using the scale() function discussed above to convert each distance metric into a distribution with mean 0, standard deviation 1 before combining them.

Now that we have some ways to compare the similarity of items in a dataset, let us start implementing some clustering pipelines.

K-means clustering

K-means clustering is the classical divisive clustering algorithm. The idea is relatively simple: the k in the title comes from the number of clusters we wish to identify, which we need to decide before running the algorithm, while means denotes the fact that the algorithm attempts to assign datapoints to clusters where they are closest to the average value of the cluster. Thus a given datapoint chooses among k different means in order to be assigned to the most appropriate cluster. The basic steps of the simplest version of this algorithm are (MacKay, David. An example inference task: clustering. Information theory, inference and learning algorithms (2003): 284-292):

  1. Choose a desired number of groups k.

    Assign k cluster centers; these could simply be k random points from the dataset, which is known as the Forgy method ( Hamerly, Greg, and Charles Elkan. Alternatives to the k-means algorithm that find better clusterings. Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002.). Alternatively, we could assign a random cluster to each datapoint, and compute the average of the datapoints assigned to the same cluster as the k centers, a method called Random Partitioning (Hamerly, Greg, and Charles Elkan. Alternatives to the k-means algorithm that find better clusterings. Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002). More sophisticated methods are also possible, as we will see shortly.

  2. Assign any remaining datapoints to the nearest cluster, based on some similarity measure (such as the squared Euclidean distance).
  3. Recalculate the center of each of the k groups by averaging the points assigned to them. Note that at this point the center may no longer represent the location of a single datapoint, but is the weighted center of mass of all points in the group.
  4. Repeat 3 and 4 until no points change cluster assignment or the maximum number of iterations is reached.

    Tip

    K-means ++

    In the initialization of the algorithm in step 2 above, there are two potential problems. If we simple choose random points as cluster centers, they may not be optimally distributed within the data (particularly if the size of the clusters is unequal). The k points may not actually end up in the k clusters in the data (for example, multiple random points may reside within the largest cluster in the dataset, as in the figure below, top middle panel), which means the algorithm may not converge to the 'correct' solution or may take a long time to do so. Similarly, the Random Partitioning method will tend to place all the centers near the greatest mass of datapoints (see figure below, top right panel), as any random set of points will be dominated by the largest cluster. To improve the initial choice of parameters, we can use instead the k++ initialization proposed in 2007 (Arthur, David, and Sergei Vassilvitskii. "k-means++: The advantages of careful seeding." Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007.). In this algorithm, we choose an initial datapoint at random to be the center of the first cluster. We then calculate the squared distance from every other datapoint to the chosen datapoint, and choose the next cluster center with probability proportional to this distance. Subsequently, we choose the remaining clusters by calculating this squared distance to the previously selected centers for a given datapoint. Thus, this initialization will choose points with higher probability that are far from any previously chosen point, and spread the initial centers more evenly in the space. This algorithm is the default used in scikit-learn.

K-means clustering

Kmeans++ clustering. (Top, left): Example data with three clusters of unequal size. (Top, middle). Random choice of cluster centers is biased towards points in the largest underlying cluster. (Top, right): Random partitioning results in center of mass for all three random clusters near the bottom of the plot. (Bottom panels). Kmeans++ results in sequential selection of three cluster centers that are evenly spaced across the dataset.

Let's think for a moment about why this works; even if we start with random group centers, once we assign points to these groups the centers are pulled towards the average position of observations in our dataset. The updated center is nearer to this average value. After many iterations, the center of each group will be dominated by the average position of datapoints near the randomly chosen starting point. If the center was poorly chosen to begin with, it will be dragged towards this average, while datapoints that were perhaps incorrectly assigned to this group will gradually be reassigned. During this process, the overall value that is minimized is typically the sum of squared errors (when we are using the Euclidean distance metric), given by:

K-means clustering

Where D is the Euclidean distance and c is the cluster center for the cluster to which a point is assigned. This value is also sometimes referred to as the inertia. If we think about this for a moment, we can see that this has the effect that the algorithm works best for data that is composed of circles (or spheres in higher dimensions); the overall SSE for a cluster is minimized when points are uniformly distant from it in a spherical cloud. In contrast, a non-uniform shape (such as an ellipse) will tend to have higher SSE values, and the algorithm will be optimized by splitting the data into two clusters, even if visually we can see that they appear to be represented well by one. This fact reinforces why normalization is often beneficial (as the 0 mean, 1 standard deviation normalization attempts to approximate the shape of a normal distribution for all dimensions, thus forming circles of spheres of data), and the important role of data visualization in addition to numerical statistics in judging the quality of clusters.

It is also important to consider the implications of this minimization criteria for step 3. The SSE is equivalent to the summed squared Euclidean distance between cluster points and their centroid. Thus, using the squared Euclidean distance as the metric for comparison, we guarantee that the cluster assignment is also optimizing the minimization criteria. We could use other distance metrics, but then this will not be guaranteed. If we are using Manhattan or Hamming distance, we could instead make our minimization criteria the sum of distances to the cluster center, which we term the k-median, since the value that optimizes this statistic is the cluster median (Jain, Anil K., and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.). Alternatively, we could use an arbitrary distance metric with an algorithm such as k-medoids (see below).

Clearly, this method will be sensitive to our initial choice of group centers, so we will usually run the algorithm many times and use the best result.

Let's look at an example: type the following commands in the notebook to read in a sample dataset.

>>> df = pd.read_csv('kmeans.txt',sep='	')
>>> df.plot(kind='scatter',x='x_coord',y='y_coord')

K-means clustering

By visual inspection, this dataset clearly has a number of clusters in it. Let's try clustering with k=5.

>>> from sklearn.cluster import KMeans
>>> kmeans_clusters = KMeans(5).fit_predict(X=np.array(df)[:,1:])
>>> df.plot(kind='scatter', x='x_coord', y='y_coord', c=kmeans_clusters)

K-means clustering

You will notice that we use the slice operators '[]' to index a numpy array that we create from the input dataframe, and select all rows and the columns after the first (the first contains the label, so we don't need it as it isn't part of the data being used for clustering). We call the KMeans model using a pattern that will become familiar for many algorithms in scikit-learn and PySpark: we create model object (KMeans) with parameters (here, 5, which is the number of clusters), and call 'fit_predict' to both calibrate the model parameters and apply the model to the input data. Here, applying the model generates cluster centers, while in regression or classification models that we will discuss in Chapters 4, Connecting the Dots with Models – Regression Methods and Chapter 5, Putting Data in its Place – Classification Methods and Analysis, 'predict' will yield an estimated continuous response or class label, respectively, for each data point. We could also simply call the fit method for KMeans, which would simply return an object describing the cluster centers and the statistics resulting from fitting the model, such as the inertia measure we describe above.

Is this a good number of clusters to fit to the data or not? We can explore this question if we cluster using several values of k and plot the inertia at each. In Python we can use the following commands.

>>> inertias = []
>>> ks = np.arange(1,20)
>>> for k in ks:
 …      inertias.append(KMeans(k).fit(X=np.array(df)[:,1:]).inertia_)
>>> results = pd.DataFrame({"num_clusters": ks, "sum_distance": inertias})

Recall that the inertia is defined as the sum of squared distance points in a cluster to the center of the cluster to which they are assigned, which is the objective we are trying to optimize in k-means. By visualizing this inertia value at each cluster number k, we can get a feeling for the number of clusters that best fits the data:

>>> results.plot(kind='scatter', x='num_clusters', y='sum_distance')
K-means clustering

We notice there is an elbow around the five-cluster mark, which fortunately was the value we initially selected. This elbow indicates that after five clusters we do not significantly decrease the inertia as we add more clusters, suggesting that at k=5 we have captured the important group structure in the data.

This exercise also illustrates some problems: as you can see from the plot, some of our clusters might be formed from what appear to be overlapping segments, forming a cross shape. Is this a single cluster or two mixed clusters? Unfortunately, without a specification in our cluster model of what shape the clusters should conform to, the results are driven entirely by distance metrics, not pattern which you might notice yourself visually. This underscores the importance of visualizing the results and examining them with domain experts to judge whether the obtained clusters makes sense. In the absence of a domain expert, we could also see whether the obtained clusters contain all points labeled with a known assignment—if a high percentage of the clusters are enriched for a single label, this indicates the clusters are of good conceptual quality as well as minimizing our distance metric.

We could also try using a method that will automatically calculate the best number of clusters for a dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset