Mathematics behind clustering

Earlier in this chapter, we discussed how a measure of similarity or dissimilarity is needed for the purpose of clustering observations. In this section, we will see what those measures are and how they are used.

Distances between two observations

If we consider each observation as a point in an n-dimensional space, where n is the number of columns in the dataset, one can calculate the mathematical distance between the points. The lesser the distance, the more similar they are. The points that are less distant to each other will be clubbed together.

Now, there are many ways of calculating distances and different algorithms use different methods of calculating distance. Let us see the different methods with a few examples. Let us consider a sample dataset of 10 observations with three variables, each to illustrate the distance better. The following dataset contains percentage marks obtained by 10 students in English, Maths, and Science:

Student

English

Maths

Science

1

0.12

0.49

0.21

2

0.21

0.81

0.79

3

0.73

0.30

0.99

4

0.55

0.03

0.17

5

0.15

0.83

0.25

6

0.24

0.37

0.63

7

0.20

0.82

0.85

8

0.17

0.92

0.45

9

0.26

0.16

0.31

10

0.15

0.47

0.23

Table 7.1: Percentage marks obtained by 10 students in English, Maths, and Science

Euclidean distance

This is the most commonly used definition of the distance. Let us denote each observation as Xi. Then, the Euclidean distance between the two points, Xi and Xj, for a dataset having n-columns, is defined as follows:

Euclidean distance

For our dataset, the Euclidean distance between the student 1 and 2 is calculated as follows:

Euclidean distance

Thus, the Euclidian distance between the student 1 and 2 comes out to be 0.67.

Manhattan distance

Manhattan distance is defined as follows:

Manhattan distance

Minkowski distance

Minkowski distance is defined as follows:

Minkowski distance

where p>=1

The distance matrix

Once we have defined the distance between the two points, we can calculate the distance between any two points and store them as a matrix. This matrix representing the distance between any two points (observations) in the dataset is called a distance matrix. A distance matrix has certain properties as follows:

  • All the diagonal elements in a distance matrix have a value 0 as they represent the distance of the point from itself, that is, Di,i=0 for all i's.
  • For a dataset with n observations, we get a nxn distance matrix.
  • A distance matrix is a symmetric matrix as the distance between the two points is the same, irrespective of the order of the points. The distance between point 1 and point 2 is the same as the difference between point 2 and point 1. This implies Di,j = Dj,I for all i's and j's.

For our dataset containing information about the marks of 10 students, it would be a 10x10 matrix. If we decide to calculate the matrix for Euclidean distances, the distance matrix will look as follows:

 

1

2

3

4

5

6

7

8

9

10

1

0.00

0.67

1.01

0.63

0.33

0.45

0.72

0.49

0.37

0.04

2

0.67

0.00

0.76

1.06

0.55

0.47

0.06

0.36

0.81

0.66

3

1.01

0.76

0.00

0.88

1.08

0.62

0.76

1.00

0.84

0.97

4

0.63

1.06

0.88

0.00

0.89

0.65

1.10

1.01

0.35

0.60

5

0.33

0.55

1.08

0.89

0.00

0.60

0.60

0.23

0.68

0.35

6

0.45

0.47

0.62

0.65

0.60

0.00

0.50

0.58

0.38

0.41

7

0.72

0.06

0.76

1.10

0.60

0.50

0.00

0.41

0.85

0.71

8

0.49

0.36

1.00

1.01

0.23

0.58

0.41

0.00

0.78

0.50

9

0.37

0.81

0.84

0.35

0.68

0.38

0.85

0.78

0.00

0.34

10

0.04

0.66

0.97

0.60

0.35

0.41

0.71

0.50

0.34

0.00

Table 7.2: Distance matrix for marks information for 10 students

Each element of this matrix has been calculated in the same way as we calculated S1,2, only that the different rows are used for calculation. One can observe that these two properties stated above are satisfied. The diagonal elements are all 0. Interpreting a distance matrix is simple. The value in the 2nd row and the 1st column gives the distance between the 2nd and the 1st student (or the 1st and the 2nd student), and so on. So, D1,3=1.01, D2,5 = 0.55, D6,8 = 0.58,and so on.

Normalizing the distances

These distances can sometimes be misleading if the variables are not in the same numerical range. What happens in such cases is that the variables having larger numerical values start influencing the distance more than the variables having smaller numerical values. This gives undue importance to the variables with large numerical values and subdues the importance of the ones with small numerical values.

The following dataset has the data of the Monthly Income, Monthly Expense, and Education Level for 10 customers. Look at the 1 and 9 customers. The difference in their monthly income is 1758. The difference in the monthly expense is 7707. Also, the difference in the education level is just 3. The contribution of the education level in distance becomes insignificant compared to the contribution of the other two variables even though, actually, the difference of 3 in the education level might matter more. This creates a problem as the distance is not a reliable measure of similarity anymore and can give rise to incorrect clustering.

Let us look at the following datasheet:

Normalizing the distances

Fig. 7.2: Monthly Income, Monthly Expense and Education Levels

The solution to this problem is normalizing the values of the variables to bring them in the same numerical range. The normalization entails the numerical transformation of each value so that they come in the same numerical range. The following method is used to normalize values. This method uses the min and max values and the transformation is defined as follows:

Normalizing the distances

Note that Xi is the value of the variables, Xmin is the minimum value of the variable, Xmax is the maximum value of the variable, and Zi is the normalized value of the variable.

The preceding dataset shows us how to normalize using the preceding equation; the datasheet looks as shown:

Normalizing the distances

Fig. 7.3: Monthly Income, Monthly Expense, and Education Levels after normalization

Linkage methods

While clustering the observations using the distances, one needs to link two or more points or clusters to form a new cluster or grow an already existing cluster. The distance between two clusters can be defined in many ways; for example, it can be the minimum distance between two points in two clusters, the maximum distance between two such points, or the distance between the centroids of the two clusters. The two clusters having the minimum distance between each other are clubbed together. Corresponding to each definition of the distance between the two clusters, there is a linkage method. Some of these linkage methods are shown in the following sections.

Single linkage

  • The distance between two clusters is the minimum distance between a point in cluster 1 and cluster 2
  • Two clusters having the smallest distance between them are combined as follows:
    Single linkage

    Here, Cm and Cn are two clusters; i and j are are points in the m and n clusters.

Compete linkage

  • The distance between two clusters is the maximum distance between a point in cluster 1 and cluster 2
  • Two clusters having the smallest distance between them are combined as follows:
    Compete linkage

    Here, Cm and Cn are two clusters; i and j are points in the m and n clusters.

Average linkage

  • The distance between two clusters is the average distance between a point in cluster 1 and cluster 2
  • Two clusters having the smallest distance between them are combined as follows:
    Average linkage

    Here, Cm and Cn are two clusters; i and j are points in the m and n clusters.

Centroid linkage

  • The distance between two clusters is the distance between the centroid (mean) of all the points in cluster 1 and the centroid (mean) of all the points in cluster 2
  • Two clusters having the smallest distance between them are combined as follows:
    Centroid linkage

    Here, Cm and Cn are two clusters.

Ward's method

A cluster that minimizes the increase in the combined error sum of the square of ANOVA is joined to an already existing cluster to form a new joined cluster. ANOVA is a statistical method to check whether there is more variation within the cluster or in the overall dataset. The smallest increase in the ANOVA error term shows that the elements of the newly joined clusters are more similar to each other than elements in other clusters.

Hierarchical clustering

Hierarchical clustering is an agglomerative method of clustering, wherein we start with each point as a separate cluster and then agglomerate them in a single cluster based on the similarity between observations.

For a dataset with N observations and NxN distance matrix, a hierarchical cluster can be created using the following steps:

  1. Start with each observation as a cluster so that you have N clusters to start with.
  2. Find the smallest distance in the distance matrix. Join the two observations having the smallest distance to form a cluster.
  3. Recompute the distances between all the old clusters and the new clusters. If one follows a single linkage method, the distance between two clusters is the minimum distance between two points on the two clusters.
  4. Repeat the steps 2 and 3 until you are left with a single cluster of all the N observations.

Let us take the distance matrix we created earlier in this chapter and follow the above steps to create a hierarchical cluster. The distance matrix, as we have seen before, looks like this:

Hierarchical clustering

Fig. 7.4: Distance matrix for the students' marks

Iteration 1:

Hierarchical clustering

Fig. 7.5: The first iteration of clustering

Iteration 2:

Hierarchical clustering

Fig. 7.6: The second iteration of clustering

Iteration 3:

Hierarchical clustering

Fig. 7.7: The third iteration of clustering

Iteration 4:

Hierarchical clustering

Fig. 7.8: The fourth iteration of clustering

Next, iterations can be performed in a similar manner. Here we used a single linkage method. If we use a different linkage method, the clustering would take a different shape. The hierarchical clusters can be best understood using a hierarchical tree depicting when and where the two or more points/clusters joined to form a bigger cluster.

For the distance matrix used above and the single method of linkage, one would get the following tree structure:

Hierarchical clustering

Fig. 7.9: A tree using a single linkage method

While using Ward's method, one gets a tree as shown in the following figure:

Hierarchical clustering

Fig. 7.10: A tree using Ward's linkage method

The tree formed using a single linkage can be interpreted as follows:

  • The #1 and #10 observations are joined to form a cluster at first, followed by a cluster formed by #2 and #7.
  • The #5 and #8 observations form a different cluster that later joins the cluster formed by #1 and #10.
  • This bigger cluster containing #1, #10, #5, and #8 is joined by the cluster containing #2 and #7, and so on
  • The observations that joined earlier to form a cluster are more similar to each other. So, (#1, #10), (#2, #7), and (#5, #8) are similar to each other.

Hierarchical clusters produce efficient and analyzable results when the number of observations in the dataset is relatively less. As the number of observations increase, the trees resulting from the hierarchical clusters become messier and more difficult to analyze. In such cases, it is better to try another method of clustering.

K-means clustering

K-means clustering is another unsupervised algorithm to cluster a number of observations. It is different from the hierarchical cluster in the sense that here the number of desired clusters and the centroid of the clusters need to be defined prior to the model formation. The centroid of the clusters keep updating based on the observations assigned to that cluster. The output consists of an array containing the cluster number to which each observation belongs to.

A step-by-step detail of the k-means clustering algorithm is as follows:

  1. Decide the number of clusters and assign the cluster centroid for each cluster. The number of clusters can be decided based on the business context. The cluster centroids can be passed manually (based on the business context or some prior information), or the randomly chosen observations can serve as the cluster centroids to start with.
  2. Calculate the distance of each observation from each cluster. Assign the observation to the cluster from which its distance is the least. The distance of an observation from a cluster is defined.
  3. Recalculate the cluster centroid using the mean of the observations in the cluster. The following formula defines the update of the cluster centroids:
    K-means clustering

    Here, ci=number of observations in the clusters, and Xm=observation vector whose length is equal to the number of columns in the observation. A cluster centroid is also as long as the observation vector or the number of columns in the observation.

  4. Repeat the steps starting from step 2.
  5. Stop if none of the observations were reassigned from one cluster to another.

    None of the observations are being reassigned; that means that all the observations are already in the correct clusters and their distance to a cluster centroid can't be reduced further.

The goal of this algorithm is to attain a configuration of cluster centers and cluster observation so that the overall J squared error function or J-score is minimized:

K-means clustering

Here, c=number of clusters, ci=number of points in the cluster, and Vi=centroid of the ith cluster.

The J squared error function can be understood as the sum of the squared distance of points from their respective cluster centroids. A smaller value of J squared function implies tightly packed and homogeneous clusters. This also implies that most of the points have been placed in the right clusters.

Let us try the k-means clustering algorithm for clustering some random numbers between 0 and 1. The Python library and Scipy have some inbuilt methods to perform the algorithm and return a list defining which observation belongs to which cluster:

  1. Define a set of observations consisting of random numbers ranging from 0 to 1. In this case, we have defined an observation set of 30x3:
    Import numpy as np
    obs=np.random.random(90).reshape(30,3)
    obs

    The output of the code looks as follows:

    K-means clustering

    Fig. 7.11: A 30x3 array of random numbers between 0 and 1

  2. Decide that we want two clusters (no hard and fast rule, you can try with three clusters also). Select two observations at random to make them cluster centroids:
    c1=np.random.choice(range(len(obs)))
    c2=np.random.choice(range(len(obs)))
    clust_cen=np.vstack([obs[c1],obs[c2]])
    clust_cen

    The output of the code looks as follows:

    K-means clustering

    Fig. 7.12: Selecting two rows (out of 30) at random to be initial cluster centers

    The two rows in the clust_cen array correspond to the two cluster centroids.

  3. With the number of clusters and cluster centroids defined, one is ready to implement the k-means clustering. This can be done using the cluster method of Scipy:
    from scipy.cluster.vq import vq
    vq(obs,clust_cen)
    K-means clustering

    Fig. 7.13: Cluster label and distance from cluster centers for each observation

    The first array gives us the information as to which cluster the observation belongs to. The first observation belongs to cluster c2, the second observation belongs to c1, the third belongs to c2, the fourth to c1, and so on.

    The second array gives the distance of the observation from the final cluster centroid. Hence, the first observation is at a distance of 0.33 units from the centroid of the cluster c2, the second observation is at a distance of 0.43 from the centroid of the cluster c1, and so on.

  4. Find the cluster centroid for the two clusters. This is done using the kmeans method in Scipy:
    from scipy.cluster.vq import kmeans
    kmeans(obs,clust_cen)

    The output of the code looks as follows:

    K-means clustering

    Fig. 7.14: The final cluster centers and the value of the squared error function or J-score

    The two rows in the array correspond to the two final cluster centroids. The centroid of the first cluster is at (0.524, 0.837, 0.676). The number at the end is the value of the squared error function or J-score, which we seek to minimize. Its value comes out to be 0.35.

    K-means also works if one provides just the number of required clusters and not the cluster centroids. If only the required number of clusters is provided, then the method will randomly select that many observations at random from the observation set to become a cluster centroid. Thus, we could have also written the following:

    from scipy.cluster.vq import kmeans
    kmeans(obs,2)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset