Earlier in this chapter, we discussed how a measure of similarity or dissimilarity is needed for the purpose of clustering observations. In this section, we will see what those measures are and how they are used.
If we consider each observation as a point in an n-dimensional space, where n is the number of columns in the dataset, one can calculate the mathematical distance between the points. The lesser the distance, the more similar they are. The points that are less distant to each other will be clubbed together.
Now, there are many ways of calculating distances and different algorithms use different methods of calculating distance. Let us see the different methods with a few examples. Let us consider a sample dataset of 10 observations with three variables, each to illustrate the distance better. The following dataset contains percentage marks obtained by 10 students in English, Maths, and Science:
Student |
English |
Maths |
Science |
---|---|---|---|
1 |
0.12 |
0.49 |
0.21 |
2 |
0.21 |
0.81 |
0.79 |
3 |
0.73 |
0.30 |
0.99 |
4 |
0.55 |
0.03 |
0.17 |
5 |
0.15 |
0.83 |
0.25 |
6 |
0.24 |
0.37 |
0.63 |
7 |
0.20 |
0.82 |
0.85 |
8 |
0.17 |
0.92 |
0.45 |
9 |
0.26 |
0.16 |
0.31 |
10 |
0.15 |
0.47 |
0.23 |
Table 7.1: Percentage marks obtained by 10 students in English, Maths, and Science
This is the most commonly used definition of the distance. Let us denote each observation as Xi. Then, the Euclidean distance between the two points, Xi and Xj, for a dataset having n-columns, is defined as follows:
For our dataset, the Euclidean distance between the student 1 and 2 is calculated as follows:
Thus, the Euclidian distance between the student 1 and 2 comes out to be 0.67.
Once we have defined the distance between the two points, we can calculate the distance between any two points and store them as a matrix. This matrix representing the distance between any two points (observations) in the dataset is called a distance matrix. A distance matrix has certain properties as follows:
For our dataset containing information about the marks of 10 students, it would be a 10x10 matrix. If we decide to calculate the matrix for Euclidean distances, the distance matrix will look as follows:
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 | |
---|---|---|---|---|---|---|---|---|---|---|
1 |
0.00 |
0.67 |
1.01 |
0.63 |
0.33 |
0.45 |
0.72 |
0.49 |
0.37 |
0.04 |
2 |
0.67 |
0.00 |
0.76 |
1.06 |
0.55 |
0.47 |
0.06 |
0.36 |
0.81 |
0.66 |
3 |
1.01 |
0.76 |
0.00 |
0.88 |
1.08 |
0.62 |
0.76 |
1.00 |
0.84 |
0.97 |
4 |
0.63 |
1.06 |
0.88 |
0.00 |
0.89 |
0.65 |
1.10 |
1.01 |
0.35 |
0.60 |
5 |
0.33 |
0.55 |
1.08 |
0.89 |
0.00 |
0.60 |
0.60 |
0.23 |
0.68 |
0.35 |
6 |
0.45 |
0.47 |
0.62 |
0.65 |
0.60 |
0.00 |
0.50 |
0.58 |
0.38 |
0.41 |
7 |
0.72 |
0.06 |
0.76 |
1.10 |
0.60 |
0.50 |
0.00 |
0.41 |
0.85 |
0.71 |
8 |
0.49 |
0.36 |
1.00 |
1.01 |
0.23 |
0.58 |
0.41 |
0.00 |
0.78 |
0.50 |
9 |
0.37 |
0.81 |
0.84 |
0.35 |
0.68 |
0.38 |
0.85 |
0.78 |
0.00 |
0.34 |
10 |
0.04 |
0.66 |
0.97 |
0.60 |
0.35 |
0.41 |
0.71 |
0.50 |
0.34 |
0.00 |
Table 7.2: Distance matrix for marks information for 10 students
Each element of this matrix has been calculated in the same way as we calculated S1,2, only that the different rows are used for calculation. One can observe that these two properties stated above are satisfied. The diagonal elements are all 0. Interpreting a distance matrix is simple. The value in the 2nd row and the 1st column gives the distance between the 2nd and the 1st student (or the 1st and the 2nd student), and so on. So, D1,3=1.01, D2,5 = 0.55, D6,8 = 0.58,and so on.
These distances can sometimes be misleading if the variables are not in the same numerical range. What happens in such cases is that the variables having larger numerical values start influencing the distance more than the variables having smaller numerical values. This gives undue importance to the variables with large numerical values and subdues the importance of the ones with small numerical values.
The following dataset has the data of the Monthly Income, Monthly Expense, and Education Level for 10 customers. Look at the 1 and 9 customers. The difference in their monthly income is 1758. The difference in the monthly expense is 7707. Also, the difference in the education level is just 3. The contribution of the education level in distance becomes insignificant compared to the contribution of the other two variables even though, actually, the difference of 3 in the education level might matter more. This creates a problem as the distance is not a reliable measure of similarity anymore and can give rise to incorrect clustering.
Let us look at the following datasheet:
The solution to this problem is normalizing the values of the variables to bring them in the same numerical range. The normalization entails the numerical transformation of each value so that they come in the same numerical range. The following method is used to normalize values. This method uses the min and max values and the transformation is defined as follows:
Note that Xi is the value of the variables, Xmin is the minimum value of the variable, Xmax is the maximum value of the variable, and Zi is the normalized value of the variable.
The preceding dataset shows us how to normalize using the preceding equation; the datasheet looks as shown:
While clustering the observations using the distances, one needs to link two or more points or clusters to form a new cluster or grow an already existing cluster. The distance between two clusters can be defined in many ways; for example, it can be the minimum distance between two points in two clusters, the maximum distance between two such points, or the distance between the centroids of the two clusters. The two clusters having the minimum distance between each other are clubbed together. Corresponding to each definition of the distance between the two clusters, there is a linkage method. Some of these linkage methods are shown in the following sections.
A cluster that minimizes the increase in the combined error sum of the square of ANOVA is joined to an already existing cluster to form a new joined cluster. ANOVA is a statistical method to check whether there is more variation within the cluster or in the overall dataset. The smallest increase in the ANOVA error term shows that the elements of the newly joined clusters are more similar to each other than elements in other clusters.
Hierarchical clustering is an agglomerative method of clustering, wherein we start with each point as a separate cluster and then agglomerate them in a single cluster based on the similarity between observations.
For a dataset with N observations and NxN distance matrix, a hierarchical cluster can be created using the following steps:
Let us take the distance matrix we created earlier in this chapter and follow the above steps to create a hierarchical cluster. The distance matrix, as we have seen before, looks like this:
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
Next, iterations can be performed in a similar manner. Here we used a single linkage method. If we use a different linkage method, the clustering would take a different shape. The hierarchical clusters can be best understood using a hierarchical tree depicting when and where the two or more points/clusters joined to form a bigger cluster.
For the distance matrix used above and the single method of linkage, one would get the following tree structure:
While using Ward's method, one gets a tree as shown in the following figure:
The tree formed using a single linkage can be interpreted as follows:
Hierarchical clusters produce efficient and analyzable results when the number of observations in the dataset is relatively less. As the number of observations increase, the trees resulting from the hierarchical clusters become messier and more difficult to analyze. In such cases, it is better to try another method of clustering.
K-means clustering is another unsupervised algorithm to cluster a number of observations. It is different from the hierarchical cluster in the sense that here the number of desired clusters and the centroid of the clusters need to be defined prior to the model formation. The centroid of the clusters keep updating based on the observations assigned to that cluster. The output consists of an array containing the cluster number to which each observation belongs to.
A step-by-step detail of the k-means clustering algorithm is as follows:
Here, ci=number of observations in the clusters, and Xm=observation vector whose length is equal to the number of columns in the observation. A cluster centroid is also as long as the observation vector or the number of columns in the observation.
None of the observations are being reassigned; that means that all the observations are already in the correct clusters and their distance to a cluster centroid can't be reduced further.
The goal of this algorithm is to attain a configuration of cluster centers and cluster observation so that the overall J squared error function or J-score is minimized:
The J squared error function can be understood as the sum of the squared distance of points from their respective cluster centroids. A smaller value of J squared function implies tightly packed and homogeneous clusters. This also implies that most of the points have been placed in the right clusters.
Let us try the k-means clustering algorithm for clustering some random numbers between 0 and 1. The Python library and Scipy have some inbuilt methods to perform the algorithm and return a list defining which observation belongs to which cluster:
Import numpy as np obs=np.random.random(90).reshape(30,3) obs
The output of the code looks as follows:
c1=np.random.choice(range(len(obs))) c2=np.random.choice(range(len(obs))) clust_cen=np.vstack([obs[c1],obs[c2]]) clust_cen
The output of the code looks as follows:
The two rows in the clust_cen
array correspond to the two cluster centroids.
cluster
method of Scipy:from scipy.cluster.vq import vq vq(obs,clust_cen)
The first array gives us the information as to which cluster the observation belongs to. The first observation belongs to cluster c2, the second observation belongs to c1, the third belongs to c2, the fourth to c1, and so on.
The second array gives the distance of the observation from the final cluster centroid. Hence, the first observation is at a distance of 0.33 units from the centroid of the cluster c2, the second observation is at a distance of 0.43 from the centroid of the cluster c1, and so on.
kmeans
method in Scipy:from scipy.cluster.vq import kmeans kmeans(obs,clust_cen)
The output of the code looks as follows:
The two rows in the array correspond to the two final cluster centroids. The centroid of the first cluster is at (0.524, 0.837, 0.676). The number at the end is the value of the squared error function or J-score, which we seek to minimize. Its value comes out to be 0.35.
K-means also works if one provides just the number of required clusters and not the cluster centroids. If only the required number of clusters is provided, then the method will randomly select that many observations at random from the observation set to become a cluster centroid. Thus, we could have also written the following:
from scipy.cluster.vq import kmeans kmeans(obs,2)