This chapter discusses the circumstances in which the cluster analysis technique can be used. Starting from two observations, different distance (dissimilarity) measures for metric variables and similarity measures for binary variables are calculated. Different hierarchical agglomeration schedules are described, as well as how to interpret dendrograms aiming to allocate the observations to each group. The nonhierarchical k-means agglomeration schedule and its differences in relation to hierarchical schedules will also be studied. Finally, we will develop a cluster analysis in an algebraic manner and by using IBM SPSS Statistics Software and Stata Statistical Software, and then interpret their results.
Cluster analysis; Clustering; Distance or dissimilarity measures; Similarity measures; Agglomeration schedules; Hierarchical method; Nonhierarchical k-means method; Dendrogram; SPSS and Stata Software
Maybe Hamlet is right.
We could be bounded in a nutshell, but counting ourselves kings of infinite space.
Stephen Hawking
Cluster analysis represents a set of very useful exploratory techniques that can be applied whenever we intend to verify the existence of similar behavior between observations (individuals, companies, municipalities, countries, among other examples) in relation to certain variables, and there is the intention of creating groups or clusters, in which an internal homogeneity prevails. In this regard, this set of techniques has as its main objective to allocate observations to a relatively small number of clusters that are internally homogeneous and heterogeneous between themselves, and that represent the joint behavior of the observations from certain variables. That is, the observations of a certain group must be relatively similar to one another, in relation to the variables inserted in the analysis, and significantly different from the observations found in other groups.
Clustering techniques are considered exploratory, or interdependent, since their applications do not have a predictive nature for other observations not initially present in the sample. Moreover, the inclusion of new observations into the dataset makes it necessary to reapply the modeling, so that, possibly, new clusters can be generated. Besides, the inclusion of a new variable can also generate a complete rearrangement of the observations in the groups.
Researchers can choose to develop a cluster analysis when their main goal is to sort and allocate observations to groups and, from then on, to analyze what the ideal number of clusters formed is. Or they can, a priori, define the number of groups they wish to create, based on certain criteria, and verify how the sorting and allocation of observations behave in that specific number of groups. Regardless of the objective, clustering will continue being exploratory. If a researcher aims to use a technique to, in fact, confirm the creation of groups and to make the analysis predictive, he can use techniques as, for example, discriminant analysis or multinomial logistic regression.
Elaborating a cluster analysis does not require vast knowledge of matrix algebra or statistics, different from techniques such as factor analysis and correspondence analysis. The researcher interested in applying a cluster analysis needs to, starting from the definition of the research objectives, choose a certain distance or similarity measure that will be the basis for the observations to be considered less or much closer, and a certain agglomeration schedule that will have to be defined between hierarchical and nonhierarchical methods. Therefore, he will be able to analyze, interpret, and compare the outcomes.
It is important to highlight that the outcomes obtained through hierarchical and nonhierarchical agglomeration schedules can be compared and, in this regard, the researcher is free to develop the technique, using one method or another, and to reapply it, if he deems necessary. While hierarchical schedules allow us to identify the sorting and allocation of observations, offering possibilities for researchers to study, assess, and decide the number of clusters formed in nonhierarchical schedules, we start with a known number of clusters and, from then on, we begin allocating the observations to these clusters, with a future evaluation of the representativeness of each variable when creating them. Therefore, the result of one method can serve as input to carry out the other, making the analysis cyclical. Fig. 11.1 shows the logic from which a cluster analysis can be elaborated.
When choosing the distance or similarity measure and the agglomeration schedule, we must take some aspects into consideration, such as, the previously desired number of clusters, which were defined based on some resource allocation criteria, as well as certain constraints that may lead the researcher to choose a specific solution. According to Bussab et al. (1990), different criteria regarding distance measures and agglomeration schedules may lead to different cluster formations, and the homogeneity desired by the researcher fundamentally depends on the objectives set in the research.
Imagine that a researcher is interested in studying the interdependence between individuals living in a certain municipality based only on two metric variables (age, in years, and average family income, in R$). His main goal is to assess the effectiveness of social programs aimed at providing health care and then, based on these variables, to propose a still unknown number of new programs aimed at homogeneous groups of people. After collecting the data, the researcher constructed a scatter plot, as shown in Fig. 11.2.
Based on the chart seen in Fig. 11.2, the researcher identified four clusters and highlighted them in a new chart (Fig. 11.3).
From the creation of these clusters, the researcher decided to develop an analysis of the behavior of the observations in each group, or, more precisely, of the existing variability within the clusters and between them, so that he could clearly and consciously base his decision as regards the allocation of individuals to these four new social programs. In order to illustrate this issue, the researcher constructed the chart found in Fig. 11.4.
Based on this chart, the researcher was able to notice that the groups formed showed a lot of internal homogeneity, with a certain individual being closer to other individuals in the same group than to individuals in other groups. This is the core of cluster analysis.
If the number of social programs to be provided for the population (number of clusters) had already been given to the researcher, due to budgetary, legal, or political constraints, even so we would be able to use clustering, solely, to determine the allocation of individuals from the municipality to that number of programs (groups).
Having concluded the research and allocated the individuals to the different social, health care programs, the following year, the researcher decided to carry out the same research with individuals from the same municipality. However, in the meantime, a group of elderly billionaires decided to move to that city, and, when he constructed the new scatter plot, the researcher realized that those four clusters, clearly formed the previous year, did not exist anymore, since they fused when the billionaires were included. The new scatter plot can be seen in Fig. 11.5.
This new situation exemplifies the importance of always reapplying the cluster analysis whenever new observations are included (and also new variables), which deprives it from and makes its predictive power totally unfeasible, as we have already discussed.
Moreover, before elaborating any cluster analysis, this example shows that it is advisable for the researcher to study the data behavior and to check the existence of discrepant observations in relation to certain variables, since the creation of clusters is very sensitive to the presence of outliers. Excluding or retaining outliers in the dataset, however, will depend on the research objectives and on the type of data researcher have. Since, if certain observations represent anomalies in terms of variable values, when compared to the other observations, and end up forming small, insignificant, or even individual clusters, they can, in fact, be excluded. On the other hand, if these observations represent one or more relevant groups, even if they are different from the others, they must be considered in the analysis and, whenever the technique is reapplied, they can be separated so that other segmentations can be better structured in new groups, formed with higher internal homogeneity.
We would like to emphasize that cluster analysis methods are considered static procedures, since the inclusion of new observations or variables may change the clusters, thus, making it mandatory to develop a new analysis.
In this example, we realized that the original variables from which the groups are established are metric, since the clustering started from the study of the distance behavior (dissimilarity measures) between the observations. In some cases, as we will study throughout this chapter, cluster analyses can be elaborated from the similarity behavior (similarity measures) between observations that present binary variables. However, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a cluster analysis. This is a major error, since there are exploratory techniques meant exclusively for the study of the behavior of qualitative variables as, for example, the correspondence analysis.
Historically speaking, even though many distance and similarity measures date back to the end of the 19th century and the beginning of the 20th century, cluster analyses, as a better structured set of techniques, began in the field of Anthropology with Driver and Kroeber (1932), and in Psychology with Zubin (1938a,b) and Tryon (1939), as discussed by Reis (2001) and Fávero et al. (2009). With the acknowledgment that observation clustering and classification procedures are scientific methods, together with astonishing technological developments, mainly verified after the 1960s, cluster analyses started being used more frequently after Sokal and Sneath’s (1963) relevant work was published, in which procedures are carried out to compare the biological similarities of organisms with similar characteristics and the respective species.
Currently, cluster analysis offers several application possibilities in the fields of consumer behavior, market segmentation, strategy, political science, economics, finance, accounting, actuarial science, engineering, logistics, computer science, education, medicine, biology, genetics, biostatistics, psychology, anthropology, demography, geography, ecology, climatology, geology, archeology, criminology and forensics, among others.
In this chapter, we will discuss cluster analysis techniques, aiming at: (1) introducing the concepts; (2) presenting the step by step of modeling, in an algebraic and practical way; (3) interpreting the results obtained; and (4) applying the technique in SPSS and in Stata. Following the logic proposed in the book, first, we will present the algebraic solution of an example jointly with the presentation of the concepts. Only after the introduction of concepts will the procedures for elaborating the techniques in SPSS and Stata be presented.
Many are the procedures for elaborating a cluster analysis, since there are different distance or similarity measures for metric or binary variables, respectively. Besides, after defining the distance or similarity measure, the researcher still needs to determine, among several possibilities, the observation clustering method, from certain hierarchical or nonhierarchical criteria. Therefore, when one wishes to group observations in internally homogeneous clusters, what initially seems trivial can become quite complex, because there are multiple combinations between different distance or similarity measures and clustering methods. Hence, based on the underlying theory and on his research objectives, as well as on his experience and intuition, it is extremely important for the researcher to define the criteria from which the observations will be allocated to each one of the groups.
In the following sections, we will discuss the theoretical development of the technique, along with a practical example. In Sections 11.2.1 and 11.2.2, the concepts of distance and similarity measures and clustering methods are presented and discussed, respectively, always followed by the algebraic solutions developed from a dataset.
As we have already discussed, the first phase for elaborating a cluster analysis consists in defining the distance (dissimilarity) or similarity measure that will be the basis for each observation to be allocated to a certain group.
Distance measures are frequently used when the variables in the dataset are essentially metric, since, the greater the differences between the variable values of two observations the smaller the similarity between them or, in other words, the higher the dissimilarity.
On the other hand, similarity measures are often used when the variables are binary, and what most interests us is the frequency of converging answer pairs 1-1 or 0-0 of two observations. In this case, the greater the frequency of converging pairs, the higher the similarity between the observations.
An exception to this rule is Pearson’s correlation coefficient between two observations, calculated from metric variables, however, with similarity characteristics, as we will see in the following section.
We will study the dissimilarity measures for metric variables in Section 11.2.1.1 and, in Section 11.2.1.2, we will discuss the similarity measures for binary variables.
As a hypothetical situation, imagine that we intend to calculate the distance between two observations i (i = 1, 2) from a dataset that has three metric variables (X1i, X2i, X3i), with values in the same unit of measure. These data can be found in Table 11.1.
Table 11.1
Observation i | X1i | X2i | X3i |
---|---|---|---|
1 | 3.7 | 2.7 | 9.1 |
2 | 7.8 | 8.0 | 1.5 |
It is possible to illustrate the configuration of both observations in a three-dimensional space from these data, since we have exactly three variables. Fig. 11.6 shows the relative position of each observation, emphasizing the distance between them (d12).
Distance d12, which is a dissimilarity measure, can be easily calculated by using, for instance, its projection over the horizontal plane formed by axes X1 and X2, called distance d′12, as shown in Fig. 11.7.
Thus, based on the well-known Pythagorean distance formula for right-angled triangles, we can determine d12 through the following expression:
d12=√(d′12)2+(X31−X32)2
where | X31 − X32 | is the distance of the vertical projections (axis X3) from points 1 and 2.
However, distance d′12 is unknown to us, so, once again, we need to use the Pythagorean formula, now using the distances of the projections from Points 1 and 2 over the other two axes (X1 and X2), as shown in Fig. 11.8.
Thus, we can say that:
d′12=√(X11−X12)2+(X21−X22)2
and, substituting (2) in (1), we have:
d12=√(X11−X12)2+(X21−X22)2+(X31−X32)2,
which is the expression of distance (dissimilarity measure) between Points 1 and 2, also known as the Euclidean distance formula.
Therefore, for the data in our example, we have:
d12=√(3.7−7.8)2+(2.7−8.0)2+(9.1−1.5)2=10.132
whose unit of measure is the same as for the original variables in the dataset. It is important to highlight that, if the variables do not have the same unit of measure, a data standardization procedure will have to be carried out previously, as we will discuss later.
We can generalize this problem for a situation in which the dataset has n observations and, for each observation i (i = 1, ..., n), values corresponding to each one of the j (j = 1, ..., k) metric variables X, as shown in Table 11.2.
Table 11.2
Variable j | ||||
---|---|---|---|---|
Observation i | X1i | X2i | … | Xki |
1 | X11 | X21 | … | Xk1 |
2 | X12 | X22 | Xk2 | |
⋮ | ⋮ | ⋮ | ||
P | X1p | X2p | Xkp | |
⋮ | ⋮ | ⋮ | ||
q | X1q | X2q | Xkq | |
⋮ | ⋮ | ⋮ | ⋮ | |
n | X1n | X2n | Xkn |
So, Expression (11.4), based on Expression (11.3), presents the general definition of the Euclidian distance between any two observations p and q.
dpq=√(X1p−X1q)2+(X2p−X2q)2+…+(Xkp−Xkq)2=√k∑j=1(Xjp−Xjq)2
Although the Euclidian distance is the most commonly used in cluster analyses, there are other dissimilarity measures that can be used, and using each one of them will depend on the researcher’s assumptions and objectives. Next, we will discuss other dissimilarity measures that can be used:
dpq=(X1p−X1q)2+(X2p−X2q)2+…+(Xkp−Xkq)2=k∑j=1(Xjp−Xjq)2
dpq=[k∑j=1(|Xjp−Xjq|)m]1m
where m takes on positive integer values (m = 1, 2, ...). We can see that the Euclidian distance is a particular case of the Minkowski distance, when m = 2.
dpq=k∑j=1|Xjp−Xjq|
dpq=max|Xjp−Xjq|
It is a particular case of the Minkowski distance as well, when m = ∞.
dpq=k∑j=1|Xjp−Xjq|(Xjp+Xjq)
Whenever there are metric variables, the researcher can also use Pearson’s correlation, which, even though, is not a dissimilarity measure (in fact, it is a similarity measure), can provide important information when the aim is to group rows from the dataset. Pearson’s correlation expression, between the values of any two observations p and q, based on Expression (4.11) presented in Chapter 4, can be written as follows:
ρpq=k∑j=1(Xjp−ˉXp)⋅(Xjq−ˉXq)√k∑j=1(Xjp−ˉXp)2⋅√k∑j=1(Xjq−ˉXq)2
where ˉXp and ˉXq represent the mean of all variable values for observations p and q, respectively, that is, the mean of each one of the rows in the dataset.
Therefore, we can see that we are dealing with a coefficient of correlation between rows and not between columns (variables). It is the most common in data analysis and its values vary between − 1 and 1. Pearson’s correlation coefficient can be used as a similarity measure between the rows of the dataset in analyses that include time series, for example, that is, cases in which the observations represent periods. In this case, the researcher may intend to study the correlations between different periods, to investigate, for instance, a possible recurrence of behavior in the same row for the set of variables, which may cause certain periods, not necessarily subsequent ones, to be grouped by similarity of behavior.
Going back to the data presented in Table 11.1, we can calculate the different distance measures between observations 1 and 2, given by Expressions (11.4)–(11.9), as well as the correlational similarity measure, given by Expression (11.10). Table 11.3 shows these calculations and the respective results.
Table 11.3
Observation i | X1i | X2i | X3i | Mean |
---|---|---|---|---|
1 | 3.7 | 2.7 | 9.1 | 5.167 |
2 | 7.8 | 8.0 | 1.5 | 5.767 |
Euclidian Distance | ||||
d12=√(3.7−7.8)2+(2.7−8.0)2+(9.1−1.5)2=10.132 | ||||
Squared Euclidean Distance | ||||
d12 = (3.7 − 7.8)2 + (2.7 − 8.0)2 + (9.1 − 1.5)2 = 102.660 | ||||
Manhattan Distance | ||||
d12 = | 3.7 − 7.8 | + | 2.7 − 8.0 | + | 9.1 − 1.5 | = 17.000 | ||||
Chebyshev Distance | ||||
d12 = | 9.1 − 1.5 | = 7.600 | ||||
Canberra Distance | ||||
d12=|3.7−7.8|(3.7+7.8)+|2.7−8.0|(2.7+8.0)+|9.1−1.5|(9.1+1.5)=1.569 | ||||
Pearson’s Correlation (Similarity) | ||||
ρ12=(3.7−5.167)⋅(7.8−5.767)+(2.7−5.167)⋅(8.0−5.767)+(9.1−5.167)⋅(1.5−5.767)√(3.7−5.167)2+(2.7−5.167)2+(9.1−5.167)2⋅√(7.8−5.767)2+(8.0−5.767)2+(1.5−5.767)2=−0.993 |
Based on the results shown in Table 11.3, we can see that different measures produce different results, which may cause the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis, as discussed by Vicini and Souza (2005) and Malhotra (2012). Therefore, it is essential for the researcher to always underpin his choice and to bear in mind the reasons why he decided to use a certain measure, instead of others. Simply using more than one measure, when analyzing the same dataset, can support this decision, since, in this case, the results can be compared.
This becomes really clear when we include a third observation in the analysis, as shown in Table 11.4.
Table 11.4
Observation i | X1i | X2i | X3i |
---|---|---|---|
1 | 3.7 | 2.7 | 9.1 |
2 | 7.8 | 8.0 | 1.5 |
3 | 8.9 | 1.0 | 2.7 |
While the Euclidian distance suggests that the most similar observations (the shortest distance) are 2 and 3, when we use the Chebyshev distance, observations 1 and 3 are the most similar. Table 11.5 shows these distances for each pair of observations, highlighting, in bold characters, the smallest value of each distance.
Table 11.5
Distance | Pair of Observations 1 and 2 | Pair of Observations 1 and 3 | Pair of Observations 2 and 3 |
---|---|---|---|
Euclidian | d12 = 10.132 | d13 = 8.420 | d23 = 7.187 |
Chebyshev | d12 = 7.600 | d13 = 6.400 | d23 = 7.000 |
Hence, in a certain cluster schedule, and only due to the dissimilarity measure chosen, we would have different initial clusters.
Besides deciding which distance measure to choose, the researcher also has to verify if the data need to be treated previously. So far, in the examples we have already discussed, we were careful to choose metric variables with values in the same unit of measure (as, for example, students’ grades in Math, Physics, and Chemistry, which vary from 0 the 10). However, if the variables are measured in different units (as, for example, income in R$, educational level in years of study, and number of children), the intensity of the distances between the observations may be arbitrarily influenced by the variables that will possibly present greater magnitude in their values, to the detriment the others. In these situations, the researcher must standardize the data, so that the arbitrary nature of the measurement units may be eliminated, making each variable have the same contribution over the distance measure considered.
Z-scores procedure is the most frequently used method to standardize variables. In it, for each observation i, the value of a new standardized variable ZXj is obtained by subtracting the corresponding original variable value Xj from its mean and, after that, the resulting value is divided by its standard deviation, as presented in Expression (11.11).
ZXji=Xji−ˉXjsj
where ˉX and s represent the mean and the standard deviation of variable Xj. Hence, regardless of the magnitude of the values and of the type of measurement units of the original variables in a dataset, all the respective variables standardized by the Z-scores procedure will have a mean equal to zero and a standard deviation equal to 1, which ensures that possible arbitrary measurement units over the distance between each pair of observations will be eliminated. In addition, Z-scores have the advantage of not changing the distribution of the original variable.
Therefore, if the original variables are different units, distance measure Expressions (11.4)–(11.9) must have the terms Xjp and Xjq, respectively, substituted for ZXjp and ZXjq. Table 11.6 presents these expressions, based on the standardized variables.
Table 11.6
Distance Measure (Dissimilarity) | Expression |
---|---|
Euclidian | dpq=√∑kj=1(ZXjp−ZXjq)2 |
Squared Euclidean | dpq=∑kj=1(ZXjp−ZXjq)2 |
Minkowski | dpq=[∑kj=1(|ZXjp−ZXjq|)m]1m |
Manhattan | dpq=∑kj=1|ZXjp−ZXjq| |
Chebyshev | dpq = máx | ZXjp − ZXjq | |
Canberra | dpq=∑kj=1|ZXjp−ZXjq|(ZXjp+ZXjq) |
Even though Pearson’s correlation is not a dissimilarity measure (in fact, it is a similarity measure), it is important to mention that its use also requires that the variables be standardized by using the Z-scores procedure in case they do not have the same measurement units. If the main goal were to group variables, which is the main goal of the following chapter (factor analysis), the standardization of variables through the Z-scores procedure would, in fact, be irrelevant, given that the analysis would consist in assessing the correlation between columns of the dataset. On the other hand, as the objective of this chapter is to group rows from the dataset that represent the observations, the standardization of the variables is necessary for elaborating an accurate cluster analysis.
Now, imagine that we intend to calculate the distance between two observations i (i = 1, 2) coming from a dataset that has seven variables (X1i, ..., X7i), however, all of them related to the presence or absence of characteristics. In this situation, it is common for the presence or absence of a certain characteristic to be represented by a binary variable, or a dummy, which assumes value 1, in case the characteristic occurs, and 0, if otherwise. These data can be found in Table 11.7.
Table 11.7
Observation i | X1i | X2i | X3i | X4i | X5i | X6i | X7i |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
2 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
It is important to highlight that the use of binary variables does not generate arbitrary weighting problems resulting from the variable categories, contrary to what would happen if discrete values (1, 2, 3, ...) were assigned to each category of each qualitative variable. In this regard, if a certain qualitative variable has k categories, (k − 1) binary variables will be necessary to represent the presence or absence of each one of the categories. Thus, all the binary variables will be equal to 0 in case the reference category occurs.
Therefore, by using Expression (11.4), we can calculate the squared Euclidean distance between observations 1 and 2, as follows:
d12=7∑j=1(Xj1−Xj2)2=(0−0)2+(0−1)2+(1−1)2+(1−1)2+(0−1)2+(1−0)2+(1−1)2=3,
which represents the total number of variables with answer differences between observations 1 and 2.
Therefore, for any two observations p and q, the greater the number of equal answers (0-0 or 1-1), the shorter the squared Euclidean distance between them will be, since:
(Xjp−Xjq)2={0ifXjp=Xjq={011ifXjp≠Xjq
As discussed by Johnson and Wichern (2007), each stretch of the distance represented by Expression (11.12) is considered to be a dissimilarity measure, since the greater the number of answer discrepancies, the greater the squared Euclidean distances. On the other hand, the calculations equally ponder the pairs of answers 0-0 and 1-1, without giving higher relative importance to the pair of answers 1-1 that, in many cases, is a stronger similarity indicator than the pair of answers 0-0. For example, when we group people, the fact that two of them eat lobster every day is a stronger similarity evidence than the absence of this characteristic for both.
Hence, many authors, aiming at defining similarity measures between observations, proposed the use of coefficients that would take the similarity of the answers 1-1 and 0-0 into consideration, and these pairs would not necessarily have the same relative importance. In order for us to be able to present these measures, it is necessary to construct an absolute frequency table of answers 0 and 1 for each pair of observations p and q, as shown in Table 11.8.
Table 11.8
Observation p | |||
---|---|---|---|
Observation q | 1 | 0 | Total |
1 | a | b | a + b |
0 | c | d | c + d |
Total | a + c | b + d | a + b + c + d |
Next, based on this table, we will discuss the main similarity measures, bearing in mind that the use of each one depends on the researcher’s assumptions and objectives.
spq=a+da+b+c+d
spq=aa+b+c
spq=2⋅a2⋅a+b+c
spq=aa+2⋅(b+c)
spq=aa+b+c+d
spq=a√(a+b)⋅(a+c)
spq=a⋅d−b⋅ca⋅d+b⋅c
spq=a+da+d+2⋅(b+c)
spq=2⋅(a+d)2⋅(a+d)+b+c
spq=(a+d)−(b+c)a+b+c+d
As was discussed in Section 11.2.1.1 as regards the dissimilarity measures applied to metric variables, let’s go back to the data presented in Table 11.7, aiming at calculating the different similarity measures between observations 1 and 2, which only have binary variables. In order to do that, from that table, we must construct the absolute frequency table of answers 0 and 1 for the observations mentioned (Table 11.9).
Table 11.9
Observation 1 | |||
---|---|---|---|
Observation 2 | 1 | 0 | Total |
1 | 3 | 2 | 5 |
0 | 1 | 1 | 2 |
Total | 4 | 3 | 7 |
So, using Expressions (11.13)–(11.22), we are able to calculate the similarity measures themselves. Table 11.10 presents the calculations and the results of each coefficient.
Table 11.10
Simple Matching: | Jaccard: |
---|---|
s12=3+17=0.571 | s12=36=0.500 |
Dice: | Anti-Dice: |
s12=2(3)2⋅(3)+2+1=0.667 | s12=33+2⋅(2+1)=0.333 |
Russell and Rao: | Ochiai: |
s12=37=0.429 | s12=3√(3+2)⋅(3+1)=0.671 |
Yule: | Rogers and Tanimoto: |
s12=3⋅1−2⋅13⋅1+2⋅1=0.200 | s12=3+13+1+2⋅(2+1)=0.400 |
Sneath and Sokal: | Hamann: |
s12=2⋅(3+1)2⋅(3+1)+2+1=0.727 | s12=(3+1)−(2+1)7=0.143 |
Analogous to what was discussed when the dissimilarity measures were calculated, we can clearly see that different similarity measures generate different results, which may cause, when defining the cluster method, the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis.
Bear in mind that it does not make any sense to apply the Z-scores standardization procedure to calculate the similarity measures discussed in this section, since the variables used for the cluster analysis are binary.
At this moment, it is important to emphasize that, instead of using similarity measures to define the clusters whenever there are binary variables, it is very common to define clusters from the coordinates of each observation, which can be generated when elaborating simple or multiple correspondence analyses, for instance. This is an exploratory technique applied solely to datasets that have qualitative variables, aiming at creating perceptual maps, which are constructed based on the frequency of the categories of each one of the variables in analysis (Fávero and Belfiore, 2017).
After defining the coefficient that will be used, based on the research objectives, on the underlying theory, and on his experience and intuition, the researcher must move on to the definition of the cluster schedule. The main cluster analysis schedules will be studied in the following section.
As discussed by Vicini and Souza (2005) and Johnson and Wichern (2007), in cluster analysis, choosing the clustering method, also known as agglomeration schedule, is as important as defining the distance (or similarity) measure, and this decision must also be made based on what researchers intends to do in terms of their research objectives.
Basically, agglomeration schedules can be classified into two types, hierarchicals and nonhierarchicals. While the former characterize themselves for favoring a hierarchical structure (step by step) when forming clusters, nonhierarchical schedules use algorithms to maximize the homogeneity within each cluster, without going through a hierarchical process for such.
Hierarchical agglomeration schedules can be clustering or partitioning, depending on how the process starts. If all the observations are considered to be separated and, from their distances (or similarities), groups are formed until we reach a final stage with only one cluster, then this process is known as clustering. Among all hierarchical agglomeration schedules, the most commonly used are those that have the following linkage methods: nearest-neighbor or single-linkage, furthest-neighbor or complete-linkage, or between-groups or average-linkage. On the other hand, if all the observations are considered grouped and, stage after stage, smaller groups are formed by the separation of each observation, until these subdivisions generate individual groups (that is, totally separated observations), then, we have a partitioning process.
Conversely, nonhierarchical agglomeration schedules, among which the most popular is the k-means procedure, refer to processes in which clustering centers are defined, and from which the observations are allocated based on their proximity to them. Different from hierarchical schedules, in which the researcher can study the several possibilities for allocating observations and even define the ideal number of clusters based on each one of the grouping stages, a nonhierarchical agglomeration schedule requires that we previously stipulate the number of clusters from which the clustering centers will be defined and the observations allocated. That is why we recommend the generation of a hierarchical agglomeration schedule before constructing a nonhierarchical schedule, when there is no reasonable estimate of the number of clusters that can be formed from the observations in the dataset and based on the variables in study.
Fig. 11.9 shows the logic of agglomeration schedules in cluster analysis.
We will study hierarchical agglomeration schedules in Section 11.2.2.1, and Section 11.2.2.2 will be used to discuss the nonhierarchical k-means agglomeration schedule.
In this section, we will discuss the main hierarchical agglomeration schedules, in which larger and larger clusters are formed at each clustering stage because new observations or groups are added to it, due to a certain criterion (linkage method) and based on the distance measure chosen. In Section 11.2.2.1.1, the main concepts of these schedules will be presented, and, in Section 11.2.2.1.2, a practical example will be presented and solved algebraically.
There are three main linkage methods in hierarchical agglomeration schedules, as shown in Fig. 11.9: the nearest-neighbor or single-linkage, the furthest-neighbor or complete-linkage, and the between-groups or average-linkage.
Table 11.11 illustrates the distance to be considered in each clustering stage, based on the linkage method chosen.
Table 11.11
Linkage Method | Illustration | Distance (Dissimilarity) |
---|---|---|
Single (Nearest-Neighbor or Single-Linkage) | d23 | |
Complete (Furthest-Neighbor or Complete-Linkage) | d15 | |
Average (Between-Groups or Average-Linkage) | d13+d14+d15+d23+d24+d256 |
The single-linkage method favors the shortest distances (thus, the nomenclature nearest neighbor) so that new clusters can be formed at each clustering stage through the incorporation of observations or groups. Therefore, applying it is advisable in cases in which the observations are relatively far apart, that is, different, and we would like to form clusters considering a minimum of homogeneity. On the other hand, its analysis may be hampered when there are observations or clusters just a little farther apart from each other, as shown in Fig. 11.10.
The complete-linkage method, on the other hand, goes in the opposite direction, that is, it favors the greatest distances between the observations or groups so that new clusters can be formed (hence, the name furthest neighbor) and, in this regard, using it is advisable in cases in which there is no considerable distance between the observations, and the researcher needs to identify the heterogeneities between them.
Finally, in the average-linkage method, two groups merge based on the average distance between all the pairs of observations that are in these groups (hence, the name average linkage). Accordingly, even though there are changes in the calculation of the distance measures between the clusters, the average-linkage method ends up preserving the order of the observations in each group, offered by the single-linkage method, in case there is a considerable distance between the observations. The same happens with the sorting solution provided by the complete-linkage method, if the observations are very close to each other.
Johnson and Wichern (2007) proposed a logical sequence of steps in order to facilitate the understanding of a cluster analysis, elaborated through a certain hierarchical agglomerative method:
Therefore, the values that form the D matrices of each one of the stages will be a function of the distance measure chosen and of the linkage method adopted. In a certain clustering stage s, imagine that a researcher groups two clusters M and N formed previously, containing observations m and n, respectively, so that cluster MN can be formed. Next, he intends to group MN with another cluster W, with w observations. Since we know that the decision to choose the next cluster will always be the smallest distance between each pair of observations or groups in the hierarchical agglomerative methods, the agglomeration schedule will be essential in order for the distances that will form each matrix Ds to be analyzed. Using this logic and based on Table 11.11, let’s discuss the criterion to calculate the distance between the clusters MN and W, inserted in matrix Ds, based on the linkage method:
d(MN)W=min{dMW;dNW}
where dMW and dNW are the distances between the closest observations in clusters M and W and in clusters N and W, respectively.
d(MN)W=max{dMW;dNW}
where dMW and dNW are the distances between the farthest observations in clusters M and W and in clusters N and W, respectively.
d(MN)W=m+n∑p=1w∑q=1dpq(m+n)⋅(w)
where dpq represents the distance between any observation p in cluster MN and any observation q in cluster W, and m + n and w represent the number of observations in clusters MN and W, respectively.
In the following section, we will present a practical example that will be solved algebraically, and from which the concepts of hierarchical agglomerative methods will be established.
Imagine that a college professor, who is very concerned about his students’ capacity to learn the subject he teaches, Quantitative Methods, is interested in allocating them to groups with the highest homogeneity possible, based on the grades they obtained on the college entrance exams in subjects considered quantitative (Math, Physics, and Chemistry).
In order to do that, the professor collected information on these grades, which vary from 0 to 10. In addition, since he will carry out a cluster analysis, first, in an algebraic way, he decided, for pedagogical purposes, to only work with five students. This dataset can be seen in Table 11.12.
Table 11.12
Student (Observation) | Grade in Mathematics (X1i) | Grade in Physics (X2i) | Grade in Chemistry (X3i) |
---|---|---|---|
Gabriela | 3.7 | 2.7 | 9.1 |
Luiz Felipe | 7.8 | 8.0 | 1.5 |
Patricia | 8.9 | 1.0 | 2.7 |
Ovidio | 7.0 | 1.0 | 9.0 |
Leonor | 3.4 | 2.0 | 5.0 |
Based on the data obtained, the chart in Fig. 11.11 is constructed, and, since the variables are metric, the dissimilarity measure known as Euclidian distance will be used for the cluster analysis. Besides, since all the variables have values in the same unit of 0 measure (grades from 0 to 10), in this case, it will not be necessary to standardize them through Z-scores.
In the following sections, hierarchical agglomeration schedules based on the Euclidian distance will be elaborated through the three linkage methods being studied.
At this moment, from the data presented in Table 11.12, let’s develop a cluster analysis through a hierarchical agglomeration schedule with the single-linkage method. First of all, we define matrix D0, formed by the Euclidian distances (dissimilarities) between each pair of observations, as follows:
It is important to mention that at this initial moment each observation is considered an individual cluster, that is, in stage 0, we have 5 clusters (sample size). Highlighted in matrix D0 is the smallest distance between all the observations and, therefore, in the first stage, observations Gabriela and Ovidio will initially be grouped, and will now be a new cluster.
We must construct matrix D1 so that we can go to the next clustering stage, in which the distances between the cluster Gabriela-Ovidio and the other observations are calculated. Observations that are still isolated. Thus, by using the single-linkage method and based on the Expression (11.23), we have:
d(Gabriela−Ovidio)Luiz Felipe=min{10.132;10.290}=10.132
d(Gabriela−Ovídio)Patricia=min{8.420;6.580}=6.580
d(Gabriela−Ovídio)Leonor=min{4.170;5.474}=4.170
Matrix D1 can be seen:
In the same way, in matrix D1 the smallest distance between all of them is highlighted. Therefore, in the second stage, observation Leonor is inserted into the already-formed cluster Gabriela-Ovidio. Observations Luiz Felipe and Patricia still remain isolated.
We must construct matrix D2 so that we can take the next step, in which the distances between the cluster Gabriela-Ovidio-Leonor and the two remaining observations are calculated. Analogously, we have:
d(Gabriela−Ovidio−Leonor)Luiz Felipe=min{10.132;8.223}=8.223
d(Gabriela−Ovidio−Leonor)Patricia=min{6.580;6.045}=6.045
Matrix D2 can be written as:
In the third clustering stage, observation Patricia is incorporated into the cluster Gabriela-Ovidio-Leonor, since the corresponding distance is the smallest among all the ones presented in matrix D2. Therefore, we can write matrix D3, which comes next, taking into consideration the following criterion:
d(Gabriela−Ovidio−Leonor−Patricia)Luiz Felipe=min{8.223;7.187}=7.187
Finally, in the fourth and last stage, all the observations are allocated to the same cluster, thus, concluding the hierarchical process. Table 11.13 presents a summary of this agglomeration schedule constructed by using the single-linkage method.
Table 11.13
Stage | Cluster | Grouped Observation | Smallest Euclidian Distance |
---|---|---|---|
1 | Gabriela | Ovidio | 3.713 |
2 | Gabriela-Ovidio | Leonor | 4.170 |
3 | Gabriela-Ovidio-Leonor | Patricia | 6.045 |
4 | Gabriela-Ovidio-Leonor-Patricia | Luiz Felipe | 7.187 |
Based on this agglomeration schedule, we can construct a tree-shaped diagram, known as a dendrogram or phenogram, whose main objective is to illustrate the step by step of the clusters and to facilitate the visualization of how each observation is allocated to each stage. The dendrogram can be seen in Fig. 11.12.
Through Figs. 11.13 and 11.14, we are able to interpret the dendrogram constructed.
First of all, we drew three lines (I, II, and III) that are orthogonal to the dendrogram lines, as shown in Fig. 11.13, which allow us to identify the number of clusters in each clustering stage, as well as the observations in each cluster.
Therefore, line I “cuts” the dendrogram immediately after the first clustering stage and, at this moment, we can verify that there are four clusters (four intersections with the dendrogram’s horizontal lines), one of them formed by observations Gabriela and Ovidio, and the others, by the individual observations.
On the other hand, line II intersects three horizontal lines of the dendrogram, which means that, after the second stage, in which observation Leonor was incorporated into the already formed cluster Gabriela-Ovidio, there are three clusters.
Finally, line III is drawn immediately after the third stage, in which observation Patricia merges with the cluster Gabriela-Ovidio-Leonor. Since two intersections between this line and the dendrogram’s horizontal lines are identified, we can see that observation Luiz Felipe remains isolated, while the others form a single cluster.
Besides providing a study of the number of clusters in each clustering stage and of the allocation of observations, a dendrogram also allows the researcher to analyze the magnitude of the distance leaps in order to establish the clusters. A high magnitude leap, in comparison to the others, can indicate that a certain observation or cluster, a considerably different one, is incorporated into already formed clusters, which offers subsidies for the establishment of a solution regarding the number of clusters without the need for a next clustering stage.
Although we know that setting an inflexible, mandatory number of clusters may hamper the analysis, at least giving an idea of this number, given the distance measure used and the linkage method adopted, may help researchers better understand the characteristics of the observations that led to this fact. Moreover, since the number of clusters is important for constructing nonhierarchical agglomeration schedules, this piece of information (considered an output of the hierarchical schedule) may serve as input for the k-means procedure.
Fig. 11.14 presents three distance leaps (A, B, and C), regarding each one of the clustering stages, and, from their analysis, we can see that leap B, which represents the incorporation of observation Patricia into the cluster that had already been formed Gabriela-Ovidio-Leonor, is the greatest of the three. Therefore, in case we intend to set the ideal number of clusters in this example, the researcher may choose the solution with three clusters (line II in Fig. 11.13), without the stage in which observation Patricia is incorporated, since it possibly has characteristics that are not so homogeneous and that make it unfeasible to include it in the previously formed cluster, given the large distance leap. Thus, in this case, we would have a cluster formed by Gabriela, Ovidio, and Leonor, another one formed only by Patricia, and a third one formed only by Luiz Felipe.
When using dissimilarity measures in methods clustering, a very useful criterion for identifying the number of clusters consists in identifying a considerable distance leap (whenever possible), and defining the number of clusters formed in the clustering stage immediately before the great leap, since very high leaps may incorporate observations with characteristics that are not so homogeneous.
Furthermore, it is also important to mention that, if the distance leaps from a stage to another are small, due to the existence of variables with values that are too close to the observations, which can make it difficult to read the dendrogram, the researcher may use the squared Euclidean distance, so that the leaps can become clearer and better explained, making it easier to identify the clusters in the dendrogram, and providing better arguments for the decision making process.
Software such as SPSS shows dendrograms with rescaled distance measures, in order to facilitate the interpretation of the allocation of each observation and the visualization of the large distance leaps.
Fig. 11.15 illustrates how clusters can be established after the single-linkage method is elaborated.
Next, we will develop the same example. However, now, let’s use the complete- and average-linkage methods, so that we can compare the order of the observations and the distance leaps.
Matrix D0, shown here, is obviously the same, and the smallest Euclidian distance, the one highlighted, is between observations Gabriela and Ovidio that become the first cluster. It is important to emphasize that the first cluster will always be the same, regardless of the linkage method used, since the first stage will always consider the smallest distance between two pairs of observations, which are still isolated.
In the complete-linkage method, we must use Expression (11.24) to construct matrix D1, as follows:
d(Gabriela−Ovidio)Luiz Felipe=max{10.132;10.290}=10.290
d(Gabriela−Ovidio)Patricia=max{8.420;6.580}=8.420
d(Gabriela−Ovidio)Leonor=max{4.170;5.474}=5.474
Matrix D1 can be seen and by analyzing it, we can see that observation Leonor will be incorporated into the cluster formed by Gabriela and Ovidio. Once again, the smallest value, among all the ones shown in matrix D1, is highlighted.
As verified when using the single-linkage method, here, observations Luiz Felipe and Patricia also remain isolated at this stage. The differences between the methods start arising now. Therefore, we will construct matrix D2 using the following criteria:
d(Gabriela−Ovidio−Leonor)Luiz Felipe=max{10.290;8.223}=10.290
d(Gabriela−Ovidio−Leonor)Patricia=max{8.420;6.045}=8.420
Matrix D2 can be written as follows:
In the third clustering stage, a new cluster is formed by the fusion of observations Patricia and Luiz Felipe, since the furthest-neighbor criterion adopted in the complete-linkage method makes the distance between these two observations become the smallest among all the ones calculated to construct matrix D2. Therefore, notice that at this stage differences related to the single-linkage method appear, in terms of the sorting and allocation of the observations to groups.
Hence, to construct matrix D3, we must take the following criterion into consideration:
d(Gabriela−Ovidio−Leonor)(Luiz Felipe−Patricia)=max{10.290;8.420}=10.290
In the same way, in the fourth and last stage, all the observations are allocated to the same cluster, since there is the clustering between Gabriela-Ovidio-Leonor and Luiz Felipe-Patricia. Table 11.14 shows a summary of this agglomeration schedule, elaborated by using the complete-linkage method.
Table 11.14
Stage | Cluster | Grouped Observation | Smallest Euclidian Distance |
---|---|---|---|
1 | Gabriela | Ovidio | 3.713 |
2 | Gabriela-Ovidio | Leonor | 5.474 |
3 | Luiz Felipe | Patricia | 7.187 |
4 | Gabriela-Ovidio-Leonor | Luiz Felipe-Patricia | 10.290 |
This agglomeration schedule’s dendrogram can be seen in Fig. 11.16. We can initially see that the sorting of the observations is different from what was observed in the dendrogram seen in Fig. 11.12.
Analogous to what was carried out in the previous method, we chose to draw two vertical lines (I and II) over the largest distance leap, as shown in Fig. 11.17.
Thus, if the researcher chooses to consider three clusters, the solution will be the same as the one achieved previously through the single-linkage method, one formed by Gabriela, Ovidio, and Leonor, another one by Luiz Felipe, and a third one by Patricia (line I in Fig. 11.17). However, if he chooses to define two clusters (line II), the solution will be different since, in this case, the second cluster will be formed by Luiz Felipe and Patricia, while in the previous case, it was formed only by Luiz Felipe, since observation Patricia was allocated to the first cluster.
Similar to what was done in the previous method, Fig. 11.18 illustrates how the clusters can be established after the complete-linkage method is carried out.
Defining the clustering method can be based on the application of the average-linkage method, in which two groups merge based on the average distance between all the pairs of observations that belong to these groups. Therefore, as we have already discussed, if the most suitable method is the single linkage because there are observations considerably far apart from one another, the sorting and allocation of the observations will be maintained by the average-linkage method. On the other hand, the outputs of this method will show consistency with the solution achieved through the complete-linkage method as regards the sorting and allocation of the observations, if they are very similar in the variables in study.
Thus, it is advisable for the researcher to apply the three linkage methods when elaborating a cluster analysis through hierarchical agglomeration schedules. Therefore, let’s move on to the average-linkage method.
First of all, let’s show the Euclidian distance matrix between each pair of observations (matrix D0), once again, highlighting the smallest distance between them.
By using Expression (11.25), we are able to calculate the terms of matrix D1, given that the first cluster Gabriela-Ovidio has already been formed. Thus, we have:
d(Gabriela−Ovidio)Luiz Felipe=10.132+10.2902=10.211
d(Gabriela−Ovidio)Patricia=8.420+6.5802=7.500
d(Gabriela−Ovidio)Leonor=4.170+5.4742=4.822
Matrix D1 can be seen and, through it, we can see that observation Leonor is once again incorporated into the cluster formed by Gabriela and Ovidio. The smallest value among all the ones presented in matrix D1 has also been highlighted.
In order to construct matrix D2, in which the distances between the cluster Gabriela-Ovidio-Leonor and the two remaining observations are calculated, we must perform the following calculations:
d(Gabriela−Ovidio−Leonor)Luiz Felipe=10.132+10.290+8.2233=9.548
d(Gabriela−Ovidio−Leonor)Patrícia=8.420+6.580+6.0453=7.015
Note that the distances used to calculate the dissimilarities to be inserted into matrix D2 are the original Euclidian distances between each pair of observations, that is, they come from matrix D0. Matrix D2 can be seen:
As verified when the single-linkage method was elaborated, here, observation Patricia is also incorporated into the cluster already formed by Gabriela, Ovidio and Leonor, and observation Luiz Felipe remains isolated. Finally, matrix D3 can be constructed from the following calculation:
d(Gabriela−Ovidio−Leonor−Patricia)Luiz Felipe=10.132+10.290+8.223+7.1874=8.958
Once again, in the fourth and last stage, all the observations are in the same cluster. Table 11.15 and Fig. 11.19 present a summary of this agglomeration schedule and the corresponding dendrogram, respectively, resulting from this average-linkage method.
Table 11.15
Stage | Cluster | Grouped Observation | Smallest Euclidian Distance |
---|---|---|---|
1 | Gabriela | Ovidio | 3.713 |
2 | Gabriela-Ovidio | Leonor | 4.822 |
3 | Gabriela-Ovidio-Leonor | Patricia | 7.015 |
4 | Gabriela-Ovidio-Leonor-Patricia | Luiz Felipe | 8.958 |
Despite having other distance values, we can see that Table 11.15 and Fig. 11.19 show the same sorting and the same allocation of observations in the clusters as those presented in Table 11.13 and in Fig. 11.12, respectively, obtained when the single-linkage method was elaborated.
Hence, we can state that the observations are significantly different from the variables studied, fact proven by the consistency of the answers obtained from the single- and average-linkage methods. If the observations were more similar, fact that has not been observed in the diagram seen in Fig. 11.11, the consistency of answers would occur between the complete- and average-linkage methods, as already discussed. Therefore, when possible, the initial construction of scatter plots may help researchers, even if in a preliminary way, choose the method to be adopted.
Hierarchical agglomeration schedules are very useful and offer us the possibility to analyze, in an exploratory way, the similarity between observations based on the behavior of certain variables. However, it is essential for researchers to understand that these methods are not conclusive by themselves and more than one answer may be obtained, depending on what is desired and on the data behavior.
Besides, it is necessary for researchers to be aware of how sensitive these methods are to the presence of outliers. The existence of a very discrepant observation may cause other observations, not so similar to one another, to be allocated to the same cluster because they are extremely different from the observation considered an outlier. Hence, it is advisable to apply the hierarchical agglomeration schedules with the linkage method chosen several times, and, in each application, to identify one or more observations considered outliers. This procedure will make the cluster analysis become more reliable, since more and more homogeneous clusters may be formed. Researchers are free to characterize the most discrepant observation as the one that ended up becoming isolated after the penultimate clustering stage, that is, if it happens before the total fusion. Nonetheless, many are the methods to define an outlier. Barnett and Lewis (1994), for instance, mention almost 1000 articles in the existing literature on outliers, and, for pedagogical purposes, in the Appendix of this chapter, we will discuss an efficient procedure in Stata for detecting outliers when a researcher is carrying out a multivariate data analysis.
It is also important to emphasize, as we have already discussed in this section, that different linkage methods, when elaborating hierarchical agglomeration schedules, must be applied to the same dataset, and the resulting dendrograms, compared. This procedure will help researchers in their decision-making processes with regard to choosing the ideal number of clusters, and also to sorting the observations and allocating each one of them to the different clusters formed. This will even allow researchers to make coherent decisions about the number of clusters that may be considered input in a possible nonhierarchical analysis.
Last but not least, it is worth mentioning that the agglomeration schedules presented in this section (Tables 11.13, 11.14, and 11.15) provide increasing values of the clustering measures because a dissimilarity measure was used (Euclidian distance) as a comparison criterion between the observations. If we had chosen Pearson’s correlation between the observations, a similarity measure also used for metric variables, as we discussed in Section 11.2.1.1, the values of the clustering measures in the agglomeration schedules would be decreasing. The latter is also true for cluster analyses in which similarity measures are used, as the ones studied in Section 11.2.1.2, to assess the behavior of observations based on binary variables.
In the following section we will develop the same example, in an algebraic way, using the nonhierarchical k-means agglomeration schedule.
Among all the nonhierarchical agglomeration schedules, the k-means procedure is the most often used by researchers in several fields of knowledge. Given that the number of clusters is previously defined by the researcher, this procedure can be elaborated after the application of a hierarchical agglomeration schedule when we have no idea of the number of clusters that can be formed, and, in this situation, the output obtained from this procedure can serve as input for the nonhierarchical.
As the one developed in Section 11.2.2.1.1, we now present a logical sequence of steps, based on Johnson and Wichern (2007), in order to facilitate the understanding of the cluster analysis (k-means procedure):
Centroid coordinate ˉx must be recalculated whenever including or excluding a certain observation p in the respective cluster, based on the following expressions:
ˉxnew=N⋅ˉx+xpN+1,if observationpis inserted into the cluster under analysis
ˉxnew=N⋅ˉx−xpN−1,if observationpis excluded from the cluster under analysis
where N and ˉx refer to the number of observations in the cluster and to its centroid coordinate before the reallocation of that observation, respectively. In addition, xp refers to the coordinate of observation p, which changed clusters.
For two variables (X1 and X2), Fig. 11.20 shows a hypothetical situation that represents the end of the k-means procedure, in which it is no longer possible to reallocate any observation because there are no more close proximities to centroids of other clusters.
The matrix with distances between observations does not need to be defined at each step, different from hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to considerably larger dataset than those traditionally studied through hierarchical schedules.
In addition, bear in mind that the variables must be standardized before elaborating the k-means procedure, and in the hierarchical agglomeration schedules too, if the respective values are not in the same unit of measure.
Finally, after concluding this procedure, it is important for researchers to analyze if the values of a certain metric variable differ between the groups defined, that is, if the variability between the clusters is significantly higher than the internal variability of each cluster. The F-test of the one-way analysis of variance, or one-way ANOVA, allows us to develop this analysis, and its null and alternative hypotheses can be defined as follows:
Therefore, a single F-test can be applied for each variable, aiming to assess the existence of at least one difference among all the comparison possibilities, and, in this regard, the main advantage of applying it is the fact that adjustments in the discrepant dimensions of the groups do not need to be carried out to analyze several comparisons. On the other hand, rejecting the null hypothesis at a certain significance level, does not allow the researcher to know which group(s) is(are) statistically different from the others in relation to the variable being analyzed.
The F statistical expression, corresponding to this test, is given by the following expression:
F=variability between the groupsvariability within the groups=K∑k=1Nk⋅(ˉXk−ˉX)2K−1∑ki(Xki−ˉXk)2n−K
where N is the number of observations in the k-th cluster, ˉXk is the mean of variable X in the same k-th cluster, ˉX is the general average of variable X, and Xki is the value that variable X takes on for a certain observation i present in the k-th cluster. In addition, K represents the number of clusters to be compared, and n, the sample size.
By using the F statistic, researchers will be able to identify the variables whose means most differ between the groups, that is, those that most contribute to the formation of at least one of the K clusters (highest F statistic), as well as those that do not contribute to the formation of the suggested number of clusters, at a certain significance level.
In the following section, we will discuss a practical example that will be solved algebraically, and from which the concepts of the k-means procedure may be established.
To solve the nonhierarchical k-means agglomeration schedule algebraically, let’s use the data from our own example, which can be found in Table 11.12 and are shown in Table 11.16.
Table 11.16
Student (Observation) | Grade in Mathematics (X1i) | Grade in Physics (X2i) | Grade in Chemistry (X3i) |
---|---|---|---|
Gabriela | 3.7 | 2.7 | 9.1 |
Luiz Felipe | 7.8 | 8.0 | 1.5 |
Patricia | 8.9 | 1.0 | 2.7 |
Ovidio | 7.0 | 1.0 | 9.0 |
Leonor | 3.4 | 2.0 | 5.0 |
Software packages such as SPSS use the Euclidian distance as the standard dissimilarity measure, reason why we will develop the algebraic procedures based on this measure. This criterion will even allow the results obtained to be compared to the ones found when elaborating the hierarchical agglomeration schedules in Section 11.2.2.1.2, as, in those situations, the Euclidian distance was also used. In the same way, it will not be necessary to standardize the variables through Z-scores, since all of them are in the same unit of measure (grades from 0 to 10). Otherwise, it is crucial for researchers to standardize the variables before elaborating the k-means procedure.
Using the logical sequence presented in Section 11.2.2.2.1, we will develop the k-means procedure with K = 3 clusters. This number of clusters may have come from a decision made by the researcher and based on a certain preliminary criterion, or it was chosen based on the outputs of the hierarchical agglomeration schedules. In our case, the decision was made based on the comparison of the dendrograms that had already been constructed, and by the similarity of the outputs obtained by the single- and average-linkage methods.
Thus, we need to arbitrarily allocate the observations to three clusters, so that the respective centroids can be calculated. Therefore, we can establish that observations Gabriela and Luiz Felipe form the first cluster, Patricia and Ovidio, the second, and Leonor, the third. Table 11.17 shows the arbitrary formation of these preliminary clusters, as well as the calculation of the respective centroid coordinates, which makes the initial step of the k-means procedure algorithm possible.
Table 11.17
Centroid Coordinates | |||
---|---|---|---|
Cluster | Variable | ||
Grade in Mathematics | Grade in Physics | Grade in Chemistry | |
Gabriela | 3.7+7.82=5.75 | 2.7+8.02=5.35 | 9.1+1.52=5.30 |
Luiz Felipe | |||
Patricia | 8.9+7.02=7.95 | 1.0+1.02=1.00 | 2.7+9.02=5.85 |
Ovidio | |||
Leonor | 3.40 | 2.00 | 5.00 |
Based on these coordinates, we constructed the chart seen in Fig. 11.21, which shows the arbitrary allocation of each observation to its cluster and the respective centroids.
Based on the second step of the logical sequence presented in Section 11.2.2.2.1, we must choose a certain observation and calculate the distance between it and all the cluster centroids, assuming that it is or it is not reallocated to each cluster. Selecting the first observation (Gabriela), for example, we can calculate the distances between it and the centroids of the clusters that have already been formed (Gabriela-Luiz Felipe, Patricia-Ovidio, and Leonor) and, after that, assume that it leaves its cluster (Gabriela-Luiz Felipe), and is inserted into one of the other two clusters, forming the cluster Gabriela-Patricia-Ovidio or Gabriela-Leonor. Thus, from Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Gabriela to one of the two clusters takes place, as shown in Table 11.18.
Table 11.18
Centroid Coordinates | ||||
---|---|---|---|---|
Cluster | Simulation | Variable | ||
Grade in Mathematics | Grade in Physics | Grade in Chemistry | ||
Luiz Felipe | Excluding Gabriela | 2⋅(5.75)−3.702−1=7.80 | 2⋅(5.35)−2.702−1=8.00 | 2⋅(5.30)−9.102−1=1.50 |
Gabriela | Including Gabriela | 2⋅(7.95)+3.702+1=6.53 | 2⋅(1.00)+2.702+1=1.57 | 2⋅(5.85)+9.102+1=6.93 |
Patricia | ||||
Ovidio | ||||
Gabriela | Including Gabriela | 1⋅(3.40)+3.701+1=3.55 | 1⋅(2.00)+2.701+1=2.35 | 1⋅(5.00)+9.101+1=7.05 |
Leonor |
Obs.: Note that the values calculated for the Luiz Felipe centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.
Thus, from Tables 11.16, 11.17, and 11.18, we can calculate the following Euclidian distances:
dGabriela−(Gabriela−Luiz Felipe)=√(3.70−5.75)2+(2.70−5.35)2+(9.10−5.30)2=5.066
dGabriela−(Patricia−Ovidio)=√(3.70−7.95)2+(2.70−1.00)2+(9.10−5.85)2=5.614
dGabriela−Leonor=√(3.70−3.40)2+(2.70−2.00)2+(9.10−5.00)2=4.170
dGabriela−Luiz Felipe=√(3.70−7.80)2+(2.70−8.00)2+(9.10−1.50)2=10.132
dGabriela−(Gabriela−Patricia−Ovidio)=√(3.70−6.53)2+(2.70−1.57)2+(9.10−6.93)2=3.743
dGabriela−(Gabriela−Leonor)=√(3.70−3.55)2+(2.70−2.35)2+(9.10−7.05)2=2.085
Since Gabriela is the closest to the Gabriela-Leonor centroid (the shortest Euclidian distance), we must reallocate this observation to the cluster initially formed only by Leonor. So, the cluster in which observation Gabriela was at first (Gabriela-Luiz Felipe) has just lost it, and now Luiz Felipe has become an individual cluster. Therefore, the centroids of the cluster that receives it and the one that loses it must be recalculated. Table 11.19 shows the creation of the new clusters and the calculation of the respective centroid coordinates too.
Table 11.19
Centroid Coordinates | |||
---|---|---|---|
Cluster | Variable | ||
Grade in Mathematics | Grade in Physics | Grade in Chemistry | |
Luiz Felipe | 7.80 | 8.00 | 1.50 |
Patricia | 7.95 | 1.00 | 5.85 |
Ovidio | |||
Gabriela | 3.7+3.42=3.55 | 2.7+2.02=2.35 | 9.1+5.02=7.05 |
Leonor |
Based on these new coordinates, we can construct the chart shown in Fig. 11.22.
Once again, let’s repeat the previous step. At this moment, since observation Luiz Felipe is isolated, let’s simulate the reallocation of the third observation (Patricia). We must calculate the distances between it and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, afterwards, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Patricia or Gabriela-Patricia-Leonor. Also based on Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Patricia to one of these two clusters happens, as shown in Table 11.20.
Table 11.20
Centroid Coordinates | ||||
---|---|---|---|---|
Cluster | Simulation | Variable | ||
Grade in Mathematics | Grade in Physics | Grade in Chemistry | ||
Luiz Felipe | Including Patricia | 1⋅(7.80)+8.901+1=8.35 | 1⋅(8.00)+1.001+1=4.50 | 1⋅(1.50)+2.701+1=2.10 |
Patricia | ||||
Ovidio | Excluding Patricia | 2⋅(7.95)−8.902−1=7.00 | 2⋅(1.00)−1.002−1=1.00 | 2⋅(5.85)−2.702−1=9.00 |
Gabriela | Including Patricia | 2⋅(3.55)+8.902+1=5.33 | 2⋅(2.35)+1.002+1=1.90 | 2⋅(7.05)+2.702+1=5.60 |
Patricia | ||||
Leonor |
Obs.: Note that the values calculated of the Ovidio centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.
Similar to what was carried out when simulating Gabriela’s reallocation, based on Tables 11.16, 11.19, and 11.20, let’s calculate the Euclidian distances between Patricia and each one of the centroids:
dPatricia−Luiz Felipe=√(8.90−7.80)2+(1.00−8.00)2+(2.70−1.50)2=7.187
dPatricia−(Patricia−Ovidio)=√(8.90−7.95)2+(1.00−1.00)2+(2.70−5.85)2=3.290
dPatricia−(Gabriela−Leonor)=√(8.90−3.55)2+(1.00−2.35)2+(2.70−7.05)2=7.026
dPatricia−(Luiz Felipe−Patricia)=√(8.90−8.35)2+(1.00−4.50)2+(2.70−2.10)2=3.593
dPatricia−Ovidio=√(8.90−7.00)2+(1.00−1.00)2+(2.70−9.00)2=6.580
dPatricia−(Gabriela−Patricia−Leonor)=√(8.90−5.33)2+(1.00−1.90)2+(2.70−5.60)2=4.684
Bearing in mind that the Euclidian distance between Patricia and the cluster Patricia-Ovidio is the shortest, we have to reallocate it to another cluster and, at this moment, let’s maintain the solution presented in Table 11.19 and in Fig. 11.22.
Next, we will develop the same procedure, however, simulating the reallocation of the fourth observation (Ovidio). Analogously, we must, therefore, calculate the distances between this observation and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, after that, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Ovidio or Gabriela-Ovidio-Leonor. Once again by using Expressions (11.26) and (11.27), we can recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Ovidio to one of these two clusters takes place, as shown in Table 11.21.
Table 11.21
Centroid Coordinates | ||||
---|---|---|---|---|
Cluster | Simulation | Variable | ||
Grade in Mathematics | Grade in Physics | Grade in Chemistry | ||
Luiz Felipe | Including Ovidio | 1⋅(7.80)+7.001+1=7.40 | 1⋅(8.00)+1.001+1=4.50 | 1⋅(1.50)+9.001+1=5.25 |
Ovidio | ||||
Patricia | Excluding Ovidio | 2⋅(7.95)−7.002−1=8.90 | 2⋅(1.00)−1.002−1=1.00 | 2⋅(5.85)−9.002−1=2.70 |
Gabriela | Including Ovidio | 2⋅(3.55)+7.002+1=4.70 | 2⋅(2.35)+1.002+1=1.90 | 2⋅(7.05)+9.002+1=7.70 |
Ovidio | ||||
Leonor |
Obs.: Note that the values calculated of the Patricia centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.
Next, we can see the calculations of the Euclidian distances between Ovidio and each one of the centroids, defined from Tables 11.16, 11.19, and 11.21:
dOvidio−Luiz Felipe=√(7.00−7.80)2+(1.00−8.00)2+(9.00−1.50)2=10.290
dOvidio−(Patricia−Ovidio)=√(7.00−7.95)2+(1.00−1.00)2+(9.00−5.85)2=3.290
dOvidio−(Gabriela−Leonor)=√(7.00−3.55)2+(1.00−2.35)2+(9.00−7.05)2=4.187
dOvidio−(Luiz Felipe−Ovidio)=√(7.00−7.40)2+(1.00−4.50)2+(9.00−5.25)2=5.145
dOvidio−Patricia=√(7.00−8.90)2+(1.00−1.00)2+(9.00−2.70)2=6.580
dOvidio−(Gabriela−Ovidio−Leonor)=√(7.00−4.70)2+(1.00−1.90)2+(9.00−7.70)2=2.791
In this case, since observation Ovidio is the closest to the centroid of Gabriela-Ovidio-Leonor (the shortest Euclidian distance), we must reallocate this observation to the cluster formed originally by Gabriela and Leonor. Therefore, observation Patricia becomes an individual cluster. Table 11.22 shows the centroid coordinates of clusters Luiz Felipe, Patricia, and Gabriela-Ovidio-Leonor.
Table 11.22
Centroid Coordinates | |||
---|---|---|---|
Cluster | Variable | ||
Grade in Mathematics | Grade in Physics | Grade in Chemistry | |
Luiz Felipe | 7.80 | 8.00 | 1.50 |
Patricia | 8.90 | 1.00 | 2.70 |
Gabriela | 4.70 | 1.90 | 7.70 |
Ovidio | |||
Leonor |
We will not carry out the procedure proposed for the fifth observation (Leonor), since it had already fused with observation Gabriela in the first step of the algorithm. We can consider that the k-means procedure is concluded, since it is no longer possible to reallocate any observation due to closer proximity to another cluster’s centroid. Fig. 11.23 shows the allocation of each observation to its cluster and their respective centroids. Note that the solution achieved is equal to the one reached through the single- (Fig. 11.15) and average-linkage methods, when we elaborated the hierarchical agglomeration schedules.
As we have already discussed, we can see that the matrix with the distances between the observations does not need to be defined at each step of the k-means procedure algorithm, different from the hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to dataset significantly larger than the ones traditionally studied through hierarchical schedules.
Table 11.23 shows the Euclidian distances between each observation of the original dataset and the centroids of each one of the clusters formed.
Table 11.23
Cluster | |||
---|---|---|---|
Student (Observation) | Luiz Felipe | Patricia | Gabriela Ovidio Leonor |
Gabriela | 10.132 | 8.420 | 1.897 |
Luiz Felipe | 0.000 | 7.187 | 9.234 |
Patricia | 7.187 | 0.000 | 6.592 |
Ovidio | 10.290 | 6.580 | 2.791 |
Leonor | 8.223 | 6.045 | 2.998 |
We would like to emphasize that this algorithm can be elaborated with another preliminary allocation of the observations to the clusters besides the one chosen in this example. Reapplying the k-means procedure with several arbitrary choices, given K clusters, allows the researcher to assess how stable the clustering procedure is, and to underpin the allocation of the observations to the groups in a consistent way.
After concluding this procedure, it is essential to check, through the F-test of one-way ANOVA, if the values of each one of the three variables considered in the analysis are statistically different between the three clusters. To make the calculation of the F statistics that correspond to this test easier, we constructed Tables 11.24, 11.25, and 11.26, which show the means per cluster and the general mean of the variables mathematics, physics, and chemistry, respectively.
Table 11.24
Cluster 1 | Cluster 2 | Cluster 3 |
---|---|---|
XLuiz Felipe=7.80 | XPatricia=8.90 | XGabriela=3.70 |
XOvidio=7.00 | ||
XLeonor=3.40 | ||
ˉX1=7.80 | ˉX2=8.90 | ˉX3=4.70 |
ˉX=6.16 |
Table 11.25
Cluster 1 | Cluster 2 | Cluster 3 |
---|---|---|
XLuiz Felipe=8.00 | XPatricia=1.00 | XGabriela=2.70 |
XOvidio=1.00 | ||
XLeonor=2.00 | ||
ˉX1=8.00 | ˉX2=1.00 | ˉX3=1.90 |
ˉX=2.94 |
Table 11.26
Cluster 1 | Cluster 2 | Cluster 3 |
---|---|---|
XLuiz Felipe=1.50 | XPatricia=2.70 | XGabriela=9.10 |
XOvidio=9.00 | ||
XLeonor=5.00 | ||
ˉX1=1.50 | ˉX2=2.70 | ˉX3=7.70 |
ˉX=5.46 |
So, based on the values presented in these tables and by using Expression (11.28), we are able to calculate the variation between the groups and within them for each one of the variables, as well as the respective F statistics. Tables 11.27, 11.28, and 11.29 show these calculations.
Table 11.27
Variability between the groups | (7.80−6.16)2+(8.90−6.16)2+3(4.70−6.16)23−1=8.296 |
Variability within the groups | (3.70−4.70)2+(7.00−4.70)2+(3.40−4.70)25−3=3.990 |
F | 8.2963.990=2.079 |
Note: The calculation of the variability within the groups only took cluster 3 into consideration, since the others show variability equal to 0, because they are formed by a single observation.
Table 11.28
Variability between the groups | (8.00−2.94)2+(1.00−2.94)2+3(1.90−2.94)23−1=16.306 |
Variability within the groups | (2.70−1.90)2+(1.00−1.90)2+(2.00−1.90)25−3=0.730 |
F | 16.3060.730=22.337 |
Note: The same as the previous table.
Table 11.29
Variability between the groups | (1.50−5.46)2+(2.70−5.46)2+3(7.70−5.46)23−1=19.176 |
Variability within the groups | (9.10−7.70)2+(9.00−7.70)2+(5.00−7.70)25−3=5.470 |
F | 19.1765.470=3.506 |
Note: The same as Table 11.27.
Now, let’s analyze the rejection or not of the null hypothesis of the F-tests for each one of the variables. Since there are two degrees of freedom for the variability between the groups (K – 1 = 2) and two degrees of freedom for the variability within the groups (n – K = 2), by using Table A in the Appendix, we have Fc = 19.00 (critical F at a significance level of 0.05). Therefore, only for the variable physics can we reject the null hypothesis that all the groups formed have the same mean, since F calculated Fcal = 22.337 > Fc = F2,2,5% = 19.00, So, for this variable, there is at least one group that has a mean that is statistically different from the others. For the variables mathematics and chemistry, however, we cannot reject the test’s null hypothesis at a significance level of 0.05.
Software such as SPSS and Stata do not offer the Fc for the defined degrees of freedom and a certain significance level. However, they offer the Fcal significance level for these degrees of freedom. Thus, instead of analyzing if Fcal > Fc, we must verify if the Fcal significance level is less than 0.05 (5%). Therefore:
If Sig. F (or Prob. F) < 0.05, there is at least one difference between the groups for the variable under analysis.
The Fcal significance level can be obtained in Excel by using the command Formulas → Insert Function → FDIST, which will open a dialog box as the one shown in Fig. 11.24.
As we can see in this figure, sig. F for the variable physics is less than 0.05 (sig. F = 0.043), that is, there is at least one difference between the groups for this variable at a significance level of 0.05. An inquisitive researcher will be able to carry out the same procedure for the variables mathematics and chemistry. In short, Table 11.30 presents the results of the one-way ANOVA, with the variation of each variable, the F statistics, and the respective significance levels.
Table 11.30
Variable | Variability Between the Groups | Variability Within the Groups | F | Sig. F |
---|---|---|---|---|
mathematics | 8.296 | 3.990 | 2.079 | 0.325 |
physics | 16.306 | 0.730 | 22.337 | 0.043 |
chemistry | 19.176 | 5.470 | 3.506 | 0.222 |
The one-way ANOVA table still allows the researcher to identify the variables that most contribute to the formation of at least one of the clusters, because they have a mean that is statistically different from at least one of the groups in relation to the others, since they will have greater F statistic values. It is important to mention that F statistic values are very sensitive to the sample size, and, in this case, the variables mathematics and chemistry ended up not having statistically different means among the three groups, mainly because the sample is small (only five observations).
We would like to emphasize that this one-way ANOVA can also be carried out soon after the application of a certain hierarchical agglomeration schedule, since it only depends on the classification of the observations within groups. The researcher must be careful about only one thing, when comparing the results obtained by a hierarchical schedule to the ones obtained by a nonhierarchical schedule, to use the same distance measure in both situations. Different allocations of the observations to the same number of clusters may happen if different distance measures are used in a hierarchical schedule and in a nonhierarchical schedule. Therefore, different values of the F statistics in both situations can be calculated.
In general, in case there are one or more variables that do not contribute to the formation of the suggested number of clusters, we recommend that the procedure be reapplied without it (or them). In these situations, the number of clusters may change and, if the researcher feels the need to underpin the initial input regarding the number of K clusters, he may even use a hierarchical agglomeration schedule without those variables before reapplying the k-means procedure, which will make the analysis cyclical.
Moreover, the existence of outliers may generate considerably disperse clusters, and treating the dataset in order to identify extremely discrepant observations becomes an advisable procedure, before elaborating nonhierarchical agglomeration schedules. In the Appendix of this chapter, an important procedure in Stata for detecting multivariate outliers will be presented.
As with hierarchical agglomeration schedules, the nonhierarchical k-means schedule cannot be used as an isolated technique to make a conclusive decision about the clustering of observations. The data behavior, sample size, and criteria adopted by the researcher may be extremely sensitive to the allocation of observations and the formation of clusters. The combination of the outputs found with the ones coming from other techniques can more powerfully underpin the choices made by the researcher, and provide higher transparency in the decision-making process.
At the end of the cluster analysis, since the clusters formed can be represented in the dataset by a new qualitative variable with terms connected to each observation (cluster 1, cluster 2, ..., cluster K), other exploratory multivariate techniques can be elaborated from it, as, for example, a correspondence analysis, so that, depending on the researcher’s objectives, we can study a possible association between the clusters and the categories of other qualitative variables.
This new qualitative variable, which represents the allocation of each observation, may also be used as an explanatory variable of a certain phenomenon in confirmatory multivariate models as, for example, multiple regression models, as long as it is transformed into dummy variables that represent the categories (clusters) of this new variable generated in the cluster analysis, as we will study in Chapter 13. On the other hand, such a procedure only makes sense when we intend to propose a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since a new observation does not have its place in a certain cluster, obtaining its allocation is only possible when we include such observation into a new cluster analysis, in order to obtain a new qualitative variable and, consequently, new dummies.
In addition, this new qualitative variable can also be considered dependent on a multinomial logistic regression model, allowing the researcher to evaluate the probabilities each observation has to belong to each one of the clusters formed, due to the behavior of other explanatory variables not initially considered in the cluster analysis. We would also like to highlight that this procedure depends on the research objectives and construct established, and has a diagnostic nature as regards the behavior of the variables in the sample for the existing observations, without a predictive purpose.
Finally, if the clusters formed present substantiality in relation to the number of observations allocated, by using other variables, we may even apply specific confirmatory techniques for each cluster identified, so that, possibly, better adjusted models can be generated.
Next, the same dataset will be used to run cluster analyses in SPSS and Stata. In Section 11.3, we will discuss the procedures for elaborating the techniques studied in SPSS and their results too. In Section 11.4, we will study the commands to perform the procedures in Stata, with the respective outputs.