11.3 Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

11.3 Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS

In this section, we will discuss the step by step for elaborating our example in the IBM SPSS Statistics Software. The main objective is to offer the researcher an opportunity to run cluster analyses with hierarchical and nonhierarchical schedules in this software package, given how easy it is to use it and how didactical the operations are. Every time an output is shown, we will mention the respective result obtained when performing the algebraic solution in the previous sections, so that the researcher can compare them and increase his own knowledge on the topic. The use of the images in this section has been authorized by the International Business Machines Corporation©.

11.3.1 Elaborating Hierarchical Agglomeration Schedules in SPSS

Going back to the example presented in Section 11.2.2.1.2, remember that our professor is interested in grouping students in homogeneous clusters based on their grades (from 0 to 10) obtained on the college entrance exams, in Mathematics, Physics, and Chemistry. The data can be found in the file CollegeEntranceExams.sav and they are exactly the same as the ones presented in Table 11.12. In this section, we will carry out the cluster analysis using the Euclidian distance between the observations and only considering the single-linkage method.

In order for a cluster analysis to be elaborated through a hierarchical method in SPSS, we must click on Analyze → Classify → Hierarchical Cluster.... A dialog box as the one shown in Fig. 11.25 will open.

Fig. 11.25 Dialog box for elaborating the cluster analysis with a hierarchical method in SPSS.

Next, we must insert the original variables from our example (mathematics, physics, and chemistry) into Variables and the variable that identifies the observations (student) in Label Cases by, as shown in Fig. 11.26. If the researcher does not have a variable that represents the name of the observations (in this case, a string), he may leave this last cell blank.

Fig. 11.26 Selecting the original variables.

First of all, in Statistics..., let’s choose the options Agglomeration schedule and Proximity matrix, which make the table with the agglomeration schedule be presented in the outputs, constructed based on the distance measure to be chosen and on the linkage method to be defined, and the matrix with the distances between each pair of observations, respectively. Let’s maintain the option None in Cluster Membership. Fig. 11.27 shows how this dialog box will be.

Fig. 11.27 Selecting the options that generate the agglomeration schedule and the matrix with the distances between the pairs of observations.

When we click on Continue, we will go back to the main dialog box of the hierarchical cluster analysis. Next, we must click on Plots.... As seen in Fig. 11.28, let’s select the option Dendrogram and the option None in Icicle.

In the same way, let’s click on Continue, so that we can go back to the main dialog box.

In Method..., which is the most important dialog box of the hierarchical cluster analysis, we must choose the single-linkage method, also known as the nearest neighbor. Thus, in Cluster Method, let’s select the option Nearest neighbor. An inquisitive researcher may see that the complete (Furthest neighbor) and average (Between-groups linkage) linkage methods, discussed in Section 11.2.2.1, are also available in this option.

Besides, since the variables in the dataset are metric, we have to choose one of the dissimilarity measures found in Measure → Interval. In order to maintain the same logic used when solving our example algebraically, we will choose the Euclidian distance as a dissimilarity measure and, therefore, we must select the option Euclidean distance. We can also see that, in this option, we can find the other dissimilarity measures studied in Section 11.2.1.1, such as, the squared Euclidean distance, Minkowski, Manhattan (Block, in SPSS), Chebyshev, and Pearson’s correlation that, even though is a similarity measure, is also used for metric variables.

Although we do not use similarity measures in this example because we are not working with binary variables, it is important to mention that some similarity measures can be selected if necessary. Hence, as discussed in Section 11.2.1.2, in Measure → Binary, we can select the simple matching, Jaccard, Dice, Anti-Dice (Sokal and Sneath 2, in SPSS), Russell and Rao, Ochiai, Yule (Yule’s Q, in SPSS), Rogers and Tanimoto, Sneath and Sokal (Sokal and Sneath 1, in SPSS), and Hamann coefficients, among others.

Still in the same dialog box, the researcher may request that the cluster analysis be elaborated from standardized variables. If necessary, for situations in which the original variables have different measurement units, the option Z scores in Transform Values → Standardize can be selected, which will make all the calculations be elaborated from the standardization of the variables, and which will begin having means equal to 0 and standard deviations equal to 1.

After these considerations, the dialog box in our example will become what can be seen in Fig. 11.29.

Next, we can click on Continue and on OK.

The first output (Fig. 11.30) shows dissimilarity matrix D₀ formed by the Euclidian distances between each pair of observations. We can even see that in the legend it says, “This is a dissimilarity matrix.” If this matrix were formed by similarity measures, resulting from calculations elaborated from binary variables, it would say, “This is a similarity matrix.”

Through this matrix, which is equal to the one whose values were calculated and presented in Section 11.2.2.1.2, we can verify that observations Gabriela and Ovidio are the most similar (the smallest Euclidian distance) in relation to the variables mathematics, physics, and chemistry (d_{Gabriela−Ovidio} = 3.713).

Therefore, in the hierarchical schedule shown in Fig. 11.31, the first clustering stage occurs exactly by joining these two students, with Coefficient (Euclidian distance) equal to 3.713. Note that the columns Cluster Combined Cluster 1 and Cluster 2 refer to the isolated observations, when they are still not incorporated into a certain cluster or clusters that have already been formed. Obviously, in the first clustering stage, the first cluster is formed by the fusion of two isolated observations.

Next, in the second stage, observation Leonor (5) is incorporated into the cluster previously formed by Gabriela (1) and Ovidio (4). With regard to the single-linkage method, we can see that the distance considered for the agglomeration of Leonor was the smallest between this observation and Gabriela or Ovidio, that is, the criterion adopted it was:

$d_{(Gabriela - Ovidio) Leonor} = min \{4.170; 5.474\} = 4.170$ $d_{(Gabriela - Ovidio) Leonor} = min \{4.170; 5.474\} = 4.170$

We can also see that, while columns Stage Cluster First Appears Cluster 1 and Cluster 2 indicate in which previous stage each corresponding observation was incorporated into a certain cluster, column Next Stage shows in which future stage the respective cluster will receive a new observation or cluster, given that we are dealing with a clustering method.

In the third stage, observation Patricia (3) is incorporated to the already formed cluster, Gabriela-Ovidio-Leonor, respecting the following distance criterion:

$d_{(Gabriela - Ovidio - Leonor) Patricia} = min \{8.420; 6.580; 6.045\} = 6.045$ $d_{(Gabriela - Ovidio - Leonor) Patricia} = min \{8.420; 6.580; 6.045\} = 6.045$

And, finally, given that we have five observations, in the fourth and last stage, observation Luiz Felipe, which is still isolated (note that the last observation to be incorporated into a cluster corresponds to the last value equal to 0 in the column Stage Cluster First Appears Cluster 2), is incorporated to the cluster already formed by the other observations, concluding the agglomeration schedule. The distance considered at this stage is given by:

$d_{(Gabriela - Ovidio - Leonor - Patricia) Luiz Felipe} = min \{10.132; 10.290; 8.223; 7.187\} = 7.187$ $d_{(Gabriela - Ovidio - Leonor - Patricia) Luiz Felipe} = min \{10.132; 10.290; 8.223; 7.187\} = 7.187$

Based on how the observations are sorted in the agglomeration schedule and on the distances used as a clustering criterion, the dendrogram can be constructed, and it can be seen in Fig. 11.32. Note that the distance measures are rescaled to construct the dendrograms in SPSS, so that the interpretation of each observation allocation to the clusters and, mainly, visualizing the highest distance leaps can be made easier, as discussed in Section 11.2.2.1.2.1.

Fig. 11.32 Dendrogram—Single-linkage method and rescaled euclidian distances in SPSS.

The way the observations are sorted in the dendrogram corresponds to what was presented in the agglomeration schedule (Fig. 11.31), and, from the analysis shown in Fig. 11.32, it is possible to see that the greatest distance leap occurs when Patricia merges with Gabriela-Ovidio-Leonor, which had already been formed. This leap could have already been identified in the agglomeration schedule found in Fig. 11.31, since a large increase in distance occurs when we go from the second to the third stage, that is, when we increase the Euclidian distance from 4.170 to 6.045 (44.96%), so that a new cluster can be formed by incorporating another observation. Therefore, we can choose the existing configuration at the end of the second clustering stage, in which three clusters are formed. As discussed in Section 11.2.2.1.2.1, the criterion for identifying the number of clusters that considers the clustering stage immediately before a large leap is very useful and commonly used.

Fig. 11.33 shows a vertical line (a dashed line) that “cuts” the dendrogram in the region where the highest leaps occur. At this moment, since three intersections with lines from the dendrogram happen, we can identify three corresponding clusters formed by Gabriela-Ovidio-Leonor, Patricia, and Luiz Felipe, respectively.

As discussed, it is common to find dendrograms that make it difficult to identify distance leaps, mainly due to the fact that there are considerably similar observations in the dataset in relation to all the variables under analysis. In these situations, it is advisable to use the squared Euclidean distance and the complete-linkage method (furthest neighbor). This criteria combination is very popular in datasets with extremely homogeneous observations.

Having adopted the solution with three clusters, we can once again click on Analyze → Classify → Hierarchical Cluster... and, on Statistics..., select the option Single solution in Cluster Membership. In this option, we must insert number 3 into Number of clusters, as shown in Fig. 11.34.

Fig. 11.34 Defining the number of clusters.

When we click on Continue, we will go back to the main dialog box of the cluster analysis. On Save..., let’s choose the option Single solution and, in the same way, insert number 3 into Number of clusters, as shown in Fig. 11.35, so that the new variable corresponding to the allocation of observations to the clusters can become available in the dataset.

Fig. 11.35 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Hierarchical procedure.

Next, we can click on Continue and on OK.

Although the outputs generated are the same, it is important to notice that a new table of results is presented, corresponding to the allocation of the observations to the clusters itself. Fig. 11.36 shows, for three clusters, that, while observations Gabriela, Ovidio, and Leonor form a single cluster, called 1, observations Luiz Felipe and Patricia form two individual clusters, called 2 and 3, respectively. Even though these names are numerical, it is important to highlight that they only represent the labels (categories) of a qualitative variable.

Fig. 11.36 Allocating the observations to the clusters.

When elaborating the procedure described, we can see that a new variable is generated in the dataset. It is called CLU3_1 by SPSS, as shown in Fig. 11.37.

This new variable is automatically classified by the software as Nominal, that is, qualitative, as shown in Fig. 11.38, which can be obtained when we click on Variable View, in the lower left-hand side of the screen in SPSS.

As we have already discussed, variable CLU3_1 can be used in other exploratory techniques, such as, the correspondence analysis, or in confirmatory techniques. In the latter, it can be inserted, for example, into the explanatory variables vector (as long as it is transformed into dummies) of a multiple regression model, or as a dependent variable of a certain multinomial logistic regression model, in which researchers intend to study the behavior of other variables, not inserted into the cluster analysis, concerning the probability of inserting each observation into each one of the clusters formed. However, this decision depends on the research objectives.

At this moment, the researcher may consider the cluster analysis with hierarchical agglomeration schedules concluded. Nevertheless, based on the generation of the new variable CLU3_1, by using the one-way ANOVA, he may still study if the values of a certain variable differ between the clusters formed, that is, if the variability between the groups is significantly higher than the variability within each one of them. Even if the analysis had not been developed when solving the hierarchical schedules algebraically, since we chose to carry it out only after the k-means procedure in Section 11.2.2.2.2, we can now show how it can be applied at this moment, since we have already allocated the observations to the groups.

In order to do that, let’s click on Analyze → Compare Means → One-Way ANOVA.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Dependent List and variable CLU3_1 (Single Linkage) into Factor. The dialog box will be as the one shown in Fig. 11.39.

Fig. 11.39 Dialog box with the selection of the variables to run the one-way analysis of variance in SPSS.

In Options..., let’s choose the options Descriptive (in Statistics) and Means plot, as shown in Fig. 11.40.

Next, we can click on Continue and on OK.

While Fig. 11.41 shows the descriptive statistics of the clusters per variable, similar to Tables 11.24, 11.25, and 11.26, Fig. 11.42 uses these values and shows the calculation of the variation between the groups (Between Groups) and within them (Within Groups), as well as the F statistics for each variable and the respective significance levels. We can see that these values correspond to the ones calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.30.

From Fig. 11.42, we can see that sig. F for the variable physics is less than 0.05 (sig. F = 0.043), that is, there is at least one group that has a statistically different mean, when compared to the others, at a significance level of 0.05. However, the same cannot be said about the variables mathematics and chemistry.

Although we have an idea of which group has a statistically different mean compared to the others for the variable physics, based on the outputs seen in Fig. 11.41, constructing the diagrams may facilitate the analysis of the differences between the variable means per cluster even more. The charts generated by SPSS (Figs. 11.43, 11.44, and 11.45) allow us to see these differences between the groups for each variable analyzed.

Fig. 11.43 Means of the variable mathematics in the three clusters.

Fig. 11.45 Means of the variable chemistry in the three clusters.

Therefore, from the chart seen in Fig. 11.44, it is possible to see that group 2, formed only by observation Luiz Felipe, in fact, has a mean different from the others in relation to the variable physics.

Besides, even though we can see from the diagrams in Figs. 11.43 and 11.45 that there are mean differences of the variables mathematics and chemistry between the groups, these differences cannot be considered statistically significant, at a significance level of 0.05, since we are dealing with a very small number of observations, and the F statistic values are very sensitive to the sample size. This graphical analysis becomes really useful when we are studying datasets with a larger number of observations and variables.

Finally, researchers can still complement their analysis by elaborating a procedure known as multidimensional scaling, since using the distance matrix may help them construct a chart that allows a two-dimensional visualization of the relative positions of each observation, regardless of the total number of variables.

In order to do that, we must structure a new dataset, formed exactly by the distance matrix. For the data in our example, we can open the file CollegeEntranceExamMatrix.sav, which contains the Euclidian distance matrix shown in Fig. 11.46. Note that the columns of this new dataset refer to the observations in the original dataset, as well as the rows (squared distance matrix).

Fig. 11.46 Dataset with the Euclidean distance matrix.

Let’s click on Analyze → Scale → Multidimensional Scaling (ASCAL).... In the dialog box that will open, we must insert the variables that represent the observations in Variables, as shown in Fig. 11.39. Since the data already correspond to the distances, nothing needs to be done regarding the field Distances (Fig. 11.47).

Fig. 11.47 Dialog box with the selection of the variables to run the multidimensional scaling in SPSS.

In Model..., let’s select the option Ratio in Level of Measurement (note that the option Euclidean distance in Scaling Model has already been selected) and, in Options..., the option Group plots in Display, as shown in Figs. 11.48 and 11.49, respectively.

Fig. 11.48 Defining the nature of the variable that corresponds to the distance measure.

Fig. 11.49 Selecting the option for constructing the two-dimensional chart.

Next, we can click on Continue and on OK.

Fig. 11.50 shows the chart with the relative positions of the observations projected on a plane.

This type of chart is really useful when researchers wish to prepare didactical presentations of observation clusters (individuals, companies, municipalities, countries, among other examples) and to make the interpretation of the clusters easier, mainly when there is a relatively large number of variables in the dataset.

11.3.2 Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS

Maintaining the same logic proposed in the chapter, from the same dataset, we will develop a cluster analysis based on the nonhierarchical k-means agglomeration schedule. Thus, we must once again use the file CollegeEntranceExams.sav.

In order to do that, we must click on Analyze → Classify → K-Means Cluster.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Variables, and the variable student into Label Cases by. The main difference between this initial dialog box and the one corresponding to the hierarchical procedure is determining the number of clusters from which the k-means algorithm will be elaborated. In our example, let’s insert number 3 into Number of Clusters. Fig. 11.51 shows how the dialog box will be.

Fig. 11.51 Dialog box for elaborating the cluster analysis with the nonhierarchical K-means method in SPSS.

We can see that we inserted the original variables into the field Variables. This procedure is acceptable, since, for our example, the values are in the same unit of measure. However, if this fact is not verified, before elaborating the k-means procedure, researchers must standardize them through the Z-scores procedure, in Analyze → Descriptive Statistics → Descriptives..., insert the original variables into Variables, and select the option Save standardized values as variables. When we click on OK, researchers will see that new standardized variables will become part of the dataset.

Going back to the initial screen of the k-means procedure, we will click on Save.... In the dialog box that will open, we must select the option Cluster membership, as shown in Fig. 11.52.

Fig. 11.52 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Nonhierarchical procedure.

When we click on Continue, we will go back to the previous dialog box. In Options..., let’s select the options Initial cluster centers, ANOVA table, and Cluster information for each case, in Statistics, as shown in Fig. 11.53.

Fig. 11.53 Selecting the options to perform the K-means procedure.

Next, we can click on Continue and on OK. It is important to mention that SPSS already uses the Euclidian distance as a standard dissimilarity measure when elaborating the k-means procedure.

The first two outputs generated refer to the initial step and to the iteration of the k-means algorithm. The centroid coordinates are presented in the initial step and, through them, we can notice that SPSS considers that the three clusters are formed by the first three observations in the dataset. Although this decision is different from the one we used in Section 11.2.2.2.2, this choice is purely arbitrary and, as we will see later, it will not impact the formation of clusters in the final step of the k-means algorithm at all.

While Fig. 11.54 shows the values of the original variables for observations Gabriela, Luiz Felipe, and Patricia (as shown in Table 11.16) as the centroid coordinates of the three groups, in Fig. 11.55 we can see, after the first iteration of the algorithm, that the change in the centroid coordinate of the first cluster is 1.897, which corresponds exactly to the Euclidian distance between observation Gabriela and the cluster Gabriela-Ovidio-Leonor (as shown in Table 11.23). In this last figure, in the footnotes, it is also possible to see the measure 7.187 that corresponds to the Euclidian distance between observations Luiz Felipe and Patricia, which remain isolated after the iteration.

Fig. 11.55 First iteration of the K-means algorithm and change in the centroid coordinates.

The next three figures refer to the final stage of the k-means algorithm. While the output Cluster Membership (Fig. 11.56) shows the allocation of each observation to each one of the three clusters, as well as the Euclidian distances between each observation and the centroid of the respective group, the output Distances between Final Cluster Centers (Fig. 11.58) shows the Euclidian distances between the group centroids. These two outputs have values that were calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.23. Moreover, the output Final Cluster Centers (Fig. 11.57) shows the centroid coordinates of the groups after the final stage of this nonhierarchical procedure, which correspond to the values already calculated and presented in Table 11.22.

Fig. 11.57 Final stage of the K-Means algorithm—Cluster centroid coordinates.

Fig. 11.58 Final stage of the K-means algorithm—Distances between the cluster centroids.

Fig. 11.59 One-way analysis of variance in the K-means procedure—Variation between groups and within groups, F statistics, and significance levels per variable.

Fig. 11.60 Number of observations in each cluster.

Fig. 11.61 Dataset with the new variable QCL_1—Allocation of each observation.

Fig. 11.62 Description of the CollegeEntranceExams.dta dataset.

Fig. 11.63 Euclidean distance matrix between pairs of observations.

Fig. 11.64 Elaboration of the hierarchical cluster analysis and summary of the criteria used.

Fig. 11.65 Dataset with the new variables.

Fig. 11.66 Stages of the agglomeration schedule and respective Euclidian distances.

Fig. 11.67 Dendrogram—Single-linkage method and Euclidian distances in Stata.

Fig. 11.68 Allocating the observations to the clusters.

Fig. 11.69 ANOVA for the variables mathematics, physics, and chemistry.

Fig. 11.70 Elaborating the multidimensional scaling in Stata.

Fig. 11.71 Chart with projections of the relative positions of the observations.

Fig. 11.72 Elaborating the nonhierarchical K-means procedure and a summary of the criteria used.

Fig. 11.73 Number of observations in each cluster and allocation of observations.

Fig. 11.74 Means per cluster and general means of the variables mathematics, physics, and chemistry.

Fig. 11.75 Interrelationship between the variables and relative position of the observations in each cluster—matrix chart.

Fig. 11.76 Description of the Bacon.dta dataset.

Fig. 11.77 Applying the command bacon in Stata.

Fig. 11.78 Observations classified as multivariate outliers.

Fig. 11.79 Variables income and age—Relative position of the observations.

Fig. 11.80 Variables income and tgrad—Relative position of the observations.

Fig. 11.81 Variables age and tgrad—Relative position of the observations.

The ANOVA output (Fig. 11.59) is analogous to the one presented in Table 11.30 in Section 2.2.2.2 and in Fig. 11.42 in Section 11.3.1, and, through it, we can see that only the variable physics has a statistically different mean in at least one of the groups formed, when compared to the others, at a significance level of 0.05.

As we have previously discussed, if one or more variables are not contributing to the formation of the suggested number of clusters, we recommend that the algorithm be reapplied without these variables. The researcher can even use a hierarchical procedure without the aforementioned variables before reapplying the k-means procedure. For the data in our example, however, the analysis would become univariate due to the exclusion of the variables mathematics and chemistry, which demonstrates the risk researchers take when working with extremely small datasets in cluster analysis.

It is important to mention that the ANOVA output must only be used when studying the variables that most contribute to the formation of the specified number of clusters, since this is chosen so that the differences between the observations allocated to different groups can be maximized. Thus, as explained in this output’s footnotes, we cannot use the F statistic aiming at verifying the equality or not of the groups formed. For this reason, it is common to find the term pseudo F for this statistic in the existing literature.

Finally, Fig. 11.60 shows the number of observations in each one of the clusters.

Similar to the hierarchical procedure, we can see that a new variable (obviously qualitative) is generated in the dataset after the preparation of the k-means procedure, which is called QCL_1 by SPSS, as shown in Fig. 11.61.

This variable ended up being identical to the variable CLU3_1 (Fig. 11.37) in this example. Nonetheless, this fact does not always happen with a larger number of observations and in the cases in which different dissimilarity measures are used in the hierarchical and nonhierarchical procedures.

Having presented the procedures for the application of the cluster analysis in SPSS, let’s discuss this technique in Stata.

11.4 Cluster Analysis With Hierarchical and Nonhierarchical Agglomeration Schedules in Stata

Now, we will present the step by step for preparing our example in Stata Statistical Software®. In this section, our main objective is not to once again discuss the concepts related to the cluster analysis, but to give the researcher an opportunity to prepare the technique by using the commands this software has to offer. At each presentation of an output, we will mention the respective result obtained when performing its algebraic solution and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©.

11.4.1 Elaborating Hierarchical Agglomeration Schedules in Stata

Therefore, let’s begin with the dataset constructed by the professor and which contains the grades in Mathematics, Physics, and Chemistry obtained by five students in the college entrance exams. The dataset can be found in the file CollegeEntranceExams.dta and is exactly the same as the one presented in Table 11.12 in Section 11.2.2.1.2.

Initially, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 11.62 shows the first output in Stata.

As discussed previously, since the original variables have values in the same unit of measure, in this example, it is not necessary to standardize them by using the Z-scores procedure. However, if the researcher wishes to, he may obtain the standardized variables through the following commands:

egen zmathematics = std(mathematics)

egen zphysics = std(physics)

egen zchemistry = std(chemistry)

First of all, let’s obtain the matrix with distances between the pairs of observations. In general, the sequence of commands for obtaining distance or similarity matrices in Stata is:

matrix dissimilarity D = variables⁎, option⁎

matrix list D

where the term variables⁎ will have to be substituted for the list of variables to be considered in the analysis, and the term option⁎ will have to be substituted for the term corresponding to the distance or similarity measure that the researcher wishes to use. While Table 11.31 shows the terms in Stata that correspond to each one of the measures for the metric variables studied in Section 11.2.1.1, Table 11.32 shows the terms related to the measures used for the binary variables studied in Section 11.2.1.2.

Table 11.31

Terms in Stata Corresponding to the Measures for Metric Variables
Measure for Metric Variables	Term in Stata
Euclidian	L2
Squared Euclidean	L2squared
Manhattan	L1
Chebyshev	Linf
Canberra	Canberra
Pearson’s Correlation	corr

Table 11.32

Terms in Stata Corresponding to the Measures for Binary Variables
Measure for Binary Variables	Term in Stata
Simple matching	matching
Jaccard	Jaccard
Dice	Dice
AntiDice	antiDice
Russell and Rao	Russell
Ochiai	Ochiai
Yule	Yule
Rogers and Tanimoto	Rogers
Sneath and Sokal	Sneath
Hamann	Hamann

Therefore, since we wish to obtain the Euclidian distance matrix between the pairs of observations, in order to maintain the criterion used in the chapter, we must type the following sequence of commands:

matrix dissimilarity D = mathematics physics chemistry, L2

matrix list D

The output generated, which can be seen in Fig. 11.63, is in accordance with what was presented in matrix D₀ in Section 11.2.2.1.2.1, and also in Fig. 11.30 when we elaborated the technique in SPSS (Section 11.3.1).

Next, we will carry out the cluster analysis itself. The general command used to run a cluster analysis through a hierarchical schedule in Stata is given by:

cluster method⁎ variables⁎, measure(option⁎)

where, besides the substitution of the terms variables⁎ and option⁎, as discussed previously, we must substitute the term method⁎ for the linkage method chosen by the researcher. Table 11.33 shows the terms in Stata related to the methods discussed in Section 11.2.2.1.

Table 11.33

Terms in Stata That Correspond to the Linkage Methods in Hierarchical Agglomeration Schedules
Linkage Method	Term in Stata
Single	singlelinkage
Complete	completelinkage
Average	averagelinkage

Therefore, for the data in our example and following the criterion adopted throughout this chapter (single-linkage method with Euclidian distance - term L2), we must type the following command:

cluster singlelinkage mathematics physics chemistry, measure(L2)

After that, we can type the command cluster list, which makes, in a summarized way, the criteria used by the researcher to develop the hierarchical cluster analysis. Fig. 11.64 shows the outputs generated.

From Fig. 11.64 and by analyzing the dataset, we can verify that three new variables are generated, regarding the identification of each observation (_clus_1_id), the sorting of the observations when creating the clusters (_clus_1_ord), and the Euclidian distances used in order to group the new observation in each one of the clustering stages (_clus_1_hgt). Fig. 11.65 shows how the dataset is after this cluster analysis is elaborated.

It is important to mention that Stata shows the variable _clu_1_hgt with the old values in one row, which can make the analysis a little confusing. Therefore, while distance 3.713 refers to the merger between observations Ovidio and Gabriela (first stage of the agglomeration schedule), distance 7.187 corresponds to the fusion between Luiz Felipe and the cluster already formed by all the other observations (last stage of the agglomeration schedule), as already shown in Table 11.13 and in Fig. 11.31.

Thus, in order for researchers to correct this discrepancy and to obtain the real behavior of the distances in each new clustering stage, they can type the sequence of commands, whose output can be seen in Fig. 11.66. Note that a new variable is generated (dist) and it corresponds to the correction of the discrepancy found in variable _clu_1_hgt (term [_n-1]), presenting the value of each Euclidian distance in order to establish a new cluster in each stage of the agglomeration schedule.

gen dist = _clus_1_hgt[_n-1]

replace dist = 0 if dist ==.

sort dist

list student dist

Having carried out this phase, we can ask Stata to construct the dendrogram by typing one of the two equivalent commands:

cluster dendrogram, labels(student) horizontal

cluster tree, labels(student) horizontal

The diagram generated can be seen in Fig. 11.67.

We can see that the dendrogram constructed by Stata, in terms of Euclidian distances, is equal to the one shown in Fig. 11.12, constructed when the modeling was solved algebraically. However, it differs from the one constructed by SPSS (Fig. 11.32) for not considering rescaled measures. Regardless of this fact, we will adopt three clusters as a possible solution, being one of them formed by Leonor, Ovidio, and Gabriela, another, by Patricia, and the third, by Luiz Felipe, since the criteria discussed about large distance leaps coherently lead us toward this decision.

In order to generate a new variable, corresponding to the allocation of the observations to the three clusters, we must type the following sequence of commands. Note that we have named this new variable cluster. The output seen in Fig. 11.68 shows the allocation of the observations to the groups and is equivalent to the one shown in Fig. 11.36 (SPSS).

cluster generate cluster = groups(3), name(_clus_1)

sort _clus_1_id

list student cluster

Finally, by using the one-way analysis of variance (ANOVA), we will study if the values of a certain variable differ between the groups represented by the categories of the new qualitative variable cluster generated in the dataset, that is, if the variation between the groups is significantly higher than the variation within each one of them, following the logic proposed in Section 11.3.1. In order to do that, let’s type the following commands, in which the three metric variables (mathematics, physics, and chemistry) are individually related to the variable cluster:

oneway mathematics cluster, tabulate

oneway physics cluster, tabulate

oneway chemistry cluster, tabulate

The results of the ANOVA for the three variables are in Fig. 11.69.

The outputs in this figure, which show the results of the variation Between groups and Within groups, the F statistics, and the respective significance levels (Prob. F, or Prob > F in Stata) for each variable, are equal to the ones calculated algebraically and presented in Table 11.30 (Section 11.2.2.2.2) and also in Fig. 11.42, when this procedure was elaborated in SPSS (Section 11.3.1).

Therefore, as we have already discussed, we can see that, while for the variable physics there is at least one cluster that has a statistically different mean, when compared to the others, at a significance level of 0.05 (Prob. F = 0.0429 < 0.05), the variables mathematics and chemistry do not have statistically different means between the three groups formed for this sample and at the significance level set.

It is important to bear in mind that, if there is a greater number of variables that have Prob. F less than 0.05, the one considered the most discriminant of the groups is the one with the highest F statistic (that is, the lowest significance level Prob. F).

Even if it is possible to conclude the hierarchical analysis at this moment, the researcher has the option to run a multidimensional scaling, in order to see the projections of the relative positions of the observations in a two-dimensional chart, similar to what was done in Section 11.3.1. In order to do that, he may type the following command:

mds mathematics physics chemistry, id(student) method(modern) measure(L2) loss(sstress) config nolog

The outputs generated can be found in Figs. 11.70 and 11.71, and the chart of the latter is the one shown in Fig. 11.50.

Having presented the commands to carry out the cluster analysis with hierarchical agglomeration schedules in Stata, let’s move on to the elaboration of the nonhierarchical k-means agglomeration schedule in the same software package.

11.4.2 Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata

In order to apply the k-means procedure to the data in the file CollegeEntranceExams.dta, we must type the following command:

cluster kmeans mathematics physics chemistry, k(3) name(kmeans) measure(L2) start(firstk)

where the term k(3) is the input for the algorithm to be elaborated with three clusters. Besides, we define that a new variable with the allocation of the observations to the three groups will be generated in the dataset with the name kmeans (term name(kmeans)), and the distance measure used will be the Euclidian distance (term L2). Moreover, the term firstk specifies that the coordinates of the first k observations of the sample will be used as centroids of the k clusters (in our case, k = 3), which corresponds exactly to the criterion adopted by SPSS, as discussed in Section 11.3.2.

Next, we can type the command cluster list kmeans so that, in a summarized way, the criteria adopted for elaborating the k-means procedure can be presented.

The outputs in Fig. 11.72 show what is generated by Stata after we type the last two commands.

The next two commands generate, in the outputs of the software, two tables that refer to the number of observations in each one of the three clusters formed, as well as to the allocation of each observation in these groups, respectively:

table kmeans

list student kmeans

Fig. 11.73 shows these outputs.

These results correspond to the one found when the k-means procedure was solved algebraically in Section 11.2.2.2.2 (Fig. 11.23), and to the one obtained when this procedure was elaborated using SPSS in Section 11.3.2 (Figs. 11.60 and 11.61).

Even though we are able to develop a one-way analysis of variance for the original variables in the dataset, from the new qualitative variable generated (kmeans), we chose not to carry out this procedure here, since we have already done that for the variable cluster generated in Section 11.4.1 after the hierarchical procedure, which is exactly the same as the variable kmeans in this case.

On the other hand, for pedagogical purposes, we present the command that allows the means of each variable in the three clusters to be generated, so that they can be compared:

tabstat mathematics physics chemistry, by(kmeans)

The output generated can be seen in Fig. 11.74 and is equivalent to the one presented in Tables 11.24, 11.25, and 11.26.

Finally, the researcher can also construct a chart to show the interrelationships between the variables, two at a time. This chart, known as matrix, can give the researcher a better understanding of how the variables relate to one another and even make suggestions regarding the relative position of the observations in each cluster in these interrelationships. To construct the chart shown in Fig. 11.75, we must type the following command:

graph matrix mathematics physics chemistry, mlabel(kmeans)

Obviously, this chart could have also been constructed in the previous section. However, we chose to present it only at the end of the preparation of the k-means procedure in Stata. By analyzing it, it is possible to verify, among other things, that only considering the variables mathematics and chemistry is not enough to make observations Luiz Felipe and Patricia (clusters 2 and 3, respectively) stay further apart. It is necessary to consider the variable physics so that these two students can, in fact, be allocated to different clusters when forming three clusters. Although it may seem pretty obvious when analyzing the data in their own dataset, the chart becomes extremely useful for larger samples with a considerable number of variables, fact that would multiply these interrelationships.

11.5 Final Remarks

Many are the situations in which researchers may wish to group observations (individuals, companies, municipalities, countries, political parties, plant species, among other examples) from certain metric or even binary variables. Creating homogeneous clusters, reducing data structurally, and verifying the validity of previously established constructs are some of the main reasons that make researchers choose to work with cluster analysis.

This set of techniques allows decision-making mechanisms to be better structured and justified from the behavior and interdependence relationship between the observations of a certain dataset. Since the variable that represents the clusters formed is qualitative, the outputs of the cluster analysis can serve as inputs in other multivariate techniques, both exploratory as well as confirmatory ones.

It is strongly advisable for researchers to justify, clearly and transparently, the measure they chose and that will serve as the basis for the observations to be considered more or less similar, as well as the reasons that make them define nonhierarchical or hierarchical agglomeration schedules and, in this last case, determine the linkage methods.

In the last few years, the evolution of technological capabilities and the development of new software, with extremely improved resources, caused new and better cluster analysis techniques to arise. Techniques that use more and more sophisticated algorithms and that are aimed at the decision-making process in several fields of knowledge, always with the main goal of grouping observations based on certain criteria. However, in this chapter, we tried to offer a general overview of the main cluster analysis methods, also considered to be the most popular.

Lastly, we would like to highlight that the application of this important set of techniques must always be done by using the software chosen for the modeling correctly and sensibly, based on the underlying theory and on researchers’ experience and intuition.

11.6 Exercises

1) The scholarship department of a certain college wishes to investigate the interdependence relationship between the students entering university in a certain school year, based only on two metric variables (age, in years, and average family income, in US$). The main objective is to propose a still unknown number of new scholarship programs aimed at homogeneous groups of students. In order to do that, data on 100 new students were collected and a dataset was constructed, which can be found in the files Scholarship.sav and Scholarship.dta, with the following variables:

Variable	Description
student	A string variable that identifies all freshmen in the college
age	Student’s age (years)
income	Average family income (US$)

We would like you to:

a) Run a cluster analysis through a hierarchical agglomeration schedule, with the complete-linkage method (furthest neighbor) and the squared Euclidean distance. Only present the final part of the agglomeration schedule table and discuss the results. Reminder: Since the variables have different units of measure, it is necessary to apply the Z-scores standardization procedure to prepare the cluster analysis correctly.
b) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of students will be formed?
c) Is it possible to identify one or more very discrepant students, in comparison to the others, regarding the two variables under analysis?
d) If the answer to the previous item is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without the student(s) considered discrepant. From the analysis of the new results, can new clusters be identified?
e) Discuss how the presence of outliers can hamper the interpretation of results in a clusters analysis.

2) The marketing department of a retail company wants to study possible discrepancies in their 18 stores spread throughout three regional centers and distributed all over the country. In order to maintain and preserve its brand’s image and identity, top management would like to know if their stores are homogeneous in terms of customers’ perception of attributes, such as, services, variety of goods, and organization. Thus, first, a research with samples of clients was developed in each store, so that data regarding these attributes could be collected. These were defined based on the average score obtained (0 to 100) in each store.
Next, a dataset was constructed and it contains the following variables:

Variable	Description
store	A string variable that varies from 01 the 18 and that identifies the commercial establishment (store)
regional	A string variable that identifies each regional center (Regional 1 to Regional 3)
services	Customers’ average evaluation of services rendered (score from 0 to 100)
assortment	Customers’ average evaluation of the variety of goods (score from 0 to 100)
organization	Customers’ average evaluation of the organization (score from 0 to 100)

These data can be found in the files Retail Regional Center.sav and Retail Regional Center.dta. We would like you to:

a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method and the Euclidean distance. Present the matrix with distances between each pair of observations. Reminder: Since the variables are in the same unit, it is not necessary to apply the Z-scores standardization procedure.
b) Present and discuss the agglomeration schedule table.
c) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of stores will be formed?
d) Run a multidimensional scaling and, after that, present and discuss the two-dimensional chart generated with the relative positions of the stores.
e) Run a cluster analysis by using the k-means procedure, with the number of clusters suggested in item (c), and interpret the one-way analysis of variance for each variable considered in the study, considering a significance level of 0.05. Which variable contributes the most to the creation of at least one of the clusters formed, that is, which of them is the most discriminant of the groups?
f) Is there any correspondence between the allocations of the observations to the groups obtained by the hierarchical and nonhierarchical methods?
g) Is it possible to identify an association between any regional center and a certain discrepant group of stores, which could justify the management’s concern regarding the brand’s image and identity? If the answer is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without this discrepant group of stores. By analyzing the new results, is it possible to see the differences between the others stores more clearly?

3) A financial market analyst has decided to carry out a survey with CEOs and directors of large companies that operate in the health, education, and transport industries, in order to investigate how these companies’ operations are carried out and the mechanisms that guide their decision making processes. In order to do that, he structured a questionnaire with 50 questions, whose answers are only dichotomous, or binary. After applying the questionnaire, he got answers from 35 companies and, from then on, structured a dataset, present in the files Binary Survey.sav and Binary Survey.dta. In a generic way, the variables are:

Variable	Description
q1 to q50	A list of 50 dummy variables that refer to the way the operations and the decision-making processes are carried out in these companies
sector	Company sector

The analyst’s main goal is to verify whether companies in the same sector show similarities in relation to the way their operations and decision making processes are carried out, at least from their own managers’ perspective. In order to do that, after collecting the data, a cluster analysis can be elaborated. We would like you to:

a) Based on the hierarchical cluster analysis elaborated with the average-linkage method (between groups) and the simple matching similarity measure for binary variables, analyze the agglomeration schedule generated.
b) Interpret the dendrogram.
c) Check if there is any correspondence between the allocations of the companies to the clusters and the respective sectors, or, in other words, if the companies in the same sector show similarities regarding the way their operations and decision-making processes are carried out.

4) A greengrocer has decided to monitor the sales of his products for 16 weeks (4 months). The main objective is to verify if the sales behavior of three of their main products (bananas, oranges, and apples) is recurrent after a certain period, due to weekly wholesale price fluctuations, prices that are passed on to customers and may impact sales. These data can be found in the files Veggiefruit.sav and Veggiefruit.dta, which have the following variables:

Variable	Description
week	A string variable that varies from 1 to 16 and identifies the week in which the sales were monitored
week_month	A string variable that varies from 1 to 4 and identifies the week in each one of the months
banana	Number of bananas sold that week (un.)
orange	Number of oranges sold that week (un.)
apple	Number of apples sold that week (un.)

We would like you to:

a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method (nearest neighbor) and Pearson’s correlation measure. Present the matrix of similarity measures (Pearson’s correlation) between each row in the dataset (weekly periods). Reminder: Since the variables are in the same unit of measure, it is not necessary to apply the Z-scores standardization procedure.
b) Present and discuss the agglomeration schedule table.
c) Based on the table found in the previous item and on the dendrogram, we ask you: is there any indication that the joint sales behavior of bananas, oranges and apples is recurrent in certain weeks?

Appendix

A.1 Detecting Multivariate Outliers

Even though detecting outliers is extremely important when applying practically every single multivariate data analysis technique, we chose to add this Appendix to the present chapter because cluster analysis represents the first set of multivariate exploratory techniques being studied, whose outputs can be used as inputs in several other techniques, as well as because very discrepant observations may significantly interfere in the creation of clusters.

Barnett and Lewis (1994) mention almost 1000 articles in the existing literature on outliers. However, we chose to show a very effective, computationally simple, and fast algorithm for detecting multivariate outliers, bearing in mind that the identification of outliers for each variable individually, that is, in a univariate way, has already been studied in Chapter 3.

A) Brief Presentation of the Blocked Adaptive Computationally Efficient Outlier Nominators Algorithm

Billor et al. (2000), in extremely important work, show an interesting algorithm that has the purpose of detecting multivariate outliers. It is called Blocked Adaptive Computationally Efficient Outlier Nominators or simply BACON. This algorithm, explained in a very clear and didactical way by Weber (2010), is defined based on the preparation of a few steps, described briefly:

1. From a dataset with n observations and j (j = 1, ..., k) variables X, in which each observation is identified by i (i = 1, ..., n), the distance between one observation i that has a vector with dimension k $x_{i} = (x_{i 1} x_{i 2} \dots x_{ik})$ $x_{i} = (x_{i 1} x_{i 2} \dots x_{ik})$ and the general mean of all sample values (group G), which also has a vector with dimension k $\bar{x} = ({\bar{x}}_{1} {\bar{x}}_{2} \dots {\bar{x}}_{k})$ $\bar{x} = ({\bar{x}}_{1} {\bar{x}}_{2} \dots {\bar{x}}_{k})$ , is given by the following expression, known as the Mahalanobis distance:

$d_{iG} = \sqrt{{(x_{i} - \bar{x})}^{'} \cdot S^{- 1} \cdot (x_{i} - \bar{x})}$ $d_{iG} = \sqrt{{(x_{i} - \bar{x})}^{'} \cdot S^{- 1} \cdot (x_{i} - \bar{x})}$

(11.29)

where S represents the covariance matrix of the n observations. Therefore, the first step of the algorithm consists in identifying m (m > k) homogeneous observations (initial group M) that have the smallest Mahalanobis distances in relation to the entire sample.

It is important to mention that the dissimilarity measure known as Mahalanobis distance, not discussed in this chapter, is adopted by the aforementioned authors due to the fact that it is not susceptible to variables that are in different measurement units.

2. Next, the Mahalanobis distances between each observation i and the mean of the m observation values that belong to group M are calculated, which also has a vector with dimension k ${\bar{x}}_{M} = ({\bar{x}}_{M 1} {\bar{x}}_{M 2} \dots {\bar{x}}_{Mk})$ ${\bar{x}}_{M} = ({\bar{x}}_{M 1} {\bar{x}}_{M 2} \dots {\bar{x}}_{Mk})$ , such that:

$d_{iM} = \sqrt{{(x_{i} - {\bar{x}}_{M})}^{'} \cdot S_{M}^{- 1} \cdot (x_{i} - {\bar{x}}_{M})}$ $d_{iM} = \sqrt{{(x_{i} - {\bar{x}}_{M})}^{'} \cdot S_{M}^{- 1} \cdot (x_{i} - {\bar{x}}_{M})}$

(11.30)

where S_M represents the covariance matrix of the m observations.

3. All the observations with Mahalanobis distances less than a certain threshold are added to the group M of observations. This threshold is defined as a corrected percentile of the χ² distribution (85% in the Stata standard).

Steps 2 and 3 must be reapplied until there are no more modifications in group M, which will only have observations that are not considered outliers. Hence, the ones excluded from the group will be considered multivariate outliers.

Weber (2010) codifies the algorithm proposed in the paper written by Billor et al. (2000) in Stata, thus proposing the command bacon. Next, we will present and discuss an example in which this command is used, and whose main advantage is to be very fast computationally, even when applied to large datasets.

B) Example: The command bacon in Stata

Before the specific preparation of this procedure in Stata, we must install the command bacon by typing findit bacon and clicking on the link st0197 from http://www.stata-journal.com/software/sj10-3. After that, we must click on click here to install. Lastly, going back to the Stata command screen, we can type ssc install moremata and mata: mata mlib index. Having done this, we may apply the command bacon.

To apply this command, let’s use the file Bacon.dta, which shows data on the median household income (US$) of 20,000 engineers, their age (years), and time he(she) has had a college degree (years). First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible. Fig. 11.76 shows this first output.

Next, we can type the following command that, based on the algorithm presented, identifies the observations considered multivariate outliers:

bacon income age tgrad, generate(outbacon)

where the term generate(outbacon) makes a new dummy variable be generated in the dataset, called outbacon, which has values equal to 0 for observations not considered outliers, and values equal to 1 for the ones considered outliers. This output can be seen in Fig. 11.77.

Through the figure, it is possible to see that four observations are classified as multivariate outliers. Besides, Stata considers 85% of the percentile standard of the χ² distribution, used as a separation threshold between the observations considered outliers and nonoutliers, as previously discussed and highlighted by Weber (2010). This is the reason why the term BACON outliers (p = 0.15) appears in the outputs. This value may be altered due to a criterion established by the researcher. However, we would like to emphasize that the standard percentile(0.15) is very adequate for obtaining consistent answers.

From the following command, which generates the output seen in Fig. 11.78, we can investigate which observations are classified as outliers:

list if outbacon == 1

Even if we are working with three variables, we can construct two-dimensional scatter plots, which allow us to identify the positions of the observations considered outliers in relation to the others. In order to do that, let’s type the following commands, which generate the mentioned charts for each pair of variables:

scatter income age, ml(outbacon) note("0 = not outlier, 1 = outlier")
scatter income tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier")
scatter age tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier")

These three charts can be seen in Figs. 11.79, 11.80, and 11.81.

Despite the fact that outliers have been identified, it is important to mention that the decision about what to do with these observations is entirely up to researchers, who must make it based on their research objectives. As already discussed throughout this chapter, excluding these outliers from the dataset may be an option. However, studying why they became multivariately discrepant can also result in many interesting research outcomes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11.3 Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS

Create new playlist

Sign In

Sign Up

11.3 Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS

11.3.1 Elaborating Hierarchical Agglomeration Schedules in SPSS

11.3.2 Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS

11.4 Cluster Analysis With Hierarchical and Nonhierarchical Agglomeration Schedules in Stata

11.4.1 Elaborating Hierarchical Agglomeration Schedules in Stata

11.4.2 Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata

11.5 Final Remarks

11.6 Exercises

Appendix

A.1 Detecting Multivariate Outliers

Table of Contents for
11.3 Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS