12.3 Principal Component Factor Analysis in SPSS

In this section, we will discuss the step by step for developing our example in the IBM SPSS Statistics Software. Following the logic proposed in this book, the main objective is to give researchers an opportunity to elaborate the principal component factor analysis in this software package, given how easy it is to use it and how didactical the operations are. Every time we present an output, we will mention the respective result obtained when performing the algebraic solution of the technique in the previous section, so that researchers can compare them and broaden their own knowledge and understanding about it. The use of the images in this section has been authorized by the International Business Machines Corporation©.

Going back to the example presented in Section 12.2.6, remember that the professor is interested in creating a school performance ranking of his students based on the joint behavior of their final grades in four subjects. The data can be found in the file FactorGrades.sav and are exactly the same as the ones partially presented in Table 12.6 in Section 12.2.6.

Therefore, in order for the factor analysis to be elaborated, let’s click on Analyze → Dimension Reduction →Factor …. A dialog box as the one shown in Fig. 12.10 will open.

Fig. 12.10
Fig. 12.10 Dialog box for running a factor analysis in SPSS.

Next, we must insert the original variables finance, costs, marketing, and actuarial into Variables, as shown in Fig. 12.11.

Fig. 12.11
Fig. 12.11 Selecting the original variables.

Different from what was discussed in the previous chapter, when developing the cluster analysis, it is important to mention that the researcher does not need to worry about with the Z-scores standardization of the original variables to elaborate the factor analysis, since the correlations between original variables or between their corresponding standardized variables are exactly the same. Even so, if researchers choose to standardize each one of the variables, they will see that the outputs will be exactly the same.

In Descriptives …, first, let’s select the option Initial solution in Statistics …, which makes all the eigenvalues of the correlation matrix be presented in the outputs, even the ones that are less than 1. In addition, let’s select the options Coefficients, Determinant, and KMO and Bartlett’s test of sphericity in Correlation Matrix, as shown in Fig. 12.12.

Fig. 12.12
Fig. 12.12 Selecting the initial options for running the factor analysis.

When we click on Continue, we will go back to the main dialog box of the factor analysis. Next, we must click on Extraction …. As shown in Fig. 12.13, we will maintain the options regarding the factor extraction method selected (Method: Principal components) and the choice criterion of the number of factors. In this case, as discussed in Section 12.2.3, only the factors that correspond to eigenvalues greater than 1 will be considered (latent root criterion or Kaiser criterion), and, therefore, we must maintain the option Based on Eigenvalue → Eigenvalues greater than: 1 in Extract selected. Moreover, we will also maintain the options Unrotated factor solution, in Display, and Correlation matrix, in Analyze, selected.

Fig. 12.13
Fig. 12.13 Choosing the factor extraction method and the criterion for determining the number of factors.

In the same way, let’s click on Continue so that we can go back to the main dialog box of the factor analysis. In Rotation …, for now, let’s select the option Loading plot(s) in Display, while still maintaining the option None in Method selected, as shown in Fig. 12.14.

Fig. 12.14
Fig. 12.14 Dialog box for selecting the rotation method and the loading plot.

Choosing the extraction of unrotated factors at this moment is didactical, since the outputs generated may be compared to the ones obtained algebraically in Section 12.2.6. Nevertheless, researchers can choose to extract rotated factors at this opportunity.

After clicking on Continue, we can select the button Scores … in the technique’s main dialog box. At this moment, let’s select the option Display factor score coefficient matrix, as shown in Fig. 12.15, which makes the factor scores that correspond to each factor extracted be presented in the outputs.

Fig. 12.15
Fig. 12.15 Selecting the option to present the factor scores.

Next, we can click on Continue and on OK.

The first output (Fig. 12.16) shows correlation matrix ρ, equal to the one in Table 12.7 in Section 12.2.6, through which we can see that the variable marketing is the only one that shows low Pearson’s correlation coefficients with all the other variables. As we have already discussed, it is a first indication that the variables finance, costs, and actuarial can be correlated with a certain factor, while the variable marketing can correlate strongly with another one.

Fig. 12.16
Fig. 12.16 Pearson’s correlation coefficients.

We can also verify that the output seen in Fig. 12.16 shows the value of the determinant of correlation matrix ρ too, used to calculate the χBartlett2 statistic, as discussed when we presented Expression (12.9).

In order to study the overall adequacy of the factor analysis, let’s analyze the outputs in Fig. 12.17, which shows the results of the calculations that correspond to the KMO statistic and χBartlett2. While the first suggests that the overall adequacy of the factor analysis is considered middling (KMO = 0.737), based on the criterion presented in Table 12.2, the χBartlett2 statistic = 192.335 (Sig. χBartlett2 < 0.05 for 6 degrees of freedom) allows us to reject that correlation matrix ρ is statistically equal to identity matrix I with the same dimension, at a significance level of 0.05 and based on the hypotheses of Bartlett’s test of sphericity. Thus, we can conclude that the factor analysis is adequate.

Fig. 12.17
Fig. 12.17 Results of the KMO statistic and Bartlett’s test of sphericity.

The values of the KMO and χBartlett2 statistics are calculated through Expressions (12.3) and (12.9), respectively, presented in Section 12.2.2, and are exactly the same as the ones obtained algebraically in Section 12.2.6.

Next, Fig. 12.18 shows the four eigenvalues of correlation matrix ρ that correspond to each one of the factors extracted initially, with the respective proportions of variance shared by the original variables.

Fig. 12.18
Fig. 12.18 Eigenvalues and variance shared by the original variables to form each factor.

Note that the eigenvalues are exactly the same as the ones obtained algebraically in Section 12.2.6, such that:

λ12+λ22++λk2=2.519+1.000+0.298+0.183=4

si84_e

Since in the analysis we will only consider the factors whose eigenvalues are greater than 1, the right-hand side of Fig. 12.18 shows the proportion of variance shared by the original variables to only form these factors. Therefore, analogous to what was the presented in Table 12.10, we can state that, while 62.975% of the total variance are shared to form the first factor, 25.010% are shared to form the second. Thus, to form these two factors, the total loss of variance of the original variables is equal to 12.015%.

Having extracted two factors, Fig. 12.19 shows the factor scores that correspond to each one of the standardized variables for each one of these factors.

Fig. 12.19
Fig. 12.19 Factor scores.

Hence, we are able to write the expressions of factors F1 and F2 as follows:

F1i=0.355Zfinancei+0.371Zcostsi0.017Zmarketingi+0.364Zactuariali

si111_e

F2i=0.007Zfinancei+0.049Zcostsi+0.999Zmarketingi0.010Zactuariali

si112_e

Note that the expressions are identical to the ones obtained in Section 12.2.6 from the algebraic definition of unrotated factor scores.

Fig. 12.20 shows the factor loadings, which correspond to Pearson’s correlation coefficients between the original variables and each one of the factors. The values shown in Fig. 12.20 are equal to the ones presented in the first two columns of Table 12.12.

Fig. 12.20
Fig. 12.20 Factor loadings.

The highest factor loading is highlighted for each variable and, therefore, we can verify that, while the variables finance, costs, and actuarial show stronger correlations with the first factor, only the variable marketing shows stronger correlation with the second factor.

As we also discussed in Section 12.2.6, the sum of the squared factor loadings in the columns results in the eigenvalue of the corresponding factor, that is, it represents the proportion of variance shared by the four original variables to form each factor. Thus, we can verify that:

0.8952+0.9342+0.0422+0.9182=2.519

si113_e

0.0072+0.0492+0.9992+0.0102=1.000

si114_e

On the other hand, the sum of the squared factor loadings in the rows results in the communality of the respective variable, that is, it represents the proportion of shared variance of each original variable in the two factors extracted. Therefore, we can also see that:

communalityfinance=0.8952+0.0072=0.802communalitycosts=0.9342+0.0492=0.875communalitymarketing=0.0422+0.9992=1.000communalityactuarial=0.9182+0.0102=0.843

si102_e

In the SPSS outputs, the communalities table is also presented, as shown in Fig. 12.21.

Fig. 12.21
Fig. 12.21 Communalities.

The loading plot that shows the relative position of each variable in each factor, based on the respective factor loadings, is also shown in the outputs, as shown in Fig. 12.22 (equivalent to Fig. 12.8 in Section 12.2.6), in which the X-axis represents factor F1, and the Y-axis, factor F2.

Fig. 12.22
Fig. 12.22 Loading plot.

Even though the relative position of the variables in each axis is very clear, that is, the magnitude of the correlations between each one of them and each factor, for pedagogical purposes, we chose to elaborate the rotation of the axes, which can sometimes facilitate the interpretation of the factors because it provides a better distribution of the variables’ factor loadings in each factor.

Thus, once again, let’s click on Analyze → Dimension Reduction → Factor … and, on the button Rotation …, select the option Varimax, as shown in Fig. 12.23.

Fig. 12.23
Fig. 12.23 Selecting the Varimax orthogonal rotation method.

When we click on Continue, we will go back to the main dialog box of the factor analysis. In Scores …, let’s select the option Save as variables, as shown in Fig. 12.24, so that the factors generated, now rotated, can be made available in the dataset as new variables. From these factors, the students’ school performance ranking will be created.

Fig. 12.24
Fig. 12.24 Selecting the option to save the factors as new variables in the dataset.

Next, we can click on Continue and on OK.

Figs. 12.2512.29 show the outputs that present differences in relation to the previous ones, due to the rotation. In this regard, the results of the correlation matrix, of the KMO statistic, of Bartlett’s test of sphericity, and of the communalities table are not presented again, which, even though they were calculated from the rotated loadings, do not show changes in their values.

Fig. 12.25
Fig. 12.25 Rotated factor loadings through the Varimax method.
Fig. 12.26
Fig. 12.26 Loading plot with rotated loadings.
Fig. 12.27
Fig. 12.27 Rotation angle (in radians).
Fig. 12.28
Fig. 12.28 Eigenvalues and variance shared by the original variables to form both rotated factors.
Fig. 12.29
Fig. 12.29 Rotated factor scores.

Fig. 12.25 shows these rotated factor loadings and, through them, it is possible to verify, even if very tenuously, a certain redistribution of the variable loadings in each factor.

Note that the rotated factor loadings in Fig. 12.25 are exactly the same as the ones obtained algebraically in Section 12.2.6, from Expressions (12.35) to (12.40), and presented in Table 12.17.

The new loading plot, constructed from the rotated factor loadings and equivalent to Fig. 12.9, can be seen in Fig. 12.26.

The rotation angle calculated algebraically in Section 12.2.6 is also a part of the SPSS outputs and can be found in Fig. 12.27.

As we have already discussed, from the rotated factor loadings, we can verify that there are no changes in the communality values of the variables considered in the analysis, that is:

communalityfinance=0.8952+0.0192=0.802communalitycosts=0.9352+0.0212=0.875communalitymarketing=0.0132+1.0002=1.000communalityactuarial=0.9172+0.0372=0.843

si106_e

On the other hand, the new eigenvalues can be obtained as follows:

0.8952+0.9352+0.0132+0.9172=λ12=2.5180.0192+0.0212+1.0002+0.0372=λ22=1.002

si107_e

Fig. 12.28 shows the results of the eigenvalues for the first two rotated factors in Rotation Sums of Squared Loadings, with their respective proportions of variance shared by the four original variables. The results are in accordance with the ones presented in Table 12.18.

In comparison to the results obtained before the rotation, we can see that, even though there is no change in the sharing of 87.985% of the total variance of the original variables to form both rotated factors, the rotation redistributed the variance shared by the variables to each factor.

Fig. 12.29 shows the rotated factor scores, from which the expressions of the new factors can be obtained.

Therefore, we can write the following rotated factors expressions:

F1i=0.355Zfinancei+0.372Zcostsi+0.012Zmarketingi+0.364Zactuariali

si108_e

F2i=0.004Zfinancei+0.038Zcostsi+0.999Zmarketingi0.021Zactuariali

si109_e

When developing the procedure described, we can verify that two new variables are generated in the dataset, called FAC1_1 and FAC2_1 by SPSS, as shown in Fig. 12.30 for the first 20 observations.

Fig. 12.30
Fig. 12.30 Dataset with the F1′ (FAC1_1) and F2′ (FAC2_1) values per observation.

These new variables, which show the values of both rotated factors for each one of the observations in the dataset, are orthogonal to one another, that is, they have a Pearson’s correlation coefficient equal to 0. This can be verified when we click on Analyze → Correlate → Bivariate …. In the dialog box that will open, we must insert the four original variables into Variables and select the options Pearson (in Correlation Coefficients) and Two-tailed (in Test of Significance), as shown in Fig. 12.31.

Fig. 12.31
Fig. 12.31 Dialog box for determining Pearson’s correlation coefficient between both rotated factors.

When we click on OK, the output seen in Fig. 12.32 will be presented, in which it is possible to verify that Pearson’s correlation coefficient between both rotated factors is equal to 0.

Fig. 12.32
Fig. 12.32 Pearson’s correlation coefficient between both rotated factors.

According to what was studied in Sections 12.2.4 and 12.2.6, a more inquisitive researcher may also verify that the rotated factor scores can be obtained through the estimation of two multiple linear regression models, in which a certain factor is considered to be a dependent variable in each one of them, and as explanatory variables, the standardized variables. The factor scores will be the parameters estimated in each model.

In the same way, it is also possible to verify that the rotated factor loadings can be obtained by using the estimation of four multiple linear regression models as well, in which, in each one of them, a certain standardized variable is considered to be a dependent variable, and the factors, explanatory variables. While the factor loadings will be the parameters estimated in each model, the communalities will be the respective coefficients of determination R2. Therefore, the following expressions can be obtained:

Zfinancei=0.895F1i0.019F2i+ui,R2=0.802

si120_e

Zcostsi=0.935F1i+0.021F2i+ui,R2=0.875

si121_e

Zmarketingi=0.013F1i+1.000F2i+ui,R2=1.000

si122_e

Zactuariali=0.917F1i0.037F2i+ui,R2=0.843

si123_e

in which the terms ui represent additional sources of variance, besides factors F1′ and F2′, to explain the behavior of each variable, and they are also called error terms or residuals.

In case there is any interest in verifying these facts, we must obtain the standardized variables by clicking on Analyze → Descriptive Statistics → Descriptives …. When we select all the original variables, we must click on Save standardized values as variables. Although this specific procedure is not shown here, after clicking on OK, the standardized variables will be generated in the dataset itself.

Therefore, based on the factors generated, we are able to create the desired school performance ranking. In order to do that, we will use the criterion described in Section 12.2.6, known as weighted rank-sum criterion, in which a new variable is generated from the multiplication of the values of each factor by the respective proportions of variance shared by the original variables. Thus, this new variable, which we call ranking, has the following expression:

rankingi=0.62942F1i+0.25043F2i

si124_e

in which parameters 0.62942 and 0.25043 correspond to the proportions of variance shared by the first two factors, respectively, as shown in Fig. 12.28.

In order for the variable to be generated in the dataset, we must click on Transform → Compute Variable …. In Target Variable, we must type the name of the new variable (ranking) and, in Numeric Expression, we must type the weighted sum expression (FAC1_1⁎0.62942) + (FAC2_1⁎0.25043), as shown in Fig. 12.33. When we click on OK, the variable ranking will appear in the dataset.

Fig. 12.33
Fig. 12.33 Creating the new variable (ranking).

Finally, to sort variable ranking, we must click on Data → Sort Cases …. In addition to selecting the option Descending, we must insert the variable ranking into Sort by, as shown in Fig. 12.34. When we click on OK, the observations will appear sorted out in the dataset, from the highest to the lowest value of variable ranking, as shown in Fig. 12.35 for the 20 observations with the best performance school.

Fig. 12.34
Fig. 12.34 Dialog box for sorting the observations by variable ranking.
Fig. 12.35
Fig. 12.35 Dataset with the school performance ranking.

We can see that the ranking constructed through the weighted rank-sum criterion points to Adelino as the student with the best school performance in that set of subjects, followed by Renata, Giulia, Felipe, and Cecilia.

Having presented the procedures for applying the principal component factor analysis in SPSS, let’s now discuss the technique in Stata, following the standard used in this book.

12.4 Principal Component Factor Analysis in Stata

We now present the step by step for preparing our example in Stata Statistical Software. In this section, our main goal is not to discuss the concepts of the principal component factor analysis once again, instead, it is to give researchers an opportunity to elaborate the technique by using the commands in this software. Every time we present an output, we will mention the respective result obtained when applying the technique in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©.

Therefore, right away, we begin with the dataset constructed by the professor starting from the questions asked to each one of his 100 students. This dataset can be found in the file FactorGrades.dta and is exactly the same as the one partially presented in Table 12.6 in Section 12.2.6.

First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each of them. Fig. 12.36 shows this first output in Stata.

Fig. 12.36
Fig. 12.36 Description of the FactorGrades.dta dataset.

The command pwcorr …, sig generates Pearson’s correlation coefficients between each pair of variables, with their respective significance levels. Therefore, we must type the following command:

pwcorr finance costs marketing actuarial, sig

Fig. 12.37 shows the output generated.

Fig. 12.37
Fig. 12.37 Pearson’s correlation coefficients and respective significance levels.

The outputs seen in Fig. 12.37 show that the correlations between the variable marketing and each one of the other variables are relatively low and not statistically significant, at a significance level of 0.05. On the other hand, the other variables have high and statistically significant correlations, between one another, at this significance level, which is a first indication that the factor analysis may group them in a certain factor without any substantial loss of their variances, while the variable marketing may show high correlation with another factor. This figure is in accordance with the one presented in Table 12.7 in Section 12.2.6, and also in Fig. 12.16, when we elaborated the technique in SPSS (Section 12.3).

The factor analysis’s overall adequacy can be evaluated through the results of the KMO statistic and Bartlett’s test of sphericity, which can be obtained by using the command factortest. Thus, let’s type:

factortest finance costs marketing actuarial

The outputs generated can be seen in Fig. 12.38.

Fig. 12.38
Fig. 12.38 Results of the KMO statistic and Bartlett’s test of sphericity.

Based on the result of the KMO statistic, the overall adequacy of the factor analysis can be considered middling. However, more important than this piece of information is the result of Bartlett’s test of sphericity. From the result of the χBartlett2 statistic, with a significance level of 0.05 and 6 degrees of freedom, we can say that Pearson’s correlation matrix is statistically different from the identity matrix with the same dimension, since χBartlett2 = 192.335 (χ2 calculated for 6 degrees of freedom) and Prob. χBartlett2 (P-value) < 0.05. Note that the results of these statistics are in accordance with the ones calculated algebraically in Section 12.2.6 and also shown in Fig. 12.17 of Section 12.3. Fig. 12.38 also shows the value of the determinant of the correlation matrix, used to calculate of the χBartlett2 statistic.

Stata also allows us to obtain the eigenvalues and eigenvectors of the correlation matrix. In order to do that, we must type the following command:

pca finance costs marketing actuarial

Fig. 12.39 shows these eigenvalues and eigenvectors, and they are exactly the same as the ones calculated algebraically in Section 12.2.6. Since we have not elaborated the procedure for rotating the factors generated yet, we can verify that the proportions of variance shared by the original variables to form each factor correspond to the ones presented in Table 12.10.

Fig. 12.39
Fig. 12.39 Eigenvalues and eigenvectors of the correlation matrix.

After having presented these first outputs, we can now elaborate the principal component factor analysis itself by typing the following command, whose results are shown in Fig. 12.40.

Fig. 12.40
Fig. 12.40 Outputs of the principal component factor analysis in Stata.

factor finance costs marketing actuarial, pcf

where the term pcf refers to the principal-component factor method.

While the upper part of Fig. 12.40 shows the eigenvalues of the correlation matrix once again, with the respective proportions of shared variance of the original variables, since researchers can choose not to use the command pca, the lower part of the figure shows the factor loadings, which represent the correlations between each variable and the factors that only have eigenvalues greater than 1. Therefore, we can see that Stata automatically considers the latent root criterion (Kaiser criterion) when choosing the number of factors. If for some reason researchers choose to extract a number of factors considering a smaller eigenvalue so that more factors can be extracted, they must type the term mineigen(#) at the end of the command factor, in which # will be a number that corresponds to the eigenvalue from which factors will be extracted.

The factor loadings shown in Fig. 12.40 are equal to the first two columns of Table 12.12 in Section 12.2.6, and in Fig. 12.20 of Section 12.3. Through them, we can see that, while the variables finance, costs, and actuarial show high correlations with the first factor, the variable marketing shows strong correlation with the second factor. Besides, in the factor loadings matrix, a column called Uniqueness is also presented, or exclusivity, whose values represent, for each variable, the proportion of variance lost to form the factors extracted, that is, corresponds to (1—communality) of each variable. Therefore, we have:

uniquenessfinance=10.89532+0.00682=0.1983uniquenesscosts=10.93432+0.04872=0.1246uniquenessmarketing=10.04242+0.99892=0.0003uniquenessactuarial=10.91792+0.01012=0.1573

si125_e

Consequently, because the variable marketing has low correlations with each one of the other original variables, it ends up having a high Pearson’s correlation with the second factor. This makes its uniqueness value be very low, since its proportion of variance shared with the second factor is almost equal to 100%.

Knowing that two factors are extracted, at this moment, we will carry out the rotation by using the Varimax method. In order to do that, we must type the following command:

rotate, varimax horst

where the term horst defines the rotation angle from the standardized factor loadings. This procedure is in accordance with the one elaborated algebraically in Section 12.2.6. The outputs generated can be seen in Fig. 12.41.

Fig. 12.41
Fig. 12.41 Rotation of factors through the Varimax method.

From Fig. 12.41, as we have already discussed, we can verify that the proportion of variance shared by all the variables to form both factors is equal to 87.98%, even though the eigenvalue of each factor rotated is different from the one obtained previously. The same can be said regarding the uniqueness values of each variable, even if the rotated factor loadings are different in relation to their unrotated corresponding ones, since the Varimax method maximizes the loadings of each variable in a certain factor. Fig. 12.41 also shows the rotation angle at the end. All of these outputs are identical to the ones calculated in Section 12.2.6 and they were also presented when we elaborated the technique in SPSS, in Figs. 12.25, 12.27, and 12.28.

Thus, we can say that:

uniquenessfinance=10.89512+0.01952=0.1983uniquenesscosts=10.93542+0.02132=0.1246uniquenessmarketing=10.01312+0.99972=0.0003uniquenessactuarial=10.91722+0.03702=0.1573

si126_e

and that:

0.89512+0.93542+0.01312+0.91722=λ12=2.517680.01952+0.02132+0.99972+0.03702=λ22=1.00170

si127_e

If the researcher wishes to, Stata also allows us to compare the rotated factor loadings to the ones obtained before the rotation in the same table. In order to do that, it is necessary to type the following command, after preparing the rotation:

estat rotatecompare

The outputs generated can be seen in Fig. 12.42.

Fig. 12.42
Fig. 12.42 Comparison of the rotated and unrotated factor loadings.

At this moment, the loading plot of the rotated factor loadings can be obtained by typing the command loadingplot. This chart, which corresponds to the ones presented in Figs. 12.9 and 12.26, can be seen in Fig. 12.43.

Fig. 12.43
Fig. 12.43 Loading plot with rotated loadings.

After developing these procedures, the researcher may want to generate two new variables in the dataset, which correspond to the rotated factors obtained through the factor analysis. Therefore, it is necessary to type the following command:

predict f1 f2

where f1 and f2 are the names of the corresponding variables to the first and second factors, respectively. When we type the command, in addition to creating these two new variables in the dataset, an output similar to the one in Fig. 12.44 will also be generated, in which the rotated factor scores are presented.

Fig. 12.44
Fig. 12.44 Generating the factors in the dataset and the rotated factor scores.

The results shown in Fig. 12.44 are equivalent to the ones in SPSS (Fig. 12.29). Besides, it is also possible to verify that both factors generated are orthogonal, that is, they have a Pearson’s correlation coefficient equal to 0. In order to do that, let’s type:

estat common

which results in the output seen in Fig. 12.45.

Fig. 12.45
Fig. 12.45 Pearson’s correlation coefficient between both rotated factors.

Only for pedagogical purposes, we can also obtain the scores and the rotated factor loadings from multiple linear regression models. In order to do that, first of all, we have to generate the standardized variables by using the Z-scores procedure in the dataset, from each one of the original variables, by typing the following sequence of commands:

egen zfinance = std(finance)
egen zcosts = std(costs)
egen zmarketing = std(marketing)
egen zactuarial = std(actuarial)

Having done this, we can type the two following commands, which represent two multiple linear regression models, in which each one of them shows a certain factor as a dependent variable and the standardized variables as explanatory variables.

reg f1 zfinance zcosts zmarketing zactuarial
reg f2 zfinance zcosts zmarketing zactuarial

The results of these models can be seen in Fig. 12.46.

Fig. 12.46
Fig. 12.46 Outputs of the multiple linear regression models with factors as dependent variables.

By analyzing Fig. 12.46, we note that the parameters estimated in each model correspond to the rotated factor scores for each variable, according to what has already been shown in Fig. 12.44. Thus, since all the parameters of the intercept are practically equal to 0, we can write:

F1i=0.3554795Zfinancei+0.3721907Zcostsi+0.0124719Zmarketingi+0.3639452Zactuariali

si128_e

F2i=0.0036389Zfinancei+0.0377955Zcostsi+0.9986053Zmarketingi0.020781Zactuariali

si129_e

Obviously, since the four variables share variances to form each factor, the coefficients of determination R2 of each model are equal to 1.

On the other hand, to obtain the rotated factor loadings, we must type the following four commands, which represent four multiple linear regression models, in which each one of them has a certain standardized variable as a dependent variable, and the rotated factors, as explanatory variables.

reg zfinance f1 f2
reg zcosts f1 f2
reg zmarketing f1 f2
reg zactuarial f1 f2

The results of these models can be seen in Fig. 12.47.

Fig. 12.47
Fig. 12.47 Outputs of the multiple linear regression models with standardized variables as dependent variables.

By analyzing this figure, note that the parameters estimated in each model correspond to the rotated factor loadings for each factor, according to what has already been shown in Fig. 12.41. Therefore, since all the parameters of the intercept are practically equal to 0, we can write:

Zfinancei=0.895146F1i0.0194694F2i+ui,R2=1uniqueness=0.8017

si130_e

Zcostsi=0.935375F1i+0.0212916F2i+ui,R2=1uniqueness=0.8754

si131_e

Zmarketingi=0.013053F1i+0.9997495F2i+ui,R2=1uniqueness=0.9997

si132_e

Zactuariali=0.917223F1i0.0370175F2i+ui,R2=1uniqueness=0.8427

si133_e

where the terms ui represent additional sources of variance, besides factors F1′ and F2′, to explain the behavior of each variable, since two other factors with eigenvalues less than 1 could also have been extracted. The coefficients of determination R2 of each model different from 1 correspond to the communality values of each variable, that is, to (1 − uniqueness).

Although researchers can choose not to estimate multiple linear regression models when applying the factor analysis, since it is only a verification procedure, we believe that its didactical nature is essential for fully understanding the technique.

From the rotated factors extracted (variables f1 and f2), we can define the desired school performance ranking. As elaborated when applying the technique in SPSS, we will use the criterion described in Section 12.2.6, known as the weighted rank-sum criterion, in which a new variable is generated from the multiplication of the values of each factor by the respective proportions of variance shared by the original variables. Let’s type the following command:

gen ranking = f1⁎0.6294 + f2⁎0.2504

where the terms 0.6294 and 0.2504 correspond to the proportions of variance shared by the first two factors, respectively, as shown in Fig. 12.41. The new variable generated in the dataset is called ranking. Next, we can sort the observations, from the highest to the lowest value of variable ranking, by typing the following command:

gsort -ranking

After that, just as an example, we can list the school performance ranking of the best 20 students, based on the joint behavior of the final grades in all four subjects. In order to do that, we can type the following command:

list student ranking in 1/20

Fig. 12.48 shows the ranking of the top 20 students.

Fig. 12.48
Fig. 12.48 School performance ranking of the best 20 students.

12.5 Final Remarks

Many are the situations in which researchers wish to group variables in one or more factors, to verify the validity of previously established constructs, to create orthogonal factors for future use in confirmatory multivariate techniques that need the absence of multicollinearity, or to create rankings by developing performance indexes. In these situations, the factor analysis procedures are highly recommended, and the most frequently used is known as the principal components.

Therefore, factor analysis allows us to improve decision-making processes based on the behavior and on the interdependence between quantitative variables that have a relative correlation intensity. Since the factors generated from the original variables are also quantitative variables, the outputs of the factor analysis can be inputs in other multivariate techniques, such as, the cluster analysis. The stratification of each factor into ranges may allow the association between these ranges and the categories of other qualitative variables to be evaluated through a correspondence analysis.

The use of factors in confirmatory multivariate techniques may also make sense when researchers intend to elaborate diagnostics about the behavior of a certain dependent variable and use the factors extracted as explanatory variables, fact that eliminates possible multicollinearity problems because the factors are orthogonal. The consideration of a certain qualitative variable obtained from the stratification of a certain factor into ranges can be used, for example, in a multinomial logistic regression model, which allows the preparation of a diagnostic on the probabilities each observation has of being in each range, due to the behavior of other explanatory variables not initially considered in the factor analysis.

Regardless of the main goal for applying the technique, factor analysis may bear good and interesting research fruits that can be useful for the decision-making process. Its preparation must always be carried out through the correct and conscious use of the software package chosen for the modeling, based on the underlying theory and on researchers’ experience and intuition.

12.6 Exercises

  1. (1) From a dataset that contains certain clients’ variables (individuals), analysts from a bank’s Customer Relationship Management department (CRM) elaborated a principal component factor analysis aiming to study the joint behavior of these variables so that, afterwards, they can propose the creation of an investment profile index. The variables used to elaborate the modeling were:
VariableDescription
ageClient’s age i (years)
fixedifPercentage of resources invested in fixed-income funds (%)
variableifPercentage of resources invested in variable-income funds (%)
peopleNumber of people who live in the residence

In a certain management report, these analysts presented the factor loadings (Pearson’s correlation coefficients) between each original variable and both factors extracted by using the latent root criterion or Kaiser criterion. These factor loadings can be found in the table:

VariableFactor 1Factor 2
age0.9170.047
fixedif0.8740.077
variableif− 0.8440.197
people0.0310.979

We would like you to answer the following questions:

  1. (a) Which eigenvalues correspond to the two factors extracted?
  2. (b) What are the proportions of variance shared by all the variables to form each factor? What is the total proportion of variance lost by the four variables to extract these two factors?
  3. (c) For each variable, what is the proportion of shared variance to form both factors (communality)?
  4. (d) What is the expression of each standardized variable based on the two factors extracted?
  5. (e) Construct a loading plot from the factor loadings.
  6. (f) Interpret both factors based on the distribution of the loadings of each variable.
  7. (2) A researcher specialized in analyzing the behavior of nations’ socioeconomic indexes would like to investigate the possible relationship between variables related to corruption, violence, income, and education, and, in order to do that, he collected data on 50 countries considered to be developed or emerging two years in a row. The data can be found in the files CountriesIndexes.sav and CountriesIndexes.dta, which have the following variables:
VariablePeriodDescription
countryA string variable that identifies country i
cpi1Year 1Corruption perception index, which corresponds to citizens’ perception of abuses committed by the public sector as regards a nation’s private assets, including administrative and political aspects. The lower the index, the higher the perception of corruption in the country (Source: Transparency International)
cpi2Year 2
violence1Year 1Number of murders per 100,000 inhabitants (Sources: World Health Organization, United Nations Office on Drugs and Crime, and GIMD Global Burden of Injuries)
violence2Year 2
capita_gdp1Year 1Per capita GDP in US$ adjusted for inflation, using 2000 as the base year (Source: World Bank)
capita_gdp2Year 2
school1Year 1Average number of years in school per person over 25 years of age, including primary, secondary, and higher education (Source: Institute for Health Metrics and Evaluation)
school2Year 2

Unlabelled Table

In order to create a socioeconomic index that generates a country ranking for each year, the researcher has decided to elaborate a principal component factor analysis using the variables of each period. Based on the results obtained, we would like you to answer the following questions:

  1. (a) By using the KMO statistic and Bartlett’s test of sphericity, is it possible to state that the principal component factor analysis is adequate for each one of the years of study? In the case of Bartlett’s test of sphericity, use a significance level of 0.05.
  2. (b) How many factors are extracted in the analysis in each of the years, considering the latent root criterion? Which eigenvalue(s) correspond to the factor(s) extracted each year, as well as the proportion(s) of variance shared by all the variables to form this(these) factor(s)?
  3. (c) For each variable, what is the proportion of shared variance to form one(more) factor(s) each year? Did any alterations in the communalities of each variable occur from one year to the next?
  4. (d) What are the expression(s) of the factor(s) extracted each year, based on the standardized variables? From one year to the next, did any alterations in the factor scores of the variables occur in each factor? Discuss the importance of developing a specific factor analysis each year in order to create indexes.
  5. (e) Considering the principal factor extracted as a socioeconomic index, create a country ranking from this index for each one of the years. From one year to the next, were there any changes regarding the countries’ positions in the ranking?
  6. (3) The general manager of a store, which belongs to a chain of drugstores, wishes to find out its consumers’ perception of eight attributes, which are described below:
Attribute (Variable)Description
assortmentPerception of the variety of goods
replacementPerception of the quality and speed of inventory replacement
layoutPerception of the store’s layout
comfortPerception of thermal, acoustic, and visual comfort inside the store
cleanlinessPerception of the store’s general cleanliness
servicesPerception of the quality of the services rendered
pricesPerception of the store’s prices compared to the competition
discountsPerception of the store’s discount policy

In order to do that, he carried out a survey with 1700 clients at the store for some time. The questionnaire was structured based on groups of attributes, and each question corresponding to an attribute asked the consumer to assign a score from 0 to 10 depending on his or her perception of that attribute, 0 corresponded to an entirely negative perception, and 10, to the best perception possible. Since the store’s general manager is rather experienced, he decided, in advance, to gather the questions in three groups, such that, the complete questionnaire would be as follows:

Based on your perception, fill out the questionnaire below with scores from 0 to 10, in which 0 means that your perception is entirely negative in relation to a certain attribute, and 10, that your perception is the best possibleScore
Products and store environment
Please rate the store’s variety of goods on a scale of 0–10
Please rate the store’s quality and speed of inventory replacement on a scale of 0–10
Please rate the store’s layout on a scale of 0–10
Please rate the store’s thermal, acoustic, and visual comfort on a scale of 0–10
Please rate the store’s general cleanliness on a scale of 0–10
Services
Please rate the quality of the services rendered in our store on a scale of 0–10
Prices and discount policy
Please rate the store’s prices compared to the competition on a scale of 0–10
Please rate our discount policy on a scale of 0–10

Unlabelled Table

The complete dataset developed by the store’s general manager can be seen in the files DrugstorePerception.sav and DrugstorePerception.dta. We would like you to:

  1. (a) Present the correlation matrix between each pair of variables. Based on the magnitude of the values of Pearson’s correlation coefficients, is it possible to identify any indication that the factor analysis may group the variables into factors?
  2. (b) By using the result of Bartlett’s test of sphericity, is it possible to state, at a significance level of 0.05, that the principal component factor analysis is adequate?
  3. (c) How many factors are extracted in the analysis considering the latent root criterion? Which eigenvalue(s) correspond to the factor(s) extracted, as well as to the proportion(s) of variance shared by all the variables to form this(these) factor(s)?
  4. (d) What is the total percentage of variance loss of the original variables resulting from the extraction of the factor(s) based on the latent root criterion?
  5. (e) For each variable, what are the loading and the proportion of shared variance to form the factor(s)?
  6. (f) By demanding the extraction of three factors, to the detriment of the latent root criterion, and based on the new factor loadings, is it possible to confirm the construct of the questionnaire proposed by the store’s general manager? In other words, do the variables of each group in the questionnaire, in fact, end up showing greater sharing of variance with a common factor?
  7. (g) Discuss the impact of the decision to extract three factors on the communality values?
  8. (h) Construct a Varimax rotation and discuss it once again, based on the redistribution of the factor loadings, the construct initially proposed in the questionnaire by the store’s general manager.
  9. (i) Present the 3D loading plot with the rotated factor loadings.

Appendix: Cronbach’s Alpha

A.1 Brief Presentation

The alpha statistic, proposed by Cronbach (1951), is a measure used to assess the internal consistency of the variables in a dataset, that is, it is a measure of the level of reliability with which a certain scale, adopted to define the original variables, produces consistent results about the relationship between these variables. According to Nunnally and Bernstein (1994), the level of reliability is defined from the behavior of the correlations between the original variables (or standardized), and, therefore, Cronbach’s alpha can be used to evaluate the reliability with which a factor can be extracted from variables, thus, being related to the factor analysis.

According to Rogers et al. (2002), even though Cronbach’s alpha is not the only existing measure of reliability, since it has constraints related to multidimensionality, that is, with the identification of multiple factors, it can be defined as the measure that makes it possible to assess the intensity with which a certain construct or factor is present in the original variables. Therefore, a dataset with variables that share a single factor tends to have a high Cronbach’s alpha.

Hence, Cronbach’s alpha cannot be used to assess the overall adequacy of the factor analysis, different from the KMO statistic and Bartlett’s test of sphericity, since its magnitude offers the researcher an indication only of the internal consistency of the scale used to extract a single factor. If its value is low, not even the first factor will be adequately extracted, main reason why some researchers choose to study the magnitude of Cronbach’s alpha before running the factor analysis, even though this decision is not a mandatory requisite for developing the technique.

Cronbach’s alpha can be defined by the following expression:

α=kk11kVarkVarsum

si134_e  (12.41)

where:

  • Vark is the variance of the kth variable, and

    Varsum=i=1nkXki2i=1nkXki2nn1

    si135_e  (12.42)

which represents the variance of the sum of each row in the dataset, that is, the variance of the sum of the values corresponding to each observation. Besides, we know that n is the sample size, and k, the number of variables X.

So, we can state that, if consistencies in the variable values occur, the term Varsum will be large enough in order for alpha (α) to tend to 1. On the other hand, variables that have low correlations, possibly due to the presence of random observation values, will make the term Varsum go back to the sum of the variances of each variable (Vark), which will make alpha (α) tend to 0.

Although there is no consensus in the existing literature about the value of alpha from which there is internal consistency of the variables in the dataset, it is interesting that the result obtained is greater than 0.6 when we apply exploratory techniques.

Next, we will discuss the calculation of Cronbach’s alpha for the data in the example used throughout this chapter.

A.2 Determining Cronbach’s Alpha Algebraically

From the standardized variables in the example studied throughout this chapter, we can construct Table 12.20, which helps us calculate Cronbach’s alpha.

Table 12.20

Procedure for Calculating Cronbach’s Alpha
StudentZfinanceiZcostsiZmarketingiZactuarialik=4Xkisi26_ek=4Xki2si27_e
Gabriela− 0.011− 0.290− 1.6500.273− 1.6792.817
Luiz Felipe− 0.876− 0.6971.532− 1.319− 1.3601.849
Patricia− 0.876− 0.290− 0.590− 0.523− 2.2785.191
Gustavo1.3341.3370.8251.0694.56420.832
Leticia− 0.779− 1.104− 0.872− 0.841− 3.59712.939
Ovidio1.3342.150− 1.6501.8653.69913.682
Leonor− 0.2670.1160.825− 0.1250.5490.301
Dalila− 0.1390.5230.1180.2730.7750.600
Antonio0.021− 0.290− 0.590− 0.523− 1.3821.909
Estela0.9820.113− 1.2971.0690.8680.753
Variance1.0001.0001.0001.000i=1100k=4Xki2=0si28_ei=1100k=4Xki2=832.570si29_e

Table 12.20

Thus, based on Expression (12.42), we have:

Varsum=832.57099=8.410

si136_e

and, by using Expression (12.41), we can calculate Cronbach’s alpha:

α=43148.410=0.699

si137_e

We can consider this value acceptable for the internal consistency of the variables in our dataset. Nevertheless, as we will see when determining Cronbach’s alpha in SPSS and in Stata, there is a considerable loss of reliability because the original variables are not measuring the same factor, that is, the same dimension, since this statistic has constraints related to multidimensionality. That is, if we did not include the variable marketing when calculating Cronbach’s alpha, its value would be considerably higher, which indicates that this variable does not contribute to the construct, or to the first factor, formed by the other variables (finance, costs, and actuarial).

The complete spreadsheet with the calculation of Cronbach’s alpha can be found in the file AlphaCronbach.xls.

Analogous to what was done throughout this chapter, next, we will present the procedures for obtaining Cronbach’s alpha in SPSS and in Stata.

A.3 Determining Cronbach’s Alpha in SPSS

Once again, let’s use the file FactorGrades.sav. In order for us to determine Cronbach’s alpha based on the standardized variables, first, we must standardize them by using the Z-scores procedure. To do that, let’s click on Analyze → Descriptive Statistics → Descriptives …. When we select all the original variables, we must click on Save standardized values as variables. Although this specific procedure is not shown here, after clicking on OK, the standardized variables will be generated in the dataset itself.

After that, let’s click on Analyze → Scale → Reliability Analysis …. A dialog box will open. We must insert the standardized variables into Items, as shown in Fig. 12.49.

Fig. 12.49
Fig. 12.49 Dialog box for determining Cronbach’s alpha in SPSS.

Next, in Statistics …, we must select the option Scale if item deleted, as shown in Fig. 12.50. This option calculates the different values of Cronbach’s alpha when each variable in the analysis is eliminated. The term item is often mentioned in Cronbach’s work (1951), and it is used as a synonym for variable.

Fig. 12.50
Fig. 12.50 Selecting the option to calculate alpha when excluding a certain variable.

Next, we can click on Continue and on OK.

Fig. 12.51 shows the result of Cronbach’s alpha, whose value is exactly the same as the one calculated through Expressions (12.41) and (12.42) and shown in the previous section.

Fig. 12.51
Fig. 12.51 Result of Cronbach’s alpha in SPSS.

Furthermore, Fig. 12.52 also shows, in the last column, Cronbach’s alpha values that would be obtained if a certain variable were excluded from the analysis. Therefore, we can see that the presence of the variable marketing contributes negatively to the identification of only one factor, because, as we know, this variable shows strong correlation with the second factor extracted by the principal component factor analysis elaborated throughout this chapter. Since Cronbach’s alpha is a one-dimensional measure of reliability, excluding the variable marketing would make its value get to 0.904.

Fig. 12.52
Fig. 12.52 Cronbach’s alpha when excluding each variable.

Next, we will obtain the same outputs by using specific commands in Stata.

A.4 Determining Cronbach’s Alpha in Stata

Now, let’s open the file FactorGrades.dta. In order to calculate Cronbach’s alpha, we must type the following command:

alpha finance costs marketing actuarial, asis std

where the term std makes Cronbach’s alpha be calculated from the standardized variables, even if the original variables were considered in the command alpha.

The output generated can be seen in Fig. 12.53.

Fig. 12.53
Fig. 12.53 Result of Cronbach’s alpha in Stata.

If researchers choose to obtain Cronbach’s alpha values when excluding each one of the variables, as what is done in SPSS, they may type the following command:

alpha finance costs marketing actuarial, asis std item

The new outputs are shown in Fig. 12.54, in which the values of the last column are exactly the same as the ones presented in Fig. 12.52, which corroborates the fact that the variables finance, costs, and actuarial show high internal consistency for determining a single factor.

Fig. 12.54
Fig. 12.54 Internal consistency when excluding each variable—last column.

References

Bartlett M.S. A note on the multiplying factors for various χ2 approximations. J. Roy. Stat. Soc. Ser. B. 1954;16(2):296–298.

Cronbach L.J. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–334.

Fávero L.P., Belfiore P. Manual de análise de dados: estatística e modelagem multivariada com Excel®, SPSS® e Stata®. Rio de Janeiro: Elsevier; 2017.

Gorsuch R.L. Factor Analysis. second ed. Mahwah: Lawrence Erlbaum Associates; 1983.

Gujarati D.N., Porter D.C. Econometria básica. fifth ed. New York: McGraw-Hill; 2008.

Harman H.H. Modern Factor Analysis. third ed. Chicago: University of Chicago Press; 1976.

Kaiser H.F. A second generation little jiffy. Psychometrika. 1970;35(4):401–415.

Kaiser H.F. The varimax criterion for analytic rotation in factor analysis. Psychometrika. 1958;23(3):187–200.

Nunnally J.C., Bernstein I.H. Psychometric Theory. third ed. New York: McGraw-Hill; 1994.

Pearson K. Mathematical contributions to the theory of evolution. III. Regression, Heredity, and Panmixia. Philos. Trans. R. Soc. London. 1896;187:253–318.

Reis E. Estatística multivariada aplicada. second ed. Lisboa: Edições Sílabo; 2001.

Rogers W.M., Schmitt N., Mullins M.E. Correction for unreliability of multifactor measures: comparison of alpha and parallel forms approaches. Organization. Res. Methods. 2002;5(2):184–199.

Spearman C.E. “General intelligence,” objectively determined and measured. Am. J. Psychol. 1904;15(2):201–292.


"To view the full reference list for the book, click here"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset