performancetjk=π0jk+π1jkyearjk+etjk

si85_e

π0jk=b00k+b01kgenderjk+r0jk

si114_e

π1jk=b10k+b11kgenderjk+r1jk

si115_e

b00k=γ000+γ001texpk+u00k

si116_e

b01k=γ010

si117_e

b10k=γ100+γ101texpk+u10k

si118_e

b11k=γ110,

si119_e

which results in the following expression:

performancetjk=γ000+γ100yearjk+γ010genderjk+γ001texpk+γ110genderjkyearjk+γ101texpkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si120_e

To estimate this model, it is necessary to create one more new variable (texpyear), which corresponds to the multiplication between texp and year. Thus, let’s type the following command:

gen texpyear = texp⁎year

Therefore, we can estimate the model proposed by typing the following command:

xtmixed performance year gender texp genderyear texpyear || school: year || student: year, var nolog reml

whose outputs can be found in Fig. 23.42.

Fig. 23.42
Fig. 23.42 Outputs of the linear trend model with random intercepts and slopes and level-2 variable gender and level-3 variable texp.

Even though the estimations of the fixed effects parameters and random effects variances are significant, at a significance level of 0.05, it is necessary to study the structure of the random effects (u00k, u10k and r0jk, r1jk) variance-covariance matrix . Based on the outputs found in Fig. 23.42, we have:

  •  Random effects variance-covariance matrix for level school

var[u00ku10k]=[87.994000.263]

si121_e

  •  Random effects variance-covariance matrix for level student

var[r0jkr1jk]=[337.627003.092]

si122_e

To store the results of this estimation, we have to type:

estimates store finalindependent

Since we did not specify any covariance structure for these error terms, in the preparation of the command xtmixed, Stata assumes that this structure is independent, that is, that cov(u00k, u10k) = 0 and that cov(r0jk, r1jk) = 0. Nevertheless, we can generalize these matrices’ structure by allowing u00k and u10k to be correlated, and r0jk and r1jk to be correlated too. In order to do that, in the command xtmixed, it is necessary to add the term covariance(unstructured) to the random effects components of levels school and student, such that:

xtmixed performance year gender texp genderyear texpyear || school: year, covariance(unstructured) || student: year, covariance(unstructured) var nolog reml

which generates the outputs seen in Fig. 23.43.

Fig. 23.43
Fig. 23.43 Outputs of the linear trend model with random intercepts and slopes and level-2 variable gender and level-3 variable texp, with correlated random effects (u00k, u10k) and (r0jk, r1jk).

The fixed effects parameter estimations are extremely close to those obtained when estimating the model that considers the existence of a structure that is independent from the random effects variance-covariance matrices (Fig. 23.42).

Regarding the random effects parameters, except for the estimations of u10k and cov(u00k, u10k), which are statistically significant at a significance level of 0.10, all the other estimations are significant at a significance level of 0.05. Since the respective | z | > 1.64, and this is the critical value of the standardized normal distribution that results in a significance level of 0.10. For educational purposes, we will use a confidence level of 90% to continue the analysis.

Thus, considering that cov(u00k, u10k) and cov(r0jk, r1jk) are statistically different from zero, based on the outputs in Fig. 23.43 we can write that:

  •  Random effects variance-covariance matrix for level school

var[u00ku10k]=[88.7373.1853.1850.255]

si123_e

  •  Random effects variance-covariance matrix for level student

var[r0jkr1jk]=[350.91313.25113.2513.258]

si124_e

Researchers will also obtain these matrices if they type the following command right after the last estimation:

estat recovariance

whose outputs can be found in Fig. 23.44.

Fig. 23.44
Fig. 23.44 Variance-covariance matrices with correlated random effects (u00k, u10k) and (r0jk, r1jk).

Even statistically different from zero, the estimations of the random effects covariances in both levels of the analysis, if researchers wish to prove the better suitability of this last model over the one that considers the matrix with independent error terms, they just need to run a likelihood-ratio test to compare both estimations.

For this purpose, first, let’s type the following command, regarding the estimation with unstructured random effects:

estimates store finalunstructured

Next, we can type the command to carry out the abovementioned test:

lrtest finalunstructured finalindependent

The result can be seen in Fig. 23.45.

Fig. 23.45
Fig. 23.45 Likelihood-ratio test to compare the estimations of the complete models with independent and correlated random effects (u00k, u10k) and (r0jk, r1jk).

This χ2 statistic for the test, with 2 degrees of freedom, can also be obtained through the following expression:

χ22=(2LLrind(2LLrunstruc))={2(7,419.679)[2(7,376.715)]}=85.93,

si125_e

which results in a Sig. χ22 = 0.000 < 0.05. Therefore, we can state that the structure of the random effects variance-covariance matrices can be considered unstructured in this example. That is, we can consider that error terms u00k and u10k are correlated (cov(u00k, u10k) ≠ 0) and that error terms r0jk and r1jk are correlated too (cov(r0jk, r1jk) ≠ 0).

We have arrived at our final model, with the following specification:

performancetjk=54.734+4.516yearjk14.702genderjk+1.179texpk+0.652genderjkyearjk0.057texpkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si126_e

Next, we can obtain the expected BLUPS (best linear unbiased predictions) values of random effects u10k, u00k, r1jk, and r0jk of our final model by typing:

predict u10final u00final r1final r0final, reffects

which generates four new variables in the dataset, which are called u10final, u00final, r1final, and r0final. They correspond to the slope and intercept random effects of level school and to the slope and intercept random effects of level student, respectively. The following command, whose outputs can be found in Fig. 23.46, makes the descriptions of these random effects be presented:

Fig. 23.46
Fig. 23.46 Description of random effects u10k, u00k, r1jk, and r0jk.

desc u10final u00final r1final r0final

Besides, we can also obtain the expected values of each student’s school performance in each of the periods monitored, by typing the following command:

predict yhatstudent, fitted level(student)

which defines the variable yhatstudent, which can also be obtained through the following command:

gen yhatstudent = 54.73435 + 4.515641⁎year - 14.70213⁎gender + 1.178656⁎texp + .6518855⁎genderyear - .0566496⁎texpyear + u00final + u10final⁎year + r0final + r1final⁎year

which corresponds to the expression:

performˆance_studentjk=54.734+4.516yearjk14.702genderjk+1.179texpk+0.652genderjkyearjk0.057texpkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk

si127_e

If researchers type the following command, they will obtain the expected values of each student’s school performance in each of the periods monitored; however, without considering the random effects in the level student:

predict yhatschool, fitted level(school)

which defines the variable yhatschool in the dataset, which can also be obtained through the following command:

gen yhatschool = 54.73435 + 4.515641⁎year - 14.70213⁎gender + 1.178656⁎texp + .6518855⁎genderyear - .0566496⁎texpyear + u00final + u10final⁎year

which corresponds to the expression:

performˆance_schoolk=54.734+4.516yearjk14.702genderjk+1.179texpk+0.652genderjkyearjk0.057texpkyearjk+u00k+u10kyearjk

si128_e

Error terms etjk can be obtained by typing the command predict etjk, res (which is equivalent to performance - yhatstudent).

Therefore, at this moment, we are able to conclude the analysis. We have seen that students’ school performance follows a linear trend throughout time, there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools, and students’ gender is significant to explain part of this variation. Professors’ years of teaching experience at each school (level-3 variable) itself also explains part of the discrepancies in the annual school performance between students from different schools.

The following command, typed after the command sort student year, makes a chart be generated (Fig. 23.47) with the predicted values of school performance throughout time for the first 50 students in the sample (yhatstudent) and, through which, we can see different intercepts and slopes throughout time for different students.

Fig. 23.47
Fig. 23.47 Predicted school performance values throughout time for the first 50 students in the sample.
sort student year

graph twoway connected yhatstudent year if student <= 50, connect(L)

Finally, a more inquisitive researcher, aiming at questioning the superiority of multilevel models in relation to traditional regression models estimated through OLS, whenever there are datasets with nested structures, decides to construct a chart. Through this chart, it is possible to compare the predicted school performance values generated by this three-level hierarchical modeling (HLM3) to those generated by an estimation through OLS, for all the students in the sample, in each of the periods analyzed, using the same explanatory variables year, gender, texp, genderyear, and texpyear. Obviously, there are only fixed effects in the estimation through OLS.

Thus, the following sequence of commands is typed, which generates the chart seen in Fig. 23.48:

Fig. 23.48
Fig. 23.48 Values predicted through OLS and through HLM3 x observed school performance values.
quietly reg performance year gender texp genderyear texpyear
predict yhatreg
graph twoway mspline yhatreg performance || mspline yhatstudent performance || lfit performance performance ||, legend(label(1 "OLS") label(2 "HLM3") label(3 "Observed Values"))

The dotted line at 45 degrees shows the observed school performance values of each one of the students in the sample in each of the periods analyzed (performance × performance). By using the chart in Fig. 23.48, we can clearly see the superiority of our linear trend model with explanatory variables and random intercepts and slopes in levels 2 and 3 (complete HLM3 model) over the multiple linear regression model estimated through OLS with the same explanatory variables. This demonstrates the importance of considering the random effects components whenever there are nested data structures.

In a consolidated way, Table 23.5 shows the general commands in Stata for preparing a two-level hierarchical linear model with clustered data, and a three-level hierarchical linear model with repeated measures, as studied in Sections 23.5.1 and 23.5.2, respectively. This is a broad topic and new intermediate models can always be estimated by researchers, based on their research objectives and on the constructs proposed.

Table 23.5

Hierarchical Modeling, Intermediate Models (Multilevel Step-Up Strategy) and Commands in Stata
ModelingIntermediate ModelCommand in Stata
Two-Level Hierarchical Linear Model with Clustered DataNull Model (Nonconditional Model)xtmixed Y || var(level 2):
Random Intercepts Modelxtmixed Y X || var(level 2):
Random Intercepts and Slopes Modelxtmixed Y X || var(level 2): X
Random Intercepts and Slopes Model and Correlated Error Termsxtmixed Y X || var(level 2): X covariance(unstructured)
Three-Level Hierarchical Linear Model with Repeated MeasuresNull Model (Nonconditional Model)xtmixed Y || var(level 3): || var(level 2):
Linear Trend Model with Random Interceptsxtmixed Y t || var(level 3): || var(level 2):
Linear Trend Model with Random Intercepts and Slopesxtmixed Y t || var(level 3): t || var(level 2): t
Linear Trend Model with Random Intercepts and Slopes and Level-2 Variablextmixed Y t X Xt || var(level 3): t || var(level 2): t
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 Variablesxtmixed Y t X W Xt Wt WXt || var(level 3): t || var(level 2): t
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 Variables and Correlated Error Termsxtmixed Y t X W Xt Wt WXt || var(level 3): t, covariance(unstructured) || var(level 2): t, covariance(unstructured)

Table 23.5

Note: Considering a level-2 variable X, a level-3 variable W (whenever there is one), and t as the temporal variable. In addition to this, Y refers to the dependent variable. In all the cases, the term that corresponds to the estimation method was omitted. As discussed previously, while the estimation method adopted by Stata up to Version 12 is the restricted maximum likelihood (reml) by default, the default method becomes the maximum likelihood (mle) after Version 13.

After having made these considerations and having respected the multilevel step-up strategy throughout this entire section, at this moment, let’s estimate the same models in SPSS. In order to give researchers the opportunity to compare both software packages, the procedures and routines for estimating the models are presented, as well as the logic with which the outputs are generated.

23.6 Estimation of Hierarchical Linear Models in SPSS

Now, let’s present the step by step for preparing our examples in IBM SPSS Statistics Software. The use of the images in this section has been authorized by the International Business Machines Corporation©.

At this moment, the main objective is to give researchers the opportunity to use multilevel modeling techniques in SPSS. Every time we present an output, we will mention the respective result obtained when preparing the techniques in Stata. So that researchers can compare them and, thus, decide which software package to use, based on the characteristics of each one and on how accessible they are.

23.6.1 Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in SPSS

Going back to the example used in Section 23.5.1, let’s remember that our professor collected data on the school performance (grades from 0 to 100 plus a bonus for participation in class) of 2,000 students from 46 schools. He also collected data on the number of hours spent studying per week (level-1 explanatory variable), the type of school (public or private), and professors’ teaching experience (in years) at each school (level-2 explanatory variables). The complete dataset is in the file PerformanceStudentSchool.sav.

Maintaining the logic presented here, initially, let’s estimate the null model, as follows:

  • Null Model

performanceij=γ00+u0j+rij

si54_e

Even though it is possible to estimate multilevel models using the option Analyze → Mixed Models in SPSS, based on point-and-click procedures, in this section, we have chosen to estimate the models through syntax, to provide a better comparison for the estimations elaborated in Section 23.5.1, and to facilitate the understanding of how to include variables into fixed and random effects components. In order to do that, with the file PerformanceStudentSchool.sav open, we must click on File → New → Syntax. For the null model, we must type the following syntax in the window that will open:

MIXED performance

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT

/RANDOM = INTERCEPT | SUBJECT(school) .

where the first line (MIXED)4 only shows the dependent variable performance and both lines after that (METHOD and PRINT) determine the estimation method adopted (in this case, restricted maximum likelihood estimation, or REML), and that the estimations of the fixed effects with their corresponding standard errors be presented in the outputs, respectively. Finally, in the last two lines (FIXED and RANDOM), in addition to the intercept term, the variables that will be a part of the fixed and random effects components, respectively, can be specified, where the term SUBJECT inserted after the vertical bar | identifies the group variable that corresponds to level 2 (in our case, the variable school).

Fig. 23.49 shows the window in SPSS with the inclusion of the syntax that corresponds to the null model, highlighting the button Run Selection that will have to be clicked so that the multilevel modeling can be estimated.

Fig. 23.49
Fig. 23.49 Window with the inclusion of the syntax to estimate the null model in SPSS.

Next, in Fig. 23.50, the outputs generated by SPSS are presented.

Fig. 23.50
Fig. 23.50
Fig. 23.50
Fig. 23.50
Fig. 23.50
Fig. 23.50 (A) and (B) Outputs of the null model in SPSS. (C) and (D) Fixed Effects. (E) Covariance Parameters.

Initially, we can see that the output Model Dimension shows the number of levels considered in the modeling (in this case, 2), and the number of parameters estimated (in this case, 3, including the error term). The term Variance Components informs us that a variance-covariance matrix structure with independent random effects is being considered.

In Information Criteria, the value -2 Restricted Log Likelihood is presented, which corresponds to − 2 times the maximum value obtained for the logarithm of the restricted likelihood function to estimate the model parameters. We can see that the output in SPSS shows that − 2 ∙ LLr = 17,504.04, which is exactly equal to − 2 times the value presented in Stata (Fig. 23.13), since − 2 ∙(− 8752.02) = 17,504.04.

Next, in Fixed Effects, the estimation of parameter γ00 is presented (fixed effect), which corresponds to the average of students’ expected school performance (horizontal line estimated in the null model, or general intercept). We can see that the estimation of γ00 = 61.049 corresponds to the one obtained in Fig. 23.13 in the estimation of the null model in Stata.

Finally, the estimations of level-1 and level-2 error terms’ variance components (random effects) are presented (Covariance Parameters). Here, we can also verify that the outputs correspond to the ones obtained in Stata, since the estimations of τ00 = 135.779 (Intercept [subject = school]) and σ2 = 347.562 (Residual). Nevertheless, note that, different from Stata, SPSS displays the z statistics of the estimations of the error terms’ variances directly, with their respective significance levels. Thus, for the data in our example, we can see that there is variability in the school performance of students from different schools, since Sig. z τ00 < 0.05 (if the confidence level is defined at 95%).

Based on the intraclass correlation, which is calculated later, we can see that approximately 28% of the total school performance variance is due to the alterations between schools.

rho=τ00τ00+σ2=135.779135.779+347.562=0.281

si130_e

In order to maintain the logic presented in Section 23.5.1, at this moment, let’s estimate the random intercepts model, including the explanatory variable hours, as follows:

  • Random Intercepts Model

performanceij=γ00+γ10hoursij+u0j+rij

si59_e

The syntax to estimate this model in SPSS is:

MIXED performance WITH hours

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT hours

/RANDOM = INTERCEPT | SUBJECT(school) .

where all the explanatory variables that researchers want must be inserted after the term WITH in the first line of the syntax. After we run it, we arrive at the main outputs shown in Fig. 23.51.

Fig. 23.51
Fig. 23.51
Fig. 23.51
Fig. 23.51 (A) Main outputs of the random intercepts model. (B) Fixed Effects. (C) Covariance Parameters.

These outputs correspond to the ones presented in Fig. 23.15 (Stata) and, through them, we can see that there is statistical significance in the estimations of the variances of error terms τ00 = 19.125 and σ2 = 31.764, which result in the following intraclass correlation:

rho=τ00τ00+σ2=19.12519.125+31.764=0.376

si132_e

Thus, there is an increase in the proportion of the variance component of the intercept in relation to the null model. This favors the decision to include the variable hours to study the school performance behavior when comparing the schools.

Therefore, now, our model starts to have the following specification:

performanceij=0.534+3.252hoursij+u0j+rij

si61_e

where the fixed effect of the intercept, now, corresponds to the average expected school performance, between schools, of the students who, for some reason, do not study (hoursij = 0). The slope allows us to state that one more hour spent studying per week, on average, makes the expected mean school performance, between schools, increase 3.252 points, and this parameter is statistically significant.5

At this moment, let’s insert slope random effects into our multilevel model that, by maintaining the intercept random effects, will start to have the following expression:

  • Random Intercepts and Slopes Model

performanceij=γ00+γ10hoursij+u0j+u1jhoursij+rij

si65_e

The new syntax is:

MIXED performance WITH hours

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT hours

/RANDOM = INTERCEPT hours | SUBJECT(school) .

which generates the outputs shown in Fig. 23.52.

Fig. 23.52
Fig. 23.52
Fig. 23.52
Fig. 23.52 (A) Main outputs of the model with random intercepts and slopes. (B) Fixed Effects. (C) Covariance Parameters.

Analogously, these outputs correspond to the ones shown in Fig. 23.18 (Stata).

We can see that the parameter and variance estimations in the random intercepts and slopes model are identical to the ones obtained when the model parameters were estimated. The model that only had random intercepts (Fig. 23.51). This occurs because the estimation of variance τ11 (hours [subject = school]) is statistically equal to zero, which makes the value obtained of − 2 ∙ LLr the same as the one shown in Fig. 23.51.

Hence, applying a likelihood-ratio test would offer an output that would obviously favor the use of a random intercepts model, since the significance level Sig. χ12 (12,744.329 − 12,744.329 = 0) = 1.000 > 0.05, as shown in Fig. 23.19.

If researchers wish to generalize the structure of the random effects variance-covariance matrix, allowing u0j and u1j to be correlated, they just need to estimate the model parameters using the term COVTYPE(UN) at the end of the RANDOM line of the last syntax. It will become:

MIXED performance WITH hours

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT hours

/RANDOM = INTERCEPT hours | SUBJECT(school) COVTYPE(UN) .

where the term COVTYPE(UN) considers that there is an unstructured variance-covariance matrix. This model’s outputs are not presented here. However, a likelihood-ratio test to compare the estimations of random intercepts and slopes models with independent and correlated error terms u0j and u1j will show that the structure of the variance-covariance matrix between u0j and u1j can be considered independent, similar to what is shown in Fig. 23.23.

Being independent from the random effects variance-covariance matrix structure and being the random intercepts model the most suitable, let’s now estimate the complete final model that has the following specification:

  • Complete Final Model

performanceij=γ00+γ10hoursij+γ01texpj+γ02privj+γ11privjhoursij+u0j+rij

si135_e

Note that we have already begun the last estimation obtained in Section 23.5.1. The syntax to estimate the model is:

MIXED performance WITH hours texp priv

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT hours texp priv priv⁎hours

/RANDOM = INTERCEPT | SUBJECT(school)

/SAVE = PRED FIXPRED .

where the last line shows the term SAVE = PRED FIXPRED, which makes two new variables be generated in the dataset, PRED_1 and FXPRED_1. The former corresponds to the predicted values of school performance per student (yhat in Stata), with random intercepts components u0j. The latter refers to the predicted values of school performance only resulting from the fixed effects component. The outputs generated are shown in Fig. 23.53 and the expected BLUPS (best linear unbiased predictions) values of our final model’s random effects u0j can be obtained through the following syntax:

Fig. 23.53
Fig. 23.53
Fig. 23.53
Fig. 23.53 (A) Main outputs of the final complete model with random intercepts. (B) Fixed Effects. (C) Covariance Parameters.

COMPUTE blups = PRED_1-FXPRED_1.

which generates a new variable in the dataset, called blups, equal to the variable u0final defined in the estimation of this model in Stata.

These outputs correspond to the ones presented in Fig. 23.25 (Stata). With significant estimations of the variances of the random effects and of the fixed effects parameters, at a confidence level of 95% (except for the estimation of the parameter of the combined variable hours⁎priv, significant at a confidence level of 90%), we obtain the following expression of the model proposed:

performanceij=2.710+3.281hoursij+0.866texpj5.610privj0.080privjhoursij+u0j+rij

si136_e

constructed with the inclusion of level-1 and level-2 explanatory variables and through a multilevel step-up strategy. Hence, we can conclude that there are differences in the school performance behavior between students from the same schools and from different schools. These differences occur due to the number of hours each student spends studying per week, on what type of school it is (public or private), and on the professors’ teaching experience (in years) at each school have.

Next and in SPSS too, let’s study an example with a three-level hierarchical linear model with repeated measures.

23.6.2 Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in SPSS

In this section, we are going back to the example used in Section 23.5.2. Bear in mind that our professor managed to get data on the school performance (grades from 0 to 100) throughout four years (level-1 temporal variable) of 2000 students from 15 schools. He also collected data on each student’s gender (level-2 explanatory variable), and on professors’ years of teaching experience in each of the schools (level-3 explanatory variable). The complete dataset is presented in the file PerformanceTimeStudentSchool.sav.

It is important to mention that the time SPSS takes to process estimations of multilevel models is considerably longer than Stata, mainly for three or more levels.

Maintaining the logic presented in Section 23.5.2, initially, let’s estimate the null model, as follows:

  • Null Model

performancetjk=γ000+u00k+r0jk+etjk

si82_e

For this null model, we must type the following routine in the syntax window:

MIXED performance

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT

/RANDOM = INTERCEPT | SUBJECT(student)

/RANDOM = INTERCEPT | SUBJECT(school) .

where the first line (MIXED) only shows the dependent variable performance and both lines after that (METHOD and PRINT) determine the estimation method adopted (in this case, restricted maximum likelihood estimation, or REML), and that the estimations of the fixed effects with their corresponding standard errors be presented in the outputs. In the following line (FIXED), the variable that will be a part of the fixed effects components can be specified, in addition to the intercept term. Finally, in the last two lines of the routine (RANDOM), besides the intercept terms, the variables that will be part of the random effects components in the different levels of the analysis can be specified. The term SUBJECT inserted after the vertical bar | identifies the group variable that corresponds to each level (in our case, student for level 2 and school for level 3).

Fig. 23.54 shows the outputs generated by SPSS.

Fig. 23.54
Fig. 23.54
Fig. 23.54
Fig. 23.54
Fig. 23.54
Fig. 23.54 (A) and (B) Outputs of the null model in SPSS. (C) and (D) Fixed Effects. (E) Covariance Parameters.

We will not analyze all the outputs of the model generated once again, because they are identical to the ones shown in Fig. 23.34, obtained in the estimation of this null model in Stata.

Nevertheless, we can see that the estimation of parameter γ000 (Fixed Effects) is equal to 68.714, which corresponds to the average of the students’ expected annual school performance (horizontal line estimated in the null model, or general intercept).

Besides, we know that the estimations of the error terms’ variances (Covariance Parameters) τu000 = 180.194 (Intercept [subject = school]), τr000 = 325.799 (Intercept [subject = student]), and σ2 = 41.649 (Residual) are statistically different from zero, at a significance level of 0.05. This fact allows us to state that there is significant variability in the school performance throughout the four years of the analysis, there is significant variability in the school performance, throughout time, between students of the same school, and there is significant variability in the school performance, throughout time, between students from different schools.

Both intraclass correlations, which correspond to levels 2 and 3 of the analysis, can be calculated as follows:

  •  Level-2 intraclass correlation

rhostudentschool=corr(Ytjk,Yt´jk)=τu000+τr000τu000+τr000+σ2=180.194+325.799180.194+325.799+41.649=0.924

si83_e

  •  Level-3 intraclass correlation

rhoschool=corr(Ytjk,Yt´j´k)=τu000τu000+τr000+σ2=180.194180.194+325.799+41.649=0.329

si84_e

Thus, the correlation between the annual school performances, for the same school, is equal to 32.9% (rhoschool), and the correlation between the annual school performances, for the same student of a certain school, is equal to 92.4% (rhostudent | school).

In order to maintain the same logic presented in Section 23.5.2, at this moment, let’s estimate the linear trend model with random intercepts and slopes, including the variable year (repeated measure) as an explanatory variable into level 1, as follows:

  • Linear Trend Model with Random Intercepts and Slopes

performancetjk=γ000+γ100yearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si99_e

The syntax to estimate this model in SPSS is:

MIXED performance WITH year

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT year

/RANDOM = INTERCEPT year | SUBJECT(student)

/RANDOM = INTERCEPT year | SUBJECT(school) .

where all the explanatory variables that researchers want must be inserted after the term WITH in the first line of the syntax. After nine iterations and a few processing minutes, we arrived at the main outputs shown in Fig. 23.55.

Fig. 23.55
Fig. 23.55
Fig. 23.55
Fig. 23.55 (A) Main outputs of the linear trend model with random intercepts and slopes. (B) Fixed Effects. (C) Covariance Parameters.

These outputs correspond to the ones shown in Fig. 23.39. Through them, we can see that the parameters estimated of the fixed and random effects components are statistically different from zero, at a significance level of 0.05. This gives us subsidies to state that students’ school performance follows a linear trend throughout time, and that there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools6. By using the level-2 intraclass correlation, calculated later, we estimate that the random effects of students and schools form approximately 99% of the total variance of the residuals!

rhostudentschool=corr(Ytjk,Yt´jk)=τu000+τu100+τr000+τr100τu000+τu100+τr000+τr100+σ2=224.343+0.560+374.285+3.157224.343+0.560+374.285+3.157+3.868=0.994

si141_e

At this moment, our model starts to have the following specification:

performancetjk=57.858+4.343yearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si103_e

Finally, let’s investigate if level-2 and level-3 variables gender and texp also explain the variation in the annual school performance between students. After some intermediate analyses, let’s move on to estimate the following complete three-level model:

  • Linear Trend Model with Random Intercepts and Slopes, Level-2 Variable gender, and Level-3 Variable texp (Complete Model)

performancetjk=γ000+γ100yearjk+γ010genderjk+γ001texpk+γ110genderjkyearjk+γ101texpkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si143_e

To estimate this model, let’s generalize the structure of the random effects variance-covariance matrices, allowing (u00k, u10k) and (r0jk, r1jk) to be correlated (unstructured variance-covariance matrices). In order to do that, we must insert the term COVTYPE(UN) at the end of the RANDOM lines, making the syntax in SPSS become:

MIXED performance WITH year gender texp

/METHOD = REML

/PRINT = SOLUTION TESTCOV

/FIXED = INTERCEPT year gender texp gender⁎year texp⁎year

/RANDOM = INTERCEPT year | SUBJECT(student) COVTYPE(UN)

/RANDOM = INTERCEPT year | SUBJECT(school) COVTYPE(UN)

/SAVE = PRED FIXPRED RESID .

where the last line now shows the term SAVE = PRED FIXPRED RESID, which makes three new variables be generated in the dataset, PRED_1, FXPRED_1, and RESID_1. They correspond to the predicted values of the school performance per student (yhatstudent in Stata), to the predicted values of the school performance only resulting from the fixed effects component, and to error terms etjk, respectively.

After five iterations and a few processing minutes, we arrived at the outputs shown in Fig. 23.56.

Fig. 23.56
Fig. 23.56
Fig. 23.56
Fig. 23.56 (A) Main outputs of the linear trend model with random intercepts and slopes and level-2 variable gender and level-3 variable texp, with correlated random effects (u00k, u10k) and (r0jk, r1jk). (B) Fixed Effects. (C) Covariance Parameters.

These outputs correspond to the ones shown in Fig. 23.43 (Stata) and, through which, we can see that all the parameters estimated for the fixed effects component are statistically different from zero, at a significance level of 0.05. On the other hand, in relation to the parameters of the random effects components, only the estimations of u10k and cov(u00k, u10k) are statistically significant at a significance level of 0.10. All the others are significant at a significance level of 0.05. Thus, considering that cov(u00k, u10k) and cov(r0jk, r1jk) are statistically different from zero, we can write:

  •  Random effects variance-covariance matrix for level school

var[u00ku10k]=[88.7343.1853.1850.255]

si144_e

  •  Random effects variance-covariance matrix for level student

var[r0jkr1jk]=[350.91313.25113.2513.257]

si145_e

Therefore, the expression of our final model has the following specification7:

performancetjk=54.734+4.516yearjk14.702genderjk+1.179texpk+0.652genderjkyearjk0.057texpkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si146_e

constructed with the inclusion of level-1 and level-2 explanatory variables and through a multilevel step-up strategy.

Therefore, we can conclude that students’ school performance follows a linear trend throughout time. In addition, there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools. Students’ gender is significant to explain part of this variation. Professors’ years of teaching experience at each school also explains part of the discrepancies in the annual school performance between students from different schools.

Similar to Table 23.5, presented at the end of Section 23.5, Table 23.6 consolidates the general estimation routines, in SPSS, for multilevel models.

Table 23.6

Hierarchical Modeling, Intermediate Models (Multilevel Step-Up Strategy) and Routines in SPSS
ModelingIntermediate ModelRoutine in SPSS
Two-Level Hierarchical Linear Model with Clustered DataNull Model (Nonconditional Model)MIXED Y
/FIXED = INTERCEPT
/RANDOM = INTERCEPT | SUBJECT(level2_var) .
Random Intercepts ModelMIXED Y WITH X
/FIXED = INTERCEPT X
/RANDOM = INTERCEPT | SUBJECT(level2_var) .
Random Intercepts and Slopes ModelMIXED Y WITH X
/FIXED = INTERCEPT X
/RANDOM = INTERCEPT X | SUBJECT(level2_var) .
Random Intercepts and Slopes Model and Correlated Error TermsMIXED Y WITH X
/FIXED = INTERCEPT X
/RANDOM = INTERCEPT X | SUBJECT(level2_var) COVTYPE(UN) .
Three-Level Hierarchical Linear Model with Repeated MeasuresNull Model (Nonconditional Model)MIXED Y
/FIXED = INTERCEPT
/RANDOM = INTERCEPT | SUBJECT(level2_var)
/RANDOM = INTERCEPT | SUBJECT(level3_var) .
Linear Trend Model with Random InterceptsMIXED Y WITH t
/FIXED = INTERCEPT t
/RANDOM = INTERCEPT | SUBJECT(level2_var)
/RANDOM = INTERCEPT | SUBJECT(level3_var) .
Linear Trend Model with Random Intercepts and SlopesMIXED Y WITH t
/FIXED = INTERCEPT t
/RANDOM = INTERCEPT t | SUBJECT(level2_var)
/RANDOM = INTERCEPT t | SUBJECT(level3_var) .
Linear Trend Model with Random Intercepts and Slopes and Level-2 VariableMIXED Y WITH t X
/FIXED = INTERCEPT t X X⁎t
/RANDOM = INTERCEPT t | SUBJECT(level2_var)
/RANDOM = INTERCEPT t | SUBJECT(level3_var) .
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 VariablesMIXED Y WITH t X W
/FIXED = INTERCEPT t X W X⁎t W⁎t W⁎X⁎t
/RANDOM = INTERCEPT t | SUBJECT(level2_var)
/RANDOM = INTERCEPT t | SUBJECT(level3_var) .
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 Variables and Correlated Error TermsMIXED Y WITH t X W
/FIXED = INTERCEPT t X W X⁎t W⁎t W⁎X⁎t
/RANDOM = INTERCEPT t | SUBJECT(level2_var) COVTYPE(UN)
/RANDOM = INTERCEPT t | SUBJECT(level3_var) COVTYPE(UN) .

Table 23.6

Note: Considering a level-2 variable X, a level-3 variable W (whenever there is one), and t as the temporal variable. Besides, Y refers to the dependent variable. In all the commands, having considered an estimation through restricted maximum likelihood estimation (omitted term /METHOD = REML).

23.7 Final Remarks

Data mining is a broad theme that is only beginning to be explored in depth in the field of business. This chapter only provides a brief discussion about the concepts, processes, stages, tasks, and the types of methods and techniques it can employ.

In this context, we believe that one of the most recent and relevant modeling techniques within the data-mining environment is multilevel modeling. It allows researchers and managers to assess the relationship between a certain performance variable and one or more predictor variables, which characterize different analysis levels. Moreover, each level is formed by individuals or groups nested into other groups and so on. Since variables from a certain group are invariable between groups or individuals that correspond to lower levels that are nested into that group, it is natural for many researches and constructs to use such models. Since many datasets have nested data structures, as those that simultaneously have students' and school, company and country, municipality and state, or real estate and neighborhood characteristics, for instance.

Many can be the characteristics of the datasets with nested data structures. The most common are those with absolute nesting, in which there are clustered data or data with repeated measures. In this chapter, we chose to present examples in which datasets are used to estimate two-level hierarchical linear models with clustered data and three-level hierarchical linear models with repeated measures. Nonetheless, from which, we believe researchers will have the conditions to estimate, for example, three-level models with clustered data or even consider a higher number of analysis levels, resulting from more complex nesting structures.

Multilevel models allow us to identify and analyze individual heterogeneities and the heterogeneities between the groups to which these individuals belong, making it possible to specify random components in each analysis level. This fact represents the main difference of the traditional regression models estimated through OLS, which cannot consider the natural nesting of data and, consequently, generate biased parameter estimators.

Although many papers use multilevel models only to estimate null models to investigate the variance decomposition of the phenomenon being studied in the different analysis levels, the possibility of including explanatory variables that correspond to the different levels in the fixed and random effects components allows us to investigate possible relationships between these variables and the dependent variable. This makes it possible to establish new research objectives and interesting constructs.

Currently, it is possible to see a growing concern of software and tools manufacturers regarding the processing capability of commands and routines to estimate more complex multilevel models. We cannot forget to mention the important and educational software HLM (Hierarchical Linear and Nonlinear Modeling), produced by Scientific Software International (SSI) and developed by Professors Stephen Raudenbush (University of Michigan), Anthony Bryk (University of Chicago), and Richard Congdon (Harvard University).

To estimate multilevel models, as well as for any other modeling technique, it is necessary for the application to be accompanied by methodological rigor and certain care when analyzing the results, mainly if these are meant for making forecasts. The use of a certain estimation method, to the detriment of another, can help researchers and managers choose the most suitable model, adding value to their research, and allowing new studies on the topic chosen to be carried out.

Discovering implicit and contextual standards from larger and larger volumes of data becomes an essential condition for organizations to become successful in competitive environments, and multilevel modeling contributes in a considerable way with a list of techniques for the data-mining process.

23.8 Exercises

  1. 1) The organization of an international science competition for high school students from 24 countries (j = 1, ..., 24) wishes to investigate participants’ performance behavior based on their characteristics and on the characteristics of the countries they came from. Even though the coordinators of the event know that performance is a result of several factors, such as, participants’ dedication and the characteristics of the schools where they study. At this moment, they wish to try to verify if there is a relationship between the scores obtained in the competition, students’ social status, translated by their median household income, and the importance given by their countries to issues, such as, scientific and technological development, translated here by the investments in research and development. The dataset collected, which contains data on the top five students from each country, which represents a total of 120 participants in the competition (i = 1, ..., 120), and generates a balanced clustered data structure, can be found in the file Science_Competition.dta. The variables found in this dataset are:
VariableDescription
countryA string variable that identifies the country.
idcountryCountry code j.
resdevelCountry’s investments in research and development, in % of the GDP (Source: World Bank).
idstudentStudent code i.
scoreScience score obtained by the student in the competition (0 to 100).
incomeStudent’s median household income per month (US$).

By using this dataset, we would like you to:

  1. a) Elaborate a table that proves the existence of a balanced clustered data structure of students in countries.
  1. b) Construct charts that allow us to visualize the average score obtained in the science competition by the participants from each country.
  1. c) Given the existence of two analysis levels, with students (level 1) nested into countries (level 2), estimate the following null model:

scoreij=b0j+rij

si147_e

b0j=γ00+u0j

si57_e

which results in:

scoreij=γ00+u0j+rij

si149_e

  1. d) Through the estimation of the null model, is it possible to verify if there is variability in the scores obtained between students from different countries?
  2. e) From the result of the likelihood-ratio test generated, is it possible to reject the null hypothesis that the random intercepts are equal to zero? That is, is it possible to rule out the estimation of a traditional linear regression model for these clustered data?
  3. f) Also based on the estimation of the null model, calculate the intraclass correlation and discuss the result.
  4. g) Construct a chart that has a linear adjustment by OLS, for each country, of each student’s science score behavior based on their median household income.
  1. h) Estimate the following random intercepts model:

scoreij=b0j+b1jincomeij+rij

si150_e

b0j=γ00+u0j

si57_e

b1j=γ10,

si58_e

which results in:

scoreij=γ00+γ10incomeij+u0j+rij

si153_e

  1. i) At a significance level of 0.05, discuss the statistical significance of the estimations of fixed and random effects parameters.
  2. j) Construct a bar chart that allows us to visualize random intercept terms u0j per country.
  1. k) Estimate the following random intercepts and slopes model:

scoreij=b0j+b1jincomeij+rij

si150_e

b0j=γ00+u0j

si57_e

b1j=γ10+u1j

si156_e

which results in:

scoreij=γ00+γ10incomeij+u0j+u1jincomeij+rij

si157_e

  1. l) Based on the estimations of the random intercepts model and random intercepts and slopes model, run a likelihood-ratio test and discuss the result.
  1. m) Estimate the following multilevel model:

scoreij=b0j+b1jincomeij+rij

si150_e

b0j=γ00+u0j

si57_e

b1j=γ10+γ11resdevelj

si160_e

which results in:

scoreij=γ00+γ10incomeij+γ11resdeveljincomeij+u0j+rij

si161_e

  1. n) Present the expression of the last model estimated, with random intercepts and level-1 and level-2 variables.
  2. o) Construct a chart in which it is possible to compare the predicted values of the score obtained in the science competition, generated through this two-level hierarchical modeling (HLM2), to the real values obtained (observed values) by the students of the sample.
  1. 2) A firm that rents commercial offices has a portfolio with 277 properties in a certain municipality. Its board of directors would like to find out if there are differences in the rental prices per square meter between properties and in the average rental prices between different districts, throughout time. In order to do that, the marketing team structured a dataset, which can be found in the file Commercial_Properties.dta. It contains the characteristics of these 277 offices that have already been rented (j = 1, ..., 277), whose rental prices were monitored for the last six years (t = 1, ..., 6), and of the 15 municipal districts (k = 1, ..., 15), where these properties are located. The variables found in this dataset are:
VariableDescription
districtDistrict code k.
propertyProperty code j.
lnpNatural logarithm of the rental price per square meter (adjusted by the inflation, base year 1).
yearTemporal variable (repeated measure) that corresponds to the period of monitoring (year 1 to 6).
foodIs there a restaurant or food court in the building where the property is located? (No = 0; Yes = 1).
space4Are there four or more parking spaces? (No = 0; Yes = 1).
valetIs there valet parking in the building where the property is located? (No = 0; Yes = 1).
subwayIs there a subway station in the district where the property is located? (No = 0; Yes = 1).
violenceAverage mortality rate due to external causes in the district where the property is located (per 100,000 inhabitants).

This dataset, in which periods (level 1) are nested into properties (level 2), and these into districts (level 3), is structured according to the logic presented in the following figure:

Unlabelled Image

We would like you to:

  1. a) Elaborate a table that proves the existence of an unbalanced clustered data structure of properties in districts.
  1. b) Elaborate a table that proves the existence of an unbalanced panel data in relation to the property periods of monitoring.
  1. c) Construct a chart that allows us to visualize the temporal evolution of the natural logarithm of the rental price per square meter of the properties under analysis.
  1. d) Construct a chart that allows us to check if there is an approximately linear behavior of the mean of the natural logarithm of the rental price per square meter of the properties during the periods.
  1. e) Construct a chart that has, per municipal district, the temporal evolutions of the means of the natural logarithms of the rental prices per square meter of the properties (linear adjustments through OLS).
  1. f) Given the existence of three analysis levels, with repeated measures (level 1) nested into the properties (level 2), and these nested into the municipal districts (level 3), estimate the following null model:

ln(p)tjk=π0jk+etjk

si162_e

π0jk=b00k+r0jk

si80_e

b00k=γ000+u00k

si81_e

which results in:

ln(p)tjk=γ000+u00k+r0jk+etjk

si165_e

  1. g) Based on the estimation of the null model, calculate the level-2 and level-3 intraclass correlations and discuss the results.
  2. h) Still through the estimation of the null model, is it possible to state that there is variability in the rental price of the commercial properties throughout the period analyzed, and that there is variability in the rental price, throughout time, between properties in the same district, and between properties located in different districts?
  3. i) From the result of the likelihood-ratio test generated, is it possible to reject the null hypothesis that the random intercepts are equal to zero? That is, is it possible to rule out the estimation of a traditional linear regression model for these data?
  4. j) Estimate the following linear trend model with random intercepts:

ln(p)tjk=π0jk+π1jkyearjk+etjk

si166_e

π0jk=b00k+r0jk

si80_e

π1jk=b10k

si87_e

b00k=γ000+u00k

si81_e

b10k=γ100,

si89_e

which results in the following expression:

ln(p)tjk=γ000+γ100yearjk+u00k+r0jk+etjk

si171_e

  1. k) At a significance level of 0.05, discuss the statistical significance of the estimations of fixed and random effects parameters.
  2. l) Construct two bar charts that allow us to visualize the random intercepts per district and per property.
  1. m) Estimate the following linear trend model with random intercepts and slopes:

ln(p)tjk=π0jk+π1jkyearjk+etjk

si166_e

π0jk=b00k+r0jk

si80_e

π1jk=b10k+r1jk

si96_e

b00k=γ000+u00k

si81_e

b10k=γ100+u10k

si176_e

which results in:

ln(p)tjk=γ000+γ100yearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si177_e

  1. n) Calculate the new level-2 and level-3 intraclass correlations and discuss the results.
  2. o) Run a likelihood-ratio test to compare the estimations of the linear trend models with random intercepts and with random intercepts and slopes.
  1. p) Estimate the following linear trend model with random intercepts and slopes and level-2 variables:

ln(p)tjk=π0jk+π1jkyearjk+etjk

si166_e

π0jk=b00k+b01kfoodjk+b02kspace4jk+r0jk

si179_e

π1jk=b10k+b11kvaletjk+r1jk

si180_e

b00k=γ000+u00k

si81_e

b01k=γ010

si182_e

b02k=γ020

si183_e

b10k=γ100+u10k

si176_e

b11k=γ110,

si185_e

which results in the following expression:

ln(p)tjk=γ000+γ100yearjk+γ010foodjk+γ020space4jk+γ110valetjkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si186_e

  1. q) Present the expression of the last model estimated, with repeated measures, random intercepts and slopes, and level-2 variables.
  2. r) Through this model, is it possible to state that the natural logarithm of the rental price per square meter of the properties follows a linear trend throughout time, and that there is a significant variance of intercepts and slopes between those located in the same district and between those located in different districts? If yes, does the existence of a restaurant or food court, the existence of four or more parking spaces, and the existence of valet parking in the building where the property is located explain part of this variability?
  3. s) Estimate the following linear trend model with random intercepts and slopes and level-2 and level-3 variables:

ln(p)tjk=π0jk+π1jkyearjk+etjk

si166_e

π0jk=b00k+b01kfoodjk+b02kspace4jk+r0jk

si179_e

π1jk=b10k+b11kvaletjk+r1jk

si180_e

b00k=γ000+γ001subwayk+u00k

si190_e

b01k=γ010

si182_e

b02k=γ020

si183_e

b10k=γ100+γ101subwayk+γ102violencek+u10k

si193_e

b11k=γ110,

si185_e

which results in the following expression:

ln(p)tjk=γ000+γ100yearjk+γ010foodjk+γ020space4jk+γ001subwayk+γ110valetjkyearjk+γ101subwaykyearjk+γ102violencekyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si195_e

  1. t) Present the random effects variance-covariance matrices for the levels district and property.
  2. u) Estimate the same linear trend model with random intercepts and slopes, and level-2 and level-3 variables; however, now considering correlated random effects (u00k, u10k) and (r0jk, r1jk).
  1. v) Present the random effects variance-covariance matrices for the levels district and property.
  2. w) Run a likelihood-ratio test to compare the estimations of the models with independent and correlated random effects (u00k, u10k) and (r0jk, r1jk). What can we conclude based on the result of this test?
  1. x) What is the final expression of the multilevel model estimated?
  2. y) Is it possible to state that the existence of subways and the violence rate in the district explain part of the variability of the evolution of the natural logarithm of the rental price per square meter between the properties located in different districts?
  3. z) Construct a chart in which it is possible to compare the predicted values of the natural logarithm of the rental price per square meter generated through this three-level hierarchical modeling (HLM3) to those generated through an estimate by OLS—which uses the same explanatory variables of the model in Item (x) inserted into the fixed effects component (year, food, space4, subway, valetyear, subwayyear, and violence⁎year)—and to the real values observed of the natural logarithm of the rental price per square meter of the properties.

Appendix

A.1 Hierarchical Nonlinear Models

As we have already discussed, the generalized linear latent and mixed models (GLLAMM), similar to the generalized linear models (GLM), encompass the hierarchical linear models (HLM) we studied throughout this chapter, and the hierarchical nonlinear models (HNM). The latter refer to the situations in which, if there is a nested data structure, the dependent variable presents itself as a categorical variable or as a variable with count data, reason why we have chosen to present examples of hierarchical nonlinear models as logistic, Poisson and negative binomial in this Appendix. Fig. 23.57 shows the logic of the generalized linear latent and mixed models, highlighting the models that will be studied from now.

  1. (A) Hierarchical Logistic Models
Fig. 23.57
Fig. 23.57 Generalized linear latent and mixed models, highlighting the hierarchical nonlinear models.

Analogous to what we studied in Chapter 14, mixed effects logistic regression models can be used whenever the dependent variable is qualitative and dichotomic, the data are found in a certain nested structure (in levels), and there may be clustered data or data with repeated measures. In these situations, researchers can estimate a model aiming at capturing the relationship between the behavior of explanatory variables and the occurrence of the phenomenon being studied, represented by a dichotomic variable (dummy), as well as studying the variance decomposition of the random effects components due to the presence of a multilevel structure.

In this section, we will present a two-level hierarchical logistic model with clustered data. In general and from Expressions (14.10) and (23.23), we can define this model with two analysis levels. The first level offers explanatory variables X1, ..., XQ, which refer to each individual i (i = 1, ..., n), and the second level, explanatory variables W1, ..., WS that refer to each group j (j = 1, ..., J), invariable for the observations that belong to the same group, as follows:

Level1:pij=11+e(b0j+b1jX1ij+b2jX2ij++bQjXQij)

si196_e  (23.45)

where pij represents the probability of occurrence of the event we are interested in for each observation i that belongs to a certain group j, and bqj (q = 0, 1, ..., Q) refer to the level-1 coefficients.

Level2:bqj=γq0+Sqs=1γqsWsj+uqj

si197_e  (23.46)

where γqs (s = 0, 1, ..., Sq) refer to the level-2 coefficients, and uqj are the level-2 random effects, normally distributed, with mean equal to zero and variance τqq. Furthermore, possible independent error terms of uqj have a mean equal to zero and variance π2/3.

At this moment, let’s present an example. A research was carried out at a global level aiming at investigating if there are differences when couples, who reside in different countries, travel abroad for tourism. In order to do that, data on 1,622 couples located in 50 countries were collected, as well as the average age of each couple, and the number of children they have. Part of the dataset is presented in Table 23.7. However, the complete dataset can be found in the file Tourism.dta.

Table 23.7

Example: Traveling Abroad of Couples (Level 1) Residents in Different Countries (Level 2)
Observation
(Couple i - Level 1)
Country j Where the Couple Lives
(Level 2)
Traveled Abroad for Tourism in the Last Year
(Yij)
Couple’s Average Age
(X1ij)
Number of Children
(X2ij)
1FranceYes682
2FranceYes370
117FranceYes543
1,604EgyptNo552
1,605EgyptNo512
1,622EgyptYes390

Table 23.7

After opening this file, we can type the command desc, which makes it possible to analyze the dataset characteristics, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 23.58 shows this output in Stata.

Fig. 23.58
Fig. 23.58 Description of the Tourism.dta Dataset.

Since the main goal of this Appendix is not to discuss the concepts presented throughout this chapter once again, let’s carry out the following estimation:

p(tourism)ij=11+e(b0j+b1jageij+b2jchildrenij)

si198_e

b0j=γ00+u0j

si57_e

b1j=γ10

si200_e

b2j=γ20,

si201_e

which results in the random intercepts model:

p(tourism)ij=11+e(γ00+γ10ageij+γ20childrenij+u0j)

si202_e

where the variable tourism is dichotomic (dummy), in which values equal to 1 correspond to the couples that traveled abroad for tourism in the last year, and values equal to 0 are the opposite.

To estimate this model in Stata, we must type the following command:

  • melogit tourism age children || country: , nolog8

whose outputs can be found in Fig. 23.59.

Fig. 23.59
Fig. 23.59 Outputs of the hierarchical logistic model with random intercepts in Stata.

Based on this figure, initially, we can see that we have 1622 observations (couples) nested into 50 groups (countries), which characterizes a two-level clustered data structure.

A more inquisitive researcher may verify that the parameter estimations of the fixed and random effects components are identical to the ones that would be obtained through the following command:

meglm tourism age children || country: , family(bernoulli) link(logit) nolog

where the term meglm means multilevel mixed effects generalized linear model. Therefore, that makes it necessary to define the family of distributions of the dependent variable and, in this case, it is Bernoulli, and the canonic link function, which in this situation is logistic.9

Moreover, the odds ratios of the fixed effects parameters can also be obtained directly, by typing the term or (odds ratio) at the end of the commands presented.

Given that the independent error terms of uqj have variance equal to π2/3, we can define the following intraclass correlation:

rho=τ00τ00+π23=0.2550.255+π23=0.072,

si203_e

which suggests that approximately 7% of the total variance of the error terms are due to alterations in the dependent variable’s behavior between countries. After Stata 13, it is possible to obtain this intraclass correlation directly, by typing the command estat icc right after the estimation of the corresponding model.

Even though Stata does not show the result of the z tests with their respective significance levels for the random effects parameters directly, the fact that the estimation of variance component τ00, which corresponds to random intercepts u0j, is considerably higher than its standard error suggests that there is significant variation in the behavior of couples who reside in different countries when it comes to traveling abroad for tourism. Statistically, we can see that z = 0.255 / 0.088 = 2.90 > 1.96, being 1.96 the critical value of the standardized normal distribution which results in a significance level of 0.05.

Even if country variables that may possibly explain such behavior have not been considered, such as, cultural, economic, or social characteristics, we are able to verify that, while an increment in age increases the expected probability that couples will start traveling abroad for tourism, ceteris paribus, traveling decreases with the increment in the number of children, also ceteris paribus. The model estimated has the following expression:

p(tourism)ij=11+e(0.439+0.015ageij0,424childrenij+u0j)

si204_e

At the bottom of Fig. 23.59, we can see, from the result of the likelihood-ratio test, that the estimation of this multilevel model is more suitable than the estimation of a traditional binary logistic regression model for the data in our example.

Therefore, we can obtain the expected probability values of the occurrence of the event being studied (traveling abroad for tourism) for each of the couples in the sample. In order to do that, we must type the following command, which generates a new variable (phat) in the dataset:

predict phat

Besides, we can also obtain the error terms u0j, invariable for couples from the same country. In order to do that, we must type the following command:

predict u0, remeans

which makes the new variable, u0, also be generated in the dataset.

The following command, which generates the outputs seen in Fig. 23.60, shows the values of phat and the error terms u0 only for the couples who reside in Brazil:

Fig. 23.60
Fig. 23.60 Expected probabilities of traveling abroad for tourism and error terms u0j for couples who reside in Brazil (j = Brazil).

list country tourism phat u0 if country == "Brazil"

Only for educational purposes, researchers may verify that variable phat can also be generated through the following expression:

gen phat = (1) / (1 + exp(-(0.4393717 + 0.0150543⁎age - 0.4239421⁎children + u0)))

Finally, we can construct a chart that shows, based on the variable children, the adjustments of curves S (sigmoid functions) of the expected probabilities that couples who reside in five specific countries, chosen based on their different locations around the globe, travel abroad for tourism. This chart, which can be seen in Fig. 23.61, is obtained by typing the following command:

Fig. 23.61
Fig. 23.61 Adjustments of the expected probabilities that couples who reside in five countries travel abroad for tourism, based on the number of children.

graph twoway scatter phat children || mspline phat children if country =="France" || mspline phat children if country =="United States" || mspline phat children if country =="Japan" || mspline phat children if country =="South Africa" || mspline phat children if country =="Venezuela" ||, legend(label(2 "France") label(3 "United States") label(4 "Japan") label(5 "South Africa") label(6 "Venezuela"))

Through this chart, we are able to see the different behavior between couples from different countries in relation to traveling abroad for tourism clearly.

  1. (B) Hierarchical Models for Count Data

Analogous to what we studied in Chapter 15, mixed effects regression models for count data can be used when the dependent variable is quantitative, however, with discrete and non-negative values, and when the data are in a certain nested structure (in levels), and there may be clustered data or data with repeated measures.

In this section, we will present a hierarchical model for count data with three levels and clustered data. In general and from Expressions (15.4), (23.30), and (23.31), we can define this three-level model. The first level shows level-1 explanatory variables Z1, ..., ZP, which refer to units i (i = 1, ..., n). The second level, level-2 explanatory variables X1, ..., XQ, which refer to units j (j = 1, ..., J), and they are invariable for the units that belong to the same group j. The third level, level-3 explanatory variables W1, ..., WS, which refer to units k (k = 1, ..., K), and they are invariable for the units that belong to the same group k. This model is as follows:

Level1:ln(λijk)=π0jk+π1jkZ1jk+π2jkZ2jk++πPjkZPjk

si205_e  (23.47)

where λ is the expected number of occurrences or the estimated average incidence rate of the phenomenon being studied for a certain exposure. πpjk (p = 0, 1, ..., P) refer to the level-1 coefficients, and Zpjk is the p-th level-1 explanatory variable for observation i in level-2 unit j and in level-3 unit k.

Level2:πpjk=bp0k+Qpq=1bpqkXqjk+rpjk

si206_e  (23.48)

where bpqk (q = 0, 1, ..., Qp) refer to the level-2 coefficients. Xqjk is the q-th level-2 explanatory variable for unit j in the level-3 unit k. rpjk are the level-2 random effects, assuming, for each unit j, that the vector (r0jk, r1jk, ..., rPjk)´ follows a multivariate normal distribution with each element having mean zero and variance τrπpp.

Level3:bpqk=γpq0+Spqs=1γpqsWsk+upqk

si207_e  (23.49)

where γpqs (s = 0, 1, ..., Spq) refer to the level-3 coefficients, Wsk is the s-th level-3 explanatory variable for unit k, and upqk are the level-3 random effects, assuming that for each unit k, the vector formed by terms upqk follows a multivariate normal distribution with each element having mean zero and variance τuπpp.

Imagine that a national research has been carried out aiming at studying the relationship between the number of traffic accidents and the average amount of alcohol ingested per inhabitant/day (in grams). This research was carried out in several Brazilian municipal districts located in the whole country in the last year. It also wants to find out if there are differences in this relationship between districts located in different municipalities and different states of the federation. In order to do that, data from 1,062 municipal districts located in 234 municipalities in all 27 units of the federation (26 states and the Federal District) were analyzed. Part of the dataset is presented in Table 23.8. However, the complete dataset can be found in the file Traffic_Accidents.dta.

Table 23.8

Example: Traffic Accidents in Municipal Districts (Level 1) From Different Municipalities (Level 2) and Different States (Level 3)
State k
(Level 3)
Municipality j
(Level 2)
Municipal district i
(Level 1)
Number of Traffic Accidents in the Last Year
(Yijk)
Average Amount of Alcohol Ingested per Inhabitant/Day, in Grams
(Zjk)
AC11912.57
AC221013.36
...
AC311212.33
...
TO2311,052211.94
TO2311,053310.54
...
TO2341,062511.74

Table 23.8

Fig. 23.62 shows the output generated in Stata when we typed the command desc.

Fig. 23.62
Fig. 23.62 Description of the Traffic_Accidents.dta Dataset.

Following the logic presented in Chapter 15, initially, let’s construct a histogram for the variable accidents, which will be the dependent variable of the model to be proposed. In order to do that, we must type the following command, which generates the histogram in Fig. 23.63.

Fig. 23.63
Fig. 23.63 Histogram of dependent variable accidents.

hist accidents, discrete freq

As studied in Chapter 15, it is interesting for researchers to assess if the mean and variance of the dependent variable are equal, or at least close to one another, before estimating models that involves count data. By doing that, they will have an idea of the suitability of the estimation of the Poisson model, or if it will be necessary to estimate a negative binomial model. By typing the following command, it will be possible for this preliminary diagnostic to be elaborated, whose results can be found in Fig. 23.64:

Fig. 23.64
Fig. 23.64 Mean and variance of the dependent variable accidents.

tabstat accidents, stats(mean var)

Even if the variance of the variable accidents is much higher than its mean, which indicates that there is overdispersion in the data, initially and for educational purposes, we will estimate a Poisson model. In the modeling of the number of traffic accidents, even though a possibility is the inclusion of dummy variables that represent municipalities and states in the fixed effects component, we will treat them as random effects and estimate a multilevel Poisson regression model with three levels and random intercepts. Furthermore, the definition of the existence of overdispersion in the data, which suggests a better suitability of the multilevel negative binomial regression model in relation to the Poisson model, will be elaborated next, through a likelihood-ratio test.

Therefore, let’s carry out the following estimation:

ln(accidentsijk)=π0jk+π1jkalcoholjk

si208_e

π0jk=b00k+r0jk

si80_e

π1jk=b10k

si87_e

b00k=γ000+u00k

si81_e

b10k=γ100,

si89_e

which results in the random intercepts model:

ln(accidentsijk)=γ000+γ100alcoholjk+u00k+r0jk

si213_e

where the variable accidents represents the phenomenon being studied. It is quantitative and only has non-negative and discrete values (count data), indicating the incidence of traffic accidents in the last year in the municipal district i located in the municipality j of state k.

To estimate the model proposed in Stata, we must type the following command:

  • mepoisson accidents alcohol || state: || municipality: , nolog10

in which the insertion logic of the different levels follows the same nesting criterion discussed throughout this chapter, that is, from the highest to the lowest level, and these levels are separated by the terms ||. The outputs generated are shown in Fig. 23.65.

Fig. 23.65
Fig. 23.65 Outputs of the multilevel Poisson model with random intercepts in Stata.

Based on this figure, initially, we can see the existence of a three-level unbalanced clustered data structure. Besides, the result of the likelihood-ratio test shows that there is significant variability between the districts located in different municipalities and states, which favors the use of the multilevel Poisson model in relation to a traditional Poisson regression model without random effects.

Before moving on, we can type the command estimates store mepoisson, which makes the results of this estimation be stored for future comparison to the ones that will be obtained through the estimation of the negative binomial model. Moreover, we can also type predict lambda, which generates a new variable in the dataset (lambda) that corresponds to the values estimated of the incidence of traffic accidents in the last year in each of the 1062 municipal districts. Finally, researchers may also type the term irr (incidence rate ratio) at the end of the command presented, as studied in Chapter 15, so that the incidence rates of traffic accidents per year corresponding to the alterations in each fixed effects parameter can be estimated.

An even more inquisitive researcher may verify that the parameter estimations of the fixed and random effects components are identical to the ones that would be obtained through the following command:

meglm accidents alcohol || state: || municipality: , family(poisson) link(log) nolog

which explains, for the generalized linear latent and mixed model (term meglm), that the distribution of the dependent variable considered is the Poisson and the canonic link function is the logarithmic.

After the estimation of the random effects parameters, it is possible for the number of traffic accidents to present overdispersion. Thus, we must re-examine the data by estimating a negative binomial model, so that its results may be compared to the ones obtained by the estimation of the Poisson model. In order to do that, we must type the following command:

  • menbreg accidents alcohol || state: || municipality: , nolog11

The results obtained are shown in Fig. 23.66.

Fig. 23.66
Fig. 23.66 Outputs of the multilevel negative binomial model with random intercepts in Stata.

At the bottom of this figure and from the result of the likelihood-ratio test, we can see that the estimation of this multilevel model is more suitable than the estimation of a traditional negative binomial regression model without random effects for the data in our example. In addition, all the fixed and random effects parameters are statistically different from zero, at a significance level of 0.05.

The estimation of the variances of u00k and r0jk resulted in smaller values than the respective values obtained when estimating the multilevel Poisson model (from 0.386 to 0.377 for u00k and from 0.083 to 0.061 for r0jk), a fact that is due to the addition of an overdispersion parameter that controls the variability of the data.

In Fig. 23.66, we can see that the estimation of lnalpha is presented. As studied in Chapter 15, remember that alpha (or ϕ), which is the conditional overdispersion of the data, represents the inverse of the shape parameter of the Gamma distribution. For the data in our example, we have alˆpha=e2.258=0.105si214_e.

Analogously, the fixed and random effects parameters can also be obtained through the following command:

meglm accidents alcohol || state: || municipality: , family(nbinomial) link(log) nolog

In order to compare the estimations of the multilevel Poisson and negative binomial models, we must run a likelihood-ratio test, by typing the following command:

lrtest mepoisson ., force

where the term mepoisson refers to the estimation of the Poisson model. Since we are comparing two different estimators (mepoisson and menbreg), we must use the term force when elaborating this likelihood-ratio test. The result of the test can be seen in Fig. 23.67 and, through it, we can see that the negative binomial model is the most suitable, proving that there is overdispersion in the data.

Fig. 23.67
Fig. 23.67 Likelihood-ratio test to compare the estimations of the multilevel Poisson and negative binomial models.

Therefore, the expression of the estimated average number of traffic accidents per year, for a certain municipal district i in a certain municipality j in a state k, is given by:

uijk=e(0.754+0.047alcoholjk+u00k+r0jk)

si215_e

where u represents the expected number of occurrences or the estimated average rate of incidence of traffic accidents for one year. In order for these estimated numbers to be generated in the dataset (new variable u), we can type the following command:

predict u

Besides, we can also obtain the error terms u00k (invariable for the districts located in the same state) and r0jk (invariable for the districts located in the same municipality). In order to do that, we must type the following command:

predict u00 r0, remeans

which makes two new variables, u00 and r0, be created in the dataset.

The following command, which generates the outputs in Fig. 23.68, shows the values of u, u00, and r0 only for the municipality districts in the State of Mato Grosso:

Fig. 23.68
Fig. 23.68 Real and estimated number of traffic accidents and error terms u00k and r0jk for municipal districts in the State of Mato Grosso (k = Mato Grosso).

list state municipality accidents u u00 r0 if state =="MT", sepby(municipality)

Through this figure, we can see that, while the values of u00 do not vary for all the municipal districts in the State of Mato Grosso, the values of r0 do not vary per municipality.

Only for educational purposes, researchers may verify that variable u can also be generated through the following expression:

gen u = exp(0.7538477 + 0.0466768⁎alcohol + u00 + r0)

Finally, we can construct a chart that compares the estimation adjustments of the traditional and multilevel negative binomial models. This chart, which can be seen in Fig. 23.69, is obtained by typing the following commands:

Fig. 23.69
Fig. 23.69 Adjustments of the estimated number of traffic accidents obtained through the traditional and multilevel negative binomial models, based on the average amount of alcohol ingested per inhabitant/day in the district.
quietly nbreg accidents alcohol
predict utrad
graph twoway scatter accidents alcohol || mspline utrad alcohol || mspline u alcohol ||, legend(label(2 "Traditional Negative Binomial") label(3 "Multilevel Negative Binomial"))

References

Albright A.C., Winston W.L. Business Analytics: Data Analysis and Decision Making. fifth ed. Stamford: Cengage Learning; 2015.

Bramer M. Principles of Data Mining. third ed. New York: Springer; 2016.

Camilo C.O., Silva J.C. Mineração de dados: conceitos, tarefas, métodos e ferramentas. Technical Report RT-INF 001-09 Instituto de Informática, Universidade Federal de Goiás; 2009.

Cios K.J., Pedrycz W., Swiniarski R.W., Kurgan L.A. Data Mining: A Knowledge Discovery Approach. New York: Springer; 2007.

Courgeau D. Methodology and Epistemology of Multilevel Analysis. London: Kluwer Academic Publishers; 2003.

Fávero L.P., Belfiore P. Manual de análise de dados: estatística e modelagem multivariada com Excel®, SPSS® e Stata®. Rio de Janeiro: Elsevier; 2017.

Fayyad U., Piatetsky-Shapiro G., Smyth P. From data mining to knowledge discovery in databases. AI Magazine. 1996;17(3):37–54.

Han J., Kamber M. Data Mining: Concepts and Techniques. Burlington: Morgan Kaufmann; 2000.

Larose D.T., Larose C.D. Discovering Knowledge in Data: An Introduction to Data Mining. 2. ed. New York: John Wiley & Sons; 2014.

Linoff G.S., Berry M.J.A. Data Mining Techniques: for Marketing, Sales, and Customer Relationship Management. third ed. Indianapolis: John Wiley & Sons; 2011.

Rabe-Hesketh S., Skrondal A. Multilevel and Longitudinal Modeling Using Stata: Continuous Responses. third ed. College Station: Stata Press; . 2012a;vol. I.

Rabe-Hesketh S., Skrondal A., Pickles A. Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata J. 2002;2(1):1–21.

Raudenbush S., Bryk A. Hierarchical Linear Models: Applications and Data Analysis Methods. second ed. Thousand Oaks: Sage Publications; 2002.

Raudenbush S., Bryk A., Cheong Y.F., Congdon R., du Toit M. HLM 6: hierarchical linear and nonlinear modeling. Lincolnwood: Scientific Software International, Inc; 2004.

Searle S.R., Casella G., McCulloch C.E. Variance Components. New York: John Wiley & Sons; 2006.

Snijders T.A.B., Bosker R.J. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. second ed. London: Sage Publications; 2011.

West B.T., Welch K.B., Gałecki A.T. Linear Mixed Models: A Pratical Guide Using Statistical Software. second ed. Boca Raton: Chapman & Hall/CRC Press; 2015.


1 The command xtmixed started being available in Stata Version 9 (2005), and up until Version 12, it is the command for estimating hierarchical linear models, with the standard method being the restricted maximum likelihood (REML). After Stata 13, estimations of hierarchical linear models can be elaborated through commands xtmixed or simply mixed. However, when not specified by the researcher, by default, the estimation method becomes the maximum likelihood (MLE).

2 A more inquisitive researcher may verify this fact by typing the command predict yhat right after estimating the null model. A new variable (yhat) will be generated in the dataset, with all the values equal to 61.049 (in fact, it is a constant).

3 If a more inquisitive researcher wishes to run a likelihood-ratio test to compare the estimations of the null model and of the random intercepts model, whose fixed effects specifications are obviously different, he/she will have to do it by estimating these two models through maximum likelihood (MLE), instead of through restricted maximum likelihood (REML). In order to do that, he/she can type the following sequence of commands: quietly xtmixed performance || school: , var nolog mle estimates store nullmle quietly xtmixed performance hours || school: , var nolog mle estimates store randominterceptmle lrtest nullmle randominterceptmle whose result favors the model with random effects in the intercept in relation to the null model.

4 The command MIXED became available in SPSS after 2001, in Version 11.0.

5 If researchers wish to run a likelihood-ratio test to compare the estimations of the null model and the random intercepts model, whose fixed effects specifications are obviously different, they will have to do it by estimating these two models through maximum likelihood (ML), instead of through restricted maximum likelihood (REML). Hence, they must type both syntaxes below that correspond to the estimations through maximum likelihood (in SPSS, METHOD = ML) of the null model and the random intercepts model, respectively: MIXED performance /METHOD = ML /PRINT = SOLUTION TESTCOV /FIXED = INTERCEPT /RANDOM = INTERCEPT | SUBJECT(school) . MIXED performance WITH hours /METHOD = ML /PRINT = SOLUTION TESTCOV /FIXED = INTERCEPT hours /RANDOM = INTERCEPT | SUBJECT(school) . which, even though they are not presented here, generate values of − 2 ∙ LL, respectively, equal to 17,507.017 and 12,739.629. Thus, the likelihood-ratio test has a significance level Sig. χ12 (17,507.017 − 12,739.629 = 4767.39) = 0.000 < 0.05, which favors the use of the model with random effects in the intercept.

6 If researchers wish to compare the results of this estimation to those resulting from the estimation of a linear trend model with only random intercepts, as carried out in Stata, they just need to type the following routine in the syntax window in SPSS: MIXED performance WITH year /METHOD = REML /PRINT = SOLUTION TESTCOV /FIXED = INTERCEPT year /RANDOM = INTERCEPT | SUBJECT(student) /RANDOM = INTERCEPT | SUBJECT(school) . Even though the outputs are not presented here, they generate the value − 2 ∙ LL equal to 15,602.840. Therefore, a likelihood-ratio test will have a significance level Sig. χ22 (15,602.840 − 14,929.638 = 673.20) = 0.000 < 0.05, which favors the use of the linear trend model with random intercepts and slopes.

7 Analogously, if researchers wish to compare the results of this estimation to those from an estimate of a model considering independent random terms, as carried out in Stata, they just need to type the following routine in the syntax window in SPSS: MIXED performance WITH year gender texp /METHOD = REML /PRINT = SOLUTION TESTCOV /FIXED = INTERCEPT year gender texp gender⁎year texp⁎year /RANDOM = INTERCEPT year | SUBJECT(student) /RANDOM = INTERCEPT year | SUBJECT(school) . Even though the outputs are not presented here, they generate the value − 2 ∙ LL equal to 14,839.357. Thus, a likelihood-ratio test will have a significance level Sig. χ22 (14,839.357 − 14,753.429 = 85.93) = 0.000 < 0.05. This allows us to state that the structure of the variance-covariance matrix between the error terms can be considered unstructured in this example. That is, we can consider that error terms u00k and u10k are correlated (cov(u00k , u10k) ≠ 0), and that error terms r0jk and r1jk are also correlated (cov(r0jk , r1jk) ≠ 0).

8 For the versions that came before Stata 13, the command must be xtmelogit tourism age children || country: , var nolog.

9 If researchers choose to estimate a probit hierarchical nonlinear model, whose distribution of the dependent variable is also Bernoulli, as we discussed in the Appendix of Chapter 14, they may use one of the following two commands : meprobit tourism age children || country: , nolog meglm tourism age children || country: , family(bernoulli) link(probit) nolog

10 For the versions that came before Stata 13, the command must be xtmepoisson accidents alcohol || state: || municipality: , var nolog.

11 The estimation of multilevel negative binomial models (command menbreg) started to be available in Stata after Version 13.

"To view the full reference list for the book, click here"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset