performancetjk=π0jk+π1jk⋅yearjk+etjk
π0jk=b00k+b01k⋅genderjk+r0jk
π1jk=b10k+b11k⋅genderjk+r1jk
b00k=γ000+γ001⋅texpk+u00k
b01k=γ010
b10k=γ100+γ101⋅texpk+u10k
b11k=γ110,
which results in the following expression:
performancetjk=γ000+γ100⋅yearjk+γ010⋅genderjk+γ001⋅texpk+γ110⋅genderjk⋅yearjk+γ101⋅texpk⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
To estimate this model, it is necessary to create one more new variable (texpyear), which corresponds to the multiplication between texp and year. Thus, let’s type the following command:
Therefore, we can estimate the model proposed by typing the following command:
xtmixed performance year gender texp genderyear texpyear || school: year || student: year, var nolog reml
whose outputs can be found in Fig. 23.42.
Even though the estimations of the fixed effects parameters and random effects variances are significant, at a significance level of 0.05, it is necessary to study the structure of the random effects (u00k, u10k and r0jk, r1jk) variance-covariance matrix . Based on the outputs found in Fig. 23.42, we have:
var[u00ku10k]=[87.994000.263]
var[r0jkr1jk]=[337.627003.092]
To store the results of this estimation, we have to type:
estimates store finalindependent
Since we did not specify any covariance structure for these error terms, in the preparation of the command xtmixed, Stata assumes that this structure is independent, that is, that cov(u00k, u10k) = 0 and that cov(r0jk, r1jk) = 0. Nevertheless, we can generalize these matrices’ structure by allowing u00k and u10k to be correlated, and r0jk and r1jk to be correlated too. In order to do that, in the command xtmixed, it is necessary to add the term covariance(unstructured) to the random effects components of levels school and student, such that:
xtmixed performance year gender texp genderyear texpyear || school: year, covariance(unstructured) || student: year, covariance(unstructured) var nolog reml
which generates the outputs seen in Fig. 23.43.
The fixed effects parameter estimations are extremely close to those obtained when estimating the model that considers the existence of a structure that is independent from the random effects variance-covariance matrices (Fig. 23.42).
Regarding the random effects parameters, except for the estimations of u10k and cov(u00k, u10k), which are statistically significant at a significance level of 0.10, all the other estimations are significant at a significance level of 0.05. Since the respective | z | > 1.64, and this is the critical value of the standardized normal distribution that results in a significance level of 0.10. For educational purposes, we will use a confidence level of 90% to continue the analysis.
Thus, considering that cov(u00k, u10k) and cov(r0jk, r1jk) are statistically different from zero, based on the outputs in Fig. 23.43 we can write that:
var[u00ku10k]=[88.737−3.185−3.1850.255]
var[r0jkr1jk]=[350.913−13.251−13.2513.258]
Researchers will also obtain these matrices if they type the following command right after the last estimation:
whose outputs can be found in Fig. 23.44.
Even statistically different from zero, the estimations of the random effects covariances in both levels of the analysis, if researchers wish to prove the better suitability of this last model over the one that considers the matrix with independent error terms, they just need to run a likelihood-ratio test to compare both estimations.
For this purpose, first, let’s type the following command, regarding the estimation with unstructured random effects:
estimates store finalunstructured
Next, we can type the command to carry out the abovementioned test:
lrtest finalunstructured finalindependent
The result can be seen in Fig. 23.45.
This χ2 statistic for the test, with 2 degrees of freedom, can also be obtained through the following expression:
χ22=(−2⋅LLr−ind−(−2⋅LLr−unstruc))={−2⋅(−7,419.679)−[−2⋅(−7,376.715)]}=85.93,
which results in a Sig. χ22 = 0.000 < 0.05. Therefore, we can state that the structure of the random effects variance-covariance matrices can be considered unstructured in this example. That is, we can consider that error terms u00k and u10k are correlated (cov(u00k, u10k) ≠ 0) and that error terms r0jk and r1jk are correlated too (cov(r0jk, r1jk) ≠ 0).
We have arrived at our final model, with the following specification:
performancetjk=54.734+4.516⋅yearjk−14.702⋅genderjk+1.179⋅texpk+0.652⋅genderjk⋅yearjk−0.057⋅texpk⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
Next, we can obtain the expected BLUPS (best linear unbiased predictions) values of random effects u10k, u00k, r1jk, and r0jk of our final model by typing:
predict u10final u00final r1final r0final, reffects
which generates four new variables in the dataset, which are called u10final, u00final, r1final, and r0final. They correspond to the slope and intercept random effects of level school and to the slope and intercept random effects of level student, respectively. The following command, whose outputs can be found in Fig. 23.46, makes the descriptions of these random effects be presented:
desc u10final u00final r1final r0final
Besides, we can also obtain the expected values of each student’s school performance in each of the periods monitored, by typing the following command:
predict yhatstudent, fitted level(student)
which defines the variable yhatstudent, which can also be obtained through the following command:
gen yhatstudent = 54.73435 + 4.515641⁎year - 14.70213⁎gender + 1.178656⁎texp + .6518855⁎genderyear - .0566496⁎texpyear + u00final + u10final⁎year + r0final + r1final⁎year
which corresponds to the expression:
performˆance_studentjk=54.734+4.516⋅yearjk−14.702⋅genderjk+1.179⋅texpk+0.652⋅genderjk⋅yearjk−0.057⋅texpk⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk
If researchers type the following command, they will obtain the expected values of each student’s school performance in each of the periods monitored; however, without considering the random effects in the level student:
predict yhatschool, fitted level(school)
which defines the variable yhatschool in the dataset, which can also be obtained through the following command:
gen yhatschool = 54.73435 + 4.515641⁎year - 14.70213⁎gender + 1.178656⁎texp + .6518855⁎genderyear - .0566496⁎texpyear + u00final + u10final⁎year
which corresponds to the expression:
performˆance_schoolk=54.734+4.516⋅yearjk−14.702⋅genderjk+1.179⋅texpk+0.652⋅genderjk⋅yearjk−0.057⋅texpk⋅yearjk+u00k+u10k⋅yearjk
Error terms etjk can be obtained by typing the command predict etjk, res (which is equivalent to performance - yhatstudent).
Therefore, at this moment, we are able to conclude the analysis. We have seen that students’ school performance follows a linear trend throughout time, there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools, and students’ gender is significant to explain part of this variation. Professors’ years of teaching experience at each school (level-3 variable) itself also explains part of the discrepancies in the annual school performance between students from different schools.
The following command, typed after the command sort student year, makes a chart be generated (Fig. 23.47) with the predicted values of school performance throughout time for the first 50 students in the sample (yhatstudent) and, through which, we can see different intercepts and slopes throughout time for different students.
sort student year graph twoway connected yhatstudent year if student <= 50, connect(L)
Finally, a more inquisitive researcher, aiming at questioning the superiority of multilevel models in relation to traditional regression models estimated through OLS, whenever there are datasets with nested structures, decides to construct a chart. Through this chart, it is possible to compare the predicted school performance values generated by this three-level hierarchical modeling (HLM3) to those generated by an estimation through OLS, for all the students in the sample, in each of the periods analyzed, using the same explanatory variables year, gender, texp, genderyear, and texpyear. Obviously, there are only fixed effects in the estimation through OLS.
Thus, the following sequence of commands is typed, which generates the chart seen in Fig. 23.48:
quietly reg performance year gender texp genderyear texpyear predict yhatreg graph twoway mspline yhatreg performance || mspline yhatstudent performance || lfit performance performance ||, legend(label(1 "OLS") label(2 "HLM3") label(3 "Observed Values"))
The dotted line at 45 degrees shows the observed school performance values of each one of the students in the sample in each of the periods analyzed (performance × performance). By using the chart in Fig. 23.48, we can clearly see the superiority of our linear trend model with explanatory variables and random intercepts and slopes in levels 2 and 3 (complete HLM3 model) over the multiple linear regression model estimated through OLS with the same explanatory variables. This demonstrates the importance of considering the random effects components whenever there are nested data structures.
In a consolidated way, Table 23.5 shows the general commands in Stata for preparing a two-level hierarchical linear model with clustered data, and a three-level hierarchical linear model with repeated measures, as studied in Sections 23.5.1 and 23.5.2, respectively. This is a broad topic and new intermediate models can always be estimated by researchers, based on their research objectives and on the constructs proposed.
Table 23.5
Modeling | Intermediate Model | Command in Stata |
---|---|---|
Two-Level Hierarchical Linear Model with Clustered Data | Null Model (Nonconditional Model) | xtmixed Y || var(level 2): |
Random Intercepts Model | xtmixed Y X || var(level 2): | |
Random Intercepts and Slopes Model | xtmixed Y X || var(level 2): X | |
Random Intercepts and Slopes Model and Correlated Error Terms | xtmixed Y X || var(level 2): X covariance(unstructured) | |
Three-Level Hierarchical Linear Model with Repeated Measures | Null Model (Nonconditional Model) | xtmixed Y || var(level 3): || var(level 2): |
Linear Trend Model with Random Intercepts | xtmixed Y t || var(level 3): || var(level 2): | |
Linear Trend Model with Random Intercepts and Slopes | xtmixed Y t || var(level 3): t || var(level 2): t | |
Linear Trend Model with Random Intercepts and Slopes and Level-2 Variable | xtmixed Y t X Xt || var(level 3): t || var(level 2): t | |
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 Variables | xtmixed Y t X W Xt Wt WXt || var(level 3): t || var(level 2): t | |
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 Variables and Correlated Error Terms | xtmixed Y t X W Xt Wt WXt || var(level 3): t, covariance(unstructured) || var(level 2): t, covariance(unstructured) |
Note: Considering a level-2 variable X, a level-3 variable W (whenever there is one), and t as the temporal variable. In addition to this, Y refers to the dependent variable. In all the cases, the term that corresponds to the estimation method was omitted. As discussed previously, while the estimation method adopted by Stata up to Version 12 is the restricted maximum likelihood (reml) by default, the default method becomes the maximum likelihood (mle) after Version 13.
After having made these considerations and having respected the multilevel step-up strategy throughout this entire section, at this moment, let’s estimate the same models in SPSS. In order to give researchers the opportunity to compare both software packages, the procedures and routines for estimating the models are presented, as well as the logic with which the outputs are generated.
Now, let’s present the step by step for preparing our examples in IBM SPSS Statistics Software. The use of the images in this section has been authorized by the International Business Machines Corporation©.
At this moment, the main objective is to give researchers the opportunity to use multilevel modeling techniques in SPSS. Every time we present an output, we will mention the respective result obtained when preparing the techniques in Stata. So that researchers can compare them and, thus, decide which software package to use, based on the characteristics of each one and on how accessible they are.
Going back to the example used in Section 23.5.1, let’s remember that our professor collected data on the school performance (grades from 0 to 100 plus a bonus for participation in class) of 2,000 students from 46 schools. He also collected data on the number of hours spent studying per week (level-1 explanatory variable), the type of school (public or private), and professors’ teaching experience (in years) at each school (level-2 explanatory variables). The complete dataset is in the file PerformanceStudentSchool.sav.
Maintaining the logic presented here, initially, let’s estimate the null model, as follows:
performanceij=γ00+u0j+rij
Even though it is possible to estimate multilevel models using the option Analyze → Mixed Models in SPSS, based on point-and-click procedures, in this section, we have chosen to estimate the models through syntax, to provide a better comparison for the estimations elaborated in Section 23.5.1, and to facilitate the understanding of how to include variables into fixed and random effects components. In order to do that, with the file PerformanceStudentSchool.sav open, we must click on File → New → Syntax. For the null model, we must type the following syntax in the window that will open:
MIXED performance
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT
/RANDOM = INTERCEPT | SUBJECT(school) .
where the first line (MIXED)4 only shows the dependent variable performance and both lines after that (METHOD and PRINT) determine the estimation method adopted (in this case, restricted maximum likelihood estimation, or REML), and that the estimations of the fixed effects with their corresponding standard errors be presented in the outputs, respectively. Finally, in the last two lines (FIXED and RANDOM), in addition to the intercept term, the variables that will be a part of the fixed and random effects components, respectively, can be specified, where the term SUBJECT inserted after the vertical bar | identifies the group variable that corresponds to level 2 (in our case, the variable school).
Fig. 23.49 shows the window in SPSS with the inclusion of the syntax that corresponds to the null model, highlighting the button Run Selection that will have to be clicked so that the multilevel modeling can be estimated.
Next, in Fig. 23.50, the outputs generated by SPSS are presented.
Initially, we can see that the output Model Dimension shows the number of levels considered in the modeling (in this case, 2), and the number of parameters estimated (in this case, 3, including the error term). The term Variance Components informs us that a variance-covariance matrix structure with independent random effects is being considered.
In Information Criteria, the value -2 Restricted Log Likelihood is presented, which corresponds to − 2 times the maximum value obtained for the logarithm of the restricted likelihood function to estimate the model parameters. We can see that the output in SPSS shows that − 2 ∙ LLr = 17,504.04, which is exactly equal to − 2 times the value presented in Stata (Fig. 23.13), since − 2 ∙(− 8752.02) = 17,504.04.
Next, in Fixed Effects, the estimation of parameter γ00 is presented (fixed effect), which corresponds to the average of students’ expected school performance (horizontal line estimated in the null model, or general intercept). We can see that the estimation of γ00 = 61.049 corresponds to the one obtained in Fig. 23.13 in the estimation of the null model in Stata.
Finally, the estimations of level-1 and level-2 error terms’ variance components (random effects) are presented (Covariance Parameters). Here, we can also verify that the outputs correspond to the ones obtained in Stata, since the estimations of τ00 = 135.779 (Intercept [subject = school]) and σ2 = 347.562 (Residual). Nevertheless, note that, different from Stata, SPSS displays the z statistics of the estimations of the error terms’ variances directly, with their respective significance levels. Thus, for the data in our example, we can see that there is variability in the school performance of students from different schools, since Sig. z τ00 < 0.05 (if the confidence level is defined at 95%).
Based on the intraclass correlation, which is calculated later, we can see that approximately 28% of the total school performance variance is due to the alterations between schools.
rho=τ00τ00+σ2=135.779135.779+347.562=0.281
In order to maintain the logic presented in Section 23.5.1, at this moment, let’s estimate the random intercepts model, including the explanatory variable hours, as follows:
performanceij=γ00+γ10⋅hoursij+u0j+rij
The syntax to estimate this model in SPSS is:
MIXED performance WITH hours
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT hours
/RANDOM = INTERCEPT | SUBJECT(school) .
where all the explanatory variables that researchers want must be inserted after the term WITH in the first line of the syntax. After we run it, we arrive at the main outputs shown in Fig. 23.51.
These outputs correspond to the ones presented in Fig. 23.15 (Stata) and, through them, we can see that there is statistical significance in the estimations of the variances of error terms τ00 = 19.125 and σ2 = 31.764, which result in the following intraclass correlation:
rho=τ00τ00+σ2=19.12519.125+31.764=0.376
Thus, there is an increase in the proportion of the variance component of the intercept in relation to the null model. This favors the decision to include the variable hours to study the school performance behavior when comparing the schools.
Therefore, now, our model starts to have the following specification:
performanceij=0.534+3.252⋅hoursij+u0j+rij
where the fixed effect of the intercept, now, corresponds to the average expected school performance, between schools, of the students who, for some reason, do not study (hoursij = 0). The slope allows us to state that one more hour spent studying per week, on average, makes the expected mean school performance, between schools, increase 3.252 points, and this parameter is statistically significant.5
At this moment, let’s insert slope random effects into our multilevel model that, by maintaining the intercept random effects, will start to have the following expression:
performanceij=γ00+γ10⋅hoursij+u0j+u1j⋅hoursij+rij
The new syntax is:
MIXED performance WITH hours
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT hours
/RANDOM = INTERCEPT hours | SUBJECT(school) .
which generates the outputs shown in Fig. 23.52.
Analogously, these outputs correspond to the ones shown in Fig. 23.18 (Stata).
We can see that the parameter and variance estimations in the random intercepts and slopes model are identical to the ones obtained when the model parameters were estimated. The model that only had random intercepts (Fig. 23.51). This occurs because the estimation of variance τ11 (hours [subject = school]) is statistically equal to zero, which makes the value obtained of − 2 ∙ LLr the same as the one shown in Fig. 23.51.
Hence, applying a likelihood-ratio test would offer an output that would obviously favor the use of a random intercepts model, since the significance level Sig. χ12 (12,744.329 − 12,744.329 = 0) = 1.000 > 0.05, as shown in Fig. 23.19.
If researchers wish to generalize the structure of the random effects variance-covariance matrix, allowing u0j and u1j to be correlated, they just need to estimate the model parameters using the term COVTYPE(UN) at the end of the RANDOM line of the last syntax. It will become:
MIXED performance WITH hours
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT hours
/RANDOM = INTERCEPT hours | SUBJECT(school) COVTYPE(UN) .
where the term COVTYPE(UN) considers that there is an unstructured variance-covariance matrix. This model’s outputs are not presented here. However, a likelihood-ratio test to compare the estimations of random intercepts and slopes models with independent and correlated error terms u0j and u1j will show that the structure of the variance-covariance matrix between u0j and u1j can be considered independent, similar to what is shown in Fig. 23.23.
Being independent from the random effects variance-covariance matrix structure and being the random intercepts model the most suitable, let’s now estimate the complete final model that has the following specification:
performanceij=γ00+γ10⋅hoursij+γ01⋅texpj+γ02⋅privj+γ11⋅privj⋅hoursij+u0j+rij
Note that we have already begun the last estimation obtained in Section 23.5.1. The syntax to estimate the model is:
MIXED performance WITH hours texp priv
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT hours texp priv priv⁎hours
/RANDOM = INTERCEPT | SUBJECT(school)
/SAVE = PRED FIXPRED .
where the last line shows the term SAVE = PRED FIXPRED, which makes two new variables be generated in the dataset, PRED_1 and FXPRED_1. The former corresponds to the predicted values of school performance per student (yhat in Stata), with random intercepts components u0j. The latter refers to the predicted values of school performance only resulting from the fixed effects component. The outputs generated are shown in Fig. 23.53 and the expected BLUPS (best linear unbiased predictions) values of our final model’s random effects u0j can be obtained through the following syntax:
COMPUTE blups = PRED_1-FXPRED_1.
which generates a new variable in the dataset, called blups, equal to the variable u0final defined in the estimation of this model in Stata.
These outputs correspond to the ones presented in Fig. 23.25 (Stata). With significant estimations of the variances of the random effects and of the fixed effects parameters, at a confidence level of 95% (except for the estimation of the parameter of the combined variable hours⁎priv, significant at a confidence level of 90%), we obtain the following expression of the model proposed:
performanceij=−2.710+3.281⋅hoursij+0.866⋅texpj−5.610⋅privj−0.080⋅privj⋅hoursij+u0j+rij
constructed with the inclusion of level-1 and level-2 explanatory variables and through a multilevel step-up strategy. Hence, we can conclude that there are differences in the school performance behavior between students from the same schools and from different schools. These differences occur due to the number of hours each student spends studying per week, on what type of school it is (public or private), and on the professors’ teaching experience (in years) at each school have.
Next and in SPSS too, let’s study an example with a three-level hierarchical linear model with repeated measures.
In this section, we are going back to the example used in Section 23.5.2. Bear in mind that our professor managed to get data on the school performance (grades from 0 to 100) throughout four years (level-1 temporal variable) of 2000 students from 15 schools. He also collected data on each student’s gender (level-2 explanatory variable), and on professors’ years of teaching experience in each of the schools (level-3 explanatory variable). The complete dataset is presented in the file PerformanceTimeStudentSchool.sav.
It is important to mention that the time SPSS takes to process estimations of multilevel models is considerably longer than Stata, mainly for three or more levels.
Maintaining the logic presented in Section 23.5.2, initially, let’s estimate the null model, as follows:
performancetjk=γ000+u00k+r0jk+etjk
For this null model, we must type the following routine in the syntax window:
MIXED performance
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT
/RANDOM = INTERCEPT | SUBJECT(student)
/RANDOM = INTERCEPT | SUBJECT(school) .
where the first line (MIXED) only shows the dependent variable performance and both lines after that (METHOD and PRINT) determine the estimation method adopted (in this case, restricted maximum likelihood estimation, or REML), and that the estimations of the fixed effects with their corresponding standard errors be presented in the outputs. In the following line (FIXED), the variable that will be a part of the fixed effects components can be specified, in addition to the intercept term. Finally, in the last two lines of the routine (RANDOM), besides the intercept terms, the variables that will be part of the random effects components in the different levels of the analysis can be specified. The term SUBJECT inserted after the vertical bar | identifies the group variable that corresponds to each level (in our case, student for level 2 and school for level 3).
Fig. 23.54 shows the outputs generated by SPSS.
We will not analyze all the outputs of the model generated once again, because they are identical to the ones shown in Fig. 23.34, obtained in the estimation of this null model in Stata.
Nevertheless, we can see that the estimation of parameter γ000 (Fixed Effects) is equal to 68.714, which corresponds to the average of the students’ expected annual school performance (horizontal line estimated in the null model, or general intercept).
Besides, we know that the estimations of the error terms’ variances (Covariance Parameters) τu000 = 180.194 (Intercept [subject = school]), τr000 = 325.799 (Intercept [subject = student]), and σ2 = 41.649 (Residual) are statistically different from zero, at a significance level of 0.05. This fact allows us to state that there is significant variability in the school performance throughout the four years of the analysis, there is significant variability in the school performance, throughout time, between students of the same school, and there is significant variability in the school performance, throughout time, between students from different schools.
Both intraclass correlations, which correspond to levels 2 and 3 of the analysis, can be calculated as follows:
rhostudent∣school=corr(Ytjk,Yt´jk)=τu000+τr000τu000+τr000+σ2=180.194+325.799180.194+325.799+41.649=0.924
rhoschool=corr(Ytjk,Yt´j´k)=τu000τu000+τr000+σ2=180.194180.194+325.799+41.649=0.329
Thus, the correlation between the annual school performances, for the same school, is equal to 32.9% (rhoschool), and the correlation between the annual school performances, for the same student of a certain school, is equal to 92.4% (rhostudent | school).
In order to maintain the same logic presented in Section 23.5.2, at this moment, let’s estimate the linear trend model with random intercepts and slopes, including the variable year (repeated measure) as an explanatory variable into level 1, as follows:
performancetjk=γ000+γ100⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
The syntax to estimate this model in SPSS is:
MIXED performance WITH year
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT year
/RANDOM = INTERCEPT year | SUBJECT(student)
/RANDOM = INTERCEPT year | SUBJECT(school) .
where all the explanatory variables that researchers want must be inserted after the term WITH in the first line of the syntax. After nine iterations and a few processing minutes, we arrived at the main outputs shown in Fig. 23.55.
These outputs correspond to the ones shown in Fig. 23.39. Through them, we can see that the parameters estimated of the fixed and random effects components are statistically different from zero, at a significance level of 0.05. This gives us subsidies to state that students’ school performance follows a linear trend throughout time, and that there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools6. By using the level-2 intraclass correlation, calculated later, we estimate that the random effects of students and schools form approximately 99% of the total variance of the residuals!
rhostudent∣school=corr(Ytjk,Yt´jk)=τu000+τu100+τr000+τr100τu000+τu100+τr000+τr100+σ2=224.343+0.560+374.285+3.157224.343+0.560+374.285+3.157+3.868=0.994
At this moment, our model starts to have the following specification:
performancetjk=57.858+4.343⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
Finally, let’s investigate if level-2 and level-3 variables gender and texp also explain the variation in the annual school performance between students. After some intermediate analyses, let’s move on to estimate the following complete three-level model:
performancetjk=γ000+γ100⋅yearjk+γ010⋅genderjk+γ001⋅texpk+γ110⋅genderjk⋅yearjk+γ101⋅texpk⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
To estimate this model, let’s generalize the structure of the random effects variance-covariance matrices, allowing (u00k, u10k) and (r0jk, r1jk) to be correlated (unstructured variance-covariance matrices). In order to do that, we must insert the term COVTYPE(UN) at the end of the RANDOM lines, making the syntax in SPSS become:
MIXED performance WITH year gender texp
/METHOD = REML
/PRINT = SOLUTION TESTCOV
/FIXED = INTERCEPT year gender texp gender⁎year texp⁎year
/RANDOM = INTERCEPT year | SUBJECT(student) COVTYPE(UN)
/RANDOM = INTERCEPT year | SUBJECT(school) COVTYPE(UN)
/SAVE = PRED FIXPRED RESID .
where the last line now shows the term SAVE = PRED FIXPRED RESID, which makes three new variables be generated in the dataset, PRED_1, FXPRED_1, and RESID_1. They correspond to the predicted values of the school performance per student (yhatstudent in Stata), to the predicted values of the school performance only resulting from the fixed effects component, and to error terms etjk, respectively.
After five iterations and a few processing minutes, we arrived at the outputs shown in Fig. 23.56.
These outputs correspond to the ones shown in Fig. 23.43 (Stata) and, through which, we can see that all the parameters estimated for the fixed effects component are statistically different from zero, at a significance level of 0.05. On the other hand, in relation to the parameters of the random effects components, only the estimations of u10k and cov(u00k, u10k) are statistically significant at a significance level of 0.10. All the others are significant at a significance level of 0.05. Thus, considering that cov(u00k, u10k) and cov(r0jk, r1jk) are statistically different from zero, we can write:
var[u00ku10k]=[88.734−3.185−3.1850.255]
var[r0jkr1jk]=[350.913−13.251−13.2513.257]
Therefore, the expression of our final model has the following specification7:
performancetjk=54.734+4.516⋅yearjk−14.702⋅genderjk+1.179⋅texpk+0.652⋅genderjk⋅yearjk−0.057⋅texpk⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
constructed with the inclusion of level-1 and level-2 explanatory variables and through a multilevel step-up strategy.
Therefore, we can conclude that students’ school performance follows a linear trend throughout time. In addition, there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools. Students’ gender is significant to explain part of this variation. Professors’ years of teaching experience at each school also explains part of the discrepancies in the annual school performance between students from different schools.
Similar to Table 23.5, presented at the end of Section 23.5, Table 23.6 consolidates the general estimation routines, in SPSS, for multilevel models.
Table 23.6
Modeling | Intermediate Model | Routine in SPSS |
---|---|---|
Two-Level Hierarchical Linear Model with Clustered Data | Null Model (Nonconditional Model) | MIXED Y /FIXED = INTERCEPT /RANDOM = INTERCEPT | SUBJECT(level2_var) . |
Random Intercepts Model | MIXED Y WITH X /FIXED = INTERCEPT X /RANDOM = INTERCEPT | SUBJECT(level2_var) . | |
Random Intercepts and Slopes Model | MIXED Y WITH X /FIXED = INTERCEPT X /RANDOM = INTERCEPT X | SUBJECT(level2_var) . | |
Random Intercepts and Slopes Model and Correlated Error Terms | MIXED Y WITH X /FIXED = INTERCEPT X /RANDOM = INTERCEPT X | SUBJECT(level2_var) COVTYPE(UN) . | |
Three-Level Hierarchical Linear Model with Repeated Measures | Null Model (Nonconditional Model) | MIXED Y /FIXED = INTERCEPT /RANDOM = INTERCEPT | SUBJECT(level2_var) /RANDOM = INTERCEPT | SUBJECT(level3_var) . |
Linear Trend Model with Random Intercepts | MIXED Y WITH t /FIXED = INTERCEPT t /RANDOM = INTERCEPT | SUBJECT(level2_var) /RANDOM = INTERCEPT | SUBJECT(level3_var) . | |
Linear Trend Model with Random Intercepts and Slopes | MIXED Y WITH t /FIXED = INTERCEPT t /RANDOM = INTERCEPT t | SUBJECT(level2_var) /RANDOM = INTERCEPT t | SUBJECT(level3_var) . | |
Linear Trend Model with Random Intercepts and Slopes and Level-2 Variable | MIXED Y WITH t X /FIXED = INTERCEPT t X X⁎t /RANDOM = INTERCEPT t | SUBJECT(level2_var) /RANDOM = INTERCEPT t | SUBJECT(level3_var) . | |
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 Variables | MIXED Y WITH t X W /FIXED = INTERCEPT t X W X⁎t W⁎t W⁎X⁎t /RANDOM = INTERCEPT t | SUBJECT(level2_var) /RANDOM = INTERCEPT t | SUBJECT(level3_var) . | |
Linear Trend Model with Random Intercepts and Slopes and Level-2 and Level-3 Variables and Correlated Error Terms | MIXED Y WITH t X W /FIXED = INTERCEPT t X W X⁎t W⁎t W⁎X⁎t /RANDOM = INTERCEPT t | SUBJECT(level2_var) COVTYPE(UN) /RANDOM = INTERCEPT t | SUBJECT(level3_var) COVTYPE(UN) . |
Note: Considering a level-2 variable X, a level-3 variable W (whenever there is one), and t as the temporal variable. Besides, Y refers to the dependent variable. In all the commands, having considered an estimation through restricted maximum likelihood estimation (omitted term /METHOD = REML).
Data mining is a broad theme that is only beginning to be explored in depth in the field of business. This chapter only provides a brief discussion about the concepts, processes, stages, tasks, and the types of methods and techniques it can employ.
In this context, we believe that one of the most recent and relevant modeling techniques within the data-mining environment is multilevel modeling. It allows researchers and managers to assess the relationship between a certain performance variable and one or more predictor variables, which characterize different analysis levels. Moreover, each level is formed by individuals or groups nested into other groups and so on. Since variables from a certain group are invariable between groups or individuals that correspond to lower levels that are nested into that group, it is natural for many researches and constructs to use such models. Since many datasets have nested data structures, as those that simultaneously have students' and school, company and country, municipality and state, or real estate and neighborhood characteristics, for instance.
Many can be the characteristics of the datasets with nested data structures. The most common are those with absolute nesting, in which there are clustered data or data with repeated measures. In this chapter, we chose to present examples in which datasets are used to estimate two-level hierarchical linear models with clustered data and three-level hierarchical linear models with repeated measures. Nonetheless, from which, we believe researchers will have the conditions to estimate, for example, three-level models with clustered data or even consider a higher number of analysis levels, resulting from more complex nesting structures.
Multilevel models allow us to identify and analyze individual heterogeneities and the heterogeneities between the groups to which these individuals belong, making it possible to specify random components in each analysis level. This fact represents the main difference of the traditional regression models estimated through OLS, which cannot consider the natural nesting of data and, consequently, generate biased parameter estimators.
Although many papers use multilevel models only to estimate null models to investigate the variance decomposition of the phenomenon being studied in the different analysis levels, the possibility of including explanatory variables that correspond to the different levels in the fixed and random effects components allows us to investigate possible relationships between these variables and the dependent variable. This makes it possible to establish new research objectives and interesting constructs.
Currently, it is possible to see a growing concern of software and tools manufacturers regarding the processing capability of commands and routines to estimate more complex multilevel models. We cannot forget to mention the important and educational software HLM (Hierarchical Linear and Nonlinear Modeling), produced by Scientific Software International (SSI) and developed by Professors Stephen Raudenbush (University of Michigan), Anthony Bryk (University of Chicago), and Richard Congdon (Harvard University).
To estimate multilevel models, as well as for any other modeling technique, it is necessary for the application to be accompanied by methodological rigor and certain care when analyzing the results, mainly if these are meant for making forecasts. The use of a certain estimation method, to the detriment of another, can help researchers and managers choose the most suitable model, adding value to their research, and allowing new studies on the topic chosen to be carried out.
Discovering implicit and contextual standards from larger and larger volumes of data becomes an essential condition for organizations to become successful in competitive environments, and multilevel modeling contributes in a considerable way with a list of techniques for the data-mining process.
Variable | Description |
---|---|
country | A string variable that identifies the country. |
idcountry | Country code j. |
resdevel | Country’s investments in research and development, in % of the GDP (Source: World Bank). |
idstudent | Student code i. |
score | Science score obtained by the student in the competition (0 to 100). |
income | Student’s median household income per month (US$). |
By using this dataset, we would like you to:
scoreij=b0j+rij
b0j=γ00+u0j
which results in:
scoreij=γ00+u0j+rij
scoreij=b0j+b1j⋅incomeij+rij
b0j=γ00+u0j
b1j=γ10,
which results in:
scoreij=γ00+γ10⋅incomeij+u0j+rij
scoreij=b0j+b1j⋅incomeij+rij
b0j=γ00+u0j
b1j=γ10+u1j
which results in:
scoreij=γ00+γ10⋅incomeij+u0j+u1j⋅incomeij+rij
scoreij=b0j+b1j⋅incomeij+rij
b0j=γ00+u0j
b1j=γ10+γ11⋅resdevelj
which results in:
scoreij=γ00+γ10⋅incomeij+γ11⋅resdevelj⋅incomeij+u0j+rij
Variable | Description |
---|---|
district | District code k. |
property | Property code j. |
lnp | Natural logarithm of the rental price per square meter (adjusted by the inflation, base year 1). |
year | Temporal variable (repeated measure) that corresponds to the period of monitoring (year 1 to 6). |
food | Is there a restaurant or food court in the building where the property is located? (No = 0; Yes = 1). |
space4 | Are there four or more parking spaces? (No = 0; Yes = 1). |
valet | Is there valet parking in the building where the property is located? (No = 0; Yes = 1). |
subway | Is there a subway station in the district where the property is located? (No = 0; Yes = 1). |
violence | Average mortality rate due to external causes in the district where the property is located (per 100,000 inhabitants). |
This dataset, in which periods (level 1) are nested into properties (level 2), and these into districts (level 3), is structured according to the logic presented in the following figure:
We would like you to:
ln(p)tjk=π0jk+etjk
π0jk=b00k+r0jk
b00k=γ000+u00k
which results in:
ln(p)tjk=γ000+u00k+r0jk+etjk
ln(p)tjk=π0jk+π1jk⋅yearjk+etjk
π0jk=b00k+r0jk
π1jk=b10k
b00k=γ000+u00k
b10k=γ100,
which results in the following expression:
ln(p)tjk=γ000+γ100⋅yearjk+u00k+r0jk+etjk
ln(p)tjk=π0jk+π1jk⋅yearjk+etjk
π0jk=b00k+r0jk
π1jk=b10k+r1jk
b00k=γ000+u00k
b10k=γ100+u10k
which results in:
ln(p)tjk=γ000+γ100⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
ln(p)tjk=π0jk+π1jk⋅yearjk+etjk
π0jk=b00k+b01k⋅foodjk+b02k⋅space4jk+r0jk
π1jk=b10k+b11k⋅valetjk+r1jk
b00k=γ000+u00k
b01k=γ010
b02k=γ020
b10k=γ100+u10k
b11k=γ110,
which results in the following expression:
ln(p)tjk=γ000+γ100⋅yearjk+γ010⋅foodjk+γ020⋅space4jk+γ110⋅valetjk⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
ln(p)tjk=π0jk+π1jk⋅yearjk+etjk
π0jk=b00k+b01k⋅foodjk+b02k⋅space4jk+r0jk
π1jk=b10k+b11k⋅valetjk+r1jk
b00k=γ000+γ001⋅subwayk+u00k
b01k=γ010
b02k=γ020
b10k=γ100+γ101⋅subwayk+γ102⋅violencek+u10k
b11k=γ110,
which results in the following expression:
ln(p)tjk=γ000+γ100⋅yearjk+γ010⋅foodjk+γ020⋅space4jk+γ001⋅subwayk+γ110⋅valetjk⋅yearjk+γ101⋅subwayk⋅yearjk+γ102⋅violencek⋅yearjk+u00k+u10k⋅yearjk+r0jk+r1jk⋅yearjk+etjk
As we have already discussed, the generalized linear latent and mixed models (GLLAMM), similar to the generalized linear models (GLM), encompass the hierarchical linear models (HLM) we studied throughout this chapter, and the hierarchical nonlinear models (HNM). The latter refer to the situations in which, if there is a nested data structure, the dependent variable presents itself as a categorical variable or as a variable with count data, reason why we have chosen to present examples of hierarchical nonlinear models as logistic, Poisson and negative binomial in this Appendix. Fig. 23.57 shows the logic of the generalized linear latent and mixed models, highlighting the models that will be studied from now.
Analogous to what we studied in Chapter 14, mixed effects logistic regression models can be used whenever the dependent variable is qualitative and dichotomic, the data are found in a certain nested structure (in levels), and there may be clustered data or data with repeated measures. In these situations, researchers can estimate a model aiming at capturing the relationship between the behavior of explanatory variables and the occurrence of the phenomenon being studied, represented by a dichotomic variable (dummy), as well as studying the variance decomposition of the random effects components due to the presence of a multilevel structure.
In this section, we will present a two-level hierarchical logistic model with clustered data. In general and from Expressions (14.10) and (23.23), we can define this model with two analysis levels. The first level offers explanatory variables X1, ..., XQ, which refer to each individual i (i = 1, ..., n), and the second level, explanatory variables W1, ..., WS that refer to each group j (j = 1, ..., J), invariable for the observations that belong to the same group, as follows:
Level1:pij=11+e−(b0j+b1j⋅X1ij+b2j⋅X2ij+…+bQj⋅XQij)
where pij represents the probability of occurrence of the event we are interested in for each observation i that belongs to a certain group j, and bqj (q = 0, 1, ..., Q) refer to the level-1 coefficients.
Level2:bqj=γq0+Sq∑s=1γqs⋅Wsj+uqj
where γqs (s = 0, 1, ..., Sq) refer to the level-2 coefficients, and uqj are the level-2 random effects, normally distributed, with mean equal to zero and variance τqq. Furthermore, possible independent error terms of uqj have a mean equal to zero and variance π2/3.
At this moment, let’s present an example. A research was carried out at a global level aiming at investigating if there are differences when couples, who reside in different countries, travel abroad for tourism. In order to do that, data on 1,622 couples located in 50 countries were collected, as well as the average age of each couple, and the number of children they have. Part of the dataset is presented in Table 23.7. However, the complete dataset can be found in the file Tourism.dta.
Table 23.7
Observation (Couple i - Level 1) | Country j Where the Couple Lives (Level 2) | Traveled Abroad for Tourism in the Last Year (Yij) | Couple’s Average Age (X1ij) | Number of Children (X2ij) |
---|---|---|---|---|
1 | France | Yes | 68 | 2 |
2 | France | Yes | 37 | 0 |
… | ||||
117 | France | Yes | 54 | 3 |
… | ||||
1,604 | Egypt | No | 55 | 2 |
1,605 | Egypt | No | 51 | 2 |
… | ||||
1,622 | Egypt | Yes | 39 | 0 |
After opening this file, we can type the command desc, which makes it possible to analyze the dataset characteristics, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 23.58 shows this output in Stata.
Since the main goal of this Appendix is not to discuss the concepts presented throughout this chapter once again, let’s carry out the following estimation:
p(tourism)ij=11+e−(b0j+b1j⋅ageij+b2j⋅childrenij)
b0j=γ00+u0j
b1j=γ10
b2j=γ20,
which results in the random intercepts model:
p(tourism)ij=11+e−(γ00+γ10⋅ageij+γ20⋅childrenij+u0j)
where the variable tourism is dichotomic (dummy), in which values equal to 1 correspond to the couples that traveled abroad for tourism in the last year, and values equal to 0 are the opposite.
To estimate this model in Stata, we must type the following command:
whose outputs can be found in Fig. 23.59.
Based on this figure, initially, we can see that we have 1622 observations (couples) nested into 50 groups (countries), which characterizes a two-level clustered data structure.
A more inquisitive researcher may verify that the parameter estimations of the fixed and random effects components are identical to the ones that would be obtained through the following command:
meglm tourism age children || country: , family(bernoulli) link(logit) nolog
where the term meglm means multilevel mixed effects generalized linear model. Therefore, that makes it necessary to define the family of distributions of the dependent variable and, in this case, it is Bernoulli, and the canonic link function, which in this situation is logistic.9
Moreover, the odds ratios of the fixed effects parameters can also be obtained directly, by typing the term or (odds ratio) at the end of the commands presented.
Given that the independent error terms of uqj have variance equal to π2/3, we can define the following intraclass correlation:
rho=τ00τ00+π23=0.2550.255+π23=0.072,
which suggests that approximately 7% of the total variance of the error terms are due to alterations in the dependent variable’s behavior between countries. After Stata 13, it is possible to obtain this intraclass correlation directly, by typing the command estat icc right after the estimation of the corresponding model.
Even though Stata does not show the result of the z tests with their respective significance levels for the random effects parameters directly, the fact that the estimation of variance component τ00, which corresponds to random intercepts u0j, is considerably higher than its standard error suggests that there is significant variation in the behavior of couples who reside in different countries when it comes to traveling abroad for tourism. Statistically, we can see that z = 0.255 / 0.088 = 2.90 > 1.96, being 1.96 the critical value of the standardized normal distribution which results in a significance level of 0.05.
Even if country variables that may possibly explain such behavior have not been considered, such as, cultural, economic, or social characteristics, we are able to verify that, while an increment in age increases the expected probability that couples will start traveling abroad for tourism, ceteris paribus, traveling decreases with the increment in the number of children, also ceteris paribus. The model estimated has the following expression:
p(tourism)ij=11+e−(0.439+0.015⋅ageij−0,424⋅childrenij+u0j)
At the bottom of Fig. 23.59, we can see, from the result of the likelihood-ratio test, that the estimation of this multilevel model is more suitable than the estimation of a traditional binary logistic regression model for the data in our example.
Therefore, we can obtain the expected probability values of the occurrence of the event being studied (traveling abroad for tourism) for each of the couples in the sample. In order to do that, we must type the following command, which generates a new variable (phat) in the dataset:
Besides, we can also obtain the error terms u0j, invariable for couples from the same country. In order to do that, we must type the following command:
which makes the new variable, u0, also be generated in the dataset.
The following command, which generates the outputs seen in Fig. 23.60, shows the values of phat and the error terms u0 only for the couples who reside in Brazil:
list country tourism phat u0 if country == "Brazil"
Only for educational purposes, researchers may verify that variable phat can also be generated through the following expression:
gen phat = (1) / (1 + exp(-(0.4393717 + 0.0150543⁎age - 0.4239421⁎children + u0)))
Finally, we can construct a chart that shows, based on the variable children, the adjustments of curves S (sigmoid functions) of the expected probabilities that couples who reside in five specific countries, chosen based on their different locations around the globe, travel abroad for tourism. This chart, which can be seen in Fig. 23.61, is obtained by typing the following command:
graph twoway scatter phat children || mspline phat children if country =="France" || mspline phat children if country =="United States" || mspline phat children if country =="Japan" || mspline phat children if country =="South Africa" || mspline phat children if country =="Venezuela" ||, legend(label(2 "France") label(3 "United States") label(4 "Japan") label(5 "South Africa") label(6 "Venezuela"))
Through this chart, we are able to see the different behavior between couples from different countries in relation to traveling abroad for tourism clearly.
Analogous to what we studied in Chapter 15, mixed effects regression models for count data can be used when the dependent variable is quantitative, however, with discrete and non-negative values, and when the data are in a certain nested structure (in levels), and there may be clustered data or data with repeated measures.
In this section, we will present a hierarchical model for count data with three levels and clustered data. In general and from Expressions (15.4), (23.30), and (23.31), we can define this three-level model. The first level shows level-1 explanatory variables Z1, ..., ZP, which refer to units i (i = 1, ..., n). The second level, level-2 explanatory variables X1, ..., XQ, which refer to units j (j = 1, ..., J), and they are invariable for the units that belong to the same group j. The third level, level-3 explanatory variables W1, ..., WS, which refer to units k (k = 1, ..., K), and they are invariable for the units that belong to the same group k. This model is as follows:
Level1:ln(λijk)=π0jk+π1jk⋅Z1jk+π2jk⋅Z2jk+…+πPjk⋅ZPjk
where λ is the expected number of occurrences or the estimated average incidence rate of the phenomenon being studied for a certain exposure. πpjk (p = 0, 1, ..., P) refer to the level-1 coefficients, and Zpjk is the p-th level-1 explanatory variable for observation i in level-2 unit j and in level-3 unit k.
Level2:πpjk=bp0k+Qp∑q=1bpqk⋅Xqjk+rpjk
where bpqk (q = 0, 1, ..., Qp) refer to the level-2 coefficients. Xqjk is the q-th level-2 explanatory variable for unit j in the level-3 unit k. rpjk are the level-2 random effects, assuming, for each unit j, that the vector (r0jk, r1jk, ..., rPjk)´ follows a multivariate normal distribution with each element having mean zero and variance τrπpp.
Level3:bpqk=γpq0+Spq∑s=1γpqs⋅Wsk+upqk
where γpqs (s = 0, 1, ..., Spq) refer to the level-3 coefficients, Wsk is the s-th level-3 explanatory variable for unit k, and upqk are the level-3 random effects, assuming that for each unit k, the vector formed by terms upqk follows a multivariate normal distribution with each element having mean zero and variance τuπpp.
Imagine that a national research has been carried out aiming at studying the relationship between the number of traffic accidents and the average amount of alcohol ingested per inhabitant/day (in grams). This research was carried out in several Brazilian municipal districts located in the whole country in the last year. It also wants to find out if there are differences in this relationship between districts located in different municipalities and different states of the federation. In order to do that, data from 1,062 municipal districts located in 234 municipalities in all 27 units of the federation (26 states and the Federal District) were analyzed. Part of the dataset is presented in Table 23.8. However, the complete dataset can be found in the file Traffic_Accidents.dta.
Table 23.8
State k (Level 3) | Municipality j (Level 2) | Municipal district i (Level 1) | Number of Traffic Accidents in the Last Year (Yijk) | Average Amount of Alcohol Ingested per Inhabitant/Day, in Grams (Zjk) |
---|---|---|---|---|
AC | 1 | 1 | 9 | 12.57 |
AC | 2 | 2 | 10 | 13.36 |
... | ||||
AC | 3 | 11 | 2 | 12.33 |
... | ||||
TO | 231 | 1,052 | 2 | 11.94 |
TO | 231 | 1,053 | 3 | 10.54 |
... | ||||
TO | 234 | 1,062 | 5 | 11.74 |
Fig. 23.62 shows the output generated in Stata when we typed the command desc.
Following the logic presented in Chapter 15, initially, let’s construct a histogram for the variable accidents, which will be the dependent variable of the model to be proposed. In order to do that, we must type the following command, which generates the histogram in Fig. 23.63.
As studied in Chapter 15, it is interesting for researchers to assess if the mean and variance of the dependent variable are equal, or at least close to one another, before estimating models that involves count data. By doing that, they will have an idea of the suitability of the estimation of the Poisson model, or if it will be necessary to estimate a negative binomial model. By typing the following command, it will be possible for this preliminary diagnostic to be elaborated, whose results can be found in Fig. 23.64:
tabstat accidents, stats(mean var)
Even if the variance of the variable accidents is much higher than its mean, which indicates that there is overdispersion in the data, initially and for educational purposes, we will estimate a Poisson model. In the modeling of the number of traffic accidents, even though a possibility is the inclusion of dummy variables that represent municipalities and states in the fixed effects component, we will treat them as random effects and estimate a multilevel Poisson regression model with three levels and random intercepts. Furthermore, the definition of the existence of overdispersion in the data, which suggests a better suitability of the multilevel negative binomial regression model in relation to the Poisson model, will be elaborated next, through a likelihood-ratio test.
Therefore, let’s carry out the following estimation:
ln(accidentsijk)=π0jk+π1jk⋅alcoholjk
π0jk=b00k+r0jk
π1jk=b10k
b00k=γ000+u00k
b10k=γ100,
which results in the random intercepts model:
ln(accidentsijk)=γ000+γ100⋅alcoholjk+u00k+r0jk
where the variable accidents represents the phenomenon being studied. It is quantitative and only has non-negative and discrete values (count data), indicating the incidence of traffic accidents in the last year in the municipal district i located in the municipality j of state k.
To estimate the model proposed in Stata, we must type the following command:
in which the insertion logic of the different levels follows the same nesting criterion discussed throughout this chapter, that is, from the highest to the lowest level, and these levels are separated by the terms ||. The outputs generated are shown in Fig. 23.65.
Based on this figure, initially, we can see the existence of a three-level unbalanced clustered data structure. Besides, the result of the likelihood-ratio test shows that there is significant variability between the districts located in different municipalities and states, which favors the use of the multilevel Poisson model in relation to a traditional Poisson regression model without random effects.
Before moving on, we can type the command estimates store mepoisson, which makes the results of this estimation be stored for future comparison to the ones that will be obtained through the estimation of the negative binomial model. Moreover, we can also type predict lambda, which generates a new variable in the dataset (lambda) that corresponds to the values estimated of the incidence of traffic accidents in the last year in each of the 1062 municipal districts. Finally, researchers may also type the term irr (incidence rate ratio) at the end of the command presented, as studied in Chapter 15, so that the incidence rates of traffic accidents per year corresponding to the alterations in each fixed effects parameter can be estimated.
An even more inquisitive researcher may verify that the parameter estimations of the fixed and random effects components are identical to the ones that would be obtained through the following command:
meglm accidents alcohol || state: || municipality: , family(poisson) link(log) nolog
which explains, for the generalized linear latent and mixed model (term meglm), that the distribution of the dependent variable considered is the Poisson and the canonic link function is the logarithmic.
After the estimation of the random effects parameters, it is possible for the number of traffic accidents to present overdispersion. Thus, we must re-examine the data by estimating a negative binomial model, so that its results may be compared to the ones obtained by the estimation of the Poisson model. In order to do that, we must type the following command:
The results obtained are shown in Fig. 23.66.
At the bottom of this figure and from the result of the likelihood-ratio test, we can see that the estimation of this multilevel model is more suitable than the estimation of a traditional negative binomial regression model without random effects for the data in our example. In addition, all the fixed and random effects parameters are statistically different from zero, at a significance level of 0.05.
The estimation of the variances of u00k and r0jk resulted in smaller values than the respective values obtained when estimating the multilevel Poisson model (from 0.386 to 0.377 for u00k and from 0.083 to 0.061 for r0jk), a fact that is due to the addition of an overdispersion parameter that controls the variability of the data.
In Fig. 23.66, we can see that the estimation of lnalpha is presented. As studied in Chapter 15, remember that alpha (or ϕ), which is the conditional overdispersion of the data, represents the inverse of the shape parameter of the Gamma distribution. For the data in our example, we have alˆpha=e−2.258=0.105.
Analogously, the fixed and random effects parameters can also be obtained through the following command:
meglm accidents alcohol || state: || municipality: , family(nbinomial) link(log) nolog
In order to compare the estimations of the multilevel Poisson and negative binomial models, we must run a likelihood-ratio test, by typing the following command:
where the term mepoisson refers to the estimation of the Poisson model. Since we are comparing two different estimators (mepoisson and menbreg), we must use the term force when elaborating this likelihood-ratio test. The result of the test can be seen in Fig. 23.67 and, through it, we can see that the negative binomial model is the most suitable, proving that there is overdispersion in the data.
Therefore, the expression of the estimated average number of traffic accidents per year, for a certain municipal district i in a certain municipality j in a state k, is given by:
uijk=e(0.754+0.047⋅alcoholjk+u00k+r0jk)
where u represents the expected number of occurrences or the estimated average rate of incidence of traffic accidents for one year. In order for these estimated numbers to be generated in the dataset (new variable u), we can type the following command:
Besides, we can also obtain the error terms u00k (invariable for the districts located in the same state) and r0jk (invariable for the districts located in the same municipality). In order to do that, we must type the following command:
which makes two new variables, u00 and r0, be created in the dataset.
The following command, which generates the outputs in Fig. 23.68, shows the values of u, u00, and r0 only for the municipality districts in the State of Mato Grosso:
list state municipality accidents u u00 r0 if state =="MT", sepby(municipality)
Through this figure, we can see that, while the values of u00 do not vary for all the municipal districts in the State of Mato Grosso, the values of r0 do not vary per municipality.
Only for educational purposes, researchers may verify that variable u can also be generated through the following expression:
gen u = exp(0.7538477 + 0.0466768⁎alcohol + u00 + r0)
Finally, we can construct a chart that compares the estimation adjustments of the traditional and multilevel negative binomial models. This chart, which can be seen in Fig. 23.69, is obtained by typing the following commands:
quietly nbreg accidents alcohol predict utrad graph twoway scatter accidents alcohol || mspline utrad alcohol || mspline u alcohol ||, legend(label(2 "Traditional Negative Binomial") label(3 "Multilevel Negative Binomial"))