Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 23

Data Mining and Multilevel Modeling

Abstract

This chapter presents a brief introduction of data mining and, within the context of modeling, it discusses multilevel models in detail, clarifying the circumstances in which they can be used. The main objective is to estimate the parameters for two-level hierarchical linear models with clustered data, and for three-level models with repeated measures, as well as to offer the conditions for their correct interpretation. The results of the statistical and likelihood-ratio tests as regards these models are evaluated, which allows us to distinguish one multilevel model from a traditional regression model. From the concepts and techniques presented, we can propose models in which it is possible to identify the fixed and random effects on the dependent variable, understand the variance decomposition of multilevel random effects, and calculate and interpret the intraclass correlations of each analysis level. Understanding how nested clustered data structures and of data with repeated measures work allows researchers and managers to define several types of constructs from which multilevel models can be used within the context of data mining. The multilevel models are estimated in Stata Statistical Software® and IBM SPSS Statistics Software®.

Keywords

Big data; KDD process; Data Mining; CRISP-DM; Multilevel modeling; Hierarchical linear models; Hierarchical nonlinear models; Mixed models; Nested models; Repeated measures; Variance decomposition; Fixed effects; Random effects; Likelihood-ratio test; Stata and SPSS

We must widen the circle of our love till it embraces the whole village; the village in its turn must take into its fold the district; the district the province; and so on, until the scope of our love becomes co-terminous with the world.

Mahatma Gandhi

23.1 Introduction to Data Mining

In this new millennium, with respect to the generation and availability of data, humankind has been witnessing and learning how to live with the simultaneous occurrence of five characteristics, or dimensions: data volume, velocity, variety, variability, and complexity.

Among others reasons, this excessive volume of data comes from the increase of technological capabilities, the increase of phenomena monitoring, and the emergence of social media. The velocity with which data become available for treatment and analysis, due to new collection methods that use electronic tags and radiofrequency antenna systems, is also visible and vital for decision-making processes in environments that are more and more competitive. Variety refers to the different formats in which data are accessed, such as, texts, indicators, secondary datasets or even speeches, and a converging analysis can foster a better decision-making process too. Beyond the three previous dimensions, data variability relates to cyclical or seasonal phenomena, sometimes in high frequency, directly observable or not, and that a certain treatment can generate differentiated information for researchers. Last, but not least, data complexity, mainly for large volumes, resides in the fact that many sources can be accessed, with codes, periodicities or distinct criteria, which forces researchers to have a managerial control process over the data, in order to have an integrated analysis and for decision making.

As shown in Fig. 23.1, the combination of these five data generation and availability dimensions is called Big Data, currently, a very frequent term in academic and business environments.

These five dimensions that define Big Data cannot be supported without the enhancement of professional software packages that, in addition to offering enormous dataset processing capability, are able to elaborate the most diverse tests and models, adequate and robust for each situation, and according to what researchers and decision makers want. These are the main reasons why organizations from several different sectors have been investing in the structuring and development of multidisciplinary areas known as Business Analytics. These have the main goal of analyzing data and generating information, allowing the creation of standard recognition and a predictive capacity in time real of the organization compared to the market and to its competitors.

Within this perspective, with the emergence and improvement of complex and robust computer systems, and with the reduction in hardware and software prices, organizations have been storing more and more data. Data storing systems are constantly generated and enhanced as, for example, data warehouses, virtual libraries, and the web itself (Cios et al., 2007; Camilo and Silva, 2009).

According to Bramer (2016), NASA’s observation satellites generate around one terabyte of data a day, the Human Genome Project stores thousands of bytes for each of the billions of existing genetic datasets, financial institutions maintain repositories with millions of daily transactions done by their clients, and retailers control the flow of thousands of SKU’s instantaneously. Nevertheless, excessive storing makes players from the most diverse areas question themselves about how to treat the high volume and high variety of complex data generated with extreme velocity and variability. To answer this crucial question, in the 1980s, Data Mining emerged, aiming to propose technologies and treatments in situations in which traditional data exploration and analysis techniques are not enough or adequate.

As mentioned in Chapter 1, hierarchy between data, information, and knowledge has been present in all the discussions throughout this book. While data are transformed into information whenever treated and analyzed, knowledge is generated at the moment when such information is recognized and applied in decision making. As discussed by Fayyad et al. (1996), be it science, marketing, finance, health care, retail, or any other field, the classical approach to data analysis relies fundamentally on one or more analysts becoming intimately familiar with the data and serving as an interface between the data and the users and products. This manual probing of a data set is slow, expensive, and can be highly subjective, and, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains.

According to the same authors, we are witnessing the emergence of a new generation of computational theories and tools to assist humans in extracting useful information and knowledge from rapidly growing volumes of digital data. These theories and tools are the subject of the emerging field of knowledge discovery in databases (KDD). At an abstract level, KDD is concerned with the development of methods and techniques for making sense of data.

As stated in Han and Kamber (2000) and in Camilo and Silva (2009), KDD and Data Mining are synonyms, even though there is no consensus regarding the definition of these terms yet. For Fayyad et al. (1996) and Cios et al. (2007), while KDD includes all the phases for discovering knowledge from the existence of data, Data Mining is solely one of the phases of that process, as shown in Fig. 23.2.

Fig. 23.2 Stages of the KDD Process and Data Mining. (Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge discovery in databases. AI Magazine 17 (3), 37–54.)

The data-mining stage of KDD currently relies heavily on known techniques from machine learning, pattern recognition, optimization, simulation, statistics, and multivariate analysis to find patterns from data (Fayyad et al., 1996). Data mining, thus, is a stage in the KDD process that consists in applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data.

Following the logic proposed by Olson and Delen (2008), Camilo and Silva (2009) and Larose and Larose (2014), data mining can be structured in six phases, or stages, which form what is called CRISP-DM (crossindustry standard process of data mining):

1. Business understanding: knowledge about the business and about the market processes inherent to the business is fundamental in order to define the objectives of the data mining.

2. Data understanding: we must describe the data in a clear and objective way, always explaining their sources and possible interdependence behavior between the variables. As we studied in Part 5 of this book, exploratory techniques can be very useful in this phase.

3. Data preparation: preliminary analyses of the data, with possible treatment of outliers or missing values, can be extremely useful in order for the data-mining methods to be applied correctly. The clustering of variables itself or their categorization through a certain criterion can make one technique more suitable than another, respecting the analysis objectives.

4. Modeling: as discussed by Fávero and Belfiore (2017), several techniques can be applied, such as, the preparation of exploratory techniques, the estimation of confirmatory models, or the implementation of algorithms, always based on the objectives proposed.

5. Analysis of results: it is essential for business experts, statisticians, and data scientists to take part in this phase, so that evaluations of the findings from the previous phase can be carried out, from the analysis of tests and validations (e.g., contingency tables, χ² statistic, correlation matrices, Stepwise procedures, t-tests, among others).

6. Dissemination of results: after the modeling and output analysis, it is necessary for all those involved to be aware of the results found, so that it is possible to implement management procedures.

In a schematic way, Fig. 23.3 shows the phases that form the crossindustry standard process of data mining (CRISP-DM). Through it, it is possible to verify that the flow between the phases is not always unidirectional. That is, if, for example, a certain modeling is not possible due to the nature of the data, researchers can go back to the previous phase and prepare these data once again.

According to Linoff and Berry (2011), “although some data mining techniques are quite new, data mining itself is not a new technology, in the sense that people have been analyzing data on computers since the first computers were invented - and without computers for centuries before that.” Data mining has assumed too many names, such as knowledge discovery, business intelligence, predictive modeling, and predictive analytics, but the following definition is one of the most utilized and accepted:

Data mining is a business process for exploring large amounts of data to discover meaningful patterns and rules.

In this sense, the main tasks of data mining are related to:

• Description (e.g., Statistical Summaries);
• Data Exploration and Visualization (e.g., Online Analytical Processing—OLAP, Construction of Maps);
• Classification and Prediction (e.g., Generalized Linear Models—GLM, Generalized Linear Latent and Mixed Models—GLLAMM, Artificial Neural Networks—NN);
• Clustering (e.g., Hierarchical Clustering, K-Means Clustering, Self-Organizing Maps—SOM, Decision Trees);
• Association Rule Mining (e.g., Factor Analysis, Simple and Multiple Correspondence Analysis, Multidimensional Scaling);
• Optimization and Simulation (e.g., Linear Programming, Network Programming, Integer Programming, Monte Carlo).

Many are the tools and software packages developed to facilitate the implementation of data mining by professionals from the most diverse fields. Among which, we can highlight Stata, IBM SPSS Modeler, RStudio, SAS Enterprise Miner, Pimiento, WEKA, KNIME, Dundas BI, Qlik Sense, Birst, DOMO, Orange, Microsoft SharePoint, Oracle Data Mining (ODM), Sisense, Salesforce Analytics Cloud, RapidMiner, LingPipe, IBM Cognos, and IBM DB2 Intelligent Miner.

Fig. 23.4 shows a screenshot of IBM SPSS Modeler with a Plot Spatial Data Extension, in which it is possible to see a range of interconnected advanced algorithms and techniques, and the map generated by the geospatial analysis.

Data mining is applied in several fields of knowledge frequently and successfully, as already discussed throughout this book. Yet, we can mention the following examples:

• Banking Sector: credit risk models and probability of default;
• Financial Sector: identification of standards in the behavior of financial asset prices;
• Marketing and CRM (Customer Relationship Management): identification of customers’ standards to increase retention rates;
• Retail: replacement and placement of products on shelves based on consumption standards;
• Medicine and Health: preparation of more precise diagnoses;
• Epidemiology: study of the dissemination and transmission of diseases in order to monitor and prevent them;
• Recruitment and selection of professionals: identification of the most suitable profiles for each position or function;
• Logistics: inventory management and vehicle routing based on demand fluctuations and peaks;
• Security: detection of terrorist and criminal activities.
• Public Policies: definition of priorities in the allocation of public resources and improvement of public management;
• Education: study of students’ performance and assistance in the preparation for college entrance exams;
• Social Media: monitoring to define new products, sales, and promotions.

As emphasized by Albright and Winston (2015), data mining is a huge topic that can fill a large book by itself, covering the role of data mining in real business problems, data warehousing, techniques, and software packages. The main goal of this chapter is to offer a brief overview of data-mining definitions and to present and discuss in detail a relevant and relatively recent technique, known as multilevel modeling. It offers a good overview during data-mining problem solving and helps researchers, managers, and practitioners to focus their attention on suitable target areas.

The myriad combinations of model specification make multilevel modeling an interesting data-mining tool, as it takes into account influences of observations on the dataset, as well as their contexts, over the outcome variable, opening up new possibilities for prediction and exploratory work.

23.2 Multilevel Modeling

Multilevel regression models for panel data have become considerably important in several fields of knowledge, and the publication of papers that use estimations related to these models has become more and more frequent. Mainly due to the determination of research constructs that consider the existence of nested data structures, in which certain variables show variation between distinct units that represent groups, however, not between observations that belong to the same group. The computational development itself and investments that certain manufacturers of data analysis software have made in the processing capacity to estimate multilevel models also offer support to researchers who are increasingly interested in this type of approach.

Imagine that a group of researchers is interested in studying how firms’ performance, measured, for example, by a certain profitability indicator, behaves in relation to certain company operational characteristics (size, investment, among others), and in relation to the characteristics of the industry in which each firm operates (participation in the GDP, tax and legal incentives, among others). Since sector characteristics do not vary among firms from the same industry, we characterize a two-level clustered data structure, with firms (level 1) nested into (level 2) companies. Estimating a multilevel model may allow researchers to verify if there are firm characteristics that explain possible performance differences between companies from the same industry, as well as if there are sector characteristics that explain possible differences in the performance of firms from different industries.

Imagine that this study is expanded in order to investigate the temporal evolution of these firms’ performance. Different from longitudinal regression models for panel data, in which the variables change between observations and throughout time, assume that the dataset is structured only with firm (governance structure, production lines, among others) and industry variables (tax incidence, legislation, among others), which do not change during the period analyzed. Therefore, we characterize a three-level data structure with repeated measures, with periods (level 1) nested into firms (level 2), and these into sectors (level 3), and, from which, models can be estimated. Aiming at investigating if, throughout time, there is variability in the performance between firms from the same sector and between those from different sectors, and, if yes, if there are firm and sector characteristics that explain this variability.

Theoretically, researchers can define a construct with a greater number of analysis levels, even if the interpretation of model parameters is not something trivial. For instance, imagine the study of school performance, throughout time, of students nested into schools, these nested into municipal districts, these into municipalities, and these into states of the federation. In this case, we would be working with six analysis levels (temporal evolution, students, schools, municipal districts, municipalities, and states).

The main advantage of multilevel models over traditional regression models, as, for example, the ones estimated by OLS (Chapter 13), refers to the possibility of considering a natural nesting of data. In other words, multilevel models allow us to identify and analyze individual heterogeneities and heterogeneities between groups, to which these individuals belong, making it possible to specify random components in each analysis level. For example, if companies are nested into sectors, it is possible to define a random component at the firm level and another one at the sector level. Different from what a traditional regression model would allow, in which the effect of the sector on the firms’ performance would be considered in a homogeneous way. Thus, multilevel models can also be called random coefficients models.

According to Courgeau (2003), within a model structure with a single equation, there seems to be no connection between individuals and the society in which they live. In this sense, the use of level equations allows the researcher to “jump” from one science to another: students and schools, families and neighborhoods, firms and countries. Ignoring this relationship means to elaborate incorrect analyzes about the behavior of the individuals and, equally, about the behavior of the groups. Only the recognition of these reciprocal influences allows the correct analysis of the phenomena.

In this chapter, we will study multilevel models aiming at investigating the behavior of metric dependent variables (outcome variables) and, from which, normally distributed residuals will be generated. However, they are not independent and do not have a constant variance. Therefore, our focus will be on linear multilevel models, also known as linear mixed models (LMM) or hierarchical linear models (HLM). This is the reason why multilevel models applied to data nested into two levels are also called HLM2, and why models applied to data nested into three levels are known as HLM3.

According to West et al. (2015), the name linear mixed models comes from the fact that these models present linear specification and the explanatory variables include a mix of fixed and random effects, that is, they can be inserted into components with fixed effects, as well as into components with random effects. While the estimated fixed effects parameters indicate the relationship between explanatory variables and the metric dependent variable, the random effects components can be represented by the combination of explanatory variables and nonobserved random effects.

In the Appendix of this chapter, a brief presentation on nonlinear multilevel models will be given, with applications in Stata of examples of logistic, Poisson, and negative binomial models.

Following the same logic of Chapters 13, 14, and 15, we will estimate all models in this chapter in Stata. Moreover, we believe that the estimation of them in SPSS may also allow researchers to compare how to use different software packages, procedures, and routines to estimate the models and logic with which the outputs are presented. Allowing them to decide which software to use based on the characteristics of each one and on how accessible it is.

Hence, in this chapter, we will discuss multilevel regression models for panel data. Our main objectives here are: (1) to introduce the concepts of nested data structures; (2) to define the type of model to be estimated based on the characteristics of the data; (3) to estimate parameters through several methods in Stata and in SPSS; (4) to interpret the results obtained through several types of existing estimations for multilevel models; and (5) to define the most suitable estimation for diagnosing and forecasting effects in each of the cases studied. Initially, the main concepts inherent to each modeling will be presented. Next, the procedures for estimating the models in Stata and in SPSS will be discussed.

23.3 Nested Data Structures

Multilevel regression models allow us to investigate the behavior of a certain dependent variable Y, which represents the phenomenon we are interested in, based on the behavior of explanatory variables, whose changes may occur, for clustered data, between observations and between groups to which these observations belong, and for data with repeated measures throughout time. In other words, there must be variables that have data that change between individuals that represent a certain level, however, remain unchanged for certain groups of individuals, and these groups represent a higher level.

First, imagine a dataset with data on n individuals, and each individual i = 1, ..., n belongs to one of the j = 1, ..., J groups, obviously n > J. Therefore, this dataset can have certain explanatory variables X₁, ..., X_Q that refer to each individual i, and other explanatory variables W₁, ..., W_S that refer to each group j, however, invariable for the individuals of a certain group. Table 23.1 shows the general model of a dataset with a two-level clustered/nested data structure (individual and group).

Table 23.1

General Model of a Dataset With a Two-Level Clustered/Nested Data Structure
(Observation) (Individual i) Level 1	Group j Level 2	Y_ij	X_1ij	X_2ij	...	X_Qij	W_1j	W_2j	...	W_Sj
1	1	Y₁₁	X₁₁₁	X₂₁₁	...	X_Q11	W₁₁	W₂₁	...	W_S1
2	1	Y₂₁	X₁₂₁	X₂₂₁		X_Q21	W₁₁	W₂₁		W_S1
⋮	⋮	⋮	⋮	⋮		⋮	⋮	⋮		⋮
n₁	1	Y_n₁1	X_1n₁1	X_2n₁1		X_Qn₁1	W₁₁	W₂₁		W_S1
n₁ + 1	2	Y_{n₁ + 1, 2}	X_{1n₁ + 1, 2}	X_{2n₁ + 1, 2}		X_{Qn₁ + 1, 2}	W₁₂	W₂₂		W_S2
n₁ + 2	2	Y_{n₁ + 2, 2}	X_{1n₁ + 2, 2}	X_{2n₁ + 2, 2}		X_{Qn₁ + 2, 2}	W₁₂	W₂₂		W_S2
⋮	⋮	⋮	⋮	⋮		⋮	⋮	⋮		⋮
n₂	2	Y_n₂2	X_1n₂2	X_2n₂2		X_Qn₂2	W₁₂	W₂₂		W_S2
⋮	⋮	⋮	⋮	⋮		⋮	⋮	⋮		⋮
n_J − 1 + 1	J	Y_{n_J − 1 + 1, J}	X_{1n_J − 1 + 1, J}	X_{2n_J − 1 + 1, J}		X_{Qn_J − 1 + 1, J}	W_1J	W_2J		W_SJ
n_J − 1 + 2	J	Y_{n_j − 1 + 2, J}	X_{1n_J − 1 + 2, J}	X_{2n_J − 1 + 2, J}		X_{Qn_J − 1 + 2, J}	W_1J	W_2J		W_SJ
⋮	⋮	⋮	⋮	⋮		⋮	⋮	⋮		⋮
n	J	Y_nJ	X_1nJ	X_2nJ		X_QnJ	W_1J	W_2J		W_SJ

Table 23.1

Based on Table 23.1, we can see that X₁, ..., X_Q are level-1 variables (data change between individuals), and W₁, ..., W_S are level-2 variables (data change between groups, however, not for the individuals in each group). Furthermore, the number of individuals in groups 1, 2, ..., J is the same n₁, n₂ − n₁, ..., n − n_J − 1, respectively. Fig. 23.5 allows us to see the existing nesting between the level-1 units (individuals) and the level-2 units (groups), which characterizes the existence of clustered data.

Fig. 23.5 Two-level nested structure of clustered data.

If n₁ = n₂ − n₁ = ... = n − n_J − 1, we will have a balanced nested data structure.

Imagine another dataset in which, in addition to the nesting presented for clustered data, there is temporal evolution, that is, data with repeated measures. Thus, besides the individuals that will now belong to level 2 and therefore will be called j = 1, ..., J, nested into k = 1, ..., K groups (which now belong to level 3), we will also have t = 1, ..., T_j periods in which each individual j is monitored. Consequently, this new dataset can have the same explanatory variables X₁, ..., X_Q that refer to each individual j. However, now they are invariable for each individual j during the periods of monitoring. Moreover, it can also have the same explanatory variables W₁, ..., W_S that refer to each group k. However, they are also invariable throughout time for each group k. Table 23.2 offers the logic with which we can present a dataset with a three-level nested data structure with repeated measures (time, individual, and group).

Table 23.2

General Model of a Dataset With a Three-Level Nested Data Structure With Repeated Measures
Period t (Repeated Measure) Level 1	(Observation) (Individual j) Level 2	Group k Level 3	Y_tjk	X_1jk	X_2jk	...	X_Qjk	W_1k	W_2k	...	W_Sk
1	1	1	Y₁₁₁	X₁₁₁	X₂₁₁	...	X_Q11	W₁₁	W₂₁	...	W_S1
2	1	1	Y₂₁₁	X₁₁₁	X₂₁₁		X_Q11	W₁₁	W₂₁		W_S1
⋮	⋮	⋮	⋮	⋮	⋮		⋮	⋮	⋮		⋮
T₁	1		Y_T₁11	X₁₁₁	X₂₁₁		X_Q11
T₁ + 1	2		Y_{T₁ + 1, 21}	X₁₂₁	X₂₂₁		X_Q21
T₁ + 2	2		Y_{T₁ + 2, 21}	X₁₂₁	X₂₂₁		X_Q21
⋮	⋮		⋮	⋮	⋮		⋮
T₂	2	1	Y_T₂21	X₁₂₁	X₂₂₁		X_Q21	W₁₁	W₂₁		W_S1
⋮	⋮	⋮	⋮	⋮	⋮		⋮	⋮	⋮		⋮
T_J − 1 + 1	J	K	Y_{T_J − 1 + 1, JK}	X_1JK	X_2JK		X_QJK	W_1K	W_2K		W_SK
T_J − 1 + 2	J	K	Y_{T_J − 1 + 2, JK}	X_1JK	X_2JK		X_QJK	W_1K	W_2K		W_SK
⋮	⋮	⋮	⋮	⋮	⋮		⋮	⋮	⋮		⋮
T_J	J	K	Y_{T_JJK}	X_1JK	X_2JK		X_QJK	W_1K	W_2K		W_SK

Table 23.2

Based on Table 23.2, now, we can see that the variable that corresponds to the period is a level-1 explanatory variable, since the data change in each row of the dataset, and that X₁, ..., X_Q become level-2 variables (data change between individuals; however, not for the same individual throughout time), and that W₁, ..., W_S become level-3 variables (data change between groups; however, not for the same group throughout time). Furthermore, the number of periods in which individuals 1, 2, ..., J are monitored are the same T₁, T₂ − T₁, ..., T_J − T_J − 1, respectively. Analogous to what was exposed for the case with two levels, Fig. 23.6 allows us to see the existing nesting between the level-1 units (temporal variation), the level-2 units (individuals), and the level-3 units (groups), which characterizes a data structure with repeated measures.

Fig. 23.6 Three-level nested structure with repeated measures.

If T₁ = T₂ − T₁ = ... = T_J − T_J − 1, we will have a balanced panel.

Through Tables 23.1 and 23.2, as well as through the corresponding Figs. 23.5 and 23.6, we can see that the data structures present absolute nesting. That is, a certain individual can be nested into only one group, this into only another group, and so on. Nevertheless, there may be nested data structures within a crossclassification, in which certain observations of a group may be part of a group at a higher level, with the others forming another group at a higher level. For instance, imagine a study of the performance of firms nested into sectors and countries. There may be mining firms from Brazil and others, such as, aviation firms that also come from Brazil. However, in case there are mining firms from Australia in the dataset, for example, it becomes characterized as crossclassified nesting, making it necessary to estimate hierarchical crossclassified models (HCM). These models are not covered in the current edition of the book. However, researchers may study them in depth in Raudenbush and Bryk (2002), Raudenbush et al. (2004), and Rabe-Hesketh and Skrondal (2012a,b).

In Sections 23.5.1 and 23.6.1, we will estimate two-level hierarchical linear models with clustered data (HLM2) in Stata and SPSS, respectively. In Sections 23.5.2 and 23.6.2, we will estimate three-level hierarchical linear models with repeated measures (HLM3) in the same software packages. However, before that, it is necessary to present and discuss the algebraic formulations of each one of these models in the following section.

23.4 Hierarchical Linear Models

In this section, we will discuss the algebraic formulations and specifications of two-level hierarchical linear models with clustered data (Section 23.4.1) and three-level hierarchical linear models with repeated measures (Section 23.4.2).

23.4.1 Two-Level Hierarchical Linear Models With Clustered Data (HLM2)

In order to understand how the general expression of a two-level hierarchical linear model with clustered data is defined, we need to use a multiple linear regression model, whose specification, based on Expression (12.1), is presented here:

$Y_{i} = b_{0} + b_{1} \cdot X_{1 i} + b_{2} \cdot X_{2 i} + \dots + b_{Q} \cdot X_{Qi} + r_{i}$

(23.1)

where Y represents the phenomenon being studied (dependent variable), b₀ represents the intercept, b₁, b₂, ..., b_Q are the coefficients of each variable, X₁, ..., X_Q are explanatory variables (metric or dummies), and r represents the error terms. The subscript i represents each one of the sample observations under analysis (i = 1, 2, ..., n, where n is the sample size). Note that some terms have a nomenclature different from the one proposed in Chapter 13 (for example, error terms), since another analysis level will be considered here to define the hierarchical modeling.

The model represented by Expression (23.1) presents observations considered homogeneous, that is, they do not come from different groups that, for some reason, could influence the behavior of variable Y differently. Nevertheless, we could think of two groups of observations, from which two different models would be estimated, as follows:

$Y_{i 1} = b_{01} + b_{11} \cdot X_{1 i 1} + b_{21} \cdot X_{2 i 1} + \dots + b_{Q 1} \cdot X_{Qi 1} + r_{i 1}$

(23.2)

$Y_{i 2} = b_{02} + b_{12} \cdot X_{1 i 2} + b_{22} \cdot X_{2 i 2} + \dots + b_{Q 2} \cdot X_{Qi 2} + r_{i 2}$

(23.3)

where coefficients b₀₁ and b₀₂ represent the expected average values of Y for the observations of groups 1 and 2, respectively, when all the explanatory variables are equal to zero, and b₁₁, b₂₁, ..., b_Q1 and b₁₂, b₂₂, ..., b_Q2 are the coefficients of variables X₁, ..., X_Q in the model of each group (1 and 2), respectively. In addition to this, r₁ and r₂ represent the specific error terms in each model.

Therefore, for j = 1, ..., J groups, we can write the general expression of a regression model for clustered data, considered a first-level model, as follows:

$\begin{array}{l} Y_{ij} = b_{0 j} + b_{1 j} \cdot X_{1 ij} + b_{2 j} \cdot X_{2 ij} + \dots + b_{Qj} \cdot X_{Qij} + r_{ij} \\ = b_{0 j} + \sum_{q = 1}^{Q} b_{qj} \cdot X_{qij} + r_{ij} \end{array}$

si4_e (23.4)

For educational purposes and aiming at constructing an illustrative chart, we can write the expression for the expected values of Y, that is, $\hat{Y}$ , for each observation i that belongs to each group j, when there is only one explanatory variable X in the model proposed, as follows:

$Group 1 : {\hat{Y}}_{i 1} = β_{01} + β_{11} \cdot X_{i 1}$

(23.5)

$Group 2 : {\hat{Y}}_{i 2} = β_{02} + β_{12} \cdot X_{i 2}$

(23.6)

$⋮$

$Group J : {\hat{Y}}_{iJ} = β_{0 J} + β_{1 J} \cdot X_{iJ}$

(23.7)

where parameters β are the estimations of coefficients b, following the standard used in this book.

The chart shown in Fig. 23.7 shows the plotting of Expressions (23.5)–(23.7) in a conceptual way, and, through it, we can see that the individual models that represent the observations of each group can have different intercepts and slopes, fact that may occur based on certain characteristics of the groups themselves.

Thus, there must be invariable group characteristics (second level) for the observations that belong to each group (as shown in Table 23.1), which can explain the differences in the intercepts and slopes of the models that represent these groups. Therefore, based on the following regression model with one explanatory variable X and with observations nested into j = 1, ..., J groups:

$Y_{ij} = b_{0 j} + b_{1 j} \cdot X_{ij} + r_{ij}$

(23.8)

we can write the intercept b_0j and slope b_1j expressions based on a certain explanatory variable W as follows, which represents a characteristic of the j groups:

Intercepts

$Group 1 : b_{01} = γ_{00} + γ_{01} \cdot W_{1} + u_{01}$

(23.9)

$Group 2 : b_{02} = γ_{00} + γ_{01} \cdot W_{2} + u_{02}$

(23.10)

$⋮$

$Group J : b_{0 J} = γ_{00} + γ_{01} \cdot W_{J} + u_{0 J}$

(23.11)

or, in a more general way:

$b_{0 j} = γ_{00} + γ_{01} \cdot W_{j} + u_{0 j}$

(23.12)

where γ₀₀ represents the expected value of the dependent variable for a certain observation i that belongs to a group j when X = W = 0 (general intercept), and γ₀₁ represents the alteration in the expected value of the dependent variable for a certain observation i that belongs to a group j when there is a unit alteration in characteristic W of group j, ceteris paribus. Moreover, u_0j represents the error terms that indicate that there is randomness in the intercepts, which can be generated by the presence of observations from different groups in the dataset.

Slopes

$Group 1 : b_{11} = γ_{10} + γ_{11} \cdot W_{1} + u_{11}$

(23.13)

$Group 2 : b_{12} = γ_{10} + γ_{11} \cdot W_{2} + u_{12}$

(23.14)

$⋮$

$Group J : b_{1 J} = γ_{10} + γ_{11} \cdot W_{J} + u_{1 J}$

(23.15)

or, in a more general way:

$b_{1 j} = γ_{10} + γ_{11} \cdot W_{j} + u_{1 j}$

(23.16)

where γ₁₀ represents the alteration in the expected value of the dependent variable for a certain observation i that belongs to a group j when there is a unit alteration in characteristic X of individual i, ceteris paribus (change in the slope because of X), and γ₁₁ represents the alteration in the expected value of the dependent variable for a certain observation i that belongs to a group j when there is a unit alteration in the multiplication W.X, also ceteris paribus (change in the slope because of W.X). Besides, u_1j represents the error terms that indicate that there is randomness in the slopes of the models regarding the groups, which can also be generated by the presence of observations from different groups in the dataset.

By combining Expressions (23.8), (23.12), and (23.16), we obtain the following expression:

$Y_{ij} = \underset{\begin{array}{l} random effects \\ intercept \end{array}}{\underset{︸}{(γ_{00} + γ_{01} \cdot W_{j} + u_{0 j})}} + \underset{\begin{array}{l} random effects \\ slope \end{array}}{\underset{︸}{(γ_{10} + γ_{11} \cdot W_{j} + u_{1 j})}} \cdot X_{ij} + r_{ij},$

si21_e (23.17)

which facilitates the visualization that the intercept and slope can be influenced by random effects resulting from the existence of observations that belong to different groups.

Essentially, multilevel models represent a set of techniques that, besides estimating the parameters of the model proposed, allow us to estimate the variance components of the error terms (in the case of the model in Expression (23.17), u_0j, u_1j, and r_ij), as well as the respective statistical significances. So that we can verify if, in fact, randomness occurs in the intercepts and slopes resulting from the presence of higher levels in the analysis. If we do not verify the statistical significance of the variances of error terms u_0j and u_1j in the model in Expression (23.17), that is, if both are statistically equal to zero, it becomes suitable to estimate a linear regression model through traditional methods, such as, OLS. Since it is not possible to prove the existence of randomness in the intercepts and slopes.

We can assume that random effects u_0j and u_1j follow a multivariate normal distribution, have means equal to zero and the same variances, respectively, τ₀₀ and τ₁₁. Furthermore, error terms r_ij follow a normal distribution, with mean equal to zero and variance equal to σ². Thus, we can define the following variance-covariance matrices for the error terms:

$var [u] = var [\begin{array}{c} u_{0 j} \\ u_{1 j} \end{array}] = G = [\begin{array}{c} τ_{00} & σ_{01} \\ σ_{01} & τ_{11} \end{array}]$

si22_e (23.18)

$var [r] = var [\begin{array}{c} r_{1 j} \\ ⋮ \\ r_{nj} \end{array}] = σ^{2} \cdot I_{n} = [\begin{array}{c} σ^{2} & 0 & \dots & 0 \\ 0 & σ^{2} & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & 0 \\ 0 & \dots & 0 & σ^{2} \end{array}]$

si23_e (23.19)

These matrices will be used very soon when we discuss the methods for estimating multilevel model parameters. Therefore, we can define the relationship between the variances of these error terms, known as intraclass correlation, as follows:

$rho = \frac{τ_{00} + τ_{11}}{τ_{00} + τ_{11} + σ^{2}}$

si24_e (23.20)

This intraclass correlation measures the proportion of total variance that is due to levels 1 and 2. If it is equal to zero, there is no variance of the individuals between the level-2 groups. However, if it is considerably different from zero due to the presence of at least one significant error term resulting from the presence of level 2 in the analysis, traditional procedures for estimating model parameters, such as the ordinary least squares (OLS), are not suitable. In the limit, the fact that it is equal to 1, that is, σ² = 0, suggests that there are no differences between the individuals, that is, all of them are identical, which is highly unlikely. This correlation is also called level-2 intraclass correlation.

In Section 23.5.1, we will use likelihood-ratio tests aiming at verifying if τ₀₀ = τ₁₁ = 0, which would favor the estimation of a traditional regression model, or at least if τ₁₁ = 0, which would allow researchers to choose a random intercepts model (τ₀₀ ≠ 0) instead of a random slopes model (τ₁₁ ≠ 0).

We can rearrange Expression (23.17), in order to separate the fixed effects component, from which the model parameters are estimated, from the random effects component, from which the variances of the error terms are estimated. Therefore, we have:

$\begin{array}{c} Y_{ij} = \underset{Fixed Effects}{\underset{︸}{γ_{00} + γ_{10} \cdot X_{ij} + γ_{01} \cdot W_{j} + γ_{11} \cdot W_{j} \cdot X_{ij}}}, \\ + \underset{Random Effects}{\underset{︸}{u_{0 j} + u_{1 j} \cdot X_{ij} + r_{ij}}} \end{array}$

si25_e (23.21)

which allows researchers to see more clearly that the random effects component can also influence the behavior of the dependent variable. We can even notice that one explanatory variable may be a part of this random component. By estimating such a multilevel model, we will see that, while fixed effects refer to the relationship between the behavior of certain characteristics and the behavior of Y, random effects allow us to analyze possible distortions in the behavior of Y between the units of the second analysis level.

In general and from Expression (23.4), we can define a model with two analysis levels, in which the first level offers explanatory variables X₁, ..., X_Q that refer to each individual i, and the second level, explanatory variables W₁, ..., W_S that refer to each group j, in the following way:

$Level 1 : Y_{ij} = b_{0 j} + \sum_{q = 1}^{Q} b_{qj} \cdot X_{qij} + r_{ij}$

si26_e (23.22)

$Level 2 : b_{qj} = γ_{q 0} + \sum_{s = 1}^{S_{q}} γ_{qs} \cdot W_{sj} + u_{qj}$

si27_e (23.23)

where q = 0, 1, ..., Q and s = 1, ..., S_q.

Concerning the estimation of the model, while fixed effects parameters are estimated in a traditional way in software packages such as Stata and SPSS, i.e., by the maximum likelihood estimation (MLE), as we studied in Chapters 14 and 15, the variance components of error terms can be estimated both by maximum likelihood and by restricted maximum likelihood (REML).

Parameter estimations through MLE or REML are computationally intense, reason why we will not develop them algebraically in this chapter, as we did in Chapters 14 and 15, when we presented some practical examples. Nevertheless, both require the optimization of a certain objective function, which usually starts from the initial values of the parameters and uses a sequence of iterations to find the parameters that maximize the previously defined likelihood function.

In order to present the concepts as regards the REML method, let’s imagine, for example, a regression model with only one constant, where Y_i (i = 1, ..., n) is a dependent variable that follows a normal distribution, mean μ, and variance σ_Y². While the estimation through the maximum likelihood of σ_Y² is obtained considering the n terms Y_i − μ, the estimation of σ_Y² through REML is obtained from the (n − 1) first terms of $Y_{i} - {\bar{Y}}_{i}$ , whose distribution does not depend on μ. In other words, maximum likelihood methods for this last distribution generate a nonbiased estimation of σ_Y², because this is the sample variance itself obtained by the division of the elements by (n − 1). This is the reason why the restricted maximum likelihood estimation is also known as estimation through reduced maximum likelihood.

In order to present the expressions of the likelihood and restricted likelihood functions, from which, through maximization, multilevel model parameters can be estimated, let’s write, in matrix notation, the general expression of a multilevel model with fixed and random effects:

$Y = A . γ + B \cdot u + r$

(23.24)

where Y is a vector n x 1 that represents the dependent variable, A is a matrix n x (q + s + q ∙ s + 1) with data from all the variables to be inserted into the model’s fixed effects component, γ is a vector (q + s + q ∙ s + 1) x 1 with all the fixed effects parameters estimated, B is the matrix n x (q + 1) with data from all the variables to be inserted into the random effects components u, being u a random error terms vector with dimensions (q + 1) x 1 and with variance-covariance matrix G. Besides, r is a vector n x 1 of error terms with mean zero and variance matrix $σ^{2} \cdot I_{n}$ . Based on Expressions (23.18) and (23.19), we can determine that:

$var [\begin{array}{c} u \\ r \end{array}] = [\begin{array}{c} G & 0 \\ 0 & σ^{2} \cdot I_{n} \end{array}]$

si31_e (23.25)

and, in this regard, the variance-covariance matrix n x n of Y, given by V, can be obtained as follows:

$V = B \cdot G \cdot B^{'} + σ^{2} \cdot I_{n}$

(23.26)

From this matrix, as shown by Searle et al. (2006), the following expression of the logarithmic likelihood function can be defined, which must be maximized (MLE):

$LL = - \frac{1}{2} \cdot [n \cdot ln (2 π) + ln |V| + {(Y - A \cdot γ)}^{'} \cdot V^{- 1} \cdot (Y - A \cdot γ)]$

si33_e (23.27)

In addition, according to the same authors, from Expression (23.27), the expression of the logarithm of the restricted likelihood function is given by:

${LL}_{r} = LL - \frac{1}{2} \cdot ln |A^{'} \cdot V^{- 1} \cdot A|$

si34_e (23.28)

The fact that the REML method generates nonbiased estimations of the error terms’ variances in multilevel models may make researchers choose to use it unconditionally. However, the likelihood-ratio tests based on estimations obtained through REML are not suitable for comparing models with different fixed effects specifications. For these situations, in which there is the intention of elaborating such tests, we recommend that the variances of the error terms be estimated through MLE, which already is the method used to estimate model parameters. Besides, it is important to mention that the differences between the estimations of the error terms’ variances obtained through REML or through MLE are practically nonexistent for large samples.

In the following section, we will discuss the specification of three-level hierarchical linear models with repeated measures, maintaining the logic proposed in this book.

23.4.2 Three-Level Hierarchical Linear Models With Repeated Measures (HLM3)

Following the logic proposed in the previous section, let’s present the specification of a three-level hierarchical linear model, in which there are data with repeated measures, that is, with temporal evolution in the dependent variable.

In general and following the logic presented in Raudenbush et al. (2004), a three-level hierarchical model has three submodels, one for each analysis level of the nested data structure. Therefore, based on Expressions (23.22) and (23.23), we can define a general model with three analysis levels and nested data. The first level presents explanatory variables Z₁, ..., Z_P that refer to level-1 units i (i = 1, ..., n). The second level, explanatory variables X₁, ..., X_Q that refer to level-2 units j (j = 1, ..., J). Whereas the third level presents explanatory variables W₁, ..., W_S that refer to level-3 units k (k = 1, ..., K), as follows:

$Level 1 : Y_{ijk} = π_{0 jk} + \sum_{p = 1}^{P} π_{pjk} \cdot Z_{pjk} + e_{ijk}$

si35_e (23.29)

where π_pjk (p = 0, 1, ..., P) refer to the level-1 coefficients, Z_pjk is the p-th level-1 explanatory variable for observation i in the level-2 unit j and in the level-3 unit k, and e_ijk refers to the level-1 error terms that follow a normal distribution, with mean equal to zero and variance equal to σ².

$Level 2 : π_{pjk} = b_{p 0 k} + \sum_{q = 1}^{Q_{p}} b_{pqk} \cdot X_{qjk} + r_{pjk}$

si36_e (23.30)

where b_pqk (q = 0, 1, ..., Q_p) refer to the level-2 coefficients, X_qjk is the q-th level-2 explanatory variable for unit j in the level-3 unit k, and r_pjk are the level-2 random effects, assuming, for each unit j, that the vector (r_0jk, r_1jk, ..., r_Pjk)´ follows a multivariate normal distribution with each element having mean zero and variance τ_rπpp.

$Level 3 : b_{pqk} = γ_{pq 0} + \sum_{s = 1}^{S_{pq}} γ_{pqs} \cdot W_{sk} + u_{pqk}$

si37_e (23.31)

where γ_pqs (s = 0, 1, ..., S_pq) refer to the level-3 coefficients, W_sk is the s-th level-3 explanatory variable for unit k, and u_pqk are the level-3 random effects, assuming that, for each unit k, the vector formed by terms u_pqk follows a multivariate normal distribution with each element having mean zero and variance τ_uπpp, which results in variance-covariance matrix T_b with a maximum dimension equal to:

${Dim}_{\max} T_{b} = \sum_{p = 0}^{P} (Q_{p} + 1) \cdot \sum_{p = 0}^{P} (Q_{p} + 1),$

si38_e (23.32)

which depends on the number of level-3 coefficients specified with random effects.

In order to maintain the logic presented in the previous section and aiming at facilitating the understanding of the example that will be studied in Sections 23.5.2 and 23.6.2, let’s imagine a single level-1 explanatory variable that corresponds to the periods in which the data of the dependent variable are monitored. In other words, level-2 units j nested into level-3 units k are monitored for a period t (t = 1, ..., T_j), which makes the dataset have j time series, as shown in Table 23.2. The main objective is to verify if there are discrepancies in the temporal evolution of the data of the dependent variable and, if yes, if these happen due to characteristics of the level-2 and level-3 units. This temporal evolution is what characterizes the term repeated measures.

In this regard, Expression (23.29) can be rewritten as follows, in which subscripts i become subscripts t:

$Y_{tjk} = π_{0 jk} + π_{1 jk} \cdot {period}_{jk} + e_{tjk}$

(23.33)

where π_0jk represents the intercept of the model that corresponds to the temporal evolution of the dependent variable of level-2 unit j nested into level-3 unit k, and π_1jk corresponds to the average evolution (slope) of the dependent variable for the same unit throughout the period analyzed. The substructures that correspond to levels 2 and 3 remain with the same specifications as those respectively presented in Expressions (23.30) and (23.31).

The chart seen in Fig. 23.8 shows the plotting of the set of models represented by Expression (23.33) in a conceptual way. Through it, we can see that the individual models that represent level-2 units j can present different intercepts and slopes throughout period t. Fact that may occur due to certain characteristics of the level-2 units j themselves or due to characteristics of the level-3 units k.

Thus, there must be characteristics of level-2 units j, temporally invariable, and of level-3 units k, invariable also for level-2 units j nested into each level-3 unit k (as shown in Table 23.2), that can explain the differences in the model intercepts and slopes ${\hat{Y}}_{tjk} = {\hat{π}}_{0 jk} + {\hat{π}}_{1 jk} \cdot {period}_{jk}$ represented in Fig. 23.8.

Hence, assuming that there is a single explanatory variable X that represents a characteristic of level-2 units j, and a single explanatory variable W that represents a characteristic of level-3 units k, from Expression (23.33) and based on Expressions (23.30) and (23.31), we can define the following model with three analysis levels. In this model, the first level refers to the measure repeated and only contains the temporal variable:

$Level 1 : Y_{tjk} = π_{0 jk} + π_{1 jk} \cdot {period}_{jk} + e_{tjk}$

(23.34)

$Level 2 : π_{0 jk} = b_{00 k} + b_{01 k} \cdot X_{jk} + r_{0 jk}$

(23.35)

$π_{1 jk} = b_{10 k} + b_{11 k} \cdot X_{jk} + r_{1 jk}$

(23.36)

$Level 3 : b_{00 k} = γ_{000} + γ_{001} \cdot W_{k} + u_{00 k}$

(23.37)

$b_{01 k} = γ_{010} + γ_{011} \cdot W_{k} + u_{01 k}$

(23.38)

$b_{10 k} = γ_{100} + γ_{101} \cdot W_{k} + u_{10 k}$

(23.39)

$b_{11 k} = γ_{110} + γ_{111} \cdot W_{k} + u_{11 k}$

(23.40)

By combining Expressions (23.34)–(23.39), we obtain the following expression:

$\begin{array}{c} Y_{tjk} = \underset{random effects intercept}{\underset{︸}{(γ_{000} + γ_{001} \cdot W_{k} + γ_{010} \cdot X_{jk} + γ_{011} \cdot W_{k} \cdot X_{jk} + u_{00 k} + u_{01 k} \cdot X_{jk} + r_{0 jk})}} \\ + \underset{random effects slope}{\underset{︸}{(γ_{100} + γ_{101} \cdot W_{k} + γ_{110} \cdot X_{jk} + γ_{111} \cdot W_{k} \cdot X_{jk} + u_{10 k} + u_{11 k} \cdot X_{jk} + r_{1 jk}) \cdot {period}_{jk}}} \\ + e_{tjk} \end{array}$

si48_e (23.41)

where γ₀₀₀ represents the expected value of the dependent variable at the initial moment and when X = W = 0 (general intercept), γ₀₀₁ represents the increase in the expected value of the dependent variable at the initial moment (alteration in the intercept) for a certain level-2 unit j that belongs to a level-3 unit k when there is a unit alteration in the characteristic W of k, ceteris paribus, γ₀₁₀ represents the increase in the expected value of the dependent variable at the initial moment for a certain unit jk when there is a unit alteration in the characteristic X of j, ceteris paribus, and γ₀₁₁ represents the increase in the expected value of the dependent variable at the initial moment for a certain unit jk when there is a unit alteration in the multiplication W.X, also ceteris paribus. Moreover, u_00k and u_01k represent the error terms that indicate that there is randomness in the intercepts, and the last one impacts the alterations in variable X.

In addition, γ₁₀₀ represents the alteration in the expected value of the dependent variable when there is a unit alteration in the analysis period (change in the slope due to a unit temporal evolution), ceteris paribus, γ₁₀₁ represents the alteration in the expected value of the dependent variable due to a unit temporal evolution for a certain unit jk when there is a unit alteration in the characteristic W, ceteris paribus, γ₁₁₀ represents the alteration in the expected value of the dependent variable due to a unit temporal evolution for a certain unit jk when there is a unit alteration in the characteristic X, ceteris paribus, and γ₁₁₁ represents the alteration in the expected value of the dependent variable due to a unit temporal evolution for a certain unit jk when there is a unit alteration in the multiplication W.X, also ceteris paribus. Finally, u_10k and u_11k represent the error terms that indicate that there is randomness in the slopes and the last one impacts the alterations in variable X.

Expression (23.41) facilitates the visualization that the intercept and slope can be influenced by random effects resulting from different behaviors of the dependent variable throughout time for each of the level-2 units (different time series), and this phenomenon can be a result of these units’ characteristics, as well as of characteristics of the groups to which such units belong.

If researchers wish to elaborate an analysis about the fixed and random effects components that can influence the behavior of the dependent variable, given that this procedure even facilitates the insertion of the commands to estimate multilevel models in Stata and in SPSS, as we will see, we just need to rearrange the terms of Expression (23.41) as follows:

$\begin{array}{c} \begin{array}{l} Y_{tjk} = γ_{000} + γ_{001} \cdot W_{k} + γ_{010} \cdot X_{jk} + γ_{011} \cdot W_{k} \cdot X_{jk} \\ + γ_{100} \cdot {period}_{tjk} + γ_{101} \cdot W_{k} \cdot {period}_{jk} + γ_{110} \cdot X_{jk} \cdot {period}_{jk} + γ_{111} \cdot W_{k} \cdot X_{jk} \cdot {period}_{jk} \end{array}\} Fixed Effects \\ + \underset{Random Effects}{\underset{︸}{u_{00 k} + u_{01 k} \cdot X_{jk} + u_{10 k} \cdot {period}_{jk} + u_{11 k} \cdot X_{jk} \cdot {period}_{jk} + r_{0 jk} + r_{1 jk} \cdot {period}_{jk} + e_{tjk}}} \end{array}$

si49_e (23.42)

In three-level hierarchical models, we can define two intraclass correlations given the existence of two variance proportions. One corresponds to the behavior of the data that belong to the same level-2 units j and the same level-3 units k (level-2 intraclass correlation), and the other corresponds to the behavior of the data that belong to the same level-3 units k, however, from different level-2 units j (level-3 intraclass correlation). In Sections 23.5.2 and 23.6.2, we will calculate these intraclass correlations when we present some practical examples in Stata and in SPSS, respectively.

From Expression (23.34) and as seen later, we can define the general expressions of the level-2 and 3 substructures of a hierarchical analysis with three levels and repeated measures, in which the second level offers explanatory variables X₁, ..., X_Q that refer to each unit j, and the third level, explanatory variables W₁, ..., W_S that refer to each unit k:

$Level 2 : π_{pjk} = b_{p 0 k} + \sum_{q = 1}^{Q_{p}} b_{pqk} \cdot X_{qjk} + r_{pjk}$

si36_e (23.43)

$Level 3 : b_{pqk} = γ_{pq 0} + \sum_{s = 1}^{S_{pq}} γ_{pqs} \cdot W_{sk} + u_{pqk}$

si37_e (23.44)

Similar to what was discussed when the two-level hierarchical models were presented in the previous section, while fixed effects parameters are estimated traditionally, using software such as Stata and SPSS, through maximum likelihood, the variance components of the error terms can be estimated both through maximum likelihood and through restricted maximum likelihood. As we will see in the following sections, when we estimate three-level hierarchical models through these software packages.

Based on what has been exposed, in Section 23.5 we will estimate two-level hierarchical models with clustered data and three-level models with repeated measures in Stata. In Section 23.6, we will estimate the same models, however, in SPSS. The examples used follow the logic adopted throughout this book.

23.5 Estimation of Hierarchical Linear Models in Stata

The main objective of this section is to give researchers the opportunity to prepare multilevel modeling procedures through Stata Statistical Software. The use of the images in this section has been authorized by StataCorp LP©.

23.5.1 Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in Stata

We will discuss an example that follows the same logic seen in Chapters 13, 14, and 15. Now, however, with data that vary between individuals and between groups to which these individuals belong, characterizing a nested structure.

Let’s imagine that our shrewd and talented professor is now interested in expanding his research to other schools. He has already explored the effects of certain explanatory variables regarding the time it takes a group of students to get to school, the probability of arriving late, and how many times students were late per week or per month through multiple regression, binary and multinomial logistic regression, and regression for count data models, respectively. Now, he wants to investigate if there are differences in the school performance behavior between students from different schools and, if yes, if these differences occur due to characteristics of the schools themselves.

In order to do this, the professor managed to get data on students’ school performance (scores from 0 to 100 plus a bonus for participation in class). He collected data on 2,000 students from 46 schools. In addition, he also managed to get data on students’ behavior, such as, number of hours spent studying per week, data regarding the type of school (public or private), and also professors’ years of teaching experience. Part of the dataset can be seen in Table 23.3. The complete dataset can be found in files PerformanceStudentSchool.xls (Excel) and PerformanceStudentSchool.dta (Stata).

Table 23.3

Example: School Performance and Students’ (Level 1) and Schools’ (Level 2) Characteristics
Student i (Level 1)	School j (Level 2)	Performance at School (Y_ij)	Number of Hours Spent Studying per Week (X_ij)	Professors’ Years of Teaching Experience (W_1j)	Public or Private School (W_2j)
1	1	35.4	11	2	public
2	1	74.9	23	2	public
...
47	1	24.8	9	2	public
48	2	41.0	13	2	public
...
72	2	65.2	20	2	public
...
121	4	66.4	20	9	private
...
140	4	93.4	27	9	private
...
1995	46	44.0	15	2	public
...
2000	46	56.6	17	2	public

Table 23.3

After opening the file PerformanceStudentSchool.dta, we can type the command desc, which makes it possible to analyze the dataset characteristics, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 23.9 shows this first output in Stata.

Fig. 23.9 Description of the PerformanceStudentSchool.dta Dataset.

First, we can obtain information on the number of students that were researched by the professor at each school, through the following command:

tabulate school, subpop(student)

The outputs can be found in Fig. 23.10 and, through them, we can see that, in this case, we have an unbalanced clustered data structure.

Students’ average performance per school, which can be seen in Fig. 23.11, can be obtained through the following commands:

bysort school: egen average_performance = mean(performance)

tabstat average_performance, by(school)

To conclude this initial diagnostic, we can construct a chart that allows the visualization of students’ average performance per school. This chart can be seen in Fig. 23.12 and can be obtained by typing the following command:

Fig. 23.12 Students’ average performance per school.

graph twoway scatter performance school || connected average_performance school, connect(L) || , ytitle(performance at school)

Having characterized the nesting of students into schools based on our example’s clustered data, now, we can apply the multilevel modeling itself, constructing the procedures aiming at estimating a two-level hierarchical linear model (students and school). In the school performance modeling, even though a possibility is the inclusion of dummy variables that represent schools into the fixed effects component, let’s treat these level-2 units as random effects to estimate these models.

The first model to be estimated, known as null model or nonconditional model, allows us to check if there is variability in the school performance between students from different schools. This is because no explanatory variable will be inserted into the modeling, which only considers the existence of one intercept and error terms u_0j and r_ij, with variances equal to τ₀₀ and σ², respectively. Therefore, the model to be estimated has the following expression:

Null Model

${performance}_{ij} = b_{0 j} + r_{ij}$

$b_{0 j} = γ_{00} + u_{0 j},$

which results in:

${performance}_{ij} = γ_{00} + u_{0 j} + r_{ij}$

For the data in our example, the command for estimating the null model in Stata is:

xtmixed performance || school: , var nolog reml

where the term xtmixed refers to the estimation of any hierarchical linear model and the first variable to be inserted corresponds to the dependent variable, as in any other estimation of a regression model. Explanatory variables may be included afterwards. Furthermore, there is a second part of the command xtmixed that starts with the term ||. While the first part of the command corresponds to the fixed effects, the second part is related to the random effects that can be generated if there is a second analysis level, in this case, which refers to the schools (hence the second part begins with the term school: ). The term var makes the estimations of the error terms’ variances u_0j and r_ij (τ₀₀ and σ², respectively) be presented in the outputs, instead of the standard deviations. On the other hand, the term nolog only makes the results of the iterations for the maximization of the logarithm of the restricted likelihood function not be presented in the outputs. Finally, researchers can also define the estimation method to be used by using the terms reml (restricted estimation of maximum likelihood) or mle (maximum likelihood estimation).¹

The outputs generated can be seen in Fig. 23.13.

From the outputs in Fig. 23.13, initially, we can see that the estimation of parameter γ₀₀ is equal to 61.049, which corresponds to students’ expected average school performance (horizontal line estimated in the null model, or general intercept).² Moreover, at the bottom of the outputs, the estimations of the variances of error terms τ₀₀ = 135.779 are presented (in Stata, var(_cons)) and σ² = 347.562 (in Stata, var(Residual)). Thus, based on Expression (23.20), we can calculate the following intraclass correlation:

$rho = \frac{τ_{00}}{τ_{00} + σ^{2}} = \frac{135.779}{135.779 + 347.562} = 0.281,$

si55_e

which suggests that approximately 28% of the total variance of the school performance is due to changes between schools, representing a first sign that there is variability in students’ school performance when they come from different schools. After Stata 13, it is possible to directly obtain this intraclass correlation by typing the command estat icc right after the estimation of the corresponding model.

Even though Stata does not directly show the result of the z tests with their respective significance levels for the random effect parameters, the fact that the estimation of variance component τ₀₀, which corresponds to random intercepts u_0j, is considerably higher than its standard error suggests that there is significant variation in the school performance between schools. Statistically, we can see that z = 135.779/30.750 = 4.416 > 1.96, where 1.96 is the critical value of the standardized normal distribution, which results in a significance level of 0.05.

This piece of information is extremely important to support the choice of the hierarchical modeling, to the detriment of a traditional regression model estimate by OLS. Moreover, it is the main reason why a null model is always estimated when we carry out multilevel analyses.

At the bottom of Fig. 23.13, we can verify this fact, analyzing the result of the likelihood-ratio test (LR test). Since Sig. χ² = 0.000, we can reject the null hypothesis that the random intercepts are equal to zero (H₀: u_0j = 0), which makes the estimation of a traditional linear regression model be ruled out for the clustered data in our example.

First, let’s investigate if the level-1 explanatory variable, hours, has any relationship to the school performance behavior of students from the same school (variation between students) and from different schools (variation between schools). A first diagnostic can be elaborated by typing the following command, which generates the chart seen in Fig. 23.14:

statsby intercept =_b[_cons] slope =_b[hours], by(school) saving(ols, replace): reg performance hours
sort school
merge school using ols
drop _merge
gen yhat_ols = intercept + slope⁎hours
sort school hours
separate performance, by(school)
separate yhat_ols, by(school)
graph twoway connected yhat_ols1-yhat_ols46 hours || lfit performance hours, clwidth(thick) clcolor(black) legend(off) ytitle(performance at school)

The chart in Fig. 23.14 shows the linear adjustment by OLS, for each school, of the behavior of each student’s school performance based on the number of hours spent studying per week. We can see that, even though there is significant improvement in school performance as the number of hours spent studying per week increases (fortunately), this relationship is not the same for every school. Moreover, the intercepts of each model are clearly different.

Therefore, our duty is to investigate if random effects occur in the intercepts and slopes generated by the variable hours, because there are several schools. If yes, in the future, we must investigate if some of these school characteristics can answer for such a fact. Note that this last command also generates a new file in Stata (ols.dta), in which the differences between the schools can be analyzed.

If researchers chose not to include random effects in the modeling, that is, if the likelihood-ratio test elaborated in the estimation of the null model did not reject H₀ (u_0j = 0), they would just need to type the following command, as discussed in Chapter 13, in order for our model parameters to be estimated:

reg performance hours

Only for educational purposes, the parameters estimated when we typed this last command (reg), whose outputs are not presented here, are the same as those that would be obtained through the following command:

xtmixed performance hours, reml

since the term xtmixed without the specification of random effects makes, through the restricted maximum likelihood estimation (term reml), parameters with identical values to the ones estimated by ordinary least squares method (linear regression only with fixed effects).

Based on the logic proposed here, initially, let’s insert intercept random effects into our multilevel model, which will start having the following specification:

Random Intercepts Model

${performance}_{ij} = b_{0 j} + b_{1 j} \cdot {hours}_{ij} + r_{ij}$

$b_{0 j} = γ_{00} + u_{0 j}$

$b_{1 j} = γ_{10},$

which results in the following expression:

${performance}_{ij} = γ_{00} + γ_{10} \cdot {hours}_{ij} + u_{0 j} + r_{ij}$

For the data in our example, the command for estimating the random intercepts model in Stata is:

xtmixed performance hours || school: , var nolog reml

which generates the outputs seen in Fig. 23.15.

Similarly, at the top of the outputs, we can see the fixed effects of our model, which includes 46 separate intercepts (one for each school), even though they are not presented directly. At the bottom, we can see the estimation of the variances of error terms τ₀₀ = 19.125 and σ² = 31.764. This model’s intraclass correlation is calculated as follows:

$rho = \frac{τ_{00}}{τ_{00} + σ^{2}} = \frac{19.125}{19.125 + 31.764} = 0.376,$

si60_e

which shows an increase in the proportion of the variance component that corresponds to the intercept in relation to the null model, demonstrating the importance of including the variable hours to study the school performance behavior when comparing the schools. As already verified in the null model, the estimation of variance component τ₀₀ is almost five times higher than its standard error (z = 19.125/4.199 = 4.555 > 1.96), suggesting that there may be a significant variation in the average school performance between schools due to the existence of random intercepts (the intercepts vary in a statistically significant way from school to school).

By analyzing the result of the likelihood-ratio test (LR test), here, we can also reject the null hypothesis that the random intercepts are equal to zero (H₀: u_0j = 0), since Sig. χ² = 0.000, proving that the estimation of a traditional linear regression model only with fixed effects must be ruled out.

Therefore, now, our model starts to have the following specification:

${performance}_{ij} = 0.534 + 3.252 \cdot {hours}_{ij} + u_{0 j} + r_{ij}$

where the fixed effect of the intercept now corresponds to the average expected school performance, between schools, of students who, for some reason, do not study (hours_ij = 0). On the other hand, one more hour spent studying per week, on average, makes the expected mean of school performance, between schools, increase by 3.252 points, and this parameter is statistically significant.

Only for educational purposes, as this last estimation represents a model in which the random component only contains intercepts, the maximum likelihood method (not restricted) would generate parameter estimations identical to the ones that would be obtained through a traditional estimation considering longitudinal panel data. Furthermore, an even more inquisitive researcher would be able to verify that the preparation of a generalized linear latent and mixed model (GLLAMM) would also generate the same parameter estimations. In other words, the following three commands generate identical parameter estimations and estimations of the error terms’ variances:

• Multilevel Model with Maximum Likelihood Estimation

xtmixed performance hours || school: , var nolog mle

where the term mle means maximum likelihood estimation.

• Model for Panel Data with Maximum Likelihood Estimation

xtset school student
xtreg performance hours, mle

• Generalized Linear Latent and Mixed Model

gllamm performance hours, i(school) adapt

where the option adapt makes the adaptive quadrature process be used instead of the standard process of an ordinary Gauss-Hermite quadrature.

It is important to mention that the generalized linear latent and mixed models (GLLAMM) are analogous to the generalized linear models (GLM) studied in Chapters 13, 14, and 15. That is, they are also extremely useful for estimating models in which the dependent variable is categorical or has count data, and there is a nested data structure. In the Appendix of this chapter, we will present some examples of logistic, Poisson, and negative binomial hierarchical nonlinear models. To study this topic in depth, we also recommend Rabe-Hesketh et al. (2002), Rabe-Hesketh and Skrondal (2012a,b), and Fávero and Belfiore (2017).

Going back to our random intercepts model (outputs in Fig. 23.15), we can store (estimates store command) the estimations obtained for future comparison to the ones that will be generated when we estimate a model with random intercepts and slopes. Besides, through the command predict, reffects, we can also obtain the expected values of random effects u_0j, known as BLUPS (best linear unbiased predictions), since the command xtmixed does not show them directly. In order to do that, we can type the following sequence of commands:

quietly xtmixed performance hours || school: , var nolog reml
estimates store randomintercept
predict u0, reffects
desc u0
by student, sort: generate tolist = (_n ==1)
list student u0 if student <= 10 | student > 1990 & tolist

Fig. 23.16 shows the values of random intercept terms u_0j for the first and last 10 students of the dataset. We can see that these error terms are invariable for students from the same school. However, they vary between schools, which characterize the existence of one intercept for each school.

In order to provide a better visualization of the random intercepts per school, we can generate a chart by typing the following command:

graph hbar (mean) u0, over(school) ytitle("Random Intercepts per School")

This chart can be seen in Fig. 23.17.

Since we will still do some additional estimations, in order to arrive at a more complete model and with the presence of level-2 explanatory variables, at this moment, let’s present the commands to generate the predicted values of the school performance per student. This procedure will be carried out later.

Having checked that students’ school performance is influenced by the number of hours spent studying per week, and that there are differences in the model intercepts between schools, at this moment, let’s analyze if the slopes are also different between the schools. Even though the charts in Figs. 23.14 and 23.17 allow us to see discrepant intercepts between schools clearly, the same cannot be said in relation to the slopes of the 46 linear adjustments. Nevertheless, it is our duty to assess such situation from a statistical standpoint. Therefore, let’s insert slope random effects into our multilevel model that, by maintaining the intercept random effects, will start to have the following expression:

• Random Intercepts and Slopes Model

${performance}_{ij} = b_{0 j} + b_{1 j} \cdot {hours}_{ij} + r_{ij}$

$b_{0 j} = γ_{00} + u_{0 j}$

$b_{1 j} = γ_{10} + u_{1 j},$

which results in:

${performance}_{ij} = γ_{00} + γ_{10} \cdot {hours}_{ij} + u_{0 j} + u_{1 j} \cdot {hours}_{ij} + r_{ij}$

For the data in our example, the command for estimating the model with random intercepts and slopes in Stata is:

xtmixed performance hours || school: hours, var nolog reml

Note that the variable hours inserted after the term school: (random component of the command xtmixed) comes from the term u_1j.hours_ij present in the specification of the multilevel model. The results obtained in this estimation can be seen in Fig. 23.18.

Fig. 23.18 Outputs of the model with random intercepts and slopes.

We can see that the parameter and variance estimations in the model with random intercepts and slopes are practically identical to the ones obtained when the model parameters were estimated only with random intercepts (Fig. 23.15). This occurs because the estimation of the variance τ₁₁ of random slope terms u_1j is statistically equal to zero (an extremely low value and a considerably greater standard error, with values equal to zero for the confidence intervals).

Even though this fact is clear in this case, researchers may choose to elaborate the likelihood-ratio test to compare the estimations obtained through the random intercepts model and through the model with random intercepts and slopes. In order to do that, the following command must be typed:

estimates store randomslope

and, next, the command that will elaborate the test:

lrtest randomslope randomintercept

since the term randomintercept refers to the estimation carried out previously. The result of the test can be seen in Fig. 23.19.

The significance level of the test is equal to 1.000 (much greater than 0.05) because the logarithms of both restricted likelihood functions are identical (LL_r = − 6372.164), making LR chi2 for 1 degree of freedom be equal to 0. The model that only has random effects in the intercept is favored, proving that random error terms u_1j are statistically equal to zero. It is important to mention, as the note at the bottom of Fig. 23.19 also explains, that this likelihood-ratio test is only valid when a comparison of the estimations obtained through restricted maximum likelihood (REML) of two models with identical fixed effects specification is carried out. Since in our case both models, which were estimated through REML, present the same fixed effects specification γ₀₀ + γ₁₀ ⋅ hours_ij, the test is considered valid.³

Only for educational purposes, another way of analyzing the statistical significance of the error terms of a multilevel model is to insert the term estmetric at the end of the command xtmixed, as follows:

xtmixed performance hours || school: hours, estmetric nolog reml

The outputs generated can be seen in Fig. 23.20.

The fixed effects parameter estimations are identical to the ones obtained previously. However, the term estmetric makes the estimations of the natural logarithm of the standard deviations of the error terms be presented, instead of the variances of these terms, with the respective z statistics and their significance levels, which facilitates the interpretation of the statistical significance of each random term.

For the term r_ij, for example, instead of presenting the estimation of its variance σ² = 31.764 (Fig. 23.18), the estimation of the natural logarithm of the standard deviation of r_ij is presented, such that:

$ln (\sqrt{31.764}) = 1.729$

Therefore, we can prove that the random slope terms u_1j are statistically equal to zero at a confidence level of 95%, for example, since Sig. z = 0.978 > 0.05.

At this moment, another pertinent discussion is related to the structure of the random effects (u_0j and u_1j) variance-covariance matrix. Since we did not specify any covariance structure for these error terms, Stata assumes, through the command xtmixed, that this structure is independent, that is, that cov(u_0j, u_1j) = σ₀₁ = 0. In other words, based on Expression (23.18) and in the outputs shown in Fig. 23.18, we have:

$G = var [u] = var [\begin{array}{c} u_{0 j} \\ u_{1 j} \end{array}] = [\begin{array}{c} τ_{00} & 0 \\ 0 & τ_{11} \end{array}] = [\begin{array}{c} 19.125 & 0 \\ 0 & 8.37 x 10^{- 14} \end{array}]$

si67_e

Nevertheless, we can generalize the structure of matrix G, allowing u_0j and u_1j to be correlated, that is, that cov(u_0j , u_1j) = σ₀₁ ≠ 0. In order to do that, we just need to add the term covariance(unstructured) to the command xtmixed, such that:

xtmixed performance hours || school: hours, covariance(unstructured) var nolog reml

The new outputs generated can be seen in Fig. 23.21.

The new estimations of the error terms’ variances generate the following variance-covariance matrix:

$var [u] = var [\begin{array}{c} u_{0 j} \\ u_{1 j} \end{array}] = [\begin{array}{c} τ_{00} & σ_{01} \\ σ_{01} & τ_{11} \end{array}] = [\begin{array}{c} 20.750 & - 0.040 \\ - 0.040 & 7.59 x 10^{- 5} \end{array}],$

si68_e

which can also be obtained through the following command:

estat recovariance

whose outputs can be seen in Fig. 23.22.

Even though the estimation of the covariance between u_0j and u_1j cov(u_0j, u_1j) = σ₀₁ = − 0.040 ≠ 0, a more inquisitive researcher will see, by including the term estmetric at the end of the last command typed xtmixed (without the term var), that this covariance is not statistically significant. In fact, the output, not presented here, will show the nonsignificance of the hyperbolic tangent arc of the correlation between these two error terms.

Another way of verifying the nonsignificance of the correlation between the error terms is through a new likelihood-ratio test, which compares the estimations of the random intercepts and slopes model with independent error terms u_0j and u_1j (Fig. 23.18) with the same model. However, with correlated error terms (Fig. 23.21), that is, with an unstructured variance-covariance matrix. In order to do that, we must type the following sequence of commands:

estimates store randomslopeunstructured

lrtest randomslopeunstructured randomslope

The result of this test can be seen in Fig. 23.23.

The χ² statistic for the test, with 1 degree of freedom, can also be obtained through the following expression:

$χ_{1}^{2} = (- 2 ∙ {LL}_{r - ind} - (- 2 ∙ {LL}_{r - unstruc})) = \{- 2 ∙ (- 6372.164) - [- 2 ∙ (- 6372.111)]\} = 0.11$

That is, we have Sig. χ₁² = 0.744 > 0.05. Therefore, in this example, we can state that the structure of the variance-covariance matrix between u_0j and u_1j can be considered independent.

However, more than this, we can see that the estimated variance of u_1j is statistically equal to zero, making the random intercepts model more suitable than the random intercepts and slopes model for our data.

Therefore, at this moment, let’s insert the variables texp and priv (level-2 explanatory variables - school) into our random intercepts model, such that the new specification of the hierarchical model will be as follows:

Complete Random Intercepts Model

${performance}_{ij} = b_{0 j} + b_{1 j} \cdot {hours}_{ij} + r_{ij}$

$b_{0 j} = γ_{00} + γ_{01} \cdot {texp}_{j} + γ_{02} \cdot {priv}_{j} + u_{0 j}$

$b_{1 j} = γ_{10} + γ_{11} \cdot {texp}_{j} + γ_{12} \cdot {priv}_{j},$

which results in the following expression:

$\begin{array}{l} {performance}_{ij} = γ_{00} + γ_{10} \cdot {hours}_{ij} + γ_{01} \cdot {texp}_{j} + γ_{02} \cdot {priv}_{j} \\ + γ_{11} \cdot {texp}_{j} \cdot {hours}_{ij} + γ_{12} \cdot {priv}_{j} \cdot {hours}_{ij} + u_{0 j} + r_{ij} \end{array}$

si73_e

Thus, initially, we need to generate two new variables, which correspond to the multiplication of texp by hours and priv by hours. The following commands generate these two variables (texphours and privhours):

gen texphours = texp⁎hours

gen privhours = priv⁎hours

Next, we can estimate our complete random intercepts model by typing the following command:

xtmixed performance hours texp priv texphours privhours || school: , var nolog reml

The outputs are shown in Fig. 23.24.

When we analyze the estimated fixed effects parameters, we can see that those that correspond to the variables texphours and privhours are not statistically different from zero, at a significance level of 0.05. Since there is no Stepwise procedure that corresponds to the command xtmixed in Stata, we have to exclude the variable texphours manually (that is, the variable texp from the expression of slope b_1j), because it is the one whose estimated parameter presented greater Sig. z. Therefore, the new model has the following expression:

${performance}_{ij} = b_{0 j} + b_{1 j} \cdot {hours}_{ij} + r_{ij}$

$b_{0 j} = γ_{00} + γ_{01} \cdot {texp}_{j} + γ_{02} \cdot {priv}_{j} + u_{0 j}$

$b_{1 j} = γ_{10} + γ_{11} \cdot {priv}_{j},$

which results in:

$\begin{array}{l} {performance}_{ij} = γ_{00} + γ_{10} \cdot {hours}_{ij} + γ_{01} \cdot {texp}_{j} + γ_{02} \cdot {priv}_{j} \\ + γ_{11} \cdot {priv}_{j} \cdot {hours}_{ij} + u_{0 j} + r_{ij} \end{array}$

si77_e

whose estimation can be obtained by typing the following command:

xtmixed performance hours texp priv privhours || school: , var nolog reml

The new outputs can be seen in Fig. 23.25.

Note that, even though the estimated parameter γ₁₁ related to the variable privhours is not statistically significant at a significance level of 0.05, it is at a significance level of 0.10. Only for educational purposes, we will consider this higher significance level at this moment, in order to continue the analysis with at least one level-2 variable (priv) in the expression of slope b_1j, even if we have to do it without random effects in this slope. Therefore, the expression of our final estimated model with random intercepts and level-1 and level-2 explanatory variables is:

$\begin{array}{l} {performance}_{ij} = - 2.710 + 3.281 \cdot {hours}_{ij} + 0.866 \cdot {texp}_{j} - 5.610 \cdot {priv}_{j} \\ - 0.080 \cdot {priv}_{j} \cdot {hours}_{ij} + u_{0 j} + r_{ij} \end{array}$

si78_e

A more inquisitive researcher could question the fact that the estimated parameter of variable priv is negative. Bear in mind that this fact only occurs in the presence of the other explanatory variables, because the correlation between performance and priv is positive and statistically significant, at a significance level of 0.05, which proves that students from private schools end up having better school performance, on average, than students from public schools.

Next, we can obtain the expected BLUPS (best linear unbiased predictions) of random effects u_0j of our final model by typing:

predict u0final, reffects

which generates a new variable in the dataset, which is called u0final. Besides, we can also obtain the expected values of each student’s school performance by typing the following command:

predict yhat, fitted

which defines the variable yhat, which can also be obtained by the command:

gen yhat = -2.71035 + 3.281046⁎hours + .8662029⁎texp - 5.610535⁎priv - .0801207⁎privhours + u0final

The following command makes a chart be generated (Fig. 23.26) with the predicted values of each student’s school performance based on the number of hours spent studying per week for the 46 schools under analysis. Through it, we can see that the intercepts are different (random effects), however, without any discrepancies in the slopes.

graph twoway connected yhat hours, connect(L)

Finally, Fig. 23.27 shows the values of the intercepts and slopes of the linear adjustments of the predicted values of the average school performance for each of the 46 schools, where it is possible to prove the existence of random effects in the intercepts and only of fixed effects in the slopes. This figure can be obtained by typing the following sequence of commands:

generate interceptfinal = _b[_cons] + u0final
generate slopefinal = _b[hours] + _b[texp] + _b[priv] + _b[privhours]
by school, sort: generate group = (_n ==1)
list school interceptfinal slopefinal if group == 1

Hence, we can conclude that there are differences in the school performance behavior between students from the same schools and from different schools. In addition, these differences occur based on the number of hours each student spends studying per week, on what type of school it is (public or private), and on the professors’ years of teaching experience at each school have.

We chose to use the strategic multilevel analysis proposed by Raudenbush and Bryk (2002), and by Snijders and Bosker (2011). That is, first, we studied the variance decomposition from the definition of a null model (nonconditional model) so that, afterwards, a random intercepts model and a random intercepts and slopes model could be estimated. Finally, from the definition of the random nature of the error terms, we estimated the complete model by including level-2 variables into the analysis. This procedure is known as multilevel step-up strategy.

Next, let’s estimate a three-level hierarchical linear model, in which the nesting of data will be characterized due to the presence of repeated measures, that is, there is temporal evolution in the behavior of the dependent variable.

23.5.2 Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in Stata

Let’s discuss an example that follows the same logic of the previous section. However, now, with data that vary throughout time, between individuals, and between the groups to which these individuals belong, characterizing a nested structure with repeated measures.

Imagine that our highly qualified professor is now interested in expanding his research, monitoring students’ school performance for a certain period, in order to investigate if there is variability in this performance throughout time between students from the same school and between those from different schools. And, if yes, if there are certain student and school characteristics that explain this variability.

Therefore, 15 schools volunteered to provide data on their students’ school performance (scores from 0 to 100) in the last four years, a total of 610 students. In addition, the professor also included each student’s gender in the dataset, in order to verify if there are differences in school performance resulting from this variable. The variable regarding professors’ years of teaching experience, for each school, remains in the study. Part of the dataset can be seen in Table 23.4. The complete dataset, however, can be found in the files PerformanceTimeStudentSchool.xls (Excel) and PerformanceTimeStudentSchool.dta (Stata).

Table 23.4

Example: School Performance Throughout Time (Level 1—Repeated Measure) and Students' (Level 2) and Schools’ (Level 3) Characteristics
Student j (Level 2)	School k (Level 3)	Performance at School (Y_tjk)	Year t (Level 1)	Gender (X_jk)	Professors’ Years of Teaching Experience (W_k)
1	1	35.4	1	male	2
1	1	44.4	2	male	2
1	1	46.4	3	male	2
1	1	52.4	4	male	2
⋮
121	4	66.4	1	female	9
121	4	66.4	2	female	9
121	4	74.4	3	female	9
121	4	79.4	4	female	9
⋮
610	15	87.6	1	female	9
610	15	92.6	2	female	9
610	15	94.6	3	female	9
610	15	100.0	4	female	9

Table 23.4

After opening the file PerformanceTimeStudentSchool.dta, we can type the command desc, which allows us to analyze the characteristics in the dataset, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 23.28 shows this output in Stata.

Fig. 23.28 Description of the PerformanceTimeStudentSchool.dta Dataset.

Following the logic proposed in the previous section, initially, let’s analyze the number of students monitored by the professor in each period (year), by using the following command:

tabulate year, subpop(student)

The outputs are shown in Fig. 23.29 and, through them, we can see that we have a balanced panel data, since all 610 students are monitored in the four periods.

The chart in Fig. 23.30, obtained by typing the following command, allows us to analyze the temporal evolution of the school performance of the first 50 students in the sample:

graph twoway connected performance year if student <= 50, connect(L)

This chart already allows us to see that the temporal evolutions of the school performance have different intercepts and slopes between students, which justifies the use of multilevel modeling and offers subsidies to include intercept and slope random effects in level 2 of the models that will be estimated.

Besides, students’ average performance in the four periods can be analyzed in Figs. 23.31 and 23.32, obtained from the following commands. Through them, it is possible to verify that there is a growing behavior, approximately linear, of students’ school performance throughout time, and this is the reason why the variable year is also inserted, with a linear specification, into level 1 of the modeling, as we will see later.

Fig. 23.31 Students’ average school performance in each period.

Fig. 23.32 Evolution of students’ average school performance in each period.

bysort year: egen average_performance = mean(performance)

tabstat average_performance, by(year)

graph twoway scatter performance year || connected average_performance year, connect(L) || , ytitle(performance at school)

So that we can justify the reasons to estimate a three-level hierarchical model more powerfully, let’s construct a chart (Fig. 23.33) that shows the temporal evolutions of the average school performance. In order to do that, we can type the following sequence of commands:

statsby intercept =_b[_cons] slope =_b[year], by(school) saving(ols, replace): reg performance year
sort school
merge school using ols
drop _merge
gen yhat_ols = intercept + slope⁎year
sort school year
separate performance, by(school)
separate yhat_ols, by(school)
graph twoway connected yhat_ols1-yhat_ols15 year || lfit performance year, clwidth(thick) clcolor(black) legend(off) ytitle(performance at school)

This chart shows the linear adjustment through OLS, for each school, of the school performance behavior throughout time. It also offers subsidies to include intercept and slope random effects in level 3 of the models that will be estimated, since the temporal evolutions of the school performance present different intercepts and slopes between the schools too. Note that the last sequence of commands generates a new file in Stata (ols.dta), in which the differences in the school performance behavior, in terms of temporal intercepts and slopes, between the schools can be analyzed.

Having characterized the temporal nesting of the students from different schools in the data with repeated measures in our example, initially, let’s estimate a null model (nonconditional model) that allows us to check if there is variability in the school performance between students from the same school and between those from different schools. No explanatory variable will be inserted into the modeling, which only considers the existence of one intercept and of error terms u_00k, r_0jk, and e_tjk, with variances equal to τ_u000, τ_r000, and σ², respectively. The model to be estimated has the following expression:

Null Model

${performance}_{tjk} = π_{0 jk} + e_{tjk}$

$π_{0 jk} = b_{00 k} + r_{0 jk}$

$b_{00 k} = γ_{000} + u_{00 k}$

which results in:

${performance}_{tjk} = γ_{000} + u_{00 k} + r_{0 jk} + e_{tjk}$

The command to estimate this null model in Stata is:

xtmixed performance || school: || student: , var nolog reml

which, as we can see, now shows two random effects components, one that corresponds to level 3 (school) and another to level 2 (student). It is important to highlight that the order in which the random effects components are inserted into the command xtmixed is decreasing when there are more than two levels. That is, we must begin with the highest data nesting level and continue until the lowest level (level 2). The outputs obtained can be seen in Fig. 23.34.

At the top of Fig. 23.34, we can initially prove that we have a balanced panel here. Since, for each student, we have minimum and maximum quantities of periods of monitoring equal to four, with a mean also equal four.

In relation to the fixed effects component, we can see that the estimation of parameter γ₀₀₀ is equal to 68.714, which corresponds to the average of students’ expected annual school performance of the (horizontal line estimated in the null model, or general intercept).

Whereas at the bottom of the outputs, the estimations of the variances of error terms τ_u000 = 180.194 (in Stata, var(_cons) for school), τ_r000 = 325.799 (in Stata, var(_cons) for student), and σ² = 41.649 (in Stata, var(Residual)) are presented.

Thus, we can define two intraclass correlations, given the existence of two variance proportions. The first one refers to the correlation between the data of variable performance in t and in t´ (t ≠ t´) of a certain student j from a certain school k (level-2 intraclass correlation). Whereas the other refers to the correlation between the data of variable performance in t and in t´ (t ≠ t´) of different students j and j´ (j ≠ j´) from a certain school k (level-3 intraclass correlation). Therefore, we have:

• Level-2 intraclass correlation

${rho}_{student ∣ school} = corr (Y_{tjk}, Y_{t ´ jk}) = \frac{τ_{u 000} + τ_{r 000}}{τ_{u 000} + τ_{r 000} + σ^{2}} = \frac{180.194 + 325.799}{180.194 + 325.799 + 41.649} = 0.924$

si83_e

• Level-3 intraclass correlation

${rho}_{school} = corr (Y_{tjk}, Y_{t ´ j ´ k}) = \frac{τ_{u 000}}{τ_{u 000} + τ_{r 000} + σ^{2}} = \frac{180.194}{180.194 + 325.799 + 41.649} = 0.329$

si84_e

After Stata 13, it is possible to obtain these intraclass correlations directly, by typing the command estat icc right after the estimation of the corresponding model.

Hence, the correlation between annual school performances is equal to 32.9% (rho_school) for the same school, and the correlation between annual school performances is equal to 92.4% (rho_{student | school}) for the same student of a certain school. Therefore, for the model without explanatory variables, while the annual school performance is lightly correlated between schools, the same becomes strongly correlated when the calculation is carried out for the same student from a certain school. In this last case, we estimate that students and schools random effects form approximately 92% of the total variance of the residuals!

Regarding the statistical significance of these variances, the fact that the estimated values of τ_u000, τ_r000, and σ² are considerably higher than their respective standard errors suggests that there is significant variation in the annual school performance between students and between schools. More specifically, we can see that all of these relationships are higher than 1.96, and this is the critical value of the standardized normal distribution that results in a significance level of 0.05.

As discussed in Section 23.5.1, this information is essential to underpin the choice of the multilevel modeling in this example, instead of a simple and traditional regression model through OLS. At the bottom of Fig. 23.34, we can verify this fact, by analyzing the result of the likelihood-ratio test (LR test). Since Sig. χ² = 0.000, we can reject the null hypothesis that the random intercepts are equal to zero (H₀: u_00k = r_0jk = 0), which makes the estimation of a traditional linear regression model be ruled out for the data with repeated measures in our example.

Even though researchers frequently ignore the estimation of null models, analyzing the results may help to reject the research hypotheses or not. It may even provide adjustments in relation to the constructs proposed. For the data in our example, the results of the null model allow us to state that there is significant variability in the school performance throughout the four years of the analysis. Furthermore, there is significant variability in the school performance, throughout time, between students of the same school, and there is significant variability in the school performance, throughout time, between students from different schools. By themselves, these findings can reject or prove research hypotheses and be used to structure certain work, and depending on the researcher’s objectives, without being necessary to prepare additional models.

In addition to what has been discussed, since our main objective is to verify if there are student and school characteristics that would explain the variability in the school performance between students from the same school and between those from different schools, we will continue with the next modeling steps, respecting the multilevel step-up strategy.

Therefore, as already seen through the charts in Figs. 23.32 and 23.33, let’s insert level-1 variable year into the analysis, aiming at investigating if the temporal variable has a relationship to students’ school performance behavior and, more than this, if the school performance has a linear behavior throughout time.

Linear Trend Model with Random Intercepts

${performance}_{tjk} = π_{0 jk} + π_{1 jk} \cdot {year}_{jk} + e_{tjk}$

$π_{0 jk} = b_{00 k} + r_{0 jk}$

$π_{1 jk} = b_{10 k}$

$b_{00 k} = γ_{000} + u_{00 k}$

$b_{10 k} = γ_{100},$

which results in the following expression:

${performance}_{tjk} = γ_{000} + γ_{100} \cdot {year}_{jk} + u_{00 k} + r_{0 jk} + e_{tjk}$

For the data in our example, the command for estimating the linear trend model with random intercepts in Stata is:

xtmixed performance year || school: || student: , var nolog reml

whose outputs are shown in Fig. 23.35.

First, we can see that the mean of the annual increase in school performance is statistically significant and with an estimated parameter of γ₁₀₀ = 4.348, ceteris paribus.

Regarding the random effects components, we have also verified that there is statistical significance in the variances of u_00k, r_0jk, and e_tjk, because the estimations of τ_u000, τ_r000, and σ² are considerably higher than the respective standard errors. Therefore, new intraclass correlations can be calculated, as follows:

• Level-2 intraclass correlation

${rho}_{student ∣ school} = corr (Y_{tjk}, Y_{t ´ jk}) = \frac{τ_{u 000} + τ_{r 000}}{τ_{u 000} + τ_{r 000} + σ^{2}} = \frac{180.196 + 333.675}{180.196 + 333.675 + 10.146} = 0.981$

si91_e

• Level-3 intraclass correlation

${rho}_{school} = corr (Y_{tjk}, Y_{t ´ j ´ k}) = \frac{τ_{u 000}}{τ_{u 000} + τ_{r 000} + σ^{2}} = \frac{180.196}{180.196 + 333.675 + 10.146} = 0.344$

si92_e

Both variance proportions are higher than the ones obtained in the estimation of the null model, which shows the importance of including the variable that corresponds to the repeated measure in level 1. Besides, the result of the likelihood-ratio test (LR test) at the bottom of Fig. 23.35 allows us to prove that the estimation of a simple traditional linear regression model (performance based on year) only with fixed effects must be ruled out.

Therefore, now, our model starts to have the following specification:

${performance}_{tjk} = 57.844 + 4.348 \cdot {year}_{jk} + u_{00 k} + r_{0 jk} + e_{tjk}$

Next, we can store (command estimates store) the estimations obtained for future comparison to the ones that will be generated in the estimation of a linear trend model with random intercepts and slopes. Through the command predict, reffects, we can also obtain the expected values of the random effects BLUPS (best linear unbiased predictions) u_00k and r_0jk. Maintaining the logic proposed in the previous section, let’s type the following sequence of commands:

estimates store randomintercept
predict u00 r0, reffects
desc u00 r0
by student, sort: generate tolist = (_n ==1)
list student school u00 r0 if school <=2 & tolist

Fig. 23.36 shows the values of random intercept terms u_00k and r_0jk for the students from the first two schools in the dataset. We can see that, while error terms u_00k do not vary for students from the same school and throughout time (variable u00 generated in the dataset), terms r_0jk vary between students. However, they do not vary for the same student throughout time (variable r0 generated in the dataset), which characterizes an intercept for each student and an intercept for each school.

In order to provide a better visualization of the random intercepts per school and per student, we can generate two charts (Figs. 23.37 and 23.38), by typing the following commands:

Fig. 23.37 Random intercepts per school.

Fig. 23.38 Random intercepts per student.

graph hbar (mean) u00, over(school) ytitle("Random Intercepts per School")

graph hbar (mean) r0, over(student) ytitle("Random Intercepts per Student")

Therefore, at this moment of the modeling, we are able to state that students’ school performance follows a linear trend throughout time. In addition, there is a significant variance of intercepts between those who study at the same school and between those who study at different schools.

Thus, we also need to verify if there is significant variance of the school performance slopes throughout time between the different students. Since the charts in Figs. 23.30 and 23.33 already gave us an indication that this phenomenon occurred. Therefore, let’s insert slope random effects into levels 2 and 3 of our multilevel model that, by maintaining the intercept random effects, will start to have the following expression:

Linear Trend Model with Random Intercepts and Slopes

${performance}_{tjk} = π_{0 jk} + π_{1 jk} \cdot {year}_{jk} + e_{tjk}$

$π_{0 jk} = b_{00 k} + r_{0 jk}$

$π_{1 jk} = b_{10 k} + r_{1 jk}$

$b_{00 k} = γ_{000} + u_{00 k}$

$b_{10 k} = γ_{100} + u_{10 k},$

which results in:

${performance}_{tjk} = γ_{000} + γ_{100} \cdot {year}_{jk} + u_{00 k} + u_{10 k} \cdot {year}_{jk} + r_{0 jk} + r_{1 jk} \cdot {year}_{jk} + e_{tjk}$

The command for estimating this linear trend model with random intercepts and slopes in Stata is:

xtmixed performance year || school: year || student: year, var nolog reml

Note that the variable year is present in the fixed effects component and in the level-3 random effects components (by multiplying error term u_10k), and in the level-2 ones (by multiplying error term r_1jk). The outputs obtained can be found in Fig. 23.39.

Fig. 23.39 Outputs of the linear trend model with random intercepts and slopes.

We can see that, even though the fixed effects parameter estimations do not change considerably in relation to the previous model, the variance estimations are different, which generates new intraclass correlations, as follows:

• Level-2 intraclass correlation

$\begin{array}{l} {rho}_{student ∣ school} = corr (Y_{tjk}, Y_{t ´ jk}) = \frac{τ_{u 000} + τ_{u 100} + τ_{r 000} + τ_{r 100}}{τ_{u 000} + τ_{u 100} + τ_{r 000} + τ_{r 100} + σ^{2}} \\ = \frac{224.343 + 0.560 + 374.285 + 3.157}{224.343 + 0.560 + 374.285 + 3.157 + 3.868} = 0.994 \end{array}$

si100_e

• Level-3 intraclass correlation

$\begin{array}{l} {rho}_{school} = corr (Y_{tjk}, Y_{t ´ j ´ k}) = \frac{τ_{u 000} + τ_{u 100}}{τ_{u 000} + τ_{u 100} + τ_{r 000} + τ_{r 100} + σ^{2}} \\ = \frac{224.343 + 0.560}{224.343 + 0.560 + 374.285 + 3.157 + 3.868} = 0.371 \end{array}$

si101_e

Therefore, for this model, we estimate that the students and schools random effects form approximately 99% of the total variance of the residuals!

Let’s type the following command, so that we can prove the better suitability of this estimation over the previous one, without random slopes:

estimates store randomslope

Next, we can type the command that will elaborate the likelihood-ratio test:

lrtest randomslope randomintercept

since the term randomintercept refers to the estimation carried out previously. The result of the test can be seen in Fig. 23.40.

By using the values of the restricted likelihood functions obtained in Figs. 23.35 and 23.39, we arrive at the χ² statistic for the test, with 2 degrees of freedom:

$χ_{2}^{2} = (- 2 \cdot {LL}_{r - random intercept} - (- 2 \cdot {LL}_{r - random slope})) = \{- 2 \cdot (- 7, 801.420) - [- 2 \cdot (- 7, 464.819)]\} = 673.20,$

which results in a Sig. χ₂² = 0.000 < 0.05 and ends up favoring the linear trend model with and random intercepts and slopes. It is important to mention once again, as the note at the bottom of Fig. 23.40 also explains, that this likelihood-ratio test is only valid when a comparison of the estimations obtained through restricted maximum likelihood (REML) of two models with identical fixed effects specification is carried out. Since in our case, both models, which were estimated through REML, present the same fixed effects specification, γ₀₀₀ + γ₁₀₀ ⋅ year_jk, the test is considered valid.

Hence, our model starts to have the following specification:

${performance}_{tjk} = 57.858 + 4.343 \cdot {year}_{jk} + u_{00 k} + u_{10 k} \cdot {year}_{jk} + r_{0 jk} + r_{1 jk} \cdot {year}_{jk} + e_{tjk}$

In the current situation, we are able to state that students’ school performance follows a linear trend throughout time. In addition, there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools.

Therefore, let’s insert level-2 variable gender into the analysis, in order to verify if this characteristic explains the variation in the annual school performance between students.

Linear Trend Model with Random Intercepts and Slopes and with Level-2 Variable gender

${performance}_{tjk} = π_{0 jk} + π_{1 jk} \cdot {year}_{jk} + e_{tjk}$

$π_{0 jk} = b_{00 k} + b_{01 k} \cdot {year}_{jk} + r_{0 jk}$

$π_{1 jk} = b_{10 k} + b_{11 k} \cdot {year}_{jk} + r_{1 jk}$

$b_{00 k} = γ_{000} + u_{00 k}$

$b_{01 k} = γ_{010}$

$b_{10 k} = γ_{100} + u_{10 k}$

$b_{11 k} = γ_{110},$

which results in the following expression:

$\begin{array}{l} {performance}_{tjk} = γ_{000} + γ_{100} \cdot {year}_{jk} + γ_{010} \cdot {gender}_{jk} + γ_{110} \cdot {gender}_{jk} \cdot {year}_{jk} \\ + u_{00 k} + u_{10 k} \cdot {year}_{jk} + r_{0 jk} + r_{1 jk} \cdot {year}_{jk} + e_{tjk} \end{array}$

si111_e

Initially, we need to generate a new variable that corresponds to the multiplication between gender and year. The following command generates this variable (genderyear):

gen genderyear = gender⁎year

Next, we can estimate our linear trend model with random intercepts and slopes and level-2 variable gender, by typing the following command:

xtmixed performance year gender genderyear || school: year || student: year, var nolog reml

The outputs generated can be seen in Fig. 23.41.

This model shows significant estimations for the fixed effects parameters, as well as for the variances of the random effects terms, at a significance level of 0.05. Moreover, at this moment of the modeling, we are able to state that students’ school performance follows a linear trend throughout time, and there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools. Additionally, the fact that a certain student is female or male is part of the reason why there is this variation in school performance.

The model begins to have the following specification:

$\begin{array}{l} {performance}_{tjk} = 64.498 + 4.029 \cdot {year}_{jk} - 15.033 \cdot {gender}_{jk} + 0.705 \cdot {gender}_{jk} \cdot {year}_{jk} \\ + u_{00 k} + u_{10 k} \cdot {year}_{jk} + r_{0 jk} + r_{1 jk} \cdot {year}_{jk} + e_{tjk} \end{array}$

si112_e

and, from which, we can see that male students (dummy gender = 1) have a worse performance than female students, on average and ceteris paribus.

Finally, let’s investigate if level-3 variable texp (professors’ years of teaching experience), also explains the variation in the annual school performance between the students. After some intermediate analyses, let’s move on to estimate the three-level hierarchical model with the following specification:Linear Trend Model with Random Intercepts and Slopes, Level-2 Variable gender, and Level-3 Variable texp (Complete Model)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 23: Data Mining and Multilevel Modeling

Create new playlist

Sign In

Sign Up

23.1 Introduction to Data Mining

23.2 Multilevel Modeling

23.3 Nested Data Structures

23.4 Hierarchical Linear Models

23.4.1 Two-Level Hierarchical Linear Models With Clustered Data (HLM2)

23.4.2 Three-Level Hierarchical Linear Models With Repeated Measures (HLM3)

23.5 Estimation of Hierarchical Linear Models in Stata

23.5.1 Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in Stata

23.5.2 Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in Stata

Table of Contents for
Chapter 23: Data Mining and Multilevel Modeling