Chapter 23

Data Mining and Multilevel Modeling

Abstract

This chapter presents a brief introduction of data mining and, within the context of modeling, it discusses multilevel models in detail, clarifying the circumstances in which they can be used. The main objective is to estimate the parameters for two-level hierarchical linear models with clustered data, and for three-level models with repeated measures, as well as to offer the conditions for their correct interpretation. The results of the statistical and likelihood-ratio tests as regards these models are evaluated, which allows us to distinguish one multilevel model from a traditional regression model. From the concepts and techniques presented, we can propose models in which it is possible to identify the fixed and random effects on the dependent variable, understand the variance decomposition of multilevel random effects, and calculate and interpret the intraclass correlations of each analysis level. Understanding how nested clustered data structures and of data with repeated measures work allows researchers and managers to define several types of constructs from which multilevel models can be used within the context of data mining. The multilevel models are estimated in Stata Statistical Software® and IBM SPSS Statistics Software®.

Keywords

Big data; KDD process; Data Mining; CRISP-DM; Multilevel modeling; Hierarchical linear models; Hierarchical nonlinear models; Mixed models; Nested models; Repeated measures; Variance decomposition; Fixed effects; Random effects; Likelihood-ratio test; Stata and SPSS

We must widen the circle of our love till it embraces the whole village; the village in its turn must take into its fold the district; the district the province; and so on, until the scope of our love becomes co-terminous with the world.

Mahatma Gandhi

23.1 Introduction to Data Mining

In this new millennium, with respect to the generation and availability of data, humankind has been witnessing and learning how to live with the simultaneous occurrence of five characteristics, or dimensions: data volume, velocity, variety, variability, and complexity.

Among others reasons, this excessive volume of data comes from the increase of technological capabilities, the increase of phenomena monitoring, and the emergence of social media. The velocity with which data become available for treatment and analysis, due to new collection methods that use electronic tags and radiofrequency antenna systems, is also visible and vital for decision-making processes in environments that are more and more competitive. Variety refers to the different formats in which data are accessed, such as, texts, indicators, secondary datasets or even speeches, and a converging analysis can foster a better decision-making process too. Beyond the three previous dimensions, data variability relates to cyclical or seasonal phenomena, sometimes in high frequency, directly observable or not, and that a certain treatment can generate differentiated information for researchers. Last, but not least, data complexity, mainly for large volumes, resides in the fact that many sources can be accessed, with codes, periodicities or distinct criteria, which forces researchers to have a managerial control process over the data, in order to have an integrated analysis and for decision making.

As shown in Fig. 23.1, the combination of these five data generation and availability dimensions is called Big Data, currently, a very frequent term in academic and business environments.

Fig. 23.1
Fig. 23.1 Data Generation and Availability Dimensions, and Big Data.

These five dimensions that define Big Data cannot be supported without the enhancement of professional software packages that, in addition to offering enormous dataset processing capability, are able to elaborate the most diverse tests and models, adequate and robust for each situation, and according to what researchers and decision makers want. These are the main reasons why organizations from several different sectors have been investing in the structuring and development of multidisciplinary areas known as Business Analytics. These have the main goal of analyzing data and generating information, allowing the creation of standard recognition and a predictive capacity in time real of the organization compared to the market and to its competitors.

Within this perspective, with the emergence and improvement of complex and robust computer systems, and with the reduction in hardware and software prices, organizations have been storing more and more data. Data storing systems are constantly generated and enhanced as, for example, data warehouses, virtual libraries, and the web itself (Cios et al., 2007; Camilo and Silva, 2009).

According to Bramer (2016), NASA’s observation satellites generate around one terabyte of data a day, the Human Genome Project stores thousands of bytes for each of the billions of existing genetic datasets, financial institutions maintain repositories with millions of daily transactions done by their clients, and retailers control the flow of thousands of SKU’s instantaneously. Nevertheless, excessive storing makes players from the most diverse areas question themselves about how to treat the high volume and high variety of complex data generated with extreme velocity and variability. To answer this crucial question, in the 1980s, Data Mining emerged, aiming to propose technologies and treatments in situations in which traditional data exploration and analysis techniques are not enough or adequate.

As mentioned in Chapter 1, hierarchy between data, information, and knowledge has been present in all the discussions throughout this book. While data are transformed into information whenever treated and analyzed, knowledge is generated at the moment when such information is recognized and applied in decision making. As discussed by Fayyad et al. (1996), be it science, marketing, finance, health care, retail, or any other field, the classical approach to data analysis relies fundamentally on one or more analysts becoming intimately familiar with the data and serving as an interface between the data and the users and products. This manual probing of a data set is slow, expensive, and can be highly subjective, and, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains.

According to the same authors, we are witnessing the emergence of a new generation of computational theories and tools to assist humans in extracting useful information and knowledge from rapidly growing volumes of digital data. These theories and tools are the subject of the emerging field of knowledge discovery in databases (KDD). At an abstract level, KDD is concerned with the development of methods and techniques for making sense of data.

As stated in Han and Kamber (2000) and in Camilo and Silva (2009), KDD and Data Mining are synonyms, even though there is no consensus regarding the definition of these terms yet. For Fayyad et al. (1996) and Cios et al. (2007), while KDD includes all the phases for discovering knowledge from the existence of data, Data Mining is solely one of the phases of that process, as shown in Fig. 23.2.

Fig. 23.2
Fig. 23.2 Stages of the KDD Process and Data Mining. (Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge discovery in databases. AI Magazine 17 (3), 37–54.)

The data-mining stage of KDD currently relies heavily on known techniques from machine learning, pattern recognition, optimization, simulation, statistics, and multivariate analysis to find patterns from data (Fayyad et al., 1996). Data mining, thus, is a stage in the KDD process that consists in applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data.

Following the logic proposed by Olson and Delen (2008), Camilo and Silva (2009) and Larose and Larose (2014), data mining can be structured in six phases, or stages, which form what is called CRISP-DM (crossindustry standard process of data mining):

1. Business understanding: knowledge about the business and about the market processes inherent to the business is fundamental in order to define the objectives of the data mining.

2. Data understanding: we must describe the data in a clear and objective way, always explaining their sources and possible interdependence behavior between the variables. As we studied in Part 5 of this book, exploratory techniques can be very useful in this phase.

3. Data preparation: preliminary analyses of the data, with possible treatment of outliers or missing values, can be extremely useful in order for the data-mining methods to be applied correctly. The clustering of variables itself or their categorization through a certain criterion can make one technique more suitable than another, respecting the analysis objectives.

4. Modeling: as discussed by Fávero and Belfiore (2017), several techniques can be applied, such as, the preparation of exploratory techniques, the estimation of confirmatory models, or the implementation of algorithms, always based on the objectives proposed.

5. Analysis of results: it is essential for business experts, statisticians, and data scientists to take part in this phase, so that evaluations of the findings from the previous phase can be carried out, from the analysis of tests and validations (e.g., contingency tables, χ2 statistic, correlation matrices, Stepwise procedures, t-tests, among others).

6. Dissemination of results: after the modeling and output analysis, it is necessary for all those involved to be aware of the results found, so that it is possible to implement management procedures.

In a schematic way, Fig. 23.3 shows the phases that form the crossindustry standard process of data mining (CRISP-DM). Through it, it is possible to verify that the flow between the phases is not always unidirectional. That is, if, for example, a certain modeling is not possible due to the nature of the data, researchers can go back to the previous phase and prepare these data once again.

Fig. 23.3
Fig. 23.3 Phases of the Crossindustry Standard Process of Data Mining (CRISP-DM). (Sources: Olson, D., Delen, D., 2008. Advanced Data Mining Techniques. Springer, New York; Camilo, C.O., Silva, J.C., 2009. Mineração de dados: conceitos, tarefas, métodos e ferramentas. Technical Report RT-INF 001-09. Instituto de Informática, Universidade Federal de Goiás; Larose, D.T., Larose, C.D., 2014. Discovering Knowledge in Data: An Introduction to Data Mining. 2nd ed. John Wiley & Sons, New York; Fávero, L.P., Belfiore, P., 2017. Manual de análise de dados: estatística e modelagem multivariada com Excel®, SPSS® e Stata®. Elsevier, Rio de Janeiro.)

According to Linoff and Berry (2011), “although some data mining techniques are quite new, data mining itself is not a new technology, in the sense that people have been analyzing data on computers since the first computers were invented - and without computers for centuries before that.” Data mining has assumed too many names, such as knowledge discovery, business intelligence, predictive modeling, and predictive analytics, but the following definition is one of the most utilized and accepted:

Data mining is a business process for exploring large amounts of data to discover meaningful patterns and rules.

In this sense, the main tasks of data mining are related to:

  •  Description (e.g., Statistical Summaries);
  •  Data Exploration and Visualization (e.g., Online Analytical Processing—OLAP, Construction of Maps);
  •  Classification and Prediction (e.g., Generalized Linear Models—GLM, Generalized Linear Latent and Mixed Models—GLLAMM, Artificial Neural Networks—NN);
  •  Clustering (e.g., Hierarchical Clustering, K-Means Clustering, Self-Organizing Maps—SOM, Decision Trees);
  •  Association Rule Mining (e.g., Factor Analysis, Simple and Multiple Correspondence Analysis, Multidimensional Scaling);
  •  Optimization and Simulation (e.g., Linear Programming, Network Programming, Integer Programming, Monte Carlo).

Many are the tools and software packages developed to facilitate the implementation of data mining by professionals from the most diverse fields. Among which, we can highlight Stata, IBM SPSS Modeler, RStudio, SAS Enterprise Miner, Pimiento, WEKA, KNIME, Dundas BI, Qlik Sense, Birst, DOMO, Orange, Microsoft SharePoint, Oracle Data Mining (ODM), Sisense, Salesforce Analytics Cloud, RapidMiner, LingPipe, IBM Cognos, and IBM DB2 Intelligent Miner.

Fig. 23.4 shows a screenshot of IBM SPSS Modeler with a Plot Spatial Data Extension, in which it is possible to see a range of interconnected advanced algorithms and techniques, and the map generated by the geospatial analysis.

Fig. 23.4
Fig. 23.4 Data Mining and Geospatial Analytics in IBM SPSS Modeler.

Data mining is applied in several fields of knowledge frequently and successfully, as already discussed throughout this book. Yet, we can mention the following examples:

  •  Banking Sector: credit risk models and probability of default;
  •  Financial Sector: identification of standards in the behavior of financial asset prices;
  •  Marketing and CRM (Customer Relationship Management): identification of customers’ standards to increase retention rates;
  •  Retail: replacement and placement of products on shelves based on consumption standards;
  •  Medicine and Health: preparation of more precise diagnoses;
  •  Epidemiology: study of the dissemination and transmission of diseases in order to monitor and prevent them;
  •  Recruitment and selection of professionals: identification of the most suitable profiles for each position or function;
  •  Logistics: inventory management and vehicle routing based on demand fluctuations and peaks;
  •  Security: detection of terrorist and criminal activities.
  •  Public Policies: definition of priorities in the allocation of public resources and improvement of public management;
  •  Education: study of students’ performance and assistance in the preparation for college entrance exams;
  •  Social Media: monitoring to define new products, sales, and promotions.

As emphasized by Albright and Winston (2015), data mining is a huge topic that can fill a large book by itself, covering the role of data mining in real business problems, data warehousing, techniques, and software packages. The main goal of this chapter is to offer a brief overview of data-mining definitions and to present and discuss in detail a relevant and relatively recent technique, known as multilevel modeling. It offers a good overview during data-mining problem solving and helps researchers, managers, and practitioners to focus their attention on suitable target areas.

The myriad combinations of model specification make multilevel modeling an interesting data-mining tool, as it takes into account influences of observations on the dataset, as well as their contexts, over the outcome variable, opening up new possibilities for prediction and exploratory work.

23.2 Multilevel Modeling

Multilevel regression models for panel data have become considerably important in several fields of knowledge, and the publication of papers that use estimations related to these models has become more and more frequent. Mainly due to the determination of research constructs that consider the existence of nested data structures, in which certain variables show variation between distinct units that represent groups, however, not between observations that belong to the same group. The computational development itself and investments that certain manufacturers of data analysis software have made in the processing capacity to estimate multilevel models also offer support to researchers who are increasingly interested in this type of approach.

Imagine that a group of researchers is interested in studying how firms’ performance, measured, for example, by a certain profitability indicator, behaves in relation to certain company operational characteristics (size, investment, among others), and in relation to the characteristics of the industry in which each firm operates (participation in the GDP, tax and legal incentives, among others). Since sector characteristics do not vary among firms from the same industry, we characterize a two-level clustered data structure, with firms (level 1) nested into (level 2) companies. Estimating a multilevel model may allow researchers to verify if there are firm characteristics that explain possible performance differences between companies from the same industry, as well as if there are sector characteristics that explain possible differences in the performance of firms from different industries.

Imagine that this study is expanded in order to investigate the temporal evolution of these firms’ performance. Different from longitudinal regression models for panel data, in which the variables change between observations and throughout time, assume that the dataset is structured only with firm (governance structure, production lines, among others) and industry variables (tax incidence, legislation, among others), which do not change during the period analyzed. Therefore, we characterize a three-level data structure with repeated measures, with periods (level 1) nested into firms (level 2), and these into sectors (level 3), and, from which, models can be estimated. Aiming at investigating if, throughout time, there is variability in the performance between firms from the same sector and between those from different sectors, and, if yes, if there are firm and sector characteristics that explain this variability.

Theoretically, researchers can define a construct with a greater number of analysis levels, even if the interpretation of model parameters is not something trivial. For instance, imagine the study of school performance, throughout time, of students nested into schools, these nested into municipal districts, these into municipalities, and these into states of the federation. In this case, we would be working with six analysis levels (temporal evolution, students, schools, municipal districts, municipalities, and states).

The main advantage of multilevel models over traditional regression models, as, for example, the ones estimated by OLS (Chapter 13), refers to the possibility of considering a natural nesting of data. In other words, multilevel models allow us to identify and analyze individual heterogeneities and heterogeneities between groups, to which these individuals belong, making it possible to specify random components in each analysis level. For example, if companies are nested into sectors, it is possible to define a random component at the firm level and another one at the sector level. Different from what a traditional regression model would allow, in which the effect of the sector on the firms’ performance would be considered in a homogeneous way. Thus, multilevel models can also be called random coefficients models.

According to Courgeau (2003), within a model structure with a single equation, there seems to be no connection between individuals and the society in which they live. In this sense, the use of level equations allows the researcher to “jump” from one science to another: students and schools, families and neighborhoods, firms and countries. Ignoring this relationship means to elaborate incorrect analyzes about the behavior of the individuals and, equally, about the behavior of the groups. Only the recognition of these reciprocal influences allows the correct analysis of the phenomena.

In this chapter, we will study multilevel models aiming at investigating the behavior of metric dependent variables (outcome variables) and, from which, normally distributed residuals will be generated. However, they are not independent and do not have a constant variance. Therefore, our focus will be on linear multilevel models, also known as linear mixed models (LMM) or hierarchical linear models (HLM). This is the reason why multilevel models applied to data nested into two levels are also called HLM2, and why models applied to data nested into three levels are known as HLM3.

According to West et al. (2015), the name linear mixed models comes from the fact that these models present linear specification and the explanatory variables include a mix of fixed and random effects, that is, they can be inserted into components with fixed effects, as well as into components with random effects. While the estimated fixed effects parameters indicate the relationship between explanatory variables and the metric dependent variable, the random effects components can be represented by the combination of explanatory variables and nonobserved random effects.

In the Appendix of this chapter, a brief presentation on nonlinear multilevel models will be given, with applications in Stata of examples of logistic, Poisson, and negative binomial models.

Following the same logic of Chapters 13, 14, and 15, we will estimate all models in this chapter in Stata. Moreover, we believe that the estimation of them in SPSS may also allow researchers to compare how to use different software packages, procedures, and routines to estimate the models and logic with which the outputs are presented. Allowing them to decide which software to use based on the characteristics of each one and on how accessible it is.

Hence, in this chapter, we will discuss multilevel regression models for panel data. Our main objectives here are: (1) to introduce the concepts of nested data structures; (2) to define the type of model to be estimated based on the characteristics of the data; (3) to estimate parameters through several methods in Stata and in SPSS; (4) to interpret the results obtained through several types of existing estimations for multilevel models; and (5) to define the most suitable estimation for diagnosing and forecasting effects in each of the cases studied. Initially, the main concepts inherent to each modeling will be presented. Next, the procedures for estimating the models in Stata and in SPSS will be discussed.

23.3 Nested Data Structures

Multilevel regression models allow us to investigate the behavior of a certain dependent variable Y, which represents the phenomenon we are interested in, based on the behavior of explanatory variables, whose changes may occur, for clustered data, between observations and between groups to which these observations belong, and for data with repeated measures throughout time. In other words, there must be variables that have data that change between individuals that represent a certain level, however, remain unchanged for certain groups of individuals, and these groups represent a higher level.

First, imagine a dataset with data on n individuals, and each individual i = 1, ..., n belongs to one of the j = 1, ..., J groups, obviously n > J. Therefore, this dataset can have certain explanatory variables X1, ..., XQ that refer to each individual i, and other explanatory variables W1, ..., WS that refer to each group j, however, invariable for the individuals of a certain group. Table 23.1 shows the general model of a dataset with a two-level clustered/nested data structure (individual and group).

Table 23.1

General Model of a Dataset With a Two-Level Clustered/Nested Data Structure
(Observation)
(Individual i)
Level 1
Group j

Level 2
YijX1ijX2ij...XQijW1jW2j...WSj
11Y11X111X211...XQ11W11W21...WS1
21Y21X121X221XQ21W11W21WS1
n11Yn11X1n11X2n11XQn11W11W21WS1
n1 + 12Yn1 + 1, 2X1n1 + 1, 2X2n1 + 1, 2XQn1 + 1, 2W12W22WS2
n1 + 22Yn1 + 2, 2X1n1 + 2, 2X2n1 + 2, 2XQn1 + 2, 2W12W22WS2
n22Yn22X1n22X2n22XQn22W12W22WS2
nJ − 1 + 1JYnJ − 1 + 1, JX1nJ − 1 + 1, JX2nJ − 1 + 1, JXQnJ − 1 + 1, JW1JW2JWSJ
nJ − 1 + 2JYnj − 1 + 2, JX1nJ − 1 + 2, JX2nJ − 1 + 2, JXQnJ − 1 + 2, JW1JW2JWSJ
nJYnJX1nJX2nJXQnJW1JW2JWSJ

Table 23.1

Based on Table 23.1, we can see that X1, ..., XQ are level-1 variables (data change between individuals), and W1, ..., WS are level-2 variables (data change between groups, however, not for the individuals in each group). Furthermore, the number of individuals in groups 1, 2, ..., J is the same n1, n2 − n1, ..., n − nJ − 1, respectively. Fig. 23.5 allows us to see the existing nesting between the level-1 units (individuals) and the level-2 units (groups), which characterizes the existence of clustered data.

Fig. 23.5
Fig. 23.5 Two-level nested structure of clustered data.

If n1 = n2 − n1 = ... = n − nJ − 1, we will have a balanced nested data structure.

Imagine another dataset in which, in addition to the nesting presented for clustered data, there is temporal evolution, that is, data with repeated measures. Thus, besides the individuals that will now belong to level 2 and therefore will be called j = 1, ..., J, nested into k = 1, ..., K groups (which now belong to level 3), we will also have t = 1, ..., Tj periods in which each individual j is monitored. Consequently, this new dataset can have the same explanatory variables X1, ..., XQ that refer to each individual j. However, now they are invariable for each individual j during the periods of monitoring. Moreover, it can also have the same explanatory variables W1, ..., WS that refer to each group k. However, they are also invariable throughout time for each group k. Table 23.2 offers the logic with which we can present a dataset with a three-level nested data structure with repeated measures (time, individual, and group).

Table 23.2

General Model of a Dataset With a Three-Level Nested Data Structure With Repeated Measures
Period t
(Repeated Measure)
Level 1
(Observation)
(Individual j)
Level 2
Group k

Level 3
YtjkX1jkX2jk...XQjkW1kW2k...WSk
111Y111X111X211...XQ11W11W21...WS1
211Y211X111X211XQ11W11W21WS1
T11YT111X111X211XQ11
T1 + 12YT1 + 1, 21X121X221XQ21
T1 + 22YT1 + 2, 21X121X221XQ21
T221YT221X121X221XQ21W11W21WS1
TJ − 1 + 1JKYTJ − 1 + 1, JKX1JKX2JKXQJKW1KW2KWSK
TJ − 1 + 2JKYTJ − 1 + 2, JKX1JKX2JKXQJKW1KW2KWSK
TJJKYTJJKX1JKX2JKXQJKW1KW2KWSK

Table 23.2

Based on Table 23.2, now, we can see that the variable that corresponds to the period is a level-1 explanatory variable, since the data change in each row of the dataset, and that X1, ..., XQ become level-2 variables (data change between individuals; however, not for the same individual throughout time), and that W1, ..., WS become level-3 variables (data change between groups; however, not for the same group throughout time). Furthermore, the number of periods in which individuals 1, 2, ..., J are monitored are the same T1, T2 − T1, ..., TJ − TJ − 1, respectively. Analogous to what was exposed for the case with two levels, Fig. 23.6 allows us to see the existing nesting between the level-1 units (temporal variation), the level-2 units (individuals), and the level-3 units (groups), which characterizes a data structure with repeated measures.

Fig. 23.6
Fig. 23.6 Three-level nested structure with repeated measures.

If T1 = T2 − T1 = ... = TJ − TJ − 1, we will have a balanced panel.

Through Tables 23.1 and 23.2, as well as through the corresponding Figs. 23.5 and 23.6, we can see that the data structures present absolute nesting. That is, a certain individual can be nested into only one group, this into only another group, and so on. Nevertheless, there may be nested data structures within a crossclassification, in which certain observations of a group may be part of a group at a higher level, with the others forming another group at a higher level. For instance, imagine a study of the performance of firms nested into sectors and countries. There may be mining firms from Brazil and others, such as, aviation firms that also come from Brazil. However, in case there are mining firms from Australia in the dataset, for example, it becomes characterized as crossclassified nesting, making it necessary to estimate hierarchical crossclassified models (HCM). These models are not covered in the current edition of the book. However, researchers may study them in depth in Raudenbush and Bryk (2002), Raudenbush et al. (2004), and Rabe-Hesketh and Skrondal (2012a,b).

In Sections 23.5.1 and 23.6.1, we will estimate two-level hierarchical linear models with clustered data (HLM2) in Stata and SPSS, respectively. In Sections 23.5.2 and 23.6.2, we will estimate three-level hierarchical linear models with repeated measures (HLM3) in the same software packages. However, before that, it is necessary to present and discuss the algebraic formulations of each one of these models in the following section.

23.4 Hierarchical Linear Models

In this section, we will discuss the algebraic formulations and specifications of two-level hierarchical linear models with clustered data (Section 23.4.1) and three-level hierarchical linear models with repeated measures (Section 23.4.2).

23.4.1 Two-Level Hierarchical Linear Models With Clustered Data (HLM2)

In order to understand how the general expression of a two-level hierarchical linear model with clustered data is defined, we need to use a multiple linear regression model, whose specification, based on Expression (12.1), is presented here:

Yi=b0+b1X1i+b2X2i++bQXQi+ri

si1_e  (23.1)

where Y represents the phenomenon being studied (dependent variable), b0 represents the intercept, b1, b2, ..., bQ are the coefficients of each variable, X1, ..., XQ are explanatory variables (metric or dummies), and r represents the error terms. The subscript i represents each one of the sample observations under analysis (i = 1, 2, ..., n, where n is the sample size). Note that some terms have a nomenclature different from the one proposed in Chapter 13 (for example, error terms), since another analysis level will be considered here to define the hierarchical modeling.

The model represented by Expression (23.1) presents observations considered homogeneous, that is, they do not come from different groups that, for some reason, could influence the behavior of variable Y differently. Nevertheless, we could think of two groups of observations, from which two different models would be estimated, as follows:

Yi1=b01+b11X1i1+b21X2i1++bQ1XQi1+ri1

si2_e  (23.2)

Yi2=b02+b12X1i2+b22X2i2++bQ2XQi2+ri2

si3_e  (23.3)

where coefficients b01 and b02 represent the expected average values of Y for the observations of groups 1 and 2, respectively, when all the explanatory variables are equal to zero, and b11, b21, ..., bQ1 and b12, b22, ..., bQ2 are the coefficients of variables X1, ..., XQ in the model of each group (1 and 2), respectively. In addition to this, r1 and r2 represent the specific error terms in each model.

Therefore, for j = 1, ..., J groups, we can write the general expression of a regression model for clustered data, considered a first-level model, as follows:

Yij=b0j+b1jX1ij+b2jX2ij++bQjXQij+rij=b0j+Qq=1bqjXqij+rij

si4_e  (23.4)

For educational purposes and aiming at constructing an illustrative chart, we can write the expression for the expected values of Y, that is, ˆYsi5_e, for each observation i that belongs to each group j, when there is only one explanatory variable X in the model proposed, as follows:

Group1:ˆYi1=β01+β11Xi1

si6_e  (23.5)

Group2:ˆYi2=β02+β12Xi2

si7_e  (23.6)

si8_e

GroupJ:ˆYiJ=β0J+β1JXiJ

si9_e  (23.7)

where parameters β are the estimations of coefficients b, following the standard used in this book.

The chart shown in Fig. 23.7 shows the plotting of Expressions (23.5)(23.7) in a conceptual way, and, through it, we can see that the individual models that represent the observations of each group can have different intercepts and slopes, fact that may occur based on certain characteristics of the groups themselves.

Fig. 23.7
Fig. 23.7 Individual models that represent the observations of each one of the J groups.

Thus, there must be invariable group characteristics (second level) for the observations that belong to each group (as shown in Table 23.1), which can explain the differences in the intercepts and slopes of the models that represent these groups. Therefore, based on the following regression model with one explanatory variable X and with observations nested into j = 1, ..., J groups:

Yij=b0j+b1jXij+rij

si10_e  (23.8)

we can write the intercept b0j and slope b1j expressions based on a certain explanatory variable W as follows, which represents a characteristic of the j groups:

  • Intercepts

Group1:b01=γ00+γ01W1+u01

si11_e  (23.9)

Group2:b02=γ00+γ01W2+u02

si12_e  (23.10)

si8_e

GroupJ:b0J=γ00+γ01WJ+u0J

si14_e  (23.11)

or, in a more general way:

b0j=γ00+γ01Wj+u0j

si15_e  (23.12)

where γ00 represents the expected value of the dependent variable for a certain observation i that belongs to a group j when X = W = 0 (general intercept), and γ01 represents the alteration in the expected value of the dependent variable for a certain observation i that belongs to a group j when there is a unit alteration in characteristic W of group j, ceteris paribus. Moreover, u0j represents the error terms that indicate that there is randomness in the intercepts, which can be generated by the presence of observations from different groups in the dataset.

  • Slopes

Group1:b11=γ10+γ11W1+u11

si16_e  (23.13)

Group2:b12=γ10+γ11W2+u12

si17_e  (23.14)

si8_e

GroupJ:b1J=γ10+γ11WJ+u1J

si19_e  (23.15)

or, in a more general way:

b1j=γ10+γ11Wj+u1j

si20_e  (23.16)

where γ10 represents the alteration in the expected value of the dependent variable for a certain observation i that belongs to a group j when there is a unit alteration in characteristic X of individual i, ceteris paribus (change in the slope because of X), and γ11 represents the alteration in the expected value of the dependent variable for a certain observation i that belongs to a group j when there is a unit alteration in the multiplication W.X, also ceteris paribus (change in the slope because of W.X). Besides, u1j represents the error terms that indicate that there is randomness in the slopes of the models regarding the groups, which can also be generated by the presence of observations from different groups in the dataset.

By combining Expressions (23.8), (23.12), and (23.16), we obtain the following expression:

Yij=(γ00+γ01Wj+u0j)randomeffectsintercept+(γ10+γ11Wj+u1j)randomeffectsslopeXij+rij,

si21_e  (23.17)

which facilitates the visualization that the intercept and slope can be influenced by random effects resulting from the existence of observations that belong to different groups.

Essentially, multilevel models represent a set of techniques that, besides estimating the parameters of the model proposed, allow us to estimate the variance components of the error terms (in the case of the model in Expression (23.17), u0j, u1j, and rij), as well as the respective statistical significances. So that we can verify if, in fact, randomness occurs in the intercepts and slopes resulting from the presence of higher levels in the analysis. If we do not verify the statistical significance of the variances of error terms u0j and u1j in the model in Expression (23.17), that is, if both are statistically equal to zero, it becomes suitable to estimate a linear regression model through traditional methods, such as, OLS. Since it is not possible to prove the existence of randomness in the intercepts and slopes.

We can assume that random effects u0j and u1j follow a multivariate normal distribution, have means equal to zero and the same variances, respectively, τ00 and τ11. Furthermore, error terms rij follow a normal distribution, with mean equal to zero and variance equal to σ2. Thus, we can define the following variance-covariance matrices for the error terms:

var[u]=var[u0ju1j]=G=[τ00σ01σ01τ11]

si22_e  (23.18)

var[r]=var[r1jrnj]=σ2In=[σ2000σ2000σ2]

si23_e  (23.19)

These matrices will be used very soon when we discuss the methods for estimating multilevel model parameters. Therefore, we can define the relationship between the variances of these error terms, known as intraclass correlation, as follows:

rho=τ00+τ11τ00+τ11+σ2

si24_e  (23.20)

This intraclass correlation measures the proportion of total variance that is due to levels 1 and 2. If it is equal to zero, there is no variance of the individuals between the level-2 groups. However, if it is considerably different from zero due to the presence of at least one significant error term resulting from the presence of level 2 in the analysis, traditional procedures for estimating model parameters, such as the ordinary least squares (OLS), are not suitable. In the limit, the fact that it is equal to 1, that is, σ2 = 0, suggests that there are no differences between the individuals, that is, all of them are identical, which is highly unlikely. This correlation is also called level-2 intraclass correlation.

In Section 23.5.1, we will use likelihood-ratio tests aiming at verifying if τ00 = τ11 = 0, which would favor the estimation of a traditional regression model, or at least if τ11 = 0, which would allow researchers to choose a random intercepts model00 ≠ 0) instead of a random slopes model11 ≠ 0).

We can rearrange Expression (23.17), in order to separate the fixed effects component, from which the model parameters are estimated, from the random effects component, from which the variances of the error terms are estimated. Therefore, we have:

Yij=γ00+γ10Xij+γ01Wj+γ11WjXijFixedEffects,+u0j+u1jXij+rijRandomEffects

si25_e  (23.21)

which allows researchers to see more clearly that the random effects component can also influence the behavior of the dependent variable. We can even notice that one explanatory variable may be a part of this random component. By estimating such a multilevel model, we will see that, while fixed effects refer to the relationship between the behavior of certain characteristics and the behavior of Y, random effects allow us to analyze possible distortions in the behavior of Y between the units of the second analysis level.

In general and from Expression (23.4), we can define a model with two analysis levels, in which the first level offers explanatory variables X1, ..., XQ that refer to each individual i, and the second level, explanatory variables W1, ..., WS that refer to each group j, in the following way:

Level1:Yij=b0j+Qq=1bqjXqij+rij

si26_e  (23.22)

Level2:bqj=γq0+Sqs=1γqsWsj+uqj

si27_e  (23.23)

where q = 0, 1, ..., Q and s = 1, ..., Sq.

Concerning the estimation of the model, while fixed effects parameters are estimated in a traditional way in software packages such as Stata and SPSS, i.e., by the maximum likelihood estimation (MLE), as we studied in Chapters 14 and 15, the variance components of error terms can be estimated both by maximum likelihood and by restricted maximum likelihood (REML).

Parameter estimations through MLE or REML are computationally intense, reason why we will not develop them algebraically in this chapter, as we did in Chapters 14 and 15, when we presented some practical examples. Nevertheless, both require the optimization of a certain objective function, which usually starts from the initial values of the parameters and uses a sequence of iterations to find the parameters that maximize the previously defined likelihood function.

In order to present the concepts as regards the REML method, let’s imagine, for example, a regression model with only one constant, where Yi (i = 1, ..., n) is a dependent variable that follows a normal distribution, mean μ, and variance σY2. While the estimation through the maximum likelihood of σY2 is obtained considering the n terms Yi − μ, the estimation of σY2 through REML is obtained from the (n − 1) first terms of YiˉYisi28_e, whose distribution does not depend on μ. In other words, maximum likelihood methods for this last distribution generate a nonbiased estimation of σY2, because this is the sample variance itself obtained by the division of the elements by (n − 1). This is the reason why the restricted maximum likelihood estimation is also known as estimation through reduced maximum likelihood.

In order to present the expressions of the likelihood and restricted likelihood functions, from which, through maximization, multilevel model parameters can be estimated, let’s write, in matrix notation, the general expression of a multilevel model with fixed and random effects:

Y=A.γ+Bu+r

si29_e  (23.24)

where Y is a vector n x 1 that represents the dependent variable, A is a matrix n x (q + s + q ∙ s + 1) with data from all the variables to be inserted into the model’s fixed effects component, γ is a vector (q + s + q ∙ s + 1) x 1 with all the fixed effects parameters estimated, B is the matrix n x (q + 1) with data from all the variables to be inserted into the random effects components u, being u a random error terms vector with dimensions (q + 1) x 1 and with variance-covariance matrix G. Besides, r is a vector n x 1 of error terms with mean zero and variance matrix σ2Insi30_e. Based on Expressions (23.18) and (23.19), we can determine that:

var[ur]=[G00σ2In]

si31_e  (23.25)

and, in this regard, the variance-covariance matrix n x n of Y, given by V, can be obtained as follows:

V=BGB'+σ2In

si32_e  (23.26)

From this matrix, as shown by Searle et al. (2006), the following expression of the logarithmic likelihood function can be defined, which must be maximized (MLE):

LL=12[nln(2π)+ln|V|+(YAγ)V1(YAγ)]

si33_e  (23.27)

In addition, according to the same authors, from Expression (23.27), the expression of the logarithm of the restricted likelihood function is given by:

LLr=LL12ln|AV1A|

si34_e  (23.28)

The fact that the REML method generates nonbiased estimations of the error terms’ variances in multilevel models may make researchers choose to use it unconditionally. However, the likelihood-ratio tests based on estimations obtained through REML are not suitable for comparing models with different fixed effects specifications. For these situations, in which there is the intention of elaborating such tests, we recommend that the variances of the error terms be estimated through MLE, which already is the method used to estimate model parameters. Besides, it is important to mention that the differences between the estimations of the error terms’ variances obtained through REML or through MLE are practically nonexistent for large samples.

In the following section, we will discuss the specification of three-level hierarchical linear models with repeated measures, maintaining the logic proposed in this book.

23.4.2 Three-Level Hierarchical Linear Models With Repeated Measures (HLM3)

Following the logic proposed in the previous section, let’s present the specification of a three-level hierarchical linear model, in which there are data with repeated measures, that is, with temporal evolution in the dependent variable.

In general and following the logic presented in Raudenbush et al. (2004), a three-level hierarchical model has three submodels, one for each analysis level of the nested data structure. Therefore, based on Expressions (23.22) and (23.23), we can define a general model with three analysis levels and nested data. The first level presents explanatory variables Z1, ..., ZP that refer to level-1 units i (i = 1, ..., n). The second level, explanatory variables X1, ..., XQ that refer to level-2 units j (j = 1, ..., J). Whereas the third level presents explanatory variables W1, ..., WS that refer to level-3 units k (k = 1, ..., K), as follows:

Level1:Yijk=π0jk+Pp=1πpjkZpjk+eijk

si35_e  (23.29)

where πpjk (p = 0, 1, ..., P) refer to the level-1 coefficients, Zpjk is the p-th level-1 explanatory variable for observation i in the level-2 unit j and in the level-3 unit k, and eijk refers to the level-1 error terms that follow a normal distribution, with mean equal to zero and variance equal to σ2.

Level2:πpjk=bp0k+Qpq=1bpqkXqjk+rpjk

si36_e  (23.30)

where bpqk (q = 0, 1, ..., Qp) refer to the level-2 coefficients, Xqjk is the q-th level-2 explanatory variable for unit j in the level-3 unit k, and rpjk are the level-2 random effects, assuming, for each unit j, that the vector (r0jk, r1jk, ..., rPjk)´ follows a multivariate normal distribution with each element having mean zero and variance τrπpp.

Level3:bpqk=γpq0+Spqs=1γpqsWsk+upqk

si37_e  (23.31)

where γpqs (s = 0, 1, ..., Spq) refer to the level-3 coefficients, Wsk is the s-th level-3 explanatory variable for unit k, and upqk are the level-3 random effects, assuming that, for each unit k, the vector formed by terms upqk follows a multivariate normal distribution with each element having mean zero and variance τuπpp, which results in variance-covariance matrix Tb with a maximum dimension equal to:

DimmaxTb=Pp=0(Qp+1)Pp=0(Qp+1),

si38_e  (23.32)

which depends on the number of level-3 coefficients specified with random effects.

In order to maintain the logic presented in the previous section and aiming at facilitating the understanding of the example that will be studied in Sections 23.5.2 and 23.6.2, let’s imagine a single level-1 explanatory variable that corresponds to the periods in which the data of the dependent variable are monitored. In other words, level-2 units j nested into level-3 units k are monitored for a period t (t = 1, ..., Tj), which makes the dataset have j time series, as shown in Table 23.2. The main objective is to verify if there are discrepancies in the temporal evolution of the data of the dependent variable and, if yes, if these happen due to characteristics of the level-2 and level-3 units. This temporal evolution is what characterizes the term repeated measures.

In this regard, Expression (23.29) can be rewritten as follows, in which subscripts i become subscripts t:

Ytjk=π0jk+π1jkperiodjk+etjk

si39_e  (23.33)

where π0jk represents the intercept of the model that corresponds to the temporal evolution of the dependent variable of level-2 unit j nested into level-3 unit k, and π1jk corresponds to the average evolution (slope) of the dependent variable for the same unit throughout the period analyzed. The substructures that correspond to levels 2 and 3 remain with the same specifications as those respectively presented in Expressions (23.30) and (23.31).

The chart seen in Fig. 23.8 shows the plotting of the set of models represented by Expression (23.33) in a conceptual way. Through it, we can see that the individual models that represent level-2 units j can present different intercepts and slopes throughout period t. Fact that may occur due to certain characteristics of the level-2 units j themselves or due to characteristics of the level-3 units k.

Fig. 23.8
Fig. 23.8 Individual models that represent the temporal evolution of the dependent variable for each of the J level-2 units.

Thus, there must be characteristics of level-2 units j, temporally invariable, and of level-3 units k, invariable also for level-2 units j nested into each level-3 unit k (as shown in Table 23.2), that can explain the differences in the model intercepts and slopes ˆYtjk=ˆπ0jk+ˆπ1jkperiodjksi40_e represented in Fig. 23.8.

Hence, assuming that there is a single explanatory variable X that represents a characteristic of level-2 units j, and a single explanatory variable W that represents a characteristic of level-3 units k, from Expression (23.33) and based on Expressions (23.30) and (23.31), we can define the following model with three analysis levels. In this model, the first level refers to the measure repeated and only contains the temporal variable:

Level1:Ytjk=π0jk+π1jkperiodjk+etjk

si41_e  (23.34)

Level2:π0jk=b00k+b01kXjk+r0jk

si42_e  (23.35)

π1jk=b10k+b11kXjk+r1jk

si43_e  (23.36)

Level3:b00k=γ000+γ001Wk+u00k

si44_e  (23.37)

b01k=γ010+γ011Wk+u01k

si45_e  (23.38)

b10k=γ100+γ101Wk+u10k

si46_e  (23.39)

b11k=γ110+γ111Wk+u11k

si47_e  (23.40)

By combining Expressions (23.34)(23.39), we obtain the following expression:

Ytjk=(γ000+γ001Wk+γ010Xjk+γ011WkXjk+u00k+u01kXjk+r0jk)randomeffectsintercept+(γ100+γ101Wk+γ110Xjk+γ111WkXjk+u10k+u11kXjk+r1jk)periodjkrandomeffectsslope+etjk

si48_e  (23.41)

where γ000 represents the expected value of the dependent variable at the initial moment and when X = W = 0 (general intercept), γ001 represents the increase in the expected value of the dependent variable at the initial moment (alteration in the intercept) for a certain level-2 unit j that belongs to a level-3 unit k when there is a unit alteration in the characteristic W of k, ceteris paribus, γ010 represents the increase in the expected value of the dependent variable at the initial moment for a certain unit jk when there is a unit alteration in the characteristic X of j, ceteris paribus, and γ011 represents the increase in the expected value of the dependent variable at the initial moment for a certain unit jk when there is a unit alteration in the multiplication W.X, also ceteris paribus. Moreover, u00k and u01k represent the error terms that indicate that there is randomness in the intercepts, and the last one impacts the alterations in variable X.

In addition, γ100 represents the alteration in the expected value of the dependent variable when there is a unit alteration in the analysis period (change in the slope due to a unit temporal evolution), ceteris paribus, γ101 represents the alteration in the expected value of the dependent variable due to a unit temporal evolution for a certain unit jk when there is a unit alteration in the characteristic W, ceteris paribus, γ110 represents the alteration in the expected value of the dependent variable due to a unit temporal evolution for a certain unit jk when there is a unit alteration in the characteristic X, ceteris paribus, and γ111 represents the alteration in the expected value of the dependent variable due to a unit temporal evolution for a certain unit jk when there is a unit alteration in the multiplication W.X, also ceteris paribus. Finally, u10k and u11k represent the error terms that indicate that there is randomness in the slopes and the last one impacts the alterations in variable X.

Expression (23.41) facilitates the visualization that the intercept and slope can be influenced by random effects resulting from different behaviors of the dependent variable throughout time for each of the level-2 units (different time series), and this phenomenon can be a result of these units’ characteristics, as well as of characteristics of the groups to which such units belong.

If researchers wish to elaborate an analysis about the fixed and random effects components that can influence the behavior of the dependent variable, given that this procedure even facilitates the insertion of the commands to estimate multilevel models in Stata and in SPSS, as we will see, we just need to rearrange the terms of Expression (23.41) as follows:

Ytjk=γ000+γ001Wk+γ010Xjk+γ011WkXjk+γ100periodtjk+γ101Wkperiodjk+γ110Xjkperiodjk+γ111WkXjkperiodjk}FixedEffects+u00k+u01kXjk+u10kperiodjk+u11kXjkperiodjk+r0jk+r1jkperiodjk+etjkRandomEffects

si49_e  (23.42)

In three-level hierarchical models, we can define two intraclass correlations given the existence of two variance proportions. One corresponds to the behavior of the data that belong to the same level-2 units j and the same level-3 units k (level-2 intraclass correlation), and the other corresponds to the behavior of the data that belong to the same level-3 units k, however, from different level-2 units j (level-3 intraclass correlation). In Sections 23.5.2 and 23.6.2, we will calculate these intraclass correlations when we present some practical examples in Stata and in SPSS, respectively.

From Expression (23.34) and as seen later, we can define the general expressions of the level-2 and 3 substructures of a hierarchical analysis with three levels and repeated measures, in which the second level offers explanatory variables X1, ..., XQ that refer to each unit j, and the third level, explanatory variables W1, ..., WS that refer to each unit k:

Level2:πpjk=bp0k+Qpq=1bpqkXqjk+rpjk

si36_e  (23.43)

Level3:bpqk=γpq0+Spqs=1γpqsWsk+upqk

si37_e  (23.44)

Similar to what was discussed when the two-level hierarchical models were presented in the previous section, while fixed effects parameters are estimated traditionally, using software such as Stata and SPSS, through maximum likelihood, the variance components of the error terms can be estimated both through maximum likelihood and through restricted maximum likelihood. As we will see in the following sections, when we estimate three-level hierarchical models through these software packages.

Based on what has been exposed, in Section 23.5 we will estimate two-level hierarchical models with clustered data and three-level models with repeated measures in Stata. In Section 23.6, we will estimate the same models, however, in SPSS. The examples used follow the logic adopted throughout this book.

23.5 Estimation of Hierarchical Linear Models in Stata

The main objective of this section is to give researchers the opportunity to prepare multilevel modeling procedures through Stata Statistical Software. The use of the images in this section has been authorized by StataCorp LP©.

23.5.1 Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in Stata

We will discuss an example that follows the same logic seen in Chapters 13, 14, and 15. Now, however, with data that vary between individuals and between groups to which these individuals belong, characterizing a nested structure.

Let’s imagine that our shrewd and talented professor is now interested in expanding his research to other schools. He has already explored the effects of certain explanatory variables regarding the time it takes a group of students to get to school, the probability of arriving late, and how many times students were late per week or per month through multiple regression, binary and multinomial logistic regression, and regression for count data models, respectively. Now, he wants to investigate if there are differences in the school performance behavior between students from different schools and, if yes, if these differences occur due to characteristics of the schools themselves.

In order to do this, the professor managed to get data on students’ school performance (scores from 0 to 100 plus a bonus for participation in class). He collected data on 2,000 students from 46 schools. In addition, he also managed to get data on students’ behavior, such as, number of hours spent studying per week, data regarding the type of school (public or private), and also professors’ years of teaching experience. Part of the dataset can be seen in Table 23.3. The complete dataset can be found in files PerformanceStudentSchool.xls (Excel) and PerformanceStudentSchool.dta (Stata).

Table 23.3

Example: School Performance and Students’ (Level 1) and Schools’ (Level 2) Characteristics
Student i
(Level 1)
School j
(Level 2)
Performance at School
(Yij)
Number of Hours Spent Studying per Week
(Xij)
Professors’ Years of Teaching Experience
(W1j)
Public or Private School
(W2j)
1135.4112public
2174.9232public
...
47124.892public
48241.0132public
...
72265.2202public
...
121466.4209private
...
140493.4279private
...
19954644.0152public
...
20004656.6172public

Table 23.3

After opening the file PerformanceStudentSchool.dta, we can type the command desc, which makes it possible to analyze the dataset characteristics, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 23.9 shows this first output in Stata.

Fig. 23.9
Fig. 23.9 Description of the PerformanceStudentSchool.dta Dataset.

First, we can obtain information on the number of students that were researched by the professor at each school, through the following command:

tabulate school, subpop(student)

The outputs can be found in Fig. 23.10 and, through them, we can see that, in this case, we have an unbalanced clustered data structure.

Fig. 23.10
Fig. 23.10 Number of students per school.

Students’ average performance per school, which can be seen in Fig. 23.11, can be obtained through the following commands:

Fig. 23.11
Fig. 23.11 Students’ average performance per school.
bysort school: egen average_performance = mean(performance)

tabstat average_performance, by(school)

To conclude this initial diagnostic, we can construct a chart that allows the visualization of students’ average performance per school. This chart can be seen in Fig. 23.12 and can be obtained by typing the following command:

Fig. 23.12
Fig. 23.12 Students’ average performance per school.

graph twoway scatter performance school || connected average_performance school, connect(L) || , ytitle(performance at school)

Having characterized the nesting of students into schools based on our example’s clustered data, now, we can apply the multilevel modeling itself, constructing the procedures aiming at estimating a two-level hierarchical linear model (students and school). In the school performance modeling, even though a possibility is the inclusion of dummy variables that represent schools into the fixed effects component, let’s treat these level-2 units as random effects to estimate these models.

The first model to be estimated, known as null model or nonconditional model, allows us to check if there is variability in the school performance between students from different schools. This is because no explanatory variable will be inserted into the modeling, which only considers the existence of one intercept and error terms u0j and rij, with variances equal to τ00 and σ2, respectively. Therefore, the model to be estimated has the following expression:

  • Null Model

performanceij=b0j+rij

si52_e

b0j=γ00+u0j,

si53_e

which results in:

performanceij=γ00+u0j+rij

si54_e

For the data in our example, the command for estimating the null model in Stata is:

xtmixed performance || school: , var nolog reml

where the term xtmixed refers to the estimation of any hierarchical linear model and the first variable to be inserted corresponds to the dependent variable, as in any other estimation of a regression model. Explanatory variables may be included afterwards. Furthermore, there is a second part of the command xtmixed that starts with the term ||. While the first part of the command corresponds to the fixed effects, the second part is related to the random effects that can be generated if there is a second analysis level, in this case, which refers to the schools (hence the second part begins with the term school: ). The term var makes the estimations of the error terms’ variances u0j and rij00 and σ2, respectively) be presented in the outputs, instead of the standard deviations. On the other hand, the term nolog only makes the results of the iterations for the maximization of the logarithm of the restricted likelihood function not be presented in the outputs. Finally, researchers can also define the estimation method to be used by using the terms reml (restricted estimation of maximum likelihood) or mle (maximum likelihood estimation).1

The outputs generated can be seen in Fig. 23.13.

Fig. 23.13
Fig. 23.13 Outputs of the null model in Stata.

From the outputs in Fig. 23.13, initially, we can see that the estimation of parameter γ00 is equal to 61.049, which corresponds to students’ expected average school performance (horizontal line estimated in the null model, or general intercept).2 Moreover, at the bottom of the outputs, the estimations of the variances of error terms τ00 = 135.779 are presented (in Stata, var(_cons)) and σ2 = 347.562 (in Stata, var(Residual)). Thus, based on Expression (23.20), we can calculate the following intraclass correlation:

rho=τ00τ00+σ2=135.779135.779+347.562=0.281,

si55_e

which suggests that approximately 28% of the total variance of the school performance is due to changes between schools, representing a first sign that there is variability in students’ school performance when they come from different schools. After Stata 13, it is possible to directly obtain this intraclass correlation by typing the command estat icc right after the estimation of the corresponding model.

Even though Stata does not directly show the result of the z tests with their respective significance levels for the random effect parameters, the fact that the estimation of variance component τ00, which corresponds to random intercepts u0j, is considerably higher than its standard error suggests that there is significant variation in the school performance between schools. Statistically, we can see that z = 135.779/30.750 = 4.416 > 1.96, where 1.96 is the critical value of the standardized normal distribution, which results in a significance level of 0.05.

This piece of information is extremely important to support the choice of the hierarchical modeling, to the detriment of a traditional regression model estimate by OLS. Moreover, it is the main reason why a null model is always estimated when we carry out multilevel analyses.

At the bottom of Fig. 23.13, we can verify this fact, analyzing the result of the likelihood-ratio test (LR test). Since Sig. χ2 = 0.000, we can reject the null hypothesis that the random intercepts are equal to zero (H0: u0j = 0), which makes the estimation of a traditional linear regression model be ruled out for the clustered data in our example.

First, let’s investigate if the level-1 explanatory variable, hours, has any relationship to the school performance behavior of students from the same school (variation between students) and from different schools (variation between schools). A first diagnostic can be elaborated by typing the following command, which generates the chart seen in Fig. 23.14:

Fig. 23.14
Fig. 23.14 School performance based on the variable hours (variation between students from the same school and between different schools).
statsby intercept =_b[_cons] slope =_b[hours], by(school) saving(ols, replace): reg performance hours
sort school
merge school using ols
drop _merge
gen yhat_ols = intercept + slope⁎hours
sort school hours
separate performance, by(school)
separate yhat_ols, by(school)
graph twoway connected yhat_ols1-yhat_ols46 hours || lfit performance hours, clwidth(thick) clcolor(black) legend(off) ytitle(performance at school)

The chart in Fig. 23.14 shows the linear adjustment by OLS, for each school, of the behavior of each student’s school performance based on the number of hours spent studying per week. We can see that, even though there is significant improvement in school performance as the number of hours spent studying per week increases (fortunately), this relationship is not the same for every school. Moreover, the intercepts of each model are clearly different.

Therefore, our duty is to investigate if random effects occur in the intercepts and slopes generated by the variable hours, because there are several schools. If yes, in the future, we must investigate if some of these school characteristics can answer for such a fact. Note that this last command also generates a new file in Stata (ols.dta), in which the differences between the schools can be analyzed.

If researchers chose not to include random effects in the modeling, that is, if the likelihood-ratio test elaborated in the estimation of the null model did not reject H0 (u0j = 0), they would just need to type the following command, as discussed in Chapter 13, in order for our model parameters to be estimated:

reg performance hours

Only for educational purposes, the parameters estimated when we typed this last command (reg), whose outputs are not presented here, are the same as those that would be obtained through the following command:

xtmixed performance hours, reml

since the term xtmixed without the specification of random effects makes, through the restricted maximum likelihood estimation (term reml), parameters with identical values to the ones estimated by ordinary least squares method (linear regression only with fixed effects).

Based on the logic proposed here, initially, let’s insert intercept random effects into our multilevel model, which will start having the following specification:

  • Random Intercepts Model

performanceij=b0j+b1jhoursij+rij

si56_e

b0j=γ00+u0j

si57_e

b1j=γ10,

si58_e

which results in the following expression:

performanceij=γ00+γ10hoursij+u0j+rij

si59_e

For the data in our example, the command for estimating the random intercepts model in Stata is:

xtmixed performance hours || school: , var nolog reml

which generates the outputs seen in Fig. 23.15.

Fig. 23.15
Fig. 23.15 Outputs of the random intercepts model.

Similarly, at the top of the outputs, we can see the fixed effects of our model, which includes 46 separate intercepts (one for each school), even though they are not presented directly. At the bottom, we can see the estimation of the variances of error terms τ00 = 19.125 and σ2 = 31.764. This model’s intraclass correlation is calculated as follows:

rho=τ00τ00+σ2=19.12519.125+31.764=0.376,

si60_e

which shows an increase in the proportion of the variance component that corresponds to the intercept in relation to the null model, demonstrating the importance of including the variable hours to study the school performance behavior when comparing the schools. As already verified in the null model, the estimation of variance component τ00 is almost five times higher than its standard error (z = 19.125/4.199 = 4.555 > 1.96), suggesting that there may be a significant variation in the average school performance between schools due to the existence of random intercepts (the intercepts vary in a statistically significant way from school to school).

By analyzing the result of the likelihood-ratio test (LR test), here, we can also reject the null hypothesis that the random intercepts are equal to zero (H0: u0j = 0), since Sig. χ2 = 0.000, proving that the estimation of a traditional linear regression model only with fixed effects must be ruled out.

Therefore, now, our model starts to have the following specification:

performanceij=0.534+3.252hoursij+u0j+rij

si61_e

where the fixed effect of the intercept now corresponds to the average expected school performance, between schools, of students who, for some reason, do not study (hoursij = 0). On the other hand, one more hour spent studying per week, on average, makes the expected mean of school performance, between schools, increase by 3.252 points, and this parameter is statistically significant.

Only for educational purposes, as this last estimation represents a model in which the random component only contains intercepts, the maximum likelihood method (not restricted) would generate parameter estimations identical to the ones that would be obtained through a traditional estimation considering longitudinal panel data. Furthermore, an even more inquisitive researcher would be able to verify that the preparation of a generalized linear latent and mixed model (GLLAMM) would also generate the same parameter estimations. In other words, the following three commands generate identical parameter estimations and estimations of the error terms’ variances:

  •  Multilevel Model with Maximum Likelihood Estimation

xtmixed performance hours || school: , var nolog mle

where the term mle means maximum likelihood estimation.

  •  Model for Panel Data with Maximum Likelihood Estimation
xtset school student
xtreg performance hours, mle
  •  Generalized Linear Latent and Mixed Model

gllamm performance hours, i(school) adapt

where the option adapt makes the adaptive quadrature process be used instead of the standard process of an ordinary Gauss-Hermite quadrature.

It is important to mention that the generalized linear latent and mixed models (GLLAMM) are analogous to the generalized linear models (GLM) studied in Chapters 13, 14, and 15. That is, they are also extremely useful for estimating models in which the dependent variable is categorical or has count data, and there is a nested data structure. In the Appendix of this chapter, we will present some examples of logistic, Poisson, and negative binomial hierarchical nonlinear models. To study this topic in depth, we also recommend Rabe-Hesketh et al. (2002), Rabe-Hesketh and Skrondal (2012a,b), and Fávero and Belfiore (2017).

Going back to our random intercepts model (outputs in Fig. 23.15), we can store (estimates store command) the estimations obtained for future comparison to the ones that will be generated when we estimate a model with random intercepts and slopes. Besides, through the command predict, reffects, we can also obtain the expected values of random effects u0j, known as BLUPS (best linear unbiased predictions), since the command xtmixed does not show them directly. In order to do that, we can type the following sequence of commands:

quietly xtmixed performance hours || school: , var nolog reml
estimates store randomintercept
predict u0, reffects
desc u0
by student, sort: generate tolist = (_n ==1)
list student u0 if student <= 10 | student > 1990 & tolist

Fig. 23.16 shows the values of random intercept terms u0j for the first and last 10 students of the dataset. We can see that these error terms are invariable for students from the same school. However, they vary between schools, which characterize the existence of one intercept for each school.

Fig. 23.16
Fig. 23.16 Random intercept terms u0j.

In order to provide a better visualization of the random intercepts per school, we can generate a chart by typing the following command:

graph hbar (mean) u0, over(school) ytitle("Random Intercepts per School")

This chart can be seen in Fig. 23.17.

Fig. 23.17
Fig. 23.17 Random intercepts per school.

Since we will still do some additional estimations, in order to arrive at a more complete model and with the presence of level-2 explanatory variables, at this moment, let’s present the commands to generate the predicted values of the school performance per student. This procedure will be carried out later.

Having checked that students’ school performance is influenced by the number of hours spent studying per week, and that there are differences in the model intercepts between schools, at this moment, let’s analyze if the slopes are also different between the schools. Even though the charts in Figs. 23.14 and 23.17 allow us to see discrepant intercepts between schools clearly, the same cannot be said in relation to the slopes of the 46 linear adjustments. Nevertheless, it is our duty to assess such situation from a statistical standpoint. Therefore, let’s insert slope random effects into our multilevel model that, by maintaining the intercept random effects, will start to have the following expression:

  •  Random Intercepts and Slopes Model

performanceij=b0j+b1jhoursij+rij

si56_e

b0j=γ00+u0j

si57_e

b1j=γ10+u1j,

si64_e

which results in:

performanceij=γ00+γ10hoursij+u0j+u1jhoursij+rij

si65_e

For the data in our example, the command for estimating the model with random intercepts and slopes in Stata is:

xtmixed performance hours || school: hours, var nolog reml

Note that the variable hours inserted after the term school: (random component of the command xtmixed) comes from the term u1j.hoursij present in the specification of the multilevel model. The results obtained in this estimation can be seen in Fig. 23.18.

Fig. 23.18
Fig. 23.18 Outputs of the model with random intercepts and slopes.

We can see that the parameter and variance estimations in the model with random intercepts and slopes are practically identical to the ones obtained when the model parameters were estimated only with random intercepts (Fig. 23.15). This occurs because the estimation of the variance τ11 of random slope terms u1j is statistically equal to zero (an extremely low value and a considerably greater standard error, with values equal to zero for the confidence intervals).

Even though this fact is clear in this case, researchers may choose to elaborate the likelihood-ratio test to compare the estimations obtained through the random intercepts model and through the model with random intercepts and slopes. In order to do that, the following command must be typed:

estimates store randomslope

and, next, the command that will elaborate the test:

lrtest randomslope randomintercept

since the term randomintercept refers to the estimation carried out previously. The result of the test can be seen in Fig. 23.19.

Fig. 23.19
Fig. 23.19 Likelihood-ratio test to compare the estimations of the models with random intercepts and with random intercepts and slopes.

The significance level of the test is equal to 1.000 (much greater than 0.05) because the logarithms of both restricted likelihood functions are identical (LLr = − 6372.164), making LR chi2 for 1 degree of freedom be equal to 0. The model that only has random effects in the intercept is favored, proving that random error terms u1j are statistically equal to zero. It is important to mention, as the note at the bottom of Fig. 23.19 also explains, that this likelihood-ratio test is only valid when a comparison of the estimations obtained through restricted maximum likelihood (REML) of two models with identical fixed effects specification is carried out. Since in our case both models, which were estimated through REML, present the same fixed effects specification γ00 + γ10 ⋅ hoursij, the test is considered valid.3

Only for educational purposes, another way of analyzing the statistical significance of the error terms of a multilevel model is to insert the term estmetric at the end of the command xtmixed, as follows:

xtmixed performance hours || school: hours, estmetric nolog reml

The outputs generated can be seen in Fig. 23.20.

Fig. 23.20
Fig. 23.20 Estimation of the parameters of the model with random intercepts and slopes, using the term estmetric.

The fixed effects parameter estimations are identical to the ones obtained previously. However, the term estmetric makes the estimations of the natural logarithm of the standard deviations of the error terms be presented, instead of the variances of these terms, with the respective z statistics and their significance levels, which facilitates the interpretation of the statistical significance of each random term.

For the term rij, for example, instead of presenting the estimation of its variance σ2 = 31.764 (Fig. 23.18), the estimation of the natural logarithm of the standard deviation of rij is presented, such that:

ln(31.764)=1.729

si66_e

Therefore, we can prove that the random slope terms u1j are statistically equal to zero at a confidence level of 95%, for example, since Sig. z = 0.978 > 0.05.

At this moment, another pertinent discussion is related to the structure of the random effects (u0j and u1j) variance-covariance matrix. Since we did not specify any covariance structure for these error terms, Stata assumes, through the command xtmixed, that this structure is independent, that is, that cov(u0j, u1j) = σ01 = 0. In other words, based on Expression (23.18) and in the outputs shown in Fig. 23.18, we have:

G=var[u]=var[u0ju1j]=[τ0000τ11]=[19.125008.37x1014]

si67_e

Nevertheless, we can generalize the structure of matrix G, allowing u0j and u1j to be correlated, that is, that cov(u0j , u1j) = σ01 ≠ 0. In order to do that, we just need to add the term covariance(unstructured) to the command xtmixed, such that:

xtmixed performance hours || school: hours, covariance(unstructured) var nolog reml

The new outputs generated can be seen in Fig. 23.21.

Fig. 23.21
Fig. 23.21 Estimation of the parameters of the model with random intercepts and slopes, with correlated random effects u0j and u1j.

The new estimations of the error terms’ variances generate the following variance-covariance matrix:

var[u]=var[u0ju1j]=[τ00σ01σ01τ11]=[20.7500.0400.0407.59x105],

si68_e

which can also be obtained through the following command:

estat recovariance

whose outputs can be seen in Fig. 23.22.

Fig. 23.22
Fig. 23.22 Variance-covariance matrix with correlated random effects u0j and u1j.

Even though the estimation of the covariance between u0j and u1j cov(u0j, u1j) = σ01 = − 0.040 ≠ 0, a more inquisitive researcher will see, by including the term estmetric at the end of the last command typed xtmixed (without the term var), that this covariance is not statistically significant. In fact, the output, not presented here, will show the nonsignificance of the hyperbolic tangent arc of the correlation between these two error terms.

Another way of verifying the nonsignificance of the correlation between the error terms is through a new likelihood-ratio test, which compares the estimations of the random intercepts and slopes model with independent error terms u0j and u1j (Fig. 23.18) with the same model. However, with correlated error terms (Fig. 23.21), that is, with an unstructured variance-covariance matrix. In order to do that, we must type the following sequence of commands:

estimates store randomslopeunstructured

lrtest randomslopeunstructured randomslope

The result of this test can be seen in Fig. 23.23.

Fig. 23.23
Fig. 23.23 Likelihood-ratio test to compare the estimations of random intercepts and slopes models with independent and correlated error terms u0j and u1j.

The χ2 statistic for the test, with 1 degree of freedom, can also be obtained through the following expression:

χ21=(2LLrind(2LLrunstruc))={2(6372.164)[2(6372.111)]}=0.11

si69_e

That is, we have Sig. χ12 = 0.744 > 0.05. Therefore, in this example, we can state that the structure of the variance-covariance matrix between u0j and u1j can be considered independent.

However, more than this, we can see that the estimated variance of u1j is statistically equal to zero, making the random intercepts model more suitable than the random intercepts and slopes model for our data.

Therefore, at this moment, let’s insert the variables texp and priv (level-2 explanatory variables - school) into our random intercepts model, such that the new specification of the hierarchical model will be as follows:

  • Complete Random Intercepts Model

performanceij=b0j+b1jhoursij+rij

si56_e

b0j=γ00+γ01texpj+γ02privj+u0j

si71_e

b1j=γ10+γ11texpj+γ12privj,

si72_e

which results in the following expression:

performanceij=γ00+γ10hoursij+γ01texpj+γ02privj+γ11texpjhoursij+γ12privjhoursij+u0j+rij

si73_e

Thus, initially, we need to generate two new variables, which correspond to the multiplication of texp by hours and priv by hours. The following commands generate these two variables (texphours and privhours):

gen texphours = texp⁎hours

gen privhours = priv⁎hours

Next, we can estimate our complete random intercepts model by typing the following command:

xtmixed performance hours texp priv texphours privhours || school: , var nolog reml

The outputs are shown in Fig. 23.24.

Fig. 23.24
Fig. 23.24 Outputs of the complete model with random intercepts.

When we analyze the estimated fixed effects parameters, we can see that those that correspond to the variables texphours and privhours are not statistically different from zero, at a significance level of 0.05. Since there is no Stepwise procedure that corresponds to the command xtmixed in Stata, we have to exclude the variable texphours manually (that is, the variable texp from the expression of slope b1j), because it is the one whose estimated parameter presented greater Sig. z. Therefore, the new model has the following expression:

performanceij=b0j+b1jhoursij+rij

si56_e

b0j=γ00+γ01texpj+γ02privj+u0j

si71_e

b1j=γ10+γ11privj,

si76_e

which results in:

performanceij=γ00+γ10hoursij+γ01texpj+γ02privj+γ11privjhoursij+u0j+rij

si77_e

whose estimation can be obtained by typing the following command:

xtmixed performance hours texp priv privhours || school: , var nolog reml

The new outputs can be seen in Fig. 23.25.

Fig. 23.25
Fig. 23.25 Outputs of the final complete model with random intercepts without the variable texphours.

Note that, even though the estimated parameter γ11 related to the variable privhours is not statistically significant at a significance level of 0.05, it is at a significance level of 0.10. Only for educational purposes, we will consider this higher significance level at this moment, in order to continue the analysis with at least one level-2 variable (priv) in the expression of slope b1j, even if we have to do it without random effects in this slope. Therefore, the expression of our final estimated model with random intercepts and level-1 and level-2 explanatory variables is:

performanceij=2.710+3.281hoursij+0.866texpj5.610privj0.080privjhoursij+u0j+rij

si78_e

A more inquisitive researcher could question the fact that the estimated parameter of variable priv is negative. Bear in mind that this fact only occurs in the presence of the other explanatory variables, because the correlation between performance and priv is positive and statistically significant, at a significance level of 0.05, which proves that students from private schools end up having better school performance, on average, than students from public schools.

Next, we can obtain the expected BLUPS (best linear unbiased predictions) of random effects u0j of our final model by typing:

predict u0final, reffects

which generates a new variable in the dataset, which is called u0final. Besides, we can also obtain the expected values of each student’s school performance by typing the following command:

predict yhat, fitted

which defines the variable yhat, which can also be obtained by the command:

gen yhat = -2.71035 + 3.281046⁎hours + .8662029⁎texp - 5.610535⁎priv - .0801207⁎privhours + u0final

The following command makes a chart be generated (Fig. 23.26) with the predicted values of each student’s school performance based on the number of hours spent studying per week for the 46 schools under analysis. Through it, we can see that the intercepts are different (random effects), however, without any discrepancies in the slopes.

Fig. 23.26
Fig. 23.26 Predicted school performance values based on the variable hours for the final complete model with random intercepts.

graph twoway connected yhat hours, connect(L)

Finally, Fig. 23.27 shows the values of the intercepts and slopes of the linear adjustments of the predicted values of the average school performance for each of the 46 schools, where it is possible to prove the existence of random effects in the intercepts and only of fixed effects in the slopes. This figure can be obtained by typing the following sequence of commands:

Fig. 23.27
Fig. 23.27 Random effects in the intercepts and fixed effects in the slopes (presented is the identification of the first observation at each school).
generate interceptfinal = _b[_cons] + u0final
generate slopefinal = _b[hours] + _b[texp] + _b[priv] + _b[privhours]
by school, sort: generate group = (_n ==1)
list school interceptfinal slopefinal if group == 1

Hence, we can conclude that there are differences in the school performance behavior between students from the same schools and from different schools. In addition, these differences occur based on the number of hours each student spends studying per week, on what type of school it is (public or private), and on the professors’ years of teaching experience at each school have.

We chose to use the strategic multilevel analysis proposed by Raudenbush and Bryk (2002), and by Snijders and Bosker (2011). That is, first, we studied the variance decomposition from the definition of a null model (nonconditional model) so that, afterwards, a random intercepts model and a random intercepts and slopes model could be estimated. Finally, from the definition of the random nature of the error terms, we estimated the complete model by including level-2 variables into the analysis. This procedure is known as multilevel step-up strategy.

Next, let’s estimate a three-level hierarchical linear model, in which the nesting of data will be characterized due to the presence of repeated measures, that is, there is temporal evolution in the behavior of the dependent variable.

23.5.2 Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in Stata

Let’s discuss an example that follows the same logic of the previous section. However, now, with data that vary throughout time, between individuals, and between the groups to which these individuals belong, characterizing a nested structure with repeated measures.

Imagine that our highly qualified professor is now interested in expanding his research, monitoring students’ school performance for a certain period, in order to investigate if there is variability in this performance throughout time between students from the same school and between those from different schools. And, if yes, if there are certain student and school characteristics that explain this variability.

Therefore, 15 schools volunteered to provide data on their students’ school performance (scores from 0 to 100) in the last four years, a total of 610 students. In addition, the professor also included each student’s gender in the dataset, in order to verify if there are differences in school performance resulting from this variable. The variable regarding professors’ years of teaching experience, for each school, remains in the study. Part of the dataset can be seen in Table 23.4. The complete dataset, however, can be found in the files PerformanceTimeStudentSchool.xls (Excel) and PerformanceTimeStudentSchool.dta (Stata).

Table 23.4

Example: School Performance Throughout Time (Level 1—Repeated Measure) and Students' (Level 2) and Schools’ (Level 3) Characteristics
Student j
(Level 2)
School k
(Level 3)
Performance at School
(Ytjk)
Year t
(Level 1)
Gender
(Xjk)
Professors’ Years of Teaching Experience
(Wk)
1135.41male2
1144.42male2
1146.43male2
1152.44male2
121466.41female9
121466.42female9
121474.43female9
121479.44female9
6101587.61female9
6101592.62female9
6101594.63female9
61015100.04female9

Table 23.4

After opening the file PerformanceTimeStudentSchool.dta, we can type the command desc, which allows us to analyze the characteristics in the dataset, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 23.28 shows this output in Stata.

Fig. 23.28
Fig. 23.28 Description of the PerformanceTimeStudentSchool.dta Dataset.

Following the logic proposed in the previous section, initially, let’s analyze the number of students monitored by the professor in each period (year), by using the following command:

tabulate year, subpop(student)

The outputs are shown in Fig. 23.29 and, through them, we can see that we have a balanced panel data, since all 610 students are monitored in the four periods.

Fig. 23.29
Fig. 23.29 Number of students monitored in each period.

The chart in Fig. 23.30, obtained by typing the following command, allows us to analyze the temporal evolution of the school performance of the first 50 students in the sample:

Fig. 23.30
Fig. 23.30 Temporal evolution of the school performance of the first 50 students in the sample.

graph twoway connected performance year if student <= 50, connect(L)

This chart already allows us to see that the temporal evolutions of the school performance have different intercepts and slopes between students, which justifies the use of multilevel modeling and offers subsidies to include intercept and slope random effects in level 2 of the models that will be estimated.

Besides, students’ average performance in the four periods can be analyzed in Figs. 23.31 and 23.32, obtained from the following commands. Through them, it is possible to verify that there is a growing behavior, approximately linear, of students’ school performance throughout time, and this is the reason why the variable year is also inserted, with a linear specification, into level 1 of the modeling, as we will see later.

Fig. 23.31
Fig. 23.31 Students’ average school performance in each period.
Fig. 23.32
Fig. 23.32 Evolution of students’ average school performance in each period.
bysort year: egen average_performance = mean(performance)

tabstat average_performance, by(year)

graph twoway scatter performance year || connected average_performance year, connect(L) || , ytitle(performance at school)

So that we can justify the reasons to estimate a three-level hierarchical model more powerfully, let’s construct a chart (Fig. 23.33) that shows the temporal evolutions of the average school performance. In order to do that, we can type the following sequence of commands:

Fig. 23.33
Fig. 23.33 Temporal evolution of students’ average school performance at each school (linear adjustment through OLS).
statsby intercept =_b[_cons] slope =_b[year], by(school) saving(ols, replace): reg performance year
sort school
merge school using ols
drop _merge
gen yhat_ols = intercept + slope⁎year
sort school year
separate performance, by(school)
separate yhat_ols, by(school)
graph twoway connected yhat_ols1-yhat_ols15 year || lfit performance year, clwidth(thick) clcolor(black) legend(off) ytitle(performance at school)

This chart shows the linear adjustment through OLS, for each school, of the school performance behavior throughout time. It also offers subsidies to include intercept and slope random effects in level 3 of the models that will be estimated, since the temporal evolutions of the school performance present different intercepts and slopes between the schools too. Note that the last sequence of commands generates a new file in Stata (ols.dta), in which the differences in the school performance behavior, in terms of temporal intercepts and slopes, between the schools can be analyzed.

Having characterized the temporal nesting of the students from different schools in the data with repeated measures in our example, initially, let’s estimate a null model (nonconditional model) that allows us to check if there is variability in the school performance between students from the same school and between those from different schools. No explanatory variable will be inserted into the modeling, which only considers the existence of one intercept and of error terms u00k, r0jk, and etjk, with variances equal to τu000, τr000, and σ2, respectively. The model to be estimated has the following expression:

  • Null Model

performancetjk=π0jk+etjk

si79_e

π0jk=b00k+r0jk

si80_e

b00k=γ000+u00k

si81_e

which results in:

performancetjk=γ000+u00k+r0jk+etjk

si82_e

The command to estimate this null model in Stata is:

xtmixed performance || school: || student: , var nolog reml

which, as we can see, now shows two random effects components, one that corresponds to level 3 (school) and another to level 2 (student). It is important to highlight that the order in which the random effects components are inserted into the command xtmixed is decreasing when there are more than two levels. That is, we must begin with the highest data nesting level and continue until the lowest level (level 2). The outputs obtained can be seen in Fig. 23.34.

Fig. 23.34
Fig. 23.34 Outputs of the null model in Stata.

At the top of Fig. 23.34, we can initially prove that we have a balanced panel here. Since, for each student, we have minimum and maximum quantities of periods of monitoring equal to four, with a mean also equal four.

In relation to the fixed effects component, we can see that the estimation of parameter γ000 is equal to 68.714, which corresponds to the average of students’ expected annual school performance of the (horizontal line estimated in the null model, or general intercept).

Whereas at the bottom of the outputs, the estimations of the variances of error terms τu000 = 180.194 (in Stata, var(_cons) for school), τr000 = 325.799 (in Stata, var(_cons) for student), and σ2 = 41.649 (in Stata, var(Residual)) are presented.

Thus, we can define two intraclass correlations, given the existence of two variance proportions. The first one refers to the correlation between the data of variable performance in t and in (t) of a certain student j from a certain school k (level-2 intraclass correlation). Whereas the other refers to the correlation between the data of variable performance in t and in (t) of different students j and (j) from a certain school k (level-3 intraclass correlation). Therefore, we have:

  •  Level-2 intraclass correlation

rhostudentschool=corr(Ytjk,Yt´jk)=τu000+τr000τu000+τr000+σ2=180.194+325.799180.194+325.799+41.649=0.924

si83_e

  •  Level-3 intraclass correlation

rhoschool=corr(Ytjk,Yt´j´k)=τu000τu000+τr000+σ2=180.194180.194+325.799+41.649=0.329

si84_e

After Stata 13, it is possible to obtain these intraclass correlations directly, by typing the command estat icc right after the estimation of the corresponding model.

Hence, the correlation between annual school performances is equal to 32.9% (rhoschool) for the same school, and the correlation between annual school performances is equal to 92.4% (rhostudent | school) for the same student of a certain school. Therefore, for the model without explanatory variables, while the annual school performance is lightly correlated between schools, the same becomes strongly correlated when the calculation is carried out for the same student from a certain school. In this last case, we estimate that students and schools random effects form approximately 92% of the total variance of the residuals!

Regarding the statistical significance of these variances, the fact that the estimated values of τu000, τr000, and σ2 are considerably higher than their respective standard errors suggests that there is significant variation in the annual school performance between students and between schools. More specifically, we can see that all of these relationships are higher than 1.96, and this is the critical value of the standardized normal distribution that results in a significance level of 0.05.

As discussed in Section 23.5.1, this information is essential to underpin the choice of the multilevel modeling in this example, instead of a simple and traditional regression model through OLS. At the bottom of Fig. 23.34, we can verify this fact, by analyzing the result of the likelihood-ratio test (LR test). Since Sig. χ2 = 0.000, we can reject the null hypothesis that the random intercepts are equal to zero (H0: u00k = r0jk = 0), which makes the estimation of a traditional linear regression model be ruled out for the data with repeated measures in our example.

Even though researchers frequently ignore the estimation of null models, analyzing the results may help to reject the research hypotheses or not. It may even provide adjustments in relation to the constructs proposed. For the data in our example, the results of the null model allow us to state that there is significant variability in the school performance throughout the four years of the analysis. Furthermore, there is significant variability in the school performance, throughout time, between students of the same school, and there is significant variability in the school performance, throughout time, between students from different schools. By themselves, these findings can reject or prove research hypotheses and be used to structure certain work, and depending on the researcher’s objectives, without being necessary to prepare additional models.

In addition to what has been discussed, since our main objective is to verify if there are student and school characteristics that would explain the variability in the school performance between students from the same school and between those from different schools, we will continue with the next modeling steps, respecting the multilevel step-up strategy.

Therefore, as already seen through the charts in Figs. 23.32 and 23.33, let’s insert level-1 variable year into the analysis, aiming at investigating if the temporal variable has a relationship to students’ school performance behavior and, more than this, if the school performance has a linear behavior throughout time.

  • Linear Trend Model with Random Intercepts

performancetjk=π0jk+π1jkyearjk+etjk

si85_e

π0jk=b00k+r0jk

si80_e

π1jk=b10k

si87_e

b00k=γ000+u00k

si81_e

b10k=γ100,

si89_e

which results in the following expression:

performancetjk=γ000+γ100yearjk+u00k+r0jk+etjk

si90_e

For the data in our example, the command for estimating the linear trend model with random intercepts in Stata is:

xtmixed performance year || school: || student: , var nolog reml

whose outputs are shown in Fig. 23.35.

Fig. 23.35
Fig. 23.35 Outputs of the linear trend model with random intercepts.

First, we can see that the mean of the annual increase in school performance is statistically significant and with an estimated parameter of γ100 = 4.348, ceteris paribus.

Regarding the random effects components, we have also verified that there is statistical significance in the variances of u00k, r0jk, and etjk, because the estimations of τu000, τr000, and σ2 are considerably higher than the respective standard errors. Therefore, new intraclass correlations can be calculated, as follows:

  •  Level-2 intraclass correlation

rhostudentschool=corr(Ytjk,Yt´jk)=τu000+τr000τu000+τr000+σ2=180.196+333.675180.196+333.675+10.146=0.981

si91_e

  •  Level-3 intraclass correlation

rhoschool=corr(Ytjk,Yt´j´k)=τu000τu000+τr000+σ2=180.196180.196+333.675+10.146=0.344

si92_e

Both variance proportions are higher than the ones obtained in the estimation of the null model, which shows the importance of including the variable that corresponds to the repeated measure in level 1. Besides, the result of the likelihood-ratio test (LR test) at the bottom of Fig. 23.35 allows us to prove that the estimation of a simple traditional linear regression model (performance based on year) only with fixed effects must be ruled out.

Therefore, now, our model starts to have the following specification:

performancetjk=57.844+4.348yearjk+u00k+r0jk+etjk

si93_e

Next, we can store (command estimates store) the estimations obtained for future comparison to the ones that will be generated in the estimation of a linear trend model with random intercepts and slopes. Through the command predict, reffects, we can also obtain the expected values of the random effects BLUPS (best linear unbiased predictions) u00k and r0jk. Maintaining the logic proposed in the previous section, let’s type the following sequence of commands:

estimates store randomintercept
predict u00 r0, reffects
desc u00 r0
by student, sort: generate tolist = (_n ==1)
list student school u00 r0 if school <=2 & tolist

Fig. 23.36 shows the values of random intercept terms u00k and r0jk for the students from the first two schools in the dataset. We can see that, while error terms u00k do not vary for students from the same school and throughout time (variable u00 generated in the dataset), terms r0jk vary between students. However, they do not vary for the same student throughout time (variable r0 generated in the dataset), which characterizes an intercept for each student and an intercept for each school.

Fig. 23.36
Fig. 23.36 Random intercept terms u00k and r0jk for the first two schools in the sample (the ones presented, identification of the observation that corresponds to each student’s first period).

In order to provide a better visualization of the random intercepts per school and per student, we can generate two charts (Figs. 23.37 and 23.38), by typing the following commands:

Fig. 23.37
Fig. 23.37 Random intercepts per school.
Fig. 23.38
Fig. 23.38 Random intercepts per student.
graph hbar (mean) u00, over(school) ytitle("Random Intercepts per School")

graph hbar (mean) r0, over(student) ytitle("Random Intercepts per Student")

Therefore, at this moment of the modeling, we are able to state that students’ school performance follows a linear trend throughout time. In addition, there is a significant variance of intercepts between those who study at the same school and between those who study at different schools.

Thus, we also need to verify if there is significant variance of the school performance slopes throughout time between the different students. Since the charts in Figs. 23.30 and 23.33 already gave us an indication that this phenomenon occurred. Therefore, let’s insert slope random effects into levels 2 and 3 of our multilevel model that, by maintaining the intercept random effects, will start to have the following expression:

  • Linear Trend Model with Random Intercepts and Slopes

performancetjk=π0jk+π1jkyearjk+etjk

si85_e

π0jk=b00k+r0jk

si80_e

π1jk=b10k+r1jk

si96_e

b00k=γ000+u00k

si81_e

b10k=γ100+u10k,

si98_e

which results in:

performancetjk=γ000+γ100yearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si99_e

The command for estimating this linear trend model with random intercepts and slopes in Stata is:

xtmixed performance year || school: year || student: year, var nolog reml

Note that the variable year is present in the fixed effects component and in the level-3 random effects components (by multiplying error term u10k), and in the level-2 ones (by multiplying error term r1jk). The outputs obtained can be found in Fig. 23.39.

Fig. 23.39
Fig. 23.39 Outputs of the linear trend model with random intercepts and slopes.

We can see that, even though the fixed effects parameter estimations do not change considerably in relation to the previous model, the variance estimations are different, which generates new intraclass correlations, as follows:

  •  Level-2 intraclass correlation

rhostudentschool=corr(Ytjk,Yt´jk)=τu000+τu100+τr000+τr100τu000+τu100+τr000+τr100+σ2=224.343+0.560+374.285+3.157224.343+0.560+374.285+3.157+3.868=0.994

si100_e

  •  Level-3 intraclass correlation

rhoschool=corr(Ytjk,Yt´j´k)=τu000+τu100τu000+τu100+τr000+τr100+σ2=224.343+0.560224.343+0.560+374.285+3.157+3.868=0.371

si101_e

Therefore, for this model, we estimate that the students and schools random effects form approximately 99% of the total variance of the residuals!

Let’s type the following command, so that we can prove the better suitability of this estimation over the previous one, without random slopes:

estimates store randomslope

Next, we can type the command that will elaborate the likelihood-ratio test:

lrtest randomslope randomintercept

since the term randomintercept refers to the estimation carried out previously. The result of the test can be seen in Fig. 23.40.

Fig. 23.40
Fig. 23.40 Likelihood-ratio test to compare the estimations of the linear trend models with random intercepts and with random intercepts and slopes.

By using the values of the restricted likelihood functions obtained in Figs. 23.35 and 23.39, we arrive at the χ2 statistic for the test, with 2 degrees of freedom:

χ22=(2LLrrandom intercept(2LLrrandom slope))={2(7,801.420)[2(7,464.819)]}=673.20,

si102_e

which results in a Sig. χ22 = 0.000 < 0.05 and ends up favoring the linear trend model with and random intercepts and slopes. It is important to mention once again, as the note at the bottom of Fig. 23.40 also explains, that this likelihood-ratio test is only valid when a comparison of the estimations obtained through restricted maximum likelihood (REML) of two models with identical fixed effects specification is carried out. Since in our case, both models, which were estimated through REML, present the same fixed effects specification, γ000 + γ100 ⋅ yearjk, the test is considered valid.

Hence, our model starts to have the following specification:

performancetjk=57.858+4.343yearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si103_e

In the current situation, we are able to state that students’ school performance follows a linear trend throughout time. In addition, there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools.

Therefore, let’s insert level-2 variable gender into the analysis, in order to verify if this characteristic explains the variation in the annual school performance between students.

  • Linear Trend Model with Random Intercepts and Slopes and with Level-2 Variable gender

performancetjk=π0jk+π1jkyearjk+etjk

si85_e

π0jk=b00k+b01kyearjk+r0jk

si105_e

π1jk=b10k+b11kyearjk+r1jk

si106_e

b00k=γ000+u00k

si107_e

b01k=γ010

si108_e

b10k=γ100+u10k

si109_e

b11k=γ110,

si110_e

which results in the following expression:

performancetjk=γ000+γ100yearjk+γ010genderjk+γ110genderjkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si111_e

Initially, we need to generate a new variable that corresponds to the multiplication between gender and year. The following command generates this variable (genderyear):

gen genderyear = gender⁎year

Next, we can estimate our linear trend model with random intercepts and slopes and level-2 variable gender, by typing the following command:

xtmixed performance year gender genderyear || school: year || student: year, var nolog reml

The outputs generated can be seen in Fig. 23.41.

Fig. 23.41
Fig. 23.41 Outputs of the linear trend model with random intercepts and slopes and level-2 variable gender.

This model shows significant estimations for the fixed effects parameters, as well as for the variances of the random effects terms, at a significance level of 0.05. Moreover, at this moment of the modeling, we are able to state that students’ school performance follows a linear trend throughout time, and there is a significant variance of intercepts and slopes between those who study at the same school and between those who study at different schools. Additionally, the fact that a certain student is female or male is part of the reason why there is this variation in school performance.

The model begins to have the following specification:

performancetjk=64.498+4.029yearjk15.033genderjk+0.705genderjkyearjk+u00k+u10kyearjk+r0jk+r1jkyearjk+etjk

si112_e

and, from which, we can see that male students (dummy gender = 1) have a worse performance than female students, on average and ceteris paribus.

Finally, let’s investigate if level-3 variable texp (professors’ years of teaching experience), also explains the variation in the annual school performance between the students. After some intermediate analyses, let’s move on to estimate the three-level hierarchical model with the following specification:Linear Trend Model with Random Intercepts and Slopes, Level-2 Variable gender, and Level-3 Variable texp (Complete Model)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset