13.3.3 The Problem of Heteroskedasticity

Besides the presuppositions previously discussed, the distribution of probability for each random term of Yi = a + b1 ⋅ X1i + b2 ⋅ X2i + ⋯ + bk ⋅ Xki + ui (i = 1, 2, …, n) is such that all distributions should present the same variance, or rather, the distributions should be homoskedastic. Therefore:

Var(ui)=E(ui)2=σ2u

si124_e  (13.39)

Fig. 13.39 provides, for a simple linear regression models, a view of the heteroskedasticity problem, or rather, the nonconstancy of variance of the residuals along the explanatory variable. In other words, there should be a correlation between the terms of error and the X variable, perceived by the formation of a “cone” that becomes narrower as X increases. Obviously, the problem of heteroskedasticity also occurs if this “cone” presents itself in a mirror image, or rather, if the narrowing (reduction of the terms or error values) occurs with the reduction of the values of the X variable.

Fig. 13.39
Fig. 13.39 The problem of heteroskedasticity.

13.3.3.1 Causes of Heteroskedasticity

According to Greene (2012), specification errors as to the functional form or as to the omission of a relevant variable can generate heteroskedastic terms of error in the model.

This phenomenon can also be generated by trial and error models. In this case, we imagine that a group of analysts wants to elaborate forecasts regarding soy futures in the derivatives market. The same analysts make their t forecasts in, t + 1, t + 2, and t + 3 months, so as to evaluate the learning curve for each of them regarding the phenomenon in question (correct commodity pricing). The graph in Fig. 13.40 is prepared after the experiment and, by means of their analysis, we can see that the analysts predict the accurate price of soy over a period of time, very probably due to the learning process to which they have been submitted.

Fig. 13.40
Fig. 13.40 Trial and error models as a cause of heteroskedasticity.

Analogously, the increment of discretionary income (portion of total income for an individual that is not pledged, where the individual can exercise some degree of discretion as to its use) can also cause heteroskedasticity problems to occur in regression models. We image that a poll was taken of law school graduates. From time to time, let’s say every 5 years, the same students are questioned regarding their discretionary income at that exact time. Fig. 13.41 is then prepared and, by means of it, we can see that the discretionary income for the students comes to present greater differences over time, when compared to that of recent graduates.

Fig. 13.41
Fig. 13.41 Increase of discretionary income as a cause of heteroskedasticity.

Continuing with our example of discretionary income, we now imagine that another sample is taken with the same configuration, but with only one individual presenting a discriminating amount of their discretionary income as t + 15, according to what is shown in Fig. 13.42. This outlier will increase even more, in this case, the intensity of the heteroskedasticity in the proposed model.

Fig. 13.42
Fig. 13.42 Existence of an outlier as a cause of heteroskedasticity.

13.3.3.2 Consequences of Heteroskedasticity

All of the cases presented (errors in model specification, trial and error models, increase in discretionary income, and the presence of outliers) can lead to heteroskedasticity, which generates nonbiased, though inefficient, parameters and standard errors of the biased parameters with the hypothesis tests of the t statistics.

In order to detect the presence of heteroskedasticity, we will next present the Breusch-Pagan/Cook-Weisberg test. Some procedures for the eventual correction of heteroskedasticity will also be discussed, such as estimation by the weighted least squares and the Huber-White method for robust standard error estimates.

13.3.3.3 Heteroskedasticity Diagnostics: Breusch-Pagan/Cook-Weisberg Test

The Breusch-Pagan/Cook-Weisberg test, which is based on the Lagrange multiplier (LM), presents, as a null hypothesis, the fact that the variance of the terms of error is constant (homoskedastic errors) and, as an alternative hypothesis, the fact that the variance of the terms of error is not constant, or rather, the terms of error are a function of one or more explanatory variables (heteroskedastic errors). It is important to mention that this test is indicated for those cases where the supposition of the residuals normality is verified.

To obtain test results, we can, at first, prepare a certain regression model, based on which we will obtain a residual vector (ui) and the predicted values of the dependent (Ŷi) variable. Next, we can standardize the residual sum of squares, forcing the average of this new variable to be equal to 1. Or rather, each standardized residual can be obtained by the following expression:

upi=u2i(ni=1u2i)/n

si125_e  (13.40)

where n is the number of observations.

Next, we can elaborate the regression upi=a+bˆYi+ξisi126_e, based upon which we can calculate the sum of the sum of squares due to regression (SSR), which, by dividing by two, we arrive at the statistic χBP/CW2.

As this, the Breusch-Pagan/Cook-Weisberg test presents, as a null hypothesis, the fact that the calculated statistic χ2BP/CW2 possesses a chi-square distribution with 1 degree of freedom, or rather that χBP/CW2 < χ1d.f.2 for a certain significance level. In other words, if the terms of error are homoskedastic, the sum of squares due to regression would not increase with the increase of Ŷ.

In Section 13.5, we will present the application of this test, as well as its results, using Stata.

13.3.3.4 Weighted Least Squares Method: A Possible Solution

As we have mentioned, failures in model specification can generate heteroskedastic terms of error and, as we know and will discuss in Section 13.4, the relations between variables are complex and do not always follow a linearity. Not having a certain subjacent theory that indicates the relations between two or more variables, it is up to the researcher, by means, for example, of preparing graphs of the residuals due to the dependent variable or the explanatory variables, try to infer regarding an eventual nonlinear adjustment to be applied to the model being studied, as logarithmic, quadratic, or the inverse.

In this sense, the weighted least squares method, which is a specific case of generalized least squares, can be applied when it has been diagnosed that the variance of the terms of error depends on the explanatory variable, or rather, when Expression (13.39) undergoes alteration, such that:

Var(ui)=σ2uXi

si127_e

or

Var(ui)=σ2uX2i

si128_e

or

Var(ui)=σ2uXi

si129_e

or any other relation between Var(ui) and Xi.

Being thus, the model can be changed in such a way so that the terms of error come to present the constant variant. Imagine, for example, that the relation between ui and Xi is linear, or rather, that | ui | = c ⋅ Xi and, as such, E(ui)2 = E(c ⋅ Xi)2 = c2 ⋅ Xi2, where c is a constant. Thus stated, we can propose a new model as such:

YiXi=aXi+bXiXi+uiXi

si130_e  (13.41)

Based on Expression (13.41), we have that the new terms of error present the following variance:

E(uiXi)2=1X2iE(ui)2=1X2ic2X2i=c2si131_e, which is constant.

Therefore, the proposed by means of Expression (13.41) can be estimated by OLS.

13.3.3.5 Huber-White Method for Robust Standard Errors

To gain a succinct idea of the procedure proposed in the seminal article written by White (1980), which follows the work of Huber (1967), we will again use the expression:

Yi=a+bXi+ui,withVar(ui)=E(ui)2=σ2u

si132_e  (13.42)

and

Var(ˆb)=X2iσ2u(X2i)2

si133_e  (13.43)

However, being that σu2 is not directly observable, White (1980) proposes the adoption of ˆu2isi134_e, instead of σu2, for the evaluation of Var(ˆb)si135_e, in the following way:

Var(ˆb)=X2iˆu2i(X2i)2

si136_e  (13.44)

White (1980) shows that the Var(ˆb)si135_e presented by means of Expression (13.44) is a consistent evaluator of the variance presented by means of Expression (13.43), or rather, as the measure of the sample size increases indefinitely, the second converges to the first.

This procedure can be generalized for the multiple regression model:

Yi=a+b1X1i+b2X2i++bkXki+ui

si16_e  (13.45)

from whence comes that:

Var(ˆbj)=ˆw2jiˆu2i(ˆw2ji)2

si139_e  (13.46)

where j = 1, 2, …, k, ˆuisi140_e are the residuals obtained by means of the estimation of the original regression and ˆwjisi141_e represent the residuals obtained by means of each auxiliary regression of regressor Xj against the remaining regressors.

Given the computational ease in applying this model, it is actually quite frequent that researchers use the robust standard error for heteroskedasticity in their academic work, to such a point that they no longer worry in verifying the existence of heteroskedasticity. However, this decision, which attempts to eliminate uncertainty corresponding to the source of heteroskedasticity and eventually generates a confidence in more robust results, does not represent a true solution in most cases. It is important to remember that this procedure, which generates different standard error estimates of the parameters than those which would have been obtained with the direct application of the OLS method (affecting the F statistics), does not alter the properly stated parameters of the regression model.

As such, the adoption of this procedure can only cause the researcher to pretend that the problem does not exist, instead of attempting to identify the reason for which it arose.

13.3.4 The Autocorrelation of Residuals Problem

The hypothesis of randomness and independence of the terms of error only makes sense if studied in models where there is a time evolution of the data. In other words, if we are working with a cross-sectional dataset, this presupposition is not justified, being that the change in sequence in which the observations are disposed in a cross-section does not alter the dataset in any way. However, it modifies the correlation between the terms of error from one observation to another. On the other hand, being we should necessarily respect the sequence of the observations in a dataset with time evolution (t, t + 1, t + 2, etc.), the correlation (ρ) of the terms of error between observations comes to make sense. As such, we can propose the following model, now with subscripts t instead of i:

Yt=a+b1X1t+b2X2t++bkXkt+ɛt

si142_e  (13.47)

where:

ɛt=ρɛt1+ut,with1ρ1

si143_e  (13.48)

Or rather, the terms of error ɛt are not independent and, according to Expression (13.48), present first-order autocorrelation. Meaning that, each value of ɛ depends on the value of ɛ of the previous period and a random and independent u term, with normal distribution, zero mean, and constant variance. In this case, we have:

ɛt1=ρɛt2+ut1ɛt2=ρɛt3+ut2ɛtp=ρɛtp1+utp

si144_e  (13.49)

Fig. 13.43 provides, for a simple linear regression, a view of the residuals autocorrelation, or more clearly, the terms of error do not present randomness and correlate themselves temporally.

Fig. 13.43
Fig. 13.43 The autocorrelation of residuals problem.

13.3.4.1 Causes of the Autocorrelation of Residuals

According to Vasconcellos and Alves (2000) and Greene (2012), specification errors as to the functional form or as to the omission of relevant variable explanatories can generate autocorrelated terms of error. Besides this, the autocorrelation of residuals can be caused by seasonal phenomena and, consequently, by the deseasonizing of those series.

Let’s imagine that a researcher wants to investigate the relation between the consumption of ice-cream (in tons) in a certain city and the population growth over the quarters. As such, over the period of two years (eight quarters), data were collected and the graph presented in Fig. 13.44 was prepared. By means of the graph, we see that city population growth over time caused ice-cream consumption to increase. However, due to the existing seasonality, being that ice-cream consumption is greater in the spring and summer periods and smaller in the fall and winter periods, the linear functional form (deseasonizing) causes autocorrelated terms of error to be generated over time.

Fig. 13.44
Fig. 13.44 Seasonality as a cause of the autocorrelation of residuals.

13.3.4.2 Consequences of the Autocorrelation of Residuals

All of the causes presented (errors in the specification of the functional form of the model, omission of relevant explanatory variables and the deseasonizing of the series) can lead to the autocorrelation of the residuals, which generates the nonbiased parameters, though inefficient, and standard errors of the underestimated parameters, which causes problems with the t statistics hypothesis tests.

So as to be able to detect the presence of the autocorrelation of residuals, we next present the Durbin-Watson and Breusch-Godfrey tests.

13.3.4.3 Autocorrelation of Residuals Diagnostic: The Durbin-Watson Test

The Durbin-Watson test is the most used by researchers who have the intention of verifying the existence of the autocorrelation of residuals, even though its application is only valid to test the autocorrelation of first order. The statistic DW of the test is given by:

DW=nt=2(ɛtɛt1)2nt=1ɛ2t

si145_e  (13.50)

where ɛt represents the estimated terms of error for Expression (13.47). Being we know that the correlation between ɛt and ɛt − 1 is given by:

ρ=nt=2ɛtɛt1nt=2ɛ2t1

si146_e  (13.51)

for sufficiently large n values, we can deduct that:

DW2(1ˆρ)

si147_e  (13.52)

and it is for this reason that many researchers state that the Durbin-Watson test with a DW statistic approximately equal to two results in the inexistence of the autocorrelation of residuals (ˆρ0si148_e). Even though this is true for autoregressive processes of the first order, a table with critical values dU and dL of the DW distribution can offer a researcher a more concrete possibility of the real existence of autocorrelation, being that it offers the dU and dL values in function of the number of sample observations of the number of model parameters and the level of statistical significance that the researcher desires. While Table C in the Appendix provides these critical values, Fig. 13.45 presents the DW distribution and the criteria for the existence or nonexistence of autocorrelation.

Fig. 13.45
Fig. 13.45 DW distribution and the criteria for the existence of autocorrelation.

Even though widely used, the Durbin-Watson test, as has been discussed, is only valid for verifying the existence of first-order autocorrelation of the terms of error. Besides this, it is not appropriate for models where the obsolete dependent variable is included as one of the explanatory variables. It is in this sense that the Breusch-Godfrey test comes to be a more interesting alternative.

13.3.4.4 Autocorrelation of Residuals Diagnostic: The Breusch-Godfrey Test

The Breusch-Godfrey test, arising from two individually published articles in 1978 (Breusch, 1978; Godfrey, 1978), allows for testing the autocorrelation of residuals in a model that presents an obsolete dependent variable as one of its explanatory variables. Besides that, it also allows the researcher to verify if the autocorrelation is of 1st order, 2nd order, or of p order, being, therefore, more general than the Durbin-Watson test.

Given the same multiple linear regression model once again:

Yt=a+b1X1t+b2X2t++bkXkt+ɛt

si142_e  (13.53)

we can define that the error terms undergo a p order autoregressive process, such that:

ɛt=ρ1ɛt1+ρ2ɛt2++ρpɛtp+ut

si150_e  (13.54)

where u has normal distribution, zero mean, and constant variance.

As such, by means of the estimation by OLS of the model represented by Expression (13.53), we can obtain ˆɛtsi151_e and prepare the following regression:

ˆɛt=d1X1t+d2X2t++dkXkt+ˆρ1ˆɛt1+ˆρ2ˆɛt2++ˆρpˆɛtp+νt

si152_e  (13.55)

Breusch and Godfrey prove that the statistic of the test is given by:

BG=(np)R2~χ2p

si153_e  (13.56)

where n is the size of the sample, p is the dimension of the autoregressive process, and R2 is the coefficient of determination obtained by means of the estimation of the model Expression (13.55). In this way, if (n − p) ⋅ R2 is higher than the critical value of the chi-square with p degrees of freedom, we reject the null hypothesis of the inexistence of the autocorrelation of residuals, or rather, at least one ˆρsi154_e parameter in Expression (13.55) is statistically different from zero.

The main disadvantage of the Breusch-Godfrey test is that it does not permit definition, a priori, of the number of p discrepancies in Expression (13.54), requiring the researcher to test for the different p possibilities.

13.3.4.5 Possible Solutions for the Autocorrelation of Residuals Problem

The autocorrelation of residuals can be treated by altering the functional form of the model or by including the omitted relevant variable. Tests to identify these specification problems are found in Section 13.3.5.

However, in case the conclusion is made that the autocorrelation is “pure,” or that it is not due to specification problems as a result of an inadequate functional form or by the omission of a relevant variable, the problem can be solved by means of the generalized least squares method, which has the objective of finding the best change to the original mode in such a way as to generate nonautocorrelated error terms.

We again imagine our original model, however, with only one explanatory variable. As such:

Yt=a+bXt+ɛt

si155_e  (13.57)

being:

ɛt=ρɛt1+ut

si156_e  (13.58)

where u has normal distribution, zero mean, and constant variance.

Being it is our intent to modify the model of Expression (13.57), so that the terms of error come to be u, and no longer ɛ, we can multiply the terms of this expression by ρ and displace them by 1 period. Doing this we have:

ρYt1=ρa+ρbXt1+ρɛt1

si157_e  (13.59)

By subtracting Expression (13.59) from Expression (13.57), we come to have:

YtρYt1=a(1ρ)+b(XtρXt1)+ut

si158_e  (13.60)

which becomes the model with noncorrelated error terms. For this transformation to occur, however, it is necessary that the researcher know ρ.

In Section 13.5, which presents the application of the multiple regression models by means of Stata, procedures will be presented to verify each of the presuppositions, with the respective tests and results.

13.3.5 Detection of Specification Problems: Linktest and RESET Test

As we notice, a large part of regression presupposition violations are generated by failures in model specification, or be it, by problems in defining the functional form and by the omission of relevant explanatory variables. There are many detection methods for specification problems, but the most used are the linktest and the RESET test.

Linktest is nothing more than a procedure that creates two new variables based on the estimation of a regression model, which are nothing more than the Ŷ and Ŷ2 variables. As such, based on the estimation of an original model:

Yi=a+b1X1i+b2X2i++bkXki+ui

si16_e  (13.61)

we can estimate the following model:

Yi=a+d1ˆYi+d2(ˆYi)2+νi

si160_e  (13.62)

from which we expect that Ŷ be statistically significant and Ŷ2, not, since, if the original model was specified correctly in functional form terms, the square of the predicted values of the dependent variable should not present an explanatory power over the original dependent variable. When linktest is applied directly in Stata, it presents exactly this configuration; however, the researcher who is interested in evaluating the statistical significance of the Ŷ variable with different exponents can do it manually.

The RESET test (regression specification error test) evaluates the existence of model specification errors by the omission of relevant variables. Similar to linktest, RESET also creates new variables based on the Ŷ values generated based on the original model estimation represented by Expression (13.61). From this, we can estimate the following model:

Yi=a+b1X1i+b2X2i++bkXki+d1(ˆYi)2+d2(ˆYi)3+d3(ˆYi)4+νi

si161_e  (13.63)

From the estimation of the model represented by Expression (13.63), we can calculate the F statistic as follows:

F=(ni=1u2ini=1ν2i)3(ni=1ν2i)(nk4)

si162_e  (13.64)

where n is the number of observations and k is the number of explanatory variables in the original model.

As such, if the calculated F statistic for (3, n − k − 4) degrees of freedom is lower than the corresponding critical F (H0 of the RESET test), we can state that the original model does not present an omission of relevant explanatory variables.

In the same way as for linktest in Section 13.5, we will elaborate on the RESET test based on the estimation of a model in Stata.

13.4 Nonlinear Regression Models

As we have already studied, a linear regression model with a single X variable can be represented by:

Yi=a+bXi+ui

si163_e  (13.65)

However, imagine a situation where the Y variable is better explained by a nonlinear behavior by the X variable. In this way, the adoption of a linear function by the researcher could generate a model with a lower R2 and, consequently, a lower power of forecast.

Imagine a hypothetical situation presented by means of Fig. 13.46. Clearly, X and Y relate in a nonlinear manner.

Fig. 13.46
Fig. 13.46 Example of nonlinear behavior between a Y variable and an X variable.

A quite curious researcher prepared four regression models with the idea of choosing the most appropriate for forecast. The functional forms chosen were the linear, the semilogarithmic, the quadratic, and the exponential. Fig. 13.47 shows the results of these four models.

Fig. 13.47
Fig. 13.47 Results of applying four different functional forms in a regression. (A) Linear specification, (B) semilogarithmic specification, (C) quadratic specification, (D) exponential specification.

In analyzing the results, the researcher determined that the semilogarithmic presented the greatest R2, which will better provide the predictive power of the model and, therefore, will be the chosen model. Besides that, it was also noticed that, in this case, the linear functional form presented the lowest R2.

The relations between the variables could be given by means of innumerable nonlinear functional forms, which should be eventually considered as to the estimation of regression models, so that, in a more adequate manner, the behavior of the different phenomenon be understood. In this sense, Box 13.3 presents the main functional forms used.

Box 13.3

Main Functional Forms in Regression Models

Functional FormModel
LinearYi = a + b ⋅ Xi + ui
Semilogarithmic to the rightYi = a + b ⋅ ln(Xi) + ui
Semilogarithmic to the leftln(Yi) = a + b ⋅ Xi + ui
Logarithmic (or log-log)ln(Yi) = a + b ⋅ ln(Xi) + ui
InverseYi=a+b(1Xi)+uisi15_e
QuadraticYi = a + b ⋅ (Xi)2 + ui
CubicYi = a + b ⋅ (Xi)3 + ui
ExponentialYi = a ⋅ (Xi)b + ui

Source: Fouto (2004) and Fávero (2005).

The definition of the best functional form is an empirical question to be decided in favor of the best adjustment of the data. We emphasize, however, that the researcher has the freedom to apply the functional forms that best agree with the basis on the subjacent theory, the preliminary analysis of the data, and also on their experience. However, the decision in favor of a certain functional form, respecting the presuppositions of the technique, has basis on the higher R2 (for the same samples and with the same number of parameters; otherwise, the option should be made for the functional form whose model presents the greatest adjusted R2, as has already been discussed).

While on the linear functional form the b parameter indicates the marginal effect of the X variable on the Y variable, on the semilogarithmic form to the right the b parameter represents the marginal effect of the ln(X) variable on the Y variable.

Yet the parameters of the models with the inverse, quadratic, and cubic functional forms represent the marginal effect, on the Y variable, of the variation of the inverse, the square, and the cube of X, respectively.

Finally, the semilogarithmic to the left and logarithmic (log-log), the coefficient of the X variable can be interpreted as a partial elasticity. It is important to mention that the binary and multinomial logistic models, the Poisson and negative binomial regression models for count data, and the regression models for survival data are specific cases of semilogarithmic to the left models, also known as log-linear or nonlinear exponential models. The binary and multinomial logistic models, and the Poisson and negative binomial regression models for count data will be studies in Chapters 14 and 15, respectively. The regression models for survival data can be found in Fávero (2015) and Fávero and Belfiore (2017).

13.4.1 The Box-Cox Transformation: The General Regression Model

Box and Cox (1964), in a seminal article, present a general regression model from which all presented functional forms are derived, or rather, are unique cases. According to the authors, and what Fávero (2005) and Fávero et al. (2009) discuss, based on a linear regression model with a single X variable, represented by means of Expression (13.65), a transformed model can be obtained by substituting Y by (Yλ − 1)/λ and X by (Xθ − 1)/θ, in which λ and θ are the transformation parameters. As such, the model comes to be:

Yλi1λ=a+b(Xθi1θ)+ui

si164_e  (13.66)

Based on Expression (13.66), we can attribute, according to what is shown in Box 13.4, values for λ and θ so as to obtain unique cases for some of the main functional forms defined in Box 13.3.

Box 13.4

Box-Cox Transformations and Values of λ and θ for Each Functional Form

Parameter λParameter θFunctional Form
11Linear
10Semilogarithmic to the right
01Semilogarithmic to the left
00Logarithmic (or log-log)
1− 1Inverse
12Quadratic
13Cubic

Box and Cox (1964) demonstrate, by expansion by Taylor, that a natural logarithm (ln) is obtained when determined parameter (λ or θ) is equal to zero.

A new variable obtained by means of a Box-Cox transformation applied to an original variable comes to represent a new distribution (new histogram). For this reason, it is very common for researchers to obtain new transformed variables based on the original variables, in the cases where the last present great amplitudes and highly discrepant values. For example, image a dataset of rental prices per square foot of store space, which can vary from $100 to $10,000 per ft2. In this case, the application of the natural logarithm would considerably reduce the amplitude and the discrepancy of values (ln(100) = 4.6 and ln(10,000) = 9.2). In Finance and Accounting, for example, business size is a variable that is traditionally known as being a natural logarithm of the assets of the company.

For dummy variables, obviously no Box-Cox transformation will make the least sense, being that, as these assume values equal to 0 or 1, no exponent will alter the original value of the variable.

According to what we studied in Section 13.3, the presuppositions related to the residuals (normality, homoskedasticity, and absence of autocorrelation) in regression models can be violated by failures in the specification of the functional form. In this way, a Box-Cox transformation can help the researcher in the definition of other functional forms, that are not linear, even providing an answer to the following question: What Box-Cox parameter (λ for the dependent variable and θ for an explanatory variable) maximizes adherence to the normality of distribution of a new transformed variable generated based on an original variable? Being that the Box-Cox parameters vary from −∞ to +∞, any value could be obtained. We will use the Stata software in Section 13.5, to answer this important question.

13.5 Estimation of Regression Models in Stata

The objective of this section is not to again discuss the concepts inherent to statistics and the presuppositions of the regression technique, but to provide the researcher knowledge of the Stata commands as well as show its advantages in relation to other software in what can be said about confirmatory models. The same example used in Section 13.2 will be used here, being that this criterion is that which has been used throughout the book. The reproductions of the Stata Statistical Software in this section have the authorization of StataCorp LP©.

Returning to our example, we remember that a professor was interested in evaluating if the travel time for students to school, independent of where they lived, was influenced by variables such as distance, number of traffic lights, period of the day when they traveled, and their driving style. We now return to the final dataset built by the professor by means of the questionnaires given to the group of 10 students. The dataset is found in the file Timedistsemperstyle.dta and is exactly equal to that presented in Table 13.10.

Initially, we can type the command desc, which allows us to analyze the dataset characteristics, such as the number of observations, the number of variables, and a description of each. Fig. 13.48 presents the first Stata output.

Fig. 13.48
Fig. 13.48 Description of the Timedistsemperstyle.dta dataset.

Even though the variable per is qualitative, it possesses only two categories, which, in the dataset, are already labeled as dummy (morning = 1; afternoon = 0). On the other hand, the style profile has three categories and, therefore, will need for us to create (n − 1 = 2) dummies, according to what was discussed in Section 13.2.6. The tab command offers the frequency distribution for a qualitative variable, highlighting the amount of categories. If the researcher has any doubt regarding the number of categories, they can easily resort to this command (Fig. 13.49).

Fig. 13.49
Fig. 13.49 Frequency distribution of style variable.

The command xi i.style will provide us with these two dummies, here named by Stata as _Istylel_2 and _Istylel_3, exactly maintaining the criteria presented in Table 13.11 (Fig. 13.50).

Fig. 13.50
Fig. 13.50 Creation of two dummies based on style variable.

Before preparing the actual multiple regression model, we can generate a graph that shows the interrelations between variables, two by two. This graph, known as matrix, can provide a better understanding of how the variables relate to the researcher, including offering an eventual suggestion regarding the nonlinear functional forms. Let’s now prepare the graph using only the model quantitative variables (Fig. 13.51), so as to make visualization easier. We should now type the following command:

Fig. 13.51
Fig. 13.51 Interrelationship between variables—graph matrix.

graph matrix time dist sem

By means of this graph, we can see that the relationship between the time variable and the dist and sem variables is positive and, apparently, linear. We can also see that there may exist a certain multicollinearity between the explanatory variables. A correlation matrix can also be generated before preparation of the regression, so as to arm the researcher with information in this phase of dataset diagnosis. To do this, we should type the following command:

pwcorr time dist sem per _Istyle_2 _Istyle_3, sig

Fig. 13.52 presents the correlation matrix.

Fig. 13.52
Fig. 13.52 Correlation matrix.

By means of the correlation matrix, we can see that the correlations between the time and dist and between time and sem are high and statistically significant, to the 5% significance level. It is important to mention that the values presented here for each correlation refer to the respective significance levels. By means of the same matrix, on the other hand, we can see that eventual multicollinearity problems may arise between some explanatory variables, such as, for example, between per and _Istyle_3. As we will see ahead, even though the correlation between time and per is higher, in module, than between time and _Iperfil_3, the per variable will be excluded from the final model by the Stepwise procedure, different from for the _Iperfil_3 variable.

Let’s now go to the modeling. To do this, we should enter the following command:

reg time dist sem per _Istyle_2 _Istyle_3

The reg command estimates a regression by means of the OLS method. If the researcher does not indicate the desired confidence level desired for the estimated parameter intervals, the standard will be 95%. However, if the researcher desires to alter the confidence level for the parameter intervals, for example to 90%, the following command should be entered:

reg time dist sem per _Istyle_2 _Istyle_3, level(90)

We will not continue with the analysis by maintaining the confidence level for the parameter intervals at 95%. The outputs found in Fig. 13.53 are exactly equal to those presented in Fig. 13.32.

Fig. 13.53
Fig. 13.53 Multiple linear regression outputs in Stata.

Being that the regression technique is part of the group of models known as Generalized Linear Models, being that the dependent variable presents normal distribution (also known as the Gauss distribution or Gaussian distribution) of the parameters estimated by OLS (reg command) and presented in Fig. 13.53. It could also be obtained by maximum likelihood estimation, to be studied in the next chapter. For such, the following command could be entered:

glm time dist sem per _Istyle_2 _Istyle_3, family(gaussian)

As we have already discussed, the per and _Istyle_2 variables do not show themselves to be statistically significant in this model in the presence of the remaining variables, to the 5% significance level. We will now begin the application of the Stepwise procedure that executed the variables whose parameters do not show themselves to statistically significant, even though this could generate a specification problem by the omission of a certain variable that would be relevant to explain the behavior of the dependent variable, in the case there were no other explanatory variables in the final model. Further ahead, we will use the RESET test to verify the eventual existence of model specification errors by the omission of relevant variables.

Let’s now type the following command:

stepwise, pr(0.05): reg time dist sem per _Istyle_2 _Istyle_3

To prepare the stepwise command, the researcher must define the significance level for the t-test, based upon which the explanatory variables are excluded from the model. The outputs can be seen in Fig. 13.54 and are exactly equal to those presented in Fig. 13.33.

Fig. 13.54
Fig. 13.54 Multiple linear regression outputs with stepwise procedure in Stata.

Analogously, the estimated parameters presented in Fig. 13.54 can also be obtained by means of the following command:

stepwise, pr(0.05): glm time dist sem per _Istyle_2 _Istyle_3, family(gaussian)

As we have already studied in Section 13.2.6, we arrive at the following multiple linear regression model:

ˆtimei=8.2919+0.7105disti+7.8368semi+8.9676_Istyle_3i{calm=0aggressive=1

si165_e

The command predict yhat causes that a new variable (yhat) be generated in the dataset, which offers the predicted (Ŷ) values for each observation of the last prepared model.

However, we may also want to know the predicted value for a certain observation now found in the dataset. In other words, we can again pose and answer the question asked, and answered manually, at the end of Section 13.2.6: What would be the estimated time for a student, who travels 17 km, goes through two traffic lights, decides to go to school in the morning, but has what is considered an aggressive driving style?

By means of the mfx command, Stata allows the researcher to answer this question directly. To do this, we should type the following command:

mfx, at(dist = 17 sem = 2 _Istyle_3 = 1)

Obviously, the term per = 1 does not need to be included in the mfx command, being that the per variable is not present in the final model. The output is presented in Fig. 13.55 and, by it, we arrive at our answer of 45.0109 min, which is exactly equal to that calculated manually in Section 13.2.6.

Fig. 13.55
Fig. 13.55 Calculation of Y estimate for values of explanatory variables—mfx command.

Having defined the model, we now embark on the verification of the technique presuppositions, as studied in Section 13.3. Before that, however, it is always interesting for the researcher, in estimating a certain model, to conduct an analysis on eventual disparate observations in the dataset and that are significantly influencing the model parameter estimates. As we know, this influence, as well as the presence of outliers, can be one of the causes of heteroskedasticity.

To this end, we introduce the concept of leverage distance, which, for each i observation, corresponds to the value of the ith position of the main diagonal of the X(X′X)− 1X′ matrix. An observation can be considered of great influencer of the model parameters estimate if its distance leverage is higher than (2 ∙ k/n), where k is the number of explanatory variables and n is the sample size. The leverage distances are generated in Stata by means of command:

predict lev, leverage

In our example, we will solicit Stata to generate the leverage distances for the final estimated model using the Stepwise procedure. These distances are presented in Table 13.16.

Table 13.16

Leverage Distances for the Final Model
Observation (i)levi
(Final Model)
Gabriela0.23
Dalila0.45
Gustavo0.33
Leticia0.54
Luiz Ovidio0.54
Leonor0.22
Ana0.28
Antonio0.74
Julia0.51
Mariana0.16

In the final model, with (2·k/n) = (2·3/10) = 0.6, observation 8 (Antonio) is that with the greatest potential to influence the estimation of parameters and, consequently, special attention should be given to it, being that eventual heteroskedasticity problems could arise due to this fact. A graph of the leverage distances in function of the squared normalized (or standardized) residuals (Fig. 13.56) can provide the researcher with an easy analysis of the observations with greater influence over the model parameters (high leverage distances) and, at the same time, an analysis of the observations considered as outliers (high squared normalized residuals). As we know, both can generate estimation problems. The command developed for the graph of our example is:

Fig. 13.56
Fig. 13.56 Leverage distances in function of the squared normalized residuals.

lvr2plot, mlabel(student)

By means of the graph in Fig. 13.56, we see that, while Antonio has greater influence over the model parameters, Ana tends to be an outlier in the sample due to the fact of presenting a greater error term in module (and, consequently, a greater squared normalized residuals). The degree of influence of these observations over the rise of heteroskedasticity in the model should be investigated during the preparation of the presupposition verification tests. So, let’s get to them!

The first presupposition, according to what is shown in Box 13.2, refers to the normality of residuals. In this way, we will generate a variable that corresponds to the error terms in the final model. For such, we will type the following command:

predict res, res

After generating the res variable, which offers the values of the error terms for each observation in the final model fitted using the Stepwise procedure, we can prepare a graph that provides a visual distribution of the error terms generated by the model with the normal standard distribution. As such, we should type the following command:

kdensity res, normal

The generated graph can be found in Fig. 13.57 and, by means of the same, we can get an idea as to the distribution of how the generated residuals (Kernel density estimate) compare to the normal standard distribution.

Fig. 13.57
Fig. 13.57 Graph of adherence between residuals distribution and normal distribution.

Being that the sample for this example is of only 10 observations, we will apply the Shapiro-Wilk test, recommended for samples of up to 30 observations (as we discussed in Chapter 9), to can effectively corroborate the hypothesis that the distribution of residuals is keeping with the normal distribution. We will use the following command:

swilk res

The output for this test is found in Fig. 13.58 and, by analysis, we can see that the error terms present normal distribution and the 5% significance level, without rejecting its null hypothesis.

Fig. 13.58
Fig. 13.58 Result of Shapiro-Wilk normality test for residuals.

For larger samples, as we have discussed, application of the Shapiro-Francia test is recommended. The command is:

sfrancia res

The second presupposition to be verified has to do with the existence of multicollinearity of the explanatory variables. After preparing the complete model (as yet without the Stepwise procedure), we can type the following command:

estat vif

The outputs are presented in Fig. 13.59 and, by them, we see that the VIF statistic of the per variable is the highest of all (VIFper = 19.86), which indicates that the R2 resulting from a regression with this variable as a dependent of all others would be approximately 95% (Toleranceper = 0.05). Fig. 13.52 shows us that the correlations between the per variable and the remaining explanatory variables are quite high, which initially leads us to understand that multicollinearity exists. However, as we know, the final model does not include this variable, let alone the _Istyle_2 variable. Fig. 13.60 shows the outputs generated by means of the estat vif command applied after preparation of the Stepwise procedure.

Fig. 13.59
Fig. 13.59 VIF and Tolerance statistics of explanatory variables for the complete model.
Fig. 13.60
Fig. 13.60 VIF and Tolerance statistics of the explanatory variables for the final model.

Being that the final model obtained after the Stepwise procedure does not present very high VIF statistics for any explanatory variable, we can assume that the multicollinearity in the complete model was considerably reduced. The sem variable, present in the final model, had its VIF statistic reduced from 6.35 to 2.35 with the main exclusion of the per variable. It is only important that we verify, by means of the RESET test, if the exclusion of these variables will generate any specification problem by the omission of a relevant variable. This will be worked out further on.

The third presupposition considers the absence of heteroskedasticity. Initially, only for diagnosis, we will prepare a graph of the valued of the error terms in function of the fitted (Ŷ) values of the estimated model. Fig. 13.61 presents the graphs generated after the estimations of the complete and final models, in which the standardized residuals values are plotted in function of the estimated values of the dependent variable. The command for the elaboration of these graphs, which should be typed in after the estimation of each of the models, is:

Fig. 13.61
Fig. 13.61 Graphing method for identification of heteroskedasticity. (A) Complete model with all variables. (B) Final model (stepwise procedure).

rvfplot, yline(0)

While Fig. 13.61A shows the formation of a clearly visible “cone,” the same cannot be said in relation to Fig. 13.61B. In fact, as we will see further on, the complete model, with the inclusion of all explanatory variables, shows heteroskedasticity, while the final model obtained by means of the Stepwise procedure generates homoskedastic error terms.

To verify the existence of heteroskedasticity, we will apply the Breusch-Pagan/Cook-Weisberg test that, according to what we have already discussed, presents, as a null hypothesis, the fact that the variance of the error terms is constant (homoskedastic) and, as an alternative hypothesis, the variance of the error terms is inconstant, or rather, that the error terms are a function of one or more explanatory variables (heteroskedastic errors). This test is indicated for cases where the supposition of normality of the residuals is verified, as in the present example.

Section 13.3.3.3, as we have seen, describes the test and offers the option that the same be done manually, step by step. We will do this first so that the researcher can analyze the outputs and compare them to the results generated by Stata.

To do this, we must develop a table that allows for calculation of the Breusch-Pagan statistic, based on the estimation of the final model:

timei=8.2919+0.7105disti+7.8368semi+8.9676_Istyle_3i+ui

si166_e

Based on the estimation of ui for each observation, we can calculate the values of ui2 and, by means of Expression (13.40), the values of upi. Table 13.17 gives us these values.

Table 13.17

Elaboration of the Breusch-Pagan/Cook-Weisberg Test
Observation (i)ui
(YiˆYi)si12_e
ui2upi=u2i(ni=1u2i)/nsi13_eˆYisi14_e
Gabriela1.024441.049481.1455513.97556
Dalila− 0.391490.153270.1673020.39149
Gustavo1.051271.105171.2063418.94873
Leticia0.694550.482410.5265739.30545
Luiz Ovidio− 0.694550.482410.5265750.69455
Leonor1.056241.115641.2177723.94376
Ana− 1.844203.401063.7124011.84420
Antonio0.463040.214400.2340354.53696
Julia− 1.021461.043391.1389036.02146
Mariana− 0.337840.114130.1245830.33784
Sum9.16137
Average0.91614

Table 13.17

To obtain the test result, the procedure is to elaborate the regression upi=a+bˆYi+ξisi126_e, from which the sum of squares due to regression (SSR), which, divided by 2, arrives at the χBP, CW2 statistic. In our example, SSR = 3.18, from which comes that χBP/CW2 = 1.59 < χ1d.f.2 = 3.84 at the 5% significance level. This means that the null hypothesis for the test (homoskedastic error terms) cannot be rejected.

The command to apply this test directly in Stata is given as:

estat hettest

that evaluates the existence of heteroskedasticity in the last generated model. The result of this test for the complete model, with the inclusion of all explanatory variables, even though not presented here, shows the existence of heteroskedasticity, as we already expected based on the analysis of Fig. 13.61A. On the other hand, Fig. 13.62 presents the test results for the final model resulting from the Stepwise procedure, which is exactly the same as what was obtained manually. By means of its analysis, we can say that the final model does not present heteroskedasticity problems (P-value χ2 = 0.2069 > 0.05).

Fig. 13.62
Fig. 13.62 Breusch-Pagan/Cook-Weisberg test for heteroskedasticity.

Analogous to the Breusch-Pagan/Cook-Weisberg test, the White test also evaluates the rejection or nonrejection of the null hypothesis that the error terms are homoskedastic, at a certain significance level. The command for this test is:

estat imtest, white

The output is presented in Fig. 13.63 and offers the same conclusion regarding the inexistence of heteroskedasticity of the final model residuals.

Fig. 13.63
Fig. 13.63 White test for heteroskedasticity.

As we did not verify the existence of heteroskedasticity in the final proposed model, we did not prepare an estimation by the weighted least squares method. However, in the case a researcher desires, for some reason, to estimate a model weighted by the per variable, the following estimation can be used:

timeiperi=aperi+b1distiperi+b2semiperi+b3periperi+b4_Istyle_2iperi+b5_Istyle_3iperi+uiperi

si168_e

The command for the estimation of the weighted least squares model by the per variable would be:

wls0 time dist sem per _Istyle_2 _Istyle_3, wvar(per) type(abse)

We also do not present the outputs of the Huber-White robust standard error estimation, given the inexistence of heteroskedasticity in this example. However, if a researcher is interested in studying the technique, the command to estimate this model would be:

reg time dist sem per _Istyle_2 _Istyle_3, rob

Being that the dataset for our example is a cross-section, we did not verify the autocorrelation of residuals presupposition in this case. However, farther forward, by means of another dataset, we will study the application of the tests in Stata directed toward the verification of such presupposition.

Being as such, we will now go on to the use of linktest, which, as was discussed in Section 13.3.5, refers to a procedure that creates two new variables based on the estimation of a regression model, which is nothing more than the variables Ŷ and Ŷ2, from which it is hoped, by reverting Y in function of these two variables, that Ŷ will be statistically significant and Ŷ2 not, since, if the original model was specified correctly in terms of functional form, the square of the fitted values of the dependent variable should not present an explanatory power over the original dependent variable. The command for the application of this test in Stata is:

linktest

that should be entered after the preparation of the final model. The test outputs can be seen in Fig. 13.64.

Fig. 13.64
Fig. 13.64 Linktest for verification the adequacy of the model functional form.

By analyzing Fig. 13.64 outputs, more specifically in relation to the P-value of the t statistic of the_hatsq variable (which refers to Ŷ2, or rather, the squared estimated value of the time variable), it can be stated that linktest does not reject the null hypothesis that the model was specified correctly in terms of functional form, in other words, the linear functional form is adequate in this case.

The RESET test, also discussed in Section 13.3.5, evaluates the existence of model specification errors by the omission of relevant variables and, analogous to linktest, creates new variables based on the values of Ŷ, generated based on the estimation of the original model. In this way, after the estimation of the final model by means of the Stepwise procedure and according to Expression (13.63), we will estimate the following model, from which we will manually calculate the F statistic presented in Expression (13.64):

timei=a+b1disti+b2semi+b3_Istyle_3i+d1(timei)2+d2(timei)3+d3(timei)4+νi

si169_e

Based on the final model estimation generated by the Stepwise procedure (which has terms of error ui) and developed in this last model based on Expression (13.63), to elaborate the RESET test (which has terms of error νi), we can construct Table 13.18.

Table 13.18

Construction of F Statistic for the RESET Test
Observation (i)uiui2νiνi2
Gabriela1.024441.049481.270971.61537
Dalila− 0.391490.15327− 0.317700.10093
Gustavo1.051271.10517− 0.492560.24261
Leticia0.694550.482410.484980.23521
Luiz Ovidio− 0.694550.48241− 0.484980.23521
Leonor1.056241.115640.512320.26247
Ana− 1.844203.40106− 0.752920.56689
Antonio0.463040.214400.255240.06515
Julia− 1.021461.043390.127530.01626
Mariana− 0.337840.11413− 0.602880.36346
Sum9.161373.70356

Table 13.18

And, based on Table 13.18, we can calculate the RESET test F statistic as follows:

F=(ni=1u2ini=1ν2i)3(ni=1ν2i)(nk4)=(9.161373.70356)3(3.70356)(1034)=1.47

si170_e

As the F statistic for (3, 3) degrees of freedom is less than the corresponding critical F (F(3,3) = 9.28 for the 5% significance level), we can state that the original model does not present the omission of relevant explanatory variables.

To conduct the RESET test in Stata, we should type the following command after the estimation of the final generated model by means of the Stepwise procedure:

ovtest

The output is found in Fig. 13.65.

Fig. 13.65
Fig. 13.65 RESET test to verify the omission of relevant variables in the model.

In this way, the linktest and the RESET test show us that we do not have any errors in the specification of the final model generated by Stepwise. If this were not the case, we would need to respecify the model by means of changing its functional form or by means of including relevant explanatory variables that had been excluded in the estimation.

Therefore, the model proposed with the Stepwise procedure did not present any problems in relation to any of the presuppositions, neither are there any specification errors.

So as to study the possible lack of linearity in regression models, we will now work with another dataset.

Now let’s imagine that our professor has been invited to give a lecture on urban mobility to a group of 50 public sector professionals. The lecture was based on the extensive research conducted on the time of travel for people in community due to distance traveled and other variables, such as number of traffic lights that they go through daily. At the end of the much applauded lecture, the professor could not miss out on the opportunity to collect more data for the investigation and, therefore, questioned each of the 50 participants regarding their travel time to the building where they were meeting, the distance traveled, and the number of traffic lights that each had crossed that morning. Armed with this information, a dataset was put together, which can be found in the file Lecturetimedistsem.dta.

According to the professor’s steps, we should first prepare a multiple linear regression to evaluate the influence of the dist and sem variables on the time variable. To do this, we should type the following command:

reg time dist sem

The results can be found in Fig. 13.66.

Fig. 13.66
Fig. 13.66 Multiple linear regression outputs.

Even though the preliminary results analysis shows a satisfactory estimation, the model presented in Fig. 13.66 presents a distribution of error terms not in keeping with normality, as can be verified by means of the Shapiro-Francia test (sample with over 30 observations), obtained by means of the following command:

predict res, res
sfrancia res

The test results are found in Fig. 13.67.

Fig. 13.67
Fig. 13.67 Shapiro-Francia test results for residual normality verification.

As we discussed in Section 13.3.1, the presupposition of normality warrants that the P-value of the t-tests and the F-test be valid. However, the violation of said presupposition could be the result of specification errors as to the functional form of the model.

As such, we will need to prepare graphs of the dependent variable in function of each of the explanatory variables individually and, in these graphs, present the linear adjustment (predicted values) and the adjustment known as lowess (locally weighted scatterplot smoothing), which refers to a nonparametric method that uses multiple regressions to indicate the standard of behavior of data and, by smoothing, adjust the not-necessarily linear curve. To achieve this, we should type the following commands:

graph twoway scatter time dist || lfit time dist || lowess time dist
graph twoway scatter time sem || lfit time sem || lowess time sem

Fig. 13.68 presents two generated graphs.

Fig. 13.68
Fig. 13.68 Graphs with linear adjustment and lowess adjustment. (A) Time in function of distance traveled. (B) Time in function of number of traffic lights.

We can clearly see, by means of Fig. 13.68, that there are differences between the linear and lowess adjustments, especially for the dist variable (Fig. 13.68A). Another common and similar means of detecting the nonlinearity of the model is by means of graphs that present the relation between the augmented component-plus-residuals and each one of the explanatory variables. To obtain these graphs, we should type the following commands:

acprplot dist, lowess
acprplot sem, lowess

Fig. 13.69 presents the two generated graphs.

Fig. 13.69
Fig. 13.69 Linear adjustment and lowess adjustment graphs for augmented component-plus-residuals. (A) Augmented component-plus-residuals in function of distance traveled. (B) Augmented component-plus-residuals in function of number of traffic lights.

Analogous to Fig. 13.68, the graph in Fig. 13.69A also shows that the lowess adjustment does not approximate the linear adjustment, contrary to the graph in Fig. 13.69B, which can indicate problems as to the linear function form of the dist variable in the regression model. We can see, for this variable, that there is a considerable quantity of points that potentially influences the model behavior. The matrix graph clearly shows this phenomenon, as is shown in Fig. 13.70, generated by using the following command:

Fig. 13.70
Fig. 13.70 Interrelationship between variables—graph matrix.

graph matrix time dist sem, half

By means of the graph in Fig. 13.70, we see that the relation between the time and sem variables is apparently linear; however, the relationship between time and dist is clearly nonlinear, as has been discussed. Because of this, we will focus on the dist variable.

We will initially execute a logarithmic transformation on the dist variable, as such creating the lndist variable, as follows:

gen lndist = ln(dist)

And, in this way, we can estimate a new regression model, with the following functional form:

timei=a+b1lndisti+b2semi+ui

si171_e

whose parameters and Shapiro-Francia test results for residuals can be obtained in Stata by typing the following commands:

reg time lndist sem
predict res1, res
sfrancia res1

with the results presented in Fig. 13.71.

Fig. 13.71
Fig. 13.71 Results of the estimation of the nonlinear model and Shapiro-Francia test.

This shows that even though the logarithmic transformation in explanatory variables can, in some cases, improve the quality of the model adjustment, which is not true in this case, it does not guarantee that the presupposition of residual normality be met. The graph in Fig. 13.72, obtained by means of the command to be given, shows us that the logarithmic functional form of the dist variable does not adequately adjust itself to the time variable.

Fig. 13.72
Fig. 13.72 Graph with linear adjustment and lowess adjustment for augmented component-plus-residuals in function of the natural logarithm of distance traveled.

acprplot lndist, lowess

In this way, as studied in Section 13.4.1, we can perform a Box-Cox transformation in the dependent variable, such that the new created variable presents a distribution with higher adherence to the normal distribution, even if there is no guarantee whatsoever that this transformation will effectively generate a variable with a normal distribution. For this, let’s create the variable bctime, based on the time variable and by means of the Box-Cox transformation. For this, we need to type the following command:

bcskew0 bctime = time

Fig. 13.73 presents the results of the Box-Cox transformation, with emphasis on the λ parameter presented in the Expression (13.66) (parameter L in Stata output).

Fig. 13.73
Fig. 13.73 Box-Cox transformation on the dependent variable.

Thus, we have that:

bctimei=(timeλi1λ)=(time2.6486i12.6486)

si172_e

The graph shows how much the distribution of the bctime (Kernel density estimate) is close to the normal standard distribution can be generated and compared with the graph that considers the original time variable. These graphs can be obtained by means of the commands:

kdensity time, normal
kdensity bctime, normal

and are presented in Fig. 13.74.

Fig. 13.74
Fig. 13.74 Graph of adherence between distribution of Y variable and normal distribution. (A) time variable, (B) bctime variable.

Even though the two variables do not present a very close adherence to normality, it can be seen that the greatest proximity is given by the bctime variable. Let’s, therefore, estimate the following model:

bctimei=a+b1disti+b2semi+ui

si173_e

whose parameters and Shapiro-Francia test results can be obtained in Stata by typing commands:

reg bctime dist sem
predict res2, res
sfrancia res2

and whose results can be found in Fig. 13.75.

Fig. 13.75
Fig. 13.75 Results of the estimation of the model with Box-Cox transformation on dependent variable and Shapiro-Francia test.

This shows that the distribution adherence for the dependent variable to normality, in regression models, can estimate, by means of OLS, parameters more adequate for the determination of the confidence intervals for the effects of forecast, being they can be generated with normal error terms.

We, then, arrive at the following model:

(time2.6486i12.6486)=7193.16+386.6511disti+840.903semi+ui

si174_e

which presents a low heteroskedasticity problem (actually, it presents homoskedastic error terms at the 1% significance level) and VIF statistics of 1.83. The graph in Fig. 13.76 shows that the Box-Cox transformation on the dependent variable brings the estimated adjustment considerably close to the lowess adjustment. Such graph can be obtained by means of the command:

Fig. 13.76
Fig. 13.76 Graph with linear adjustment and lowess adjustment for the augmented component-plus-residuals in function of distance traveled for the model with Box-Cox transformation.

acprplot dist, lowess

It is then up to the researcher, in light of the diagnostic function, which will always need to be done, in function of their experience and based on the subjacent theory, to define the adequate functional form as to the estimation of regression models, so as to attend the presuppositions and that more efficient estimators be obtained for the preparation of forecasts.

Finally, we will now study the autocorrelation of residuals problem by using Stata. Imagine that the professor, upon finishing the lecture and returning to school, has the idea of accompanying student travel times for a period of 30 days. To do this, day after day, the data referent to travel time, distance traveled, and the number of traffic lights are collected from students. However, instead of preparing the dataset per student, per day, which would result in panel of longitudinal data, the professor tabulated the average data of each variable per day, or rather, the average travel time per day, the average distance traveled by students per day, and the average number of traffic lights. The objective of the professor is now (as is ours) is to estimate the following model:

timet=a+b1distt+b2semt+ɛtt=1,2,,30

si175_e

and the dataset can be found in the file Analysistemporaltimedistsem.dta.

Before estimating the proposed model, it is necessary to define the variable corresponding to the temporal evolution (in this case, the day variable). As such, we should type, upon opening the file, the following command:

tsset day

Information such as what appears in Fig. 13.77 will appear on the screen.

Fig. 13.77
Fig. 13.77 Definition of temporal variable.

In case the researcher forgets to define the variable referent to temporal evolution, which is quite common, Stata will not permit the Durbin-Watson and Breusch-Godfrey tests to be executed. An error message will appear on the outputs screen informing the researcher that the temporal variable must be defined. On the other hand, different statistical packages, such as SPSS, provide calculation of the Durbin-Watson statistics, for example even though the dataset is in cross-section, which is a serious error.

After preparing the regression, by means of the following command, we can then prepare the tests directed toward the verification of residual autocorrelation.

reg time dist sem

The results of the estimation are found in Fig. 13.78.

Fig. 13.78
Fig. 13.78 Temporal model estimation results.

Even though the estimated model presents problems, at the 5% significance level, in relation to the normality of residuals (Shapiro-Wilk test) and heteroskedasticity (Breusch-Pagan/Cook-Weisberg), we will restrict our analysis, at this time, to the residual autocorrelation. For this, we will initially perform the Durbin-Watson test by means of the following command:

estat dwatson

The test results are found in Fig. 13.79.

Fig. 13.79
Fig. 13.79 Durbin-Watson test result.

By means of Table C in the Appendix, and according to Fig. 13.45 in Section 13.3.4.3, we have, at the 5% significance level and for a model with three parameters and 30 observations, that dU = 1.567 < 1.779 < 2.433 = 4 − dU, or rather, that a DW statistic approximately equal to two results in the inexistence of residual autocorrelation of the first order.

According to what was discussed in Section 13.3.4.4, being that the Durbin-Watson test is only valid for verification of the existence of first-order autocorrelation of error terms, the Breusch-Godfrey test comes to be the most general in the sense that it is also adequate to evaluate residual autocorrelations with higher lags. In a daily dataset, for example, it might be interesting for the researcher to study eventual order 7 autocorrelations so as to capture phenomena such weekly seasonality. Following the same logic, for monthly data, it may be interesting for the researcher to evaluate the existence of eventual order 12 autocorrelations in order to capture annual seasonality.

For teaching purposes, we will elaborate the Breusch-Godfrey test for our example with all the possible lags, or rather, with orders that vary from 1 to 28 (t − 1, t − 2, t − 3, …, t − 28). The command to be typed is:

estat bgodfrey, lags(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28)

The results are found in Fig. 13.80.

Fig. 13.80
Fig. 13.80 Breusch-Godfrey test results.

By means of the Fig. 13.80, we can see that there are no residuals autocorrelation problems for whatever the proposed lag might be.

The Stata capability to estimate models and the preparation of statistical tests is enormous; however, we believe that what has been presented here is considered essential for researchers who desire to correctly use the techniques for simple and multiple regressions.

We now part toward solving the same examples by using SPSS, highlighting that, even though its processing capacity and output generation is considered by many to be more limited than Stata, it is at times considered a friendlier software and easier to be used.

13.6 Estimation of Regression Models in SPSS

We will now present the step-by-step process for estimating our model by means of the IBM SPSS Statistics Software. The reproduction of its images in this section has been authorized by the International Business Machines Corporation©.

Following the same logic proposed applied to the models by means of Stata, we begin with the final dataset built by the professor by means of questionnaires given to each of the 10 students. The data are found in the file Timedistsemperstyle.sav and, after opening it, we will begin by clicking on Analyze → Regression → Linear …. The dialog box from Fig. 13.81 will be opened.

Fig. 13.81
Fig. 13.81 Dialog box for estimation of linear regression in SPSS.

We should select the time variable and include it in the Dependent box. The remaining variables should be simultaneously selected and inserted into Independent(s) box. Let’s keep, at this time, the option Method: Enter, as can be seen in Fig. 13.82. The Enter procedure, contrary to the Stepwise procedure, includes all variables in the estimation, even those whose parameters are statistically equal to zero, and correspond exactly to the standard procedure prepared by Excel, as well as Stata, when applying the reg command.

Fig. 13.82
Fig. 13.82 Dialog box for estimation of the linear regression in SPSS with inclusion of the dependent variable and the explanatory variables and selection of the enter procedure.

The Statistics … button allows us to select the option that will provide the parameters and respective confidence intervals in the outputs. The dialog box that is opened, in clicking on this option, is presented in Fig. 13.83, where the Estimates options were selected (so that the per se parameters with the respective t statistics are presented) and Confidence intervals (so the confidence intervals for these parameters are calculated).

Fig. 13.83
Fig. 13.83 Dialog box for parameter and confidence intervals selection.

We can return to the main linear regression dialog box by clicking Continue.

The Options … button allows us to alter the levels of significance to reject the F-test null hypothesis and, consequently, the t-tests null hypotheses. The SPSS standard, as can be seen by means of the dialog box that opens when we click on this option, is for 5% confidence level. In this same dialog box, we can impose that the α be equal to zero (in disabling the option Include constant in equation). Let’s maintain the 5% standard for the levels of significance and will leave the intercept in the model (option Include constant in equation selected). This dialog box is presented in Fig. 13.84.

Fig. 13.84
Fig. 13.84 Dialog box for eventual alteration of the confidence levels and intercept exclusion in linear regression models.

Now we can select Continue and OK. The generated outputs are presented in Fig. 13.85.

Fig. 13.85
Fig. 13.85 Multiple linear regression outputs in SPSS—Enter procedure.

We will not analyze the generated outputs again, since we can see that they are exactly equal to those obtained when estimating the multiple linear regression in Excel (Fig. 13.32) and Stata (Fig. 13.53). It is worth mentioning that the F significance from Excel is called Sig. F and the P-value is called Sig. t in SPSS.

We can now, finally, estimate the multiple linear regression by means of the Stepwise procedure. To elaborate this procedure, we should select the Method: Stepwise option in the main linear regression dialog box in SPSS, according to what is shown in Fig. 13.86.

Fig. 13.86
Fig. 13.86 Dialog box with stepwise procedure selection.

We again return to the main linear regression dialog box by clicking on Continue.

The Save … button allows the variables referent to Ŷ and the final residuals model generated by the Stepwise procedure to be created in the original dataset. Being as thus, in clicking on this option, a dialog box will be opened, as shown in Fig. 13.87. With this end, we should select the Unstandardized (in Predicted Values) and Unstandardized (in Residuals) options.

Fig. 13.87
Fig. 13.87 Dialog box for insertion of predicted values (Ŷ) and the residuals in the dataset.

By clicking on Continue and then OK, new outputs are generated, as Fig. 13.88 shows. Notice that two new variables, called PRE_1 and RES_1, which correspond to the Ŷ values and to the estimated residual values (exactly those shown in Fig. 13.33), respectively.

Fig. 13.88
Fig. 13.88 Multiple linear regression outputs in SPSS—stepwise procedure.

The Stepwise procedure performed by SPSS shows the step-by-step process for the models that were prepared, beginning with the inclusion of the most significant variable (greatest t statistic among all explanatories) to the inclusion of that with the smallest t statistic, but yet with a Sig. t < 0.05. Just as important as the analysis of the variables included in the final model is the analysis of list of excluded variables (Excluded Variables). With this, we can verify, by including only the sem explanatory variable, the list of excluded variables will contain all of the remaining. If, in the first step, there is any explanatory variable that has been excluded, but that presents itself in a significant way (Sig. t < 0.05), as what occurs for the dist variable, the next step will include it in the model (model 2). This will occur successively until the list of excluded variables no longer presents a variable with a Sig. t < 0.05. The remaining variables on this list for our example are per and style2, as we discussed in preparing the regression in Excel and Stata. The final model (model 3 in the Stepwise procedure), which is exactly what was presented by means of Figs. 13.33 and 13.54, contains only the explanatory variables dist, sem, and style3, with an R2 = 0.995. As such, as we have seen, the final estimated linear model is:

ˆtimei=8.292+0.710disti+7.837semi+8.968style3i{calm=0agrressive=1

si176_e

We now begin the verification of the model presuppositions. At first, we will perform the Shapiro-Wilk test to the verify the normality of the residuals. To do this, we should click on Analyze → Descriptive Statistics → Explore …. In the dialog box that is opened, we should insert the variable RES_1 (unstandardized residual) on the Dependent List and click on Plots … In this window, we should select Normality plots with tests, click on Continue and then OK. Fig. 13.89 shows this step by step.

Fig. 13.89
Fig. 13.89 Procedure to elaborate the Shapiro-Wilk test for the RES_1 variable.

The Shapiro-Wilk test indicates that the terms of error present a distribution in keeping with normality, being that its result (Fig. 13.90) does not indicate the rejection of the null hypothesis. The result is exactly equal to that obtained by Stata and presented by means of Fig. 13.58.

Fig. 13.90
Fig. 13.90 Shapiro-Wilk normality test result for residuals.

Next, we will prepare the multicollinearity diagnostic for the explanatory variables. To do this, we should request that the software generate the VIF and Tolerance statistics when the model estimation is done. Therefore, in Analyze → Regression → Linear …, in the Statistics … button, we should select the option Collinearity diagnostics, as is shown in Fig. 13.91.

Fig. 13.91
Fig. 13.91 Dialog box for elaboration of multicollinearity diagnostic.

The outputs generated are the same as those presented in Fig. 13.88; however, the VIF and Tolerance statistics are calculated for each explanatory variable, as shown in model 3 in Fig. 13.92. As we discussed when presenting Fig. 13.60, being that the final model obtained after the Stepwise procedure does not present high VIF statistics for any explanatory variable, we can consider that there are no multicollinearity problems.

Fig. 13.92
Fig. 13.92 VIF and Tolerance statistics for the explanatory variables.

In relation to the heteroskedasticity problem, the most common is to first prepare a graph to evaluate the behavior of the residuals in function of the dependent variable. To do this, we should again click on Analyze → Regression → Linear …. The Plots … button allows diagnostic graphs of residual behavior in function of the dependent variable estimated values to be produced and, by clicking on this button, a dialog box will be opened, as shown in Fig. 13.93. Let’s request that a graph of the estimated values in terms of standard errors in function of the estimated standard values of the dependent variable be produced. This procedure is analogous to what generated the graph in Fig. 13.61B.

Fig. 13.93
Fig. 13.93 Dialog box to elaborate a diagnostic graph of residuals behavior in function of the dependent variable.

The generated graph, presented in Fig. 13.94, shows that there are no indications of the existence of heteroskedasticity, as we discussed when analyzing Fig. 13.61B.

Fig. 13.94
Fig. 13.94 Diagnostic graph of residuals behavior in function of the dependent variable.

Even though SPSS does not offer a direct option to perform the Breusch-Pagan/Cook-Weisberg test, we will build a procedure for its use in SPSS. To do this, we first have to create a new variable, which we will call RES_1SQ, and that refers to the square of the residuals. Therefore, in Transform → Compute Variable …, we should proceed as shown in Fig. 13.95. In SPSS, the double asterisk corresponds to the exponential operator.

Fig. 13.95
Fig. 13.95 Creation of the variable referent to the square of the residuals (RES_1SQ).

This being done, we can calculate the residual sum of squares by clicking on Analyze → Descriptive Statistics → Descriptives … and selecting the Sum option on the Options … button, as shown in Fig. 13.96.

Fig. 13.96
Fig. 13.96 Calculation of the residual sum of squares.

The sum of the terms of variable RES_1SQ is 9.16137, which is in agreement with that presented in Table 13.17. We can now create a new variable called RESUP, where:

RESUPi=RES_1SQi(ni=1RES_1SQ)/n=RES_1SQi(9.16137)/10

si177_e

according to Expression (13.40). Next, in Transform → Compute Variable … we should proceed according to what is presented in Fig. 13.97.

Fig. 13.97
Fig. 13.97 Creation of RESUP variable.

Next, we prepare the RESUP regression in function of the dependent variable estimated values, or rather, in function of the PRE_1 variable. We will not show all of the outputs for this estimation; however, Fig. 13.98 presents the resulting ANOVA table.

Fig. 13.98
Fig. 13.98 ANOVA table of the regression of RESUP in function of PRE_1.

By means of the ANOVA table, we see that the sum of squares due to regression (SSR) is 3.185, which, divided by 2, arrives at the statistic that χBP/CW2 = 1.59 < χ1d.f.2 = 3.84 for the 5% level of significant, or rather, the null hypothesis for this test (homoskedastic error terms) cannot be rejected, as was also analyzed by means of Fig. 13.62.

According to the logic presented in Section 13.5, at this time we will open file Lecturetimedistsem.sav and estimate the following nonlinear regression.

timei=a+b1lndisti+b2semi+ui

si171_e

To do this, we must create the variable lndist (Fig. 13.99), by clicking on Transform → Compute Variable ….

Fig. 13.99
Fig. 13.99 Creation of lndist variable.

From this point forward, we can estimate the proposed nonlinear model. The outputs will not be presented here; however, they are the same as in Fig. 13.71.

Different from in Stata, SPSS does not offer a direct option to perform Box-Cox transformations, such that we did not estimate the model whose results are presented in Fig. 13.75. If a researcher wishes to perform that estimation, a new transformed dependent variable should be created manually in Transform → Compute Variable …. However, since the Box-Cox transformation parameter that maximized the distribution is not known, a priori, we strongly recommend at least the obtaining the λ parameter be done using Stata, with the procedure prepared so as to arrive at the results in Fig. 13.73.

At last, but not least, we will present the procedure to verify for the existence of residuals autocorrelation in SPSS. Being this software does not provide a direct procedure for performing the Breusch-Godfrey test, we will alter the application of the Durbin-Watson test. To do this, we open the file Analysistemporaltimedistsem.sav.

When performing the actual regression, in Analyze → Regression → Linear …, the Statistics … button offers the option to perform the Durbin-Watson test. We will choose this option, as shown in Fig. 13.100. Notice that there is no mention of the fact that the dataset presents a variable corresponding to temporal evolution, which means that modeling on a dataset in cross-section will also allow the preparation of said test, which, as we discussed, is a serious mistake.

Fig. 13.100
Fig. 13.100 Dialog box for elaborating the Durbin-Watson test.

The test result is in Fig. 13.101, and is exactly the same as what was presented in Fig. 13.79.

Fig. 13.101
Fig. 13.101 Durbin-Watson test result.

As has already been discussed, DW = 1.779 indicates the inexistence of first-order autocorrelation of the terms of error, to the 5% significance level and for a model with three parameters and 30 observations.

13.7 Final Remarks

The simple and multiple regression models estimated by the OLS method represent the group of regression techniques most used in academic and organizational environments, given the ease of application and interpretation of obtained results, besides the fact of being available in most software, even those where there is not a specific focus on the statistical analysis of data. It is also important to highlight the practicality of the techniques studied in this chapter for the purpose of preparing diagnoses and forecasts.

It is of fundamental importance that the researcher always evaluate and discuss the considerations of the technique presuppositions and, more than that, always reflect on the possibility that they are not necessarily models with linear functional forms.

We finally express that the researcher need not restrict the analysis of the behavior of a certain phenomenon based only and exclusively on the subjacent theory. The application of regression models requests, at times, that variables based on the experience and intuition of the researcher be included, so as to generate ever more interesting and different models than what are traditionally proposed. In this way, new points of view and perspectives for the study of phenomena can always arise, which contribute to scientific development and the rise of ever more innovative empirical studies.

13.8 Exercises

  1. (1) The following table gives GDP growth and investment in education data for a certain nation over a period of 15 years:
YearRate of GDP Growth (%)Investment in Education
(US$ Billions)
1998− 1.507.00
1999− 0.909.00
20001.3015.00
20010.8012.00
20020.3010.00
20032.0015.00
20044.0020.00
20053.7017.00
20060.208.00
2007− 2.005.00
20081.0013.00
20091.1013.00
20104.0019.00
20112.7019.00
20122.5017.00

Questions:

  1. (a) What equation evaluates the behavior of the GDP rate of growth (Y) in regards to investment in education (X)?
  2. (b) What percentage of the variance of the GDP rate of growth is explained by investment in education (coefficient of determination R2)?
  3. (c) Is the variable referent to investment in education statistically significant, at the 5% significance level, to explain the behavior of the GDP rate of growth?
  4. (d) What is the investment in education that, on average, results in an expected GDP growth rate equal to zero?
  5. (e) What would be the expected GDP growth rate if the government for this nation decided to not invest in education for a certain year?
  6. (f) If the investment in education for a certain year were of US$11 billion, what would be the expected GDP growth rate? What would be the minimum and maximum forecast values for the GDP growth rate, at the 95% confidence level?
  1. (2) The files Corruption.sav and Corruption.dta give data on 52 countries for a certain year, namely:
VariableDescription
countryA string variable that identifies country i
cpiCorruption Perception Index that corresponds to citizen perception regarding the public sector abuse of a nation’s private benefits, covering administrators and politicians (Source: Transparency International)
ageAverage age of billionaires of a country (Source: Forbes)
hoursAverage number of hours worked per week in a country, namely, the annual total of hours worked divided by 52 weeks (Source: International Labour Organization)

You wish to investigate if the perception of corruption in a country is in function of the average age of its billionaires and average quantity of hours worked weekly and, therefore, will estimate the following model:

cpii=a+b1agei+b2hoursi+ui

si179_e

Requested:

  1. (a) Analyze the significance level of the F-test. Is at least one of the variables (age and hours) is statistically significant to explain the behavior of the cpi variable, at the 5% significance level?
  2. (b) If the answer to the previous item is yes, analyze the significance level for each explanatory variable (t-tests). Are both statistically significant to explain the behavior of cpi, at the 5% significance level 5%?
  3. (c) What is the final estimated equation for the multiple linear regression model?
  4. (d) What is the R2?
  5. (e) Discuss the results in terms of the signal coefficients of the explanatory variables.
  6. (f) Save the model residuals and verify for the existence of normality in these error terms.
  7. (g) By means of the Breusch-Pagan/Cook-Weisberg test, check if there is evidence of the existence of heteroskedasticity in the final proposed model.
  8. (h) Present the VIF and Tolerance statistics and discuss the results.
  1. (3) The Corruptionemer.sav and Corruptionemer.dta files give the same data as the previous exercise, however with one more variable, namely:
VariableDescription
emergingDummy variable corresponding to the fact that the country is considered either developed or emerging, according to the criteria of Compustat Global. In this case, if the country is developed, the variable emerging = 0, otherwise, the variable emerging = 1

Initially it should be investigated if, in fact, the countries considered as emerging present lower cpi levels. Being thus, it is requested:

  1. (a) What is the difference between the average cpi index for emerging countries against developed countries? Is this difference statistically significant, to the 5% significance level?
  2. (b) Prepare, by means of the Stepwise procedure with a 10% significance level for rejection of the t-tests null hypothesis, the estimation of the model with the functional linear form that follows. Write the equation for the final estimated model.

cpii=a+b1agei+b2hoursi+b3emergingi+ui

si180_e

  1. (c) Based on this estimation, it is asked: what would the forecast be, on average, of the cpi index for the country considered as emerging, with an average billionaire age of 51 and the average quantity of 37 h worked weekly?
  2. (d) What are the minimum and maximum values for the confidence interval for the forecast of the previous item, at the confidence level of 90%?
  3. (e) Imagine that a researcher proposes, for the problem under consideration, that the following nonlinear model functional form be estimated. Write the final model equation estimated by the Stepwise procedure with a 10% significance level for the rejection of the t-tests null hypothesis.

cpii=a+b1agei+b2ln(hoursi)+b3emergingi+ui

si181_e

  1. (f) Given that no problems were identified referent to the regression models in both cases, what would be the chosen functional form for forecasting purposes?
  1. (4) A cardiologist has monitored, over the last 48 months, the LDL (mg/dL) cholesterol level, the body mass index (kg/m2), and the frequency of physical activity of a well-known executive. The desire is to give orientation regarding the importance of maintaining or losing weight and performing regular physical activity. The evolution of the LDL cholesterol index for this executive, over the period under analysis, is found in the following graph:
Unlabelled Image

The data can be found in the Cholesterol.sav and Cholesterol.dta files and are composed of the following variables:

VariableDescription
montht month of the analysis
cholesterolLDL (mg/dL) cholesterol index.
bmiBody mass index (kg/m2)
sportNumber of times physical activity is performed per week (monthly average)

The desire is to investigate if the behavior, over time, of the LDL cholesterol index is influenced by the executive’s body mass index and the number of times physical activity is performed per week. To do this, the following model will be estimated:

cholesterolt=a+b1bmit+b2sportt+ɛt

si182_e

To this effect, it is requested:

  1. (a) What is the final estimated model for the multiple linear regression?
  2. (b) Discuss the results in terms of the coefficient signs of the explanatory variables.
  3. (c) Even though the final estimates model does not present problems in relation to the normality of the residuals, the heteroskedasticity, and the multicollinearity, the same cannot be said in relation to residual autocorrelation. Perform the Durbin-Watson test, present and discuss the result.
  4. (d) Perform the Breusch-Godfrey test (not available in SPSS) with lags of orders 1, 3, 4, and 12 and discuss the results.

Appendix: Quantile Regression Models

A.1 A Brief Introduction

Quantile regression models, in general, and median regression models, in particular, have as their main goal to estimate the percentiles of the dependent variable, conditional to the values of the explanatory variables. While the median regression expresses the median (50th percentile) of the conditional distribution of the dependent variable as a linear function of the explanatory variables, the other quantile regressions estimate the parameters of a model based on any other percentile of this conditional distribution (25th or 75th, for example). If, for instance, the researcher specifies a 25th quantile regression model, the parameters estimated will describe the behavior of the 25th percentile of the conditional distribution of the dependent variable.

These models allow us to characterize all the conditional distribution of the dependent variable, based on certain explanatory variables, since different estimations of parameters for different percentiles are obtained, which can be interpreted as differences in the behavior of the dependent variable, due to changes in the explanatory variables, at the most diverse points of the conditional distribution from the first one. This fact represents an important advantage of these models over the mean regression models estimated by the ordinary least squares (OLS) method studied throughout this chapter.

The estimation of quantile regression models is similar to the estimation by OLS; however, while the latter minimizes the sum of the squares of the residuals, the former minimizes the weighted sum of the absolute residuals.

Since the median, which is a central trend measure, is not affected by the presence of outliers, different from the mean, many researchers use median regression models when there are extreme or discrepant observations in their samples, since parameters that are not sensitive to the existence of disturbances in the data are estimated. However, it is important to emphasize that, as discussed by Rousseeuw and Leroy (1987), even the estimators of quantile regression models can be sensitive to the existence of outliers if the leverage distances of these observations are considerably high.

This technique was initially proposed by Koenker and Bassett (1978) aiming at estimating the parameters of the following regression model:

Yi=a+bθ1X1i+bθ2X2i++bθkXki+uθi=Xibθ+uθi

si183_e  (13.67)

being:

Percθ(Yi|Xi)=Xibθ

si184_e  (13.68)

where Percθ(Yi | Xi) represents the percentile θ (0 < θ < 1) of the dependent variable Y, conditional to the vector of explanatory variables X′. The estimation of the parameters in Expression (13.67) can be obtained by solving a linear programming problem, whose objective function is given by the following expression:

[i:YiXibθ|YiXib|+i:Yi<Xib(1θ)|YiXib|]=min

si185_e  (13.69)

The estimation of quantile regression models does not have as an assumption the existence of normality of residuals. This makes it possible to use them alternative to the models estimated by the OLS method for the cases in which not even the Box-Cox transformation in the dependent variable ensures the determination of residuals with a distribution that adheres to normality. Situations as this one may occur, among other reasons, when the dependent variable presents considerable skewness in its distribution.

Thus, these models are part of the group of estimations that can be used in studies that have dependent variables with asymmetrical distributions, and we wish to investigate the different behaviors of the explanatory variables for different percentiles of the distribution.

In short, and following Buchinsky (1998), quantile regression models have the following characteristics and advantages:

  •  they allow the effects of each explanatory variable over the behavior of the dependent variable to vary between the percentiles;
  •  the objective function (likelihood function) of the quantile regression represents the minimization of the weighted sum of the absolute residuals, which makes the parameters estimated not sensitive to extreme or discrepant observations;
  •  they offer more efficient estimations of the parameters than those obtained by the OLS method, when the error terms do not follow a normal distribution;
  •  they can be used when the dependent variable presents an asymmetrical distribution.

As, for example, a distribution of income is inherently asymmetrical for different populations and variations throughout the percentiles occur, quantile regression models can be extremely useful in the study of the behavior of income, conditional to certain explanatory variables. For these cases, traditional mean regression models may not be satisfactory, because they may eventually lead researchers to incomplete conclusions.

After that, we are going to discuss an example in which a quantile regression model is estimated, having as a dependent variable the median household income of certain individuals.

A.2 Example: Quantile Regression Model in Stata

We are going to use the dataset QuantileIncome.dta, given the existence of multivariate outliers in the sample, which can be identified by applying the bacon algorithm discussed in the Appendix of Chapter 11. This dataset has data regarding the median household income ($) and the time since graduation (years) of 400 professionals who concluded Economics at a certain university. Therefore, now, we are going to estimate the parameters of the following model:

incˆomei=α+β1tgradi

si186_e

Initially, let’s analyze the histogram of the dependent variable income, by typing the following command:

hist income, freq

The chart generated can be seen in Fig. 13.102.

Fig. 13.102
Fig. 13.102 Histogram of the dependent variable.

From this histogram, we can see the existence of skewness, which represents the first favorable indication to the estimation of a quantile regression model.

After that, we can type the following command, which will generate the chart in Fig. 13.103.

Fig. 13.103
Fig. 13.103 Chart of percentiles of the dependent variable.

qplot income

This chart displays the values of each percentile of the dependent variable income. Through the command sum income, detail, whose outputs are not presented here, we can see that the values of the quartiles of variable income are equal to $6250.00 (25th percentile), $7500.00 (median), and $8670.00 (75th percentile).

Even though they are also not presented here, it is important to mention that the error terms generated from the estimation of an OLS regression model do not show adherence to normality. Such fact does not happen either when we estimate the same model using the Box-Cox transformation in the dependent variable, which once again favors the estimation of a quantile regression model for the data in our example. A more inquisitive researcher may prove these facts based on the concepts studied throughout this chapter.

First, we are going to estimate the parameters of a quantile regression model with the 50th percentile (median regression), by typing the following command:

qreg income tgrad, quantile(0.50)

where the command qreg estimates a quantile regression model, and the term quantile(0.50) refers to a median regression model, which could have been omitted in this case because it is the default of command qreg in Stata. The outputs generated can be seen in Fig. 13.104.

Fig. 13.104
Fig. 13.104 Median regression model outputs in Stata.

It is important to mention that an even more inquisitive researcher may obtain these same outputs through the file QuantileIncome.xls, by using the Solver in Excel. Even though it is not presented here, the researcher will also have the option of determining the percentile desired to estimate the parameters of any quantile regression model in this file.

We can see (Fig. 13.104) that all the parameters estimated are statistically different from zero, with a confidence level of 95%, and the model obtained can be written as follows:

incˆome(median)i=5243.333+273.333tgradi

si187_e

In this regard, the expected median of the household income of a certain economist who graduated 7 years ago can be obtained as follows:

incˆome(median)i=5243.333+273.333(7)=$7156.667

si188_e

Therefore, the parameters of a quantile regression model can be interpreted through the partial derivative of the conditional percentile based on a certain explanatory variable.

The outputs also show that the absolute sum of the differences between the real values of the median household income and the value of its nonconditional median ($7500.00) is equal to 491,360. In other words, we have:

400i=1|incomei7500.00|=491,360

si189_e

On the other hand, the weighted sum of the absolute residuals for the general expression obtained (conditional distribution of the variable income as a linear function of the variable tgrad) is equal to 464,040, as we can also see in the same file in Excel.

Hence, the pseudo R2 presented in the outputs can be calculated as follows:

pseudoR2=1464,040491,360=0.0556

si190_e

whose usefulness is extremely limited and restricted to the cases in which researchers are interested in comparing two or more different models.

If researchers also wish to estimate the parameters of the quantile regression models, for example, with the 25th and 75th percentiles, in order to compare them to the ones obtained by the median regression model and to the ones obtained by an OLS estimation, they may type the following sequence of commands:

⁎ ORDINARY LEAST SQUARES REGRESSION
quietly reg income tgrad
estimates store OLS
⁎ QUANTILE REGRESSION (PERCENTILE 25%)
quietly qreg income tgrad, quantile(0.25)
estimates store QREG25
⁎ QUANTILE REGRESSION (MEDIAN - PERCENTILE 50%)
quietly qreg income tgrad, quantile(0.50)
estimates store QREG50
⁎ QUANTILE REGRESSION (PERCENTILE 75%)
quietly qreg income tgrad, quantile(0.75)
estimates store QREG75
estimates table OLS QREG25 QREG50 QREG75, se

Fig. 13.105 presents the parameters estimated in each model.

Fig. 13.105
Fig. 13.105 Parameters estimated in each model and their respective standard errors.

From the outputs consolidated in Fig. 13.105, it is possible to see that there are discrepancies between the parameters estimated by OLS and the ones obtained by the quantile regressions. We can even verify that the standard errors of the parameters (values located below the respective parameters) are lower for the quantile regression with a 25th percentile, which reflects greater precision of the estimation around this percentile for the conditional distribution of the dependent variable.

The sequence of commands here even allows us to see, through charts, the differences between the estimators obtained by the quantile regressions and the ones obtained by OLS:

quietly qreg income tgrad
grqreg, cons ci ols olsci

The charts generated, which can be found in Fig. 13.106, show parameters α and β estimated, not restricted only to the 25th, 50th, and 75th percentiles, with respective confidence intervals of 95% (term ci). Besides, while the term cons allows the chart of the intercept to be constructed, the terms ols and olsci include the parameters estimated by OLS and the respective confidence intervals in the charts, also at 95%.

Fig. 13.106
Fig. 13.106 Parameters estimated for quantile regression models and by OLS, with respective confidence intervals.

Through these charts, we prove that the parameters estimated by OLS and the respective confidence intervals do not vary with the percentiles, different from the ones estimated for the quantile regression models. And, as we have already discussed, this fact represents one of the main advantages of these models over the mean regression models, since it allows the entire conditional distribution of the dependent variable based on a certain explanatory variable to be characterized, providing a greater view of the relationship between them, and not restricting the analysis only to the conditional mean.

For the data in our example, we can even see that parameter β, which corresponds to the variable tgrad, stops being statistically different from zero, with a 95% confidence level, for higher percentiles, since its confidence interval starts including zero. To verify this fact, the researcher just needs to type, for instance, the command qreg income tgrad, quantile(0.80) and analyze the t statistic of the abovementioned parameter.

It is important to mention that, in other cases, changes in the sign of a certain parameter β may even occur as the percentiles vary. This allows researchers to carry out a more complete analysis of the differences in the behavior of the dependent variable due to changes in each explanatory variable, in the most diverse points of the conditional distribution of the first one.

For pedagogical purposes, let’s construct a chart that shows the linear adjustments between the predicted values for the dependent variable generated by OLS and quantile regression models with 25th, 50th, and 75th percentiles and the explanatory variable. The main objective is to compare these linear adjustments. In order to do that, we can type the following sequence of commands:

⁎ ORDINARY LEAST SQUARES REGRESSION
quietly reg income tgrad
predict yols
⁎ QUANTILE REGRESSION (PERCENTILE 25%)
quietly qreg income tgrad, quantile(0.25)
predict yqreg25
⁎ QUANTILE REGRESSION (MEDIAN - PERCENTILE 50%)
quietly qreg income tgrad, quantile(0.50)
predict yqreg50
⁎ QUANTILE REGRESSION (PERCENTILE 75%)
quietly qreg income tgrad, quantile(0.75)
predict yqreg75
graph twoway scatter income tgrad || lfit yols tgrad || lfit yqreg25 tgrad || lfit yqreg50 tgrad || lfit yqreg75 tgrad ||, legend(label(2 "OLS") label(3 "Percentile 25") label(4 "Percentile 50") label(5 "Percentile 75"))

The chart generated can be seen in Fig. 13.107.

Fig. 13.107
Fig. 13.107 Behavior of the dependent variable based on the explanatory variable tgrad, highlighting OLS and quantile estimations.

This chart shows the median household income adjusted by its mean and for the 25th, 50th, and 75th percentiles, based on how long ago the individual graduated. Although, in this example, it is possible to notice that the increase of the median household income, in all the percentiles, as the time since getting a degree increases is evident, we can see differences between the adjustment to the mean (OLS) and the adjustment to the median (50th percentile). A fact that happens due to the existence of outliers and the influence that these have over the estimation of the parameters by the OLS method. Therefore, researchers must always be aware of the sensitivity of parameters to the existence of extreme or discrepant observations in the dataset, which may mean that a certain estimation method is preferable.

In short, and as we discussed previously, quantile regression models are more suitable for studying the relationship between the variables presented in this example, since they make the analysis possible, for the several percentiles, of the effects of the variable tgrad over the behavior of the variable income, allow the estimation of parameters that are not sensitive to the existence of outliers and asymmetrical distribution of the dependent variable, and allow the determination of a model without the need for the residuals to have a normal distribution.

References

Belfiore P., Fávero L.P. Pesquisa operacional: para cursos de administração, contabilidade e economia. Rio de Janeiro: Campus Elsevier; 2012.

Box G.E.P., Cox D.R. An analysis of transformations. J. Roy. Stat. Soc. Ser. B. 1964;26(2):211–252.

Breusch T.S. Testing for autocorrelation in dynamic linear models. Australian Econ. Papers. 1978;17(31):334–355.

Buchinsky M. Recent advances in quantile regression models: a practical guideline for empirical research. J. Hum. Resour. 1998;33(1):88–126.

Fávero L.P. Análise de dados: modelos de regressão com Excel®, Stata® e SPSS®. Rio de Janeiro: Campus Elsevier; 2015.

Fávero L.P. O mercado imobiliário residencial da região metropolitana de São Paulo: uma aplicação de modelos de comercialização hedônica de regressão e correlação canônica. PhD Thesis - Faculdade de Economia São Paulo: Administração e Contabilidade, Universidade de São Paulo; 2005 319 f.

Fávero L.P., Belfiore P. Manual de análise de dados: estatística e modelagem multivariada com Excel®, SPSS® e Stata®. Rio de Janeiro: Elsevier; 2017.

Fávero L.P., Belfiore P., Silva F.L., Chan B.L. Análise de dados: modelagem multivariada para tomada de decisões. Rio de Janeiro: Campus Elsevier; 2009.

Fouto N.M.M.D. Determinação de uma função de preços hedônicos para computadores pessoais no Brasil. Masters Dissertation São Paulo: Faculdade de Economia, Administração e Contabilidade, Universidade de São Paulo; 2004 150 f.

Godfrey L.G. Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrica. 1978;46(6):1293–1301.

Greene W.H. Econometric Analysis. seventh ed. Harlow: Pearson; 2012.

Gujarati D.N. Econometria básica. fifth ed. Porto Alegre: Bookman; 2011.

Huber P.J. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; 221–233. 1967;vol. 1.

Kennedy P. A Guide to Econometrics. sixth ed. Cambridge: MIT Press; 2008.

Koenker R., Bassett G. Regression quantiles. Econometrica. 1978;46(1):33–50.

Maroco J. Análise estatística com o SPSS Statistics. sixth ed. Lisboa: Edições Sílabo; 2014.

Rousseeuw P.J., Leroy A.M. Robust Regression and Outlier Detection. New York: John Wiley & Sons; 1987.

Sharma S. Applied Multivariate Techniques. Hoboken: John Wiley & Sons; 1996.

Stock J.H., Watson M.W. Econometria. São Paulo: Pearson Education; 2004.

Vasconcellos M.A.S., Alves D. Manual de econometria. São Paulo: Atlas; 2000.

White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980;48(4):817–838.

Wooldridge J.M. Introductory Econometrics: A Modern Approach. fifth ed. Mason: Cengage Learning; 2012.


"To view the full reference list for the book, click here"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset