Chapter 13

Simple and Multiple Regression Models

Abstract

This chapter presents the simple and multiple linear regression models, establishing the circumstances upon which they can be used. The parameters of the simple and multiple regression models are estimated by the least squares method and the model presuppositions are analyzed by means of tests and specific statistics. For the effect of forecast, confidence intervals of the model parameters are prepared. Nonlinear regression models are also specified, as well as the definition of the best functional form and the Box-Cox transformation. Finally, regression models are estimated in Microsoft Office Excel®, Stata Statistical Software®, and IBM SPSS Statistics Software®, and their results are interpreted.

Keywords

Simple and multiple linear regression; Nonlinear regression; Ordinary least squares; Functional form; Box-Cox transformation; Coefficient of determination R2; t-test; F-test; Confidence intervals for forecasts; Excel; Stata; SPSS software

… because politics is for the present, but an equation is something for eternity.

Albert Einstein

13.1 Introduction

Of the techniques studied in this book, without a doubt, those known as simple and multiple linear regression models are the most used in the different fields of knowledge.

Imagine that a group of researchers is interested in studying how the rate of return for a financial asset behaves in relation to the market, or how company expense varies when the factory increases its productive capability or increases the number of work hours, or, yet, how the number of bedrooms and amount of floor space in a residential real estate sample can influence the formation of sales prices.

Notice that, in all the examples, the main phenomena of interest to study, are represented, in each case, by a metric or quantitative variable, and, therefore, can be studied by means of estimation linear regression models, of which the main goal is to analyze how the relations between a set of explanatory variables, metrics or dummies, and a dependent metric variable (the outcome variable that represents the phenomenon under study) behave, being that some conditions are respected and some presuppositions are met, as we shall see in this chapter.

It is important to emphasize that any and all linear regression models should be defined based on the subjacent theory and the experience of the researcher, such that it is possible to estimate the desired model, analyze the results obtained by means of statistical tests and prepare forecasts.

In this chapter, we will consider the simple and multiple linear regression models, with the following objectives: (1) Introduce the concepts of simple and multiple linear regression, (2) Interpret results obtained and prepare forecasts, (3) Discuss the technique presuppositions and (4) Present the application of the technique in Excel, Stata, and SPSS. Initially, the solution to an example will be prepared in Excel simultaneously to the presentation of the concepts and the manual solution of the example. Only after the introduction of the concepts will the procedures for the preparation of the regression technique be presented in Stata and SPSS.

13.2 Linear Regression Models

First, we will address linear regression models and their presuppositions. An analysis of nonlinear regressions will be covered in Section 13.4.

According to Fávero et al. (2009), the linear regression technique offers, primarily, the ability to study the relation between one or more explanatory variables, which are presented in a linear form, and a quantitative dependent variable. As such, a general linear regression model can be defined as follows:

Yi=a+b1X1i+b2X2i++bkXki+ui

si16_e  (13.1)

where Y represents the phenomenon under study (quantitative dependent variable), a represents the intercept (constant or linear coefficient), bj (j = 1, 2, …, k) are the coefficients of each variable (angular coefficients), Xj are explanatory variables (metrics or dummies), and u is the error term (difference between the real value of Y and the predicted value of Y by means of the model for each observation). The subscripts i represent each of the observations of the sample under analysis (i = 1, 2, …, n, where n is the size of the sample).

The equation presented by means of Expression (13.1) represents a multiple linear regression model, since it considers the inclusion of various explanatory variables for the study of the phenomenon in question. On the other hand, if only one X variable is inserted, we have before us a simple linear regression model. For didactic reasons, we will introduce the concepts and present the step-by-step process of estimating the parameters by means of a simple regression model. Following, we will amplify the discussion by means of estimation in multiple regression models, including the consideration of dummy variables on the right side of the equation.

It is important to emphasize, therefore, that the simple linear regression model to be predicted present the following expression:

ˆYi=α+βXi

si17_e  (13.2)

where ˆYisi18_e represents the predicted value of the dependent variable, which will be obtained by means of the model estimation for each i observation, and α and β represent the predicted parameters of the intercept and the slope of the proposed model, respectively. Fig. 13.1 presents, graphically, the general configuration of an estimated simple linear regression model.

Fig. 13.1
Fig. 13.1 Estimated simple linear regression model.

We can, therefore, verify that, while estimated parameter α shows the point on regression model where X = 0, estimated parameter β represents the slope of the model, or rather, the increase (or decrease) of Y for each additional unit of X, on average.

Hence, the inclusion of error term u in the Expression (13.1), also known as residual, is justified by the fact that any relation that can be proposed will rarely present itself perfectly. In other words, very probably the phenomenon under study, represented by variable Y, will present a relation with some other X variable not included in the proposal and that, therefore, will need to be represented by error term u. As such, error term u, for each observation i, can be written as:

ui=YiˆYi

si19_e  (13.3)

According to Kennedy (2008), Fávero et al. (2009), and Wooldridge (2012), error terms occur due to some reasons that need to be known and considered by the researchers, such as:

  •  Existence of aggregated variables and/or not random.
  •  Failures in the specification of the model (nonlinear forms and omission of relevant explanatory variables).
  •  Errors in data gathering.

More consideration regarding error terms will be made in the study of regression model presuppositions, in Section 13.3.

Having discussed the preliminary concepts, we shall now begin the study of linear regression models estimation.

13.2.1 Estimation of the Linear Regression Model by Ordinary Least Squares

We often glimpse, in a rational or intuitive way, the relation between variable behaviors that appear either directly or indirectly. If I swim more often at my club, will I increase my muscle mass? If I change jobs, will I have more time to spend with my children? If I save a greater portion of my wages, will I be able to retire at a younger age? These questions offer clear relations between a certain dependent variable, which represents the phenomenon we wish to study, and, in the case, a single explainable variable.

The objective of regression analysis is, therefore, to provide conditions for the researcher to evaluate how a Y variable behaves based on the behavior of one or more X variables, without, necessarily, the occurrence of a cause and effect relationship.

We will introduce the concepts of regression by means of an example that considers only one explanatory variable (simple linear regression). Imagine that, on a certain class day for a group of 10 students, the professor is interested in discovering the influence of the distance traveled to get to school over the travel time. The professor completes a questionnaire with each of the 10 students and prepares a dataset, which can be found in Table 13.1.

Table 13.1

Example: Travel Time × Distance Traveled
StudentTime to Get to School (min)Distance Traveled to School (km)
Gabriela158
Dalila206
Gustavo2015
Leticia4020
Luiz Ovidio5025
Leonor2511
Ana105
Antonio5532
Julia3528
Mariana3020

Actually, the professor wants to know the equation that regulates the phenomenon “travel time to school” in function of “distance traveled by students.” It is known that other variables influence the time of a certain route, such as the route taken, the type of transportation, or the time at which the student left for school that day. However, the professor knows that such variables will not be part of the model, being that they were not collected for the formation of the dataset.

The problem can therefore be modeled in the following manner:

time=ƒ(dist)

si20_e

As such, the equation, or simple regression model, will be:

timei=a+bdisti+ui

si21_e

and, in this way, the expected value (estimate) of the dependent variable, for each i observation, will be given as:

ˆtimei=α+βdisti

si22_e

where α and β are the estimates of parameters a and b, respectively.

This last equation shows that the expected value of the time (ˆYsi23_e) variable, also known as the conditional mean, is calculated for each sample observation, in function of the behavior of the dist variable, being that the subscript i represents, for our example data, the school students (i = 1, 2, …, 10). Our objective here is, therefore, to study if the behavior of the dependent variable time presents a relation with the variation of distance, in kilometers, to which each of the students is subjected to arrive at school on a certain class day.

In our example, it does not make much sense to discuss time traveled when the distance to school is zero (parameter α). Parameter β, on the other hand, will inform us regarding the increase in time to arrive at school by increasing the distance traveled by one kilometer, on average.

We shall, as such, prepare a graph (Fig. 13.2) that relates the travel time (Y) with the distance traveled (X), where each point represents one of the students.

Fig. 13.2
Fig. 13.2 Travel time × distance traveled for each student.

As previously commented, it is not only the distance traveled that affects the time needed to get to school since it can also be affected by other variables related to traffic, means of transportation, or the individual. As such, the error term u should capture the effect of the remaining variables not included in the model. Now, in order to estimate the equation that best adjusts to this cloud of points, we should establish two fundamental conditions related to the residuals.

  1. (1) The sum of the residuals should be zero: ∑i = 1nui = 0, where n is the sample size.

With only this first condition, several lines of regression can be found where the sum of the residuals is zero, as is shown in Fig. 13.3.

Fig. 13.3
Fig. 13.3 (A–C) Three examples of lines of regression where the sum of residuals is zero.

Notice that, for the same dataset, several lines can respect the condition that the sum of the residuals is equal to zero. Therefore, it becomes necessary to establish a second condition.

  1. (2) The residual sum of squares is the least possible: ∑i = 1nui2 = min.

With this condition, we choose the model that presents the best adjustment possible to the cloud of points, giving us, therefore, the definition of the least squares. In other words, α and β should be determined in such a way that the sum of the squares of the residuals is the least possible (ordinary least squaresOLS method). As such:

ni=1(YiβXiα)2=min

si24_e  (13.4)

The minimization occurs in deriving Expression (13.4), where α and β are equal to zero to resulting expressions. As such:

[ni=1(YiβXiα)2]α=2ni=1(YiβXiα)=0

si25_e  (13.5)

[ni=1(YiβXiα)2]β=2ni=1Xi(YiβXiα)=0

si26_e  (13.6)

In distributing and dividing the Expression (13.5) by 2·n, where n is the sample size, we have that:

2ni=1Yi2n+2ni=1βXi2n+2ni=1α2n=02n

si27_e  (13.7)

from which comes:

ˉY+βˉX+α=0

si28_e  (13.8)

and, therefore:

α=ˉYβˉX

si29_e  (13.9)

where ˉYsi30_e and ˉXsi31_e represent the sample average of Y and of X, respectively.

In substituting this result in Expression (13.6), we have that:

2ni=1Xi(YiβXiˉY+βˉX)=0

si32_e  (13.10)

which, in developing:

ni=1Xi(YiˉY)+βni=1Xi(ˉXXi)=0

si33_e  (13.11)

which therefore generates:

β=ni=1(XiˉX)(YiˉY)ni=1(XiˉX)2

si34_e  (13.12)

Returning to our example, the professor then prepares a calculation spreadsheet in order to obtain the linear regression model, as shown in Table 13.2.

Table 13.2

Calculation Spreadsheet for the Determination of α and β
Observation (i)Time (Yi)Distance (Xi)YiˉYsi1_eXiˉXsi2_e(XiˉX)(YiˉY)si3_e(XiˉX)2si4_e
1158− 15− 913581
2206− 10− 11110121
32015− 10− 2204
44020103309
5502520816064
62511− 5− 63036
7105− 20− 12240144
855322515375225
9352851155121
1030200309
Sum3001701155814
Average3017

Table 13.2

By means of the spreadsheet presented in Table 13.2, we can calculate estimators α and β, in accordance as follows:

β=ni=1(XiˉX)(YiˉY)ni=1(XiˉX)2=1155814=1.4189

si35_e

α=ˉYβˉX=301.418917=5.8784

si36_e

And the simple linear regression equation can be written as:

ˆtimei=5.8784+1.4189disti

si37_e

The estimation of our example model can be done by means of the Solver tool in Excel, respecting the conditions that ∑i = 110ui = 0 and ∑i = 110ui2 = min. In this way, we can initially open the file TimeLeastSquares.xls that contains our example data, besides the columns referent to ˆYsi23_e, to u and to u2 for each observation. Fig. 13.4 presents this file, before the preparation of the Solver procedure.

Fig. 13.4
Fig. 13.4 TimeLeastSquares.xls dataset.

According to the logic proposed by Belfiore and Fávero (2012), we now open the Excel Solver tool. The objective function is in cell E13, which is our destination cell and which should be minimized (residual sum of squares). Besides this, parameters α and β, which values are in cells H3 and H5, respectively, are the variables cells. Finally, we should impose that the value of cell D13 should equal zero (restriction that the sum of the residuals be equal to zero). The Solver window will be as shown in Fig. 13.5.

Fig. 13.5
Fig. 13.5 Solver—minimization of the residual sum of squares.

By clicking on Solve and then OK, we obtain the best solution to the minimization of the residual sum of squares. Fig. 13.6 presents the results obtained by the model.

Fig. 13.6
Fig. 13.6 Obtaining the parameters of the sum minimization of u2 by Solver.

Therefore, intercept α is 5.8784 and the angular coefficient β is 1.4189, according to what we estimated by means of the analytical solution. In an elementary way, the average time to get to school by students who did not travel any distance, or rather, who were already at school, is of 5.8784 min, which does not make much sense from a physical point of view. In some cases, this type of situation can frequently occur, where values for α are not in keeping with reality. From the mathematical point of view, this is not incorrect. However, the researcher should always analyze the physical or economical sense of the situation under study, as well as the subjacent theory used. In analyzing the graph in Fig. 13.2, we notice that there is no student with a distance traveled near zero, and the intercept only reflects extension, projection, or extrapolation of the regression model up to the Y axis. It is even common that some models present a negative α in the study of phenomena that cannot offer negative values. Therefore, the researcher should always be aware of this fact, being that a regression model can be quite useful to elaborate inferences regarding the behavior of a Y variable within the limits of the X variation, or rather, for the elaboration of interpolations. Yet extrapolations can offer inconsistencies due to eventual changes in behavior for the Y variable outside the limits of the X variation in the study sample.

Giving sequence to the analysis, each additional kilometer in distance between the departure point and the school increases travel time by 1.4189 min, on average. As such, a student who lives 10 km farther from school than another will tend to spend, on average, a little more than 14 min (1.4189 × 10) to get to school than their classmate who lives closer. Fig. 13.7 presents the simple linear regression model from our example.

Fig. 13.7
Fig. 13.7 Simple linear regression model between time and distance traveled.

Concomitant to the discussion of each of the concepts and to the solution of the proposed example by means of analytical form and Solver, we will also present the systematic solution by means of the Excel Regression tool. In Sections 13.5 and 13.6, we will embark on the final solution by means of Stata and SPSS, respectively. In this way, we will not open the file Timedist.xls, which contains the data from our example, or rather, the fictitious travel time and distance covered by a group of students to the school location.

By clicking on Data → Data Analysis, the dialog box from Fig. 13.8 will appear.

Fig. 13.8
Fig. 13.8 Dialog box for data analysis in Excel.

We now click on Regression and then OK. The dialog box for insertion of data to be considered in regression will now appear (Fig. 13.9).

Fig. 13.9
Fig. 13.9 Dialog box for estimation of linear regression in Excel.

For our example, the time (in minutes) variable is the (Y) dependent and the dist (in kilometers) variable is the (X) explanatory. Therefore, we must insert their data in the respective entry intervals, according to what is shown in Fig. 13.10.

Fig. 13.10
Fig. 13.10 Insertion of data for estimation of linear regression in Excel.

Besides the insertion of data, we will also select the Residuals option, according to what is shown in Fig. 13.10. Following, we click on OK. A new spreadsheet will be generated with the regression outputs. We will analyze each of them according to when the concepts are introduced, as well as perform the calculations manually.

According to what we can observe by means of Fig. 13.11, four groups of outputs are generated: regression statistics, analysis of variance (ANOVA), table of regression coefficients, and residuals table. We will discuss each.

Fig. 13.11
Fig. 13.11 Simple linear regression outputs in Excel.

As calculated previously, we can verify the regression equation coefficients in the outputs (Fig. 13.12).

Fig. 13.12
Fig. 13.12 Linear regression equation coefficients.

13.2.2 Explanatory Power of the Regression Model: Coefficient of Determination R2

According to Fávero et al. (2009), to measure the explanatory power of a certain regression model, or the percentage of variability of the Y variable, which is explained by the variation of behavior of the explanatory variables, we need to understand some important concepts. While the total sum of squares (TSS) shows the variation in Y in regards to its own average, the sum of squares due to regression (SSR) offers a variation of Y considering the X variables used in the model. Besides this, the residual sum of squares (RSS) presents the variation of Y, which is not explained in the prepared model. We can therefore define that:

TSS=SSR+RSS

si39_e  (13.13)

being:

YiˉY=(ˆYiˉY)+(YiˆYi)

si40_e  (13.14)

where Yi is equivalent to the value of Y in each i observation of the sample, ˉYsi30_e is the average of Y, and ˆYisi18_e represents the adjusted value of the regression model for each i observation. As such, we have that:

  • YiˉY:si43_e total deviation of values of each observation in relation to the average,
  • (ˆYiˉY):si44_e deviation of values of the regression model for each observation in relation to the average,
  • (YiˆYi):si45_e deviation of values of each observation in relation to the regression model,

which results in:

ni=1(YiˉY)2=ni=1(ˆYiˉY)2+ni=1(YiˆYi)2

si46_e  (13.15)

or:

ni=1(YiˉY)2=ni=1(ˆYiˉY)2+ni=1(ui)2

si47_e  (13.16)

which is the very Expression (13.13).

Fig. 13.13 graphically shows this relation.

Fig. 13.13
Fig. 13.13 Deviations of Y for two observations.

With these considerations made and the regression equation defined, we embark on the study of the explanatory power of the regression model, also known as the coefficient of determination R2. Stock and Watson (2004) define R2 as the fraction of the variance of the Yi sample explained (or predicted) by the explanatory variables. In the same way, Wooldridge (2012) considers R2 as the proportion of sample variation of the dependent variable explained by the set of explanatory variables, able to be used as a measure of degree of adjustment for the proposed model.

According to Fávero et al. (2009), the explanatory capacity of the model is analyzed by the coefficient of determination R2 of the regression. For a simple regression model, this measure shows how much of the Y variable behavior is explained by the variation in behavior of the X variable, Always remembering that there is not, necessarily, a cause and effect relationship between the X and Y variables. For the multiple regression model, this measure shows how much of the Y variable behavior is explained by the joint variation of the X variables considered in the model.

The R2 is obtained in the following manner:

R2=SSRSSR+RSS=SSRTSS

si48_e  (13.17)

or

R2=ni=1(ˆYiˉY)2ni=1(ˆYiˉY)2+ni=1(ui)2

si49_e  (13.18)

Also according to Fávero et al. (2009), the R2 can vary between 0 and 1 (0%–100%); however, it is practically impossible to obtain an R2 equal to 1, since it would be very difficult for all the points to fall on a line. In other words, if the R2 were 1, there would be no residuals for each of the observations in the sample under study, and the variability of the Y variable would be totally explained by the vector of X variables considered in the regression model.

The more disperse the cloud of points, the less the X and Y variables will relate, the residuals will be greater, and the R2 will be closer to zero. In an extreme case, if the X variation does not correspond to any variation in Y, the R2 will be zero. Fig. 13.14 presents, in an illustrative manner, the behavior of R2 in different cases.

Fig. 13.14
Fig. 13.14 R2 behavior for different simple linear regressions.

Returning to our example where the professor intends to study the behavior of the time students take to get to school and if this phenomenon is influenced by distance traveled by the students, we present the following spreadsheet (Table 13.3), which will aid us in calculating the R2.

Table 13.3

Spreadsheet for the Calculation of the Coefficient of Determination R2 of the Regression Model
Observation (i)Time (Yi)Distance (Xi)ˇYisi5_e      ui
(YiˇYi)si6_e
(ˇYiˉY)2si7_e(ui)2
115817.23− 2.23163.084.97
220614.395.61243.6131.45
3201527.16− 7.168.0551.30
4402034.265.7418.1232.98
5502541.358.65128.8574.80
6251121.493.5172.4812.34
710512.97− 2.97289.928.84
8553251.283.72453.0013.81
9352845.61− 10.61243.61112.53
10302034.26− 4.2618.1218.12
Sum3001701638.85361.15
Average3017

Table 13.3

Obs.: Where ˆYi=ˆtimei=5.8784+1.4189distisi8_e.

The spreadsheet presented in Table 13.3 allows us to calculate the R2 of the simple linear regression model for our example. As such:

R2=ni=1(ˆYˉY)2ni=1(ˆYˉY)2+ni=1(ui)2=1638.851638.85+361.15=0.8194

si50_e

In this way, we can now affirm that, for the sample studied, 81.94% of the time variability to get to school is due to the variable referent to the distance traveled during the route determined by each of the students. And, therefore, a little more than 18% of the variability is due to other variables not included in the model and that, therefore, were due to variation in the residuals.

The outputs generated by Excel also bring out this information, according to what can be seen in Fig. 13.15.

Fig. 13.15
Fig. 13.15 Coefficient of determination R2 of the regression.

Note that the outputs also supply the values of ˆYsi23_e and the residuals for each observation, as well as the minimum value of the sum of the squares of the residuals, which are exactly equal to those obtained by estimation of the parameters by means of the Excel Solver tool (Fig. 13.6) and also calculated and presented in Table 13.3. By means of these values, we can now calculate the R2.

According to Stock and Watson (2004) and Fávero et al. (2009), the coefficient of determination R2 does not tell researchers if a certain explanatory variable is statistically significant and if this variable is the true cause of the change in behavior for the dependent variable. More than that, the R2 does not provide the ability to evaluate the existence of an eventual bias in the omission of explanatory variables and if the choice of those inserted into the proposed model was appropriate.

The importance given to the R2 dimension is often excessive. In different situations, researchers highlight the adequacy of their models by obtaining the high R2 values, including giving emphasis to the cause and effect relationship between explanatory variables and the dependent variable, even though quite erroneous, since this measure merely captures the relation between the variables used in the model. Wooldridge (2012) is even more emphasis, highlighting that it is fundamental to not give considerable importance to the R2 value in the evaluation of regression models.

According to Fávero et al. (2009), if we are able, for example, to find a variable that explains a 40% return on stock, this could at first seem like a low capacity of explanation. However, if a single variable is able to capture this entire relationship in a situation where innumerable other economic, financial, perceptual, and social factors exist, the model could be quite satisfactory.

The general statistical significance of the model and its estimated parameters is not given by the R2, but by means of appropriate statistical tests, which we will study in the next section.

13.2.3 General Statistical Significance of the Regression Model and Each of Its Parameters

To begin, it is of fundamental importance to study the general statistical significance of the estimated model. With this in mind, we should make use of the F-test, with its null and alternative hypotheses, for a general regression model, which are:

  • H0: β1 = β2 = … = βk = 0
  • H1: there is at least one βj ≠ 0, respectively

And, for a simple regression model, therefore, these hypotheses are expressed as:

  • H0: β = 0
  • H1: β ≠ 0

This test allows the researcher to verify if the model that is being estimated does in fact exist, since if all the βj (j = 1, 2, …, k) are statistically equal to zero, the alteration behavior of each of the explanatory variables will not influence in any way the variation behavior of the dependent variable. The F statistic is presented in the following expression:

F=ni=1(ˆYˉY)2(k1)ni=1(ui)2(nk)=SSR(k1)RSS(nk)

si52_e  (13.19)

where k represents the number of parameters of the estimated model (including the intercept) and n, the size of the sample.

Therefore, we can obtain an F statistic expression based on the R2 expression presented in Expression (13.17). As such, we have that:

F=SSR(k1)RSS(nk)=R2(k1)(1R2)(nk)

si53_e  (13.20)

Returning, then, to our initial example, we obtain:

F=1638.85(21)361.15(102)=36.30

si54_e

where, for 1 degree of freedom for regression (k − 1 = 1) and 8 degrees of freedom for the residuals (n − k = 10 − 2 = 8), we have, by means of Table A in the Appendix, that the Fc = 5.32 (F critical to the significance level of 5%). In this way, as F calculated Fcal = 36.30 > Fc = F1,8,5% = 5.32, we can reject the null hypothesis that all the βj (j = 1) parameters are statistically equal to zero. At least one X variable is statistically significant to explain the variability of Y and we will have a statistically significant regression model for the means of forecast. As, in this case, we have only one X variable (simple regression), this will be statistically significant, to the significance level of 5%, to explain the behavior of the Y variation.

The outputs offer, by means of analysis of variance (ANOVA), the F statistic and its corresponding significance level (Fig. 13.16).

Fig. 13.16
Fig. 13.16 ANOVA output—F-test for joint evaluation of parameters significance.

Software, such as Excel, Stata, and SPSS, do not directly offer Fc for the degrees of freedom defined and the determined significance level. However, they do offer the significance level of Fcal for these degrees of freedom. As such, instead of analyzing if Fcal > Fc, we should verify if the significance level of Fcal is less than 0.05 (5%) so as to give continuity to the regression analysis. Excel calls this significance level F significance. As such:

  • If the F significance is < 0.05, there is at least one βj ≠ 0.

The Fcal significance level can be obtained in Excel by means of the command Formulas → Insert Function → FDIST, which will open a dialog box as shown in Fig. 13.17.

Fig. 13.17
Fig. 13.17 Obtaining the F significance level (command Insert Function).

Many models present more than one explanatory X variable and, as the F-test evaluates the joint significance of the variable explanatories, it is unable to define which one or ones of the variables considered in the model presents parameters estimated to be statistically different from zero, at a certain significance level. Therefore, it is necessary that the researcher evaluate if each of the model parameters for the regression model is statistically different from zero, so as to determine if its respective X variable should be, in fact, included in the proposed model.

The t statistic, also studied in Chapter 9, is important to provide the researcher with the statistical significance of each parameter to be considered in the regression model, and the hypotheses of the corresponding test (t-test) for the intercept and for each βj (j = 1, 2, …, k) are:

  • H0: α = 0
  • H1: α ≠ 0
  • H0: βj = 0
  • H1: βj ≠ 0, respectively

This test provides the researcher with a verification of the statistical significance of each estimated parameter, α and βj, and its expression is given as:

tα=αs.e.(α)tβj=βjs.e.(βj)

si55_e  (13.21)

where s.e. corresponds to the standard error of each parameter under analysis, which will be discussed later. After obtaining the t-statistics, the researcher can use the respective distribution tables to obtain the critical values for a given significance level and verify if such tests reject the null hypothesis or not. However, as in the case of the F-test, the statistical packages also offer the values of the levels of significance for the t-tests, called P-values, which facilitates the decision, being that, with a 95% confidence level (5% significance level), we will have:

  • If P-value t < 0.05 for intercept, α ≠ 0
  • and
  • If P-value t < 0.05 for a certain X variable, β ≠ 0.

Using the data from our initial example, we have the standard error for the regression as:

s.e.=ni=1(ui)2(nk)=361.15(102)=6.7189

si56_e

which is also provided by the Excel outputs (Fig. 13.18).

Fig. 13.18
Fig. 13.18 Standard error calculation.

Based on Expression (13.21), we can calculate, for our example:

tα=αs.e.(α)=5.87846.7189ajj

si57_e

tβ=βs.e.(β)=1.41896.7189ajj

si58_e

where ajj is the jth element of the main diagonal resulting from the following calculation matrix:

[(1118615).(1816115)]1=(0.45500.02090.02090.0012)

si59_e

which therefore results in:

tα=αs.e.(α)=5.87846.71890.4550=5.87844.532=1.2969

si60_e

tβ=βs.e.(β)=1.41896.71890.0012=1.41890.2354=6.0252

si61_e

which, for 8 degrees of freedom (n − k = 10 − 2 = 8), we have, by means of Table B in the Appendix, that tc = 2.306 for the significance level of 5% (probability on the upper tail of 0.025 for the two-tailed distribution). As such, being that the tcal = 1.2969 < tc = t8,2.5% = 2.306, we cannot reject the null hypothesis that the α parameter is statistically equal to zero at this significance level for the sample in question.

The same, however, does not occur for the β parameter, being that the tcal = 6.0252 > tc = t8,2.5% = 2.306. We can, therefore, reject the null hypothesis in this case, or rather, at the significance level of 5% we cannot affirm that this parameter is statistically equal to zero. These outputs are shown in Fig. 13.19.

Fig. 13.19
Fig. 13.19 Calculation of coefficients and significance t-test of parameters.

Analogous to the F-test, instead of analyzing if tcal > tc for each parameter, we directly verify if the significance level (P-value) for each tcal is less than 0.05 (5%), so as to maintain the parameter in the final model. The P-value for each tcal can be obtained in Excel by means of the command Formulas → Insert Function → DISTT, which will open a dialog box as is shown in Fig. 13.20. In this figure, the dialog boxes corresponding to parameters α and β are already presented.

Fig. 13.20
Fig. 13.20 Obtaining the levels of significance of t for parameters α and β (command Insert Function).

It is important to mention that, for simple regressions, statistic F = t2 for parameter β, as is shown by Fávero et al. (2009). In our example, therefore, we can verify that:

t2β=F

si62_e

t2β=(6.0252)2=36.30=F

si63_e

Being that hypothesis H1 of the F-test tells us that at least one β parameter is statistically different from zero for a certain significance level, and being that a simple regression presents only one β parameter, if H0 is rejected for the F-test, H0 will also be for the t-test of this β parameter.

However, for the α parameter, being that tcal < tc (P-value of tcal for the α parameter > 0.05) in our example, we could think of the estimation of a new regression that forces the intercept to be equal to zero. This can be elaborated by means of the Excel Regression dialog box, with the selection of the option Constant is zero.

However, we will not elaborate such procedure since the nonrejection of the null hypothesis that the α parameter is statistically equal to zero is due to the small sample used. It does not impede that a researcher make such forecasts by means of using the model obtained. The imposition that the α be zero could generate forecast bias by the generation of another model that would not be the most adequate to elaborate interpolations in the data. Fig. 13.21 illustrates this fact.

Fig. 13.21
Fig. 13.21 Original regression model and with the intercept equal to zero.

In this way, the fact that we cannot reject that the α parameter be equal to zero at a certain significance level does not necessarily imply that we should exclude it from the model. However, if this is the researchers’ decision, it is important they be at least aware that there will be only one different model from the original, with consequences to the preparation of forecasts.

The nonrejection of the null hypothesis for the β parameter at a certain significance level, on the other hand, should indicate that a corresponding X variable does not correlate with a Y variable and, therefore, should be excluded from the final model.

When later in this chapter we present the analysis of regression by means of the Stata (Section 13.5) and SPSS (Section 13.6) software, the Stepwise procedure will be introduced. This has a property that automatically excludes or maintains the β parameters in the model in function of the criteria presented and offers the final model with the β parameters statistically different from zero for the determined significance level.

13.2.4 Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions

The confidence levels for the α and βj (j = 1, 2, …, k) parameters for the 95% confidence level, can be written, respectively, as follows:

P[αtα/2ni=1(ui)2(nk)(1n+ˉX2ni=1(XiˉX)2)αα+tα/2ni=1(ui)2(nk)(1n+ˉX2ni=1(XiˉX)2)]=95%P[βjtα/2s.e.(ni=1X2i)(ni=1Xi)2nβjβj+tα/2s.e.(ni=1X2i)(ni=1Xi)2n]=95%

si64_e  (13.22)

Therefore, for our example, we have that:

Parameter α:

P[5.87842.306361.14868(110+289814)α5.8784+2.306361.14868(110+289814)]=95%

si65_e

P[4.5731α16.3299]=95%

si66_e

Being that the confidence level for parameter α contains zero, we cannot reject, at the 95% confidence level, that this parameter is statistically equal to zero, according to what has been verified when calculating the t statistic.

Parameter β:

P[1.41892.3066.71893704(170)210β1.4189+2.3066.71893704(170)210]=95%

si67_e

P[0.8758β1.9619]=95%

si68_e

Being that the confidence level for parameter β does not contain zero, we can reject, at the 95% confidence level, that this parameter is statistically equal to zero, also according to what has been verified when calculating the t statistic.

These intervals are also generated in the Excel outputs. Being that the software standard is to use the 95% confidence level, these intervals are shown twice, so as to allow the researcher to manually alter the confidence level desired by selecting the Confidence Level option in the Excel Regression dialog box, but still have the ability to analyze the intervals for the confidence level most commonly used (95%). In other words, the confidence level intervals of 95% in Excel will always be presented, giving the researcher the ability to analyze the intervals from another confidence level in parallel.

We will, therefore, alter the regression dialog box (Fig. 13.22) in order to allow the software to also calculate the interval parameters for the confidence level of, for example, 90%. These outputs are presented in Fig. 13.23.

Fig. 13.22
Fig. 13.22 Alteration of the confidence level of the intervals of the parameters to 90%.
Fig. 13.23
Fig. 13.23 Intervals with confidence levels of 95% and 90% for each of the parameters.

It can be seen that the lower and upper bands are symmetrical in relation to the estimated average parameter and offer the researcher the ability to prepare forecasts with a certain confidence level. In the case of parameter β from our example, being that the extremes of the lower and upper bands are positive, we can say that this parameter is positive, with 95% confidence. Besides this, we can also way that the interval [0.8758; 1.9619] contains β with 95% confidence.

Different from what we did for the 95% confidence level, we will not manually calculate the intervals for the 90% confidence level. However, an analysis of the Excel outputs allows us to affirm that the interval [0.9810; 1.8568] contains β with 90% confidence. In this way, we can say that the lower the levels of confidence, the narrower (less amplitude) the intervals will be to contain a certain parameter. On the other hand, the higher the levels of confidence, the greater amplitude will have the intervals to contain this parameter.

Fig. 13.24 illustrates what happens when we have a dispersed cloud of points surrounding a regression model.

Fig. 13.24
Fig. 13.24 Confidence intervals for a dispersion of points surrounding a regression model.

We can note that, the more positive parameter α is and mathematically equal to 5.8784, we cannot affirm that it is statistically different from zero for this small sample, being that the confidence interval contains an intercept equal to zero (origin). A greater sample could solve this problem.

For parameter β, however, we can note that the slope has always been positive, with an average value mathematically calculated and equal to 1.4189. We can visually notice that its confidence interval does not contain an slope equal to zero.

As has already been discussed, the rejection of the null hypothesis for parameter β, at a certain significance level, indicates that a corresponding X variable is correlated to the Y variable and, consequently, should remain in the final model. Therefore, we can conclude that the decision to exclude an X variable in a certain regression model can be done by means of a direct analysis of the t-statistic of its respective parameter β (if tcal < tc → P-value > 0.05 → we cannot reject that the parameter is statistically equal to zero) or by means of an analysis of the confidence interval (if the same contains zero). Box 13.1 presents the inclusion or exclusion criteria for parameters βj (j = 1, 2, …, k) in regression models.

Box 13.1

Decision to Include βj Parameters in Regression Models

Parametert Statistic (For Significance Level α)t-Test (Analysis of the P-Value for Significance Level α)Analysis of Confidence LevelDecision
βjtcal < tc α/2P-value > significance level αConfidence level contains zeroExclude parameter from model
tcal > tc α/2P-value < significance level αConfidence level does not contain zeroMaintain parameter in model

Unlabelled Table

Obs.: The most common in applied social sciences is the adoption of significance level α = 5%.

After a discussion of these concepts, the professor proposed the following exercise to his students: What is the average travel time forecast (Y estimated, or Ŷ) for a student who travels 17 km to get to school? What would be the minimum and maximum values that this travel time could assume, with 95% confidence?

The first part of the exercise could be solved by a simple substitution of the value of Xi = 17 in the initially obtained equation. Like this:

ˆtimei=5.8784+1.4189disti=5.8784+1.4189(17)=29.9997min

si69_e

The second part of the exercise takes us to the outputs in Fig. 13.23, being that the α and β parameters assume intervals of [− 4.5731; 16.3299] and [0.8758; 1.9619], respectively, at the 95% confidence level. As such, the equations that determine the minimum and maximum travel time values for this confidence level are:

Minimum time:

ˆtimemin=4.5731+0.8758disti=4.5731+0.8758(17)=10.3155min

si70_e

Maximum time:

ˆtimemax=16.3299+1.9619disti=16.3299+1.9619(17)=49.6822min

si71_e

We can therefore say that there is 95% confidence that a student who travels 17 km to get to school will take between 10.3155 and 49.6822 min, with an average estimated time of 29.9997 min.

Obviously, the amplitude of these values is not small, due to the confidence interval of parameter α being quite ample. This fact can be corrected by the increase of the sample size or by the inclusion of new, statistically significant X variables in the model (which would then become a multiple regression model) being that, in the last case, the R2 value would be increased.

After the professor presented the results of the model to the class, a curious student raised his hand and asked, “Professor, is there any influence of the regression model coefficient of determination R2 on the amplitude of the confidence intervals? If we set up this linear regression and substituted Y for Ŷ, what would the results be? Would the equation change? And the R2? And the confidence intervals?”

And, the professor substituted Y for Ŷ and again set up the regression by means of the dataset presented in Table 13.4.

Table 13.4

Dataset for Preparation of New Regression
Observation (i)Predicted Time(Ŷi)Distance (Xi)
117.238
214.396
327.1615
434.2620
541.3525
621.4911
712.975
851.2832
945.6128
1034.2620

The first step taken by the professor was to prepare a new scatter plot graph, with the estimated regression model. This graph is presented in Fig. 13.25.

Fig. 13.25
Fig. 13.25 Scatter plot and linear regression model between predicted time (Ŷ) and distance traveled (X).

As we can see, obviously all are now located on the regression model, since this procedure forced this situation by the fact that the calculation of each Ŷi used the regression model. As such, we can state in advance that the R2 for this new regression is 1. Let’s look at the new outputs (Fig. 13.26).

Fig. 13.26
Fig. 13.26 Outputs of the linear regression model between predicted time (Ŷ) and distance traveled (X).

As expected, the R2 is 1. Moreover, the model equation is exactly that which was previously calculated, since it is the same line. However, we can see that the F and t-tests cause us to strongly reject their respective null hypotheses. Even parameter α, which previously could not be considered statistically different from zero, now presents its t-test and tells us that we can reject, at the 95% confidence level (or higher), that this parameter is statistically equal to zero. This occurs because previously the small sample used (n = 10 observations) did not allow us to affirm that the intercept was different from zero, being that the dispersion of points generated a confidence interval that had an intercept equal to zero (Fig. 13.24).

On the other hand, when all the points are on the model, each of the residual terms comes to be zero, which causes the R2 to become 1. Besides, the obtained equation is no longer an adjusted model to a dispersion of points, but the very line that passes through all the points and completely explains the sample behavior. Being such, we do not have a dispersion surrounding the regression model and the confidence intervals come to represent a null amplitude, as we can also see in Fig. 13.26. In this case, for any confidence level, the values for each parameter interval are no longer altered, which causes us to declare with 100% confidence that the [5.8784; 5.8784] interval contains α and the [1.4189; 1.4189] interval contains β. In other words, in this extreme case, α is mathematically equal to 5.8784 and β is mathematically equal to 1.4189.

Being as such, R2 is an indicator of just how ample the parameter confidence intervals are. Therefore, models with higher R2 levels will give the researcher the ability to make more accurate forecasts, given that the cloud of points is less dispersed along the regression model, which will reduce the amplitude of the parameter confidence intervals.

On the other hand, models with low R2 values can impair the preparation of forecasts in that the greater amplitude of the parameter confidence intervals, but does not invalidate the existence of the model as such. As we have already discussed, many researchers give too much importance to the R2; however, it will be the F-test that will truly confirm that a regression model exists (at least a considered X variable is statistically significant to explain Y). As such, it is not rare to find very low R2 values and statistically significant F-values in Administration, Accounting, or Economics models, which shows that the Y phenomenon studied underwent changes in its behavior due to some X variables adequately included in the model. However, there will be a low forecast accuracy due to the impossibility of monitoring all variables that effectively explain the variation of that Y phenomenon. Within the aforementioned knowledge areas, such a fact can be easily found in works on Finance and the Stock Market.

13.2.5 Estimation of Multiple Linear Regression Models

According to Fávero et al. (2009), the multiple linear regression presents the same logic as the simple linear, however now with the inclusion of more than one explanatory X variable in the model. The use of many explanatory variables depends on the subjacent theory and previous studies, as well as the experience and good sense of the researcher, in order to be able to give foundation to the decision.

Initially, the ceteris paribus concept (maintain remaining conditions constant) should be used in the multiple regression analysis, since the interpretation of the parameter of each variable should be done in isolation. As such, in a model that possesses two explanatory variables, X1 and X2, the respective coefficients will be analyzed in a way so as to consider the other factors as constants.

To illustrate the multiple linear regression, we will use the same example that we have used in this chapter. However, we will now imagine that the professor has made the decision to collect one more variable from each of the students. This variable will refer to the number of traffic lights, or semaphores, each student must pass. We will call this variable sem. As such, the theoretical model becomes:

timei=a+b1disti+b2semi+ui

si72_e

which, analogous to what was presented for the simple regression, we have that:

ˆtimei=α+β1disti+β2semi

si73_e

where α, β1, and β2 are the estimates for parameters a, b1, and b2, respectively.

The new dataset is found in Table 13.5, as well as in the file Timedistsem.xls.

Table 13.5

Example: Travel Time × Distance Traveled and Number of Traffic Lights
StudentTime to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Number of Traffic Lights
(X2i)
Gabriela1580
Dalila2061
Gustavo20150
Leticia40201
Luiz Ovidio50252
Leonor25111
Ana1050
Antonio55323
Julia35281
Mariana30201

Table 13.5

We will now algebraically develop the procedures for calculating the model parameters, as we did in the simple regression model. By means of the following expression:

Yi=a+b1X1i+b2X2i+ui

si74_e

we also define that the residual sum of squares is minimum. Therefore:

ni=1(Yiβ1X1iβ2X2iα)2=min

si75_e

The minimization occurs in deriving the previous expression in α, β1, and β2 and equaling the resulting expressions to zero. Therefore:

[ni=1(Yiβ1X1iβ2X2iα)2]α=2ni=1(Yiβ1X1iβ2X2iα)=0

si76_e  (13.23)

[ni=1(Yiβ1X1iβ2X2iα)2]β1=2ni=1X1i(Yiβ1X1iβ2X2iα)=0

si77_e  (13.24)

[ni=1(Yiβ1X1iβ2X2iα)2]β2=2ni=1X2i(Yiβ1X1iβ2X2iα)=0

si78_e  (13.25)

which generates the following system of three equations and three unknowns:

{ni=1Yi=nα+β1ni=1X1i+β2ni=1X2ini=1YiX1i=αni=1X1i+β1ni=1X21i+β2ni=1X1iX2ini=1YiX2i=αni=1X2i+β1ni=1X1iX2i+β2ni=1X22i

si79_e  (13.26)

Dividing the first equation by the Expression (13.26) by n, we arrive at:

α=ˉYβ1ˉX1β2ˉX2

si80_e  (13.27)

By means of substituting the Expression (13.27) in the last two equations of the Expression (13.26), we arrive at the following system of two equations and two unknowns:

{ni=1YiX1ini=1Yini=1X1in=β1[ni=1X21i(ni=1X1i)2n]+β2[ni=1X1iX2i(ni=1X1i)(ni=1X2i)n]ni=1YiX2ini=1Yini=1X2in=β1[ni=1X1iX2i(ni=1X1i)(ni=1X2i)n]+β2[ni=1X22i(ni=1X2i)2n]

si81_e  (13.28)

We will now manually calculate the parameters for our example model. To do this, we need to use the spreadsheet in Table 13.6.

Table 13.6

Spreadsheet to Calculate the Parameters for the Multiple Linear Regression
Obs. (i)YiX1iX2iYiX1iYiX2iX1iX2i(Yi)2(X1i)2(X2i)2
1158012000225640
22061120206400361
320150300004002250
440201800402016004001
55025212501005025006254
62511127525116251211
710505000100250
855323176016596302510249
935281980352812257841
103020160030202254001
Sum30017010625541523111,000370418
Average30171

Table 13.6

We will now substitute the values into the system represented by the Expression (13.28). Therefore:

{625530017010=β1[3704(170)210]+β2[231(170)(10)10]4153001010=β1[231(170)(10)10]+β2[18(10)210]

si82_e

Which results in:

{1155=814β1+61β2115=61β1+8β2

si83_e

Solving the system, we arrive at:

β1=0.7972andβ2=8.2963

si84_e

We have that:

α=ˉYβ1ˉX1β2ˉX2=300.7972(17)8.2963(1)=8.1512

si85_e

Therefore, the estimated time equation to get to school now comes to be:

ˆtimei=8.1512+0.7972disti+8.2963semi

si86_e

It should be remembered that the estimation of these parameters can also be obtained by means of the Excel Solver tool, as shown in Section 13.2.1.

The calculations of the coefficient of determination R2, the F and t-statistics, and extreme values of the confidence intervals will not be performed again manually, given that they are exactly the same procedure already performed in Sections 13.2.213.2.4 and can be done by means of the respective expressions presented until now. Table 13.7 can be of help in this sense.

Table 13.7

Spreadsheet to Calculate Remaining Statistics
Observation (i)Time (Yi)Distance (X1i)Traffic Lights
(X2i)
ˇYisi5_eui
(YiˇYi)si10_e
(ˇYiˉY)2si7_e(ui)2
1158814.530.47239.360.22
2206621.23− 1.2376.901.51
320151520.11− 0.1197.830.01
440202032.397.615.7257.89
550252544.675.33215.3228.37
625111125.22− 0.2222.880.05
7105512.14− 2.14319.084.57
855323258.55− 3.55815.1412.61
935282838.77− 3.7776.9014.21
1030202032.39− 2.395.725.72
Sum300170101874.85125.15
Average30171

Table 13.7

Let’s go directly to the preparation of this multiple linear regression in Excel (file Timedistsem.xls). In the regression dialog box, we should jointly select the variables referent to the distance traveled and the number of traffic lights, as shown in Fig. 13.27.

Fig. 13.27
Fig. 13.27 Multiple linear regression—joint selection of set of explanatory variables.

Fig. 13.28 presents the generated outputs.

Fig. 13.28
Fig. 13.28 Multiple linear regression outputs in Excel.

Within these outputs, we find the parameters for our multiple linear regression model determined algebraically.

At this time, it is important to introduce the concept of the adjusted R2. According to Fávero et al. (2009), when we wish to compare the coefficient of determination (R2) between two models with different sample sizes or distinct quantities of parameters, the use of the adjusted R2 becomes necessary, which is a measure of the R2 regression estimated by the OLS method adjusted by the number of degrees of freedom, since the estimate sample of R2 tends to overestimate the population parameter. The adjusted R2 expression is:

R2adjust=1n1nk(1R2)

si87_e  (13.29)

where n is the size of the sample and k is the number of regression model parameters (number of explanatory variables plus the intercept). When the number of observations is very large, the adjustment by degrees of freedom becomes negligible; however when there is a significantly different number of X variables for the two samples, the adjusted R2 should be used for the preparation of the comparison between models and the model with the higher adjusted R2 should be opted for.

R2 increases when a new variable is added to the model, however the adjusted R2 will not always increase, and could well decrease or become negative. For this last case, Stock and Watson (2004) explain that the adjusted R2 can become negative when the explanatory variables, taken as a set, reduce the residual sum of squares to such a small amount that this reduction is unable to compensate the factor (n − 1)/(n − k).

For our example, we have that:

R2adjust=1101103(10.9374)=0.9195

si88_e

Therefore, at present, in detriment to the simple regression initially applied, we should opt for this multiple regression as being a better model to study the behavior of travel time to get to school since the adjusted R2 is higher for this case.

Let’s give sequence to the analysis of the remaining outputs. Initially, the F-test informed us that at least one of the X variables is statistically significant to explain the behavior of Y. Besides this, we can also verify, at the 5% significance level, that all the parameters (α, β1, and β2) are statistically different from zero (P-value < 0.05 → confidence interval does not contain zero). As has been discussed, the nonrejection of the null hypothesis that the intercept is statistically equal to zero can be altered by including a significant explanatory variable in the model. We also note that there was a perceptible increase in the R2 value, which also caused the confidence intervals of the parameters become narrower.

In this way, we can conclude, for this case, that the increase of one traffic light along the trajectory to school increases the average travel time by 8.2963 min, ceteris paribus. On the other hand, an increase of one kilometer in the distance to be traveled now increases only 0.7972 min to the average travel time, ceteris paribus. The reduction in the estimated value of β of the dist variable occurs because part of the behavior of this variable is contemplated in the sem variable. In other words, greater distances are more susceptible to a greater number of traffic lights and, therefore, there is a high relation between them.

According to Kennedy (2008), Fávero et al. (2009), Gujarati (2011), and Wooldridge (2012), the existence of high correlations between explanatory variables, known as multicollinearity, does not affect the intention to prepare forecasts. Gujarati (2011) also highlights that the existence of high correlations between explanatory variables does not necessarily generate bad or weak estimators and that the presence of multicollinearity does not mean that the model has problems. We will discuss multicollinearity more in Section 13.3.2.

The equations that determine the minimum and maximum values for travel time, at the 95% confidence level, are:

Minimum time:

ˆtimemin=1.2463+0.2619disti+2.8967semi

si89_e

Maximum time:

ˆtimemax=15.0561+1.3325disti+13.6959semi

si90_e

13.2.6 Dummy Variables in Regression Models

According to Sharma (1996) and Fávero et al. (2009), the determination of the number of necessary variables to investigate a phenomenon is direct and simply equal to the number of variables used to measure the respective characteristics. However, the procedure to determine the number of explanatory variables whose data on qualitative scales is different.

Imagine, for example, that we wish to study how the behavior of a certain organizational phenomenon changes, such as total profitability, when companies from different sectors are in the same dataset. Or, in another situation, we wish to verify if the average grocery bill in supermarkets presents significant differences when comparing consumers from different genders and age groups. In a third situation, we want to study how GDP growth behaves in different countries considered either emerging or developed. In all of these hypothetical situations, the dependent variables (or outcome variables) are qualitative (total profitability, average grocery bill, or rate of GDP growth); however, we wish to know how they behave in function of qualitative variable explanatories (sector, gender, age group, country classification), which will be included on the right side of the respective regression models to be estimated.

We cannot simply attribute values to each of the qualitative variable categories, for this would be a serious error, called random weighting, since we would be supposing that the differences in the dependent variable would be previously known and of equal magnitude to the differences in the values attributed to each of the qualitative explanatory variable categories. In these situations, so that this problem is completely eliminated, we should resort to the artifice of dummy variables, or binaries, that assume values equal to 0 or 1, in such a way as to stratify the sample in the manner that the determined criteria, event, or attribute was defined, to then be included in the model under analysis. Even a certain period (day, month, or year) when an important even occurred can be the object of analysis.

Dummy variables should, therefore, be used when we wish to study the relation between the behavior of a certain qualitative explanatory variable and the phenomenon in question, represented by the dependent variable.

Returning to our example, now imagine that the professor also asked the students regarding the time of day they came to school, or rather, if each of them came in the morning in order to study in the library, or if they came in the afternoon for their night class. The intent of the professor is to know if the travel time to school undergoes variation due to the distance traveled, the quantity of traffic lights, and the time of day when the students leave to go to school. Therefore, a new variable was added to the dataset, as is shown in Table 13.8.

Table 13.8

Example: Travel Time × Distance Traveled, Number of Traffic Lights, and Time of Day for the Trip to School
StudentTravel Time to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Number of Traffic Lights
(X2i)
Time of Day
(X3i)
Gabriela1580Morning
Dalila2061Morning
Gustavo20150Morning
Leticia40201Afternoon
Luiz Ovidio50252Afternoon
Leonor25111Morning
Ana1050Morning
Antonio55323Afternoon
Julia35281Morning
Mariana30201Morning

Table 13.8

We should, therefore, define which of the qualitative variable categories should be the reference (dummy = 0). Being that, in this case, we have only two categories (morning or afternoon), only one dummy should be created, in which the reference category will assume the value of 0 and the other category the value of 1. This procedure allows the researcher to study the differences that occur in the Y variable in altering the qualitative variable category, since the β of this dummy will exactly represent the difference that occurs in the behavior of variable Y when it passes from the reference category of the qualitative variable to another category, being the behavior of the reference category represented by the α intercept. Therefore, the decision to choose which category will be the reference is up to the researcher and the parameters of the model will be obtained based on the criteria adopted.

As such, the professor decided that the reference category would be the afternoon period, or rather, the dataset cells with this category would assume values equal to 0. Then, the cells with the morning category would assume values equal to 1. This is because the professor wanted to evaluate if the trip to school in the morning posed any time benefit or loss in relation to the afternoon period, which was immediately before class. We will call this dummy variable per. Being as such, the dataset reflects what is presented in Table 13.9.

Table 13.9

Substitution of the Qualitative Variable Categories by the Dummy
StudentTime to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Number of Traffic Lights
(X2i)
Time of Day
Dummy per
(X3i)
Gabriela15801
Dalila20611
Gustavo201501
Leticia402010
Luiz Ovidio502520
Leonor251111
Ana10501
Antonio553230
Julia352811
Mariana302011

Table 13.9

Therefore, the new model is:

timei=a+b1disti+b2semi+b3peri+ui

si91_e

which, analogous to that presented for the simple regression, we have:

ˆtimei=α+β1disti+β2semi+β3peri

si92_e

where α, β1, β2, and β3 are the estimates of parameters a, b1, b2, and b3, respectively.

Again, solving by Excel, we should now include the dummy variable per in the explanatory variables vector, as is shown in Fig. 13.29 (file Timedistsemper.xls).

Fig. 13.29
Fig. 13.29 Multiple linear regression—joint selection of the explanatory variables with dummy.

The outputs are presented in Fig. 13.30.

Fig. 13.30
Fig. 13.30 Multiple linear regression outputs with dummy in Excel.

By means of these outputs, we can, initially, see that the R2 went up to 0.9839, which allows us to say that more than 98% of the time variation behavior to get to school is explained by the joint variation of the three X variables (dist, sem, and per). Besides that, this model is preferable in relation to those previously studied since it presents a higher adjusted R2.

While the F-test allows us to state that least one estimated β parameter is statistically different from zero to the 5% significance level, the t-tests for each parameter show that all of them (β1, β2, β3, and the very α) are statistically different from zero at this significance level, for each P-value < 0.05. As such, no X variable need be excluded from the model. The final equation that estimates the time to get to school presents itself in the following way:

ˆtimei=19.6353+0.7084disti+5.2573semi9.9088peri{afternoon=0morning=1

si93_e

In this way, we can state, for our example, that the average time predicted to get to school is 9.9088 min less for students who opt to go during the morning in relation to those who opt to go in the afternoon, ceteris paribus. This probably should have happened for reasons associated with traffic; however, deeper studies could be performed at this point. As such, the professor proposed one more exercise: What is the estimated time to get to school for a student who travels 17 km, goes through two traffic lights and gets to school a little before the beginning of the evening class, in the afternoon? The solution is as follows:

ˆtime=19.6353+0.7084(17)+5.2573(2)9.9088(0)=42.1934min

si94_e

It can be noted that eventual differences raising from the third decimal place can occur due to rounding problems. We used the values obtained from the Excel outputs.

And what would be the estimated time for another student who also travels 17 km, goes through two traffic lights, but decides to go to school in the morning?

ˆtime=19.6353+0.7084(17)+5.2573(2)9.9088(1)=32.2846min

si95_e

According to what we have already discussed, the difference between the two situations is captured by the β3 of the dummy variable. The ceteris paribus condition imposes no other alteration to be considered, exactly as shown in this last exercise.

Imagine now that the professor, still not satisfied, questions the students one last time regarding their driving style. As such, each one was asked how they classified their driving style: calm, moderate, or aggressive. In obtaining their answers, he set up the last dataset, presented in Table 13.10.

Table 13.10

Example: Travel Time × Distance Traveled, Number of Traffic Lights, Time of Day for the Trip to School, and Driving Style
StudentTime to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Number of Traffic Lights
(X2i)
Time of Day
(X3i)
Driving Style
(X4i)
Gabriela1580MorningCalm
Dalila2061MorningModerate
Gustavo20150MorningModerate
Leticia40201AfternoonAggressive
Luiz Ovidio50252AfternoonAggressive
Leonor25111MorningModerate
Ana1050MorningCalm
Antonio55323AfternoonCalm
Julia35281MorningModerate
Mariana30201MorningModerate

Table 13.10

To prepare the regression, the professor needs to transform driving style variable into dummies. For a situation where there is a number of categories greater than 2 for a certain qualitative variable (e.g., marriage status, ball team, religion, area of work, among others), it is necessary for the researcher to use a greater number of dummy variables and, in general, for a qualitative variable with n categories, there will need to be (n − 1) dummies, since a certain category should be chosen as reference and its behavior captured by the estimated α parameter.

As we have discussed, unfortunately it is quite common to find procedures that practice the arbitrarily substitution of the qualitative variable categories with values such as 1 and 2, when there are two categories, and 1, 2, and 3 when there are three categories, and so on and so forth. This is a serious error, since, in this way, we would start from the presupposition that the differences that occur in the behavior of the Y variable in altering the qualitative variable category would always be of the same magnitude, which is not necessarily true. In other words, we cannot presume that the average difference between calm and moderate individuals would be the same as that between the moderate and aggressive.

In our example, therefore, the variable driving style should be transformed into two dummies (variables style2 and style3), being that we have already defined the calm category as being the reference (present behavior in the intercept). While Table 13.11 presents the criteria for the two dummies, Table 13.12 shows the final dataset to be used in the regression.

Table 13.11

Criteria for the Creation of Two Dummy Variables Based on the Driving Style Qualitative Variable
Category of Driving Style Qualitative VariableDummy Variable style2Dummy Variable style3
Calm00
Moderate10
Aggressive01

Table 13.12

Substitution of the Qualitative Variable Categories With the Respective Dummy Variables
StudentTime to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Number of Traffic Lights
(X2i)
Time of Day
Dummy per
(X3i)
Driving Style
Dummy style2
(X4i)
Driving Style
Dummy style3
(X5i)
Gabriela1580100
Dalila2061110
Gustavo20150110
Leticia40201001
Luiz Ovidio50252001
Leonor25111110
Ana1050100
Antonio55323000
Julia35281110
Mariana30201110

Table 13.12

And, in this way, the model will have the following equation:

timei=a+b1disti+b2semi+b3peri+b4style2i+b5style3i+ui

si96_e

and, analogous to what has been presented for the previous models, we have that:

ˆtimei=α+β1disti+β2semi+β3peri+β4style2i+β5style3i

si97_e

where α, β1, β2, β3, β4, and β5 are the estimates of parameters a, b1, b2, b3, b4, and b5, respectively.

In this way, analyzing the parameters of the style2 and style3 variables, we have that:

  • β4 = average travel time difference between an individual considered moderate and an individual considered calm.
  • β5 = average travel time difference between an individual considered aggressive and an individual considered calm.
  • (β5 − β4) = average travel time difference between an individual considered aggressive and an individual considered moderate.

Solving again by Excel, we should not include the style2 and style3 dummy variables in the explanatory variables vector. Fig. 13.31 shows this procedure, prepared by means of the Timedistsemperstyle.xls file.

Fig. 13.31
Fig. 13.31 Multiple linear regression—joint selection of explanatory variables with all dummies.

The outputs are presented in Fig. 13.32.

Fig. 13.32
Fig. 13.32 Multiple linear regression outputs with different dummies in Excel.

We now notice that, even though the R2 of the regression model is quite high (R2 = 0.9969), the parameters for the referent variables for the period in which the route was taken (X3) and the moderate category of the driving style variable (X4) do not show themselves to be statistically different from zero at the 5% significance level. As such, these variables will be removed from the analysis and the model we be prepared again.

However, it is important for us to analyze that, in the presence of the remaining variables, the travel time to the school does not come to present additional differences if the route is taken in the morning or in the afternoon. The same is true in relation to the driving style, being that it can be seen that there are no statistically significant differences in the travel time for students with the moderate profile in relation to those who consider themselves calm. It should be remembered that, in a multiple regression, just as important as the analysis of the statistically significant parameters is the analysis of those parameters that do not show themselves to be statistically different from zero.

The Stepwise procedure, available in Stata, SPSS and other modeling software, presents a property that automatically excludes the explanatory variables whose parameters do not show themselves to be statistically different from zero. Being that Excel does not possess this procedure, we will manually exclude the per and style2 variables and once again elaborate the regression. The new outputs are presented in Fig. 13.33. It is recommended, however, that the researcher always take much care with the simultaneous exclusion of variables whose parameters, at first sight, do not show themselves to be statistically different from zero, since a certain β parameter can become statistically different from zero, even though initially not, when eliminating anotherβ variable whose parameter also does no show itself to be statistically different from zero. Fortunately, this does not occur in this example and, as such, we opt to simultaneously exclude the two variables. This will be proven when we elaborate this regression by means of the Stepwise procedure in the Stata (Section 13.5) and SPSS (Section 13.6) software.

Fig. 13.33
Fig. 13.33 Multiple linear regression outputs after the exclusion of variables.

And, in this way, the final model, with all the parameters statistically different from zero at the 5% significance level, with R2 = 0.9954 and the higher adjusted R2 among all those discussed in this chapter, comes to be:

ˆtimei=8.2919+0.7105disti+7.8368semi+8.9676style3i{calm=0aggressive=1

si98_e

It is also important to verify if there was a reduction in the confidence interval amplitudes for each of the parameters. In this way, we can ask:

What would be the estimated time for another student, who also travels 17 km, goes through two traffic lights, decides to go to school in the morning, but has what is considered an aggressive driving style?

ˆtime=8.2919+0.7105(17)+7.8368(2)+8.9676(1)=45.0109min

si99_e

Finally, we can state, ceteris paribus, that a student who is considered aggressive when driving takes, on average, 8.9676 min longer to get to school in relation to another who is considered calm. This shows us, among other things, that aggressiveness in traffic leads nowhere!

13.3 Presuppositions of Regression Models Estimated by OLS

After the presentation of the multiple regression model estimated by OLS method, Box 13.2 presents presuppositions, the consequences for their violations, and the procedures for verifying each of them.

Box 13.2

Regression Model Presuppositions

PresuppositionsViolationsPresupposition Verification
Residuals present a normal distributionP-values of the t-test and F-test are not validShapiro-Wilk test
Shapiro-Francia test
There are no high correlations between explanatory variables and there are more observations than explanatory variablesMulticollinearityCorrelation matrix
Determinant of (X′X) matrix
VIF (variance inflation factor) and tolerance
The residuals do not present any correlation with any X variableHeteroskedasticityBreusch-Pagan/Cook-Weisberg test
The residuals are random and independentAutocorrelation of the residuals for temporal modelsDurbin-Watson test
Breusch-Godfrey test

Source: Kennedy (2008).

Next, we will present and discuss each of the presuppositions.

13.3.1 Normality of Residuals

The normal distribution of residuals is required if and only if to validate the hypothesis tests in the regression models. In other words, the normality presupposition assures that the P-value of the t-tests and the F-test are valid. However, Wooldridge (2012) argues that the violation of this presupposition can be minimized when using large samples, due to the asymptotic properties of the estimators obtained by OLS.

It is quite common for this presupposition to be violated by researchers in the estimation of the regression models by the OLS method; however, it is important that this hypothesis be considered in order to obtain a series of statistical results in line with the definition of the best functional form and to determine the confidence intervals for forecast (Fig. 13.34), which are defined, as we have studied, based on the forecast of the model parameters.

Fig. 13.34
Fig. 13.34 Normal distribution of residuals.

It should be emphasized that adherence to the normal distribution of the dependent variable, in OLS regression models, can cause the generation of error terms, also normal, and, consequently, estimated parameters more adequate to the determination of the confidence intervals for forecasting purposes.

Being thus, it is recommended that the Shapiro-Wilk test or the Shapiro-Francia test be applied to the error terms so as to verify the normal distribution of residuals presupposition. According to Maroco (2014), while the Shapiro-Wilk test is more appropriate for small samples (those with up to 30 observations), the Shapiro-Francia test is more recommended for large sample, as we discussed in Chapter 9.

In Section 13.5 we will present the application of these tests, as well as their results, using Stata.

The nonadherence to the normality of the error terms can indicate that the model was specified incorrectly as to its functional form and there was an omission of relevant explanatory variables. So as to correct this problem, the mathematical formula can be altered, or new explanatory variables could be included in the model.

In Section 13.3.5, we will present the linktest and the RESET test for the identification of specification problems in the functional form and the omission of relevant variables, respectively. In Section 13.4, we discourse on the nonlinear specifications, highlighting some specific functional forms. In this same section, we will discuss the Box-Cox transformations, which have the purpose of maximizing adherence to the normal distribution of a certain variable generated based on an original variable with a non-normal distribution. It is very common that this procedure be applied to the dependent variable in a model whose estimation generated error terms that do not adhere to normality.

It is worth commenting that the discussion regarding the need for explanatory variables to present distributions that adhere to normality, which is a big mistake, is quite common. If this were the case, it would not be possible to use dummy variables in our models.

13.3.2 The Multicollinearity Problem

The multicollinearity problem occurs when there are very high correlations between explanatory variables. In extreme cases, such correlations can be perfect, indicating a linear relation between the variables.

Initially, we present the general multiple linear regression model in its matrix form. Beginning with:

Yi=a+b1X1i+b2X2i++bkXki+ui

si16_e  (13.30)

we write that:

Y=Xb+U

si101_e  (13.31)

or:

[Y1Y2Y3Yn]n×1=[1X11X12X1k1X21X22X2k1X31X32X3k1Xn1Xn2Xnk]n×k+1[ab1b2bk]k+1×1+[u1u2u3un]n×1

si102_e  (13.32)

from where we can show that the parameter estimates are given the following vector:

β=(XX)1(XY)

si103_e  (13.33)

We can imagine a specific model with only two explanatory variables, as follows:

Yi=a+b1X1i+b2X2i+ui

si74_e  (13.34)

If, for example, X2i = 4 ⋅ X1i, it would not be possible to separate the occurred variations in the dependent variable due to alterations in X1 due to the influence of X2. Therefore, according to Vasconcellos and Alves (2000), it would be impossible, in this situation, for all the parameters of the equation of Expression (13.34) to be estimated, being that the inversion of the matrix (X′X) would be impossible and, consequently, the calculation of the vector of parameters β = (X′X)− 1 (X′Y). However, the following model could be estimated:

Yi=a+(b1+4b2)X1i+ui

si105_e  (13.35)

whose estimated parameter would be a linear combination between b1 and b2.

Greater problems, however, occur when the correlation between explanatory variables is quite high, howbeit not perfect, according to what will be discussed later by means of presenting numerical examples and dataset applications.

13.3.2.1 Causes of Multicollinearity

One of the main causes of multicollinearity is the existence of variables that present the same tendency during some periods. We can imagine, for example, that we want to see if the profitability, over a period of time, of a certain European fixed income fund tied to price indexes varies due to inflation indexes over a period of 3 months. In other words, there is the desire to estimate a model where fund profitability in period t is a function of determined inflation indexes in t − 3. For this, the researcher includes, as explanatory variables, inflation indexes that measure the change over time in the prices of a basket of goods and services acquired by European consumers, such as the Harmonized Index of Consumer Prices and the Monetary Union Index of Consumer Prices (both in t − 3). Being that these indexes can present a correlation over time, the generated model will very probably present multicollinearity.

Such a phenomenon is not restricted to datasets where there is a temporal evolution. We can imagine another situation where a researcher wants to discover if the revenue for a sample of supermarkets in a month is due to square feet of sales area (ft2) and the number of employees assigned to each of the stores. Being that it is known that, for this type of retail operation, there is a certain correlation between the amount of sales area and number of employees, multicollinearity problems can occur in this cross-section as well.

Another common cause for multicollinearity is the use of datasets with an insufficient number of observations.

13.3.2.2 Consequences of Multicollinearity

The existence of multicollinearity has a direct impact on the calculation of the (X′X) matrix. To care for this problem, we present, by means of numerical examples, the calculations for the (X′X) and (X′X)− 1 matrices in three distinct cases, where there is a correlation between the two explanatory variables: (a) perfect correlation; (b) very high, but not perfect, correlation; (c) low correlation.

  1. (a) Perfect correlation

Imagine an X matrix with only two explanatory variables and two observations:

X=[1428]

si106_e

Then:

XX=[5202080]

si107_e

and, therefore, det(X′X) = 0, or rather, (X′X)− 1 cannot be calculated.

  1. (b) Very high, but not perfect, correlation

Imagine now that the X matrix presents the following values:

X=[1427.9]

si108_e

Then:

XX=[519.819.878.41]

si109_e

from whence comes det(X′X) = 0.01 and, therefore:

(XX)1=[784119801980500]

si110_e

Being that the variance and covariance matrices of the model parameters are given by σ2(X′X)− 1, and being that the elements of the main diagonal of this matrix appear in the denominator of the t statistic, according to what was studied in Section 13.2.3 (Expression (13.21)), these tend, in this case, to present underestimated values for the existence of high values in the (X′X)− 1 matrix, which can eventually cause the researcher to consider the effects of some of the explanatory variables as insignificant. However, being that the calculations of the F statistic and the R2 are not affected by this phenomenon, it is common to find models in which the explanatory variable coefficients are not statistically significant, with the F-test rejecting the null hypothesis at the same significance level, or rather, indicating that at least one parameter is statistically different from zero. In many cases, this inconsistency even comes accompanied by a high R2 value.

  1. (c) Low correlation

Imagine, finally, that the X matrix comes to present the following values:

X=[1423]

si111_e

Then:

XX=[5101025]

si112_e

from which comes that det(X′X) = 25 and, therefore:

(XX)1=[10.40.40.2]

si113_e

We can now verify that, given the low correlation between X1 and X2, the values presented on the (X′X)− 1 matrix are low, which generates little influence on the reduction of the t statistic as to its calculation.

In Section 13.3.2.3, to follow, models are prepared using a dataset that allows for the study of these three situations.

13.3.2.3 Application of Multicollinearity Examples in Excel

Returning to the example used throughout this chapter, we now imagine that the professor wishes to evaluate the influence of the distance traveled (dist) and the amount of intersections (cros) along the route on the time to get to school (time). To do this, research questions were posed to the students from three different classes (A, B, and C), such to obtain, for each class, the following model:

timei=a+b1disti+b2crosi+ui

si114_e

The three cases presented refer to the data obtained from each of the three classes of students, respectively.

  1. (a) Class A: The case of perfect correlation

Class A is composed only of students who live in the center of town—there coincidentally exists a perfect correlation between the distance traveled and the amount of intersections since each of the routes possesses the same characteristics and is always in the urban zone. The dataset collected from Class A is presented in Table 13.13.

Table 13.13

Class A and the Example of Perfect Correlation Between Explanatory Variables (Distance Traveled and Amount of Intersections)
StudentTime to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Amount of Intersections
(X2i)
Gabriela15816
Dalila20612
Gustavo201530
Leticia402040
Luiz Ovidio502550
Leonor251122
Ana10510
Antonio553264
Julia352856
Mariana302040

Table 13.13

By means of the file Timedistcros_class_A.xls, we can prepare the multiple regression as shown in Fig. 13.35.

Fig. 13.35
Fig. 13.35 Multiple linear regression for Class A.

The outputs are presented in Fig. 13.36.

Fig. 13.36
Fig. 13.36 Multiple linear regression outputs for Class A.

As we can see, the estimation of the variable X1 (dist) parameter was not calculated being that the correlation between dist and cros is perfect and, therefore, the inversion of matrix (X′X) is impossible, which, in this case, is given as:

XX=[37047408740814,816]si115_e, from where we get that det(X′X) = 0.

In any case, we know that cruzi = 2. disti, we can estimate the following model:

tempoi=a+(b1+2b2)disti+ui

si116_e

where the estimated parameter will be a linear combination between b1 and b2.

  1. (b) Class B: The case of high, but not perfect, correlation

Class B, very similar to Class A in terms of travel characteristics, has only one student (Alex) who, because of using an expressway, goes through one intersection less, proportionally, than the others, as can be seen in Table 13.14. As such, the correlation between dist and cros is no longer perfect, even though is it extremely high (in the case of this example, equal to 0.9998).

Table 13.14

Class B and the Example of Very High Correlation Between the Explanatory Variables (Distance Traveled and Number of Intersections)
StudentTime to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Number of Intersections
(X2i)
Grace15816
Phillip20612
Antonietta201530
Americo402039
Ferruccio502550
Francis251122
Camilo10510
William553264
Paula352856
Matthew302040

Table 13.14

By means of the file Timedistcros_class_B.xls, we can prepare the same multiple regression, whose outputs are presented in Fig. 13.37.

Fig. 13.37
Fig. 13.37 Multiple linear regression outputs for Class B.

In this case, as we have already discussed, it is possible to see that there is an inconsistency between the F-test result and the t-tests results, being that the last present underestimated values for their statistics due to the fact of having more high (X′X)− 1 matrix values, or rather, by the fact that det(X′X) is lower. In this case, we have:

XX=[37047388738814,737]si117_e, from where it comes that det(X’X) = 3,304, which apparently is a high value; however, it is considerably lower than that calculated for Class C to follow. Besides this, in this case, we have that:

(XX)1=[4.4602.2362.2361.121]

si118_e

As a result, the outputs (Fig. 13.37) can cause a researcher to, erroneously, affirm that no parameter in the model in question is statistically significant, even though the F-test has indicated that least one of them is statistically different from zero, at the significance level of, for example, 5%, and that the very R2 shows itself to be relatively high (R2 = 0.8379). This phenomenon represents a major error that can be committed in models with a high multicollinearity between explanatory variables.

  1. (c) Class C: The case of lower correlation

Class C is more heterogeneous in terms of travel characteristics, being that it is composed of students who also come from other communities and, therefore, use roads with a proportionally lower number of intersections along the route. The correlation between dist and cros, in this case, is 0.6505. Table 13.15 presents the dataset collected for Class C.

Table 13.15

Class C and the Example of a Lower Correlation Between Explanatory Variables (Distance Traveled and Number of Intersections)
StudentTime to Get to School (min)
(Yi)
Distance Traveled to School (km)
(X1i)
Number of Intersections
(X2i)
Juliana15812
Rachel20620
Larissa201525
Roger402037
Elizabeth502532
Wilson251117
Lauren1059
Sandra553260
Walter352812
Luke302017

Table 13.15

The file Timedistcros_class_C.xls gives the data in the Excel format, from which we can prepare the same multiple regression, which outputs are presented in Fig. 13.38.

Fig. 13.38
Fig. 13.38 Multiple linear regression outputs for Class C.

Now we can see that, given a lower correlation between dist and cros, the values present in the (X′X)− 1 matrix are very much lower than in those calculated for Class B, which will generate little influence over the reduction of the f statistics as to their calculations and, consequently, no inconsistencies will occur between the t-tests and the F-test. In this case, we have:

XX=[3704495949597965]si119_e, from which comes that det(X′X) = 4,910,679, which is a much higher value than that calculated for the previous case. Beside this, we have that:

(XX)1=[0.00160.00100.00100.0008]

si120_e

13.3.2.4 Multicollinearity Diagnostics

The first and most simple method of multicollinearity diagnosis refers to the identification of high correlations between explanatory variables by means of the analysis of the correlation matrix. On one hand, this method presents great ease of application; on the other, it is unable to identify eventual existing relations between more than two variables simultaneously.

The second, less used, method refers to the study of the determinant of the (X′X) matrix. According to what we have studied in the two previous sections, very low det(X′X) values can indicate the presence of high correlations between explanatory variables, which hinders the analysis of the t statistics.

Last, but not least important, is the multicollinearity diagnosis prepared by means of the estimation of the auxiliary regressions. According to Vasconcellos and Alves (2000), based on Expression (13.30), estimated regressions can be estimated, such that:

X1i=a+b1X2i+b2X3i++bk1Xki+uiX2i=a+b1X1i+b2X3i++bk1Xki+uiXki=a+b1X1i+b2X2i++bk1Xk1i+ui

si121_e  (13.36)

and, for each of them, there will be an Rk2. If one or more of these Rk2 auxiliaries is high, we can consider the existence of multicollinearity. In this way, we can define, based on the same, the Tolerance and VIF (Variance Inflation Factor) statistics, as follows:

Tolerance=1R2k

si122_e  (13.37)

VIF=1Tolerance

si123_e  (13.38)

Being thus, if the Tolerance is very low and, consequently, the VIF statistic high, we have an indication of multicollinearity problems. In other words, if the Tolerance is low for a certain auxiliary regression, it means that the explanatory variable that performs the dependent role in the auxiliary regression shares a high percentage of variance with the other explanatory variables.

While many authors state that multicollinearity problems arise with VIF values above 10, we notice that a VIF value equal to 4 results in a Tolerance of 0.25, or rather, in an Rk2 of 0.75 for that determined auxiliary regression, which represents a relatively high percentage of shared variance between a certain explanatory variable and the others.

13.3.2.5 Possible Solutions for the Multicollinearity Problem

Multicollinearity represents one of the most difficult problems to be solved in data modeling. While some only apply the Stepwise procedure, in order to eliminate the correlated explanatory variables, what in fact could correct multicollinearity, such solution can generate a specification problem by the omission of a relevant variable, as we will discuss in Section 13.3.5.

The creation of orthogonal factors based on the explanatory variables, by means of applying the factor analysis technique (Chapter 12), can correct multicollinearity problems. For the effect of forecast, however, it is known that the corresponding factors for new observations will not be known, which generates a problem for the researcher. Besides this, the creation of factors always entails a loss of a parcel of variance of the original explanatory variables.

The good news, according to what Vasconcellos and Alves (2000) also discuss, is that the existence of multicollinearity does not affect the intent to prepare forecasts, in the case that the same conditions that generated the results are maintained for the forecast. This way, the forecasts will incorporate the same standard of relation between the explanatory variables, which does not present any problem. Gujarati (2011) also highlights that the existence of high correlations between explanatory variables does not necessarily generate bad or weak estimators and that the presence of multicollinearity does not mean that the model has problems. In other words, some authors argue that a solution for multicollinearity is to identify it, recognize it, and do nothing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset