Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13

Simple and Multiple Regression Models

Abstract

This chapter presents the simple and multiple linear regression models, establishing the circumstances upon which they can be used. The parameters of the simple and multiple regression models are estimated by the least squares method and the model presuppositions are analyzed by means of tests and specific statistics. For the effect of forecast, confidence intervals of the model parameters are prepared. Nonlinear regression models are also specified, as well as the definition of the best functional form and the Box-Cox transformation. Finally, regression models are estimated in Microsoft Office Excel®, Stata Statistical Software®, and IBM SPSS Statistics Software®, and their results are interpreted.

Keywords

Simple and multiple linear regression; Nonlinear regression; Ordinary least squares; Functional form; Box-Cox transformation; Coefficient of determination R²; t-test; F-test; Confidence intervals for forecasts; Excel; Stata; SPSS software

… because politics is for the present, but an equation is something for eternity.

Albert Einstein

13.1 Introduction

Of the techniques studied in this book, without a doubt, those known as simple and multiple linear regression models are the most used in the different fields of knowledge.

Imagine that a group of researchers is interested in studying how the rate of return for a financial asset behaves in relation to the market, or how company expense varies when the factory increases its productive capability or increases the number of work hours, or, yet, how the number of bedrooms and amount of floor space in a residential real estate sample can influence the formation of sales prices.

Notice that, in all the examples, the main phenomena of interest to study, are represented, in each case, by a metric or quantitative variable, and, therefore, can be studied by means of estimation linear regression models, of which the main goal is to analyze how the relations between a set of explanatory variables, metrics or dummies, and a dependent metric variable (the outcome variable that represents the phenomenon under study) behave, being that some conditions are respected and some presuppositions are met, as we shall see in this chapter.

It is important to emphasize that any and all linear regression models should be defined based on the subjacent theory and the experience of the researcher, such that it is possible to estimate the desired model, analyze the results obtained by means of statistical tests and prepare forecasts.

In this chapter, we will consider the simple and multiple linear regression models, with the following objectives: (1) Introduce the concepts of simple and multiple linear regression, (2) Interpret results obtained and prepare forecasts, (3) Discuss the technique presuppositions and (4) Present the application of the technique in Excel, Stata, and SPSS. Initially, the solution to an example will be prepared in Excel simultaneously to the presentation of the concepts and the manual solution of the example. Only after the introduction of the concepts will the procedures for the preparation of the regression technique be presented in Stata and SPSS.

13.2 Linear Regression Models

First, we will address linear regression models and their presuppositions. An analysis of nonlinear regressions will be covered in Section 13.4.

According to Fávero et al. (2009), the linear regression technique offers, primarily, the ability to study the relation between one or more explanatory variables, which are presented in a linear form, and a quantitative dependent variable. As such, a general linear regression model can be defined as follows:

$Y_{i} = a + b_{1} \cdot X_{1 i} + b_{2} \cdot X_{2 i} + \dots + b_{k} \cdot X_{ki} + u_{i}$

(13.1)

where Y represents the phenomenon under study (quantitative dependent variable), a represents the intercept (constant or linear coefficient), b_j (j = 1, 2, …, k) are the coefficients of each variable (angular coefficients), X_j are explanatory variables (metrics or dummies), and u is the error term (difference between the real value of Y and the predicted value of Y by means of the model for each observation). The subscripts i represent each of the observations of the sample under analysis (i = 1, 2, …, n, where n is the size of the sample).

The equation presented by means of Expression (13.1) represents a multiple linear regression model, since it considers the inclusion of various explanatory variables for the study of the phenomenon in question. On the other hand, if only one X variable is inserted, we have before us a simple linear regression model. For didactic reasons, we will introduce the concepts and present the step-by-step process of estimating the parameters by means of a simple regression model. Following, we will amplify the discussion by means of estimation in multiple regression models, including the consideration of dummy variables on the right side of the equation.

It is important to emphasize, therefore, that the simple linear regression model to be predicted present the following expression:

${\hat{Y}}_{i} = α + β \cdot X_{i}$

(13.2)

where ${\hat{Y}}_{i}$ represents the predicted value of the dependent variable, which will be obtained by means of the model estimation for each i observation, and α and β represent the predicted parameters of the intercept and the slope of the proposed model, respectively. Fig. 13.1 presents, graphically, the general configuration of an estimated simple linear regression model.

Fig. 13.1 Estimated simple linear regression model.

We can, therefore, verify that, while estimated parameter α shows the point on regression model where X = 0, estimated parameter β represents the slope of the model, or rather, the increase (or decrease) of Y for each additional unit of X, on average.

Hence, the inclusion of error term u in the Expression (13.1), also known as residual, is justified by the fact that any relation that can be proposed will rarely present itself perfectly. In other words, very probably the phenomenon under study, represented by variable Y, will present a relation with some other X variable not included in the proposal and that, therefore, will need to be represented by error term u. As such, error term u, for each observation i, can be written as:

$u_{i} = Y_{i} - {\hat{Y}}_{i}$

(13.3)

According to Kennedy (2008), Fávero et al. (2009), and Wooldridge (2012), error terms occur due to some reasons that need to be known and considered by the researchers, such as:

• Existence of aggregated variables and/or not random.
• Failures in the specification of the model (nonlinear forms and omission of relevant explanatory variables).
• Errors in data gathering.

More consideration regarding error terms will be made in the study of regression model presuppositions, in Section 13.3.

Having discussed the preliminary concepts, we shall now begin the study of linear regression models estimation.

13.2.1 Estimation of the Linear Regression Model by Ordinary Least Squares

We often glimpse, in a rational or intuitive way, the relation between variable behaviors that appear either directly or indirectly. If I swim more often at my club, will I increase my muscle mass? If I change jobs, will I have more time to spend with my children? If I save a greater portion of my wages, will I be able to retire at a younger age? These questions offer clear relations between a certain dependent variable, which represents the phenomenon we wish to study, and, in the case, a single explainable variable.

The objective of regression analysis is, therefore, to provide conditions for the researcher to evaluate how a Y variable behaves based on the behavior of one or more X variables, without, necessarily, the occurrence of a cause and effect relationship.

We will introduce the concepts of regression by means of an example that considers only one explanatory variable (simple linear regression). Imagine that, on a certain class day for a group of 10 students, the professor is interested in discovering the influence of the distance traveled to get to school over the travel time. The professor completes a questionnaire with each of the 10 students and prepares a dataset, which can be found in Table 13.1.

Table 13.1

Example: Travel Time × Distance Traveled
Student	Time to Get to School (min)	Distance Traveled to School (km)
Gabriela	15	8
Dalila	20	6
Gustavo	20	15
Leticia	40	20
Luiz Ovidio	50	25
Leonor	25	11
Ana	10	5
Antonio	55	32
Julia	35	28
Mariana	30	20

Actually, the professor wants to know the equation that regulates the phenomenon “travel time to school” in function of “distance traveled by students.” It is known that other variables influence the time of a certain route, such as the route taken, the type of transportation, or the time at which the student left for school that day. However, the professor knows that such variables will not be part of the model, being that they were not collected for the formation of the dataset.

The problem can therefore be modeled in the following manner:

$time = ƒ (dist)$

As such, the equation, or simple regression model, will be:

${time}_{i} = a + b \cdot {dist}_{i} + u_{i}$

and, in this way, the expected value (estimate) of the dependent variable, for each i observation, will be given as:

$\hat{{time}_{i}} = α + β \cdot {dist}_{i}$

where α and β are the estimates of parameters a and b, respectively.

This last equation shows that the expected value of the time ( $\hat{Y}$ ) variable, also known as the conditional mean, is calculated for each sample observation, in function of the behavior of the dist variable, being that the subscript i represents, for our example data, the school students (i = 1, 2, …, 10). Our objective here is, therefore, to study if the behavior of the dependent variable time presents a relation with the variation of distance, in kilometers, to which each of the students is subjected to arrive at school on a certain class day.

In our example, it does not make much sense to discuss time traveled when the distance to school is zero (parameter α). Parameter β, on the other hand, will inform us regarding the increase in time to arrive at school by increasing the distance traveled by one kilometer, on average.

We shall, as such, prepare a graph (Fig. 13.2) that relates the travel time (Y) with the distance traveled (X), where each point represents one of the students.

As previously commented, it is not only the distance traveled that affects the time needed to get to school since it can also be affected by other variables related to traffic, means of transportation, or the individual. As such, the error term u should capture the effect of the remaining variables not included in the model. Now, in order to estimate the equation that best adjusts to this cloud of points, we should establish two fundamental conditions related to the residuals.

(1) The sum of the residuals should be zero: ∑_i = 1ⁿu_i = 0, where n is the sample size.

With only this first condition, several lines of regression can be found where the sum of the residuals is zero, as is shown in Fig. 13.3.

Notice that, for the same dataset, several lines can respect the condition that the sum of the residuals is equal to zero. Therefore, it becomes necessary to establish a second condition.

(2) The residual sum of squares is the least possible: ∑_i = 1ⁿu_i² = min.

With this condition, we choose the model that presents the best adjustment possible to the cloud of points, giving us, therefore, the definition of the least squares. In other words, α and β should be determined in such a way that the sum of the squares of the residuals is the least possible (ordinary least squares—OLS method). As such:

$\sum_{i = 1}^{n} {(Y_{i} - β \cdot X_{i} - α)}^{2} = min$

si24_e (13.4)

The minimization occurs in deriving Expression (13.4), where α and β are equal to zero to resulting expressions. As such:

$\frac{\partial [\sum_{i = 1}^{n} {(Y_{i} - β \cdot X_{i} - α)}^{2}]}{\partial α} = - 2 \sum_{i = 1}^{n} (Y_{i} - β \cdot X_{i} - α) = 0$

si25_e (13.5)

$\frac{\partial [\sum_{i = 1}^{n} {(Y_{i} - β \cdot X_{i} - α)}^{2}]}{\partial β} = - 2 \sum_{i = 1}^{n} X_{i} \cdot (Y_{i} - β \cdot X_{i} - α) = 0$

si26_e (13.6)

In distributing and dividing the Expression (13.5) by 2·n, where n is the sample size, we have that:

$\frac{- 2 \sum_{i = 1}^{n} Y_{i}}{2 n} + \frac{2 \sum_{i = 1}^{n} β \cdot X_{i}}{2 n} + \frac{2 \sum_{i = 1}^{n} α}{2 n} = \frac{0}{2 n}$

si27_e (13.7)

from which comes:

$- \bar{Y} + β \cdot \bar{X} + α = 0$

(13.8)

and, therefore:

$α = \bar{Y} - β \cdot \bar{X}$

(13.9)

where $\bar{Y}$ and $\bar{X}$ represent the sample average of Y and of X, respectively.

In substituting this result in Expression (13.6), we have that:

$- 2 \sum_{i = 1}^{n} X_{i} \cdot (Y_{i} - β \cdot X_{i} - \bar{Y} + β \cdot \bar{X}) = 0$

si32_e (13.10)

which, in developing:

$\sum_{i = 1}^{n} X_{i} \cdot (Y_{i} - \bar{Y}) + β \cdot \sum_{i = 1}^{n} X_{i} \cdot (\bar{X} - X_{i}) = 0$

si33_e (13.11)

which therefore generates:

$β = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) \cdot (Y_{i} - \bar{Y})}{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}}$

si34_e (13.12)

Returning to our example, the professor then prepares a calculation spreadsheet in order to obtain the linear regression model, as shown in Table 13.2.

Table 13.2

Calculation Spreadsheet for the Determination of α and β
Observation (i)	Time (Y_i)	Distance (X_i)	$Y_{i} - \bar{Y}$	$X_{i} - \bar{X}$	$(X_{i} - \bar{X}) \cdot (Y_{i} - \bar{Y})$	${(X_{i} - \bar{X})}^{2}$
1	15	8	− 15	− 9	135	81
2	20	6	− 10	− 11	110	121
3	20	15	− 10	− 2	20	4
4	40	20	10	3	30	9
5	50	25	20	8	160	64
6	25	11	− 5	− 6	30	36
7	10	5	− 20	− 12	240	144
8	55	32	25	15	375	225
9	35	28	5	11	55	121
10	30	20	0	3	0	9
Sum	300	170			1155	814
Average	30	17

Table 13.2

By means of the spreadsheet presented in Table 13.2, we can calculate estimators α and β, in accordance as follows:

$β = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) \cdot (Y_{i} - \bar{Y})}{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}} = \frac{1155}{814} = 1.4189$

si35_e

$α = \bar{Y} - β \cdot \bar{X} = 30 - 1.4189 \cdot 17 = 5.8784$

And the simple linear regression equation can be written as:

$\hat{{time}_{i}} = 5.8784 + 1.4189 \cdot {dist}_{i}$

The estimation of our example model can be done by means of the Solver tool in Excel, respecting the conditions that ∑_i = 1¹⁰u_i = 0 and ∑_i = 1¹⁰u_i² = min. In this way, we can initially open the file TimeLeastSquares.xls that contains our example data, besides the columns referent to $\hat{Y}$ , to u and to u² for each observation. Fig. 13.4 presents this file, before the preparation of the Solver procedure.

According to the logic proposed by Belfiore and Fávero (2012), we now open the Excel Solver tool. The objective function is in cell E13, which is our destination cell and which should be minimized (residual sum of squares). Besides this, parameters α and β, which values are in cells H3 and H5, respectively, are the variables cells. Finally, we should impose that the value of cell D13 should equal zero (restriction that the sum of the residuals be equal to zero). The Solver window will be as shown in Fig. 13.5.

Fig. 13.5 Solver—minimization of the residual sum of squares.

By clicking on Solve and then OK, we obtain the best solution to the minimization of the residual sum of squares. Fig. 13.6 presents the results obtained by the model.

Therefore, intercept α is 5.8784 and the angular coefficient β is 1.4189, according to what we estimated by means of the analytical solution. In an elementary way, the average time to get to school by students who did not travel any distance, or rather, who were already at school, is of 5.8784 min, which does not make much sense from a physical point of view. In some cases, this type of situation can frequently occur, where values for α are not in keeping with reality. From the mathematical point of view, this is not incorrect. However, the researcher should always analyze the physical or economical sense of the situation under study, as well as the subjacent theory used. In analyzing the graph in Fig. 13.2, we notice that there is no student with a distance traveled near zero, and the intercept only reflects extension, projection, or extrapolation of the regression model up to the Y axis. It is even common that some models present a negative α in the study of phenomena that cannot offer negative values. Therefore, the researcher should always be aware of this fact, being that a regression model can be quite useful to elaborate inferences regarding the behavior of a Y variable within the limits of the X variation, or rather, for the elaboration of interpolations. Yet extrapolations can offer inconsistencies due to eventual changes in behavior for the Y variable outside the limits of the X variation in the study sample.

Giving sequence to the analysis, each additional kilometer in distance between the departure point and the school increases travel time by 1.4189 min, on average. As such, a student who lives 10 km farther from school than another will tend to spend, on average, a little more than 14 min (1.4189 × 10) to get to school than their classmate who lives closer. Fig. 13.7 presents the simple linear regression model from our example.

Fig. 13.7 Simple linear regression model between time and distance traveled.

Concomitant to the discussion of each of the concepts and to the solution of the proposed example by means of analytical form and Solver, we will also present the systematic solution by means of the Excel Regression tool. In Sections 13.5 and 13.6, we will embark on the final solution by means of Stata and SPSS, respectively. In this way, we will not open the file Timedist.xls, which contains the data from our example, or rather, the fictitious travel time and distance covered by a group of students to the school location.

By clicking on Data → Data Analysis, the dialog box from Fig. 13.8 will appear.

We now click on Regression and then OK. The dialog box for insertion of data to be considered in regression will now appear (Fig. 13.9).

For our example, the time (in minutes) variable is the (Y) dependent and the dist (in kilometers) variable is the (X) explanatory. Therefore, we must insert their data in the respective entry intervals, according to what is shown in Fig. 13.10.

Besides the insertion of data, we will also select the Residuals option, according to what is shown in Fig. 13.10. Following, we click on OK. A new spreadsheet will be generated with the regression outputs. We will analyze each of them according to when the concepts are introduced, as well as perform the calculations manually.

According to what we can observe by means of Fig. 13.11, four groups of outputs are generated: regression statistics, analysis of variance (ANOVA), table of regression coefficients, and residuals table. We will discuss each.

As calculated previously, we can verify the regression equation coefficients in the outputs (Fig. 13.12).

13.2.2 Explanatory Power of the Regression Model: Coefficient of Determination R²

According to Fávero et al. (2009), to measure the explanatory power of a certain regression model, or the percentage of variability of the Y variable, which is explained by the variation of behavior of the explanatory variables, we need to understand some important concepts. While the total sum of squares (TSS) shows the variation in Y in regards to its own average, the sum of squares due to regression (SSR) offers a variation of Y considering the X variables used in the model. Besides this, the residual sum of squares (RSS) presents the variation of Y, which is not explained in the prepared model. We can therefore define that:

$TSS = SSR + RSS$

(13.13)

being:

$Y_{i} - \bar{Y} = ({\hat{Y}}_{i} - \bar{Y}) + (Y_{i} - {\hat{Y}}_{i})$

(13.14)

where Y_i is equivalent to the value of Y in each i observation of the sample, $\bar{Y}$ is the average of Y, and ${\hat{Y}}_{i}$ represents the adjusted value of the regression model for each i observation. As such, we have that:

$Y_{i} - \bar{Y} :$ total deviation of values of each observation in relation to the average,
$({\hat{Y}}_{i} - \bar{Y}) :$ deviation of values of the regression model for each observation in relation to the average,
$(Y_{i} - {\hat{Y}}_{i}) :$ deviation of values of each observation in relation to the regression model,

which results in:

$\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2} = \sum_{i = 1}^{n} {({\hat{Y}}_{i} - \bar{Y})}^{2} + \sum_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}$

si46_e (13.15)

or:

$\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2} = \sum_{i = 1}^{n} {({\hat{Y}}_{i} - \bar{Y})}^{2} + \sum_{i = 1}^{n} {(u_{i})}^{2}$

si47_e (13.16)

which is the very Expression (13.13).

Fig. 13.13 graphically shows this relation.

With these considerations made and the regression equation defined, we embark on the study of the explanatory power of the regression model, also known as the coefficient of determination R². Stock and Watson (2004) define R² as the fraction of the variance of the Y_i sample explained (or predicted) by the explanatory variables. In the same way, Wooldridge (2012) considers R² as the proportion of sample variation of the dependent variable explained by the set of explanatory variables, able to be used as a measure of degree of adjustment for the proposed model.

According to Fávero et al. (2009), the explanatory capacity of the model is analyzed by the coefficient of determination R² of the regression. For a simple regression model, this measure shows how much of the Y variable behavior is explained by the variation in behavior of the X variable, Always remembering that there is not, necessarily, a cause and effect relationship between the X and Y variables. For the multiple regression model, this measure shows how much of the Y variable behavior is explained by the joint variation of the X variables considered in the model.

The R² is obtained in the following manner:

$R^{2} = \frac{SSR}{SSR + RSS} = \frac{SSR}{TSS}$

si48_e (13.17)

$R^{2} = \frac{\sum_{i = 1}^{n} {({\hat{Y}}_{i} - \bar{Y})}^{2}}{\sum_{i = 1}^{n} {({\hat{Y}}_{i} - \bar{Y})}^{2} + \sum_{i = 1}^{n} {(u_{i})}^{2}}$

si49_e (13.18)

Also according to Fávero et al. (2009), the R² can vary between 0 and 1 (0%–100%); however, it is practically impossible to obtain an R² equal to 1, since it would be very difficult for all the points to fall on a line. In other words, if the R² were 1, there would be no residuals for each of the observations in the sample under study, and the variability of the Y variable would be totally explained by the vector of X variables considered in the regression model.

The more disperse the cloud of points, the less the X and Y variables will relate, the residuals will be greater, and the R² will be closer to zero. In an extreme case, if the X variation does not correspond to any variation in Y, the R² will be zero. Fig. 13.14 presents, in an illustrative manner, the behavior of R² in different cases.

Fig. 13.14 R² behavior for different simple linear regressions.

Returning to our example where the professor intends to study the behavior of the time students take to get to school and if this phenomenon is influenced by distance traveled by the students, we present the following spreadsheet (Table 13.3), which will aid us in calculating the R².

Table 13.3

Spreadsheet for the Calculation of the Coefficient of Determination R² of the Regression Model
Observation (i)	Time (Y_i)	Distance (X_i)	${\overset{ˇ}{Y}}_{i}$	u_i $(Y_{i} - \overset{ˇ}{Y_{i}})$	${(\overset{ˇ}{Y_{i}} - \bar{Y})}^{2}$	(u_i)²
1	15	8	17.23	− 2.23	163.08	4.97
2	20	6	14.39	5.61	243.61	31.45
3	20	15	27.16	− 7.16	8.05	51.30
4	40	20	34.26	5.74	18.12	32.98
5	50	25	41.35	8.65	128.85	74.80
6	25	11	21.49	3.51	72.48	12.34
7	10	5	12.97	− 2.97	289.92	8.84
8	55	32	51.28	3.72	453.00	13.81
9	35	28	45.61	− 10.61	243.61	112.53
10	30	20	34.26	− 4.26	18.12	18.12
Sum	300	170			1638.85	361.15
Average	30	17

Table 13.3

Obs.: Where ${\hat{Y}}_{i} = \hat{{time}_{i}} = 5.8784 + 1.4189 \cdot {dist}_{i}$ .

The spreadsheet presented in Table 13.3 allows us to calculate the R² of the simple linear regression model for our example. As such:

$R^{2} = \frac{\sum_{i = 1}^{n} {(\hat{Y} - \bar{Y})}^{2}}{\sum_{i = 1}^{n} {(\hat{Y} - \bar{Y})}^{2} + \sum_{i = 1}^{n} {(u_{i})}^{2}} = \frac{1638.85}{1638.85 + 361.15} = 0.8194$

si50_e

In this way, we can now affirm that, for the sample studied, 81.94% of the time variability to get to school is due to the variable referent to the distance traveled during the route determined by each of the students. And, therefore, a little more than 18% of the variability is due to other variables not included in the model and that, therefore, were due to variation in the residuals.

The outputs generated by Excel also bring out this information, according to what can be seen in Fig. 13.15.

Note that the outputs also supply the values of $\hat{Y}$ and the residuals for each observation, as well as the minimum value of the sum of the squares of the residuals, which are exactly equal to those obtained by estimation of the parameters by means of the Excel Solver tool (Fig. 13.6) and also calculated and presented in Table 13.3. By means of these values, we can now calculate the R².

According to Stock and Watson (2004) and Fávero et al. (2009), the coefficient of determination R² does not tell researchers if a certain explanatory variable is statistically significant and if this variable is the true cause of the change in behavior for the dependent variable. More than that, the R² does not provide the ability to evaluate the existence of an eventual bias in the omission of explanatory variables and if the choice of those inserted into the proposed model was appropriate.

The importance given to the R² dimension is often excessive. In different situations, researchers highlight the adequacy of their models by obtaining the high R² values, including giving emphasis to the cause and effect relationship between explanatory variables and the dependent variable, even though quite erroneous, since this measure merely captures the relation between the variables used in the model. Wooldridge (2012) is even more emphasis, highlighting that it is fundamental to not give considerable importance to the R² value in the evaluation of regression models.

According to Fávero et al. (2009), if we are able, for example, to find a variable that explains a 40% return on stock, this could at first seem like a low capacity of explanation. However, if a single variable is able to capture this entire relationship in a situation where innumerable other economic, financial, perceptual, and social factors exist, the model could be quite satisfactory.

The general statistical significance of the model and its estimated parameters is not given by the R², but by means of appropriate statistical tests, which we will study in the next section.

13.2.3 General Statistical Significance of the Regression Model and Each of Its Parameters

To begin, it is of fundamental importance to study the general statistical significance of the estimated model. With this in mind, we should make use of the F-test, with its null and alternative hypotheses, for a general regression model, which are:

H₀: β₁ = β₂ = … = β_k = 0
H₁: there is at least one β_j ≠ 0, respectively

And, for a simple regression model, therefore, these hypotheses are expressed as:

H₀: β = 0
H₁: β ≠ 0

This test allows the researcher to verify if the model that is being estimated does in fact exist, since if all the β_j (j = 1, 2, …, k) are statistically equal to zero, the alteration behavior of each of the explanatory variables will not influence in any way the variation behavior of the dependent variable. The F statistic is presented in the following expression:

$F = \frac{\frac{\sum_{i = 1}^{n} {(\hat{Y} - \bar{Y})}^{2}}{(k - 1)}}{\frac{\sum_{i = 1}^{n} {(u_{i})}^{2}}{(n - k)}} = \frac{\frac{SSR}{(k - 1)}}{\frac{RSS}{(n - k)}}$

si52_e (13.19)

where k represents the number of parameters of the estimated model (including the intercept) and n, the size of the sample.

Therefore, we can obtain an F statistic expression based on the R² expression presented in Expression (13.17). As such, we have that:

$F = \frac{\frac{SSR}{(k - 1)}}{\frac{RSS}{(n - k)}} = \frac{\frac{R^{2}}{(k - 1)}}{\frac{(1 - R^{2})}{(n - k)}}$

si53_e (13.20)

Returning, then, to our initial example, we obtain:

$F = \frac{\frac{1638.85}{(2 - 1)}}{\frac{361.15}{(10 - 2)}} = 36.30$

si54_e

where, for 1 degree of freedom for regression (k − 1 = 1) and 8 degrees of freedom for the residuals (n − k = 10 − 2 = 8), we have, by means of Table A in the Appendix, that the F_c = 5.32 (F critical to the significance level of 5%). In this way, as F calculated F_cal = 36.30 > F_c = F_1,8,5% = 5.32, we can reject the null hypothesis that all the β_j (j = 1) parameters are statistically equal to zero. At least one X variable is statistically significant to explain the variability of Y and we will have a statistically significant regression model for the means of forecast. As, in this case, we have only one X variable (simple regression), this will be statistically significant, to the significance level of 5%, to explain the behavior of the Y variation.

The outputs offer, by means of analysis of variance (ANOVA), the F statistic and its corresponding significance level (Fig. 13.16).

Software, such as Excel, Stata, and SPSS, do not directly offer F_c for the degrees of freedom defined and the determined significance level. However, they do offer the significance level of F_cal for these degrees of freedom. As such, instead of analyzing if F_cal > F_c, we should verify if the significance level of F_cal is less than 0.05 (5%) so as to give continuity to the regression analysis. Excel calls this significance level F significance. As such:

If the F significance is < 0.05, there is at least one β_j ≠ 0.

The F_cal significance level can be obtained in Excel by means of the command Formulas → Insert Function → FDIST, which will open a dialog box as shown in Fig. 13.17.

Fig. 13.17 Obtaining the F significance level (command Insert Function).

Many models present more than one explanatory X variable and, as the F-test evaluates the joint significance of the variable explanatories, it is unable to define which one or ones of the variables considered in the model presents parameters estimated to be statistically different from zero, at a certain significance level. Therefore, it is necessary that the researcher evaluate if each of the model parameters for the regression model is statistically different from zero, so as to determine if its respective X variable should be, in fact, included in the proposed model.

The t statistic, also studied in Chapter 9, is important to provide the researcher with the statistical significance of each parameter to be considered in the regression model, and the hypotheses of the corresponding test (t-test) for the intercept and for each β_j (j = 1, 2, …, k) are:

H₀: α = 0
H₁: α ≠ 0

H₀: β_j = 0
H₁: β_j ≠ 0, respectively

This test provides the researcher with a verification of the statistical significance of each estimated parameter, α and β_j, and its expression is given as:

$\begin{array}{l} t_{α} = \frac{α}{s . e . (α)} \\ t_{β_{j}} = \frac{β_{j}}{s . e . (β_{j})} \end{array}$

si55_e (13.21)

where s.e. corresponds to the standard error of each parameter under analysis, which will be discussed later. After obtaining the t-statistics, the researcher can use the respective distribution tables to obtain the critical values for a given significance level and verify if such tests reject the null hypothesis or not. However, as in the case of the F-test, the statistical packages also offer the values of the levels of significance for the t-tests, called P-values, which facilitates the decision, being that, with a 95% confidence level (5% significance level), we will have:

If P-value t < 0.05 for intercept, α ≠ 0

If P-value t < 0.05 for a certain X variable, β ≠ 0.

Using the data from our initial example, we have the standard error for the regression as:

$s . e . = \sqrt{\frac{\sum_{i = 1}^{n} {(u_{i})}^{2}}{(n - k)}} = \sqrt{\frac{361.15}{(10 - 2)}} = 6.7189$

si56_e

which is also provided by the Excel outputs (Fig. 13.18).

Based on Expression (13.21), we can calculate, for our example:

$t_{α} = \frac{α}{s . e . (α)} = \frac{5.8784}{6.7189 \cdot \sqrt{a_{jj}}}$

si57_e

$t_{β} = \frac{β}{s . e . (β)} = \frac{1.4189}{6.7189 \cdot \sqrt{a_{jj}}}$

si58_e

where a_jj is the jth element of the main diagonal resulting from the following calculation matrix:

${[(\begin{array}{c} 1 & 1 & 1 & \dots \\ 8 & 6 & 15 & \dots \end{array}) . (\begin{array}{c} 1 & 8 \\ 1 & 6 \\ 1 & 15 \\ \dots & \dots \end{array})]}^{- 1} = (\begin{array}{c} 0.4550 & - 0.0209 \\ - 0.0209 & 0.0012 \end{array})$

si59_e

which therefore results in:

$t_{α} = \frac{α}{s . e . (α)} = \frac{5.8784}{6.7189 \cdot \sqrt{0.4550}} = \frac{5.8784}{4.532} = 1.2969$

si60_e

$t_{β} = \frac{β}{s . e . (β)} = \frac{1.4189}{6.7189 \cdot \sqrt{0.0012}} = \frac{1.4189}{0.2354} = 6.0252$

si61_e

which, for 8 degrees of freedom (n − k = 10 − 2 = 8), we have, by means of Table B in the Appendix, that t_c = 2.306 for the significance level of 5% (probability on the upper tail of 0.025 for the two-tailed distribution). As such, being that the t_cal = 1.2969 < t_c = t_8,2.5% = 2.306, we cannot reject the null hypothesis that the α parameter is statistically equal to zero at this significance level for the sample in question.

The same, however, does not occur for the β parameter, being that the t_cal = 6.0252 > t_c = t_8,2.5% = 2.306. We can, therefore, reject the null hypothesis in this case, or rather, at the significance level of 5% we cannot affirm that this parameter is statistically equal to zero. These outputs are shown in Fig. 13.19.

Fig. 13.19 Calculation of coefficients and significance t-test of parameters.

Analogous to the F-test, instead of analyzing if t_cal > t_c for each parameter, we directly verify if the significance level (P-value) for each t_cal is less than 0.05 (5%), so as to maintain the parameter in the final model. The P-value for each t_cal can be obtained in Excel by means of the command Formulas → Insert Function → DISTT, which will open a dialog box as is shown in Fig. 13.20. In this figure, the dialog boxes corresponding to parameters α and β are already presented.

Fig. 13.20 Obtaining the levels of significance of t for parameters α and β (command Insert Function).

It is important to mention that, for simple regressions, statistic F = t² for parameter β, as is shown by Fávero et al. (2009). In our example, therefore, we can verify that:

$t_{β}^{2} = F$

$t_{β}^{2} = {(6.0252)}^{2} = 36.30 = F$

Being that hypothesis H₁ of the F-test tells us that at least one β parameter is statistically different from zero for a certain significance level, and being that a simple regression presents only one β parameter, if H₀ is rejected for the F-test, H₀ will also be for the t-test of this β parameter.

However, for the α parameter, being that t_cal < t_c (P-value of t_cal for the α parameter > 0.05) in our example, we could think of the estimation of a new regression that forces the intercept to be equal to zero. This can be elaborated by means of the Excel Regression dialog box, with the selection of the option Constant is zero.

However, we will not elaborate such procedure since the nonrejection of the null hypothesis that the α parameter is statistically equal to zero is due to the small sample used. It does not impede that a researcher make such forecasts by means of using the model obtained. The imposition that the α be zero could generate forecast bias by the generation of another model that would not be the most adequate to elaborate interpolations in the data. Fig. 13.21 illustrates this fact.

Fig. 13.21 Original regression model and with the intercept equal to zero.

In this way, the fact that we cannot reject that the α parameter be equal to zero at a certain significance level does not necessarily imply that we should exclude it from the model. However, if this is the researchers’ decision, it is important they be at least aware that there will be only one different model from the original, with consequences to the preparation of forecasts.

The nonrejection of the null hypothesis for the β parameter at a certain significance level, on the other hand, should indicate that a corresponding X variable does not correlate with a Y variable and, therefore, should be excluded from the final model.

When later in this chapter we present the analysis of regression by means of the Stata (Section 13.5) and SPSS (Section 13.6) software, the Stepwise procedure will be introduced. This has a property that automatically excludes or maintains the β parameters in the model in function of the criteria presented and offers the final model with the β parameters statistically different from zero for the determined significance level.

13.2.4 Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions

The confidence levels for the α and β_j (j = 1, 2, …, k) parameters for the 95% confidence level, can be written, respectively, as follows:

$\begin{array}{l} P [α - t_{α / 2} \cdot \sqrt{\frac{\sum_{i = 1}^{n} {(u_{i})}^{2}}{(n - k)} \cdot (\frac{1}{n} + \frac{{\bar{X}}^{2}}{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}})} \leq α \leq α + t_{α / 2} \cdot \sqrt{\frac{\sum_{i = 1}^{n} {(u_{i})}^{2}}{(n - k)} \cdot (\frac{1}{n} + \frac{{\bar{X}}^{2}}{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}})}] = 95 % \\ P [β_{j} - t_{α / 2} \cdot \frac{s . e .}{\sqrt{(\sum_{i = 1}^{n} X_{i}^{2}) - \frac{{(\sum_{i = 1}^{n} X_{i})}^{2}}{n}}} \leq β_{j} \leq β_{j} + t_{α / 2} \cdot \frac{s . e .}{\sqrt{(\sum_{i = 1}^{n} X_{i}^{2}) - \frac{{(\sum_{i = 1}^{n} X_{i})}^{2}}{n}}}] = 95 % \end{array}$

si64_e (13.22)

Therefore, for our example, we have that:

Parameter α:

$P [5.8784 - 2.306 \cdot \sqrt{\frac{361.1486}{8} \cdot (\frac{1}{10} + \frac{289}{814})} \leq α \leq 5.8784 + 2.306 \cdot \sqrt{\frac{361.1486}{8} \cdot (\frac{1}{10} + \frac{289}{814})}] = 95 %$

si65_e

$P [- 4.5731 \leq α \leq 16.3299] = 95 %$

Being that the confidence level for parameter α contains zero, we cannot reject, at the 95% confidence level, that this parameter is statistically equal to zero, according to what has been verified when calculating the t statistic.

Parameter β:

$P [1.4189 - 2.306 \cdot \frac{6.7189}{\sqrt{3704 - \frac{{(170)}^{2}}{10}}} \leq β \leq 1.4189 + 2.306 \cdot \frac{6.7189}{\sqrt{3704 - \frac{{(170)}^{2}}{10}}}] = 95 %$

si67_e

$P [0.8758 \leq β \leq 1.9619] = 95 %$

Being that the confidence level for parameter β does not contain zero, we can reject, at the 95% confidence level, that this parameter is statistically equal to zero, also according to what has been verified when calculating the t statistic.

These intervals are also generated in the Excel outputs. Being that the software standard is to use the 95% confidence level, these intervals are shown twice, so as to allow the researcher to manually alter the confidence level desired by selecting the Confidence Level option in the Excel Regression dialog box, but still have the ability to analyze the intervals for the confidence level most commonly used (95%). In other words, the confidence level intervals of 95% in Excel will always be presented, giving the researcher the ability to analyze the intervals from another confidence level in parallel.

We will, therefore, alter the regression dialog box (Fig. 13.22) in order to allow the software to also calculate the interval parameters for the confidence level of, for example, 90%. These outputs are presented in Fig. 13.23.

Fig. 13.23 Intervals with confidence levels of 95% and 90% for each of the parameters.

It can be seen that the lower and upper bands are symmetrical in relation to the estimated average parameter and offer the researcher the ability to prepare forecasts with a certain confidence level. In the case of parameter β from our example, being that the extremes of the lower and upper bands are positive, we can say that this parameter is positive, with 95% confidence. Besides this, we can also way that the interval [0.8758; 1.9619] contains β with 95% confidence.

Different from what we did for the 95% confidence level, we will not manually calculate the intervals for the 90% confidence level. However, an analysis of the Excel outputs allows us to affirm that the interval [0.9810; 1.8568] contains β with 90% confidence. In this way, we can say that the lower the levels of confidence, the narrower (less amplitude) the intervals will be to contain a certain parameter. On the other hand, the higher the levels of confidence, the greater amplitude will have the intervals to contain this parameter.

Fig. 13.24 illustrates what happens when we have a dispersed cloud of points surrounding a regression model.

We can note that, the more positive parameter α is and mathematically equal to 5.8784, we cannot affirm that it is statistically different from zero for this small sample, being that the confidence interval contains an intercept equal to zero (origin). A greater sample could solve this problem.

For parameter β, however, we can note that the slope has always been positive, with an average value mathematically calculated and equal to 1.4189. We can visually notice that its confidence interval does not contain an slope equal to zero.

As has already been discussed, the rejection of the null hypothesis for parameter β, at a certain significance level, indicates that a corresponding X variable is correlated to the Y variable and, consequently, should remain in the final model. Therefore, we can conclude that the decision to exclude an X variable in a certain regression model can be done by means of a direct analysis of the t-statistic of its respective parameter β (if t_cal < t_c → P-value > 0.05 → we cannot reject that the parameter is statistically equal to zero) or by means of an analysis of the confidence interval (if the same contains zero). Box 13.1 presents the inclusion or exclusion criteria for parameters β_j (j = 1, 2, …, k) in regression models.

Box 13.1

Decision to Include β_j Parameters in Regression Models

Parameter	t Statistic (For Significance Level α)	t-Test (Analysis of the P-Value for Significance Level α)	Analysis of Confidence Level	Decision
β_j	t_cal < t_{c α/2}	P-value > significance level α	Confidence level contains zero	Exclude parameter from model
β_j	t_cal > t_{c α/2}	P-value < significance level α	Confidence level does not contain zero	Maintain parameter in model

Unlabelled Table

Obs.: The most common in applied social sciences is the adoption of significance level α = 5%.

After a discussion of these concepts, the professor proposed the following exercise to his students: What is the average travel time forecast (Y estimated, or Ŷ) for a student who travels 17 km to get to school? What would be the minimum and maximum values that this travel time could assume, with 95% confidence?

The first part of the exercise could be solved by a simple substitution of the value of X_i = 17 in the initially obtained equation. Like this:

${\hat{time}}_{i} = 5.8784 + 1.4189 \cdot {dist}_{i} = 5.8784 + 1.4189 \cdot (17) = 29.9997 min$

The second part of the exercise takes us to the outputs in Fig. 13.23, being that the α and β parameters assume intervals of [− 4.5731; 16.3299] and [0.8758; 1.9619], respectively, at the 95% confidence level. As such, the equations that determine the minimum and maximum travel time values for this confidence level are:

Minimum time:

${\hat{time}}_{\min} = - 4.5731 + 0.8758 \cdot {dist}_{i} = - 4.5731 + 0.8758 \cdot (17) = 10.3155 min$

Maximum time:

${\hat{time}}_{\max} = 16.3299 + 1.9619 \cdot {dist}_{i} = 16.3299 + 1.9619 \cdot (17) = 49.6822 min$

We can therefore say that there is 95% confidence that a student who travels 17 km to get to school will take between 10.3155 and 49.6822 min, with an average estimated time of 29.9997 min.

Obviously, the amplitude of these values is not small, due to the confidence interval of parameter α being quite ample. This fact can be corrected by the increase of the sample size or by the inclusion of new, statistically significant X variables in the model (which would then become a multiple regression model) being that, in the last case, the R² value would be increased.

After the professor presented the results of the model to the class, a curious student raised his hand and asked, “Professor, is there any influence of the regression model coefficient of determination R² on the amplitude of the confidence intervals? If we set up this linear regression and substituted Y for Ŷ, what would the results be? Would the equation change? And the R²? And the confidence intervals?”

And, the professor substituted Y for Ŷ and again set up the regression by means of the dataset presented in Table 13.4.

Table 13.4

Dataset for Preparation of New Regression
Observation (i)	Predicted Time(Ŷ_i)	Distance (X_i)
1	17.23	8
2	14.39	6
3	27.16	15
4	34.26	20
5	41.35	25
6	21.49	11
7	12.97	5
8	51.28	32
9	45.61	28
10	34.26	20

The first step taken by the professor was to prepare a new scatter plot graph, with the estimated regression model. This graph is presented in Fig. 13.25.

Fig. 13.25 Scatter plot and linear regression model between predicted time (Ŷ) and distance traveled (X).

As we can see, obviously all are now located on the regression model, since this procedure forced this situation by the fact that the calculation of each Ŷ_i used the regression model. As such, we can state in advance that the R² for this new regression is 1. Let’s look at the new outputs (Fig. 13.26).

Fig. 13.26 Outputs of the linear regression model between predicted time (Ŷ) and distance traveled (X).

As expected, the R² is 1. Moreover, the model equation is exactly that which was previously calculated, since it is the same line. However, we can see that the F and t-tests cause us to strongly reject their respective null hypotheses. Even parameter α, which previously could not be considered statistically different from zero, now presents its t-test and tells us that we can reject, at the 95% confidence level (or higher), that this parameter is statistically equal to zero. This occurs because previously the small sample used (n = 10 observations) did not allow us to affirm that the intercept was different from zero, being that the dispersion of points generated a confidence interval that had an intercept equal to zero (Fig. 13.24).

On the other hand, when all the points are on the model, each of the residual terms comes to be zero, which causes the R² to become 1. Besides, the obtained equation is no longer an adjusted model to a dispersion of points, but the very line that passes through all the points and completely explains the sample behavior. Being such, we do not have a dispersion surrounding the regression model and the confidence intervals come to represent a null amplitude, as we can also see in Fig. 13.26. In this case, for any confidence level, the values for each parameter interval are no longer altered, which causes us to declare with 100% confidence that the [5.8784; 5.8784] interval contains α and the [1.4189; 1.4189] interval contains β. In other words, in this extreme case, α is mathematically equal to 5.8784 and β is mathematically equal to 1.4189.

Being as such, R² is an indicator of just how ample the parameter confidence intervals are. Therefore, models with higher R² levels will give the researcher the ability to make more accurate forecasts, given that the cloud of points is less dispersed along the regression model, which will reduce the amplitude of the parameter confidence intervals.

On the other hand, models with low R² values can impair the preparation of forecasts in that the greater amplitude of the parameter confidence intervals, but does not invalidate the existence of the model as such. As we have already discussed, many researchers give too much importance to the R²; however, it will be the F-test that will truly confirm that a regression model exists (at least a considered X variable is statistically significant to explain Y). As such, it is not rare to find very low R² values and statistically significant F-values in Administration, Accounting, or Economics models, which shows that the Y phenomenon studied underwent changes in its behavior due to some X variables adequately included in the model. However, there will be a low forecast accuracy due to the impossibility of monitoring all variables that effectively explain the variation of that Y phenomenon. Within the aforementioned knowledge areas, such a fact can be easily found in works on Finance and the Stock Market.

13.2.5 Estimation of Multiple Linear Regression Models

According to Fávero et al. (2009), the multiple linear regression presents the same logic as the simple linear, however now with the inclusion of more than one explanatory X variable in the model. The use of many explanatory variables depends on the subjacent theory and previous studies, as well as the experience and good sense of the researcher, in order to be able to give foundation to the decision.

Initially, the ceteris paribus concept (maintain remaining conditions constant) should be used in the multiple regression analysis, since the interpretation of the parameter of each variable should be done in isolation. As such, in a model that possesses two explanatory variables, X₁ and X₂, the respective coefficients will be analyzed in a way so as to consider the other factors as constants.

To illustrate the multiple linear regression, we will use the same example that we have used in this chapter. However, we will now imagine that the professor has made the decision to collect one more variable from each of the students. This variable will refer to the number of traffic lights, or semaphores, each student must pass. We will call this variable sem. As such, the theoretical model becomes:

${time}_{i} = a + b_{1} \cdot {dist}_{i} + b_{2} \cdot {sem}_{i} + u_{i}$

which, analogous to what was presented for the simple regression, we have that:

${\hat{time}}_{i} = α + β_{1} \cdot {dist}_{i} + β_{2} \cdot {sem}_{i}$

where α, β₁, and β₂ are the estimates for parameters a, b₁, and b₂, respectively.

The new dataset is found in Table 13.5, as well as in the file Timedistsem.xls.

Table 13.5

Example: Travel Time × Distance Traveled and Number of Traffic Lights
Student	Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights (X_2i)
Gabriela	15	8	0
Dalila	20	6	1
Gustavo	20	15	0
Leticia	40	20	1
Luiz Ovidio	50	25	2
Leonor	25	11	1
Ana	10	5	0
Antonio	55	32	3
Julia	35	28	1
Mariana	30	20	1

Table 13.5

We will now algebraically develop the procedures for calculating the model parameters, as we did in the simple regression model. By means of the following expression:

$Y_{i} = a + b_{1} \cdot X_{1 i} + b_{2} \cdot X_{2 i} + u_{i}$

we also define that the residual sum of squares is minimum. Therefore:

$\sum_{i = 1}^{n} {(Y_{i} - β_{1} \cdot X_{1 i} - β_{2} \cdot X_{2 i} - α)}^{2} = min$

si75_e

The minimization occurs in deriving the previous expression in α, β₁, and β₂ and equaling the resulting expressions to zero. Therefore:

$\frac{\partial [\sum_{i = 1}^{n} {(Y_{i} - β_{1} \cdot X_{1 i} - β_{2} \cdot X_{2 i} - α)}^{2}]}{\partial α} = - 2 \cdot \sum_{i = 1}^{n} (Y_{i} - β_{1} \cdot X_{1 i} - β_{2} \cdot X_{2 i} - α) = 0$

si76_e (13.23)

$\frac{\partial [\sum_{i = 1}^{n} {(Y_{i} - β_{1} \cdot X_{1 i} - β_{2} \cdot X_{2 i} - α)}^{2}]}{\partial β_{1}} = - 2 \cdot \sum_{i = 1}^{n} X_{1 i} \cdot (Y_{i} - β_{1} \cdot X_{1 i} - β_{2} \cdot X_{2 i} - α) = 0$

si77_e (13.24)

$\frac{\partial [\sum_{i = 1}^{n} {(Y_{i} - β_{1} \cdot X_{1 i} - β_{2} \cdot X_{2 i} - α)}^{2}]}{\partial β_{2}} = - 2 \cdot \sum_{i = 1}^{n} X_{2 i} \cdot (Y_{i} - β_{1} \cdot X_{1 i} - β_{2} \cdot X_{2 i} - α) = 0$

si78_e (13.25)

which generates the following system of three equations and three unknowns:

$\{\begin{cases} \sum_{i = 1}^{n} Y_{i} = n \cdot α + β_{1} \cdot \sum_{i = 1}^{n} X_{1 i} + β_{2} \cdot \sum_{i = 1}^{n} X_{2 i} \\ \sum_{i = 1}^{n} Y_{i} \cdot X_{1 i} = α \cdot \sum_{i = 1}^{n} X_{1 i} + β_{1} \cdot \sum_{i = 1}^{n} X_{1 i}^{2} + β_{2} \cdot \sum_{i = 1}^{n} X_{1 i} \cdot X_{2 i} \\ \sum_{i = 1}^{n} Y_{i} \cdot X_{2 i} = α \cdot \sum_{i = 1}^{n} X_{2 i} + β_{1} \cdot \sum_{i = 1}^{n} X_{1 i} \cdot X_{2 i} + β_{2} \cdot \sum_{i = 1}^{n} X_{2 i}^{2} \end{cases}$

si79_e (13.26)

Dividing the first equation by the Expression (13.26) by n, we arrive at:

$α = \bar{Y} - β_{1} \cdot {\bar{X}}_{1} - β_{2} \cdot {\bar{X}}_{2}$

(13.27)

By means of substituting the Expression (13.27) in the last two equations of the Expression (13.26), we arrive at the following system of two equations and two unknowns:

$\{\begin{cases} \sum_{i = 1}^{n} Y_{i} \cdot X_{1 i} - \frac{\sum_{i = 1}^{n} Y_{i} \cdot \sum_{i = 1}^{n} X_{1 i}}{n} = β_{1} \cdot [\sum_{i = 1}^{n} X_{1 i}^{2} - \frac{{(\sum_{i = 1}^{n} X_{1 i})}^{2}}{n}] + β_{2} \cdot [\sum_{i = 1}^{n} X_{1 i} \cdot X_{2 i} - \frac{(\sum_{i = 1}^{n} X_{1 i}) \cdot (\sum_{i = 1}^{n} X_{2 i})}{n}] \\ \sum_{i = 1}^{n} Y_{i} \cdot X_{2 i} - \frac{\sum_{i = 1}^{n} Y_{i} \cdot \sum_{i = 1}^{n} X_{2 i}}{n} = β_{1} \cdot [\sum_{i = 1}^{n} X_{1 i} \cdot X_{2 i} - \frac{(\sum_{i = 1}^{n} X_{1 i}) \cdot (\sum_{i = 1}^{n} X_{2 i})}{n}] + β_{2} \cdot [\sum_{i = 1}^{n} X_{2 i}^{2} - \frac{{(\sum_{i = 1}^{n} X_{2 i})}^{2}}{n}] \end{cases}$

si81_e (13.28)

We will now manually calculate the parameters for our example model. To do this, we need to use the spreadsheet in Table 13.6.

Table 13.6

Spreadsheet to Calculate the Parameters for the Multiple Linear Regression
Obs. (i)	Y_i	X_1i	X_2i	Y_iX_1i	Y_iX_2i	X_1iX_2i	(Y_i)²	(X_1i)²	(X_2i)²
1	15	8	0	120	0	0	225	64	0
2	20	6	1	120	20	6	400	36	1
3	20	15	0	300	0	0	400	225	0
4	40	20	1	800	40	20	1600	400	1
5	50	25	2	1250	100	50	2500	625	4
6	25	11	1	275	25	11	625	121	1
7	10	5	0	50	0	0	100	25	0
8	55	32	3	1760	165	96	3025	1024	9
9	35	28	1	980	35	28	1225	784	1
10	30	20	1	600	30	20	225	400	1
Sum	300	170	10	6255	415	231	11,000	3704	18
Average	30	17	1

Table 13.6

We will now substitute the values into the system represented by the Expression (13.28). Therefore:

$\{\begin{cases} 6255 - \frac{300 \cdot 170}{10} = β_{1} \cdot [3704 - \frac{{(170)}^{2}}{10}] + β_{2} \cdot [231 - \frac{(170) \cdot (10)}{10}] \\ 415 - \frac{300 \cdot 10}{10} = β_{1} \cdot [231 - \frac{(170) \cdot (10)}{10}] + β_{2} \cdot [18 - \frac{{(10)}^{2}}{10}] \end{cases}$

si82_e

Which results in:

$\{\begin{cases} 1155 = 814 \cdot β_{1} + 61 \cdot β_{2} \\ 115 = 61 \cdot β_{1} + 8 \cdot β_{2} \end{cases}$

si83_e

Solving the system, we arrive at:

$β_{1} = 0.7972 and β_{2} = 8.2963$

We have that:

$α = \bar{Y} - β_{1} \cdot {\bar{X}}_{1} - β_{2} \cdot {\bar{X}}_{2} = 30 - 0.7972 \cdot (17) - 8.2963 \cdot (1) = 8.1512$

Therefore, the estimated time equation to get to school now comes to be:

$\hat{{time}_{i}} = 8.1512 + 0.7972 \cdot {dist}_{i} + 8.2963 \cdot {sem}_{i}$

It should be remembered that the estimation of these parameters can also be obtained by means of the Excel Solver tool, as shown in Section 13.2.1.

The calculations of the coefficient of determination R², the F and t-statistics, and extreme values of the confidence intervals will not be performed again manually, given that they are exactly the same procedure already performed in Sections 13.2.2–13.2.4 and can be done by means of the respective expressions presented until now. Table 13.7 can be of help in this sense.

Table 13.7

Spreadsheet to Calculate Remaining Statistics
Observation (i)	Time (Y_i)	Distance (X_1i)	Traffic Lights (X_2i)	${\overset{ˇ}{Y}}_{i}$	u_i $(Y_{i} - {\overset{ˇ}{Y}}_{i})$	${(\overset{ˇ}{Y_{i}} - \bar{Y})}^{2}$	(u_i)²
1	15	8	8	14.53	0.47	239.36	0.22
2	20	6	6	21.23	− 1.23	76.90	1.51
3	20	15	15	20.11	− 0.11	97.83	0.01
4	40	20	20	32.39	7.61	5.72	57.89
5	50	25	25	44.67	5.33	215.32	28.37
6	25	11	11	25.22	− 0.22	22.88	0.05
7	10	5	5	12.14	− 2.14	319.08	4.57
8	55	32	32	58.55	− 3.55	815.14	12.61
9	35	28	28	38.77	− 3.77	76.90	14.21
10	30	20	20	32.39	− 2.39	5.72	5.72
Sum	300	170	10			1874.85	125.15
Average	30	17	1

Table 13.7

Let’s go directly to the preparation of this multiple linear regression in Excel (file Timedistsem.xls). In the regression dialog box, we should jointly select the variables referent to the distance traveled and the number of traffic lights, as shown in Fig. 13.27.

Fig. 13.27 Multiple linear regression—joint selection of set of explanatory variables.

Fig. 13.28 presents the generated outputs.

Within these outputs, we find the parameters for our multiple linear regression model determined algebraically.

At this time, it is important to introduce the concept of the adjusted R². According to Fávero et al. (2009), when we wish to compare the coefficient of determination (R²) between two models with different sample sizes or distinct quantities of parameters, the use of the adjusted R² becomes necessary, which is a measure of the R² regression estimated by the OLS method adjusted by the number of degrees of freedom, since the estimate sample of R² tends to overestimate the population parameter. The adjusted R² expression is:

$R_{adjust}^{2} = 1 - \frac{n - 1}{n - k} \cdot (1 - R^{2})$

si87_e (13.29)

where n is the size of the sample and k is the number of regression model parameters (number of explanatory variables plus the intercept). When the number of observations is very large, the adjustment by degrees of freedom becomes negligible; however when there is a significantly different number of X variables for the two samples, the adjusted R² should be used for the preparation of the comparison between models and the model with the higher adjusted R² should be opted for.

R² increases when a new variable is added to the model, however the adjusted R² will not always increase, and could well decrease or become negative. For this last case, Stock and Watson (2004) explain that the adjusted R² can become negative when the explanatory variables, taken as a set, reduce the residual sum of squares to such a small amount that this reduction is unable to compensate the factor (n − 1)/(n − k).

For our example, we have that:

$R_{adjust}^{2} = 1 - \frac{10 - 1}{10 - 3} \cdot (1 - 0.9374) = 0.9195$

si88_e

Therefore, at present, in detriment to the simple regression initially applied, we should opt for this multiple regression as being a better model to study the behavior of travel time to get to school since the adjusted R² is higher for this case.

Let’s give sequence to the analysis of the remaining outputs. Initially, the F-test informed us that at least one of the X variables is statistically significant to explain the behavior of Y. Besides this, we can also verify, at the 5% significance level, that all the parameters (α, β₁, and β₂) are statistically different from zero (P-value < 0.05 → confidence interval does not contain zero). As has been discussed, the nonrejection of the null hypothesis that the intercept is statistically equal to zero can be altered by including a significant explanatory variable in the model. We also note that there was a perceptible increase in the R² value, which also caused the confidence intervals of the parameters become narrower.

In this way, we can conclude, for this case, that the increase of one traffic light along the trajectory to school increases the average travel time by 8.2963 min, ceteris paribus. On the other hand, an increase of one kilometer in the distance to be traveled now increases only 0.7972 min to the average travel time, ceteris paribus. The reduction in the estimated value of β of the dist variable occurs because part of the behavior of this variable is contemplated in the sem variable. In other words, greater distances are more susceptible to a greater number of traffic lights and, therefore, there is a high relation between them.

According to Kennedy (2008), Fávero et al. (2009), Gujarati (2011), and Wooldridge (2012), the existence of high correlations between explanatory variables, known as multicollinearity, does not affect the intention to prepare forecasts. Gujarati (2011) also highlights that the existence of high correlations between explanatory variables does not necessarily generate bad or weak estimators and that the presence of multicollinearity does not mean that the model has problems. We will discuss multicollinearity more in Section 13.3.2.

The equations that determine the minimum and maximum values for travel time, at the 95% confidence level, are:

Minimum time:

${\hat{time}}_{\min} = 1.2463 + 0.2619 \cdot {dist}_{i} + 2.8967 \cdot {sem}_{i}$

Maximum time:

${\hat{time}}_{\max} = 15.0561 + 1.3325 \cdot {dist}_{i} + 13.6959 \cdot {sem}_{i}$

13.2.6 Dummy Variables in Regression Models

According to Sharma (1996) and Fávero et al. (2009), the determination of the number of necessary variables to investigate a phenomenon is direct and simply equal to the number of variables used to measure the respective characteristics. However, the procedure to determine the number of explanatory variables whose data on qualitative scales is different.

Imagine, for example, that we wish to study how the behavior of a certain organizational phenomenon changes, such as total profitability, when companies from different sectors are in the same dataset. Or, in another situation, we wish to verify if the average grocery bill in supermarkets presents significant differences when comparing consumers from different genders and age groups. In a third situation, we want to study how GDP growth behaves in different countries considered either emerging or developed. In all of these hypothetical situations, the dependent variables (or outcome variables) are qualitative (total profitability, average grocery bill, or rate of GDP growth); however, we wish to know how they behave in function of qualitative variable explanatories (sector, gender, age group, country classification), which will be included on the right side of the respective regression models to be estimated.

We cannot simply attribute values to each of the qualitative variable categories, for this would be a serious error, called random weighting, since we would be supposing that the differences in the dependent variable would be previously known and of equal magnitude to the differences in the values attributed to each of the qualitative explanatory variable categories. In these situations, so that this problem is completely eliminated, we should resort to the artifice of dummy variables, or binaries, that assume values equal to 0 or 1, in such a way as to stratify the sample in the manner that the determined criteria, event, or attribute was defined, to then be included in the model under analysis. Even a certain period (day, month, or year) when an important even occurred can be the object of analysis.

Dummy variables should, therefore, be used when we wish to study the relation between the behavior of a certain qualitative explanatory variable and the phenomenon in question, represented by the dependent variable.

Returning to our example, now imagine that the professor also asked the students regarding the time of day they came to school, or rather, if each of them came in the morning in order to study in the library, or if they came in the afternoon for their night class. The intent of the professor is to know if the travel time to school undergoes variation due to the distance traveled, the quantity of traffic lights, and the time of day when the students leave to go to school. Therefore, a new variable was added to the dataset, as is shown in Table 13.8.

Table 13.8

Example: Travel Time × Distance Traveled, Number of Traffic Lights, and Time of Day for the Trip to School
Student	Travel Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights (X_2i)	Time of Day (X_3i)
Gabriela	15	8	0	Morning
Dalila	20	6	1	Morning
Gustavo	20	15	0	Morning
Leticia	40	20	1	Afternoon
Luiz Ovidio	50	25	2	Afternoon
Leonor	25	11	1	Morning
Ana	10	5	0	Morning
Antonio	55	32	3	Afternoon
Julia	35	28	1	Morning
Mariana	30	20	1	Morning

Table 13.8

We should, therefore, define which of the qualitative variable categories should be the reference (dummy = 0). Being that, in this case, we have only two categories (morning or afternoon), only one dummy should be created, in which the reference category will assume the value of 0 and the other category the value of 1. This procedure allows the researcher to study the differences that occur in the Y variable in altering the qualitative variable category, since the β of this dummy will exactly represent the difference that occurs in the behavior of variable Y when it passes from the reference category of the qualitative variable to another category, being the behavior of the reference category represented by the α intercept. Therefore, the decision to choose which category will be the reference is up to the researcher and the parameters of the model will be obtained based on the criteria adopted.

As such, the professor decided that the reference category would be the afternoon period, or rather, the dataset cells with this category would assume values equal to 0. Then, the cells with the morning category would assume values equal to 1. This is because the professor wanted to evaluate if the trip to school in the morning posed any time benefit or loss in relation to the afternoon period, which was immediately before class. We will call this dummy variable per. Being as such, the dataset reflects what is presented in Table 13.9.

Table 13.9

Substitution of the Qualitative Variable Categories by the Dummy
Student	Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights (X_2i)	Time of Day Dummy per (X_3i)
Gabriela	15	8	0	1
Dalila	20	6	1	1
Gustavo	20	15	0	1
Leticia	40	20	1	0
Luiz Ovidio	50	25	2	0
Leonor	25	11	1	1
Ana	10	5	0	1
Antonio	55	32	3	0
Julia	35	28	1	1
Mariana	30	20	1	1

Table 13.9

Therefore, the new model is:

${time}_{i} = a + b_{1} \cdot {dist}_{i} + b_{2} \cdot {sem}_{i} + b_{3} \cdot {per}_{i} + u_{i}$

which, analogous to that presented for the simple regression, we have:

${\hat{time}}_{i} = α + β_{1} \cdot {dist}_{i} + β_{2} \cdot {sem}_{i} + β_{3} \cdot {per}_{i}$

where α, β₁, β₂, and β₃ are the estimates of parameters a, b₁, b₂, and b₃, respectively.

Again, solving by Excel, we should now include the dummy variable per in the explanatory variables vector, as is shown in Fig. 13.29 (file Timedistsemper.xls).

The outputs are presented in Fig. 13.30.

By means of these outputs, we can, initially, see that the R² went up to 0.9839, which allows us to say that more than 98% of the time variation behavior to get to school is explained by the joint variation of the three X variables (dist, sem, and per). Besides that, this model is preferable in relation to those previously studied since it presents a higher adjusted R².

While the F-test allows us to state that least one estimated β parameter is statistically different from zero to the 5% significance level, the t-tests for each parameter show that all of them (β₁, β₂, β₃, and the very α) are statistically different from zero at this significance level, for each P-value < 0.05. As such, no X variable need be excluded from the model. The final equation that estimates the time to get to school presents itself in the following way:

${\hat{time}}_{i} = 19.6353 + 0.7084 \cdot {dist}_{i} + 5.2573 \cdot {sem}_{i} - 9.9088 \cdot {per}_{i_{\{\begin{cases} afternoon = 0 \\ morning = 1 \end{cases}}}$

si93_e

In this way, we can state, for our example, that the average time predicted to get to school is 9.9088 min less for students who opt to go during the morning in relation to those who opt to go in the afternoon, ceteris paribus. This probably should have happened for reasons associated with traffic; however, deeper studies could be performed at this point. As such, the professor proposed one more exercise: What is the estimated time to get to school for a student who travels 17 km, goes through two traffic lights and gets to school a little before the beginning of the evening class, in the afternoon? The solution is as follows:

$\hat{time} = 19.6353 + 0.7084 \cdot (17) + 5.2573 \cdot (2) - 9.9088 \cdot (0) = 42.1934 min$

It can be noted that eventual differences raising from the third decimal place can occur due to rounding problems. We used the values obtained from the Excel outputs.

And what would be the estimated time for another student who also travels 17 km, goes through two traffic lights, but decides to go to school in the morning?

$\hat{time} = 19.6353 + 0.7084 \cdot (17) + 5.2573 \cdot (2) - 9.9088 \cdot (1) = 32.2846 min$

According to what we have already discussed, the difference between the two situations is captured by the β₃ of the dummy variable. The ceteris paribus condition imposes no other alteration to be considered, exactly as shown in this last exercise.

Imagine now that the professor, still not satisfied, questions the students one last time regarding their driving style. As such, each one was asked how they classified their driving style: calm, moderate, or aggressive. In obtaining their answers, he set up the last dataset, presented in Table 13.10.

Table 13.10

Example: Travel Time × Distance Traveled, Number of Traffic Lights, Time of Day for the Trip to School, and Driving Style
Student	Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights (X_2i)	Time of Day (X_3i)	Driving Style (X_4i)
Gabriela	15	8	0	Morning	Calm
Dalila	20	6	1	Morning	Moderate
Gustavo	20	15	0	Morning	Moderate
Leticia	40	20	1	Afternoon	Aggressive
Luiz Ovidio	50	25	2	Afternoon	Aggressive
Leonor	25	11	1	Morning	Moderate
Ana	10	5	0	Morning	Calm
Antonio	55	32	3	Afternoon	Calm
Julia	35	28	1	Morning	Moderate
Mariana	30	20	1	Morning	Moderate

Table 13.10

To prepare the regression, the professor needs to transform driving style variable into dummies. For a situation where there is a number of categories greater than 2 for a certain qualitative variable (e.g., marriage status, ball team, religion, area of work, among others), it is necessary for the researcher to use a greater number of dummy variables and, in general, for a qualitative variable with n categories, there will need to be (n − 1) dummies, since a certain category should be chosen as reference and its behavior captured by the estimated α parameter.

As we have discussed, unfortunately it is quite common to find procedures that practice the arbitrarily substitution of the qualitative variable categories with values such as 1 and 2, when there are two categories, and 1, 2, and 3 when there are three categories, and so on and so forth. This is a serious error, since, in this way, we would start from the presupposition that the differences that occur in the behavior of the Y variable in altering the qualitative variable category would always be of the same magnitude, which is not necessarily true. In other words, we cannot presume that the average difference between calm and moderate individuals would be the same as that between the moderate and aggressive.

In our example, therefore, the variable driving style should be transformed into two dummies (variables style2 and style3), being that we have already defined the calm category as being the reference (present behavior in the intercept). While Table 13.11 presents the criteria for the two dummies, Table 13.12 shows the final dataset to be used in the regression.

Table 13.11

Criteria for the Creation of Two Dummy Variables Based on the Driving Style Qualitative Variable
Category of Driving Style Qualitative Variable	Dummy Variable style2	Dummy Variable style3
Calm	0	0
Moderate	1	0
Aggressive	0	1

Table 13.12

Substitution of the Qualitative Variable Categories With the Respective Dummy Variables
Student	Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights (X_2i)	Time of Day Dummy per (X_3i)	Driving Style Dummy style2 (X_4i)	Driving Style Dummy style3 (X_5i)
Gabriela	15	8	0	1	0	0
Dalila	20	6	1	1	1	0
Gustavo	20	15	0	1	1	0
Leticia	40	20	1	0	0	1
Luiz Ovidio	50	25	2	0	0	1
Leonor	25	11	1	1	1	0
Ana	10	5	0	1	0	0
Antonio	55	32	3	0	0	0
Julia	35	28	1	1	1	0
Mariana	30	20	1	1	1	0

Table 13.12

And, in this way, the model will have the following equation:

${time}_{i} = a + b_{1} \cdot {dist}_{i} + b_{2} \cdot {sem}_{i} + b_{3} \cdot {per}_{i} + b_{4} \cdot style 2_{i} + b_{5} \cdot style 3_{i} + u_{i}$

and, analogous to what has been presented for the previous models, we have that:

$\hat{{time}_{i}} = α + β_{1} \cdot {dist}_{i} + β_{2} \cdot {sem}_{i} + β_{3} \cdot {per}_{i} + β_{4} \cdot style 2_{i} + β_{5} \cdot style 3_{i}$

where α, β₁, β₂, β_3, β₄, and β₅ are the estimates of parameters a, b₁, b₂, b₃, b₄, and b₅, respectively.

In this way, analyzing the parameters of the style2 and style3 variables, we have that:

β₄ = average travel time difference between an individual considered moderate and an individual considered calm.

β₅ = average travel time difference between an individual considered aggressive and an individual considered calm.

(β₅ − β₄) = average travel time difference between an individual considered aggressive and an individual considered moderate.

Solving again by Excel, we should not include the style2 and style3 dummy variables in the explanatory variables vector. Fig. 13.31 shows this procedure, prepared by means of the Timedistsemperstyle.xls file.

The outputs are presented in Fig. 13.32.

We now notice that, even though the R² of the regression model is quite high (R² = 0.9969), the parameters for the referent variables for the period in which the route was taken (X₃) and the moderate category of the driving style variable (X₄) do not show themselves to be statistically different from zero at the 5% significance level. As such, these variables will be removed from the analysis and the model we be prepared again.

However, it is important for us to analyze that, in the presence of the remaining variables, the travel time to the school does not come to present additional differences if the route is taken in the morning or in the afternoon. The same is true in relation to the driving style, being that it can be seen that there are no statistically significant differences in the travel time for students with the moderate profile in relation to those who consider themselves calm. It should be remembered that, in a multiple regression, just as important as the analysis of the statistically significant parameters is the analysis of those parameters that do not show themselves to be statistically different from zero.

The Stepwise procedure, available in Stata, SPSS and other modeling software, presents a property that automatically excludes the explanatory variables whose parameters do not show themselves to be statistically different from zero. Being that Excel does not possess this procedure, we will manually exclude the per and style2 variables and once again elaborate the regression. The new outputs are presented in Fig. 13.33. It is recommended, however, that the researcher always take much care with the simultaneous exclusion of variables whose parameters, at first sight, do not show themselves to be statistically different from zero, since a certain β parameter can become statistically different from zero, even though initially not, when eliminating anotherβ variable whose parameter also does no show itself to be statistically different from zero. Fortunately, this does not occur in this example and, as such, we opt to simultaneously exclude the two variables. This will be proven when we elaborate this regression by means of the Stepwise procedure in the Stata (Section 13.5) and SPSS (Section 13.6) software.

Fig. 13.33 Multiple linear regression outputs after the exclusion of variables.

And, in this way, the final model, with all the parameters statistically different from zero at the 5% significance level, with R² = 0.9954 and the higher adjusted R² among all those discussed in this chapter, comes to be:

$\hat{{time}_{i}} = 8.2919 + 0.7105 \cdot {dist}_{i} + 7.8368 \cdot {sem}_{i} + 8.9676 \cdot style 3_{i_{\{\begin{cases} calm = 0 \\ aggressive = 1 \end{cases}}}$

si98_e

It is also important to verify if there was a reduction in the confidence interval amplitudes for each of the parameters. In this way, we can ask:

What would be the estimated time for another student, who also travels 17 km, goes through two traffic lights, decides to go to school in the morning, but has what is considered an aggressive driving style?

$\hat{time} = 8.2919 + 0.7105 \cdot (17) + 7.8368 \cdot (2) + 8.9676 \cdot (1) = 45.0109 min$

Finally, we can state, ceteris paribus, that a student who is considered aggressive when driving takes, on average, 8.9676 min longer to get to school in relation to another who is considered calm. This shows us, among other things, that aggressiveness in traffic leads nowhere!

13.3 Presuppositions of Regression Models Estimated by OLS

After the presentation of the multiple regression model estimated by OLS method, Box 13.2 presents presuppositions, the consequences for their violations, and the procedures for verifying each of them.

Box 13.2

Regression Model Presuppositions

Presuppositions	Violations	Presupposition Verification
Residuals present a normal distribution	P-values of the t-test and F-test are not valid	Shapiro-Wilk test Shapiro-Francia test
There are no high correlations between explanatory variables and there are more observations than explanatory variables	Multicollinearity	Correlation matrix Determinant of (X′X) matrix VIF (variance inflation factor) and tolerance
The residuals do not present any correlation with any X variable	Heteroskedasticity	Breusch-Pagan/Cook-Weisberg test
The residuals are random and independent	Autocorrelation of the residuals for temporal models	Durbin-Watson test Breusch-Godfrey test

Source: Kennedy (2008).

Next, we will present and discuss each of the presuppositions.

13.3.1 Normality of Residuals

The normal distribution of residuals is required if and only if to validate the hypothesis tests in the regression models. In other words, the normality presupposition assures that the P-value of the t-tests and the F-test are valid. However, Wooldridge (2012) argues that the violation of this presupposition can be minimized when using large samples, due to the asymptotic properties of the estimators obtained by OLS.

It is quite common for this presupposition to be violated by researchers in the estimation of the regression models by the OLS method; however, it is important that this hypothesis be considered in order to obtain a series of statistical results in line with the definition of the best functional form and to determine the confidence intervals for forecast (Fig. 13.34), which are defined, as we have studied, based on the forecast of the model parameters.

Fig. 13.34 Normal distribution of residuals.

It should be emphasized that adherence to the normal distribution of the dependent variable, in OLS regression models, can cause the generation of error terms, also normal, and, consequently, estimated parameters more adequate to the determination of the confidence intervals for forecasting purposes.

Being thus, it is recommended that the Shapiro-Wilk test or the Shapiro-Francia test be applied to the error terms so as to verify the normal distribution of residuals presupposition. According to Maroco (2014), while the Shapiro-Wilk test is more appropriate for small samples (those with up to 30 observations), the Shapiro-Francia test is more recommended for large sample, as we discussed in Chapter 9.

In Section 13.5 we will present the application of these tests, as well as their results, using Stata.

The nonadherence to the normality of the error terms can indicate that the model was specified incorrectly as to its functional form and there was an omission of relevant explanatory variables. So as to correct this problem, the mathematical formula can be altered, or new explanatory variables could be included in the model.

In Section 13.3.5, we will present the linktest and the RESET test for the identification of specification problems in the functional form and the omission of relevant variables, respectively. In Section 13.4, we discourse on the nonlinear specifications, highlighting some specific functional forms. In this same section, we will discuss the Box-Cox transformations, which have the purpose of maximizing adherence to the normal distribution of a certain variable generated based on an original variable with a non-normal distribution. It is very common that this procedure be applied to the dependent variable in a model whose estimation generated error terms that do not adhere to normality.

It is worth commenting that the discussion regarding the need for explanatory variables to present distributions that adhere to normality, which is a big mistake, is quite common. If this were the case, it would not be possible to use dummy variables in our models.

13.3.2 The Multicollinearity Problem

The multicollinearity problem occurs when there are very high correlations between explanatory variables. In extreme cases, such correlations can be perfect, indicating a linear relation between the variables.

Initially, we present the general multiple linear regression model in its matrix form. Beginning with:

$Y_{i} = a + b_{1} \cdot X_{1 i} + b_{2} \cdot X_{2 i} + \dots + b_{k} \cdot X_{ki} + u_{i}$

(13.30)

we write that:

$Y = Xb + U$

(13.31)

or:

${[\begin{array}{c} Y_{1} \\ Y_{2} \\ Y_{3} \\ ⋮ \\ Y_{n} \end{array}]}_{n \times 1} = {[\begin{array}{c} 1 & X_{11} & X_{12} & \dots & X_{1 k} \\ 1 & X_{21} & X_{22} & \dots & X_{2 k} \\ 1 & X_{31} & X_{32} & \dots & X_{3 k} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & X_{n 1} & X_{n 2} & \dots & X_{nk} \end{array}]}_{n \times k + 1} \cdot {[\begin{array}{c} a \\ b_{1} \\ b_{2} \\ ⋮ \\ b_{k} \end{array}]}_{k + 1 \times 1} + {[\begin{array}{c} u_{1} \\ u_{2} \\ u_{3} \\ ⋮ \\ u_{n} \end{array}]}_{n \times 1}$

si102_e (13.32)

from where we can show that the parameter estimates are given the following vector:

$β = {(X' X)}^{- 1} (X^{'} Y)$

(13.33)

We can imagine a specific model with only two explanatory variables, as follows:

$Y_{i} = a + b_{1} \cdot X_{1 i} + b_{2} \cdot X_{2 i} + u_{i}$

(13.34)

If, for example, X_2i = 4 ⋅ X_1i, it would not be possible to separate the occurred variations in the dependent variable due to alterations in X₁ due to the influence of X₂. Therefore, according to Vasconcellos and Alves (2000), it would be impossible, in this situation, for all the parameters of the equation of Expression (13.34) to be estimated, being that the inversion of the matrix (X′X) would be impossible and, consequently, the calculation of the vector of parameters β = (X′X)^− 1 (X′Y). However, the following model could be estimated:

$Y_{i} = a + (b_{1} + 4 b_{2}) \cdot X_{1 i} + u_{i}$

(13.35)

whose estimated parameter would be a linear combination between b₁ and b₂.

Greater problems, however, occur when the correlation between explanatory variables is quite high, howbeit not perfect, according to what will be discussed later by means of presenting numerical examples and dataset applications.

13.3.2.1 Causes of Multicollinearity

One of the main causes of multicollinearity is the existence of variables that present the same tendency during some periods. We can imagine, for example, that we want to see if the profitability, over a period of time, of a certain European fixed income fund tied to price indexes varies due to inflation indexes over a period of 3 months. In other words, there is the desire to estimate a model where fund profitability in period t is a function of determined inflation indexes in t − 3. For this, the researcher includes, as explanatory variables, inflation indexes that measure the change over time in the prices of a basket of goods and services acquired by European consumers, such as the Harmonized Index of Consumer Prices and the Monetary Union Index of Consumer Prices (both in t − 3). Being that these indexes can present a correlation over time, the generated model will very probably present multicollinearity.

Such a phenomenon is not restricted to datasets where there is a temporal evolution. We can imagine another situation where a researcher wants to discover if the revenue for a sample of supermarkets in a month is due to square feet of sales area (ft²) and the number of employees assigned to each of the stores. Being that it is known that, for this type of retail operation, there is a certain correlation between the amount of sales area and number of employees, multicollinearity problems can occur in this cross-section as well.

Another common cause for multicollinearity is the use of datasets with an insufficient number of observations.

13.3.2.2 Consequences of Multicollinearity

The existence of multicollinearity has a direct impact on the calculation of the (X′X) matrix. To care for this problem, we present, by means of numerical examples, the calculations for the (X′X) and (X′X)^− 1 matrices in three distinct cases, where there is a correlation between the two explanatory variables: (a) perfect correlation; (b) very high, but not perfect, correlation; (c) low correlation.

(a) Perfect correlation

Imagine an X matrix with only two explanatory variables and two observations:

$X = [\begin{array}{c} 1 & 4 \\ 2 & 8 \end{array}]$

si106_e

Then:

$X^{'} X = [\begin{array}{c} 5 & 20 \\ 20 & 80 \end{array}]$

si107_e

and, therefore, det(X′X) = 0, or rather, (X′X)^− 1 cannot be calculated.

(b) Very high, but not perfect, correlation

Imagine now that the X matrix presents the following values:

$X = [\begin{array}{c} 1 & 4 \\ 2 & 7.9 \end{array}]$

si108_e

Then:

$X^{'} X = [\begin{array}{c} 5 & 19.8 \\ 19.8 & 78.41 \end{array}]$

si109_e

from whence comes det(X′X) = 0.01 and, therefore:

${(X^{'} X)}^{- 1} = [\begin{array}{c} 7841 & - 1980 \\ - 1980 & 500 \end{array}]$

si110_e

Being that the variance and covariance matrices of the model parameters are given by σ²(X′X)^− 1, and being that the elements of the main diagonal of this matrix appear in the denominator of the t statistic, according to what was studied in Section 13.2.3 (Expression (13.21)), these tend, in this case, to present underestimated values for the existence of high values in the (X′X)^− 1 matrix, which can eventually cause the researcher to consider the effects of some of the explanatory variables as insignificant. However, being that the calculations of the F statistic and the R² are not affected by this phenomenon, it is common to find models in which the explanatory variable coefficients are not statistically significant, with the F-test rejecting the null hypothesis at the same significance level, or rather, indicating that at least one parameter is statistically different from zero. In many cases, this inconsistency even comes accompanied by a high R² value.

Imagine, finally, that the X matrix comes to present the following values:

$X = [\begin{array}{c} 1 & 4 \\ 2 & 3 \end{array}]$

si111_e

Then:

$X^{'} X = [\begin{array}{c} 5 & 10 \\ 10 & 25 \end{array}]$

si112_e

from which comes that det(X′X) = 25 and, therefore:

${(X^{'} X)}^{- 1} = [\begin{array}{c} 1 & - 0.4 \\ - 0.4 & 0.2 \end{array}]$

si113_e

We can now verify that, given the low correlation between X₁ and X₂, the values presented on the (X′X)^− 1 matrix are low, which generates little influence on the reduction of the t statistic as to its calculation.

In Section 13.3.2.3, to follow, models are prepared using a dataset that allows for the study of these three situations.

13.3.2.3 Application of Multicollinearity Examples in Excel

Returning to the example used throughout this chapter, we now imagine that the professor wishes to evaluate the influence of the distance traveled (dist) and the amount of intersections (cros) along the route on the time to get to school (time). To do this, research questions were posed to the students from three different classes (A, B, and C), such to obtain, for each class, the following model:

${time}_{i} = a + b_{1} \cdot {dist}_{i} + b_{2} \cdot {cros}_{i} + u_{i}$

The three cases presented refer to the data obtained from each of the three classes of students, respectively.

(a) Class A: The case of perfect correlation

Class A is composed only of students who live in the center of town—there coincidentally exists a perfect correlation between the distance traveled and the amount of intersections since each of the routes possesses the same characteristics and is always in the urban zone. The dataset collected from Class A is presented in Table 13.13.

Table 13.13

Class A and the Example of Perfect Correlation Between Explanatory Variables (Distance Traveled and Amount of Intersections)
Student	Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Amount of Intersections (X_2i)
Gabriela	15	8	16
Dalila	20	6	12
Gustavo	20	15	30
Leticia	40	20	40
Luiz Ovidio	50	25	50
Leonor	25	11	22
Ana	10	5	10
Antonio	55	32	64
Julia	35	28	56
Mariana	30	20	40

Table 13.13

By means of the file Timedistcros_class_A.xls, we can prepare the multiple regression as shown in Fig. 13.35.

The outputs are presented in Fig. 13.36.

As we can see, the estimation of the variable X₁ (dist) parameter was not calculated being that the correlation between dist and cros is perfect and, therefore, the inversion of matrix (X′X) is impossible, which, in this case, is given as:

$X^{'} X = [\begin{array}{c} 3704 & 7408 \\ 7408 & 14, 816 \end{array}]$ , from where we get that det(X′X) = 0.

In any case, we know that cruz_i = 2. dist_i, we can estimate the following model:

${tempo}_{i} = a + (b_{1} + 2 b_{2}) \cdot {dist}_{i} + u_{i}$

where the estimated parameter will be a linear combination between b₁ and b₂.

(b) Class B: The case of high, but not perfect, correlation

Class B, very similar to Class A in terms of travel characteristics, has only one student (Alex) who, because of using an expressway, goes through one intersection less, proportionally, than the others, as can be seen in Table 13.14. As such, the correlation between dist and cros is no longer perfect, even though is it extremely high (in the case of this example, equal to 0.9998).

Table 13.14

Class B and the Example of Very High Correlation Between the Explanatory Variables (Distance Traveled and Number of Intersections)
Student	Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Intersections (X_2i)
Grace	15	8	16
Phillip	20	6	12
Antonietta	20	15	30
Americo	40	20	39
Ferruccio	50	25	50
Francis	25	11	22
Camilo	10	5	10
William	55	32	64
Paula	35	28	56
Matthew	30	20	40

Table 13.14

By means of the file Timedistcros_class_B.xls, we can prepare the same multiple regression, whose outputs are presented in Fig. 13.37.

In this case, as we have already discussed, it is possible to see that there is an inconsistency between the F-test result and the t-tests results, being that the last present underestimated values for their statistics due to the fact of having more high (X′X)^− 1 matrix values, or rather, by the fact that det(X′X) is lower. In this case, we have:

$X^{'} X = [\begin{array}{c} 3704 & 7388 \\ 7388 & 14, 737 \end{array}]$ , from where it comes that det(X’X) = 3,304, which apparently is a high value; however, it is considerably lower than that calculated for Class C to follow. Besides this, in this case, we have that:

${(X^{'} X)}^{- 1} = [\begin{array}{c} 4.460 & - 2.236 \\ - 2.236 & 1.121 \end{array}]$

si118_e

As a result, the outputs (Fig. 13.37) can cause a researcher to, erroneously, affirm that no parameter in the model in question is statistically significant, even though the F-test has indicated that least one of them is statistically different from zero, at the significance level of, for example, 5%, and that the very R² shows itself to be relatively high (R² = 0.8379). This phenomenon represents a major error that can be committed in models with a high multicollinearity between explanatory variables.

Class C is more heterogeneous in terms of travel characteristics, being that it is composed of students who also come from other communities and, therefore, use roads with a proportionally lower number of intersections along the route. The correlation between dist and cros, in this case, is 0.6505. Table 13.15 presents the dataset collected for Class C.

Table 13.15

Class C and the Example of a Lower Correlation Between Explanatory Variables (Distance Traveled and Number of Intersections)
Student	Time to Get to School (min) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Intersections (X_2i)
Juliana	15	8	12
Rachel	20	6	20
Larissa	20	15	25
Roger	40	20	37
Elizabeth	50	25	32
Wilson	25	11	17
Lauren	10	5	9
Sandra	55	32	60
Walter	35	28	12
Luke	30	20	17

Table 13.15

The file Timedistcros_class_C.xls gives the data in the Excel format, from which we can prepare the same multiple regression, which outputs are presented in Fig. 13.38.

Fig. 13.38 Multiple linear regression outputs for Class C.

Now we can see that, given a lower correlation between dist and cros, the values present in the (X′X)^− 1 matrix are very much lower than in those calculated for Class B, which will generate little influence over the reduction of the f statistics as to their calculations and, consequently, no inconsistencies will occur between the t-tests and the F-test. In this case, we have:

$X^{'} X = [\begin{array}{c} 3704 & 4959 \\ 4959 & 7965 \end{array}]$ , from which comes that det(X′X) = 4,910,679, which is a much higher value than that calculated for the previous case. Beside this, we have that:

${(X^{'} X)}^{- 1} = [\begin{array}{c} 0.0016 & - 0.0010 \\ - 0.0010 & 0.0008 \end{array}]$

si120_e

13.3.2.4 Multicollinearity Diagnostics

The first and most simple method of multicollinearity diagnosis refers to the identification of high correlations between explanatory variables by means of the analysis of the correlation matrix. On one hand, this method presents great ease of application; on the other, it is unable to identify eventual existing relations between more than two variables simultaneously.

The second, less used, method refers to the study of the determinant of the (X′X) matrix. According to what we have studied in the two previous sections, very low det(X′X) values can indicate the presence of high correlations between explanatory variables, which hinders the analysis of the t statistics.

Last, but not least important, is the multicollinearity diagnosis prepared by means of the estimation of the auxiliary regressions. According to Vasconcellos and Alves (2000), based on Expression (13.30), estimated regressions can be estimated, such that:

$\begin{array}{l} X_{1 i} = a + b_{1} \cdot X_{2 i} + b_{2} \cdot X_{3 i} + \dots + b_{k - 1} \cdot X_{ki} + u_{i} \\ X_{2 i} = a + b_{1} \cdot X_{1 i} + b_{2} \cdot X_{3 i} + \dots + b_{k - 1} \cdot X_{ki} + u_{i} \\ ⋮ \\ X_{ki} = a + b_{1} \cdot X_{1 i} + b_{2} \cdot X_{2 i} + \dots + b_{k - 1} \cdot X_{k - 1 i} + u_{i} \end{array}$

si121_e (13.36)

and, for each of them, there will be an R_k². If one or more of these R_k² auxiliaries is high, we can consider the existence of multicollinearity. In this way, we can define, based on the same, the Tolerance and VIF (Variance Inflation Factor) statistics, as follows:

$Tolerance = 1 - R_{k}^{2}$

(13.37)

$VIF = \frac{1}{Tolerance}$

si123_e (13.38)

Being thus, if the Tolerance is very low and, consequently, the VIF statistic high, we have an indication of multicollinearity problems. In other words, if the Tolerance is low for a certain auxiliary regression, it means that the explanatory variable that performs the dependent role in the auxiliary regression shares a high percentage of variance with the other explanatory variables.

While many authors state that multicollinearity problems arise with VIF values above 10, we notice that a VIF value equal to 4 results in a Tolerance of 0.25, or rather, in an R_k² of 0.75 for that determined auxiliary regression, which represents a relatively high percentage of shared variance between a certain explanatory variable and the others.

13.3.2.5 Possible Solutions for the Multicollinearity Problem

Multicollinearity represents one of the most difficult problems to be solved in data modeling. While some only apply the Stepwise procedure, in order to eliminate the correlated explanatory variables, what in fact could correct multicollinearity, such solution can generate a specification problem by the omission of a relevant variable, as we will discuss in Section 13.3.5.

The creation of orthogonal factors based on the explanatory variables, by means of applying the factor analysis technique (Chapter 12), can correct multicollinearity problems. For the effect of forecast, however, it is known that the corresponding factors for new observations will not be known, which generates a problem for the researcher. Besides this, the creation of factors always entails a loss of a parcel of variance of the original explanatory variables.

The good news, according to what Vasconcellos and Alves (2000) also discuss, is that the existence of multicollinearity does not affect the intent to prepare forecasts, in the case that the same conditions that generated the results are maintained for the forecast. This way, the forecasts will incorporate the same standard of relation between the explanatory variables, which does not present any problem. Gujarati (2011) also highlights that the existence of high correlations between explanatory variables does not necessarily generate bad or weak estimators and that the presence of multicollinearity does not mean that the model has problems. In other words, some authors argue that a solution for multicollinearity is to identify it, recognize it, and do nothing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 13: Simple and Multiple Regression Models

Create new playlist

Sign In

Sign Up

13.1 Introduction

13.2 Linear Regression Models

13.2.1 Estimation of the Linear Regression Model by Ordinary Least Squares

13.2.2 Explanatory Power of the Regression Model: Coefficient of Determination R2

13.2.3 General Statistical Significance of the Regression Model and Each of Its Parameters

13.2.4 Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions

13.2.5 Estimation of Multiple Linear Regression Models

13.2.6 Dummy Variables in Regression Models

13.3 Presuppositions of Regression Models Estimated by OLS

13.3.1 Normality of Residuals

13.3.2 The Multicollinearity Problem

13.3.2.1 Causes of Multicollinearity

13.3.2.2 Consequences of Multicollinearity

13.3.2.3 Application of Multicollinearity Examples in Excel

13.3.2.4 Multicollinearity Diagnostics

13.3.2.5 Possible Solutions for the Multicollinearity Problem

Table of Contents for
Chapter 13: Simple and Multiple Regression Models

13.2.2 Explanatory Power of the Regression Model: Coefficient of Determination R²