Simple Modeling Issues and Heteroscedasticity
Today, Touro asks, “Prof. Metric, I heard that the linear regression technique only requires a model to be linear in parameters. What does that mean?” Booka then says, “I also have a question. In the previous chapters, we assumed that the data did not have any problems. What will happen if they violate one of the classic assumptions?” Prof. Metric praises them for raising good questions and tells us that several of these issues will be discussed this week. We learn that once we finish with the chapter, we will be able to:
1. Master simple model issues in regressions;
2. Explain the nature and consequences of the heteroscedasticity problem;
3. Obtain the corrected standard errors and the transformed models for estimations;
4. Carry out Excel applications to these models.
Functional Forms
Model Transformations
Prof. Metric explains that the “linear in parameters” requirement means that the model cannot have parameters in any nonlinear form, such as . Other than that, the dependent and explanatory variables can be of any nonlinear form such as Ln(Y) Ln(X) Ln(Y), ln(X), or X b. The reason is that we can easily transform a model that has nonlinear variables into a linear one. Equation 4.1 presents a logarithmic model with linear parameters:
We can transform this model by letting Y* = Ln(Y) and letting X* = Ln(X); then, the model becomes a linear one:
Y* = a1 + a2 X*.
Similarly, a polynomial model
can be transformed by letting X* = X2, and Z* = Z3; then the model also becomes a linear one in parameters:
Y = a1 + a2 X*+ a3 Z*.
These transformations allow us to employ OLS estimation techniques using Excel, as usual.
Choosing a Logarithmic Model
The following log models are the most important in data analysis, and we want to learn how to interpret them.
Log-Log Model
This model is written as:
In this case, the parameter a2 measures the elasticity of y with respect to x.
For example, the following model measures the price elasticity of demand:
ln DEMANDi = a1 + a2 ln PRICi + ei.
Booka raises her hand and offers her estimation for the price elasticity of demand for history books at her company. The estimation is based on her recent survey. The estimation results are:
ln DEMANDi = 0.3 + −1 ln PRICi.
Invo offers “So, a 1 percent increase in the price of your history books decreases the demand for these books by 1.1 percent.” Touro adds, “Oh, the demand for your history books is slightly elastic because its price elasticity is a little greater than 1. You might want to draw several more samples to see whether the results are robust to sample changes. If this is true, your company might want to lower the prices slightly to boost sales.” Prof. Metric praises them for an insightful discussion and moves to the next section.
Log-Linear Model
This model is written as:
In this case, the slope a2 measures the percent change (a2 × 100) in y due to a one-unit change in x.
For example, the following model measures the impact of investing in new capital on the profit of a company:
ln PROFITi = a1 + a2 CAPITAi + ei.
Taila says that her company bought several new machines six months ago and found their effect on the company’s monthly profit as follows:
ln PROFITi = a1 + 0.02CAPITAi.
where CAPITAL was measured in thousands of dollars.
We are able to calculate the impact of this change in capital on the profit as ΔPROFIT = (0.02*100)% = 2%. Since the unit of capital is in thousands of dollars, this 2 percent change in profit is due to a $1,000 increase in capital.
Linear-Log Model
This model is written as:
In this case, a 1 percent change in x leads to a (a2/100) unit change in y.
Touro offers an example model, in which spending depends on log of income:
SPENDi = a1 + a2 ln(INCOM)i + ei.
Prof. Metric says this is an interesting case: if a2 > 0, an increase in income will cause an increase in spending, indicating that the good is a normal good, but the log function also implies that spending will increase at a decreasing rate, because the graph of a log function is concave with a decreasing slope. Touro says this is the case with spending on travel at his company, where the result for regression is:
SPENDi = 200 + 1400 ln(INCOM)i + ei,
where the unit of spending (SPEND) is in thousands of dollars.
We can calculate the impact of the change in income on the spending for travel, for Touro, as ΔSPEND = (1400/100) = 14. Since the unit of spending is in thousands of dollars, we conclude that a 1 percent change in income causes a $14,000 change in travel spending at Touro’s company.
Growth Rate
Prof. Metric reminds us that ln(1+x) ≅ x if x is small (less than 20 percent) and that we learned this approximation in algebra classes. This formula can have practical applications here. Invo offers an example on the growth of revenue from his company, where they found this equation related to revenue over time:
where r is the yearly growth rate of revenue.
We are able to go back to the first period as follows:
REVENUEt = REVENUE0 (1 + r)t.
Taking the logarithms of both sides yields:
ln (REVENUEt) = ln(REVENUE0) + [ln(1 + r)] × t.
Let ln(REVENUE0) = a1 and [ln(1 + r)] = a2, then
ln (REVENUEt) = a1 + a2t.
Invo provides us with the following estimation results:
ln (REVENUE) = 0.2413 + 0.0276 × t.
We are able to write the approximation as:
[ln (1 + r)] ≃ r = (0.0276 * 100)% = 2.76%.
Hence, the growth rate of revenue at Invo’s company is approximately 2.76 percent per year.
Return to Training
Prof. Metric says that we can similarly have wages as a function of training, so:
where r is rate of return to an extra month of training.
Suppose the estimation result is:
ln(WAGE)t = 0.4315 + 0.0124 * TRAININGt−1,
then an additional month of training increases the wage rate by approximately 1.24 percent.
Prof. Metric reminds us that all log models can be extended to multiple regressions by adding more control variables, as discussed in Chapter 3, or by adding dummy variables, as will be discussed in the following section.
Intercept Dummy Variables
Additive Dummy Variables
An intercept dummy is called an additive dummy because this variable is added to the model. We can add dummy variables (also called indicator variables) to a model to control for the difference in characteristics among various groups. Prof. Metric tells us to go back to Booka’s example of the paperback or hardback books. The original model on their sale values can be written as follows:
Let D = 1 if the book is in paperback and D = 0 otherwise, then we have two groups in this case (numbers of groups = G = 2). Suppose the dataset has 15 observations for D = 1 and 15 observations for D = 0, then the dummy-variable method has two advantages:
(i) It allows us to use the whole dataset of 30 observations.
(ii) We can compare and contrast the difference between the two groups.
Prof. Metric says that we can have more than two groups—for example, cars can come in green, red, blue, or brown, and thus G = 4.
In the aforementioned example on book sales, if we add D to equation (4.7), then
The intercept might not have any meaning, but the difference in the intercepts depicts the differences in the prices for different type of books. Suppose d = −15, and SALE is in dollars, then a paperback copy reduces the sale value by $15.
To choose the reference group, a researcher can only add (G-1) dummies in the regression. For example, for two groups, use one dummy. Otherwise, the sum of D1 and D2 equals 1. Since the coefficient estimate for the constant terms is a vector of all 1s, we will have a perfect-collinearity problem if we add G dummies. This problem is also called the dummy-variable trap.
Heteroscedasticity
Nature and Consequences
In the previous chapters, we assumed that the errors have a constant variance, var(ei) = σ2. Prof. Metric says that when this variance is not constant, then the classic assumption (iii) is violated, and the errors are said to be heteroskedastic. To see the problem, we all look at the original equation:
If var(ei) = (note the subscript i), the variance changes when the identity changes, and we have a heteroscedasticity problem.
For example, if , where σ2 = 100 and the variable xi changes from $64 to $81, then we can obtain
But
Thus, the error variance, , changes from 800 to 900 instead of remaining a constant. The heteroscedasticity problem could occur with time-series data as well.
To this point, Invo says, “Oh yes, suppose we want to perform a regression of food expenditures on income, then the variance of errors might change because this variance might be correlated with income as well.”
Touro asks, “How come the variance of the errors is correlated with income?” Invo answers, “Because food expenditures also depend on the price level, which in turn might go up if income goes up. Since price is not included in the model, it must be contained in the error term. As a result, the variance of the errors is correlated with income.”
Prof. Metric praises Invo and says that there are two consequences of heteroscedasticity:
(i) The standard errors are incorrect, so statistical inferences are not reliable.
(ii) The OLS estimator is no longer the BLUE, so its variance is not the smallest.
Each problem can be addressed separately, and we will learn how to overcome them in the next section.
Detecting Heteroscedasticity
A Lagrange Multiplier (LM) test is usually performed to test a variance function so that we can find out if a heteroscedasticity problem exists. The theoretical foundation behind this test is simple. Given the model in equation (4.10) with var(ei) = , we let the error variance be a function of a variable w:
We then test for the significance of c2. Since c1 is a constant, if c2 is not statistically different from zero, then var(e) = c1 = a constant, so the hypotheses are:
We learn that there are several LM tests. In this lecture, Prof. Metric teaches us the White version, which is the most popular one and which has w = x. The test utilizes the chi-squared distribution, , where K is the number of estimated coefficients (parameters), and (K − 1) is the degrees of freedom (df). Here are the steps to perform this LM test:
First, we need to estimate the original equation: yi = a1 + a2 xi + ei.
Next, we obtain and generate so that we can estimate the variance function:
After that, we can follow the four-step procedure for any test:
(i) H0: c2 = 0; Ha: c2 ≠ 0.
(ii) Calculate LMSTAT = N*R2. (4.13)
(iii) Find either using a chi-squared distribution table or by typing =CHIINV(α, df) in any Excel cell, where the degrees of freedom is K − 1, which is 1 in this case.
(iv) If , we reject the null hypothesis, meaning c2 is different from zero and implying that heteroskedasticity exists.
Prof. Metric then gives us an example, “Let’s say estimating = c1 + c2xi + vi yields R2 = 0.25 and the sample size is N = 30; how can we perform the LM test?”
We work out the problem together by calculating:
LMSTAT = 30*0.25 = 7.5 for α = 0.05; the critical value of is 3.84.
Hence, we reject the null, meaning that c2 is different from zero and implying that the data has a heteroskedasticity problem.
We can also type =CHIINV(0.05, 1) into an Excel cell and see that Excel reports the critical value as 3.8415 ≈ 3.84.
Since there are two possible consequences with heteroscedasticity, we can discuss each of them separately.
(i) The standard errors are incorrect
If the form of the heteroscedasticity is unknown, OLS can be used to estimate the coefficients. Then, corrected variances are calculated using the robust standard error method introduced by White (1980) and so also called White’s standard errors.
Theoretically, White’s variance and standard error for a2 are:
Empirically,
We are happy to hear that we do not have to calculate this White’s standard error because we will learn how to obtain it using Excel later in this chapter.
(ii) The OLS estimator is no longer the BLUE
Since the OLS estimator is no longer the BLUE, when we have a heteroscedasticity problem it is better if we can find an alternative estimator with a smaller variance. When the form of the heteroscedasticity is known, we can find a generalized least squares (GLS) estimator to replace the OLS estimator. There are several approaches to GLS estimations. Given the model in equation (4.1), we first study the simplest one by assuming that = σ2xi. In this case we must divide both sides of the equation (4.1) by :
Then we perform a regression on the transformed equation:
Touro exclaims, “Oh, this model no longer has a constant variable, because changes with each observation.” Prof. Metric says that is true, so we need to suppress the constant when performing the regression and that Prof. Empirie will show us how to do it in Excel.
He then says that the problem is solved, because
We also learn that the predicted values reported in Excel are for y*, so we need to multiply ŷ* by to obtain the predicted value for ŷ. The interval prediction values then can be calculated as usual.
Prof. Metric tells us that sometimes the form of heteroscedasticity is much more complex, so transforming the model by does not correct the problem. In practice, we don’t know the form of the heteroscedasticity, so we have to estimate a more general form.
We are excited and say in unison, “Let’s learn a more general case,” and Prof. Metric introduces this model:
Theoretically, ln() = ln(σ2) + γ(ln xi), so we can find by performing a regression of this equation.
Empirically, we don’t have σ2, so we need to run a regression on yi = a1 + a2 xi + ei and obtain êi. After that, we need to obtain estimate values of ln(σ2) and g(ln xi) by writing the relation as:
where α1 = ln (σ2); α2 = χ zi; ln xi.
Therefore, the estimated version of equation (4.18) is:
From equation (4.19), we can calculate the estimated variance:
Next, we can make the transformation:
The problem is solved, because
Prof. Metric says the correction can also be extended to multiple regressions. In this case, the formula for calculating the estimated variance is:
where K is the number of parameters.
Data Analyses
Prof. Empirie reminds us that the Chow test is similar to the F-tests discussed in Chapter 3, in which she already showed us how to estimate the unrestricted and restricted models. Moreover, adding dummy variables to a regression is just like adding any variable. Hence, in this section she only provides instructions for heteroscedasticity.
Detecting Heteroscedasticity
The yearly data on spending (SPEND) and salary (SAL) are available in the file Ch04.xls. Fig.4.1, and a section of the second regression with the N and R2 is shown in Figure 4.1. First, we regress SPEND on SAL.
Click Data and then Data Analysis on the ribbon.
Select Regression, then click OK.
Type B1:B34 into the Input Y Range box.
Type A1:A34 into the Input X Range box.
Choose Labels and Residuals.
Check the Output Range button and enter F1.
Click OK then OK again to override the data range.
Copy and paste the residuals (e) from cells H24 through H57 into cells C1 through C34.
To generate e-squared (e2), type =C2^2 into cell D2, then press Enter.
Copy and paste this formula into cells D3 through D34.
Next, we need to regress e2 on SAL.
Click Data and then Data Analysis on the ribbon.
Select Regression, then click OK.
Enter D1:D34 into the Input Y Range box.
Enter A1:A34 into the Input X Range box.
Choose Labels.
Check the Output Range button and enter P1.
Click OK then OK again to override the data range.
From Figure 4.1, N = 33 and R2 = 0.1913, so LMSTAT = 33*0.1913 = 6.31.
Typing =CHIINV(0.05,1) into any Excel cell gives you = 3.84.
Since , we reject the null hypothesis, meaning the coefficient of the residuals is statistically different from zero, implying that the model has a heteroscedasticity problem.
Obtaining White’s Standard Error
Prof. Empirie has utilized the results from the first regression for Figure 4.1, and we find them in the file Ch04.xls, Fig.4.2. We learn to perform the following steps:
In cell E2 type =AVERAGE(A2:A34), then press Enter (this is Xbar).
In cell F2 type =(A2 − $E$2)^2, then press Enter (this is (X − Xbar)2).
In cell G2 type =F2*D2, then press Enter (this is the numerator, NUMER, in the data file).
Copy and paste the formulas in cells F2 and G2 down the columns.
In cell F35 type =SUM(F2:F34), then press Enter.
In cell G35 type =SUM(G2:G34), then press Enter.
In cell G36 type =G35/(F35^2), then press Enter (this is the corrected var (a2)).
In cell G37 type =SQRT(G36), then press Enter (this is the corrected se (a2)).
The results are reported in Figure 4.2 with the corrected var(a2) = 0.0006 and se(a2) = 0.0238.
Estimating a Transformed Model
The yearly data on SPEND and SAL are used again in this demonstration and are available in the file Ch04.xls, Fig.4.3. We learn to perform the following steps:
In cell C2 type =A2^1/2), then press Enter (this is SAL1/2).
Copy and paste the formula into cells C3 through C34.
In cell D2 type =B2/C2 (this is SPEND*).
Copy and paste the formula into cells D3 through D34.
In cell E2 type =1/C2 (this is X1*).
Copy and paste the formula into cells E3 through E34.
In cell F2 type =A2/C2 (this is SAL*).
Copy and paste the formula into cells F3 through F34.
Next, we need to regress SPEND* on X1* and SAL*’.
Go to Data Analysis and choose Regression.
Type D1:D34 into the Input Y Range box.
Type E1:F34 into the Input X Range box.
Check the box Labels; Constant is Zero.
(Make sure that you suppress the constant because the model no longer has a constant.)
Check the Output Range button and enter H1.
Click OK then OK again to obtain the regression results.
The regression results for the transformed model are displayed in Figure 4.3.
Prof. Empirie reminds us that to obtain the predicted value for SPEND*, we just need to check the box Residuals when we enter the regression commands so that Excel will report them in its Summary Output. However, we still need to multiply predicted SPEND* by SAL1/2 to obtain the predicted SPEND. The interval prediction values can then be calculated as usual.
Exercises
1. The yearly data on earning (EARN) and capital (CAP) are available in the file Capital.xls. Regress CAP on EARN using Excel and obtain the residuals ê. Use the results to perform the heteroscedasticity test for the model.
2. Propose two alternative methods to correct the heteroscedasticity problem in Question 1.
3. The estimation results of a model are listed in Table 4.1. The dependent variable is ln(price) with price in dollars and the independent variables are areas of the houses in square feet (sqft) and an intercept dummy: D1 = 1 if a house is in the volcano area (volcan) and 0 otherwise. Provide an interpretation of volcan, including the meaning, magnitude, and significant level of the estimated coefficient.
|
Coefficients |
Standard error |
p-value |
Intercept |
4.0536 |
1.1165 |
0.006 |
Sqft |
0.0671 |
0.0203 |
0.008 |
Volcan |
−0.1056 |
−0.0329 |
0.009 |