Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 6 DIAGNOSTICS FOR LEVERAGE AND INFLUENCE

6.1 IMPORTANCE OF DETECTING INFLUENTIAL OBSERVATIONS

When we compute a sample average, each observation in the sample has the same weight in determining the outcome. In the regression situation, this is not the case. For example, we noted in Section 2.9 that the location of observations in x space can play an important role in determining the regression coefficients (refer to Figures 2.8 and 2.9). We have also focused attention on outliers, or observations that have unusual y values. In Section 4.4 we observed that outliers are often identified by unusually large residuals and that these observations can also affect the regression results. The material in this chapter is an extension and consolidation of some of these issues.

Consider the situation illustrated in Figure 6.1. The point labeled A in this figure is remote in x space from the rest of the sample, but it lies almost on the regression line passing through the rest of the sample points. This is an example of a leverage point; that is, it has an unusual x value and may control certain model properties. Now this point does not affect the estimates of the regression coefficients, but it certainly will have a dramatic effect on the model summary statistics such as R² and the standard errors of the regression coefficients. Now consider the point labeled A in Figure 6.2. This point has a moderately unusual x coordinate, and the y value is unusual as well. This is an influence point, that is, it has a noticeable impact on the model coefficients in that it “pulls” the regression model in its direction.

We sometimes find that a small subset of the data exerts a disproportionate influence on the model coefficients and properties. In an extreme case, the parameter estimates may depend more on the influential subset of points than on the majority of the data. This is obviously an undesirable situation; we would like for a regression model to be representative of all of the sample observations, not an artifact of a few. Consequently, we would like to find these influential points and assess their impact on the model. If these influential points are indeed “bad” values, then they should be eliminated from the sample. On the other hand, there may be nothing wrong with these points, but if they control key model properties, we would like to know it, as it could affect the end use of the regression model.

images

Figure 6.1 An example of a leverage point.

images

Figure 6.2 An example of an influential observation.

In this chapter we present several diagnostics for leverage and influence. These diagnostics are available in most multiple regression computer packages. It is important to use these diagnostics in conjunction with the residual analysis techniques of Chapter 4. Sometimes we find that a regression coefficient may have a sign that does not make engineering or scientific sense, a regressor known to be important may be statistically insignificant, or a model that fits the data well and that is logical from an application–environment perspective may produce poor predictions. These situations may be the result of one or perhaps a few influential observations. Finding these observations then can shed considerable light on the problems with the model.

6.2 LEVERAGE

As observed above, the location of points in x space is potentially important in determining the properties of the regression model. In particular, remote points potentially have disproportionate impact on the parameter estimates, standard errors, predicted values, and model summary statistics. The hat matrix

plays an important role in identifying influential observations. As noted earlier, H determines the variances and covariances of ŷ and e, since Var(ŷ) = σ²H and Var(e) = σ²(I – H). The elements h_ij of the matrix H may be interpreted as the amount of leverage exerted by the ith observation y_i on the jth fitted value ŷ_j.

We usually focus attention on the diagonal elements h_ii of the hat matrix H, which may be written as

where x′_i is the ith row of the X matrix. The hat matrix diagonal is a standardized measure of the distance of the ith observation from the center (or centroid) of the x space. Thus, large hat diagonals reveal observations that are potentially influential because they are remote in x space from the rest of the sample. It turns out that the average size of a hat diagonal is h̅ = p/n[because Σⁿ_i=1h_ii = rank(H) = rank(X) = p], and we traditionally assume that any observation for which the hat diagonal exceeds twice the average 2p/n is remote enough from the rest of the data to be considered a leverage point.

Not all leverage points are going to be influential on the regression coefficients. For example, recall point A in Figure 6.1. This point will have a large hat diagonal and is assuredly a leverage point, but it has almost no effect on the regression coefficients because it lies almost on the line passing through the remaining observations. Because the hat diagonals examine only the location of the observation in x space, some analysts like to look at the studentized residuals or R-student in conjunction with the h_ii. Observations with large hat diagonals and large residuals are likely to be influential. Finally, note that in using the cutoff value 2p/n we must also be careful to assess the magnitudes of both p and n. There will be situations where 2p/n > 1, and in these situations, the cutoff does not apply.

Example 6.1 The Delivery Time Data

Column a of Table 6.1 shows the hat diagonals for the soft drink delivery time data Example 3.1. Since p = 3 and n = 25, any point for which the hat diagonal h_ii exceeds 2p/n = 2(3)/25 = 0.24 is a leverage point. This criterion would identify observations 9 and 22 as leverage points. The remote location of these points (particularly point 9) was previously noted when we examined the matrix of scatterplots in Figure 3.4 and when we illustrated interpolation and extrapolation with this model in Figure 3.11.

In Example 4.1 we calculated the scaled residuals for the delivery time data. Table 4.1 contains the studentized residuals and R-student. These residuals are not unusually large for observation 22, indicating that it likely has little influence on the fitted model. However, both scaled residuals for point 9 are moderately large, suggesting that this observation may have moderate influence on the model. To illustrate the effect of these two points on the model, three additional analyses were performed: one deleting observation 9, a second deleting observation 22, and the third deleting both 9 and 22. The results of these additional runs are shown in the following table:

images

TABLE 6.1 Statistics for Detecting Influential Observations for the Soft Drink Delivery Time Data

images

Deleting observation 9 produces only a minor change in but results in approximately a 28% change in and a 90% change in . This illustrates that observation 9 is off the plane passing through the other 24 points and exerts a moderately strong influence on the regression coefficient associated with x₂ (distance). This is not surprising considering that the value of x₂ for this observation (1460 feet) is very different from the other observations. In effect, observation 9 may be causing curvature in the x₂ direction. If observation 9 were deleted, then MS_Res would be reduced to 5.905. Note that which is not too different from the estimate of pure error = 1.969 found by the near-neighbor analysis in Example 4.10. It seems that most of the lack of fit noted in this model in Example 4.11 is due to point 9's large residual. Deleting point 22 produces relative smaller changes in the regression coefficients and model summary statistics. Deleting both points 9 and 22 produces changes similar to those observed when deleting only 9.

The SAS code to generate its influence diagnostics is:

model time = cases dist / influence;

The R code is:

deliver.model <-  lm(time~cases+dist, data=deliver)
summary(deliver.model)
print(influence.measures(deliver.model))

6.3 MEASURES OF INFLUENCE: COOK'S D

We noted in the previous section that it is desirable to consider both the location of the point in the x space and the response variable in measuring influence. Cook [1977, 1979] has suggested a way to do this, using a measure of the squared distance between the least-squares estimate based on all n points and the estimate obtained by deleting the ith point, say . This distance measure can be expressed in a general form as

The usual choices of M and c are M = X′X and c = pMS_Res, so that Eq. (6.3) becomes

images

Points with large values of D_i have considerable influence on the least-squares estimates .

The magnitude of D_i is usually assessed by comparing it to F_α_{, p, n–p}. If D_i = F_{0.5,p, n–p}, then deleting point i would move to the boundary of an approximate 50% confidence region for β based on the complete data set. This is a large displacement and indicates that the least-squares estimate is sensitive to the ith data point. Since F_0.5,p,n–p ≃ 1, we usually consider points for which D_i > 1 to be influential. Ideally we would like each estimate to stay within the boundary of a 10 or 20% confidence region. This recommendation for a cutoff is based on the similarity of D_i to the equation for the normal-theory confidence ellipsoid [Eq. (3.50)]. The distance measure D_i is not an F statistic. However, the cutoff of unity works very well in practice.

The D_i statistic may be rewritten as

Thus, we see that, apart from the constant p, D_i is the product of the square of the ith studentized residual and h_ii/(1 – h_ii). This ratio can be shown to be the distance from the vector x_i to the centroid of the remaining data. Thus, D_i is made up of a component that reflects how well the model fits the ith observation y_i and a component that measures how far that point is from the rest of the data. Either component (or both) may contribute to a large value of D_i. Thus, D_i combines residual magnitude for the ith observation and the location of that point in x space to assess influence.

Because , another way to write Cook's distance measure is

Therefore, another way to interpret Cook's distance is that it is the squared Euclidean distance (apart from pMS_Res) that the vector of fitted values moves when the ith observation is deleted.

Example 6.2 The Delivery Time Data

Column b of Table 6.1 contains the values of Cook's distance measure for the soft drink delivery time data. We illustrate the calculations by considering the first observation. The studentized residuals for the delivery time data in Table 4.1, and r₁ = –1.6277. Thus,

The largest value of the D_i statistic is D₉ = 3.41835, which indicates that deletion of observation 9 would move the least-squares estimate to approximately the boundary of a 96% confidence region around . The next largest value is D₂₂ = 0.45106, and deletion of point 22 will move the estimate of β to approximately the edge of a 35% confidence region. Therefore, we would conclude that observation 9 is definitely influential using the cutoff of unity, and observation 22 is not influential. Notice that these conclusions agree quite well with those reached in Example 6.1 by examining the hat diagonals and studentized residuals separately.

6.4 MEASURES OF INFLUENCE: DFFITS AND DFBETAS

Cook's distance measure is a deletion diagnostic, that is, it measures the influence of the ith observation if it is removed from the sample. Belsley, Kuh, and Welsch [1980] introduced two other useful measures of deletion influence. The first of these is a statistic that indicates how much the regression coefficient changes, in standard deviation units, if the ith observation were deleted. This statistic is

where C_jj is the jth diagonal element of (X′X)⁻¹ and is the jth regression coefficient computed without use of the ith observation. A large (in magnitude) value of DFBETAS_j,i indicates that observation i has considerable influence on the jth regression coefficient. Notice that DFBETAS_j,i is an n × p matrix that conveys similar information to the composite influence information in Cook's distance measure.

The computation of DFBETAS_j,i is interesting. Define the p × n matrix

The n elements in the jth row of R produce the leverage that the n observations in the sample have on . If we let r′_j denote the jth row of R, then we can show (see Appendix C.13) that

where t_i is the R-student residual. Note that DFBETAS_j,i measures both leverage is a measure of the impact of the ith observation on ) and the effect of a large residual. Belsley, Kuh, and Welsch [1980] suggest a cutoff of for DFBETAS_j,i; that is, if , then the ith observation warrants examination.

We may also investigate the deletion influence of the ith observation on the predicted or fitted value. This leads to the second diagnostic proposed by Belsley, Kuh, and Welsch:

where ŷ_(i) is the fitted value of y_i obtained without the use of the ith observation. The denominator is just a standardization, since . Thus, DFFITS_i is the number of standard deviations that the fitted value ŷ_i changes if observation i is removed.

Computationally we may find (see Appendix C.13 for details)

where t_i is R-student. Thus, DFFITS_i is the value of R-student multiplied by the leverage of the ith observation [h_ii/(1 − h_ii)]^1/2. If the data point is an outlier, then R-student will be large in magnitude, while if the data point has high leverage, h_ii will be close to unity. In either of these cases DFFITS_i can be large. However, if h_ii ≃ 0, the effect of R-student will be moderated. Similarly a near-zero R-student combined with a high leverage point could produce a small value of DFFITS_i. Thus, DFFITS_i is affected by both leverage and prediction error. Belsley, Kuh, and Welsch suggest that any observation for which warrants attention.

A Remark on Cutoff Values In this section we have provided recommended cutoff values for DFFITS_i and DFBETAS_j,i. Remember that these recommendations are only guidelines, as it is very difficult to produce cutoffs that are correct for all cases. Therefore, we recommend that the analyst utilize information about both what the diagnostic means and the application environment in selecting a cutoff. For example, if DFFITS_i = 1.0, say, we could translate this into actual response units to determine just how much ŷ_i is affected by removing the ith observation. Then DFBETAS_j,i could be used to see whether this observation is responsible for the significance (or perhaps nonsignificance) of particular coefficients or for changes of sign in a regression coefficient. Diagnostic DFBETAS_j,i can also be used to determine (by using the standard error of the coefficient) how much change in actual problem-specific units a data point has on the regression coefficient. Sometimes these changes will be important in a problem-specific context even though the diagnostic statistics do not exceed the formal cutoff.

Notice that the recommended cutoffs are a function of sample size n. Certainly, we believe that any formal cutoff should be a function of sample size; however, in our experience these cutoffs often identify more data points than an analyst may wish to analyze. This is particularly true in small samples. We believe that the cutoffs recommended by Belsley, Kuh, and Welsch make sense for large samples, but when n is small, we prefer the diagnostic view discussed previously.

Example 6.3 The Delivery Time Data

Columns c−f of Table 6.1 present the values of DFFITS_i and DFBETAS_j,i for the soft drink delivery time data. The formal cutoff value for DFFITS_i is . Inspection of Table 6.1 reveals that both points 9 and 22 have values of DFFITS_i that exceed this value, and additionally DFFITS₂₀ is close to the cutoff.

Examining DFBETAS_j,i and recalling that the cutoff is , we immediately notice that points 9 and 22 have large effects on all three parameters. Point 9 has a very large effect on the intercept and smaller effects on and , while point 22 has its largest effect on . Several other points produce effects on the coefficients that are close to the formal cutoff, including 1 (on and ), 4 (on ), and 24 (on and ). These points produce relatively small changes in comparison to point 9.

Adopting a diagnostic view, point 9 is clearly influential, since its deletion results in a displacement of every regression coefficient by at least 0.9 standard deviation. The effect of point 22 is much smaller. Furthermore, deleting point 9 displaces the predicted response by over four standard deviations. Once again, we have a clear signal that observation 9 is influential.

6.5 A MEASURE OF MODEL PERFORMANCE

The diagnostics D_i, DFBETAS_j,i, and DFFITS_i provide insight about the effect of observations on the estimated coefficients and fitted values ŷ_i. They do not provide any information about overall precision of estimation. Since it is fairly common practice to use the determinant of the covariance matrix as a convenient scalar measure of precision, called the generalized variance, we could define the generalized variance of as

To express the role of the ith observation on the precision of estimation, we could define

images

Clearly if COVRATIO_i > 1, the ith observation improves the precision of estimation, while if COVRATIO_i < 1, inclusion of the ith point degrades precision. Computationally

Note that [1/(1 − h_ii)] is the ratio of |(X′_(i)X_(i))⁻¹| to |(X′X)⁻¹|, so that a high leverage point will make COVRATIO_i large. This is logical, since a high leverage point will improve the precision unless the point is an outlier in y space. If the ith observation is an outlier,S²_(i)/MS_Res will be much less than unity.

Cutoff values for COVRATIO are not easy to obtain. Belsley, Kuh, and Welsch [1980] suggest that if COVRATIO_i > 1 + 3p/n or if COVRATIO_i < 1 − 3p/n, then the ith point should be considered influential. The lower bound is only appropriate when n > 3p. These cutoffs are only recommended for large samples.

Example 6.4 The Delivery Time Data

Column g of Table 6.1 contains the values of COVRATIO_i for the soft drink delivery time data. The formal recommended cutoff for COVRATIO_i is 1 ± 3p/n = 1 ± 3(3)/25, or 0.64 and 1.36. Note that the values of COVRATIO₉ and COVRATIO₂₂ exceed these limits, indicating that these points are influential. Since COVRATIO₉ < 1, this observation degrades precision of estimation, while since COVRATIO₂₂ > 1, this observation tends to improve the precision. However, point 22 barely exceeds its cutoff, so the influence of this observation, from a practical viewpoint, is fairly small. Point 9 is much more clearly influential.

6.6 DETECTING GROUPS OF INFLUENTIAL OBSERVATIONS

We have focused on single-observation deletion diagnostics for influence and leverage. Now obviously, there could be situations where a group of points have high leverage or exert undue influence on the regression model. Very good discussions of this problem are in Belsley, Kuh, and Welsch [1980] and Rousseeuw and Leroy [1987].

In principle, we can extend the single-observation diagnostics to the multiple observation case. In fact, there are several strategies for solving the multiple-outlier influential observation problem. For example, see Atkinson [1994], Hadi and Simonoff [1993], Hawkings, Bradu, and Kass [1984], Pena and Yohai [1995], and Rousseeuw and van Zomeren [1990]. To show how we could extend Cook's distance measure to assess the simultaneous influence of a group of m observations, let i denote the m × 1 vector of indices specifying the points to be deleted, and define

images

Obviously, D_i is a multiple-observation version of Cook's distance measure. The interpretation of D_i is similar to the single-observation statistic. Large values of D_i indicate that the set of m points are influential. Selection of the subset of points to include in m is not obvious, however, because in some data sets subsets of points are jointly influential but individual points are not. Furthermore, it is not practical to investigate all possible combinations of the n sample points taken m = 1, 2, …, n points at a time.

Sebert, Montgomery, and Roilier [1998] investigate the use of cluster analysis to find the set of influential observations in regression. Cluster analysis is a multivariate technique for finding groups of similar observations. The procedure consists of defining a measure of similarity between observations and then using a set of rules to classify the observations into groups based on their interobservation similarities. They use a single-linkage clustering procedure (see Johnson and Wichern [1992] and Everitt [1993]) applied to the least-squares residuals and fitted values to cluster n−m observations into a “clean” group and a potentially influential group of m observations. The potentially influential group of observations are then evaluated in subsets of size 1, 2, …, m using the multiple-observation version of Cook's distance measure. The authors report that this procedure is very effective in finding the subset of influential observations. There is some “swamping,” that is, identifying too many observations as potentially influential, but the use of Cook's distance efficiently eliminates the noninfluential observations. In studying nine data sets from the literature, the authors report no incidents of “masking,” that is, failure to find the correct subset of influential points. They also report successful results from an extensive performance study conducted by Monte Carlo simulation.

6.7 TREATMENT OF INFLUENTIAL OBSERVATIONS

Diagnostics for leverage and influence are an important part of the regression model-builder's arsenal of tools. They are intended to offer the analyst insight about the data and to signal which observations may deserve more scrutiny. How much effort should be devoted to study of these points It probably depends on the number of influential points identified, their actual impact on the model, and the importance of the model-building problem. If you have spent a year collecting 30 observations, then it seems probable that a lot of follow-up analysis of suspect points can be justified. This is particularly true if an unexpected result occurs because of a single influential observation.

Should influential observations ever be discarded? This question is analogous to the question regarding discarding outliers. As a general rule, if there is an error in recording a measured value or if the sample point is indeed invalid or not part of the population that was intended to be sampled, then discarding the observation is appropriate. However, if analysis reveals that an influential point is a valid observation, then there is no justification for its removal.

A “compromise” between deleting an observation and retaining it is to consider using an estimation technique that is not impacted as severely by influential points as least squares. These robust estimation techniques essentially downweight observations in proportion to residual magnitude or influence, so that a highly influential observation will receive less weight than it would in a least-squares fit. Robust regression methods are discussed in Section 15.1.

PROBLEMS

6.1 Perform a thorough influence analysis of the solar thermal energy test data given in Table B.2. Discuss your results.

6.2 Perform a thorough influence analysis of the property valuation data given in Table B.4. Discuss your results.

6.3 Perform a thorough influence analysis of the Belle Ayr liquefaction runs given in Table B.5. Discuss your results.

6.4 Perform a thorough influence analysis of the tube-flow reactor data given in Table B.6. Discuss your results.

6.5 Perform a thorough influence analysis of the NFL team performance data given in Table B.1. Discuss your results.

6.6 Perform a thorough influence analysis of the oil extraction data given in Table B.7. Discuss your results.

6.7 Perform a thorough influence analysis of the clathrate formation data given in Table B.8. Perform any appropriate transformations. Discuss your results.

6.8 Perform a thorough influence analysis of the pressure drop data given in Table B.9. Perform any appropriate transformations. Discuss your results.

6.9 Perform a thorough influence analysis of kinematic viscosity data given in Table B.10. Perform any appropriate transformations. Discuss your results.

6.10 Formally show that

6.11 Formally show that

6.12 Table B.11 contains data on the quality of Pinot Noir wine. Fit a regression model using clarity, aroma, body, flavor, and oakiness as the regressors. Investigate this model for influential observations and comment on your findings.

6.13 Table B.12 contains data collected on heat treating of gears. Fit a regression model to these data using all of the regressors. Investigate this model for influential observations and comment on your findings.

6.14 Table B.13 contains data on the thrust of a jet turbine engine. Fit a regression model to these data using all of the regressors. Investigate this model for influential observations and comment on your findings.

6.15 Table B.14 contains data concerning the transient points of an electronic inverter. Fit a regression model to all 25 observations but only use x₁ − x₄ as the regressors. Investigate this model for influential observations and comment on your findings.

6.16 Perform a thorough influential analysis of the air pollution and mortality data given in Table B.15. Perform any appropriate transformations. Discuss your results.

6.17 For each model perform a thorough influence analysis of the life expectancy data given in Table B.16. Perform any appropriate transformations. Discuss your results.

6.18 Consider the patient satisfaction data in Table B.17. Fit a regression model to the satisfaction response using age and security as the predictors. Perform an influence analysis of the date and comment on your findings.

6.19 Consider the fuel consumption data in Table B.18. For the purposes of this exercise, ignore regressor x₁. Perform a thorough influence analysis of these data. What conclusions do you draw from this analysis?

6.20 Consider the wine quality of young red wines data in Table B.19. For the purposes of this exercise, ignore regressor x₁. Perform a thorough influence analysis of these data. What conclusions do you draw from this analysis?

6.21 Consider the methanol oxidation data in Table B.20. Perform a thorough influence analysis of these data. What conclusions do you draw from this analysis?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 6: DIAGNOSTICS FOR LEVERAGE AND INFLUENCE

Create new playlist

Sign In

Sign Up

CHAPTER 6

DIAGNOSTICS FOR LEVERAGE AND INFLUENCE

6.1 IMPORTANCE OF DETECTING INFLUENTIAL OBSERVATIONS

6.2 LEVERAGE

Example 6.1 The Delivery Time Data

6.3 MEASURES OF INFLUENCE: COOK'S D

Example 6.2 The Delivery Time Data

6.4 MEASURES OF INFLUENCE: DFFITS AND DFBETAS

Example 6.3 The Delivery Time Data

6.5 A MEASURE OF MODEL PERFORMANCE

Example 6.4 The Delivery Time Data

6.6 DETECTING GROUPS OF INFLUENTIAL OBSERVATIONS

6.7 TREATMENT OF INFLUENTIAL OBSERVATIONS

PROBLEMS

Table of Contents for
CHAPTER 6: DIAGNOSTICS FOR LEVERAGE AND INFLUENCE