CHAPTER 6

DIAGNOSTICS FOR LEVERAGE AND INFLUENCE

6.1 IMPORTANCE OF DETECTING INFLUENTIAL OBSERVATIONS

When we compute a sample average, each observation in the sample has the same weight in determining the outcome. In the regression situation, this is not the case. For example, we noted in Section 2.9 that the location of observations in x space can play an important role in determining the regression coefficients (refer to Figures 2.8 and 2.9). We have also focused attention on outliers, or observations that have unusual y values. In Section 4.4 we observed that outliers are often identified by unusually large residuals and that these observations can also affect the regression results. The material in this chapter is an extension and consolidation of some of these issues.

Consider the situation illustrated in Figure 6.1. The point labeled A in this figure is remote in x space from the rest of the sample, but it lies almost on the regression line passing through the rest of the sample points. This is an example of a leverage point; that is, it has an unusual x value and may control certain model properties. Now this point does not affect the estimates of the regression coefficients, but it certainly will have a dramatic effect on the model summary statistics such as R2 and the standard errors of the regression coefficients. Now consider the point labeled A in Figure 6.2. This point has a moderately unusual x coordinate, and the y value is unusual as well. This is an influence point, that is, it has a noticeable impact on the model coefficients in that it “pulls” the regression model in its direction.

We sometimes find that a small subset of the data exerts a disproportionate influence on the model coefficients and properties. In an extreme case, the parameter estimates may depend more on the influential subset of points than on the majority of the data. This is obviously an undesirable situation; we would like for a regression model to be representative of all of the sample observations, not an artifact of a few. Consequently, we would like to find these influential points and assess their impact on the model. If these influential points are indeed “bad” values, then they should be eliminated from the sample. On the other hand, there may be nothing wrong with these points, but if they control key model properties, we would like to know it, as it could affect the end use of the regression model.

images

Figure 6.1 An example of a leverage point.

images

Figure 6.2 An example of an influential observation.

In this chapter we present several diagnostics for leverage and influence. These diagnostics are available in most multiple regression computer packages. It is important to use these diagnostics in conjunction with the residual analysis techniques of Chapter 4. Sometimes we find that a regression coefficient may have a sign that does not make engineering or scientific sense, a regressor known to be important may be statistically insignificant, or a model that fits the data well and that is logical from an application–environment perspective may produce poor predictions. These situations may be the result of one or perhaps a few influential observations. Finding these observations then can shed considerable light on the problems with the model.

6.2 LEVERAGE

As observed above, the location of points in x space is potentially important in determining the properties of the regression model. In particular, remote points potentially have disproportionate impact on the parameter estimates, standard errors, predicted values, and model summary statistics. The hat matrix

images

plays an important role in identifying influential observations. As noted earlier, H determines the variances and covariances of ŷ and e, since Var(ŷ) = σ2H and Var(e) = σ2(I – H). The elements hij of the matrix H may be interpreted as the amount of leverage exerted by the ith observation yi on the jth fitted value ŷj.

We usually focus attention on the diagonal elements hii of the hat matrix H, which may be written as

images

where xi is the ith row of the X matrix. The hat matrix diagonal is a standardized measure of the distance of the ith observation from the center (or centroid) of the x space. Thus, large hat diagonals reveal observations that are potentially influential because they are remote in x space from the rest of the sample. It turns out that the average size of a hat diagonal is h̅ = p/n[because Σni=1hii = rank(H) = rank(X) = p], and we traditionally assume that any observation for which the hat diagonal exceeds twice the average 2p/n is remote enough from the rest of the data to be considered a leverage point.

Not all leverage points are going to be influential on the regression coefficients. For example, recall point A in Figure 6.1. This point will have a large hat diagonal and is assuredly a leverage point, but it has almost no effect on the regression coefficients because it lies almost on the line passing through the remaining observations. Because the hat diagonals examine only the location of the observation in x space, some analysts like to look at the studentized residuals or R-student in conjunction with the hii. Observations with large hat diagonals and large residuals are likely to be influential. Finally, note that in using the cutoff value 2p/n we must also be careful to assess the magnitudes of both p and n. There will be situations where 2p/n > 1, and in these situations, the cutoff does not apply.

Example 6.1 The Delivery Time Data

Column a of Table 6.1 shows the hat diagonals for the soft drink delivery time data Example 3.1. Since p = 3 and n = 25, any point for which the hat diagonal hii exceeds 2p/n = 2(3)/25 = 0.24 is a leverage point. This criterion would identify observations 9 and 22 as leverage points. The remote location of these points (particularly point 9) was previously noted when we examined the matrix of scatterplots in Figure 3.4 and when we illustrated interpolation and extrapolation with this model in Figure 3.11.

In Example 4.1 we calculated the scaled residuals for the delivery time data. Table 4.1 contains the studentized residuals and R-student. These residuals are not unusually large for observation 22, indicating that it likely has little influence on the fitted model. However, both scaled residuals for point 9 are moderately large, suggesting that this observation may have moderate influence on the model. To illustrate the effect of these two points on the model, three additional analyses were performed: one deleting observation 9, a second deleting observation 22, and the third deleting both 9 and 22. The results of these additional runs are shown in the following table:

images

TABLE 6.1 Statistics for Detecting Influential Observations for the Soft Drink Delivery Time Data

images

Deleting observation 9 produces only a minor change in images but results in approximately a 28% change in images and a 90% change in images. This illustrates that observation 9 is off the plane passing through the other 24 points and exerts a moderately strong influence on the regression coefficient associated with x2 (distance). This is not surprising considering that the value of x2 for this observation (1460 feet) is very different from the other observations. In effect, observation 9 may be causing curvature in the x2 direction. If observation 9 were deleted, then MSRes would be reduced to 5.905. Note that images which is not too different from the estimate of pure error images = 1.969 found by the near-neighbor analysis in Example 4.10. It seems that most of the lack of fit noted in this model in Example 4.11 is due to point 9's large residual. Deleting point 22 produces relative smaller changes in the regression coefficients and model summary statistics. Deleting both points 9 and 22 produces changes similar to those observed when deleting only 9.

The SAS code to generate its influence diagnostics is:

model time = cases dist / influence;

The R code is:

deliver.model <-  lm(time~cases+dist, data=deliver)
summary(deliver.model)
print(influence.measures(deliver.model))

6.3 MEASURES OF INFLUENCE: COOK'S D

We noted in the previous section that it is desirable to consider both the location of the point in the x space and the response variable in measuring influence. Cook [1977, 1979] has suggested a way to do this, using a measure of the squared distance between the least-squares estimate based on all n points images and the estimate obtained by deleting the ith point, say images. This distance measure can be expressed in a general form as

images

The usual choices of M and c are M = X′X and c = pMSRes, so that Eq. (6.3) becomes

images

Points with large values of Di have considerable influence on the least-squares estimates images.

The magnitude of Di is usually assessed by comparing it to Fα, p, n–p. If Di = F0.5,p, n–p, then deleting point i would move images to the boundary of an approximate 50% confidence region for β based on the complete data set. This is a large displacement and indicates that the least-squares estimate is sensitive to the ith data point. Since F0.5,p,n–p ≃ 1, we usually consider points for which Di > 1 to be influential. Ideally we would like each estimate images to stay within the boundary of a 10 or 20% confidence region. This recommendation for a cutoff is based on the similarity of Di to the equation for the normal-theory confidence ellipsoid [Eq. (3.50)]. The distance measure Di is not an F statistic. However, the cutoff of unity works very well in practice.

The Di statistic may be rewritten as

images

Thus, we see that, apart from the constant p, Di is the product of the square of the ith studentized residual and hii/(1 – hii). This ratio can be shown to be the distance from the vector xi to the centroid of the remaining data. Thus, Di is made up of a component that reflects how well the model fits the ith observation yi and a component that measures how far that point is from the rest of the data. Either component (or both) may contribute to a large value of Di. Thus, Di combines residual magnitude for the ith observation and the location of that point in x space to assess influence.

Because images, another way to write Cook's distance measure is

images

Therefore, another way to interpret Cook's distance is that it is the squared Euclidean distance (apart from pMSRes) that the vector of fitted values moves when the ith observation is deleted.

Example 6.2 The Delivery Time Data

Column b of Table 6.1 contains the values of Cook's distance measure for the soft drink delivery time data. We illustrate the calculations by considering the first observation. The studentized residuals for the delivery time data in Table 4.1, and r1 = –1.6277. Thus,

images

The largest value of the Di statistic is D9 = 3.41835, which indicates that deletion of observation 9 would move the least-squares estimate to approximately the boundary of a 96% confidence region around images. The next largest value is D22 = 0.45106, and deletion of point 22 will move the estimate of β to approximately the edge of a 35% confidence region. Therefore, we would conclude that observation 9 is definitely influential using the cutoff of unity, and observation 22 is not influential. Notice that these conclusions agree quite well with those reached in Example 6.1 by examining the hat diagonals and studentized residuals separately.

6.4 MEASURES OF INFLUENCE: DFFITS AND DFBETAS

Cook's distance measure is a deletion diagnostic, that is, it measures the influence of the ith observation if it is removed from the sample. Belsley, Kuh, and Welsch [1980] introduced two other useful measures of deletion influence. The first of these is a statistic that indicates how much the regression coefficient images changes, in standard deviation units, if the ith observation were deleted. This statistic is

images

where Cjj is the jth diagonal element of (X′X)−1 and images is the jth regression coefficient computed without use of the ith observation. A large (in magnitude) value of DFBETASj,i indicates that observation i has considerable influence on the jth regression coefficient. Notice that DFBETASj,i is an n × p matrix that conveys similar information to the composite influence information in Cook's distance measure.

The computation of DFBETASj,i is interesting. Define the p × n matrix

images

The n elements in the jth row of R produce the leverage that the n observations in the sample have on images. If we let rj denote the jth row of R, then we can show (see Appendix C.13) that

images

where ti is the R-student residual. Note that DFBETASj,i measures both leverage images is a measure of the impact of the ith observation on images) and the effect of a large residual. Belsley, Kuh, and Welsch [1980] suggest a cutoff of images for DFBETASj,i; that is, if images, then the ith observation warrants examination.

We may also investigate the deletion influence of the ith observation on the predicted or fitted value. This leads to the second diagnostic proposed by Belsley, Kuh, and Welsch:

images

where ŷ(i) is the fitted value of yi obtained without the use of the ith observation. The denominator is just a standardization, since images. Thus, DFFITSi is the number of standard deviations that the fitted value ŷi changes if observation i is removed.

Computationally we may find (see Appendix C.13 for details)

images

where ti is R-student. Thus, DFFITSi is the value of R-student multiplied by the leverage of the ith observation [hii/(1 − hii)]1/2. If the data point is an outlier, then R-student will be large in magnitude, while if the data point has high leverage, hii will be close to unity. In either of these cases DFFITSi can be large. However, if hii ≃ 0, the effect of R-student will be moderated. Similarly a near-zero R-student combined with a high leverage point could produce a small value of DFFITSi. Thus, DFFITSi is affected by both leverage and prediction error. Belsley, Kuh, and Welsch suggest that any observation for which images warrants attention.

A Remark on Cutoff Values In this section we have provided recommended cutoff values for DFFITSi and DFBETASj,i. Remember that these recommendations are only guidelines, as it is very difficult to produce cutoffs that are correct for all cases. Therefore, we recommend that the analyst utilize information about both what the diagnostic means and the application environment in selecting a cutoff. For example, if DFFITSi = 1.0, say, we could translate this into actual response units to determine just how much ŷi is affected by removing the ith observation. Then DFBETASj,i could be used to see whether this observation is responsible for the significance (or perhaps nonsignificance) of particular coefficients or for changes of sign in a regression coefficient. Diagnostic DFBETASj,i can also be used to determine (by using the standard error of the coefficient) how much change in actual problem-specific units a data point has on the regression coefficient. Sometimes these changes will be important in a problem-specific context even though the diagnostic statistics do not exceed the formal cutoff.

Notice that the recommended cutoffs are a function of sample size n. Certainly, we believe that any formal cutoff should be a function of sample size; however, in our experience these cutoffs often identify more data points than an analyst may wish to analyze. This is particularly true in small samples. We believe that the cutoffs recommended by Belsley, Kuh, and Welsch make sense for large samples, but when n is small, we prefer the diagnostic view discussed previously.

Example 6.3 The Delivery Time Data

Columns c−f of Table 6.1 present the values of DFFITSi and DFBETASj,i for the soft drink delivery time data. The formal cutoff value for DFFITSi is images. Inspection of Table 6.1 reveals that both points 9 and 22 have values of DFFITSi that exceed this value, and additionally DFFITS20 is close to the cutoff.

Examining DFBETASj,i and recalling that the cutoff is images, we immediately notice that points 9 and 22 have large effects on all three parameters. Point 9 has a very large effect on the intercept and smaller effects on images and images, while point 22 has its largest effect on images. Several other points produce effects on the coefficients that are close to the formal cutoff, including 1 (on images and images), 4 (on images), and 24 (on images and images). These points produce relatively small changes in comparison to point 9.

Adopting a diagnostic view, point 9 is clearly influential, since its deletion results in a displacement of every regression coefficient by at least 0.9 standard deviation. The effect of point 22 is much smaller. Furthermore, deleting point 9 displaces the predicted response by over four standard deviations. Once again, we have a clear signal that observation 9 is influential.

6.5 A MEASURE OF MODEL PERFORMANCE

The diagnostics Di, DFBETASj,i, and DFFITSi provide insight about the effect of observations on the estimated coefficients images and fitted values ŷi. They do not provide any information about overall precision of estimation. Since it is fairly common practice to use the determinant of the covariance matrix as a convenient scalar measure of precision, called the generalized variance, we could define the generalized variance of images as

images

To express the role of the ith observation on the precision of estimation, we could define

images

Clearly if COVRATIOi > 1, the ith observation improves the precision of estimation, while if COVRATIOi < 1, inclusion of the ith point degrades precision. Computationally

images

Note that [1/(1 − hii)] is the ratio of |(X′(i)X(i))−1| to |(X′X)−1|, so that a high leverage point will make COVRATIOi large. This is logical, since a high leverage point will improve the precision unless the point is an outlier in y space. If the ith observation is an outlier,S2(i)/MSRes will be much less than unity.

Cutoff values for COVRATIO are not easy to obtain. Belsley, Kuh, and Welsch [1980] suggest that if COVRATIOi > 1 + 3p/n or if COVRATIOi < 1 − 3p/n, then the ith point should be considered influential. The lower bound is only appropriate when n > 3p. These cutoffs are only recommended for large samples.

Example 6.4 The Delivery Time Data

Column g of Table 6.1 contains the values of COVRATIOi for the soft drink delivery time data. The formal recommended cutoff for COVRATIOi is 1 ± 3p/n = 1 ± 3(3)/25, or 0.64 and 1.36. Note that the values of COVRATIO9 and COVRATIO22 exceed these limits, indicating that these points are influential. Since COVRATIO9 < 1, this observation degrades precision of estimation, while since COVRATIO22 > 1, this observation tends to improve the precision. However, point 22 barely exceeds its cutoff, so the influence of this observation, from a practical viewpoint, is fairly small. Point 9 is much more clearly influential.

6.6 DETECTING GROUPS OF INFLUENTIAL OBSERVATIONS

We have focused on single-observation deletion diagnostics for influence and leverage. Now obviously, there could be situations where a group of points have high leverage or exert undue influence on the regression model. Very good discussions of this problem are in Belsley, Kuh, and Welsch [1980] and Rousseeuw and Leroy [1987].

In principle, we can extend the single-observation diagnostics to the multiple observation case. In fact, there are several strategies for solving the multiple-outlier influential observation problem. For example, see Atkinson [1994], Hadi and Simonoff [1993], Hawkings, Bradu, and Kass [1984], Pena and Yohai [1995], and Rousseeuw and van Zomeren [1990]. To show how we could extend Cook's distance measure to assess the simultaneous influence of a group of m observations, let i denote the m × 1 vector of indices specifying the points to be deleted, and define

images

Obviously, Di is a multiple-observation version of Cook's distance measure. The interpretation of Di is similar to the single-observation statistic. Large values of Di indicate that the set of m points are influential. Selection of the subset of points to include in m is not obvious, however, because in some data sets subsets of points are jointly influential but individual points are not. Furthermore, it is not practical to investigate all possible combinations of the n sample points taken m = 1, 2, …, n points at a time.

Sebert, Montgomery, and Roilier [1998] investigate the use of cluster analysis to find the set of influential observations in regression. Cluster analysis is a multivariate technique for finding groups of similar observations. The procedure consists of defining a measure of similarity between observations and then using a set of rules to classify the observations into groups based on their interobservation similarities. They use a single-linkage clustering procedure (see Johnson and Wichern [1992] and Everitt [1993]) applied to the least-squares residuals and fitted values to cluster nm observations into a “clean” group and a potentially influential group of m observations. The potentially influential group of observations are then evaluated in subsets of size 1, 2, …, m using the multiple-observation version of Cook's distance measure. The authors report that this procedure is very effective in finding the subset of influential observations. There is some “swamping,” that is, identifying too many observations as potentially influential, but the use of Cook's distance efficiently eliminates the noninfluential observations. In studying nine data sets from the literature, the authors report no incidents of “masking,” that is, failure to find the correct subset of influential points. They also report successful results from an extensive performance study conducted by Monte Carlo simulation.

6.7 TREATMENT OF INFLUENTIAL OBSERVATIONS

Diagnostics for leverage and influence are an important part of the regression model-builder's arsenal of tools. They are intended to offer the analyst insight about the data and to signal which observations may deserve more scrutiny. How much effort should be devoted to study of these points It probably depends on the number of influential points identified, their actual impact on the model, and the importance of the model-building problem. If you have spent a year collecting 30 observations, then it seems probable that a lot of follow-up analysis of suspect points can be justified. This is particularly true if an unexpected result occurs because of a single influential observation.

Should influential observations ever be discarded? This question is analogous to the question regarding discarding outliers. As a general rule, if there is an error in recording a measured value or if the sample point is indeed invalid or not part of the population that was intended to be sampled, then discarding the observation is appropriate. However, if analysis reveals that an influential point is a valid observation, then there is no justification for its removal.

A “compromise” between deleting an observation and retaining it is to consider using an estimation technique that is not impacted as severely by influential points as least squares. These robust estimation techniques essentially downweight observations in proportion to residual magnitude or influence, so that a highly influential observation will receive less weight than it would in a least-squares fit. Robust regression methods are discussed in Section 15.1.

PROBLEMS

6.1 Perform a thorough influence analysis of the solar thermal energy test data given in Table B.2. Discuss your results.

6.2 Perform a thorough influence analysis of the property valuation data given in Table B.4. Discuss your results.

6.3 Perform a thorough influence analysis of the Belle Ayr liquefaction runs given in Table B.5. Discuss your results.

6.4 Perform a thorough influence analysis of the tube-flow reactor data given in Table B.6. Discuss your results.

6.5 Perform a thorough influence analysis of the NFL team performance data given in Table B.1. Discuss your results.

6.6 Perform a thorough influence analysis of the oil extraction data given in Table B.7. Discuss your results.

6.7 Perform a thorough influence analysis of the clathrate formation data given in Table B.8. Perform any appropriate transformations. Discuss your results.

6.8 Perform a thorough influence analysis of the pressure drop data given in Table B.9. Perform any appropriate transformations. Discuss your results.

6.9 Perform a thorough influence analysis of kinematic viscosity data given in Table B.10. Perform any appropriate transformations. Discuss your results.

6.10 Formally show that

images

6.11 Formally show that

images

6.12 Table B.11 contains data on the quality of Pinot Noir wine. Fit a regression model using clarity, aroma, body, flavor, and oakiness as the regressors. Investigate this model for influential observations and comment on your findings.

6.13 Table B.12 contains data collected on heat treating of gears. Fit a regression model to these data using all of the regressors. Investigate this model for influential observations and comment on your findings.

6.14 Table B.13 contains data on the thrust of a jet turbine engine. Fit a regression model to these data using all of the regressors. Investigate this model for influential observations and comment on your findings.

6.15 Table B.14 contains data concerning the transient points of an electronic inverter. Fit a regression model to all 25 observations but only use x1x4 as the regressors. Investigate this model for influential observations and comment on your findings.

6.16 Perform a thorough influential analysis of the air pollution and mortality data given in Table B.15. Perform any appropriate transformations. Discuss your results.

6.17 For each model perform a thorough influence analysis of the life expectancy data given in Table B.16. Perform any appropriate transformations. Discuss your results.

6.18 Consider the patient satisfaction data in Table B.17. Fit a regression model to the satisfaction response using age and security as the predictors. Perform an influence analysis of the date and comment on your findings.

6.19 Consider the fuel consumption data in Table B.18. For the purposes of this exercise, ignore regressor x1. Perform a thorough influence analysis of these data. What conclusions do you draw from this analysis?

6.20 Consider the wine quality of young red wines data in Table B.19. For the purposes of this exercise, ignore regressor x1. Perform a thorough influence analysis of these data. What conclusions do you draw from this analysis?

6.21 Consider the methanol oxidation data in Table B.20. Perform a thorough influence analysis of these data. What conclusions do you draw from this analysis?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset