The pitfalls of regression analysis

There are several pitfalls of regression analysis. We will go over some of the limitations of this analysis:

  • Keep your estimates close to the original input variable range: The regression equation is in the form of a straight line. A line in a two-dimensional plane has a slope and a y-intercept and extends infinitely in two directions. Because of this, we are capable of estimating values beyond the range of our input variables, for example, the y-intercept of our equation is 0.22. This means that a baseball team that scores an average of 0 runs per game should win an estimated 22% of their games (which is laughable). Just because we can estimate values outside of the range of our input, it doesn't mean that we should.
  • There is more to regression than simple linear regression: This chapter only looks at simple linear regression, but there are several other types of regression, including log regression (where we first compute the natural log of the output variable) and log-log regression (where we first compute the natural log of both the input and the output variables). The fundamental approach is the same, but having an understanding of the curvature of the data (and not automatically assuming that the data forms a line) sometimes yields better results.
  • Any analysis that uses the mean of a dataset is easily skewed: Since the regression line is so dependent on the average of both the input and output variables, using simple linear regression can create a line that is distorted based on a single heavily skewed data point.

    The next example illustrates this clearly. The following table represents a contrived dataset known as an Anscombe's quartet. This dataset was manually developed so that each dataset has a nearly identical mean of the x column, mean of the y column, correlation coefficient, and linear regression line. It demonstrates the problem of simple linear regression. The procedure is not robust with respect to outlier data values or data that isn't in a linear order. In each of these four datasets, a simple linear regression analysis reports that all four datasets follow the same linear path when it is clear that three of them do not.

    The data for the following table was taken from the Wikipedia entry for an Anscombe's quartet:

    I

    II

    III

    IV

    x

    y

    x

    y

    x

    y

    x

    y

    10.0

    8.04

    10.0

    9.14

    10.0

    7.46

    8.0

    6.58

    8.0

    6.95

    8.0

    8.14

    8.0

    6.77

    8.0

    5.76

    13.0

    7.58

    13.0

    8.74

    13.0

    12.74

    8.0

    7.71

    9.0

    8.81

    9.0

    8.77

    9.0

    7.11

    8.0

    8.84

    11.0

    8.33

    11.0

    9.26

    11.0

    7.81

    8.0

    8.47

    14.0

    9.96

    14.0

    8.10

    14.0

    8.84

    8.0

    7.04

    6.0

    7.24

    6.0

    6.13

    6.0

    6.08

    8.0

    5.25

    4.0

    4.26

    4.0

    3.10

    4.0

    5.39

    19.0

    12.50

    12.0

    10.84

    12.0

    9.13

    12.0

    8.15

    8.0

    5.56

    7.0

    4.82

    7.0

    7.26

    7.0

    6.42

    8.0

    7.91

    5.0

    5.68

    5.0

    4.74

    5.0

    5.73

    8.0

    6.89

    Plotting the data of this table would give us the following charts:

    The pitfalls of regression analysis

    These graphs were created using the Haskell functions, the EasyPlot library that was defined in this chapter, and the same procedure that was used to analyze the baseball data (except that all the outlier observations were retained for the purpose of this demonstration).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset