9.3 Model Assumptions

In Section 9.2, we assumed that the probabilistic model relating the drug reaction time y to the percentage x of drug in the bloodstream is

y=β0+β1x+ε

We also recall that the least squares estimate of the deterministic component of the model, β0+β1x, is

y^=β^0+β^1x=.1+.7x

Now we turn our attention to the random component ε of the probabilistic model and its relation to the errors in estimating β0 and β1. We will use a probability distribution to characterize the behavior of ε. We will see how the probability distribution of ε determines how well the model describes the relationship between the dependent variable y and the independent variable x.

Step 3 in a regression analysis requires us to specify the probability distribution of the random error ε. We will make four basic assumptions about the general form of this probability distribution:

  • Assumption 1 The mean of the probability distribution of ε is 0. That is, the average of the values of ε over an infinitely long series of experiments is 0 for each setting of the independent variable x. This assumption implies that the mean value of y, for a given value of x, is E(y)=β0+β1x.

  • Assumption 2 The variance of the probability distribution of ε is constant for all settings of the independent variable x. For our straight-line model, this assumption means that the variance of ε is equal to a constant—say, σ2—for all values of x.

  • Assumption 3 The probability distribution of ε is normal.

  • Assumption 4 The values of ε associated with any two observed values of y are independent. That is, the value of ε associated with one value of y has no effect on any of the values of ε associated with any other y values.

The implications of the first three assumptions can be seen in Figure 9.9, which shows distributions of errors for three values of x, namely, 5, 10, and 15. Note that the relative frequency distributions of the errors are normal with a mean of 0 and a constant variance σ2. (All of the distributions shown have the same amount of spread or variability.) The straight line shown in Figure 9.9 is the line of means; it indicates the mean value of y for a given value of x. We denote this mean value as E(y). Then the line of means is given by the equation

Figure 9.9

The probability distribution of ε

E(y)=β0+β1x

These assumptions make it possible for us to develop measures of reliability for the least squares estimators and to devise hypothesis tests for examining the usefulness of the least squares line. We have various techniques for checking the validity of these assumptions, and we have remedies to apply when they appear to be invalid. (These topics are beyond the scope of this text, but they are discussed in some of the chapter references). Fortunately, the assumptions need not hold exactly in order for least squares estimators to be useful. The assumptions will be satisfied adequately for many applications encountered in practice.

It seems reasonable to assume that the greater the variability of the random error ε (which is measured by its variance σ2), the greater will be the errors in the estimation of the model parameters β0 and β1 and in the error of prediction when y^ is used to predict y for some value of x. Consequently, you should not be surprised, as we proceed through this chapter, to find that σ2 appears in the formulas for all confidence intervals and test statistics that we will be using.

Estimation of σ2 for a (First-Order) Straight-Line Model

s2=SSEDegrees of freedom for error=SSEn2

where SSE=(yiy^i)2=SSyyβ^1SSxy

in which

SSyy=(yiy¯)2=yi2(yi)2n

To estimate the standard deviation σ of ε, we calculate

s=s2=SSEn2

We will refer to s as the estimated standard error of the regression model.

In most practical situations, σ2 is unknown and we must use our data to estimate its value. The best estimate of σ2, denoted by s2, is obtained by dividing the sum of the squares of the deviations of the y values from the prediction line, or

SSE=(yiy^i)2

by the number of degrees of freedom associated with this quantity. We use 2 df to estimate the two parameters β0 and β1 in the straight-line model, leaving (n2) df for the estimation of the error variance.

Caution

When performing these calculations, you may be tempted to round the calculated values of SSyy,β^1, and SSxy. Be certain to carry at least six significant figures for each of these quantities, to avoid substantial errors in calculating the SSE.

STIMULUS Example 9.3 Estimating σ in Regression—Drug Reaction Data

Problem

  1. Refer to Example 9.2 and the simple linear regression of the drug reaction data in Table 9.1.

    1. Compute an estimate of σ.

    2. Give a practical interpretation of the estimate.

Solution

  1. We previously calculated SSE=1.10 for the least squares line y^=.1+.7x. Recalling that there were n=5 data points, we have n2=52=3 df for estimating σ2. Thus,

    s2=SSEn2=1.103=.367

    is the estimated variance, and

    s=.367=.61

    is the standard error of the regression model.

  2. You may be able to grasp s intuitively by recalling the interpretation of a standard deviation given in Chapter 2 and remembering that the least squares line estimates the mean value of y for a given value of x. Since s measures the spread of the distribution of y values about the least squares line and these errors of prediction are assumed to be normally distributed, we should not be surprised to find that most (about 95%) of the observations lie within 2s, or 2(.61)=1.22, of the least squares line. For this simple example (only five data points), all five data points fall within 2s of the least squares line. In Section 9.6, we use s to evaluate the error of prediction when the least squares line is used to predict a value of y to be observed for a given value of x.

Figure 9.10

SAS printout for the time–drug regression

Look Back

The values of s2 and s can also be obtained from a simple linear regression printout. The SAS printout for the drug reaction example is reproduced in Figure 9.10. The value of s2 is highlighted on the printout (in the Mean Square column in the row labeled Error). The value s2=.36667, rounded to three decimal places, agrees with the one calculated by hand. The value of s is also highlighted in Figure 9.10 (next to the heading Root MSE). This value, s=.60553, agrees (except for rounding) with our hand-calculated value.

Now Work Exercise 9.40ab

Interpretation of s, the Estimated Standard Deviation of ϵ

We expect most (95%) of the observed y values to lie within 2s of their respective least squares predicted values, y^.

Exercises 9.37–9.52

Understanding the Principles

  1. 9.37 What are the four assumptions made about the probability distribution of ε in regression? Illustrate the assumptions with a graph.

  2. 9.38 Visually compare the scatterplots shown below. If a least squares line were determined for each data set, which do you think would have the smallest variance s2? Explain.

Learning the Mechanics

  1. 9.39 Calculate SSE and s2 for each of the following cases:

    1. n=20, SSyy=95, SSxy=50,β^1=.75

    2. n=40, y2=860,y=50,

      SSxy=2,700,β^1=.2

    3. n=10, (yiy)2=58,

      SSxy=91,SSxx=170

  2. 9.40 Suppose you fit a least squares line to 12 data points and the calculated value of SSE is .429.

    1. Find s2, the estimator of σ2 (the variance of the random error term ε).

    2. Find s, the estimate of σ.

    3. What is the largest deviation that you might expect between any one of the 12 points and the least squares line?

  3. 9.41 Refer to Exercises 9.18 and 9.21 (p. 512). Calculate SSE, s2, and s for the least squares lines obtained in those exercises. Interpret the standard errors of the regression model for each.

Applying the Concepts—Basic

  1. 9.42 Do nice guys really finish last? Refer to the Nature (Mar. 20, 2008) study of whether “nice guys finish last,” Exercise9.22  (p. 512). Recall that Boston-area college students repeatedly played a version of the game “prisoner’s dilemma,” where competitors choose cooperation, defection, or costly punishment. At the conclusion of the games, the researchers recorded the average payoff and the number of times punishment was used for each player. Based on a scatterplot of the data, the simple linear regression relating average payoff (y) to punishment use (x) resulted in SSE=1.04.

    1. Assuming a sample size of n=28, compute the estimated standard deviation of the error distribution, s.

    2. Give a practical interpretation of s.

  2. MOON 9.43 Measuring the moon’s orbit. Refer to the American Journal of Physics (Apr. 2014) study of the moon’s orbit, Exercise 9.23 (p. 513). Recall that the angular size (y) of the moon was modeled as a straight-line function of height above horizon (x). Find an estimate of σ, the standard deviation of the error of prediction, on the SAS printout (p. 513). Give a practical interpretation of this value.

  3. POLO 9.44 Game performance of water polo players. Refer to the Biology of Sport (Vol. 31, 2014) study of the physiological performance of top-level water polo players, Exercise9.24  (p. 513). Recall that data for eight Olympic male water polo players were used to model y=mean heart rate as a straight-line function of x=maximal oxygen uptake. Find an estimate of σ, the standard deviation of (p. 513). Give a practical interpretation of this value.

  4. BTYPE 9.45 New method for blood typing. Refer to the Analytical Chemistry (May 2010) study in which medical researchers tested a new method of typing blood using low-cost paper, Exercise 9.25 (p. 514). The data were used to fit the straight-line model relating y= wicking length to x= antibody concentration. The SPSS printout follows.

    1. Give the values of SSE, s2, and s shown on the printout.

    2. Give a practical interpretation of s. Recall that wicking length is measured in millimeters.

  5. 9.46 Quantitative models of music. Chance (Fall 2004) published a study on modeling a certain pitch of a musical composition. Data on 147 famous compositions were used to model the number of times (y) a certain pitch occurs—called entropy—as a straight-line function of year of birth (x) of the composer. On the basis of the scatterplot of the data, the standard deviation σ of the model is estimated to be s=.1. For a given year (x), about 95% of the actual entropy values (y) will fall within d units of their predicted values. Find the value of d.

  6. ANTS 9.47 Mongolian desert ants. Refer to the Journal of Biogeography (Dec. 2003) study of ant sites in Mongolia, presented in Exercise 9.26 (p. 514). The data were used to estimate the straight-line model relating annual rainfall (y) to maximum daily temperature (x).

    1. Give the values of SSE, s2, and s, shown on the MINITAB printout (p. 514).

    2. Give a practical interpretation of the value of s.

Applying the Concepts—Intermediate

  1. 9.48 Detecting rapid visual targets. When two targets are presented close together in a rapid visual stream, the second target is often missed. Psychologists call this phenomenon the attentional blink (AB). A study published in Advances in Cognitive Psychology (July 2013) investigated whether simultaneous or preceding sounds could reduce AB. Twenty subjects were presented a rapid visual stream of symbols and letters on a computer screen and asked to identify the first and second letters (the targets). After several trials, the subject’s AB magnitude was measured as the difference between the percentages of first target and second target letters correctly identified. Each subject performed the task under each of three conditions. In the Simultaneous condition, a sound (tone) was presented simultaneously with the second target; in the Alert condition, a sound was presented prior to the coming of the second target; and in the No-Tone condition, no sound was presented with the second target. Scatterplots of AB magnitude for each possible pair of conditions are shown below as well as the least squares line for each.

    1. Which pair of conditions produces the least squares line with the steepest estimated slope?

    2. Which pair of conditions produces the least squares line with the largest SSE?

    3. Which pair of conditions produces the least squares line with the smallest estimate of σ?

  2. FCAT 9.49 FCAT scores and poverty. Refer to the Journal of Educational and Behavioral Statistics (Spring 2004) study of scores on the Florida Comprehensive Assessment Test (FCAT), presented in Exercise 9.30 (p. 515).

    1. Consider the simple linear regression relating math score (y) to percentage (x) of students below the poverty level. Find and interpret the value of s for this regression.

    2. Consider the simple linear regression relating reading score (y) to percentage (x) of students below the poverty level. Find and interpret the value of s for this regression.

    3. Which dependent variable, math score or reading score, can be more accurately predicted by percentage (x) of students below the poverty level? Explain.

  3. OJUICE 9.50 Sweetness of orange juice. Refer to the study of the quality of orange juice produced at a juice manufacturing plant, Exercise 9.32 (p. 516). Recall that simple linear regression was used to predict the sweetness index (y) from the amount of pectin (x) in orange juice manufactured during a production run.

    1. Give the values of SSE, s2, and s for this regression.

    2. Explain why it is difficult to give a practical interpretation to s2.

    3. Use the value of s to derive a range within which most (about 95%) of the errors of prediction of sweetness index fall.

  4. HEIGHT 9.51 Ideal height of your mate. Refer to the Chance (Summer 2008) study of the height of the ideal mate, Exercise9.33  (p. 516). The data were used to fit the simple linear regression model, E(y)=β0+β1x, where y=ideal partner’s height (in inches) and x=student's height (in inches).

    1. Fit the straight-line model to the data for the male students. Find an estimate for σ, the standard deviation of the error term, and interpret its value practically.

    2. Repeat part a for the female students.

    3. For which group, males or females, is student’s height the more accurate predictor of ideal partner’s height?

Applying the Concepts—Advanced

  1. TOOL 9.52 Life tests of cutting tools. To improve the quality of the output of any production process, it is necessary first to understand the capabilities of the process. (For example, see Gitlow, H., Quality Management Systems: A Practical Guide, 2000.) In a particular manufacturing process, the useful life of a cutting tool is linearly related to the speed at which the tool is operated. The data in the accompanying table were derived from life tests for the two different brands of cutting tools currently used in the production process. For which brand would you feel more confident using the least squares line to predict useful life for a given cutting speed? Explain.

    Useful Life (hours)
    Cutting Speed (meters per minute) Brand A Brand B
    30 4.5 6.0
    30 3.5 6.5
    30 5.2 5.0
    40 5.2 6.0
    40 4.0 4.5
    40 2.5 5.0
    50 4.4 4.5
    50 2.8 4.0
    50 1.0 3.7
    60 4.0 3.8
    60 2.0 3.0
    60 1.1 2.4
    70 1.1 1.5
    70 .5 2.0
    70 3.0 1.0
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset