Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9.3 Model Assumptions

In Section 9.2, we assumed that the probabilistic model relating the drug reaction time y to the percentage x of drug in the bloodstream is

y = β_{0} + β_{1} x + ε

$y = β_{0} + β_{1} x + ε$

We also recall that the least squares estimate of the deterministic component of the model, $β_{0} + β_{1} x,$ $β_{0} + β_{1} x,$ is

\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x = - .1 + .7 x

$\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x = - .1 + .7 x$

Now we turn our attention to the random component $ε$ $ε$ of the probabilistic model and its relation to the errors in estimating $β_{0}$ $β_{0}$ and $β_{1} .$ $β_{1} .$ We will use a probability distribution to characterize the behavior of $ε .$ $ε .$ We will see how the probability distribution of $ε$ $ε$ determines how well the model describes the relationship between the dependent variable y and the independent variable x.

Step 3 in a regression analysis requires us to specify the probability distribution of the random error $ε .$ $ε .$ We will make four basic assumptions about the general form of this probability distribution:

Assumption 1 The mean of the probability distribution of $ε$ $ε$ is 0. That is, the average of the values of $ε$ $ε$ over an infinitely long series of experiments is 0 for each setting of the independent variable x. This assumption implies that the mean value of y, for a given value of x, is $E (y) = β_{0} + β_{1} x .$ $E (y) = β_{0} + β_{1} x .$
Assumption 2 The variance of the probability distribution of $ε$ $ε$ is constant for all settings of the independent variable x. For our straight-line model, this assumption means that the variance of $ε$ $ε$ is equal to a constant—say, $σ^{2}$ $σ^{2}$ —for all values of x.
Assumption 3 The probability distribution of $ε$ $ε$ is normal.
Assumption 4 The values of $ε$ $ε$ associated with any two observed values of y are independent. That is, the value of $ε$ $ε$ associated with one value of y has no effect on any of the values of $ε$ $ε$ associated with any other y values.

The implications of the first three assumptions can be seen in Figure 9.9, which shows distributions of errors for three values of x, namely, 5, 10, and 15. Note that the relative frequency distributions of the errors are normal with a mean of 0 and a constant variance $σ^{2} .$ $σ^{2} .$ (All of the distributions shown have the same amount of spread or variability.) The straight line shown in Figure 9.9 is the line of means; it indicates the mean value of y for a given value of x. We denote this mean value as E(y). Then the line of means is given by the equation

E (y) = β_{0} + β_{1} x

$E (y) = β_{0} + β_{1} x$

These assumptions make it possible for us to develop measures of reliability for the least squares estimators and to devise hypothesis tests for examining the usefulness of the least squares line. We have various techniques for checking the validity of these assumptions, and we have remedies to apply when they appear to be invalid. (These topics are beyond the scope of this text, but they are discussed in some of the chapter references). Fortunately, the assumptions need not hold exactly in order for least squares estimators to be useful. The assumptions will be satisfied adequately for many applications encountered in practice.

It seems reasonable to assume that the greater the variability of the random error $ε$ $ε$ (which is measured by its variance $σ^{2}$ $σ^{2}$ ), the greater will be the errors in the estimation of the model parameters $β_{0}$ $β_{0}$ and $β_{1}$ $β_{1}$ and in the error of prediction when $\hat{y}$ $\hat{y}$ is used to predict y for some value of x. Consequently, you should not be surprised, as we proceed through this chapter, to find that $σ^{2}$ $σ^{2}$ appears in the formulas for all confidence intervals and test statistics that we will be using.

Estimation of $σ^{2}$ $σ^{2}$ for a (First-Order) Straight-Line Model

s^{2} = \frac{SSE}{Degrees of freedom for error} = \frac{SSE}{n - 2}

$s^{2} = \frac{SSE}{Degrees of freedom for error} = \frac{SSE}{n - 2}$

where $SSE = \sum {(y_{i} - {\hat{y}}_{i})}^{2} = {SS}_{y y} - {\hat{β}}_{1} {SS}_{x y}$ $SSE = \sum {(y_{i} - {\hat{y}}_{i})}^{2} = {SS}_{y y} - {\hat{β}}_{1} {SS}_{x y}$

in which

{SS}_{y y} = \sum {(y_{i} - \bar{y})}^{2} = \sum y_{i}^{2} - \frac{{(\sum^{​} y_{i})}^{2}}{n}

${SS}_{y y} = \sum {(y_{i} - \bar{y})}^{2} = \sum y_{i}^{2} - \frac{{(\sum^{} y_{i})}^{2}}{n}$

To estimate the standard deviation $σ$ $σ$ of $ε,$ $ε,$ we calculate

s = \sqrt{s^{2}} = \sqrt{\frac{SSE}{n - 2}}

$s = \sqrt{s^{2}} = \sqrt{\frac{SSE}{n - 2}}$

We will refer to s as the estimated standard error of the regression model.

In most practical situations, $σ^{2}$ $σ^{2}$ is unknown and we must use our data to estimate its value. The best estimate of $σ^{2},$ $σ^{2},$ denoted by $s^{2},$ $s^{2},$ is obtained by dividing the sum of the squares of the deviations of the y values from the prediction line, or

SSE = \sum {(y_{i} - {\hat{y}}_{i})}^{2}

$SSE = \sum {(y_{i} - {\hat{y}}_{i})}^{2}$

by the number of degrees of freedom associated with this quantity. We use 2 df to estimate the two parameters $β_{0}$ $β_{0}$ and $β_{1}$ $β_{1}$ in the straight-line model, leaving $(n - 2)$ $(n - 2)$ df for the estimation of the error variance.

Caution

When performing these calculations, you may be tempted to round the calculated values of ${SS}_{y y}, {\hat{β}}_{1},$ ${SS}_{y y}, {\hat{β}}_{1},$ and ${SS}_{x y} .$ ${SS}_{x y} .$ Be certain to carry at least six significant figures for each of these quantities, to avoid substantial errors in calculating the SSE.

STIMULUS Example 9.3 Estimating $σ$ $σ$ in Regression—Drug Reaction Data

Problem

Refer to Example 9.2 and the simple linear regression of the drug reaction data in Table 9.1.
1. Compute an estimate of $σ .$ $σ .$
2. Give a practical interpretation of the estimate.

Solution

We previously calculated $S S E = 1.10$ $S S E = 1.10$ for the least squares line $\hat{y} = - .1 + .7 x .$ $\hat{y} = - .1 + .7 x .$ Recalling that there were $n = 5$ $n = 5$ data points, we have $n - 2 = 5 - 2 = 3$ $n - 2 = 5 - 2 = 3$ df for estimating $σ^{2} .$ $σ^{2} .$ Thus,

$s^{2} = \frac{SSE}{n - 2} = \frac{1.10}{3} = .367$ $s^{2} = \frac{SSE}{n - 2} = \frac{1.10}{3} = .367$

is the estimated variance, and

$s = \sqrt{.367} = .61$ $s = \sqrt{.367} = .61$

is the standard error of the regression model.
You may be able to grasp s intuitively by recalling the interpretation of a standard deviation given in Chapter 2 and remembering that the least squares line estimates the mean value of y for a given value of x. Since s measures the spread of the distribution of y values about the least squares line and these errors of prediction are assumed to be normally distributed, we should not be surprised to find that most (about 95%) of the observations lie within 2s, or $2 (.61) = 1.22,$ $2 (.61) = 1.22,$ of the least squares line. For this simple example (only five data points), all five data points fall within 2s of the least squares line. In Section 9.6, we use s to evaluate the error of prediction when the least squares line is used to predict a value of y to be observed for a given value of x.

SAS printout for the time–drug regression

Look Back

The values of $s^{2}$ $s^{2}$ and s can also be obtained from a simple linear regression printout. The SAS printout for the drug reaction example is reproduced in Figure 9.10. The value of $s^{2}$ $s^{2}$ is highlighted on the printout (in the Mean Square column in the row labeled Error). The value $s^{2} = .36667,$ $s^{2} = .36667,$ rounded to three decimal places, agrees with the one calculated by hand. The value of s is also highlighted in Figure 9.10 (next to the heading Root MSE). This value, $s = .60553,$ $s = .60553,$ agrees (except for rounding) with our hand-calculated value.

Now Work Exercise 9.40a–b

Interpretation of `s`, the Estimated Standard Deviation of $ϵ$ $ϵ$

We expect most $(\approx 95 %)$ $(\approx 95 %)$ of the observed y values to lie within 2s of their respective least squares predicted values, $\hat{y} .$ $\hat{y} .$

Exercises 9.37–9.52

Understanding the Principles

9.37 What are the four assumptions made about the probability distribution of $ε$ $ε$ in regression? Illustrate the assumptions with a graph.
9.38 Visually compare the scatterplots shown below. If a least squares line were determined for each data set, which do you think would have the smallest variance $s^{2} ?$ $s^{2} ?$ Explain.

Learning the Mechanics

9.39 Calculate SSE and $s^{2}$ $s^{2}$ for each of the following cases:
1. $n = 20, {SS}_{y y} = 95, {SS}_{x y} = 50, {\hat{β}}_{1} = .75$ $n = 20, {SS}_{y y} = 95, {SS}_{x y} = 50, {\hat{β}}_{1} = .75$
2. $n = 40, \sum^{} y^{2} = 860, \sum^{} y = 50,$ $n = 40, \sum^{} y^{2} = 860, \sum^{} y = 50,$
  
  ${SS}_{x y} = 2, 700, {\hat{β}}_{1} = .2$ ${SS}_{x y} = 2, 700, {\hat{β}}_{1} = .2$
3. $n = 10, \sum^{} {(y_{i} - \overline{y})}^{2} = 58,$ $n = 10, \sum^{} {(y_{i} - \overline{y})}^{2} = 58,$
  
  ${SS}_{x y} = 91, {SS}_{x x} = 170$ ${SS}_{x y} = 91, {SS}_{x x} = 170$
9.40 Suppose you fit a least squares line to 12 data points and the calculated value of SSE is .429.
1. Find $s^{2},$ $s^{2},$ the estimator of $σ^{2}$ $σ^{2}$ (the variance of the random error term $ε$ $ε$ ).
2. Find s, the estimate of $σ .$ $σ .$
3. What is the largest deviation that you might expect between any one of the 12 points and the least squares line?
9.41 Refer to Exercises 9.18 and 9.21 (p. 512). Calculate SSE, $s^{2},$ $s^{2},$ and s for the least squares lines obtained in those exercises. Interpret the standard errors of the regression model for each.

Applying the Concepts—Basic

9.42 Do nice guys really finish last? Refer to the Nature (Mar. 20, 2008) study of whether “nice guys finish last,” Exercise9.22 (p. 512). Recall that Boston-area college students repeatedly played a version of the game “prisoner’s dilemma,” where competitors choose cooperation, defection, or costly punishment. At the conclusion of the games, the researchers recorded the average payoff and the number of times punishment was used for each player. Based on a scatterplot of the data, the simple linear regression relating average payoff (y) to punishment use (x) resulted in $S S E = 1.04$ $S S E = 1.04$ .
1. Assuming a sample size of $n = 28$ $n = 28$ , compute the estimated standard deviation of the error distribution, s.
2. Give a practical interpretation of s.
MOON 9.43 Measuring the moon’s orbit. Refer to the American Journal of Physics (Apr. 2014) study of the moon’s orbit, Exercise 9.23 (p. 513). Recall that the angular size (y) of the moon was modeled as a straight-line function of height above horizon (x). Find an estimate of $σ$ $σ$ , the standard deviation of the error of prediction, on the SAS printout (p. 513). Give a practical interpretation of this value.
POLO 9.44 Game performance of water polo players. Refer to the Biology of Sport (Vol. 31, 2014) study of the physiological performance of top-level water polo players, Exercise9.24 (p. 513). Recall that data for eight Olympic male water polo players were used to model $y = m e a n$ $y = m e a n$ heart rate as a straight-line function of $x = m a x i m a l$ $x = m a x i m a l$ oxygen uptake. Find an estimate of $σ$ $σ$ , the standard deviation of (p. 513). Give a practical interpretation of this value.
BTYPE 9.45 New method for blood typing. Refer to the Analytical Chemistry (May 2010) study in which medical researchers tested a new method of typing blood using low-cost paper, Exercise 9.25 (p. 514). The data were used to fit the straight-line model relating $y =$ $y =$ wicking length to $x =$ $x =$ antibody concentration. The SPSS printout follows.
1. Give the values of SSE, s², and s shown on the printout.
2. Give a practical interpretation of s. Recall that wicking length is measured in millimeters.
9.46 Quantitative models of music. Chance (Fall 2004) published a study on modeling a certain pitch of a musical composition. Data on 147 famous compositions were used to model the number of times (y) a certain pitch occurs—called entropy—as a straight-line function of year of birth (x) of the composer. On the basis of the scatterplot of the data, the standard deviation $σ$ $σ$ of the model is estimated to be $s = .1 .$ $s = .1 .$ For a given year (x), about 95% of the actual entropy values (y) will fall within d units of their predicted values. Find the value of d.
ANTS 9.47 Mongolian desert ants. Refer to the Journal of Biogeography (Dec. 2003) study of ant sites in Mongolia, presented in Exercise 9.26 (p. 514). The data were used to estimate the straight-line model relating annual rainfall (y) to maximum daily temperature (x).
1. Give the values of SSE, $s^{2},$ $s^{2},$ and s, shown on the MINITAB printout (p. 514).
2. Give a practical interpretation of the value of s.

Applying the Concepts—Intermediate

9.48 Detecting rapid visual targets. When two targets are presented close together in a rapid visual stream, the second target is often missed. Psychologists call this phenomenon the attentional blink (AB). A study published in Advances in Cognitive Psychology (July 2013) investigated whether simultaneous or preceding sounds could reduce AB. Twenty subjects were presented a rapid visual stream of symbols and letters on a computer screen and asked to identify the first and second letters (the targets). After several trials, the subject’s AB magnitude was measured as the difference between the percentages of first target and second target letters correctly identified. Each subject performed the task under each of three conditions. In the Simultaneous condition, a sound (tone) was presented simultaneously with the second target; in the Alert condition, a sound was presented prior to the coming of the second target; and in the No-Tone condition, no sound was presented with the second target. Scatterplots of AB magnitude for each possible pair of conditions are shown below as well as the least squares line for each.
1. Which pair of conditions produces the least squares line with the steepest estimated slope?
2. Which pair of conditions produces the least squares line with the largest SSE?
3. Which pair of conditions produces the least squares line with the smallest estimate of $σ$ $σ$ ?
FCAT 9.49 FCAT scores and poverty. Refer to the Journal of Educational and Behavioral Statistics (Spring 2004) study of scores on the Florida Comprehensive Assessment Test (FCAT), presented in Exercise 9.30 (p. 515).
1. Consider the simple linear regression relating math score (y) to percentage (x) of students below the poverty level. Find and interpret the value of s for this regression.
2. Consider the simple linear regression relating reading score (y) to percentage (x) of students below the poverty level. Find and interpret the value of s for this regression.
3. Which dependent variable, math score or reading score, can be more accurately predicted by percentage (x) of students below the poverty level? Explain.
OJUICE 9.50 Sweetness of orange juice. Refer to the study of the quality of orange juice produced at a juice manufacturing plant, Exercise 9.32 (p. 516). Recall that simple linear regression was used to predict the sweetness index (y) from the amount of pectin (x) in orange juice manufactured during a production run.
1. Give the values of SSE, s², and s for this regression.
2. Explain why it is difficult to give a practical interpretation to s².
3. Use the value of s to derive a range within which most (about 95%) of the errors of prediction of sweetness index fall.
HEIGHT 9.51 Ideal height of your mate. Refer to the Chance (Summer 2008) study of the height of the ideal mate, Exercise9.33 (p. 516). The data were used to fit the simple linear regression model, $E (y) = β_{0} + β_{1} x,$ $E (y) = β_{0} + β_{1} x,$ where $y = i d e a l$ $y = i d e a l$ partner’s height (in inches) and $x = s t u d e n t' s$ $x = s t u d e n t' s$ height (in inches).
1. Fit the straight-line model to the data for the male students. Find an estimate for $σ$ $σ$ , the standard deviation of the error term, and interpret its value practically.
2. Repeat part a for the female students.
3. For which group, males or females, is student’s height the more accurate predictor of ideal partner’s height?

Applying the Concepts—Advanced

TOOL 9.52 Life tests of cutting tools. To improve the quality of the output of any production process, it is necessary first to understand the capabilities of the process. (For example, see Gitlow, H., Quality Management Systems: A Practical Guide, 2000.) In a particular manufacturing process, the useful life of a cutting tool is linearly related to the speed at which the tool is operated. The data in the accompanying table were derived from life tests for the two different brands of cutting tools currently used in the production process. For which brand would you feel more confident using the least squares line to predict useful life for a given cutting speed? Explain.

	Useful Life (hours)
Cutting Speed (meters per minute)	Brand A	Brand B
30	4.5	6.0
30	3.5	6.5
30	5.2	5.0
40	5.2	6.0
40	4.0	4.5
40	2.5	5.0
50	4.4	4.5
50	2.8	4.0
50	1.0	3.7
60	4.0	3.8
60	2.0	3.0
60	1.1	2.4
70	1.1	1.5
70	.5	2.0
70	3.0	1.0

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9.3 Model Assumptions

Create new playlist

Sign In

Sign Up

Table of Contents for
9.3 Model Assumptions