In this section, we present two statistics that describe the adequacy of a model: the coefficient of correlation and the coefficient of determination.
Recall (from optional Section 2.8) that a bivariate relationship describes a relationship—or correlation—between two variables x and y. Scatterplots are used to describe a bivariate relationship graphically. In this section, we will discuss the concept of correlation and how it can be used to measure the linear relationship between two variables x and y. A numerical descriptive measure of correlation is provided by the coefficient of correlation, r.
r, is a measure of the strength of the linear relationship between two variables x and y. It is computed (for a sample of n measurements on x and y) as follows:
*The two tests are equivalent in simple linear regression only.
where
Note that the computational formula for the correlation coefficient r given above involves the same quantities that were used in computing the least squares prediction equation. In fact, since the numerators of the expressions for and r are identical, it is clear that when (the case where x contributes no information for the prediction of y) and that r is positive when the slope is positive and negative when the slope is negative. Unlike the correlation coefficient r is scaleless and assumes a value between and regardless of the units of x and y.
A value of r near or equal to 0 implies little or no linear relationship between y and x. In contrast, the closer r comes to 1 or the stronger is the linear relationship between y and x. And if or all the sample points fall exactly on the least squares line. Positive values of r imply a positive linear relationship between y and x; that is, y increases as x increases. Negative values of r imply a negative linear relationship between y and x; that is, y decreases as x increases. Each of these situations is portrayed in Figure 9.16.
Now Work Exercise 9.79
We use the data in Table 9.1 for the drug reaction example to demonstrate how to calculate the coefficient of correlation, r. The quantities needed to calculate r are and The first two quantities have been calculated previously and are and The calculation for is shown in the last column of the Excel spreadsheet, Figure 9.5 (p. 508). The result is
We now find the coefficient of correlation:
The fact that r is positive and near 1 indicates that the reaction time tends to increase as the amount of drug in the bloodstream increases—for the given sample of five subjects. This is the same conclusion we reached when we found the calculated value of the least squares slope to be positive.
Legalized gambling is available on several riverboat casinos operated by a city in Mississippi. The mayor of the city wants to know the correlation between the number of casino employees and the yearly crime rate. The records for the past 10 years are examined, and the results listed in Table 9.3 are obtained. Calculate the coefficient of correlation, r, for the data. Interpret the result.
Year | Number x of Casino Employees (thousands) | Crime Rate y (number of crimes per 1,000 population) |
---|---|---|
2006 | 15 | 1.35 |
2007 | 18 | 1.63 |
2008 | 24 | 2.33 |
2009 | 22 | 2.41 |
2010 | 25 | 2.63 |
2011 | 29 | 2.93 |
2012 | 30 | 3.41 |
2013 | 32 | 3.26 |
2014 | 35 | 3.63 |
2015 | 38 | 4.15 |
Data Set: CASINO
Rather than use the computing formula given earlier, we resort to a statistical software package. The data of Table 9.3 were entered into a computer and MINITAB was used to compute r. The MINITAB printout is shown in Figure 9.17.
Intentionally using the correlation coefficient only to make an inference about the relationship between two variables in situations where a nonlinear relationship may exist is considered unethical statistical practice.
The coefficient of correlation, highlighted at the top of the printout, is Thus, the size of the casino workforce and crime rate in this city are very highly correlated—at least over the past 10 years. The implication is that a strong positive linear relationship exists between these variables. (See Figure 9.17.) We must be careful, however, not to jump to any unwarranted conclusions. For instance, the mayor may be tempted to conclude that hiring more casino workers next year will increase the crime rate—that is, that there is a causal relationship between the two variables. However, high correlation does not imply causality. The fact is, many things have probably contributed both to the increase in the casino workforce and to the increase in crime rate. The city’s tourist trade has undoubtedly grown since riverboat casinos were legalized, and it is likely that the casinos have expanded both in services offered and in number. We cannot infer a causal relationship on the basis of high sample correlation. When a high correlation is observed in the sample data, the only safe conclusion is that a linear trend may exist between x and y.
Another variable, such as the increase in tourism, may be the underlying cause of the high correlation between x and y.
Now Work Exercise 9.85a
Two caveats apply in using the sample correlation coefficient r to infer the nature of the relationship between x and y: (1) A high correlation does not necessarily imply that a causal relationship exists between x and y—only that a linear trend may exist; (2) a low correlation does not necessarily imply that x and y are unrelated—only that x and y are not strongly linearly related.
Keep in mind that the correlation coefficient r measures the linear correlation between x values and y values in the sample, and a similar linear coefficient of correlation exists for the population from which the data points were selected. The population correlation coefficient is denoted by the symbol (rho). As you might expect, is estimated by the corresponding sample statistic r. Or, instead of estimating we might want to test the null hypothesis against ; that is, we can test the hypothesis that x contributes no information for the prediction of y by using the straight-line model against the alternative that the two variables are at least linearly related.
However, we already performed this identical test in Section 9.4 when we tested against That is, the null hypothesis is equivalent to the hypothesis * When we tested the null hypothesis in connection with the drug reaction example, the data led to a rejection of the null hypothesis at the level. This rejection implies that the null hypothesis of a 0 linear correlation between the two variables (drug and reaction time) can also be rejected at the level. The only real difference between the least squares slope and the coefficient of correlation, r, is the measurement scale. Therefore, the information they provide about the usefulness of the least squares model is to some extent redundant. For this reason, we will use the slope to make inferences about the existence of a positive or negative linear relationship between two variables.
For the sake of completeness, a summary of the test for linear correlation is provided in the following boxes.
One-Tailed Tests | ||
---|---|---|
Two-Tailed Test | ||
Rejection region:
p-value: | if tc is positive | |
if tc is negative |
where the distribution of t depends on df.
The sample of (x, y) values is randomly selected from a normal population.
Another way to measure the usefulness of a linear model is to measure the contribution of x in predicting y. To accomplish this, we calculate how much the errors of prediction of y were reduced by using the information provided by x. To illustrate, consider the sample shown in the scatterplot of Figure 9.18a. If we assume that x contributes no information for the prediction of y, the best prediction for a value of y is the sample mean which is shown as the horizontal line in Figure 9.18b. The vertical line segments in Figure 9.18b are the deviations of the points about the mean Note that the sum of the squares of the deviations for the prediction equation is
Now suppose you fit a least squares line to the same set of data and locate the deviations of the points about the line, as shown in Figure 9.18c. Compare the deviations about the prediction lines in Figures 9.18b and 9.18c You can see that
If x contributes little or no information for the prediction of y, the sums of the squares of the deviations for the two lines
will be nearly equal.
If x does contribute information for the prediction of y, the SSE will be smaller than In fact, if all the points fall on the least squares line, then
Consequently, the reduction in the sum of the squares of the deviations that can be attributed to x, expressed as a proportion of is
Note that is the “total sample variability” of the observations around the mean and that SSE is the remaining “unexplained sample variability” after fitting the line Thus, the difference is the “explained sample variability” attributable to the linear relationship with x. Thus, a verbal description of the proportion is
In simple linear regression, it can be shown that this proportion—called the coefficient of determination—is equal to the square of the simple linear coefficient of correlation, r.
The coefficient of determination is
and represents the proportion of the total sample variability around that is explained by the linear relationship between y and x. (In simple linear regression, it may also be computed as the square of the coefficient of correlation, r.)
Note that is always between 0 and 1 because r is between and Thus, an of .60 means that the sum of the squares of the deviations of the y values about their predicted values has been reduced 60% by the use of the least squares equation instead of to predict y.
Calculate the coefficient of determination for the drug reaction example. The data are repeated in Table 9.4 for convenience. Interpret the result.
From previous calculations,
Then, from our earlier definition, the coefficient of determination is
Percent x of Drug | Reaction Time y (seconds) |
---|---|
1 | 1 |
2 | 1 |
3 | 2 |
4 | 2 |
5 | 4 |
Data Set: STIMULUS
Another way to compute is to recall from earlier in this section that Then we have A third way to obtain is from a computer printout. Its value is highlighted on the SPSS printout in Figure 9.19. Our interpretation is as follows: We know that using the percent x of drug in the blood to predict y with the least squares line
accounts for nearly 82% of the total sum of the squares of the deviations of the five sample y values about their mean. Or, stated another way, 82% of the sample variation in reaction time (y) can be “explained” by using the percent x of drug in a straight-line model.
Now Work Exercise 9.87a
of the sample variation in y (measured by the total sum of the squares of the deviations of the sample y values about their mean ) can be explained by (or attributed to) using x to predict y in the straight-line model.
Using the Coefficients of Correlation and Determination to Assess the Dowsing Data
In the previous Statistics in Action Revisited, we discovered that using a dowser’s guess (x) in a straight-line model was not statistically useful in predicting actual pipe location (y). Both the coefficient of correlation and the coefficient of determination (highlighted on the MINITAB printouts in Figure SIA9.4) also support this conclusion. The value of the correlation coefficient, indicates a fairly weak positive linear relationship between the variables. This value, however, is not statistically significant In other words, there is no evidence to indicate that the population correlation coefficient is different from 0. The coefficient of determination, implies that only about 10% of the sample variation in pipe location values can be explained by the simple linear model.
9.77 True or False. The correlation coefficient is a measure of the strength of the linear relationship between x and y.
9.78 Describe the slope of the least squares line if
9.79 Explain what each of the following sample correlation coefficients tells you about the relationship between the x and y values in the sample:
9.80 True or False. A value of the correlation coefficient near 1 or near implies a causal relationship between x and y.
9.81 Construct a scatterplot for each data set. Then calculate r and for each data set.
a.
x | 0 | 1 | 2 | |||
y | 1 | 2 | 5 | 6 |
b.
x | 0 | 1 | 2 | |||
y | 6 | 5 | 3 | 2 | 0 |
c.
x | 1 | 2 | 2 | 3 | 3 | 3 | 4 | |
y | 2 | 1 | 3 | 1 | 2 | 3 | 2 |
d.
x | 0 | 1 | 3 | 5 | 6 | |
y | 0 | 1 | 2 | 1 | 0 |
9.82 Calculate for the least squares line in Exercise 9.18 (p. 512).
9.83 Calculate for the least squares line in Exercise 9.21 (p. 512).
Use the applet entitled Correlation by the Eye to explore the relationship between the pattern of data in a scatterplot and the corresponding correlation coefficient.
Run the applet several times. Each time, guess the value of the correlation coefficient. Then click Show r to see the actual correlation coefficient. How close is your value to the actual value of r? Click New data to reset the applet.
Click the trash can to clear the graph. Use the mouse to place five points on the scatterplot that are approximately in a straight line. Then guess the value of the correlation coefficient. Click Show r to see the actual correlation coefficient. How close were you this time?
Continue to clear the graph and plot sets of five points with different patterns among the points. Guess the value of r. How close do you come to the actual value of r each time?
On the basis of your experiences with the applet, explain why we need to use more reliable methods of finding the correlation coefficient than just “eyeing” it.
9.84 RateMyProfessors.com. A popular Web site among college students is RateMyProfessors.com (RMP). Established over 10 years ago, RMP allows students to post quantitative ratings of their instructors. In Practical Assessment, Research & Evaluation (May 2007), University of Maine researchers investigated whether instructor ratings posted on RMP are correlated with the formal in-class student evaluations of teaching (SET) that all universities are required to administer at the end of the semester. Data collected for University of Maine instructors yielded a correlation between RMP and SET ratings of .68.
Give the equation of a linear model relating SET rating (y) to RMP rating (x).
Give a practical interpretation of the value .
Is the estimated slope of the line, part a, positive or negative? Explain.
A test of the null hypothesis yielded a p-value of .001. Interpret this result.
Compute the coefficient of determination, r2, for the regression analysis. Interpret the result.
9.85 Last name and acquisition timing. Refer to the Journal of Consumer Research (Aug. 2011) study of the speed with which consumers decide to purchase a product, Exercise7.12 (p. 382). Recall that the researchers theorized that consumers with last names that begin with letters later in the alphabet will tend to acquire items faster than those whose last names are earlier in the alphabet (i.e., the last name effect). Each in a sample of 50 MBA students was offered free tickets to attend a college basketball game for which there was a limited supply of tickets. The first letter of the last name of those who responded to an e-mail offer in time to receive the tickets was noted and given a numerical value (e.g., “A”, “B”, etc.). Each student’s response time (measured in minutes) was also recorded.
a. The researchers computed the correlation between the two variables as . Interpret this result.
b. The observed significance level for testing for a negative correlation in the population was reported as p-value . Interpret this result for .
c. Does this analysis support the researchers’ last name effect theory? Explain.
TASTE 9.86 Taste-testing scales. The Journal of Food Science (Feb. 2014) published the results of a taste-testing study. The researchers evaluated the general Labeled Magnitude Scale (gLMS), used to rate the palatability of food items on a scale ranging from (for strongest imaginable dislike) to (for strongest imaginable like). The researchers called this rating the perceived hedonic intensity. A sample of 200 students and staff at the University of Florida used the scale to rate their most favorite and least favorite foods. In addition, each taster rated the sensory intensity of four different solutions: salt, sucrose, citric acid, and hydrochloride. The averages of these four ratings were used by the researchers to quantify individual variation in taste intensity—called perceived sensory intensity. These data are saved in the TASTE file. The accompanying MINITAB printout shows the correlation between perceived sensory intensity (PSI) and perceived hedonic intensity for both favorite (PHI-F) and least favorite (PHI-L) foods. According to the researchers, “the palatability of the favorite and least favorite foods varies depending on the perceived intensity of taste: Those who experience the greatest taste intensity (that is, supertasters) tend to experience more extreme food likes and dislikes.” Do you agree? Explain.
9.87 Going for it on fourth down in the NFL. Each week coaches in the National Football League (NFL) face a decision during the game. On fourth down, should the team punt the ball or go for a first down? To aid in the decision-making process, statisticians at California State University, Northridge, developed a regression model for predicting the number of points scored (y) by a team that has a first down with a given number of yards (x) from the opposing goal line (Chance, Winter 2009). One of the models fit to data collected on five NFL teams from a recent season was the simple linear regression model, . The regression yielded the following results: .
a. Give a practical interpretation of the coefficient of determination, r2.
b. Compute the value of the coefficient of correlation, r, from the value of r2. Is the value of r positive or negative? Why?
TRAPS 9.88 Lobster fishing study. Refer to the Bulletin of Marine Science (Apr. 2010) study of teams of fishermen fishing for the red spiny lobster in Baja California Sur, Mexico, Exercise 9.63 (p. 529). Recall that simple linear regression was used to model total catch of lobsters (in kilograms) during the season as a function of average percentage of traps allocated per day to exploring areas of unknown catch (called search frequency).
Locate and interpret the coefficient of determination, r2, on the SAS printout shown on p. 529.
Note that the coefficient of correlation, r, is not shown on the SAS printout. Is there information on the printout to determine whether total catch (y) is negatively linearly related to search frequency (x)? Explain.
9.89 Physical activity of obese young adults. The International Journal of Obesity (Jan. 2007) published a study of the physical activity of obese young adults. For two groups of young adults—13 obese and 15 of normal weight—researchers recorded the total number of registered movements (counts) of each young adult over a period of time. Baseline physical activity was then computed as the number of counts per minute (cpm). Four years later, physical activity measurements were taken again—called physical activity at follow-up.
For the 13 obese young adults, the researchers reported a correlation of between baseline and follow-up physical activity, with an associated p-value of .07. Give a practical interpretation of this correlation coefficient and p-value.
Refer to part a. Construct a scatterplot of the 13 data points that would yield a value of
For the 15 young adults of normal weight, the researchers reported a correlation of between baseline and follow-up physical activity, with an associated p-value of .66. Give a practical interpretation of this correlation coefficient and p-value.
Refer to part c. Construct a scatterplot of the 15 data points that would yield a value of
9.90 Salary linked to height. Are short people shortchanged when it comes to salary? According to business professors T. A. Judge (University of Florida) and D. M. Cable (University of North Carolina), tall people tend to earn more money over their career than short people earn (Journal of Applied Psychology, June 2004). Using data collected from participants in the National Longitudinal Surveys, the researchers computed the correlation between average earnings (in dollars) and height (in inches) for several occupations. The results are given in the following table.
Occupation | Correlation, r | Sample Size, n |
---|---|---|
Sales | .41 | 117 |
Managers | .35 | 455 |
Blue Collar | .32 | 349 |
Service Workers | .31 | 265 |
Professional/Technical | .30 | 453 |
Clerical | .25 | 358 |
Crafts/Forepersons | .24 | 250 |
Source: Judge, T. A., and Cable, D. M. “The effect of physical height on workplace success and income: Preliminary test of a theoretical model.” Journal of Applied Psychology, Vol. 89, No. 3, June 2004 (Table 5). Copyright © 2004 by the American Psychological Association. Reprinted with permission.
Interpret the value of r for people in sales occupations.
Compute for people in sales occupations. Interpret the result.
Give and for testing whether average earnings and height are positively correlated.
Compute the test statistic for testing and in part c for people in sales occupations.
Use the result you obtained in part d to conduct the test at State the appropriate conclusion.
Select another occupation and repeat parts a–e.
9.91 View of rotated objects. Perception & Psychophysics (July 1998) reported on a study of how people view three-dimensional objects projected onto a rotating two-dimensional image. Each in a sample of 25 university students viewed various depth-rotated objects (e.g., a hairbrush, a duck, and a shoe) until they recognized the object. The recognition exposure time—that is, the minimum time (in milliseconds) required for the subject to recognize the object—was recorded for each object. In addition, each subject rated the “goodness of view” of the object on a numerical scale, with lower scale values corresponding to better views. The following table gives the correlation coefficient r between recognition exposure time and goodness of view for several different rotated objects:
Object | r | t |
---|---|---|
Piano | .447 | 2.40 |
Bench | .27 | |
Motorbike | .619 | 3.78 |
Armchair | .294 | 1.47 |
Teapot | .949 | 14.50 |
Interpret the value of r for each object.
Calculate and interpret the value of for each object.
The table also includes the t-value for testing the null hypothesis of no correlation (i.e., for testing ). Interpret these results using .
9.92 Eye anatomy of giraffes. Refer to the African Zoology (Oct. 2013) study of giraffe eye characteristics, Exercise 9.71 (p. 530). Recall that the researchers fit a simple linear regression equation of the form , where y represents an eye characteristic and x represents body mass (measured in kilograms).
For the eye characteristic mass (grams), the regression equation yielded . Give a practical interpretation of this result.
Refer to part a above and Exercise 9.71 part a. Find the value of the correlation coefficient, r, and interpret its value.
For the eye characteristic axis angle (degrees), the regression equation yielded . Give a practical interpretation of this result.
Refer to part c above and Exercise 9.71 part b. Find the value of the correlation coefficient, r, and interpret its value.
9.93 Do nice guys finish first or last? Refer to the Nature (Mar. 20, 2008) study of the use of punishment in cooperation games, Exercise 9.22 (p. 512). Recall that college students repeatedly played a version of the game “prisoner’s dilemma” and the researchers recorded the average payoff (y) and the number of times punishment was used (x) for each player. A negative correlation was discovered between x and y.
Give the null and alternative hypotheses for testing whether average payoff and punishment use are negatively correlated.
The test, part a, yielded a p-value of .001. Interpret this result using .
Does the result, part b, imply that increasing punishment causes your payoff to decrease? Explain.
NAME2 9.94 The “name game.” Refer to the Journal of Experimental Psychology—Applied (June 2000) name-retrieval study, first presented in Exercise 9.34 (p. 517). Find and interpret the values of r and for the simple linear regression relating the proportion of names recalled (y) and the position (order) of the student (x) during the “name game.”
BOXING2 9.95 Effect of massage on boxing. Refer to the British Journal of Sports Medicine (Apr. 2000) study of the effect of massage on boxing performance, presented in Exercise 9.70 (p. 530). Find and interpret the values of r and for the simple linear regression relating the blood lactate concentration and the boxer’s perceived recovery.
MINES 9.96 Child labor in diamond mines. The role of child laborers in Africa’s colonial-era diamond mines was the subject of research published in the Journal of Family History (Vol. 35, 2010). One particular mining company lured children to the mines by offering incentives for adult male laborers to relocate their families close to the diamond mine. The success of the incentive program was examined by determining the annual accompaniment rate, i.e., the percentage of wives (or sons or daughters) who accompanied their husbands (or fathers) in relocating to the mine. The accompaniment rates over the years 1939–1947 are shown in the table below.
Find the correlation coefficient relating the accompaniment rates for wives and sons. Interpret this value.
Find the correlation coefficient relating the accompaniment rates for wives and daughters. Interpret this value.
Find the correlation coefficient relating the accompaniment rates for sons and daughters. Interpret this value.
Year | Wives | Sons | Daughters |
---|---|---|---|
1939 | 27.2 | 2.2 | 16.9 |
1940 | 40.1 | 1.5 | 15.7 |
1941 | 35.7 | 0.3 | 12.6 |
1942 | 37.8 | 3.5 | 22.2 |
1943 | 38.0 | 5.4 | 22.0 |
1944 | 38.4 | 11.0 | 24.3 |
1945 | 38.7 | 11.9 | 17.9 |
1946 | 29.8 | 8.6 | 17.7 |
1947 | 23.8 | 7.4 | 22.2 |
Source: Cleveland, T. “Minors in name only: Child laborers on the diamond mines of the Companhia de Diamantes de Angola (Diamang), 1917–1975.” Journal of Family History, Vol. 35, No. 1, 2010 (Table 1).
CLIFFS 9.97 Plants that grow on Swiss cliffs. Refer to the Alpine Botany (Nov. 2012) study of rare plants that grow on the limestone cliffs of the Northern Swiss Jura mountains, Exercise 2.165 (p. 97). Data on altitude above sea level (meters), plant population size (number of plants growing), and molecular variance (i.e., the variance in molecular weight of the plants) for a sample of 12 limestone cliffs are reproduced in the table. Recall that the researchers are interested in whether either altitude or population size is related to molecular variance.
Cliff Number | Altitude | Population Size | Molecular Variance |
---|---|---|---|
1 | 468 | 147 | 59.8 |
2 | 589 | 209 | 24.4 |
3 | 700 | 28 | 42.2 |
4 | 664 | 177 | 59.5 |
5 | 876 | 248 | 65.8 |
6 | 909 | 53 | 17.7 |
7 | 1032 | 33 | 12.5 |
8 | 952 | 114 | 27.6 |
9 | 832 | 217 | 35.9 |
10 | 1099 | 10 | 13.3 |
11 | 982 | 8 | 3.6 |
12 | 1053 | 15 | 3.2 |
Source: Rusterholz, H., Aydin, D., and Baur, B. “Population structure and genetic diversity of relict populations of Alyssum montanum on limestone cliffs in the Northern Swiss Jura mountains.” Alpine Botany, Vol. 122, No. 2, Nov. 2012 (Tables 1 and 2).
Use simple linear regression to investigate the relationship between molecular variance (y) and altitude (x). Find and interpret the value of .
Use simple linear regression to investigate the relationship between molecular variance (y) and population size (x). Find and interpret the value of .
What are your recommendations to the researchers?
9.98 Pain tolerance study. A study published in Psychosomatic Medicine (Mar./Apr. 2001) explored the relationship between reported severity of pain and actual pain tolerance in 337 patients who suffer from chronic pain. Each patient reported his/her severity of chronic pain on a seven-point scale To obtain a pain tolerance level, a tourniquet was applied to the arm of each patient and twisted. The maximum pain level tolerated was measured on a quantitative scale.
According to the researchers, “Correlational analysis revealed a small but significant inverse relationship between [actual] pain tolerance and the reported severity of chronic pain.” On the basis of this statement, is the value of r for the 337 patients positive or negative?
Suppose that the result reported in part a is significant at Find the approximate value of r for the sample of 337 patients.