Chapter 18

Getting Straight Talk on ­Straight-Line Regression

In This Chapter

arrow Determining when to use straight-line regression

arrow Grasping the theory

arrow Doing straight-line regression and making sense of the output

arrow Watching for things that can go wrong

arrow Estimating the sample size you need

Chapter 17 talks about regression analyses in a general way. This chapter focuses on the simplest kind of regression analysis: straight-line regression. You can visualize it as “fitting” a straight line to the points in a scatter plot from a set of data involving just two variables. Those two variables are generally referred to as X and Y.

check.png The X variable is formally called the independent variable (or the predictor or cause).

check.png The Y variable is called the dependent variable (or the outcome or effect).

tip.eps You may see straight-line regression referred to in books and articles by several different names, including linear regression, simple linear regression, linear univariate regression, and linear bivariate regression. This abundance of references can be confusing, so I always use the term straight-line regression.

Knowing When to Use Straight-Line Regression

remember.eps Straight-line regression is the way to go when all of these things are true:

check.png You’re interested in the relationship between two (and only two) numerical variables.

check.png You’ve made a scatter plot of the two variables and the data points seem to lie, more or less, along a straight line (as shown in Figures Figure 18-1a and Figure 18-1b). You shouldn’t try to fit a straight line to data that appears to lie along a curved line (as shown in Figures Figure 18-1c and Figure 18-1d).

check.png The data points appear to scatter randomly around the straight line over the entire range of the chart, with no extreme outliers.

9781118553992-fg1801.eps

Illustration by Wiley, Composition Services Graphics

Figure 18-1: Straight-line regression is appropriate for both strong and weak linear relationships (a and b), but not for nonlinear (curved-line) relationships (c and d).

Straight-line regression is the way to go when one or more of these things are true:

check.png You want to test whether there’s a significant association between the X and Y variables.

check.png You want to know the value of the slope or the intercept (or both).

check.png You want to be able to predict the value of Y for any value of X.

Understanding the Basics of Straight-Line Regression

remember.eps The formula of a straight line can be written like this: Y = a + bX. This formula breaks down this way:

check.png Y is the dependent variable (or outcome).

check.png X is the independent variable (or predictor).

check.png a is the intercept (the value of Y when X = 0).

check.png b is the slope (the amount Y changes when X increases by 1).

The best line (in the least-squares sense) through a set of data is the one that minimizes the sum of the squares (SSQ) of the residuals (the vertical ­distances of each point from the fitted line), as shown in Figure 18-2.

9781118553992-fg1802.eps

Illustration by Wiley, Composition Services Graphics

Figure 18-2: On average, a good-fitting line has smaller residuals than a bad-fitting line.

For most types of curves, finding the best-fitting curve is a very complicated mathematical problem; the straight line is one of the very few kinds of lines for which you can calculate the least-squares parameters from explicit formulas. If you’re interested (or if your professor says that you’re interested), here’s a general outline of how those formulas are derived.

For any set of data Xi and Yi (in which i is an index that identifies each observation in the set, as described in Chapter 2), SSQ can be calculated like this:

9781118553992-eq18001.eps

If you’re good at first-semester calculus, you can find the values of a and b that minimize SSQ by setting the partial derivatives of SSQ with respect to a and b equal to 0. If you stink at calculus, trust that this leads to these two simultaneous equations:

a(N) + b(ΣX) = (ΣY)

a(ΣX) + b(ΣX2) = (ΣXY)

where N is the number of observed data points.

These equations can be solved for a and b:

9781118553992-eq18002.eps

9781118553992-eq18003.eps

tip.eps See Chapter 2 if you don’t feel comfortable reading the mathematical notations or expressions in this section.

Running a Straight-Line Regression

warning_bomb.eps Never try to do regression calculations by hand (or on a calculator). You’ll go crazy trying to evaluate all those summations and other calculations, and you’ll almost certainly make a mistake somewhere in your calculations.

Fortunately, every major statistical software package (and most minor ones) can do straight-line regression. Excel has built-in functions for the slope and intercept of the least-squares straight line. You can find straight-line regression web pages (several are listed at StatPages.info), and you can download apps to do this task on a smartphone or tablet. (See Chapter 4 for an introduction to statistical software.)

In the following sections, I list the basic steps for running a straight-line regression, complete with an example.

Taking a few basic steps

remember.eps The exact steps you take to run a straight-line regression depend on what software you’re using, but here’s the general approach:

1. Get your data into the proper form.

Usually, the data consists of two columns of numbers, one representing the independent variable and the other representing the dependent variable.

2. Tell the software which variable is the independent variable and which one is the dependent variable.

Depending on the software, you may type the variable names or pick them from a menu or list in your file.

3. If the software offers output options, tell it that you want these results:

• Graphs of observed and calculated values

• Summaries and graphs of the residuals

• Regression table

• Goodness-of-fit measures

4. Press the Go button (or whatever it takes to start the calculations).

You should get your answers in the blink of an eye.

Walking through an example

To see how to run and interpret the output of a simple straight-line regression, I use the following example throughout the rest of this chapter.

Consider how blood pressure (BP) is related to body weight. It may be reasonable to suspect that people who weigh more have higher BP. If you test this hypothesis on people and find that there really is an association between weight and BP, you may want to quantify that relationship. Maybe you want to say that every extra kilogram of weight tends to be associated with a certain amount of increased BP. The following sections take you through the steps of gathering data, creating a scatter plot, and interpreting the results.

Gathering the data

Suppose that you get a group of 20 representative adults from some population. Say you stand outside the college bookstore and recruit students as they pass by. You weigh them and measure their BP. To keep your study simple, you consider just the systolic blood pressure. Table 18-1 shows some actual weight and BP data from 20 people. Weight is recorded in kilograms (kg), and BP is recorded in the strange-sounding units of millimeters of mercury (mmHg). For clarity, I omit that rather cumbersome notation when a sentence reads better without it and when I’m obviously talking about a BP value.

Table 18-1 Weight and Blood Pressure

Subject

Body Weight (kg)

Systolic BP (mmHg)

001

74.4

109

002

85.1

114

003

78.3

94

004

77.2

109

005

63.8

104

006

77.9

132

007

78.9

127

008

60.9

98

009

75.6

126

010

74.5

126

011

82.2

116

012

99.8

121

013

78.0

111

014

71.8

116

015

90.2

115

016

105.4

133

017

100.4

128

018

80.9

128

019

81.8

105

020

109.0

127

Creating a scatter plot

It’s usually hard to spot patterns and trends in a table like Table 18-1, but you get a clearer picture of what’s happening if you make a scatter plot of the 20 subjects, with weight (the independent variable) on the X axis and systolic BP (the dependent variable) on the Y axis. See Figure 18-3.

9781118553992-fg1803.eps

Illustration by Wiley, Composition Services Graphics

Figure 18-3: Blood pressure versus body weight.

Examining the results

In Figure 18-3, you can see a possible pattern. There seems to be a tendency for the following:

check.png Low-weight people have low BP (represented by the points near the lower-left part of the graph).

check.png Higher-weight people have higher BP (represented by the points near the upper-right part of the graph).

There aren’t any really heavy people with really low BP; the lower-right part of the graph is pretty empty. But the agreement isn’t completely convincing. Several people in the 70- to 80-kilogram range have BPs over 125.

remember.eps A correlation analysis (described in Chapter 17) will tell you how strong the association is and let you decide whether or not it could be due solely to random fluctuations. A regression analysis will, in addition, give you a mathematical formula that expresses the relationship between the two variables (weight and BP, in this example).

Interpreting the Output of Straight-Line Regression

In the following sections, I take you through the printed and graphical output of a typical straight-line regression run. Its looks will vary depending on your software (this output was generated by the R statistical software), but you should be able to find the following parts of the output:

check.png A simple statement of what you asked the program to do

check.png A summary of the residuals, including graphs that display the residuals and help you assess whether they’re normally distributed

check.png The regression table

check.png Measures of goodness-of-fit of the line to the data

Seeing what you told the program to do

In Figure 18-4, the first two lines produced by the statistical software say that you wanted to fit a simple formula: BP ~ Weight to your observed BP and weight values.

9781118553992-fg1804.eps

Illustration by Wiley, Composition Services Graphics

Figure 18-4: Typical regression output looks like this.

tip.eps The tilde in an expression like Y ~ X is a widely used shorthand way of saying that you’re fitting a model in which Y depends only on X. Read a tilde aloud as depends only on or is predicted by or is a function of. So in Figure 18-4, the tilde means you’re fitting a model in which BP depends only on weight.

The actual equation of the straight line is BP = a + b × weight, but the a (intercept) and b (slope) parameters have been left out of the model shown in Figure 18-4 for the sake of conciseness. This shorthand is particularly useful in Chapter 19, which deals with formulas that have lots of independent variables.

Looking at residuals

Most regression software gives several measures of how the data points scatter above and below the fitted line. The “residuals” section in Figure 18-4 provides information about how the observed data points scatter around the fitted line.

remember.eps The residual for a point is the vertical distance of that point from the fitted line. It’s calculated as Residual = Y – (a + b × X), where a and b are the intercept and slope of the fitted straight line. The residuals for the sample data are shown in Figure 18-5.

9781118553992-fg1805.eps

Illustration by Wiley, Composition Services Graphics

Figure 18-5: Scattergram of BP versus weight, with the fitted straight line and the residuals of each point from the line.

Summary statistics for the residuals

If you read about summarizing data in Chapter 8, you know the distribution of a set of numbers is often summarized by quoting the mean, standard deviation, median, minimum, maximum, and quartiles. That’s exactly what you find in the “residuals” section of your software’s output. Here’s what you see in Figure 18-4:

check.png The Min and Max values are the two largest residuals (the two points that lie farthest away from the line). One data point actually lies about 21 mmHg below the line, and one point lies about 17 above the line. (The sign of a residual is positive or negative, depending on whether the point lies above or below the fitted line, respectively.)

check.png The first and third quartiles (denoted 1Q and 3Q, respectively) tell you that about a quarter of the points (that is, 5 of the 20 points) lie more than 4.7 mmHg below the fitted line, a quarter of them lie more than 6.5 mmHg above the fitted line, and the remaining half of the points lie within those two quartiles.

check.png The Median value of –3.4 tells you that half of the residuals (that is, the residuals of 10 of the 20 points) are less than –3.4 and half are greater than –3.4 mmHg.

Note: The mean residual isn’t included in these summary numbers because the mean of the residuals is always exactly 0 for any kind of regression that includes an intercept term.

remember.eps The residual standard error, often called the root-mean-square (RMS) error in regression output, is a measure of how tightly or loosely the points scatter above or below the fitted line. You can think of it as the standard deviation (SD) of the residuals, although it’s computed in a slightly different way from the usual SD of a set of numbers: RMS uses N – 2 instead of N – 1 in the denominator of the SD formula. The R program shows the RMS value near the bottom of the output, but you can think of it as another summary statistic for residuals.

For this data, the residuals have a standard deviation of about 9.8 mmHg.

Graphs of the residuals

Most regression programs will produce several graphs of the residuals if you ask them to. You can use these graphs to assess whether the data meets the criteria for doing a least-squares straight-line regression. Figure 18-6 shows two of the more common types of residual graphs, commonly called “residuals versus fitted” and “normal Q-Q” graphs.

A residuals versus fitted graph has the values of the residuals (observed Y minus predicted Y) plotted along the Y axis and the predicted Y values from the fitted straight line plotted along the X axis. A normal Q-Q graph shows the standardized residuals (residuals divided by the RMS value) along the Y axis and theoretical quantiles along the X axis. Theoretical quantiles are what you’d expect the standardized residuals to be if they were exactly normally distributed.

remember.eps Together, the two kinds of graphs shown in Figure 18-6 give some insight into whether your data conforms to assumptions for straight-line regression:

check.png Your data must lie randomly above and below the line across the whole range of data.

check.png The average amount of scatter must be fairly constant across the whole range of data.

check.png The residuals should be approximately normally distributed.

9781118553992-fg1806.eps

Illustration by Wiley, Composition Services Graphics

Figure 18-6: These two kinds of graphs help determine whether your data meets the requirements for linear regression.

You’ll need some experience with residual graphs (maybe 10 or 20 years’ worth) before you can interpret them confidently, so don’t feel too discouraged if you can’t tell at a glance whether your data complies with the requirements for straight-line regression. Here’s how I interpret them (though other statisticians may disagree with me):

check.png The residuals versus fitted chart in Figure 18-6 indicates that points seem to lie equally above and below the fitted line, and that’s true whether you’re looking at the left, middle, or right part of the graph.

check.png Figure 18-6 also indicates that most of the points lie within ±10 mmHg of the line. But a lot of larger residuals for points appear to be where the BP is around 115 mmHg. This seems a little suspicious, and I should probably look at my raw data and see whether there’s something unusual about these subjects.

check.png If the residuals are normally distributed, then in the normal Q-Q chart in Figure 18-6, the points should lie close to the dotted diagonal line and shouldn’t display any overall curved shape. These points seem to follow the dotted line pretty well, so I’m not concerned about lack of normality in the residuals.

Making your way through the regression table

remember.eps The table of regression coefficients is arguably the most important part of the output for any kind of regression; it’s probably where you look first and where you concentrate most of your attention. Nearly all straight-line statistics programs produce a table of regression coefficients that looks much like the one in Figure 18-4.

For straight-line regression, the coefficients table has two rows that correspond to the two parameters of the straight line:

check.png The intercept row: You may find this row labeled at the far left by the term Intercept or Constant.

check.png The slope row: This row may be labeled as Slope, but it’s more frequently labeled with the name of the independent variable (in this case, Wgt).

The table usually has about four columns (more or less, depending on the software). The names of the columns vary from one software package to another. The columns are discussed in the following sections.

The values of the coefficients (the intercept and the slope)

tip.eps The first column usually shows the values of the slope and intercept of the fitted straight line. The heading of this column in the table might be Coefficient, Estimate, or perhaps (more cryptically) the single letter B or C (in uppercase or lowercase), depending on the software.

The intercept is the predicted value of Y when X is equal to 0 and is expressed in the same units of measurement as the Y variable. The slope is the amount the predicted value of Y changes when X increases by exactly one unit of measurement and is expressed in units equal to the units of Y divided by the units of X.

In the example shown in Figure 18-4, the estimated value of the intercept is 76.8602 mmHg, and the estimated value of the slope is 0.4871 mmHg/kilogram.

check.png The intercept value of 76.9 mmHg means that a person who weighs 0 kilograms should have a BP of about 77 mmHg. But nobody weighs 0 kilograms! The intercept in this example (and in many straight-line relations in biology) has no physiological meaning at all, because 0 kilograms is totally outside the range of possible human weights.

check.png The slope value of 0.4871 mmHg/kilogram does have a real-world meaning. It means that every additional 1 kilogram of weight is associated with a 0.4871 mmHg increase in systolic BP. Or, playing around with the decimal points, every additional 10 kilograms of body weight is associated with almost a 5 mmHg BP increase.

The standard errors of the coefficients

The second column in the regression table usually has the standard errors of the estimated parameters (sometimes abbreviated SE, Std. Err., or something similar). I use SE for standard error in the rest of this chapter.

remember.eps Because your observed data always have random fluctuations, anything you calculate from your observed data also has random fluctuations (whether it’s a simple average or something more complicated, like a regression coefficient). The SE tells you how precisely you were able to estimate the parameter from your data, which is very important if you plan to use the value of the slope (or the intercept) in some subsequent calculation. (See Chapter 11 to read how random fluctuations in numbers propagate through any calculations you may perform with those numbers.)

Keep these things in mind about SE:

check.png Standard errors always have the same units as the coefficients themselves. In the example shown in Figure 18-4, the SE of the intercept has units of mmHg, and the SE of the slope has units of mmHg/kg.

check.png Round off the estimated values. Quoting a lot of meaningless digits when you report your results is pointless. In this example, the SE of the intercept is about 14.7, so you can say that the estimate of the intercept in this regression is about 77 ± 15 mmHg. In the same way, you can say that the estimated slope is 0.49 ± 0.18 mmHg/kg.

When quoting regression coefficients in professional publications, you may include the SE like this: “The predicted increase in systolic blood pressure with weight (±1 SE) was 0.49 ± 0.18 mmHg/kg.”

If you have the SE, you can easily calculate a confidence interval (CI) around the estimated parameter. (See Chapter 10 for more info.) To a very good approximation, the 95 percent confidence limits, which mark the low and high ends of the confidence interval around a coefficient, are given by these expressions:

Lower 95% CL = Coefficient – 2 × SE

Upper 95% CL = Coefficient + 2 × SE

More informally, these are written as 95% CI = coefficient ± 2 × SE.

So the 95-percent CI around the slope is calculated as 0.49 ± 2 × 0.176, which works out to 0.49 ± 0.35, which becomes 0.14 to 0.84 mmHg. If you submit a manuscript for publication, you may express the precision of the results in terms of CIs instead of SEs, like this: “The predicted increase in systolic blood pressure as a function of body weight was 0.49 mmHg/kg (95% CI: 0.14 – 0.84).” Of course, you should always follow the guidelines specified by the journal you’re writing for.

technicalstuff.eps To be more precise, multiply the SE by the critical two-sided Student t value for the confidence level you want and the appropriate number of degrees of freedom (which, for N data points, is equal to N – 2). You can estimate critical t values from this book's online Cheat Sheet at www.dummies.com/cheatsheet/biostatistics or get them from more extensive tables, statistical software, or web pages. For a 95 percent CI and a set of 30 data points (28 degrees of freedom), the critical t value is 2.0484. The approximate value of 2 is fine for most practical work; you probably won't have to look up critical t values unless you have fewer than 20 data points.

The Student t value

In many statistical programs, the third column in a regression table shows the ratio of the coefficient divided by its standard error. This column can go by different names, but it’s most commonly referred to as t or t value. You can think of this column as an intermediate quantity in the calculation of what you’re really interested in: the p value for the coefficient.

warning_bomb.eps The t values appearing in the regression table are not the “critical t values” that you use to construct confidence intervals, as described earlier.

The p value

The next (and usually last) column of the regression table contains the p value, which indicates whether the regression coefficient is significantly different from 0. Depending on your software, this column may be called p value, p, Signif, or Pr(>|t|), as shown in Figure 18-4.

Note: In Chapter 19, I explain how to interpret this peculiar notation, but just keep in mind that it’s only another way of designating the column that holds the p values.

In Figure 18-4, the p value for the intercept is shown as 5.49e – 05, which is equal to 0.0000549 (see the description of scientific notation in Chapter 2). This value is much less than 0.05, so the intercept is significantly different from zero. But recall that in this example the intercept doesn’t have any real-world importance (it’s the expected BP for a person who doesn’t weigh anything), so you probably don’t care whether it’s different from zero or not.

But the p value for the slope is very important — if it’s less than 0.05, it means that the slope of the fitted straight line is significantly different from zero. In turn, that means that the X and Y variables are significantly associated with each other. A p value greater than 0.05 indicates that the true slope may be equal to zero, so there’s no conclusive evidence for a significant association between X and Y. In Figure 18-4, the p value for the slope is 0.0127, which means that the slope is significantly different from zero, and this tells you that body weight is significantly associated with systolic BP.

tip.eps If you want to test for a significant correlation between two variables, simply look at the p value for the slope of the least-squares straight line. If it’s less than 0.05, then the X and Y variables are significantly correlated. The p value for the significance of the slope in a straight-line regression is always exactly the same as the p value for the correlation test of whether r is significantly different from zero, as described in Chapter 17.

Wrapping up with measures of goodness-of-fit

The last few lines of output in Figure 18-4 contain several indicators of how well a straight line represents the data. The following sections describe this part of the output.

The correlation coefficient

Most straight-line regression programs provide the classic Pearson r correlation coefficient between X and Y (see Chapter 17 for details). But the program may give you the correlation coefficient in a roundabout way: as r2 rather than r itself. The software I use for this example shows r2 on the line that begins with “Multiple R-squared: 0.2984.” Just get out your calculator and take the square root of 0.2984 to get 0.546 for Pearson r.

remember.eps R squared is always positive — the square of anything is always positive — but the correlation coefficient can be positive or negative, depending on whether the fitted line slopes upward or downward. If the fitted line slopes downward, make your r value negative.

Why did the program give you R squared instead of r in the first place? It’s because R squared is a useful number in its own right. It’s sometimes called the coefficient of determination, and it tells you what percent of the total variability in the Y variable can be explained by the fitted line.

check.png An R-squared value of 1 means that the points lie exactly on the fitted line, with no scatter at all.

check.png An R-squared value of 0 means that your data points are all over the place, with no tendency at all for the X and Y variables to be associated.

check.png An R-squared value of 0.3 (as in this example) means that 30 percent of the variance in the dependent variable is explainable by the straight-line model.

Note: I talk about the adjusted R-squared value in Chapter 19 when I explain multiple regression. For now, you can just ignore it.

The F statistic

The last line of the sample output addresses this question: Is the straight-line model any good at all? How much better is the straight-line model (which has an intercept and a predictor) compared to the null model?

remember.eps The null model is a model that contains only a single parameter representing a constant term (such as an intercept), with no predictor variables at all.

If the p value associated with the F statistic is less than 0.05, then adding the predictor variable to the model makes it significantly better at predicting BPs.

For this example, the p value is 0.013, indicating that knowing a person’s weight makes you significantly better at predicting that person’s BP than not knowing the weight (and therefore having to quote the same overall mean BP value from your data [117 mmHg] as your guess every time).

Scientific fortune-telling with the prediction formula

As I describe in Chapter 17, one reason for doing regression analysis is to develop a prediction formula (or, if you want to sound fancy, a predictive model) that lets you guess the value of the dependent variable if you know the values of the independent variables.

tip.eps Some statistics programs show the actual equation of the best-fitting straight line. If yours doesn’t, don’t worry. Just substitute the coefficients of the intercept and slope for a and b in the straight-line equation: Y = a + bX.

With the output shown in Figure 18-4, where the intercept (a) is 76.9 and the slope (b) is 0.487, you can write the equation of the fitted straight line like this: BP = 76.9 + 0.487 Weight.

Then you can use this equation to predict someone’s BP if you know his weight. So, if a person weighs 100 kilograms, you can guess that that person’s BP may be about 76.9 + 100 × 0.487, which is 76.9 + 48.7, or about 125.6 mmHg. Your guess won’t be exactly on the nose, but it will probably be better than if you didn’t know that BP increases with increasing weight.

How far off may your guess be? The residual standard error provides a yardstick of your guessing prowess. As I explain in the earlier section Summary statistics for the residuals, the residual standard error indicates how much the individual points tend to scatter above and below the fitted line. For the BP example, this number is ±9.8, so you can expect your prediction to be within about ±10 mmHg most of the time (about 68 percent of the time if the residuals are truly normally distributed with a standard deviation of ±10) and within ±20 mmHg about 95 percent of the time.

Recognizing What Can Go Wrong with Straight-Line Regression

Fitting a straight line to a set of data is a pretty simple task for any piece of software, but you still have to be careful. A computer program happily does whatever you tell it to, even if it’s something you shouldn’t do.

warning_bomb.eps People frequently slip up on the following things when doing straight-line regression:

check.png Fitting a straight line to curved data: Examining the pattern of residuals in the residuals versus fitted chart in Figure 18-5 can alert you to this problem.

check.png Ignoring outliers in the data: Outliers can mess up all the classical statistical analyses, and regression is no exception. One or two data points that are way off the main trend of the points will tend to drag the fitted line away from the other points. That’s because the strength with which each point tugs at the line is proportionate to the square of its distance from the line.

remember.eps Always look at a scatter plot of your data to make sure outliers aren’t present. Examine the residuals to make sure they seem to be distributed normally above and below the fitted line.

Figuring Out the Sample Size You Need

To figure out how many data points you need for a regression analysis, first ask yourself why you’re doing the regression in the first place.

check.png Do you want to show that the two variables are significantly associated? Then you want to calculate the sample size required to achieve a certain statistical power for the significance test (see Chapter 3 for an introduction to statistical power).

check.png Do you want to estimate the value of the slope (or intercept) to within a certain margin of error? Then you want to calculate the sample size required to achieve a certain precision in your estimate.

Testing the significance of a slope is exactly equivalent to testing the significance of a correlation coefficient, so the sample-size calculations are also the same for the two types of tests. If you haven’t already, check out Chapter 17, which has simple formulas for the number of subjects you need to test for any specified degree of correlation.

If you’re doing the regression to estimate the value of a regression coefficient — for example, the slope of the straight line — then the calculations get more complicated. The precision of the slope depends on several things:

check.png The number of data points: More data points give you greater precision. Standard errors vary inversely as the square root of the sample size. Or, the required sample size varies inversely as the square of the desired SE. So, if you quadruple the sample size, you cut the SE in half. This is a very important, and very generally applicable, principle.

check.png Tightness of the fit of the observed points to the line: The closer the data points hug the line, the more precisely you can estimate the regression coefficients. The effect is directly proportional — twice as much Y-scatter of the points produces twice as large a SE in the coefficients.

check.png How the data points are distributed across the range of the X variable: This effect is hard to quantify, but in general, having the data points spread out evenly over the entire range of X produces more precision than having most of them clustered near the middle of the range.

How, then, do you intelligently design a study to acquire data for a linear regression where you’re mainly interested in estimating a regression coefficient to within a certain precision? One practical approach is to first conduct a small pilot study of, say, 20 subjects and look at the SE of the regression coefficient. If you’re really lucky, the SE may be as small as you wanted, or even smaller — then you’re all done!

tip.eps But the SE probably isn’t small enough (unless you’re a lot luckier that I’ve ever been). That’s when you reach for the square-root law. Follow these steps to get the total sample size you need to get the precision you want:

1. Divide the SE that you got from your pilot run by the SE you want your full study to achieve.

2. Square the ratio.

3. Multiply the square of the ratio by the sample size of your pilot study.

Say you want to estimate the slope to a precision (standard error) of ±5. If a pilot study of 20 subjects gives you a SE of ±8.4 units, then the ratio is 8.4/5 (or 1.68). Squaring this ratio gives you 2.82, which tells you that to get an SE of 5, you need 2.82 × 20, or about 56 subjects. And of course, because you’ve already acquired the first 20 subjects for your pilot run — you took my advice, right? — you need only another 36 subjects to have a total of 56.

remember.eps This estimation is only approximate. But at least you have a ballpark idea of how big a sample you need to achieve the desired precision.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset