Chapter 20: A Yes-or-No Proposition: Logistic Regression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 20

A Yes-or-No Proposition: Logistic Regression

In This Chapter

Figuring out when to use logistic regression

Getting a grip on the basics of logistic regression

Running a logistic regression and making sense of the output

Watching for things that can go wrong

Estimating the sample size you need

You can use logistic regression to analyze the relationship between one or more predictor variables (the X variables) and a categorical outcome variable (the Y variable). Typical categorical outcomes include the following:

Lived or died

Did or didn’t rain today

Did or didn’t have a stroke

Responded or didn’t respond to a treatment

Did or did not vote for Joe Smith (in an exit poll)

In this chapter, I explain logistic regression — when to use it, the important concepts, how to run it with software, and how to interpret the output. I also point out the pitfalls and show you how to determine the sample sizes you need.

Using Logistic Regression

You can use logistic regression to do any (or all) of the following:

Test whether the predictor and the outcome are significantly associated; for example, whether age or gender influenced a voter’s preference for a particular candidate.

Overcome the limitations of the 2-x-2 cross-tab method (described in Chapter 14), which can analyze only one predictor at a time that has to be a two-valued category, such as the presence or absence of a risk factor. With logistic regression, you can analyze any number of predictor variables, each of which can be a numeric variable or a categorical variable having two or more categories.

Quantify the extent of an association between the predictor and the outcome (the amount by which a predictor influences the chance of getting the outcome); for example, how much a smoker’s chance of developing emphysema changes with each additional cigarette smoked per day.

Develop a formula to predict the probability of getting the outcome from the values of the predictor variables. For example, you may want to predict the probability that a patient will benefit from a certain kind of therapy, based on the patient’s age, gender, severity of illness, and perhaps even genetic makeup.

Make yes or no predictions about the outcome that take into account the consequences of false-positive and false-negative predictions. For example, you can generate a tentative cancer diagnosis from a set of observations and lab results, using a formula that balances the different consequences of a false-positive versus a false-negative diagnosis.

See how one predictor influences the outcome after adjusting for the influence of other variables; for example, how the number of minutes of exercise per day influences the chance of having a heart attack, adjusting for the effects of age, gender, lipid levels, and other patient characteristics.

Determine the value of a predictor that produces a certain probability of getting the outcome; for example, find the dose of a drug that produces a favorable clinical response in 80 percent of the patients treated with it (called the ED80, or 80 percent effective dose).

Understanding the Basics of Logistic Regression

In this section, I explain the concepts underlying logistic regression using a simple example involving data on mortality due to radiation exposure. This example illustrates why straight-line regression wouldn’t work and what you have to use instead.

Gathering and graphing your data

As in the other chapters in Part IV, here you see a simple real-world problem and its data, which I use throughout this chapter to illustrate what I’m talking about. This example examines exposure to gamma-ray radiation, which is deadly in large-enough doses, looking only at the short-term lethality of acute large doses, not long-term health effects such as cancer or genetic damage.

In Table 20-1, “dose” is the radiation exposure expressed in units called Roentgen Equivalent Man (REM). Looking at the “Dose” and “Outcome” columns, you can get a rough sense of how survival depends on dose. For low doses almost everyone lives, and for high doses almost everyone dies.

Table 20-1

How can you analyze this data? First, graph the data: Plot the dose received on the X axis (because it’s the predictor). Plot the outcome (0 if the person lived; 1 if he died) on the Y axis. This plotting gives you the graph in Figure 20-1a. Because the outcome variable is binary (having only the values 0 or 1), the points are restricted to two horizontal lines, making the graph difficult to interpret. You can get a better picture of the dose-lethality relationship by grouping the doses into intervals (say, every 200 REM) and plotting the fraction of people in each interval who died, as shown in Figure 20-1b. Clearly, the chance of dying increases with increasing dose.

9781118553992-fg2001.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-1: Dose versus mortality, from Table 20-1. Graph A shows individual subjects; Graph B shows them grouped.

Fitting a function with an S shape to your data

Don’t try to fit a straight line to binary-outcome data. The true dose-lethality curve is almost certainly not a straight line. For one thing, the fraction of subjects dying can never be smaller than 0 nor larger than 1, but a straight line (or a parabola or any polynomial) very happily violates those limits for very low and very high doses. That can’t be right!

Instead, you need to fit a function that has an S shape — a formula giving Y as some expression involving X that, by its very nature, can never produce a Y value outside of the range from 0 to 1, no matter how large or small X may become.

Of the many mathematical expressions that produce S-shaped graphs, the logistic function is ideally suited to this kind of data. In its simplest form, the logistic function is written like this: Y = 1/(1 + e–X), where e is the mathematical constant 2.718 (which is what e represents throughout the rest of this chapter). Figure 20-2 shows the shape of the logistic function.

This function can be generalized (made more versatile for representing observed data) by adding two adjustable parameters (a and b) like this: Y = 1/(1 + e–(a + bX)).

Notice that the a+bX part looks just like the formula for a straight line (see Chapter 18); the rest of the logistic function is what bends that straight line into its characteristic S shape. The middle of the S (where Y = 0.5) always occurs when X = –b/a. The steepness of the curve in the middle region is determined by b, as follows:

If b is positive, the logistic function is an upward-sloping S-shaped curve, like the one shown in Figure 20-2.

If b is 0, the logistic function is a horizontal straight line whose Y value is equal to 1/(1 + e-a), as shown in Figure 20-3.

9781118553992-fg2002.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-2: The logistic function looks like this.

9781118553992-fg2003.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-3: When b is 0, the logistic function becomes a horizontal straight line.

If b is negative, the curve is flipped upside down, as shown in Figure 20-4. Logistic curves don’t have to slope upward.

If b is a very large number (positive or negative), the logistic curve is so steep that it looks like what mathematicians call a step function, as shown in Figure 20-5.

9781118553992-fg2004.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-4: When b is negative, the logistic function slopes downward.

9781118553992-fg2005.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-5: When b is very large, the logistic function becomes a “step function.”

Because the logistic curve approaches the limits 0.0 and 1.0 for extreme values of the predictor(s), you should not use logistic regression in situations where the fraction of subjects having the outcome does not approach these two limits. Logistic regression is fine for the radiation example because no one dies from a radiation exposure of zero REMs, and everyone dies from an extremely large dose (like 10,000 REMs). But logistic regression wouldn’t be appropriate for analyzing the response of patients to a drug if very high doses of the drug don’t produce a 100% cure (or if some subjects spontaneously get better even if given no drug at all).

Logistic regression fits the logistic model to your data by finding the values of a and b that make the logistic curve come as close as possible to all your plotted points. With this fitted model, you can then predict the probability of the outcome event (in this example, dying). See the later section Predicting probabilities with the fitted logistic formula for more details.

Getting into the nitty-gritty of logistic regression

You don’t need to know all the theoretical and computation details for logistic regression, because computers do all that work. You should have a general idea of what’s involved, though. The calculations are much more complicated than those for ordinary straight-line or multivariate least-squares regression. In fact, it’s impossible to write down a set of formulas that give the logistic regression coefficients in terms of the observed X and Y values; you have to obtain them by a complicated iterative procedure that no sane human being would ever try to do by hand.

Logistic regression determines the values of the regression coefficients that are most consistent with the observed data, using what’s called the maximum likelihood criterion. The likelihood of any statistical model is the probability, based on the model, of getting what you actually observed. There’s a likelihood value for each case in the data set, and a total likelihood (L) for the entire data set. The likelihood value for each data point is just the predicted probability of getting the observed result. For subjects who died (refer to Table 20-1), the likelihood is the probability of dying (Y) predicted by the logistic formula. For subjects who lived, the likelihood is the predicted probability of living, which is (1 – Y). The total likelihood (L) for the whole set of subjects is the product of all the calculated likelihoods for each subject.

To find the values of the coefficients that maximize L, it is sufficient (and computationally easier) to find the values that minimize the quantity –2 times the natural logarithm of L, which is sometimes designated as –2LL. Statisticians call –2LL the deviance — the closer the curve comes to the observed points, the smaller this deviance number will be. The actual numeric value of a deviance number for a logistic regression doesn’t mean much by itself, but the difference in deviance between two different models is very important.

The final step is to find the values of the coefficients that will minimize the deviance of the observed Y values from the fitted logistic curve. This may sound like a hopelessly difficult task, but computer scientists have developed elegant and efficient ways to minimize a complicated function of several variables, and a logistic regression program uses one of these methods to get the coefficients.

Handling multiple predictors in your logistic model

The data in Table 20-1 has only one predictor variable, but you may have several predictors of a yes or no outcome. For example, a person’s chance of dying from radiation exposure may depend not only on the radiation dose received, but also on age, gender, weight, general health, radiation wavelength, and the amount of time over which the radiation is received. In Chapter 19, I describe how the simple straight-line regression model can be generalized to handle multiple predictors. You can generalize the simple logistic formula to handle multiple predictors in the same way.

Suppose the outcome variable Y is dependent on three predictors, called X, V, and W. Then the multivariate logistic model looks like this:

Y = 1/(1 + e–(a + bX + cV + dW)).

Logistic regression finds the best values of the parameters a, b, c, and d so that for any particular set of values for X, V, and W, you can predict Y — the probability of getting a yes outcome.

Running a Logistic Regression with Software

The logistic regression theory is difficult, and the calculations are complicated (see the earlier sidebar Getting into the nitty-gritty of logistic regression for details). However, the great news is that most general statistics programs (like those in Chapter 4) can run logistic regression, and it isn’t any more difficult than running a simple straight-line or multiple linear regression (see Chapters 18 and 19). Here’s all you have to do:

1. Make sure your data set has a column for the outcome variable and that this column has only two different values.

You can code it as 1 or 0, according to whether the outcome is yes or no, or your software may let you record the data as yes or no (or as Lived or Died, or any other dichotomous classification), with the program doing the 0 or 1 recoding behind the scenes. (Check out Table 20-1 for an example.)

2. Make sure your data set has a column for each predictor variable and that these columns are in a format that your software accepts.

The predictors can be quantitative (such as age or weight; Table 20-1 uses dose amount) or categorical (like gender or treatment group), just as with ordinary least-squares regression. See Chapter 19, where I describe how to set up categorical predictor variables.

3. Tell your program which variables are the predictors and which variable is the outcome.

Depending on the software, you may do this by typing the variable names or by selecting the variables from a menu or list.

4. Tell your program that you want as many of the following outputs as it can give you:

• A summary of information about the variables

• Measures of goodness-of-fit

• A table of regression coefficients, including odds ratios and their confidence intervals

• Predicted probabilities of getting the outcome (which, ideally, the program puts into a new column that it creates in the database)

• If there’s only one predictor, a graph of predicted probabilities versus the value of the predictor (this will be a graph of the fitted logistic curve)

• A classification table of observed outcomes versus predicted outcomes

• Measures of prediction accuracy (overall accuracy, sensitivity, and specificity)

• An ROC curve

5. Press the Go button and stand back!

The computer does all the work and presents you with the answers.

Interpreting the Output of Logistic Regression

Figure 20-6 shows the kind of printed output that a typical logistic regression program may produce from the data in Table 20-1. The following sections explain the output’s different sections.

Seeing summary information about the variables

The program may provide some summary descriptive information about the variables: means and standard deviations of predictors that are numerical variables, and a count of how many subjects did or did not have the outcome event. In the “Descriptives” section of Figure 20-6, you see that 15 of the 30 subjects lived and 15 died. Some programs may also provide the mean and standard deviation of each numerical predictor variable.

9781118553992-fg2006.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-6: Typical output from a logistic regression.

Assessing the adequacy of the model

The program indicates how well the fitted function represents the data (goodness-of-fit), and it may provide several such measures, most of which have an associated p value. (A p value is the probability that random fluctuations alone, in the absence of any real effect in the population, could’ve produced an observed effect at least as large as what you observed in your sample; see Chapter 3 for a refresher.) It’s easy to misinterpret these because they measure subtly different types of goodness-of-fit.

You may see the following, depending on your software:

A p value associated with the drop-in deviance (–2LL) between the null model (intercept-only) and the final model (with the predictor variables): (See the earlier sidebar Getting into the nitty-gritty of logistic regression for the definition of –2LL.) If this p value is less than 0.05, it indicates that adding the predictor variables to the null model significantly improves its ability to predict the outcome. In Figure 20-6, the p value for the reduction in deviance is less than 0.0001, which means that adding radiation dose to the model makes it significantly better at predicting an individual person’s chance of dying than the null model (which, in essence, always predicts a death probability equal to the observed fraction of subjects who died).

A p value from the Hosmer-Lemeshow (H-L) test: If this p value is less than 0.05, your data isn’t consistent with the logistic function’s S shape. Perhaps it doesn’t approach a 100 percent response rate for large doses. (Most treatments aren’t 100 percent effective, even at large doses.) Perhaps the response rises with increasing dose up to some optimal dose and then declines with further dose increases. In Figure 20-6, the H-L p value is 0.842, which means that the data is consistent with the shape of a logistic curve.

One or more pseudo–R-square values: Pseudo–R-square values indicate how much of the total variability in the outcomes is explainable by the fitted model, analogous to how R-square is interpreted in ordinary least-squares regression, as described in Chapter 19. In Figure 20-6, two such values are provided: the Cox/Snell and Nagelkerke R-square. These values (0.577 and 0.770, respectively) indicate that a majority of the variability in the outcomes is explainable by the logistic model.

Akaike's Information Criterion (AIC): AIC is related to the final model deviance, adjusted for how many predictor variables are in the model. Like deviance, AIC is a "smaller is better" number. It's very useful for choosing between different models (for example, deciding which predictors to include in a model). For an excellent description of the AIC, and its use in choosing between competing models, go to www.graphpad.com/guides/prism/6/curve-fitting, click on "Comparing fits of nonlinear models" in the lefthand menu, and then choose "How the AIC computations work."

Checking out the table of regression coefficients

The most important output from a logistic regression program is the table of regression coefficients, which looks much like the coefficients table from ordinary straight-line or multivariate least-squares regression (see Chapters 18 and 19).

Every predictor variable appears on a separate row.

There’s one row for the constant (or intercept) term.

The first column is almost always the fitted value of the regression coefficient.

The second column is usually the standard error (SE) of the coefficient.

A p value column (perhaps called Sig or Signif or Pr(>|z|)) indicates whether the coefficient is significantly different from 0.

For each predictor variable, the logistic regression should also provide the odds ratio and its 95 percent confidence interval, either as additional columns in the coefficients table or as a separate table. You can see these items in Figure 20-6.

But don’t worry if the program doesn’t provide them: You can calculate them simply by exponentiating the corresponding coefficients and their confidence limits. The confidence limits for the coefficients are easily calculated by adding or subtracting 1.96 times the standard error from the coefficient. The formulas follow:

Odds ratio = eCoefficient

Lower 95 percent confidence limit = eCoefficient –- 1.96 × SE

Upper 95 percent confidence limit = eCoefficient + 1.96 × SE

Predicting probabilities with the fitted logistic formula

The program may show you the fitted logistic formula. In Figure 20-6, the formula is shown as:

Prob(Death) = 1/(1 + Exp(–(–4.828 + 0.01146 * Dose)))

If the software doesn’t provide the formula, just substitute the regression coefficients from the regression table into the logistic formula.

The final model produced by the logistic regression program from the data in Table 20-1 and the resulting logistic curve are shown in Figure 20-7.

With the fitted logistic formula, you can predict the probability of having the outcome if you know the value of the predictor variable. For example, if a subject receives 500 REM of radiation, the probability of death is given by this formula: Probability of Death = 1/(1 + e–(–4.828 + 0.01146 × 500)), which equals 0.71. A person who receives 500 REM of radiation has about a 71 percent chance of dying shortly thereafter.

You can also calculate some special points on a logistic curve, as you find out in the following sections.

9781118553992-fg2007.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-7: The logistic curve that fits the dose-mortality data from Table 20-1.

Be careful with your algebra when evaluating these formulas! The a coefficient in a logistic regression is often a negative number, and subtracting a negative number is like adding its absolute value.

Calculating effective doses on a logistic curve

When logistic regression is applied to dose-response data, the dose (X) that produces a 50 percent response (Y = 0.5) is called the median effective dose (ED50). Similarly, the X value that makes Y = 0.8 is called the 80 percent effective dose (ED80), and so on. It’s pretty easy to calculate these special dose levels from the a and b parameters of the fitted logistic model in the preceding section.

If you remember high-school algebra, you can solve the logistic formula Y = 1/(1 + e–(a + bX)) for X as a function of Y; if you don’t remember, here’s the answer:

where log stands for natural logarithm. Substituting 0.5 for Y in the preceding equation gives the ED50 as simply –a/b. Similarly, substituting 0.8 for Y gives

the ED80 as .

So if, for example, a drug produces a therapeutic response that’s represented by a logistic model with a = –3.45 and b = 0.0204 dL/mg, the 80 percent effective dose (ED80) would be equal to (1.39 – (–3.45))/0.0234, which works out to about 207 mg/dL.

Calculating lethal doses on a logistic curve

When death is the outcome event, the corresponding terms are median lethal dose (LD50), 80 percent lethal dose (LD80), and so on. So, for the data in Table 20-1, a = –4.83 and b = 0.0115, so –a/b = –(–4.83)/0.0115, which works out to 420 REMs. Someone who receives a 420 REMs dose of radiation has a 50-50 chance of dying shortly thereafter.

Making yes or no predictions

A logistic model, properly fitted to a set of data, lets you calculate the predicted probability of having the outcome. But sometimes you’d rather make a yes or no prediction instead of quoting a probability. You can do this by comparing the calculated probability of getting a yes outcome to some arbitrary cut value (such as 0.5) that separates a yes prediction from a no prediction. That is, you can say, “If the predicted probability for a subject is greater than 0.5, I’ll predict yes; otherwise, I’ll predict no.”

In the following sections, I talk about yes or no predictions — what they can tell you about the predicting ability of the logistic model and how you can select the cut value that gives you the best tradeoff between wrongly predicting yes and wrongly predicting no.

Measuring accuracy, sensitivity, and specificity with classification tables

The logistic regression program provides several goodness-of-fit outputs (described earlier in this chapter), but these outputs may not be very easy to interpret. One other indicator, which is very intuitive, is the extent to which your yes or no predictions match the actual outcomes. You can cross-tabulate the predicted and observed outcomes into a fourfold classification table. Most statistical software can do all of this for you; it’s often as simple as selecting a check box to indicate that you want the program to generate a classification table based on some particular cut value. Most software assumes a cut value of 0.5 unless you tell it to use some other value. Figure 20-8 shows the classification table for the radiation example, using 0.5 as the cut value.

9781118553992-fg2008.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-8: A classification table of observed versus predicted outcomes from radiation exposure, using a cut value of 0.5 predicted probability.

From the classification table, you can calculate several useful measures of the model’s predicting ability for any specified cut value. (I define and describe these measures in more detail in Chapter 14.)

Overall accuracy: Predicting correctly. The upper-left and lower-right cells correspond to correct predictions. Of the 30 subjects in the data set from Table 20-1, the logistic model predicted correctly (13 + 13)/30, or about 87 percent of the time; the model would make a wrong prediction only about 13 percent of the time.

Sensitivity: Predicting a yes outcome when the actual outcome is yes. The logistic model predicted 13 of the 15 observed deaths (the upper-left box of Figure 20-8), so the sensitivity is 13/15, or about 87 percent; the model would make a false-negative prediction only about 13 percent of the time.

Specificity: Predicting a no outcome when the actual outcome is no. The logistic model predicted survival in 13 of the 15 observed survivors (the lower-right box of Figure 20-8), so the specificity is 13/15, or about 87 percent; the model would make a false-positive prediction only about 13 percent of the time.

Sensitivity and specificity are especially relevant to screening tests for diseases. An ideal test would have 100 percent sensitivity and 100 percent specificity (and, therefore, 100 percent overall accuracy). But no test meets this ideal in the real world.

By judiciously choosing the cut-point for converting a probability into a yes or no decision, you can often achieve high sensitivity or high specificity, but not both simultaneously. Depending on the test and on what happens if it produces a false-positive or false-negative result, you have to consider whether high sensitivity or high specificity is more important.

For example, consider screening tests for two different diseases: colon cancer and prostate cancer.

A false-positive result from a colon cancer screening test may induce a lot of anxiety for a while, until a follow-up colonoscopy reveals that no cancer is present. But a false-negative result can give an unwarranted sense of security that may cause other symptoms to go ignored until the cancer has progressed to an incurable stage.

A false-positive result from a prostate cancer screening test may result in an unnecessary prostatectomy, an operation with many serious side effects. A false-negative result can cause prostate cancer to go untreated, but in most instances (especially in older men), prostate cancer is slow growing and usually not the ultimate cause of death. (It has been said that many men die with prostate cancer, but relatively few die from it.)

Some people may say that high sensitivity is more important than high specificity for a colon cancer test, while the reverse is true for a prostate cancer test. But other people may disagree. And nobody is likely to agree on just how to best balance the conflicting goals. This isn’t an abstract or hypothetical issue — the appropriate diagnosis and treatment of prostate cancer is currently the subject of very vigorous debate centering around these very issues.

A logistic model fitted to a set of data can yield any sensitivity (between 0 and 100 percent) and any specificity (between 0 and 100 percent), depending on what cut value you select. The trick is to pick a cut value that gives the optimal combination of sensitivity and specificity, striking the best balance between false-positive and false-negative predictions, in light of the different consequences of the two types of false predictions. To find this optimal cut value, you need to know precisely how sensitivity and specificity play against each other — that is, how they simultaneously vary with different cut values. And there’s a neat way to do exactly that, which I explain in the following section.

Rocking with ROC curves

A special kind of graph displays the sensitivity/specificity tradeoff for any fitted logistic model. It has the rather peculiar name Receiver Operator Characteristics (ROC) graph, which comes from its original use during World War II to analyze the performance characteristics of people who operated RADAR receivers. Nowadays it’s used for all kinds of things that have nothing to do with RADAR, but the original name has stuck.

An ROC graph has a curve that shows you the complete range of sensitivity and specificity that can be achieved for any fitted logistic model, based on the selected cut value. The program generates an ROC curve by effectively trying all possible cut values between 0 and 1, calculating the predicted outcomes, cross-tabbing them against the observed outcomes, calculating sensitivity and specificity, and then graphing sensitivity versus specificity.

The ROC curve always runs from the lower-left corner of the graph (0 percent sensitivity and 100 percent specificity) to the upper-right corner (100 percent sensitivity and 0 percent specificity). Most programs also draw a diagonal straight line between the lower-left and upper-right corners (representing the formula: sensitivity = 1 – specificity) to indicate the total absence of any predicting ability at all.

Figure 20-9 shows the ROC curve for the data in Table 20-1, produced by the R statistical system. A conventional ROC graph has sensitivity (displayed either as fractions between 0 and 1 or as percentages between 0 and 100) running up the Y axis, and 1 – specificity running across the X axis. Alternatively, the specificity can run backwards (from right to left) across the X axis, as shown in Figure 20-9.

ROC curves almost always lie in the upper-left part of the graph area, and the farther away from the diagonal line they are, the better the predictive model is. For a nearly perfect model, the ROC curve runs up along the Y axis from the lower-left corner to the upper-left corner, then along the top of the graph from the upper-left corner to the upper-right corner.

Because of how sensitivity and specificity are calculated, the graph appears as a series of steps, with more data producing more and smaller steps. For clarity, I show the cut values for predicted probability as a scale along the ROC curve itself; sadly, most statistical software doesn’t do this for you.

9781118553992-fg2009.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-9: ROC curve from dose mortality data.

The ROC curve helps you choose a cut value that gives the best tradeoff between sensitivity and specificity:

To have very few false positives: Choose a higher cut value to give a high specificity. Figure 20-9 shows that by setting the cut value to 0.6, you can simultaneously achieve about 93 percent specificity and 87 percent sensitivity.

To have very few false negatives: Choose a lower cut value to give higher sensitivity. Figure 20-9 shows you that if you set the cut value to 0.3, you can have almost perfect sensitivity (almost no false negatives), but your specificity will be only about 75 percent (about 25 percent false positives).

The software may optionally display the area under the ROC curve (ROC AUC), along with its standard error and a p value. This is another measure of how good the predictive model is. The diagonal line has an AUC of 0.5; the p value indicates whether the AUC is significantly greater than 0.5 (that is, whether your predictive model is better than a null model).

Heads Up: Knowing What Can Go Wrong with Logistic Regression

Logistic regression presents many of the same potential pitfalls as ordinary least-squares regression (see Chapters 18 and 19), as well as several that are specific to logistic regression. Watch out for some of the more common pitfalls, explained in the following sections.

Don’t fit a logistic function to nonlogistic data

Don’t use logistic regression to fit data that doesn’t behave like the logistic S curve. Plot your grouped data (as shown in Figure 20-1b), and if it’s clear that the fraction of yes outcome subjects isn’t leveling off at Y = 0 or Y = 1 for very large or very small X values, then logistic regression isn’t the way to go. And pay attention to the Hosmer-Lemeshow p value (described earlier) produced by the regression software. If this value is much less than 0.05, it indicates that your data is not consistent with a logistic model. In Chapter 21, I describe a more generalized logistic model that contains other parameters for the upper and lower leveling-off values.

Watch out for collinearity and disappearing significance

All regression models with more than one predictor variable can be plagued with problems of collinearity (when two or more predictor variables are strongly correlated with each other), and logistic regression is no exception. I describe this problem, and the troubles it can cause, in Chapter 19.

Check for inadvertent reverse-coding of the outcome variable

The outcome variable should always be 1 for a yes outcome and 0 for a no outcome (refer to Table 20-1 for an example). Some programs may let you record the outcome variable in your data file as descriptive terms like Lived and Died; then the program translates these terms to 0 and 1 behind the scenes. But the program may translate them as the opposite of what you want — it may translate Lived to 1 and Died to 0, in which case the fitted formula will predict the probability of living rather than dying. This reversal won’t affect any p values, but it will cause all odds ratios and their confidence intervals to be the reciprocals of what they would have been, because they will now refer to the odds of living rather than the odds of dying.

Don’t misinterpret odds ratios for numerical predictors

The value of a regression coefficient depends on the units in which the corresponding predictor variable is expressed. So the coefficient of a height variable expressed in meters is 100 times larger than the coefficient of height expressed in centimeters. In logistic regression, odds ratios are obtained by exponentiating the coefficients, so switching from centimeters to meters corresponds to raising the odds ratio (and its confidence limits) to the 100th power. The odds ratio always represents the factor by which the odds of getting the outcome event increases when the predictor increases by exactly one unit of measure (whatever that unit may be).

Sometimes you may want to express the odds ratio in more convenient units than what the data was recorded in. For the example in Table 20-1, the odds ratio for dose as a predictor of death is 1.0115 per REM. This isn’t too meaningful because one REM is a very small increment of radiation. By raising 1.0115 to the 100th power (get out your calculator), you get the equivalent odds ratio of 3.1375 per 100 REMs, and you can express this as, “Every additional 100 REMs of radiation more than triples the odds of dying.”

Don’t misinterpret odds ratios for categorical predictors

Categorical predictors should be coded numerically as I describe in Chapter 7. If you express categories as text, the computer may not translate them the way you want it to, and the resulting odds ratios may be the reciprocal of what you want or may be different in other ways.

Check the software manual to see whether there’s a way to force the program to code categorical variables the way you want. As a last resort, you can create one or more new numeric variables and do the recoding yourself.

Beware the complete separation problem

The complete separation problem, also called the perfect predictor problem, is a particularly nasty (and surprisingly frequent) problem that’s unique to logistic regression. As incredible as it may sound, it’s a sad fact that a logistic regression will fail if the data is too good!

If your predictor variable completely separates the yes outcomes from the no outcomes, the maximum likelihood method will try to make the coefficient of that variable infinite (although most regression software will give up before getting quite that far). The odds ratio also wants to be infinity if the coefficient is positive, or 0 if the coefficient is negative. The standard error wants to be infinite too, so your confidence interval may have a lower bound of 0, an upper bound of infinity, or both. Also, if you’re doing multiple logistic regression, the perfect predictor problem will rear its ugly head if any of your predictor variables completely separates the outcomes.

Check out the problem shown in Figure 20-10. The regression is trying to make the curve come as close as possible to all the data points. Usually it has to strike a compromise, because (especially in the middle part of the data) there’s a mixture of 1s and 0s. But with perfectly separated data, no compromise is necessary. As b becomes infinitely large, the logistic function morphs into a step function that touches all the data points.

Take the time to examine your data and see whether any individual variables may be perfect predictors:

1. Pick each predictor variable, one by one.

2. Sort your data file by that variable.

3. Run down the listing looking at the values in the outcome column to see whether they are completely separated (all nos followed by all yeses).

The perfect predictor problem may bite you even if each variable passes this test, because it can arise if a combination of two or more variables acting together can completely separate the outcome. Unfortunately, there’s no easy way to detect this situation by sorting or graphing your data.

9781118553992-fg2010.eps

Illustration by Wiley, Composition Services Graphics

Figure 20-10: The complete separation (or perfect predictor) problem.

Now for the really bad news: No really good solution to the complete separation problem exists. You may be able to add more data points — this sometimes introduces enough random variability to break the complete separation. Or you can remove the variable(s) responsible for complete separation from your model, but that’s not very satisfying: Why would you want to throw away your best predictors? Some advanced logistic software will at least come up with a finite lower confidence limit for an infinite odds ratio (or a finite upper limit for a zero odds ratio), but that’s about the best you can hope for.

Figuring Out the Sample Size You Need for Logistic Regression

Estimating the required sample size for a logistic regression (even a simple one-predictor regression) can be a pain. Specifying the desired power and alpha level is easy enough (see Chapter 3 for more about these items), and you can state the effect size of importance as an odds ratio.

But the required sample size also depends on a couple of other things:

The relative frequencies of yes and no outcomes

How the predictor variable is distributed

And with multiple predictors in the model, determining sample size is even more complicated. There can be a separate effect size of importance and desired power for each predictor, and the predictors themselves may be correlated.

Some programs and web pages calculate sample size for various logistic models involving one or more than one predictor and for dichotomous or continuous predictors. But these programs are likely to ask you for more information than you’re able to provide. You can use simulation methods if data from an earlier, similar study is available, but this is no task for the amateur. For a rigorous sample-size calculation, you may have no choice but to seek the help of a professional statistician.

Here are two simple approaches you can use if your logistic model has only one predictor. In each case, you replace the logistic regression with another analysis that’s sort of equivalent to it, and then do a sample-size calculation based on that other kind of analysis. It’s not ideal, but it can give you an answer that’s close enough for planning purposes.

If the predictor is a dichotomous category (like gender), logistic regression gives the same p value you get from analyzing a fourfold table. Therefore, you can use the sample-size calculations I describe in Chapter 13.

If the predictor is a continuous numerical quantity (like age), you can pretend that the outcome variable is the predictor and age is the outcome. I know this gets the cause-and-effect relationship backwards, but if you make that conceptual flip, then you can ask whether the two different outcome groups have different mean values for the predictor. You can test that question with an unpaired Student t test, so you can use the sample-size calculations I describe in Chapter 12.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 20: A Yes-or-No Proposition: Logistic Regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 20: A Yes-or-No Proposition: Logistic Regression