Chapter 21: Other Useful Kinds of Regression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 21

Other Useful Kinds of Regression

In This Chapter

Using Poisson regression to analyze counts and event rates

Getting a grip on nonlinear regression

Smoothing data without making any assumptions

This chapter covers some other kinds of regression you’re likely to encounter in biostatistical work. They’re not quite as ubiquitous as the types described in Chapters 18–20 (straight-line regression, multiple regression, and logistic regression), but you should be aware of them, so I collect them here. I don’t go into a lot of detail, but I describe what they are, when you may want to use them, how to run them and interpret the output, and what special situations you should watch out for.

Note: I don’t cover survival regression in this chapter, even though it’s one of the most important kinds of regression analysis in biostatistics. It has its own chapter (Chapter 24), in Part V of this book, which deals with the analysis of survival data.

Analyzing Counts and Rates with Poisson Regression

Statisticians often have to analyze outcomes consisting of the number of occurrences of an event over some interval of time, like the number of fatal highway accidents in a city in a year. If the occurrences seem to be getting more numerous as time goes on, you may want to perform a regression analysis to see whether the upward trend is statistically significant and to estimate the annual rate of increase (with its standard error and confidence interval).

Although they’re often analyzed by ordinary least-squares regression, event counts don’t really meet the least-squares assumptions — they aren’t well approximated as continuous, normally distributed data unless the counts are very large. Also, their variability is neither constant nor proportional to the counts themselves. So event-count outcomes aren’t best analyzed by straight-line or multiple least-squares regression.

Because independent random events (like highway accidents) should follow a Poisson distribution (see Chapter 25), they should be analyzed by a kind of regression designed for Poisson outcomes. And there is indeed just that kind of specialized regression, called (you never would’ve guessed this) Poisson regression. The following sections provide the basics on the model used for this regression, how to run and interpret its output, and a few extra tasks it can handle.

Introducing the generalized linear model

Most statistical software packages don’t offer anything explicitly called Poisson regression; instead, they have a more general regression technique called the generalized linear model (GLM).

Don’t confuse the generalized linear model with the very similarly named general linear model that I describe in Chapter 12. It’s unfortunate that these two names are almost identical, because they describe two very different things. The general linear model used to be abbreviated GLM before the generalized linear model came on the scene in the 1970s, but the former is now usually abbreviated as LM in a (not very successful) attempt to avoid confusion.

GLM is similar to LM only in that the predictor variables usually appear in the model as the familiar linear combination:

c0 + c1x1 + c2x2 + c3x3 + . . .

where the x’s are the predictor variables, and the c’s are the regression coefficients (with c0 being called a constant term, or intercept).

But GLM extends the capabilities of LM in two important ways:

With LM, the linear combination becomes the predicted value of the outcome, but with GLM, you can specify a transformation (called a link function) that turns the linear combination into the predicted value. As I note in Chapter 20, logistic regression applies exactly this kind of transformation: The linear combination (call it V) goes through the logistic function 1/(1 + e–V) to convert it into a predicted probability of having the outcome event, and you can use GLM to perform logistic regression.

With LM, the outcome is assumed to be a continuous, normally distributed variable, but with GLM, the outcome can be continuous or integer, obeying any of several different distribution functions, like normal, exponential, binomial, or Poisson. (For example, as I explain in Chapter 20, logistic regression is used when the outcome is a binomial variable indicating whether the event did or did not occur.)

GLM is the Swiss army knife of regression — it can do ordinary least-squares regression, logistic regression, Poisson regression, and a whole lot more. Most of the advanced statistical software systems (SAS, SPSS, R) offer GLM so that they don’t have to program a lot of other specialized regressions. So if your software package doesn’t offer logistic or Poisson regression, check to see whether it offers GLM; if so, then you’re all set. (Flip to Chapter 4 for an introduction to statistical software.)

Running a Poisson regression

Suppose you want to study the number of fatal highway accidents per year in a city. Table 21-1 shows some made-up fatal-accident data over the course of 12 years. Figure 21-1 shows a graph of this data, created using the R statistical software package.

Table 21-1 Yearly Data on Fatal Highway Accidents in One City

Calendar Year	Fatal Accidents
2000	10
2001	12
2002	15
2003	8
2004	8
2005	15
2006	4
2007	20
2008	20
2009	17
2010	29
2011	28

9781118553992-fg2101.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-1: Yearly data on fatal highway accidents in one city.

Running a Poisson regression is similar in many (but not all) ways to running the other common kinds of regression:

1. Assemble the data for Poisson regression just as you would for any kind of regression. For this example, you have a row of data for each year, a column containing the outcome values (the number of accidents each year), and a column for the predictor (the year).

2. Tell the software what the predictor and outcome variables are, either by name or by picking from a list of variables, depending on the software.

3. Tell the software what kind of regression you want it to carry out by specifying the family of the dependent variable’s distribution and the link function.

Step 3 is not obvious, and you have to consult your software’s manual. The R program, for instance, has everything specified in a single instruction, which looks like this:

glm(formula=Accidents ~ Year, family = poisson(link = “identity”))

This tells R everything it needs to know: The outcome is the variable called Accidents, the predictor is the variable called Year, and the outcome variable follows the Poisson distribution. The link = “identity” tells R that you want to fit a model in which the true event rate rises in a linear fashion; that is, it increases by a constant amount each year.

4. Press the Go button and get ready!

The computer does all the work and presents you with the answers.

Interpreting the Poisson regression output

After you follow the steps for running a Poisson regression in the preceding section, the program produces output like that shown in Figure 21-2.

9781118553992-fg2102.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-2: Poisson regression output from R’s generalized linear model function (glm).

This output has the same general structure as the output from other kinds of regression. The most important parts of it are the following:

In the Coefficients table, the estimated regression coefficient for Year is 1.3298, indicating that the annual number of fatal accidents is increasing by about 1.33 accidents per year.

The standard error (SE) of 0.3169 (or about 0.32) indicates the precision of the estimated rate increase per year. From the SE, using the rules given in Chapter 10, the 95 percent confidence interval (CI) around the estimated annual increase is approximately 1.3298 ± 1.96 × 0.3169, which gives a 95 percent CI of 0.71 to 1.95.

The z value column contains the value of the regression coefficient divided by its standard error. It’s used to calculate the p value that appears in the last column of the table.

The last column, Pr(>|z|), is the p value for the significance of the increasing trend. The Year variable has a p value of 2.71e-05, which is scientific notation (see Chapter 2) for 0.0000271, so the apparent increase in rate over the 12 years is highly significant. (Over the years, the value of 0.05 has become accepted as a reasonable criterion for declaring significance; don’t declare significance unless the p value is less than 0.05. See Chapter 3 for an introduction to p values.)

AIC (Akaike’s Information Criterion) indicates how well this model fits the data. The value of 81.72 isn’t useful by itself, but it’s very useful when choosing between two alternative models, as I explain later in this chapter.

The R program can also provide the predicted annual event rate for each year, from which you can add a “trend line” to the scatter graph, indicating how you think the true event rate might vary with time (see Figure 21-3).

9781118553992-fg2103.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-3: Poisson regression, assuming a constant increase in accident rate per year.

Discovering other things that Poisson regression can do

The following sections describe some additional things you can do with your data, using R’s GLM function to perform Poisson regression.

Examining nonlinear trends

The straight line in Figure 21-3 doesn’t seem to reflect the fact that the accident rate remained low for the first few years and then started to climb rapidly after 2006. Perhaps the true trend isn’t a straight line (where the rate increases by the same amount each year); it may be an exponential rise (where the rate increases by a certain percentage each year). You can have R fit an exponential rise by changing the link option from “identity” to “log” in the statement that invokes the Poisson regression:

glm(formula=Accidents ~ Year, family=poisson(link=“=”log”))

This produces the output shown in Figure 21-4 and graphed in Figure 21-5.

9781118553992-fg2104.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-4: Output from an exponential trend Poisson regression.

9781118553992-fg2105.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-5: Linear and exponential trends fitted to accident data.

Because of the “log” link used in this regression run, the coefficients are related to the logarithm of the event rate. So the relative rate of increase per year is obtained by taking the antilog of the regression coefficient for Year. This is done by raising e (the mathematical constant 2.718…) to the power of the regression coefficient for Year: e0.10414, which is about 1.11. So according to an exponential increase model, the annual accident rate increases by a factor of 1.11 (that is, an 11 percent relative increase) each year. The dashed-line curve in Figure 21-4 shows this exponential trend, which appears to accommodate the steeper rate of increase seen after 2006.

Comparing alternative models

The AIC value for the exponential trend model is 78.476, which is about 3.2 units lower than for the linear trend model (AIC = 81.72). Smaller AIC values indicate better fit, so the true trend is more likely to be exponential rather than linear. But you can’t conclude that the model with the lower AIC is really better unless the AIC is about six units better, so in this example you can’t say for sure whether the trend is linear or exponential (or something else). But the exponential curve does seem to predict the high accident rates seen in 2010 and 2011 better than the linear trend model.

Working with unequal observation intervals

In this fatal accident example, each of the 12 data points represents the accidents observed during a one-year interval. But in other applications (like analyzing the frequency of ER visits after a treatment for emphysema, where there is one data point per person), the width of the observation interval may vary from one person to another. GLM lets you provide, for each data point, an interval width along with the event count. For arcane reasons, many statistical programs refer to this interval-width variable as the offset.

Accommodating clustered events

The Poisson distribution applies when the observed events are all independent occurrences. But this assumption isn’t met if events occur in clusters. So, for example, if you count individual highway fatalities instead of fatal highway accidents, the Poisson distribution doesn’t apply, because one fatal accident may kill several people.

The standard deviation (SD) of a Poisson distribution is equal to the square root of the mean of the distribution. But if clustering is present, the SD of the data is larger than the square root of the mean, a situation called overdispersion. GLM can accommodate overdispersion; you just tell R to make the distribution family quasipoisson rather than poisson, like this:

glm(formula=Accidents ~ Year, family=quasipoisson(link=“=”log”))

Anything Goes with Nonlinear Regression

No treatment of regression would be complete without discussing the most general (and potentially the most challenging) of all kinds of least-squares regression — general nonlinear least-squares regression, or nonlinear curve-fitting. In the following sections, I explain how nonlinear regression is different from other kinds, I describe how to run and interpret a nonlinear regression (with the help of a drug research example), and I show you some tips involving equivalent functions.

Distinguishing nonlinear regression from other kinds

In the kinds of regression I describe earlier in this chapter and in Chapters 18–20, the predictor variables and regression coefficients always appear in the model as a linear combination: c0 + c1x1 + c2x2 + c3x3 + … + cnxn. But in nonlinear regression, the coefficients no longer have to appear paired up with predictor variables (like c2x2); they now have a more independent existence and can appear on their own, anywhere in the formula. In fact, the name coefficient, which implies a number that’s multiplied by a variable, is too limited to describe how they can be used in nonlinear regression; instead, they’re referred to as parameters.

The formula for a nonlinear regression model may be any algebraic expression, involving sums and differences, products and ratios, and powers and roots, together with any combination of logarithmic, exponential, trigonometric, and other advanced mathematical functions (see Chapter 2 for an introduction to these items). The formula can contain any number of predictor variables and any number of parameters (and these formulas often contain many more parameters than predictor variables).

Table 21-2 shows a few of the many nonlinear functions you may encounter in biological research.

Table 21-2 Some Examples of Nonlinear Functions

Function	Description
Conc = C0e–k × Time	Concentration versus time: Exponential (first-order) decline from C0 at time 0 to zero at infinite time
Conc = C0e–k × Time + C∞	Concentration versus time: Exponential (first-order) decline from C0 at time 0 to some non-zero leveling-off value
	An S-shaped, logistic-type curve with arbitrary leveling-off values (not necessarily 0 and 100 percent)
Y = aXb	A power curve in which the power isn’t necessarily a whole number
	Arrhenius equation for temperature dependence of rate constants and many other physical/chemical properties

Unlike other types of regression that I describe earlier in this chapter and book, in which you give the statistical software your data, click the Go button, and wait for the answer to appear, full-blown nonlinear regression is never such a no-brainer. First, you have to decide what function you want to fit to your data (out of the infinite number of possible functions you could dream up). Sometimes the general form of the function is determined (or at least suggested) by a scientific theory (this would be called a theoretical or mechanistic function and is more common in the physical sciences than in the life sciences). Other times, you may simply pick a function that has the right general shape (this would be called an empirical function). You also have to provide starting guesses for each of the parameters appearing in the function. The regression software tries to refine these guesses, using an iterative process that may or may not converge to an answer, depending on the complexity of the function you’re fitting and how close your initial guesses are to the truth.

In addition to these special problems, all the other complications of multivariate regression (like collinearity; see Chapter 19) can appear in nonlinear problems, often in a more subtle and hard-to-deal-with way.

Checking out an example from drug research

One common nonlinear regression problem arises in drug development research. As soon as scientists start testing a promising new compound, they want to determine some of its basic pharmacokinetic (PK) properties; that is, to learn how the drug is absorbed, distributed, modified, and eliminated by the body. Some clinical trials are designed specifically to characterize the pharmacokinetics of the drug accurately and in great detail, but even the earliest Phase I trials (see Chapter 6) usually try to get at least some rudimentary PK data as a secondary objective of the trial.

Raw PK data often consists of the concentration of the drug in the blood at various times after administering a dose of the drug. Consider a simple experiment, in which 10,000 micrograms (μ g) of a new drug is given as a single bolus (a rapid injection into a vein). Blood samples are drawn at pre-determined times after dosing and are analyzed for the drug. Hypothetical data from one subject is shown in Table 21-3 and graphed in Figure 21-6. The drug concentration in the blood is expressed in units of micrograms per deciliter (μg/dL); a deciliter is one-tenth of a liter.

Table 21-3 Blood Drug Concentration versus Time

Time after Dosing (In Hours)	Drug Concentration in Blood (μ g/dL)
0.25	57.4
0.5	54.0
1	44.8
1.5	52.7
2	43.6
3	40.1
4	27.9
6	20.6
8	15.0
12	10.0

9781118553992-fg2106.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-6: The concentration of an intravenous drug declines as it is eliminated from the body.

Several basic PK parameters (maximum concentration, time of maximum concentration, area under the curve) are usually calculated directly from the concentration-versus-time data, without having to fit any curve to the points. But two important parameters are usually obtained from a regression analysis:

The volume of distribution (Vd): The effective volume of fluid or tissue through which the drug distributes. This effective volume could be equal to the blood volume but could be greater if the drug also spreads through fatty tissue or other parts of the body. If you know how much drug you infused (Dose), and you know the plasma concentration at the moment of infusion (C0), before any of the drug had been eliminated, you can calculate the volume of distribution as Vd = Dose/C0. But you can’t directly measure C0 — by the time the drug has distributed evenly through the bloodstream, some of it has already been eliminated from the body. So C0 has to be estimated by extrapolating the measured concentrations backward in time to the moment of infusion (Time = 0).

The elimination half-life (λ): The time it takes for half of the drug in the body to be eliminated.

Pharmacokinetic theory is pretty well developed, and it predicts that (under some reasonable assumptions), the drug concentration (Conc) in the blood following a bolus infusion should vary with time (Time) according to the equation:

where ke is the elimination rate constant. ke is related to the elimination half-life (λ) according to the formula: λ = 0.693/ke, where 0.693 is the natural logarithm of 2. So if you can fit the preceding equation to your Conc-versus-Time data in Table 21-3, you can get C0, from which you can calculate Vd, and you can get ke, from which you can calculate λ.

The preceding equation is nonlinear in the parameters (ke appears in the exponent). In the old days, before nonlinear regression software became widely available, people would shoehorn this nonlinear regression problem into a straight-line regression program by working with the logarithms of the concentrations. But that approach has several problems, one of which is that it can’t be generalized to handle more complicated equations that often arise.

Running a nonlinear regression

Nonlinear curve-fitting is supported by many modern statistics packages, like SPSS, SAS, GraphPad Prism, and R (see Chapter 4). You can also set up the calculations in Excel, although it's not particularly easy. Finally, the web page http://StatPages.info/nonlin.html can fit any function you can write, involving up to eight independent variables and up to eight parameters. Here I describe how to do nonlinear regression in R:

1. Provide the concentration and time data.

R can read data files in various formats (Excel, Access, text files, and so on), or you can directly assign values to variables, using statements like the following (which come from Table 21-3):

Time = c(0.25, 0.5, 1, 1.5, 2, 3, 4, 6, 8, 12)

Conc = c(57.4, 54.0, 44.8, 52.7, 43.6, 40.1, 27.9, 20.6, 15.0, 10.0)

In the two preceding equations, c is a built-in R function that creates an array (see Chapter 2) from a list of numbers.

2. Specify the equation to be fitted to the data, using the algebraic syntax your software requires.

I write the equation this way (using R’s algebraic syntax): Conc ~ C0 * exp(- ke * Time)

3. Let the software know that C0 and ke are parameters to be fitted, and provide initial guesses for these values.

Nonlinear curve-fitting is a complicated task that works by iteration — you give it some rough guesses, and it refines them into closer estimates to the truth, repeating this process until it arrives at the best (least-squares) solution.

Coming up with starting guesses can be tricky for some nonlinear regression problems; it’s more of an art than a science. Sometimes, if the parameters have physiological meaning, you may be able to make a guess based on known physiology or past experience, but sometimes it just has to be trial and error. You can graph your observed data in Excel, and then superimpose a curve from values calculated from the function for various parameter guesses that you type in, and you can play around with the parameters until the curve is at least in the ballpark of the observed data.

In this example, C0 is the concentration you expect at the moment of dosing (at t = 0). From Figure 21-6, it looks like the concentration starts out around 50, so you can use 50 as an initial guess for C0. The ke parameter affects how quickly the concentration decreases with time. Figure 21-6 indicates that the concentration seems to decrease by half about every few hours, so λ should be somewhere around 4 hours. Because λ = 0.693/ke, a little algebra gives ke = 0.693/λ, or 0.693/4, so you may try 0.2 as a starting guess for ke. You tell R the starting guesses by using the syntax: start=list(C0=50, ke=0.2).

The full R statement for doing the regression, using its built-in function nls (which stands for nonlinear least-squares) and summarizing the output is

summary(nls(Conc ~ C0 * exp(-ke * Time), start = list(C0 = 50, ke = 0.2)))

Interpreting the output

As complicated as nonlinear curve-fitting may be, the output is quite simple — very much like the output from ordinary linear regression and not any more difficult to interpret. Figure 21-7 shows the relevant part of R’s output for this example.

9781118553992-fg2107.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-7: Results of nonlinear regression in R.

First is a restatement of the function you’re fitting. Then comes the regression table, which has a row for every adjustable parameter that appears in the function. Like every other kind of regression table, it shows the fitted value for the parameter, its standard error, and the p value (the last column) indicating whether or not that parameter was significantly different from zero. C0 is 59.5 ± 2.3 micrograms/deciliter (μ g/dL); and ke is 0.163 ± 0.0164 hr-1 (first-order rate constants have units of “per time”). From these values, you can calculate the PK parameters you want:

Volume of distribution: Vd = Dose/C0 = 10,000 μ g/59.5 μ g/dL = 168 dL, or 16.8 liters. (This amount is several times larger than the blood volume of the average human, indicating that this drug is going into other parts of the body besides the blood.)

Elimination half-time: t¹⁄₂ = 0.693/ke = 0.693/0.163 hr-1, or 4.25 hours. (After 4.25 hours, only 50 percent of the original dose is left in the body; after 8.5 hours, only 25 percent of the original dose remains; and so on.)

How precise are these PK parameters? (What is their SE?) Chapter 11 describes how SEs propagate through calculations, and gives you several ways to answer this question. Using the online calculator I describe in that chapter, you can calculate that the Vd = 16.8 ± 0.65 liters, and λ = 4.25 ± 0.43 hours.

R can also easily generate the predicted value for each data point, from which you can superimpose the fitted curve onto the observed data points, as in Figure 21-8.

R also provides the residual standard error, defined as the standard deviation of the vertical distances of the observed points from the fitted curve. The value of 3.556 means that the points scatter about 3.6 μ g/dL above and below the fitted curve. R can also provide Akaike’s Information Criterion (AIC), which is useful in selecting which of several possible models best fits the data.

9781118553992-fg2108.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-8: Nonlinear model fitted to drug concentration data.

Using equivalent functions to fit the parameters you really want

It’s inconvenient, annoying, and error-prone to have to perform calculations on the parameters you get from a nonlinear regression (like C0 and the ke rate constant) to get the parameters you really wanted (like Vd and t¹⁄₂), and even more so to get their standard errors. Wouldn’t it be nice if you could get Vd and λ and their SEs directly from the nonlinear regression program? Well, in many cases you can!

With nonlinear regression, there’s usually more than one way to skin a cat. Very often you can express the formula in an equivalent form that directly involves the parameters you’re interested in. Here’s how it works for the PK example I use in the preceding sections.

Because Vd = Dose/C0, that means (from high-school algebra) that C0 = Dose/Vd, why not use Dose/Vd instead of C0 in the formula you’re fitting? If you do, it

becomes . And you can go even further than that. It

turns out that a first-order exponential-decline formula can be written either as

or as the algebraically equivalent form .

Applying both of these substitutions, you get the equivalent model:

, which produces exactly the same fitted curve as the

original model, but it has the tremendous advantage of giving you exactly the PK parameters you want (Vd and t¹⁄₂), rather than other parameters (C0 and k) that you have to do further calculations on.

You already know that Dose is 10,000 micrograms (from the original description of this example), so you can substitute this value for Dose in the formula to be fitted. You’ve already estimated t¹⁄₂ as 4 hours and C0 as about 50 μ g/dL from looking at Figure 21-6, as I describe earlier, so you can estimate Vd as 10,000/50, which is 200 deciliters. With these guesses, the final R statement is

summary(nls(Conc ~ (10000/Vd) * 2^(-Time/tHalf ), start = list(Vd = 200, tHalf = 4)))

which produces the output shown in Figure 21-9.

9781118553992-fg2109.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-9: Nonlinear regression using the PK parameters you’re interested in.

Now you can directly see, with no further calculations required, that the volume of distribution is 168.2 ± 6.5 dL (or 16.8 ± 0.66 liters), and the elimination half-time is 4.24 ± 0.43 hours.

Smoothing Nonparametric Data with LOWESS

Sometimes you want to fit a smooth curve to a set of points that don’t seem to conform to any curve (straight line, parabola, exponential, and so forth) that you’re familiar with. You can’t use the usual linear or nonlinear regression methods if you can’t write an equation for the curve you want to fit. What you need is a kind of nonparametric regression — one that doesn’t assume any particular model (formula) for the relationship, but rather just tries to draw a smooth line through the data points.

Several kinds of nonparametric data-smoothing methods have been developed. One popular one is called LOWESS, which stands for Locally Weighted Scatterplot Smoothing. Many statistical programs, like SAS and R, can do LOWESS regression. In the following sections, I explain how to run a LOWESS analysis and adjust the amount of smoothing (or “stiffness” of the curve).

Running LOWESS

Suppose you discover a new kind of hormone that is produced in the ovaries of women. The blood levels of this hormone should vary with age, being relatively low before puberty and after menopause, and high during child-bearing age. You want to characterize and quantify this age dependence as precisely as possible.

Now suppose you acquire 200 blood samples drawn from females of all ages (from 2 to 90 years) for another research project, and after addressing all human-subjects-protection issues, you analyze these specimens for your new hormone. A graph of hormone level versus age may look like Figure 21-10.

9781118553992-fg2110.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-10: Data that doesn’t seem to conform to any simple function.

You have quite a lot of scatter in these points, which makes it hard to see the more subtle aspects of the age dependency: At what age does the hormone level start to rise? When does it peak? Does it remain fairly constant throughout child-bearing years? When does it start to decline? Is the rate of post-menopause decline constant or does it change with advancing age?

It would be easier to answer those questions if you had a curve that represented the data without all the random fluctuations of the individual points. How would you go about fitting a curve to this data? LOWESS to the rescue!

Running LOWESS in R is quite simple; you need only to provide the program with the x and y variables, and it does the rest. If you have imported your data into R as two variables, x and y, the R instruction to run a LOWESS regression is very simple: lowess(x, y, f = 0.2). (I explain the f = 0.2 part in the following section.)

Unlike other forms of regression, LOWESS doesn’t produce a coefficients table; the only output is a table of smoothed y values, one for each data point, from which (using another R instruction) you can plot the smoothed line superimposed on the scatter graph. Figure 21-11 shows the results of running the LOWESS routine provided with the R software.

9781118553992-fg2111.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-11: The fitted LOWESS curve follows the shape of the data, whatever it may be.

The smoothed curve seems to fit the data quite well, except possibly at the lowest ages. The individual data points don’t show any noticeable upward trend until age 12 or so, but the smoothed curve starts climbing right from age 3. The curve completes its rise by age 20, and then remains flat until almost age 50, when it starts declining. The rate of decline seems to be greatest between ages 50 to 65, after which it declines less rapidly. These subtleties would be very difficult to spot just by looking at the individual data points without any smoothed curve.

Adjusting the amount of smoothing

R’s LOWESS program allows you adjust the “stiffness” of the fitted curve by specifying a smoothing fraction, f, which is a number between 0 and 1. Figure 21-12 shows what the smoothed curve looks like for three different smoothing fractions.

9781118553992-fg2112.eps

Illustration by Wiley, Composition Services Graphics

Figure 21-12: You can adjust the smoothness of the fitted curve.

Setting f = 0.667 (or ²⁄₃, which is the value R uses if you leave the f parameter out of the LOWESS statement entirely) produces a rather “stiff” curve that rises steadily between ages 2 and 40, and then declines steadily after that. It misses important features of the data, like the low pre-puberty hormone levels, the flat plateau during child-bearing years, and the slowing down of the yearly decrease above age 65. You can say that this curve shows excessive bias, systematically departing from “the truth” in various places along its length.

Setting f = 0.1, at the other extreme, produces a very jittery curve with a lot of up-and-down wiggles that can’t possibly be real age dependencies, but reflect only random fluctuations in the data. You can say that this curve shows excessive variance, with too many random fluctuations along its length.

Setting f = 0.2 produces a curve that’s stiff enough not to have random wiggles, yet flexible enough to show that hormone levels are fairly low until age 10, reach their peak at age 20, stay fairly level until age 50, and then decline, with the rate of decline slowing down after age 70. This curve appears to strike a good balance, with low bias and low variance.

Whenever you do LOWESS regression, you have to explore different smoothing fractions to find the sweet spot that gives the best tradeoff between bias and variance — showing the real features while smoothing out the random noise. Used properly, LOWESS regression can be helpful in gleaning the most insight from noisy data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 21: Other Useful Kinds of Regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 21: Other Useful Kinds of Regression