Chapter 24: Survival Regression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 24

Survival Regression

In This Chapter

Knowing when to use survival regression

Describing the concepts behind survival regression

Running and interpreting the outcome of survival regression

Peeking at prognosis curves

Estimating sample size for survival regression

Survival regression is one of the most commonly used techniques in biostatistics. It overcomes the limitations of the log-rank test (see Chapter 23) and lets you analyze how survival time is influenced by one or more predictors (the X variables), which can be categorical or numerical (see Chapter 7). In this chapter, I introduce survival regression: when to use it, its basic concepts, running it and interpreting the output, building prognosis curves, and figuring out the sample size you need.

Note: Because time-to-event data is so often applied to survival, where the event is death, I use the terms death and survival time in this chapter, but everything I say applies also to analyzing times to the first occurrence of any event, like death, stroke, hospitalization, response to treatment, and recurrence of illness.

Knowing When to Use Survival Regression

In Chapter 22, I point out the special problems that come up when you can’t follow a subject long enough to observe a death (called censoring of data). In that chapter, I explain how to summarize survival data with life tables and the Kaplan-Meier method, and how to graph time-to-event data as survival curves. In Chapter 23, I describe the log-rank test, which you can use to compare survival among a small number of groups — for example, drug versus placebo or four stages of cancer.

But the log-rank test has its limitations:

It doesn’t handle numerical predictors well. It compares survival among a small number of categories, but you may want to know how age (for example) affects survival. To use the log-rank test, you have to express the age as age-group categories (like 0–20, 21–40, 41–60, and so on) and compare survival among these categories. This test may be less efficient at detecting gradual trends across the whole age range.

It doesn’t let you analyze the simultaneous effect of several predictors. If you try to create subgroups of subjects for each distinct combination of categories for several predictors, you could have dozens or even hundreds of groups, most of which would have very few subjects. Suppose, for example, you have three predictors: stage of disease (four categories), age group (eight categories), and mode of treatment (five categories). There are a total of 4 × 8 × 5, or 160, possible groups — one for each distinct combination of levels for the three predictors. If you have 250 subjects, most groups have only one or two subjects in them, and many groups are empty.

Use survival regression when the outcome (the Y variable) is a time-to-event variable, like survival time; this regression lets you do any (or all) of the following:

Determine whether there is a significant association between survival and one or more other variables

Quantify the extent to which a variable influences survival, including testing whether survival is different between groups

Adjust for the effects of confounding variables

Generate a predicted survival curve (a prognosis curve) that is customized for any particular set of values of the predictor variables

Explaining the Concepts behind Survival Regression

Note: My explanation of survival regression has a little math in it, but nothing beyond high school algebra. For generality, I describe multiple survival regression (more than one predictor), but everything I say also applies when you have only one predictor variable.

Most kinds of regression require you to write a formula to fit to your data. The formula is easiest to understand and work with when the predictors appear in the function as a linear combination in which each predictor variable is multiplied by a coefficient, and these terms are all added together (perhaps with another coefficient, called an intercept, thrown in), like this: y = c0 + c1x1 + c2x2 + c3x3. This linear combination can also have terms with higher powers (like squares or cubes) of the predictor variables, and it can have interaction terms (products of two or more predictors).

Survival regression takes the linear combination and uses it to predict survival. But survival data presents some special challenges:

Censoring: Censoring happens when the event doesn’t occur during the time you follow the subject. You need special methods (such as life tables, the Kaplan-Meier method, and the log-rank test; see Chapters 22 and 23) to deal with this problem.

Survival curve shapes: In some disciplines, such as industrial quality control, the times to certain kinds of events (like the failure of mechanical or electronic components) do tend to follow certain distribution functions, like the Weibull distribution (see Chapter 25), pretty well. These disciplines often use a parametric form of survival regression, which assumes that you can represent the survival curves by algebraic formulas. Early biological applications of survival regression also used parametric models. But biological data tends to produce nonparametric survival curves whose shapes can’t be represented by any simple formulas.

Researchers wanted a hybrid, semi-parametric kind of survival regression: one that was partly nonparametric (didn’t assume any mathematical formula for the shape of the overall survival curve) and partly parametric (assumed that the predictors influence the shape of that curve according to a mathematical relationship). Fortunately, in 1972, a statistician named David Cox came up with just such a method, called proportional hazards (PH) regression. His original paper is one of the most widely cited publications in the life sciences, and PH regression is often simply called Cox regression. In the following sections, I list the steps for Cox PH regression and explain hazard ratios.

The steps of Cox PH regression

You can understand Cox PH regression in terms of several conceptual steps, which statistical software (like the programs in Chapter 4) carry out in an integrated way during the regression:

1. Figure out the overall shape of the survival curve by the Kaplan-Meier method.

2. Figure out how the predictor variables bend this curve upward or downward (how the predictors affect survival).

3. Determine the values of the regression coefficients that make the predicted survival times best fit your observed data.

Figuring out the baseline

Your software may define baseline survival function in one of two ways:

The survival curve of an average subject: One whose value of each predictor is equal to the group average value for that variable. The average-subject baseline is easy to understand — it’s very much like the overall survival curve you get from a Kaplan-Meier calculation by using all the available subjects.

The survival curve of a hypothetical zero subject: One whose value of each predictor is equal to 0. Some mathematicians prefer to use the zero-subject baseline because it makes some of their formulas simpler. But the zero-subject baseline corresponds to a hypothetical subject who can’t possibly exist in the real world. Have you ever seen a person whose age is 0, weight is 0, or cholesterol level is 0? Neither have I. The survival curve for such an impossible person is so far away from reality that it usually doesn’t even look like a survival curve.

Luckily, the way your software defines its baseline function doesn’t affect regression coefficients, standard errors, hazard ratios, confidence intervals, p values, or goodness-of-fit measures, so you don’t have to worry about it. But you should be aware of the two alternative definitions if you plan to generate prognosis curves, because the formulas to generate them are slightly different for the two different kinds of baseline function.

Bending the baseline

Now for the tricky part. How do you bend (flex) this baseline curve to express how survival may increase or decrease for different predictor values? Because survival curves always start at 1 (100 percent) at time 0, the bending process must leave this special point where it is. And the bending process must also leave a survival value of 0 unchanged. One very simple mathematical operation — raising a number to a power — can do the job: It leaves 1 at 1 and 0 at 0, but smoothly raises or lowers all the values between 0 and 1.

You can see how this plays out when you look at a simple baseline function: a straight line. (No actual biological survival curve would ever be exactly a straight line, but this line makes for a nice, simple example.) Look at Figure 24-1a, which is simply a graph of the equation y = 1 – x.

Look what happens when you raise this straight line to various powers, which I refer to as h and show in Figure 24-1b:

Squaring (h = 2) the y value for every point on the line always makes the values smaller (for example, 0.82 is 0.64), because the y values are always less than 1.

Taking the square root (h = 0.5) of the y value of every point on the line makes the y values larger (for example, the square root of 0.25 is 0.5).

Both 12 and 10.5 remain 1, and 02 and 00.5 both remain 0, so those two ends of the line don’t change.

9781118553992-fg2401.eps

Illustration by Wiley, Composition Services Graphics

Figure 24-1: Bending a straight line into different shapes by raising each point on the line to some power: h.

Does the same trick work for a survival curve that doesn’t follow any particular algebraic formula? Yes, it does; look at Figure 24-2.

Figure 24-2a shows a typical survival curve. It’s not defined by any algebraic formula; it exists simply as a table of values obtained by a life-table or Kaplan-Meier calculation.

Figure 24-2b shows how the baseline survival curve is flexed by raising every baseline survival value to a power. You get the lower curve by squaring (h = 2) every baseline survival value; you get the upper curve by taking the square root (h = 0.5) of every baseline survival value. Notice that the two flexed curves keep all the distinctive zigs and zags of the baseline curve; every step occurs at the same time value as it occurs in the baseline curve.

• The lower curve represents a group of people who had a poorer survival outcome than those making up the baseline group. In other words, at any instant in time, they were somewhat more likely to die (had a greater hazard rate) than a baseline person at that same moment.

• The upper curve represents subjects who had better survival (had a lower hazard rate) than a baseline person at any given moment.

9781118553992-fg2402.eps

Illustration by Wiley, Composition Services Graphics

Figure 24-2: Raising to a power works for survival curves, too.

Because of the mathematical relationship between hazard (chance of dying at any instant in time) and survival (chance of surviving up to some point in time), it turns out that raising the survival curve to the h power is exactly equivalent to multiplying the hazard curve by the natural logarithm of h. Because every point in the hazard curve is being multiplied by the same amount — by Log(h) — raising a survival curve to a power is referred to as a proportional hazards transformation.

But what should the value of h be? The h value varies from one person to another. Keep in mind that the baseline curve describes the survival of a perfectly average person, but no individual is completely average. You can think of every subject as having her very own personalized survival curve, based on her very own h value, that provides the best estimate of that subject’s chance of survival over time.

Seeing how predictor variables influence h

The final piece of the survival regression problem is to figure out how the predictor variables influence h, which influences survival. Any kind of regression finds the values of the coefficients that make the predicted values agree as much as possible with the observed values; likewise, Cox PH regression figures out the coefficients of the predictor variables that make the predicted survival curves agree as much as possible with the observed survival times of each subject.

How does Cox PH determine these regression coefficients? The short answer is, “Don’t ask!” The longer answer is that, like all other kinds of regression, Cox PH is based on maximum likelihood estimation. You first build a big, complicated expression for the probability of one particular person dying at any point in time. This expression involves that person’s predictor values and the regression coefficients. Then you construct a bigger expression giving the likelihood of getting exactly the survival times that you got for all the subjects in your data set. And as if this isn’t already complicated enough, the expression has to deal with the complication of censored data. You then have to find the values of the regression coefficients that maximize this big likelihood expression. As with other kinds of regression, the calculations are far too difficult for any sane person to attempt by hand. Fortunately, computer software is available to do it for you.

Hazard ratios

Hazard ratios are among the most useful things you get from a Cox PH regression. Their role in survival regression is similar to the role of odds ratios in logistic regression (see Chapter 20), and they’re even calculated the same way — by exponentiating the regression coefficients:

In logistic regression: Odds ratio = eRegression Coefficient

In Cox PH regression: Hazard ratio = eRegression Coefficient

Keep in mind that hazard is the chance of dying in any small period of time. Each predictor variable in a Cox PH regression has a hazard ratio that tells you how much the hazard increases in the relative sense (that is, by what amount it’s multiplied) when you increase the variable by exactly 1.0 unit. Therefore, a hazard ratio’s numerical value depends on the units in which the variable is expressed in your data. And for categorical predictors, the hazard ratio depends on how you code the categories.

For example, if a survival regression model in a study of emphysema subjects includes cigarettes smoked per day as a predictor of survival, and if the hazard ratio for this variable comes out equal to 1.05, then a person’s chances of dying at any instant increase by a factor of 1.05 (5 percent) for every additional cigarette smoked per day. A 5 percent increase may not seem like much, but it’s applied for every additional cigarette per day. A person who smokes one pack (20 cigarettes) per day has that 1.05 multiplication applied 20 times, which is like multiplying by 1.0520, which equals 2.65. And a two-pack-per-day smoker’s hazard increases by a factor of 2.65 over a one-pack-per-day smoker, which means a 2.652 (roughly sevenfold) increase in the chances of dying at any instant, compared to a nonsmoker.

If you change the units in which you record smoking levels from cigarettes per day to packs per day (using units that are 20 times larger), then the corresponding regression coefficient is 20 times larger, and the hazard ratio is raised to the 20th power (2.65 instead of 1.05 in this example).

Running a Survival Regression

As with all statistical methods dealing with time-to-event data, your dependent variable is actually a pair of variables:

One variable is an event-occurrence indicator that’s one of the following:

• Equal to 1 if the event was known to occur (uncensored)

• Equal to 0 if the event didn’t occur during the observation period (censored)

One variable is the time-to-event, which is the time from the start of observation to either the occurrence of the event (if it did occur) or to the end of the observation (if the event wasn’t observed to occur). I describe time-to-event data in more detail in Chapter 22.

And as with all regression methods, you have one or more variables for the predictors. The rules for representing the predictor variables are the same as described in Chapter 19.

For continuous numerical variables, choose units of a convenient magnitude.

For categorical predictors, carefully consider how you record the data provided to the software and what the reference level is.

You may not be sure which variables in your data to include as predictors in the regression. I discuss this model-building problem in Chapter 19; the same principles apply to survival regression.

After you assemble and properly code the data, running the program is no more complicated than running it for ordinary least-squares or logistic regression. You need to specify the variables in the regression model:

1. Specify the two components of the outcome event:

• Time to event

• Censoring indicator

2. Specify the predictor variables.

Specify the reference level (see Chapter 23) for categorical predictors if the software lets you do that.

Most software also lets you specify the kinds of output you want to see. You should always specify at least the following:

Coefficients table, including hazard ratios and confidence intervals

Tests of whether the hazard proportionality assumption is valid

You may also want to see some or all of the following:

Summary descriptive statistics on the data, including number of censored and uncensored observations, median survival time, and mean and standard deviation for each predictor variable in the model

One or more measures of goodness-of-fit for the model

Baseline survival function (as a table of values and as a survival curve)

Baseline hazard function values (as a table and graph)

After you specify all the input to the program, click the Start button, and let the computer do all the work.

Interpreting the Output of a Survival Regression

Suppose you have conducted a long-term survival study of 200 cancer patients who were enrolled at various stages of the disease (1 through 4) and were randomized to receive either chemotherapy or radiation therapy. Subjects were followed for up to ten years, after which the survival data was summarized by treatment (Figures Figure 24-3a and Figure 24-3b for the two treatments) and by stage of disease.

It would appear, from Figure 24-3, that chemotherapy and greater stage of disease are both associated with poorer survival, but are these apparent effects significant? Proportional-hazards regression can tell you that, and more.

9781118553992-fg2403.eps

Illustration by Wiley, Composition Services Graphics

Figure 24-3: Kaplan-Meier survival curves, by treatment and by stage of disease.

To run a proportional-hazards regression on the data from this example, you must provide the following data to the software:

The time from treatment to death or censoring (a numerical variable Time, in years).

The indicator of whether the subject died or was censored (a variable Status, set to 1 if the subject died or 0 if the subject was last seen alive).

The treatment group (a categorical variable Tx, coded as Chemo or Radiation). In this example, I didn’t say which treatment was the reference level (see the discussion of reference levels in Chapter 19), so R took Chemo (which came before Radiation alphabetically) as the reference level.

The stage of disease at the time of treatment (a numerical variable Stage, equal to 1, 2, 3, or 4). Using a numerical variable as Stage implies the assumption that every successive increase in stage number by 1 is associated with a constant relative increase in hazard. If you don’t want to make that assumption, you must code Stage as a categorical variable, with four levels (Stage 1 through Stage 4).

Using the R statistical software, the proportional hazards regression can be invoked with a single command:

coxph(formula = Surv(Time, Status) ~ Stage + Tx)

Figure 24-4 shows R’s results, using the data that I graph in Figure 24-3. The output from other statistical programs won’t look exactly like Figure 24-4, but you should be able to find the main components described in the following sections.

9781118553992-fg2404.eps

Illustration by Wiley, Composition Services Graphics

Figure 24-4: Output of a Cox PH regression.

Testing the validity of the assumptions

When you’re analyzing data by PH regression, you’re assuming that your data is really consistent with the idea of flexing a baseline survival curve by raising all the points in the entire curve to the same power (shown as h in Figures Figure 24-1b and Figure 24-2b). You’re not allowed to twist the curve so that it goes higher than the baseline curve (h < 1) for small time values and lower than baseline (h > 1) for large time values. That would be a non-PH flexing of the curve.

One quick check to see whether a predictor is affecting your data in a non-PH way is to take the following steps:

1. Split your data into two groups, based on the predictor.

2. Plot the Kaplan-Meier survival curve for each group (see Chapter 22).

warning_bomb.eps If the two survival curves show the slanted figure-eight pattern shown in Figure 24-5, don’t try to use Cox PH regression on that data. (At least don’t include that predictor variable in the model.)

9781118553992-fg2405.eps

Illustration by Wiley, Composition Services Graphics

Figure 24-5: Don’t try proportional-hazards regression on this kind of data.

Your statistical software may offer several options to test the hazard-proportionality assumption. Check your software’s documentation to see what it offers (which may include the following) and how to interpret its output.

Graphs of the hazard functions versus time, which let you see the extent to which the hazards are proportional.

A statistical test for significant hazard proportionality. R provides a function called cox.zph for this purpose; other packages may offer a comparable option.

Checking out the table of regression coefficients

A regression coefficients table in a survival regression looks very much like the tables produced by almost all kinds of regression: ordinary least-squares, logistic, Poisson, and so on. The survival regression table has a row for every predictor variable, usually containing the following items:

The value of the regression coefficient. Not too meaningful by itself, it tells how much the hazard ratio’s logarithm increases when the predictor variable increases by exactly 1.0 unit. In Figure 24-4, the coefficient for Stage is 0.4522, indicating that every increase of 1 in the stage of the disease (going from 1 to 2, from 2 to 3, or from 3 to 4) increases the logarithm of the hazard by 0.4522 (more advanced stage of disease is associated with poorer survival). For a categorical predictor like treatment (Tx), there will be a row in the table for each non-reference level (in this case, a line for Radiation). The coefficient for Radiation is –0.4323; the negative sign indicates that Radiation has less hazard (better survival) than Chemo.

The coefficient’s standard error (SE), which is a measure of the precision of the regression coefficient. The SE of the Stage coefficient is 0.1013, so you would express the Stage coefficient as 0.45 ± 0.10.

The coefficient divided by its SE (often designated as t or Wald).

The p value. If less than 0.05, it indicates that the coefficient is significantly different from 0 (that is, the corresponding predictor variable is significantly associated with survival) after adjusting for the effects of all the other variables (if any) in the model. The p value for Stage is shown as 8.09e–06, which is scientific notation for 0.000008, indicating that Stage is very significantly associated with survival.

The hazard ratio and its confidence limits, which I describe in the next section.

You may be surprised that no intercept (or constant) row is in the table. Cox PH regression doesn’t include an intercept in the linear part of the model; the intercept is absorbed into the baseline survival function.

Homing in on hazard ratios and their confidence intervals

Hazard ratios from survival and other time-to-event data are used extensively as safety and efficacy outcomes of clinical trials, as well as in large-scale epidemiological studies. Depending on how the software formats its output, it may show the hazard ratio for each predictor in a separate column in the regression table, or it may create a separate table just for the hazard ratios and their confidence intervals.

If the software doesn’t give hazard ratios or their confidence intervals, you can calculate them from the regression coefficients (Coef) and standard errors (SE) as follows:

Hazard Ratio = eCoef

Low 95 percent CI = eCoef – 1.96 × SE

High 95 percent CI = eCoef + 1.96 × SE

Hazard ratios are useful and meaningful measures of the extent to which a variable influences survival.

A hazard ratio of 1 (which corresponds to a regression coefficient of 0) indicates that the variable has no effect on survival.

The confidence interval around the hazard ratio estimated from your data sample indicates the range in which the true hazard ratio (of the population from which your sample was drawn) probably lies.

In Figure 24-4, the hazard ratio for Stage is e0.4522 = 1.57, with a 95 percent confidence of 1.29 to 1.92 per unit of stage number, which means that every increase of 1 in the stage of the disease is associated with a 57 percent increase in hazard (multiplying by 1.57 is equivalent to a 57 percent increase). Similarly, the hazard ratio for Radiation relative to Chemo is 0.649, with a 95 percent confidence interval of 0.43 to 0.98.

Risk factors (such as smoking relative to nonsmoking) usually have hazard ratios greater than 1. Protective factors (such as drug relative to placebo) usually have hazard ratios less than 1.

Assessing goodness-of-fit and predictive ability of the model

There are several measures of how well a regression model fits the survival data. These measures can be useful when you’re choosing among several different models:

Should you include a possible predictor variable (like Age) in the model?

Should you include the squares or cubes of predictor variables in the model (like Age2 or Age3 in addition to Age)?

Should you include a term for the interaction between two predictors? (See Chapter 19 for details on interactions.)

Your software may offer one or more of the following goodness-of-fit measures:

A measure of concordance, or agreement, between the observed and predicted outcomes — the extent to which subjects with higher predicted hazard values had shorter observed survival times (which is what you’d expect). Figure 24-4 shows a concordance of 0.642 for this regression.

An R (or R2) value that’s interpreted like a correlation coefficient in ordinary regression — the larger the R2 value, the better the model fits the data. In Figure 24-4, R-square is 0.116.

A likelihood ratio number (and associated p value) that compares the full model (that includes all the parameters) to a model consisting of just the overall baseline function. In Figure 24-4, the likelihood ratio p value is shown as 4.46e–06, which is scientific notation for p = 0.00000446, indicating a model that includes the Tx and Stage variables can predict survival significantly better than just the overall (baseline) survival curve.

Akaike’s Information Criterion (AIC) or Bayesian Information Criterion (BIC), which are especially useful for comparing alternative models (see Chapter 19).

Focusing on baseline survival and hazard functions

The baseline survival function is represented as a table with two columns (time and predicted survival) and a row for each distinct time at which one or more events were observed.

The baseline survival function’s table may have hundreds of rows for large data sets, so printing it isn’t often useful. But if your software can save the table as a data file, you can use it to generate a customized prognosis curve for any specific set of values for the predictor variables. (I talk about prognosis curves in the following section.)

The software may also offer a graph of the baseline survival function. If your software is using an “average-subject” baseline, this graph is useful as an indicator of the entire group of subjects’ overall survival. But if your software uses a “zero-subject” baseline, the curve is probably of no use.

The baseline hazard function may also be available as a table or as a graph, which provides insight into the course of the disease. Some diseases have a long latency period (time during which little seems to be happening) during which deaths are relatively infrequent, whereas other diseases are more aggressive, with many early deaths.

How Long Have I Got, Doc? Constructing Prognosis Curves

One of the (many) reasons for doing any kind of regression analysis is to predict outcomes from any particular set of predictor values, and survival regression is no exception: You can use the regression coefficients from a Cox PH regression, along with the baseline survival curve, to construct an expected survival (prognosis) curve for any set of predictor values.

Suppose you’re an oncologist who’s analyzing survival time (from diagnosis to death) for a group of cancer patients in which the predictors are age, tumor stage, and tumor grade at the time of diagnosis. You’d run a Cox PH regression on your data and have the program generate the baseline survival curve as a table of times and survival probabilities. Then, if you (or any other doctor) diagnose a patient with cancer, you can take that person’s age, stage, and grade, and generate an expected survival curve tailored for that particular person. (I’m not sure I’d want to see that curve if I were the patient, but at least it could be done.)

You'll probably have to do these calculations outside of the software that you use for the survival regression, but the calculations aren't difficult, and can easily be done in a spreadsheet. The example in the following sections shows how it's done, using the rather trivial set of sample data that's preloaded into the online calculator for Cox PH regression at StatPages.info/prophaz.html. This particular example has only one predictor, but the basic idea extends to multiple predictors in a simple way, which I explain as I go.

Running the proportional-hazards regression

Figure 24-6 shows the output from the built-in example (omitting the Iteration History and Overall Model Fit sections). Pretend that this data represents survival, in years, as a function of Age (which, in this output, is referred to as Variable 1) for people just diagnosed with some particular disease.

First, consider the table in the Baseline Survivor Function section, which has two columns — time (years) and predicted survival (as a fraction) — and four rows — one for each time point in which one or more deaths was actually observed. The baseline survival curve for the dummy data starts (as survival curves always do) at 1.0 (100 percent survival) at time 0. (This row isn’t shown in the output.) The survival curve remains flat at 100 percent until year two, when it suddenly drops down to 99.79 percent, where it stays until year seven, when it drops down to 98.20 percent, and so on.

9781118553992-fg2406.eps

Illustration by Wiley, Composition Services Graphics

Figure 24-6: Output of Cox PH regression for generating prognostic curves.

In the Descriptive Stats section near the start of the output, the average age of the 11 subjects in the test data set is 51.1818 years, so the baseline survival curve shows the predicted survival for a person who is exactly 51.1818 years old. But suppose you want to generate a survival curve that’s customized for a person who is, say, 55 years old. According to the proportional-hazards model, you need to raise the entire baseline curve (in this case, each of the four tabulated points) to some power: h.

In general, h depends on two things:

The particular value for that subject’s predictor variables (in this example, an Age of 55)

The values of the corresponding regression coefficients (in this example, 0.3770, from the regression table)

Finding h

To calculate the h value, do the following for each predictor:

1. Subtract the average value from the patient’s value.

In this example, you subtract the average age (51.18) from the patient’s age (55), giving a difference of +3.82.

2. Multiply the difference by the regression coefficient and call the product v.

In this example, you multiply 3.82 from Step 1 by the regression coefficient for Age (0.377), giving a product of 1.44 for v.

3. Calculate the v value for each predictor in the model.

4. Add all the v values; call the sum of the individual v values V.

This example has only one predictor variable (Age), so V is just the v value you calculate for age in Step 2 (1.44).

5. Calculate eV.

This is the value of h. In this example, e1.44 gives the value 4.221, which is the h value for a 55-year-old person.

6. Raise each of the baseline survival values to the power of h to get the survival values for the prognosis curve.

In this example, you have the following prognosis:

• For year-zero survival 1.0004.221 = 1.000, or 100 percent

• For two-year survival: 0.99794.221 = 0.9912, or 99.12 percent

• For seven-year survival 0.98204.221 = 0.9262, or 92.62 percent

• For nine-year survival 0.95254.221 = 0.8143, or 81.43 percent

• For ten-year survival 0.83104.221 = 0.4578, or 45.78 percent

You then graph these calculated survival values to give a customized survival curve for this particular person. And that’s all there is to it!

Here’s a short version of the procedure:

1. V = sum of [(subject value – average value) * coefficient] summed over all the predictors

2. h = eV

3. Customized survival = (baseline survival)h

Some points to keep in mind:

If your software puts out a zero-based baseline survival function, then the only difference is that you don’t subtract the average value from the subject’s value; instead, calculate the v term as the simple product of the subject’s predictor value multiplied by the regression coefficient.

If a predictor is a categorical variable, you have to code the levels as numbers. If you have a dichotomous variable like gender, you could code male = 0 and female = 1. Then if, for example, 47.2 percent of the subjects are female, the “average gender” is 0.472, and the subtraction in Step 1 is (0 – 0.472), giving –0.472 if the patient is male, or (1 – 0.472), giving 0.528 if the subject is female. Then you carry out all the other steps exactly as described.

It’s even a little trickier for multivalued categories like race or district, because you have to code each of these variables as a set of dummy variables (see Chapter 19).

Estimating the Required Sample Size for a Survival Regression

Note: Elsewhere in this chapter, I use the word power in its algebraic sense (x2 is x to the power of 2). But in this section, I use power in its statistical sense: the probability of getting a significant result when performing a statistical test.

Sample-size calculations for regression analysis tend to be difficult for all but the simplest straight-line regressions (see Chapter 18). You can find software for many types of regression, including survival, but it often asks you for things you can’t readily provide.

Very often, sample-size estimates for studies that use regression methods to analyze the data are based on simpler analytical methods. I recommend that when you’re planning a Cox PH regression, you base your sample-size estimate on the simpler log-rank test, which I describe in Chapter 23. The free PS-Power and Sample Size program handles these calculations very well.

You still have to specify the following:

Alpha level (usually 0.05)

Desired power (usually 80 percent)

Effect size of importance (usually expressed as a hazard ratio or as the difference in median survival time between groups)

You also need some estimates of the following:

Anticipated enrollment rate: How many subjects you hope to enroll per month or per year

Planned duration of follow-up: How long, after the last subject has been enrolled, you plan to continue following all the subjects before ending the study and analyzing your data

I describe power calculations for survival comparisons in Chapter 23.

If this simpler approach isn’t satisfactory, talk to a professional statistician, who will have access to more sophisticated software. Or, you can undertake a Monte-Carlo simulation of the proposed trial and regression analysis (see Chapter 3 for details on this simulation), but this task is seldom necessary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 24: Survival Regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 24: Survival Regression