Chapter 23

Comparing Survival Times

In This Chapter

arrow Using the log-rank test to compare two groups

arrow Thinking about more complicated comparisons

arrow Calculating the necessary sample size

The life table and Kaplan-Meier survival curves described in Chapter 22 are ideal for summarizing and describing the time to the first (or only) occurrence of some event, based on times observed in a sample of subjects. They correctly incorporate censored data (when a subject isn’t observed long enough to experience the event). Animal studies or human studies involving endpoints that occur on a short time-scale (like duration of labor) might yield totally uncensored data, but most clinical studies will contain at least some censored observations.

In biological research (and especially in clinical trials), you often want to compare survival times between two or more groups of subjects. This chapter describes an important method — the log-rank test — for comparing survival between two groups and explains how to calculate the sample size you need to have sufficient statistical power (see Chapter 3) when performing this test. The log-rank test can be extended to handle three or more groups of subjects, but I don’t describe that test in this book.

remember.eps In this chapter, as in Chapters 22 and 24, I use the term survival and refer to the outcome event as death, but everything I say applies to any kind of outcome event.

warning_bomb.eps A fair bit of ambiguity is associated with the name log-rank test. This procedure is also referred to as the Mantel-Cox test, a stratified Cochran-Mantel-Haenszel test, the Mantel-Haenszel test for survival data, the Generalized Savage’s test, and the “Score Test” from the Cox Proportional Hazards model. (And I may have missed a few!) The log-rank test has also been extended in various ways; some of these variants have their own name (the Gehan-Breslow test, and Peto and Peto’s modification of the Gehan test, among others). And different implementations of the log-rank test may calculate the test statistic differently, resulting in slightly different p values. In this chapter, I describe the most commonly used form of the log-rank test.

remember.eps If you’re lucky enough to have no censored observations in your data, you can skip most of this chapter. You simply have two (or more) groups of numbers (survival times) that you want to compare. One option is to use an unpaired Student t test to see whether one group has a significantly longer mean survival than the other (or an ANOVA if you have three or more groups), as described in Chapter 12. But because survival times are very likely to be non-normally distributed, it’s safer to use a nonparametric test — you can use the Wilcoxon Sum of Ranks or Mann-Whitney U test to compare the median survival time between two groups, or the Kruskal-Wallis test for three or more groups.

Suppose you conduct a trial of a cancer drug with 90 subjects (randomized so that 60 receive the drug and 30 receive a placebo), following them for a total of five years and recording when each subject dies or is censored. You perform a life-table analysis on each group of subjects (drug and placebo) as described in Chapter 22 and graph the results, getting the survival curves shown in Figure 23-1. (The two life tables also provide the summary information you need for the log-rank test.)

9781118553992-fg2301.eps

Illustration by Wiley, Composition Services Graphics

Figure 23-1: Survival curves for two groups of subjects.

The two survival curves look different — the drug group seems to be showing better survival than the placebo group. But is this apparent difference real, or could it be the result of random fluctuations only? The log-rank test answers this question.

Comparing Survival between Two Groups with the Log-Rank Test

The log rank test can be performed using individual-subject data or on data that has been summarized into a life-table format. I first describe how to run a log-rank test with statistical software; then I describe the log-rank test calculations in detail, as you might carry them out using a spreadsheet like Excel.

Understanding what the log-rank test is doing

Basically, the log-rank test asks whether deaths are split between the two groups in the same proportion as the number of at-risk subjects in the two groups. The difference between the observed and expected number of deaths in each time slice for one of the groups (it doesn’t matter which one) is summed over all the time slices to get the total excess deaths for that group. The excess death sum is then scaled down — that is, divided by an estimate of its standard deviation. (I describe later in this chapter how that standard deviation estimate is calculated.) The scaled-down excess deaths sum is a number whose random sampling fluctuations should follow a normal distribution, from which a p value can be easily calculated.

Don’t worry if the preceding paragraph makes your head spin; it’s just meant to give you a general sense of the rationale for the log-rank test.

Running the log-rank test on software

remember.eps Most commercial statistical software packages (like those in Chapter 4) can perform a log-rank test. You first organize your data into a file consisting of one record per subject, having the following three variables:

check.png A categorical variable that identifies which group each subject belongs to

check.png A numerical variable containing the subject’s survival time (either the time to the event or the time to the end of observation)

check.png A variable that indicates the subject’s status at the end of the survival time (usually 1 if the event was observed; 0 if the observation was censored)

You identify these three variables to the program, either by typing the variable names or by picking them from a list of variables in the data file.

The program should produce a p value for the log-rank test. If this p value is less than 0.05, you can conclude that the two groups have significantly different survival.

In addition to the p value, the program may produce the median survival time for each group as well as confidence intervals for the median times and the difference in median times between groups. It may also offer to produce analyses and graphs that assess whether your data is consistent with the hazard proportionality assumption that I describe later in this chapter. You should get these extra outputs if they’re available. Consult the program’s documentation for information about using the program’s output to assess hazard proportionality.

Looking at the calculations

It’s generally not a good idea to do log-rank tests by hand or with a home-written spreadsheet; things can go wrong in too many places. But, as with many tests that I describe in this book, you’ll have a better appreciation of the strengths and limitations of the log-rank test if you understand how it works. So in this section, I describe how the log-rank calculations can be carried out in a spreadsheet environment.

The log-rank test utilizes some of the information from the life tables you prepared in order to graph the survival of the two groups in Figure 23-1. The test needs only the number of subjects at risk and the number of observed deaths for each group at each time slice. Figure 23-2 shows a portion of the life tables that produced the curves shown in Figure 23-1, with the data for the two treatment groups displayed side by side. The Drug group’s results are in columns B through E, and the Placebo group’s results are in columns F through I.

9781118553992-fg2302.eps

Illustration by Wiley, Composition Services Graphics

Figure 23-2: Part of life-table calculations for two groups of subjects.

The calculations for the log-rank test are carried out in a second spreadsheet (as shown in Figure 23-3), with the following columns:

check.png Column A identifies the time slices, consistent with Figure 23-2.

check.png The log-rank test needs only the At Risk and Died columns for each group. Columns B and C of Figure 23-3 are taken from columns E and C, respectively, of Figure 23-2. Columns D and E of Figure 23-3 are taken from Columns I and G, respectively, of Figure 23-2.

check.png Columns F and G show the total number of subjects at risk and the total number of subjects who died; they’re obtained by combining the corresponding columns for the two treatment groups.

check.png Column H shows Group 1’s percentage of the total number of at-risk ­subjects.

check.png Column I shows the number of deaths you’d expect to see in Group 1 based on apportioning the total number of deaths (in both groups) by Group 1’s percentage of total at-risk subjects. For the 0–1 year row, Group 1 had about 2/3 of the 89 subjects at risk, so you’d expect it to have about 2/3 of the nine deaths.

check.png Column J shows the excess number of actual deaths compared to the expected number for Group 1.

check.png Column K shows the variance (the square of the standard deviation) of the excess deaths. It’s obtained from a rather complicated formula that’s based on the properties of the binomial distribution (see Chapter 25):

V = DT(N1/NT)(N2/NT)(NT DT)/(NT – 1)

For the first time slice (0–1 yr), this becomes: V = 9(59.5/89)(29.5/89)(89 – 9)/(89 – 1), which equals approximately 1.813.

N refers to the number of subjects at risk, D refers to deaths, the subscripts 1 and 2 refer to groups 1 and 2, and T refers to the total of both groups combined.

9781118553992-fg2303.eps

Illustration by Wiley, Composition Services Graphics

Figure 23-3: Basic log-rank calculations (don’t try this at home, kids!).

Next, you add up the excess deaths in all the time slices to get the total number of excess deaths for Group 1 compared to what you would have expected if the deaths had been distributed between the two groups in the same ratio as the number of at-risk subjects.

Then you add up all the variances, because the variance of the sum of a set of numbers is the sum of the variances of the individual numbers (from the error-propagation rules given in Chapter 11).

Then you divide the total excess deaths by the square root of the total variance to get a test statistic called Z:

9781118553992-eq23001.eps

The Z value is approximately normally distributed, so you can obtain a p value from a table of the normal distribution or from an online calculator. For the data in Figure 23-3, 9781118553992-eq23002.eps, which is 2.19, which corresponds to a p value of 0.028, so you can conclude that the two groups have significantly different survival.

Note: By the way, it doesn’t matter which group (drug or placebo) you call Group 1 in these calculations; the final results are the same either way.

Assessing the assumptions

Like all statistical tests, the log-rank test assumes that you studied an unbiased sample from the population you’re trying to draw conclusions about. It also assumes that any censoring that occurred was due to circumstances unrelated to the efficacy of the treatment (for example, subjects didn’t drop out of the study because the drug made them sick).

remember.eps One very important assumption is that the two groups have proportional hazards. I describe these hazards in more detail in Chapter 24, but for now the important thing to know is that the survival curves of the two groups must have generally similar shapes, as in Figure 23-4. (Flip to Chapter 22 for more about survival curves.)

9781118553992-fg2304.eps

Illustration by Wiley, Composition Services Graphics

Figure 23-4: Proportional (a) and nonproportional (b) hazards relationships between two survival curves.

The log-rank test looks for differences in overall survival time; it’s not good at detecting differences in shape between two survival curves with similar overall survival time, like the two curves shown in Figure 23-4b (which actually have the same median survival time). When two survival curves cross over each other, the excess deaths are positive for some time slices and negative for others, so they tend to cancel out when they’re added up, producing a smaller test statistic (z value), and larger (less significant) p values.

Considering More Complicated Comparisons

The log-rank test is good for comparing survival between two or more groups of subjects. But it doesn’t extend well to more complicated situations. What if you want to do one of the following?

check.png Test whether survival depends on age or some other continuous variable

check.png Test the simultaneous effect of several variables, or their interactions, on survival

check.png Correct for the presence of confounding variables or other covariates

In other areas of statistical testing, such situations are usually handled by regression techniques, so it’s not surprising that statisticians have developed a special type of regression to deal with survival outcomes with censored observations. I describe this special kind of regression in Chapter 24.

Coming Up with the Sample Size Needed for Survival Comparisons

I introduce power and sample size in Chapter 3. Calculating the sample size for survival comparisons is complicated by several things:

check.png The need to specify an alternative hypothesis: This hypothesis can take the form of a hazard ratio, described in Chapter 24 (the null hypothesis is that the hazard ratio = 1), or the difference between two median survival times.

check.png The effect of censoring: This effect can depend on things like accrual rate, dropout rate, and the length of additional follow-up after the last subject has been enrolled into the study.

check.png The shape of the survival curves: This shape is often assumed, for the sake of the sample-size calculations, to be a simple exponential curve, but that may not be realistic.

tip.eps I recommend using software like the free PS (Power and Sample Size Calculation; see Chapter 4) to do these calculations, because it can take a lot of these complications into account.

Suppose you’re planning a study to compare a drug to a placebo. You’ll have two equal-size groups, and you expect to enroll subjects for one year and then continue to follow the subjects’ progress for another two years after enrollment is complete. You expect the median placebo time to be 20 months, and you think the drug should extend this to 30 months. If it truly does extend survival that much, you want to have an 80 percent chance of getting p 0.05 when you compare drug to placebo using the log-rank test.

You set up the PS program as shown in Figure 23-5. Note that time must always be entered in the same units (months, in this example) in the various fields: the median survival times for the two groups (m1 and m2), the accrual interval (A), and the post-accrual follow-up period (F).

This tells you that you need to enroll 170 subjects in each group (a total of 340 subjects altogether).

tip.eps Note that sample-size software often provides a brief paragraph describing the sample-size calculation, which you can copy and paste into your protocol (or proposal) document.

9781118553992-fg2305.tif

PS: Power and Sample Size Calculation by William D. Dupont

Figure 23-5: Sample-size calculation for comparing survival times using the PS program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset