Chapter 16

Feeling Noninferior (Or Equivalent)

In This Chapter

arrow Demonstrating the absence of an effect in your data

arrow Testing for bioequivalence, therapeutic noninferiority, and absence of harmful effects

Many statistical tests let you determine whether two things are different from each other, like showing that a drug is better than a placebo, or that the blood concentration of some enzyme is higher in people with some medical condition than in people without that condition. But sometimes you want to prove that two (or more) things are not different. Here are three examples I refer to throughout this chapter:

check.png Bioequivalence: You’re developing a generic formulation to compete with a name-brand drug, so you have to demonstrate that your product is bioequivalent to the name-brand product; that is, that it puts essentially the same amount of active ingredient into the bloodstream.

check.png Therapeutic noninferiority: You want to show that your new treatment for some disease is no worse than the current best treatment for that disease.

check.png Absence of harmful effects: Your new drug must demonstrate that, compared to a placebo, it doesn’t prolong the QT interval on an ECG (see Chapter 6 for more info on the role of QT testing during drug development).

This chapter describes how to analyze data from equivalence and noninferiority or nonsuperiority studies. Nonsuperiority studies are less frequently encountered than noninferiority studies, and the analysis is basically the same, so for the rest of this chapter, whatever I say about noninferiority also applies (in the reverse direction) to nonsuperiority.

Understanding the Absence of an Effect

remember.eps Absence of proof is not proof of absence! Proving the total absence of an effect statistically is impossible. For example, you can’t interpret a nonsignificant outcome from a t test as proving that no difference exists between two groups. The difference may have been real but small, or your sample size may have been too small, or your analytical method may not have been precise enough or sensitive enough. To demonstrate the absence of an effect, you need to test your data in a special way.

Defining the effect size: How different are the groups?

Before being able to test whether or not two groups are different, you first have to come up with a numerical measure that quantifies how different the two groups are. I call this the effect size, and it’s defined in different ways for different kinds of studies:

check.png Bioequivalence studies: In bioequivalence (BE) studies, the amount of drug that gets into the bloodstream is usually expressed in terms of the AUC — the area under the curve of blood concentration of the drug versus time, as determined from a pharmacokinetic analysis (see Chapter 5). The effect size for BE studies is usually expressed as the ratio of the AUC of the new formulation divided by the AUC of the reference formulation (such as the brand-name drug). A ratio of 1.0 means that the two products are perfectly bioequivalent (that is, 1.0 is the no-effect value for a BE study).

check.png Therapeutic noninferiority studies: In a therapeutic trial, the effect size is often expressed as the difference or the ratio of the efficacy endpoint for that trial. So for a cancer treatment trial, it could be the between-group difference in any of the following measures of efficacy: median survival time, five-year survival rate, the percent of subjects responding to the treatment, or the hazard ratio from a survival regression analysis (see Chapter 24). For an arthritis treatment trial, it could be the between-group difference in pain score improvements or the percentage of subjects reporting an improvement in their condition. Often the effect size is defined so that a larger (or more positive) value corresponds to a better treatment.

check.png QT safety studies: In a QT trial, the effect size is the difference in QT interval prolongation between the drug and placebo.

Defining an important effect size: How close is close enough?

Instead of trying to prove absolutely no difference between two groups (which is impossible), you need only prove no important difference between the groups. To do the latter, you must first define exactly what you mean by an important difference — you must come up with a number representing the smallest effect size you consider to be clinically meaningful. If the true effect size (in the entire population) is less than that amount, then for all practical purposes it’s like no difference at all between the groups. The minimal important effect size is referred to by several names, such as allowable difference and permissible tolerance.

Just how much of an effect is important depends on what you’re measuring:

check.png For bioequivalence studies: The customary tolerance in the AUC ratio is 0.8 to 1.25; two drugs are considered equivalent if their true AUC ratio lies within this range.

check.png For therapeutic noninferiority studies: The tolerance depends on the disease and the chosen efficacy endpoint. It’s usually established by consensus between expert clinicians in the particular discipline and the regulatory agencies.

check.png For QT safety studies: Current regulations say that the drug must not prolong the QT interval by more than 10 milliseconds, compared to placebo. If the true prolongation is less than 10 milliseconds, the drug is judged as not prolonging QT.

Recognizing effects: Can you spot a difference if there really is one?

When you’re trying to prove the absence of some effect, you have to convince people that your methodology isn’t oblivious to that effect and that you can actually spot the effect in your data when it’s truly present in the population you’re studying.

For example, I can always “prove” statistically that a drug doesn’t prolong the QT interval simply by having the ECGs read by someone who hasn’t the foggiest idea of how to measure QT intervals! That person’s results will consist almost entirely of random “noise,” and the statistical tests for a QT prolongation will come out nonsignificant almost every time.

Here the concept of assay sensitivity comes into the picture. I use the word assay here to refer to the entire methodology used in the study — not just a laboratory assay. To prove assay sensitivity, you need to show that you can recognize a difference if it’s really there.

remember.eps You demonstrate adequate assay sensitivity by using a positive control. In addition to the groups that you’re trying to show are not different, you must also include in your study another group that definitely is different from the reference group:

check.png In bioequivalence studies: One simple and obvious choice is the brand-name product itself, but at a different dose than that of the reference group — perhaps half (or maybe twice) the standard dose. To prove assay sensitivity, you’d better get a significant result when comparing the positive control to the standard product.

check.png warning_bomb.eps For therapeutic noninferiority studies: Noninferiority studies have a real problem with regard to assay sensitivity — there’s usually no way to use a positive control (a truly inferior treatment) in the study because noninferiority studies are generally used in situations when it would be unethical to withhold effective treatment from people with a serious illness. So when a noninferiority trial comes out successful, you can’t tell whether it was a well-designed trial on a treatment that’s truly noninferior to the reference treatment or whether it was a badly designed trial on a truly inferior treatment.

check.png For QT safety studies: The drug moxifloxacin (known to prolong QT, but not by a dangerous amount) is often used as a positive control. Subjects given moxifloxacin had better show a statistically significant QT prolongation compared to the placebo.

Proving Equivalence and Noninferiority

After you understand the lack of meaningful differences (see the previous section), you can put all the concepts together into a statistical process for demonstrating that there’s no important difference between two groups.

Equivalence and noninferiority can be demonstrated statistically by using significance tests in a special way or by using confidence intervals.

Using significance tests

Significance tests are used to show that an observed effect is unlikely to be due only to random fluctuations. (See Chapter 3 for a general discussion of hypothesis testing.) You may be tempted to test for bioequivalence by comparing the AUCs of the two drug formulations with a two-group Student t test or by comparing the ratio of the AUCs to the no-effect value of 1.0 using a one-group t test. Those approaches would be okay if you were trying to prove that the two formulations weren’t exactly the same, but that’s not what equivalence testing is all about.

Instead, you have to think like this:

check.png Bioequivalence: The rules for drug bioequivalence say that the true AUC ratio has to be between 0.8 and 1.25. That’s like saying that the observed mean AUC ratio must be significantly greater than 0.8, and it must be significantly less than 1.25. So instead of performing one significance test (against the no-effect value of 1.0), you perform two tests — one against the low (0.8) limit and one against the high (1.25) limit. Each of these tests is one-sided, because each test is concerned only with differences in one direction, so this procedure is called the two one-sided tests method for equivalence testing.

check.png Therapeutic noninferiority: You use the same idea for noninferiority testing, but you have to show only that the new treatment is significantly better than the worst end of the permissible range for the reference treatment. So if a cancer therapy trial used a five-year survival rate as the efficacy variable, the reference treatment had a 40 percent rate, and the permissible tolerance was five percent, then the five-year survival rate for the new treatment would have to be significantly greater than 35 percent.

check.png QT safety: You have to show that the drug’s true QT prolongation, relative to placebo, is less than 10 milliseconds (msec). So you have to show that the excess prolongation (drug minus placebo) is significantly less than 10 milliseconds.

If these significance-testing approaches sound confusing, don’t worry — you can test for equivalence and noninferiority another way that’s much easier to visualize and understand, and it always leads to exactly the same conclusions.

Using confidence intervals

In Chapter 10, I point out that you can use confidence intervals as an alternative to the usual significance tests by calculating the effect size, along with its confidence interval (CI), and checking whether the CI includes the no-effect value (0 for differences; 1 for ratios). It’s always true that an effect is statistically significant if, and only if, its CI doesn’t include the no-effect value. And confidence levels correspond to significance levels: 95 percent CIs correspond to p = 0.05; 99 percent CIs to p = 0.01, and so on.

Figure 16-1 illustrates this correspondence. The vertical line corresponds to the no-effect value (1 for a pharmacokinetic study, using the AUC ratio, in the left diagram; 0 for efficacy testing, using the difference in five-year survival, in the right diagram). The small diamonds are the observed value of the effect size, and the horizontal lines are the 95 percent CIs. When confidence intervals span across the no-effect value, the effect is not significant; when they don’t, it is significant.

9781118553992-fg1601.eps

Illustration by Wiley, Composition Services Graphics

Figure 16-1: Using 95 percent confidence intervals to test for significant effects for a pharmacokinetics trial and a cancer therapy trial.

You can also use confidence intervals to analyze equivalence and noninferiority/nonsuperiority studies, as shown in Figure 16-2.

9781118553992-fg1602.eps

Illustration by Wiley, Composition Services Graphics

Figure 16-2: Using confidence intervals to test for bioequivalence using the AUC ratio, and testing for noninferiority using the difference in five-year survival.

You can see two additional vertical lines, representing the lower and upper limits of the allowable tolerance. This time, you don’t care whether the confidence interval includes the no-effect line (which is why I made it so light). Instead, you’re interested in whether the line stays within the tolerance lines:

check.png Equivalence: You can conclude equivalence if, and only if, the entire CI fits between the two tolerance lines.

check.png Noninferiority and nonsuperiority: You can conclude noninferiority if, and only if, the entire CI lies to the right of the worst tolerance line. (It doesn’t matter how high the CI extends.) You can conclude nonsuperiority if, and only if, the entire CI lies to the left of the better tolerance line. (It doesn’t matter how low the CI extends.)

check.png QT testing: For a drug to be judged as not substantially prolonging the QT interval, the CI around the difference in QT prolongation between drug and placebo must never extend beyond 10 milliseconds.

When testing noninferiority or nonsuperiority at the 5 percent significance level, you should use 95 percent CIs, as you would expect. But when testing equivalence at the 5 percent level, you should use 90 percent CIs! That’s because for equivalence, the 5 percent needs to be applied at both the high and low ends of the CI, not just at one end. And for QT testing, the 5 percent is applied only at the upper end.

Some precautions about noninferiority testing

Although noninferiority testing is sometimes necessary, it has a number of weaknesses that you should keep in mind:

check.png No positive control: In the earlier section Recognizing effects: Can you spot a difference if there really is one? I describe how noninferiority trials usually can’t incorporate a truly ineffective treatment, for ethical reasons.

check.png No true proof of efficacy: Proving that a new drug isn’t inferior to a drug that has been shown to be significantly better than placebo isn’t really the same as proving that the new drug is significantly better than placebo. The new drug may be less effective than the reference drug (perhaps even significantly less effective) but still within the allowable tolerance for noninferiority.

check.png Noninferiority creep: If a new drug is tested against a reference drug that was, itself, approved on the basis of a noninferiority study, this new “third-generation” drug may be less effective than the reference drug, which may have been less effective than the first drug that was tested against placebo. Each successive generation of noninferior drugs may be less effective than the preceding generation. This so-called noninferiority creep (sometimes referred to as bio-creep) is a matter of considerable concern among researchers and regulatory agencies.

tip.eps check.png Estimating sample size: Estimating the required sample size needed for equivalence and noninferiority or nonsuperiority studies has no simple rules of thumb; you need to use special software designed for these studies. Some web pages are available to estimate sample size for some of these studies; these are listed on the StatPages.info website.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset