Chapter 14
Taking a Closer Look at Fourfold Tables
In This Chapter
Beginning with the basics of fourfold tables
Digging into sampling designs for fourfold tables
Using fourfold tables in different scenarios
In Chapter 13, I show you how to compare proportions between two or more groups with a cross-tab table. In general, a cross-tab shows the relationship between two categorical variables. Each row of the table represents one particular category of one variable, and each column of the table represents one particular category of the other variable. The table can have two or more rows and two or more columns, depending on the number of levels (different categories) present in each of the two variables. So a cross-tab between a treatment variable that has three levels (like old drug, new drug, and placebo) and an outcome variable that has five levels (like died, got worse, unchanged, improved, and cured) has three rows and five columns.
A special case occurs when both variables are dichotomous (or binary); that is, they both have only two values, like gender (male and female) and compliance (good and bad). The cross-tab of these two variables has two rows and two columns. Because a 2x2 cross-tab table has four cells, it’s commonly called a fourfold table.
Everything in Chapter 13 applies to the fourfold table and to larger cross-tab tables. Because the fourfold table springs up in so many different contexts in biological research, and because so many other quantities are calculated from the fourfold table, it warrants a chapter all its own. In this chapter, I describe the various research scenarios in which fourfold tables often occur: comparing proportions, testing for association, evaluating risk factors, quantifying the performance of diagnostic tests, assessing the effectiveness of therapies, and measuring inter-rater and intra-rater reliability. I describe how to calculate several common measures (called indices) used in each scenario, along with their confidence intervals. And I describe different kinds of sampling strategies (ways of selecting subjects to study) that you need to be aware of.
Focusing on the Fundamentals of Fourfold Tables
The most obvious thing you can get from a fourfold table is a p value indicating whether a significant association exists between the two categorical variables from which the table was created. A p value is the probability that random fluctuations alone, in the absence of any real effect in the population, could have produced an observed effect at least as large as what you observed in your sample. If the p value is less than some arbitrary value (often set at 0.05), the effect is said to be statistically significant (see Chapter 3 for a more detailed discussion of p values and significance). Assessing significance is often the main reason (and sometimes the only reason) why someone creates a cross-tab of any size. But fourfold tables can yield other interesting numbers besides a p value.
Like any other number you calculate from your data, an index from a fourfold table is only a sample statistic — an estimate of the corresponding population parameter. So a good researcher always wants to quote the precision of that estimate. In Chapters 9 and 10, I describe how to calculate the standard error (SE) and confidence interval (CI) for simple sample statistics like means, proportions, and regression coefficients. And in this chapter, I show you how to calculate the SE and CI for the various indices you can get from a fourfold table.
Though an index itself may be easy to calculate, its SE or CI usually is not. Approximate formulas are available for some of the more common indices; these are usually based on the fact that the random sampling fluctuations of an index (or its logarithm) are often nearly normally distributed if the sample size isn’t too small. I provide such formulas where they’re available. Fortunately, the fourfold table web page provides confidence intervals for all the indices it calculates, using a general (but still approximate) method.
Illustration by Wiley, Composition Services Graphics
Figure 14-1: These designations for cell counts and totals are used throughout this chapter.
Choosing the Right Sampling Strategy
When designing a study whose objective involves two categorical variables that will be cross-tabulated into a fourfold table, you have to give thought to how you select your subjects. For example, suppose you’re planning a simple research project to investigate the relationship between obesity and high blood pressure (hypertension, or HTN). You enroll a sample of subjects and prepare a fourfold table from your data, with obesity as the row variable (obese in the top row; non-obese in the bottom row) and HTN as the column variable (subjects with HTN in the left column; subjects without HTN in the right column). For the sake of the example, if an association exists, obesity is considered the cause and HTN the effect. How will you go about enrolling subjects for this study?
You can enroll a certain number of subjects without knowing how many do or do not have the risk factor, or how many do or do not have the outcome. You can decide to enroll, say, 100 subjects, not knowing in advance how many of these subjects are obese or how many have HTN. In terms of the cells in Figure 14-1, this means that you predetermine the value of t as 100, but you don’t know what the values of r1, r2, c1, or c2 will be until you determine the obesity and hypertension status of each subject. This is called a natural sampling design.
You can enroll a certain number of subjects with the risk factor, and a certain number without the risk factor. You can decide to enroll, say, 50 obese and 50 non-obese subjects, not knowing what their HTN status is. You specify, in advance, that r1 will be 50 and r2 will be 50 (and therefore t will be 100), but you don’t know what c1 and c2 are until you determine the HTN status of each subject. This is called a cohort or prospective study design — you select two cohorts of subjects based on the presence or absence of the risk factor (the cause) and then compare how many subjects in each cohort got the outcome (conceptually looking forward from cause to effect). Statisticians often use this kind of design when the risk factor is very rare, to be sure of getting enough subjects with the rare risk factor.
You can enroll a certain number of subjects who have the outcome and a certain number who do not have the outcome. You can decide to enroll, say, 50 subjects with hypertension and 50 subjects without hypertension, without knowing what their obesity status is. You specify, in advance, that c1 will be 50 and c2 will be 50 (and therefore t will be 100), but you don’t know what r1 and r2 are until you determine the obesity status of each subject. This is called a case-control or retrospective study design — you select a bunch of cases (subjects with the outcome of hypertension) and a bunch of controls (subjects without hypertension) and then compare the prevalence of the obesity risk factor between the cases and the controls (conceptually looking backward from effect to cause). Statisticians often use this kind of design when the outcome is very rare, to be sure of getting enough subjects with the rare outcome.
Why is this distinction among ways of acquiring subjects important? As you see in the rest of this chapter, some indices are meaningful only if the sampling is done a certain way.
Producing Fourfold Tables in a Variety of Situations
Fourfold tables can arise from a number of different scenarios, including the following:
Comparing proportions between two groups (see Chapter 13)
Testing whether two binary variables are associated
Assessing risk factors
Evaluating diagnostic procedures
Evaluating therapies
Evaluating inter-rater reliability
Note: These scenarios can also give rise to tables larger than 2x2. And fourfold tables can arise in other scenarios besides these.
Describing the association between two binary variables
Suppose you select a random sample of 60 adults from the local population. Suppose you measure their height and weight, calculate their body mass index, and classify them as obese or non-obese. You can also measure their blood pressure under various conditions and categorize them as hypertensive or non-hypertensive. This is a natural sampling strategy, as described in the earlier section Choosing the Right Sampling Strategy. You can summarize your data in a fourfold table (see Figure 14-2).
This table indicates that most obese people have hypertension, and most non-obese people don’t have hypertension. You can show that this apparent association is statistically significant in this sample using either a Yates chi-square or a Fisher Exact test on this table (as I describe in Chapter 13), getting p = 0.016 or p = 0.013, respectively.
But when you present the results of this study, just saying that a significant association exists between obesity and hypertension isn’t enough; you should also indicate how strong this relationship is. For two continuous variables (such as weight or blood pressure, that are not restricted to whole numbers, but could, in theory at least, be measured to any number of decimal places), you can present the correlation coefficient — a number that varies from 0 (indicating no correlation at all) to plus or minus 1 (indicating perfect positive or perfect negative correlation). Wouldn’t it be nice if there was a correlation coefficient designed for two binary categorical variables that worked the same way?
The tetrachoric correlation coefficient (RTet), also called the terachoric correlation coefficient, is based on the concept that a subject’s category (like hypertensive or non-hypertensive) could have been derived from some continuous variable (like systolic blood pressure), based on an arbitrary “cut value” (like 150 mmHg). This concept doesn’t make much sense for intrinsically dichotomous variables like gender, but it’s reasonable for things like hypertension based on blood pressure or obesity based on body mass index.
Illustration by Wiley, Composition Services Graphics
Figure 14-2: A fourfold table summarizing obesity and hypertension in a sample of 60 subjects.
where cos is the cosine of an angle in radians.
For the data in Figure 14-2, , which is about 0.53.
Note: No simple formulas exist for the standard error or confidence intervals for the tetrachoric correlation coefficient, but the fourfold-table web page (at StatPages.info/ctab2x2.html
)
can calculate them. For this example, the 95 percent CI is 0.09 to 0.81.
Assessing risk factors
How much does a suspected risk factor (cause) increase the chances of getting a particular outcome (effect)? For example, how much does being obese increase your chances of having hypertension? You can calculate a couple of indices from the fourfold table that describe this increase, as you discover in the following sections.
Relative risk (risk ratio)
The risk (or probability) of getting a bad outcome is estimated as the fraction of subjects in a group who had the outcome. You can calculate the risk separately for subjects with and without the risk factor. The risk for subjects with the risk factor is a/r1; for the example from Figure 14-2, it’s 14/21, which is 0.667 (66.7 percent). And for those without the risk factor, the risk is c/r2; for this example, it’s 12/39, which is 0.308 (30.8 percent).
For this example, the RR is (14/21)/(12/39), which is 0.667/0.308, which is 2.17. So in this sample, obese subjects are slightly more than twice as likely to have hypertension than non-obese subjects.
You can calculate an approximate 95 percent confidence interval around the observed RR using the following formulas, which are based on the assumption that the logarithm of the RR is normally distributed:
1. Calculate the standard error of the log of RR using the following formula:
2. Calculate Q with the following formula: Q = e1.96 × SE where Q is simply a convenient intermediate quantity, which will be used in the next part of the calculation, and e is the mathematical constant 2.718.
3. Find the lower and upper limits of the confidence interval with the following formula:
For other confidence levels, replace the 1.96 in Step 2 with the appropriate multiplier shown in Table 10-1 of Chapter 10. So for 50 percent CIs, use 0.67; for 80 percent, use 1.28; for 90 percent, use 1.64; for 98 percent, use 2.33; and for 99 percent, use 2.58.
So for the example in Figure 14-2, you calculate 95 percent CI around the observed relative risk as follows:
1. , which is 0.2855.
2. Q = e1.96 × 0.2855, which is 1.75.
3. The , which is 1.24 to 3.80.
Or, you can enter the four cell counts from Figure 14-2 into the fourfold table web page, and it will calculate the RR as 2.17, with 95 percent confidence limits of 1.14 to 3.71 (using a different formula).
Odds ratio
The odds of something happening is the probability of it happening divided by the probability of it not happening: p/(1 – p). In a sample of data, you estimate the odds of having an outcome event as the number of subjects who had the event divided by the number of subjects who didn’t have it.
The odds of having the outcome event for subjects with the risk factor are a/b; for the example in Figure 14-2, they’re 14/7, which is 2.00. And for those without the risk factor, the odds are c/d; for this example they’re 12/27, which is 0.444 (odds usually aren’t expressed as percentages). See Chapter 3 for a more detailed discussion of odds.
For this example, the odds ratio is (14/7)/(12/27), which is 2.00/0.444, which is 4.50. So in this sample, obese subjects have 4.5 times the odds of having hypertension than non-obese subjects.
You can calculate an approximate 95 percent confidence interval around the observed odds ratio using the following formulas, which are based on the assumption that the logarithm of the OR is normally distributed:
1. Calculate the standard error of the log of the OR with the following formula:
2. Calculate Q with the following formula: Q = e1.96 × SE, where Q is simply a convenient intermediate quantity, which will be used in the next part of the calculation, and e is the mathematical constant 2.718.
3. Find the limits of the confidence interval with the following formula:
For other confidence levels, replace the 1.96 in Step 2 with the appropriate multiplier shown in Table 10-1 of Chapter 10. So for 50 percent CIs, use 0.67; for 80 percent, use 1.28; for 90 percent, use 1.64; for 98 percent, use 2.33; and for 99 percent, use 2.58.
So for the example in Figure 14-2, you calculate 95 percent CI around the observed odds ratio as follows:
1. , which is 0.5785.
2. , which is 3.11.
3. , which is 1.45 to 14.0.
Or, you can enter the four cell counts from Figure 14-2 into the fourfold table web page, and it will calculate the OR as 4.5, with 95 percent confidence limits of 1.27 to 16.5 (using a different formula).
Evaluating diagnostic procedures
Many diagnostic procedures give a positive or negative test result, which, ideally, should correspond to the true presence or absence of the medical condition being tested for (as determined by some gold standard that’s assumed to be perfectly accurate in diagnosing the condition). But gold standard diagnostic procedures can be time-consuming, expensive, and unpleasant for the patient, so quick, inexpensive, and relatively noninvasive screening tests are very valuable if they’re reasonably accurate.
Most tests produce some false positive results (coming out positive when the condition is truly not present) and some false negative results (coming out negative when the condition truly is present). It’s important to know how well a test performs.
You usually evaluate a proposed screening test for a medical condition by administering the proposed test to a group of subjects, whose true status has been (or will be) determined by the gold standard method. You can then cross-tabulate the test results against the true condition, producing a fourfold table like Figure 14-3.
Illustration by Wiley, Composition Services Graphics
Figure 14-3: This is how data is summarized when evaluating a proposed diagnostic screening test.
For example, consider a home pregnancy test that’s administered to 100 randomly chosen women who suspect they may be pregnant. This is a natural sampling from a population defined as “all women who think they might be pregnant,” which is the population to whom a home pregnancy test would be marketed. Eventually, their true status becomes known, so it’s cross-tabulated against the test results, giving Figure 14-4.
Illustration by Wiley, Composition Services Graphics
Figure 14-4: Results from a test of a proposed home pregnancy test.
You can easily calculate at least five important characteristics of the home test from this table, as you find out in the following sections.
Overall accuracy
Overall accuracy measures how often a test is right. A perfectly accurate test never produces false positive or false negative results. In Figure 14-4, cells a and d represent correct test results, so the overall accuracy of the home pregnancy test is (a + d)/t. Using the data in Figure 14-4, accuracy = (33 + 51)/100, which is 0.84, or 84 percent.
Sensitivity and specificity
A perfectly sensitive test never produces a false negative result; if the condition is truly present, the test always comes out positive. (In other words, if it’s there, you’ll see it.) So when a perfectly sensitive test comes out negative, you can be sure the person doesn’t have the condition. You calculate sensitivity by dividing the number of true positive cases by the total number of cases where the condition was truly present: a/c1 (that is, true positive/all present). Using the data in Figure 14-4, sensitivity = 33/37, which is 0.89; that means that the home test comes out positive in 89 percent of truly pregnant women.
A perfectly specific test never produces a false positive result; if the condition is truly absent, the test always comes out negative. (In other words, if it’s not there, you won’t see it.) So when a perfectly specific test comes out positive, you can be sure the person has the condition. You calculate specificity by dividing the number of true negative cases by the total number of cases where the condition was truly absent: d/c2 (that is, true negative/all not present). Using the data in Figure 14-4, specificity = 51/63, which is 0.81; that means that the home test comes out negative in 81 percent of truly non-pregnant women.
Sensitivity and specificity are important characteristics of the test itself, but they don’t answer the very practical question of, “How likely is a particular test result (positive or negative) to be correct?” That’s because the answers depend on the prevalence of the condition in the population the test is applied to. (Positive predictive value and negative predictive value, explained in the following section, do answer that question, because their values do depend on the prevalence of the condition in the population.)
Positive predictive value and negative predictive value
The positive predictive value (PPV) is the fraction of all positive test results that are true positives (the woman is truly pregnant). If you see it, it’s there! You calculate PPV as a/r1. For the data in Figure 14-4, the PPV is 33/45, which is 0.73. So if the pregnancy test comes out positive, there’s a 73 percent chance that the woman is truly pregnant.
The negative predictive value (NPV) is the fraction of all negative test results that are true negatives (the woman is truly not pregnant). If you don’t see it, it’s not there! You calculate NPV as d/r2. For the data in Figure 14-4, the NPV is 51/55, which is 0.93. So if the pregnancy test comes out negative, there’s a 93 percent chance that the woman is truly not pregnant.
Investigating treatments
One of the simplest ways to investigate the effectiveness of some treatment (drug, surgical procedure, and so on) is to study a sample of subjects with the target condition (obesity, hypertension, diabetes, and so on) and randomly assign some of them to receive the proposed treatment and some of them to receive a placebo or sham treatment. Then observe whether the treatment helped the subject. Of course, placebos help many subjects, so you need to compare the fraction of successful outcomes between the two groups of subjects.
Suppose you study 200 subjects with arthritis, randomize them so that 100 receive an experimental drug and 100 receive a placebo, and record whether each subject felt that the product helped their arthritis. You tabulate the results in a fourfold table, like Figure 14-5.
Illustration by Wiley, Composition Services Graphics
Figure 14-5: Comparing a treatment to a placebo.
Seventy percent of subjects taking the new drug report that it helped their arthritis, which is quite impressive until you see that 50 percent of subjects who received the placebo also reported improvement. (Pain studies are notorious for showing very strong placebo effects.) Nevertheless, a Yates chi-square or Fisher Exact test (see Chapter 13) shows that the drug helped a significantly greater fraction of the time than the placebo (p = 0.006 by either test).
But how do you quantify the amount of improvement? You can calculate a couple of useful effect-size indices from this fourfold table, as you find out in the following sections.
Difference in proportion
One very simple and obvious number is the between-group difference in the fraction of subjects helped: a/r1 – c/r2. For the numbers in Figure 14-5, the difference = 70/100 – 50/100, which is 0.7 – 0.5 = 0.2, or a 20 percent superiority in the proportion of subjects helped by the drug relative to the placebo.
You can calculate (approximately) the standard error (SE) of the difference
as: . For the data in Figure 14-5,
, which is 0.0678, so you’d report
the difference in proportion helped as 0.20 ± 0.0678.
You obtain the 95 percent CI around the difference by adding and subtracting 1.96 times the SE, which gives 0.2 – 1.96 × 0.0678, and 0.2 + 1.96 × 0.0678, for a 95 percent CI of 0.067 to 0.333.
Number needed to treat
The number needed to treat (NNT) is an interesting number that physicians love. Basically, it answers the very practical question, “How many subjects would I have to treat with the new drug before helping, on average, one additional subject beyond those who would have been helped even by a placebo?” This number turns out to be simply the reciprocal of the difference in the proportions helped (ignoring the sign of the difference), which I describe in the preceding section: NNT = 1/|Diff|. So for the example in Figure 14-5, NNT = 1/0.2, or 5 subjects.
The SE of the NNT isn’t particularly useful because NNT has a very skewed sampling distribution. You can obtain the confidence limits around NNT by taking the reciprocals of the confidence limits for Diff (and swapping the lower and upper limits). So the 95 percent confidence limits for NNT are 1/0.333 and 1/0.067, which is 3.0 to 15, approximately.
Looking at inter- and intra-rater reliability
Many measurements in biological and sociological research are obtained by the subjective judgment of humans. Examples include the reading of X-rays, CAT scans, ECG tracings, ultrasound images, biopsy specimens, and audio and video recordings of subject behavior in various situations. The human may make quantitative measurements (like the length of a bone on an ultrasound image) or categorical ratings (like the presence or absence of some atypical feature on an ECG tracing).
You need to know how consistent such ratings are among different raters reading the same thing (inter-rater reliability) and how reproducible the ratings are for one rater reading the same thing multiple times (intra-rater reliability).
When considering a binary reading (like yes or no) between two raters, you can estimate inter-rater reliability by having each rater read the same batch of, say, 50 specimens, and then cross-tabbing the results, as in Figure 14-6.
Cell a contains a count of how many specimens were rated yes by Rater 1 and yes by Rater 2; cell b counts how many specimens were rated yes by Rater 1 but no by Rater 2; and so on.
Illustration by Wiley, Composition Services Graphics
Figure 14-6: Results of two raters reading the same set of 50 specimens and rating each specimen yes or no.
You can construct a similar table for estimating intra-rater reliability by having one rater read the same batch of specimens on two separate occasions; in this case, you’d replace the word Rater with Reading in the row and column labels.
For perfect agreement, κ = 1; for completely random ratings (indicating no rating ability whatsoever), κ = 0. Random sampling fluctuations can actually cause κ to be negative. Like the student taking a true/false test, where the number of wrong answers is subtracted from the number of right answers to compensate for guessing, getting a score less than zero indicates the interesting combination of being stupid and unlucky!
For the data in Figure 14-6: κ = 2(22 × 16 – 5 × 7)/(27 × 21 + 23 × 29), which is 0.5138, indicating only fair agreement between the two raters.
You won't find any simple formulas for calculating SEs or CIs for kappa, but the fourfold table web page (StatPages.info/ctab2x2.html
) provides approximate CIs for Cohen's Kappa. For the preceding example, the 95 percent CI is 0.202 to 0.735.