Chapter 9
Aiming for Accuracy and Precision
In This Chapter
Starting with accuracy and precision fundamentals
Boosting accuracy and precision
Determining standard errors for a variety of statistics
A very wise scientist once said, “A measurement whose accuracy is completely unknown has no use whatever.” Whenever you’re reporting a numerical result (and as a researcher, you report numerical results all the time), you must include, along with the numerical value, some indication of how good that value is. A good numeric result is both accurate and precise. In this chapter, I describe what accuracy and precision are, how you can improve the accuracy and precision of your results, and how you can express quantitatively just how precise your results are.
Beginning with the Basics of Accuracy and Precision
Before you read any further, make sure you’ve looked at the Statistical Estimation Theory section of Chapter 3, which gives an example introducing the concepts of accuracy and precision and the difference between them. In a nutshell: Accuracy refers to how close your numbers come to the true values; precision refers to how close your numbers come to each other. In this section, I define accuracy and precision more formally in terms of concepts like sample statistic, population parameter, and sampling distribution.
Getting to know sample statistics and population parameters
Scientists conduct experiments on limited samples of subjects in order to draw conclusions that (they hope) are valid for a large population of people. Suppose you want to conduct an experiment to determine some quantity of interest. For example, you may have a scientific interest in one of these questions:
What is the average fasting blood glucose concentration in adults with diabetes?
What percent of children like chocolate?
How much does blood urea nitrogen (BUN) tend to increase (or decrease) with every additional year after age 60?
To get exact answers to questions like these, you’d have to examine every adult diabetic, or every child, or every person over age 60. But you can’t examine every person in the population; you have to study a relatively small sample of subjects, in a clinical trial or a survey.
Understanding accuracy and precision in terms of the sampling distribution
Imagine a scenario in which an experiment (like a clinical trial or a survey) is carried out over and over again an enormous number of times, each time on a different random sample of subjects. Using the “percent of kids who like chocolate” example, each experiment could consist of interviewing 50 randomly chosen children and reporting what percentage of kids in that sample said that they liked chocolate. Repeating that entire experiment N times (and supposing that N is up in the millions) would require a lot of scientists, take a lot of time, and cost a lot of money, but suppose that you could actually do it. For each repetition of the experiment, you’d get some particular value for the sample statistic you were interested in (the percent of kids in that sample who like chocolate), and you’d write this number down on a (really big) piece of paper.
After conducting your experiment N times, you’d have a huge set of values for the sampling statistic (that is, the percent of kids who like chocolate). You could then calculate the mean of those values by adding them up and dividing by N. And you could calculate the standard deviation by subtracting the mean from each value, squaring each difference, adding up the squares, dividing by N – 1, and then taking the square root. And you could construct a histogram of the N percentage values to see how they were spread out, as described in Chapter 8.
Accuracy refers to how close your observed sample statistic comes to the true population parameter, or more formally, how close the mean of the sampling distribution is to the mean of the population distribution. For example, how close is the mean of all your percentage values to the true percentage of children who like chocolate?
Precision refers to how close your replicate values of the sample statistic are to each other, or more formally, how wide the sampling distribution is, which can be expressed as the standard deviation of the sampling distribution. For example, what is the standard deviation of your big collection of percentage values?
Thinking of measurement as a kind of sampling
No measuring instrument (ruler, scale, voltmeter, hematology analyzer, and so on) is perfect, so questions of measurement accuracy and precision are just as relevant as questions of sampling accuracy and precision. In fact, statisticians think of measuring as a kind of sampling process. This analogy may seem like quite a stretch, but it lets them analyze measurement errors using the same concepts, terminology, and mathematical techniques that they use to analyze sampling errors.
For example, suppose you happen to weigh exactly 86.73839 kilograms at this very moment. If you were to step onto a bathroom scale (the old kind, with springs and a dial), it certainly wouldn’t show exactly that weight. And if you were to step off the scale and then on it again, it might not show exactly the same weight as the first time. A set of repetitive weights would differ from your true weight — and they’d differ from each other — for any of many reasons. For example, maybe you couldn’t read the dial that precisely, the scale was miscalibrated, you shifted your weight slightly, or you stood in a slightly different spot on the platform each time.
You can consider your measured weight to be a number randomly drawn from a hypothetical population of possible weights that the scale might produce if the same person were to be weighed repeatedly on it. If you weigh yourself a thousand times, those 1,000 numbers will be spread out into a sampling distribution that describes the accuracy and precision of the process of measuring your weight with that particular bathroom scale.
Expressing errors in terms of accuracy and precision
In the preceding section, I explain the difference between accuracy and precision. In the following sections, I describe what can cause your results to be inaccurate and what can cause them to be imprecise.
Inaccuracy comes from systematic errors
Inaccuracy results from the effects of systematic errors — those that tend to affect all replications the same way — leading to a biased result (one that’s off in a definite direction). These errors can arise in sampling and in measuring.
Systematic errors in a clinical study can result from causes such as the following:
Enrolling subjects who are not representative of the population that you want to draw conclusions about, either through incorrect inclusion/exclusion criteria (such as wanting to draw conclusions that apply to males and females but enrolling only males) or through inappropriate advertising (for example, putting a notice in a newspaper, on the web, or on a college cafeteria bulletin board that only part of the target population ever looks at)
Human error (mistakes) such as recording lab results in the wrong units (entering all glucose values as milligrams per deciliter [mg/dL] when the protocol calls for millimoles per liter [mmol/L]) or administering the wrong product to the subject (giving a placebo to a subject who should have gotten the real product)
Systematic errors in a measurement can result from the following types of circumstances:
Physical changes occur in the measuring instrument (for example, wooden rulers might shrink and scale springs might get stiff with age).
The measuring instrument is used improperly (for example, the balance isn’t zeroed before weighing).
The measuring instrument is poorly calibrated (or not calibrated at all).
The operator makes mistakes (such as using the wrong reagents in an analyzer).
Imprecision comes from random errors
Imprecision results from the effects of random fluctuations — those that tend to be unpredictable — and can affect each replication differently.
Sampling imprecision (as, for example, in a clinical trial) arises from several sources:
Subject-to-subject variability (for example, different subjects have different weights, different blood pressure, and different tendencies to respond to a treatment)
Within-subject variability (for example, one person’s blood pressure, recorded every 15 minutes, will show random variability from one reading to another because of the combined action of a large number of internal factors, such as stress, and external factors, like activity, noise, and so on)
Random sampling errors (inherent in the random sampling process itself)
Measurement imprecision arises from the combined effects of a large number of individual, uncontrolled factors, such as
Environmental factors (like temperature, humidity, mechanical vibrations, voltage fluctuations, and so on)
Physically induced randomness (such as electrical noise or static, or nuclear decay in assay methods using radioactive isotopes)
Operator variability (for example, reading a scale from a slightly different angle or estimating digits between scale markings)
Improving Accuracy and Precision
While perfect accuracy and precision will always be an unattainable ideal, you can take steps to minimize the effects of systematic errors and random fluctuations on your sampled and measured data.
Enhancing sampling accuracy
You improve sampling accuracy by eliminating sources of bias in the selection of subjects for your study. The study’s inclusion criteria should ideally define the population you want your study’s conclusions to apply to. If you want your conclusions to apply to all adult diabetics, for example, your inclusion criteria may state that subjects must be 18 years or older and must have a definitive clinical diagnosis of diabetes mellitus, as confirmed by a glucose tolerance test. The study’s exclusion criteria should be limited to only those conditions and situations that make it impossible for a subject to safely participate in the study and provide usable data for analysis.
You also want to try to select subjects as broadly and evenly as possible from the total target population. This task may be difficult or even impossible (it’s almost impossible to obtain a representative sample from a worldwide population). But the scientific validity of a study depends on having as representative a sample as possible, so you should sample as wide a geographic region as is practically feasible.
Getting more accurate measurements
Measurement accuracy very often becomes a matter of properly calibrating an instrument against known standards. The instrument may be as simple as a ruler or as complicated as a million-dollar analyzer, but the principles are the same. They generally involve the following steps:
1. Acquire one or more known standards from a reliable source.
Known standards are generally prepared and certified by an organization or a company that you have reason to believe has much more accurate instruments than you do, such as the National Institute of Standards and Technology (NIST) or a well-respected company like Hewlett-Packard or Fisher Scientific.
If you’re calibrating a blood glucose analyzer, for example, you need to acquire a set of glucose solutions whose concentrations are known with great accuracy, and can be taken as “true” concentration values (perhaps five vials, with glucose values of 50, 100, 200, 400, and 800 mg/dL).
2. Run your measuring process or assay, using your instrument, on those standards; record the instrument’s results, along with the “true” values.
Continuing with the glucose example, you might split each vial into four aliquots (portions), and run these 20 specimens through the analyzer.
3. Plot your instrument’s readings against the true values and fit the best line possible to that data.
You’d plot the results of the analysis of the standards as 20 points on a scattergram, with the true value from the standards provider (GlucTrue) on the X axis, and the instrument’s results (GlucInstr) on the Y axis. The best line may not be a straight line, so you may have to do some nonlinear curve-fitting (I describe how to do this in Chapter 21).
4. Use that fitted line to convert your instrument’s readings into the values you report. (You have to do some algebra to rearrange the formula to calculate the X value from the Y value.)
Suppose the fitted equation from Step 3 was GlucInstr = 1.54 + 0.9573 × GlucTrue. With a little algebra, this equation can be rearranged to GlucTrue = (GlucInstr – 1.54)/0.9573. If you were to run a patient’s specimen through that instrument and get a value of 200.0, you’d use the calibration equation to get the corrected value: (200 – 1.54)/0.9573, which works out to 207.3, the value you’d report for this specimen.
If done properly, this process can effectively remove almost all systematic errors from your measurements, resulting in very accurate measurements.
Improving sampling precision
You improve the precision of anything you observe from your sample of subjects by having a larger sample. The central limit theorem (or CLT, one of the foundations of probability theory) describes how random fluctuations behave when a bunch of random variables are added (or averaged) together. Among many other things, the CLT describes how the precision of a sample statistic depends on the sample size.
You can also get better precision (and smaller SEs) by setting up your experiment in a way that lessens random variability in the population. For example, if you want to compare a weight-loss product to a placebo, you should try to have the two treatment groups in your trial as equally balanced as possible with respect to every subject characteristic that can conceivably influence weight loss. Identical twins make ideal (though hard-to-find) subjects for clinical trials because they’re so closely matched in so many ways. Alternatively, you can make your inclusion criteria more stringent. For example, you can restrict the study groups to just males within a narrow age, height, and weight range and impose other criteria that eliminate other sources of between-subject variability (such as history of smoking, hypertension, nervous disorders, and so on).
It makes finding suitable subjects harder.
Your inferences (conclusions) from this study can now only be applied to the narrower population (corresponding to your more stringent inclusion criteria).
Increasing the precision of your measurements
Here are a few general suggestions for achieving better precision (smaller random errors) in your measurements:
Use the most precise measuring instruments you can afford. For example, a beam balance may yield more precise measurements than a spring scale, and an electronic balance may be even more precise.
Control as many sources of random fluctuations due to external perturbations as you can. Depending on how the measuring device operates, a reading can be influenced by temperature, humidity, mechanical vibrations, electrical power fluctuations, and a host of other environmental factors. Operator technique also contributes to random variability in readings.
When reading an instrument with an analog display, like a dial or linear scale (as opposed to a computerized device with a digital readout), try to interpolate (estimate an extra digit) between the divisions and record the number with that extra decimal place. So if you’re weighing someone on a scale with a large rotary dial with lines spaced every kilogram, try to estimate the position of the dial pointer to the nearest tenth of a kilogram.
Make replicate readings and average them. This technique is one of the most widely applicable (see the next section for more information).
Calculating Standard Errors for Different Sample Statistics
As I mention in the earlier section Imprecision comes from random errors, the standard error (SE) is just the standard deviation (SD) of the sampling distribution of the numbers that you get from measuring the same thing over and over again. But you don’t necessarily have to carry out this repetitive process in practice. You can usually estimate the SE of a sample statistic obtained from a single experiment by using a formula appropriate for the sample statistic. The following sections describe how to calculate the SE for various kinds of sample statistics.
A mean
From the central limit theorem (see the earlier section Improving sampling precision for details), the SE of the mean of N numbers (SEMN) is related to the standard deviation (SD) of the numbers by the formula . So if you study 25 adult diabetics and find that they have an average fasting blood glucose level of 130 milligrams per deciliter (mg/dL) with an SD of ±40 mg/dL, you can say that your estimate of the mean has a precision (SE) of , which is equal to ±40/5, or ±8 mg/dL.
Making three or four replicates of each measurement and then reporting the average of those replicates is typical practice in laboratory research (though less common in clinical studies). Applied to measurements, the central limit theorem tells you that the SEMN is more precise than any one individual measurement (SE1) by a factor of the square root of N: .
A proportion
If you were to survey 100 typical children and find that 70 of them like chocolate, you’d estimate that 70 percent of children like chocolate. How precise is that estimated 70-percent figure?
Based on the properties of the binomial distribution (see Chapters 3 and 25), which generally describes observed proportions of this type, the standard error (SE) of an observed proportion (p), based on a sample size of N, is given by this approximate formula:
For small values of N, this formula underestimates SE, but for N = 10 or more, the approximation is very good.
Plugging in the numbers, the SE of the observed 70 percent (0.7 proportion) in 100 children is
So you report the percentage of children who like chocolate as 70 percent ± 4.6 percent, being sure to state that the ± number is the standard error of the percentage.
Event counts and rates
Closely related to the binomial case in the preceding section (where you have some number of events observed out of some number of opportunities for the event to occur) is the case of the observed number of sporadic events over some interval of time or space.
For example, suppose that there were 36 fatal highway accidents in your county in the last three months. If that’s the only safety data you have to go on, then your best estimate of the monthly fatal accident rate is simply that observed count divided by the length of time during which they were observed: 36/3, or 12.0 fatal accidents per month. How precise is that estimate?
Based on the properties of the Poisson distribution (see Chapters 3 and 25), which generally describes the sporadic occurrences of independent events, the standard error (SE) of an event rate (R), based on the occurrence of N events in T units of time, is given by this approximate formula:
For small values of N, this formula underestimates the SE, but for an N of ten or more, the approximation is very good.
Plugging in the numbers, the SE for an observed count of 36 fatalities (N) in 3 months (T) is
So you would report that the estimated rate is 12.0 ± 2.0 fatal accidents per month, being sure to state that the ± number is the SE of the monthly rate.
A regression coefficient
Suppose you’re interested in whether or not blood urea nitrogen (BUN), a measure of kidney performance, tends to naturally increase after age 60 in generally healthy adults. You enroll a bunch of generally healthy adults age 60 and above, record their ages, and measure their BUN. Next, you create a scatter plot of BUN versus age, and then you fit a straight line to the data points, using regression analysis (see Chapter 18 for details). The regression analysis gives you two regression coefficients: the slope and the intercept of the fitted straight line. The slope of this line has units of (mg/dL)/year, and tells you how much, on average, a healthy person’s BUN goes up with every additional year of age after age 60. Suppose the answer you get is a 1.4 mg/dL glucose increase per year. How precise is that estimate of yearly increase?