Chapter 9: Aiming for Accuracy and Precision

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Aiming for Accuracy and Precision

In This Chapter

Starting with accuracy and precision fundamentals

Boosting accuracy and precision

Determining standard errors for a variety of statistics

A very wise scientist once said, “A measurement whose accuracy is completely unknown has no use whatever.” Whenever you’re reporting a numerical result (and as a researcher, you report numerical results all the time), you must include, along with the numerical value, some indication of how good that value is. A good numeric result is both accurate and precise. In this chapter, I describe what accuracy and precision are, how you can improve the accuracy and precision of your results, and how you can express quantitatively just how precise your results are.

Beginning with the Basics of Accuracy and Precision

Before you read any further, make sure you’ve looked at the Statistical Estimation Theory section of Chapter 3, which gives an example introducing the concepts of accuracy and precision and the difference between them. In a nutshell: Accuracy refers to how close your numbers come to the true values; precision refers to how close your numbers come to each other. In this section, I define accuracy and precision more formally in terms of concepts like sample statistic, population parameter, and sampling distribution.

Getting to know sample statistics and population parameters

Scientists conduct experiments on limited samples of subjects in order to draw conclusions that (they hope) are valid for a large population of people. Suppose you want to conduct an experiment to determine some quantity of interest. For example, you may have a scientific interest in one of these questions:

What is the average fasting blood glucose concentration in adults with diabetes?

What percent of children like chocolate?

How much does blood urea nitrogen (BUN) tend to increase (or decrease) with every additional year after age 60?

To get exact answers to questions like these, you’d have to examine every adult diabetic, or every child, or every person over age 60. But you can’t examine every person in the population; you have to study a relatively small sample of subjects, in a clinical trial or a survey.

The numeric result that you get from your sample (such as average glucose, the percent of children who like chocolate, or the BUN increase per year) is called a sample statistic, and it’s your best guess for the value of the corresponding population parameter, which is the true value of that average or percent or yearly increase in the entire population. Because of random sampling fluctuations, the sample statistic you get from your study isn’t exactly equal to the corresponding population parameter. Statisticians express this unavoidable discrepancy in terms of two concepts: accuracy and precision. To many people these two terms mean the same thing, but to a statistician they’re very different (as you find out in the following section).

Understanding accuracy and precision in terms of the sampling distribution

Imagine a scenario in which an experiment (like a clinical trial or a survey) is carried out over and over again an enormous number of times, each time on a different random sample of subjects. Using the “percent of kids who like chocolate” example, each experiment could consist of interviewing 50 randomly chosen children and reporting what percentage of kids in that sample said that they liked chocolate. Repeating that entire experiment N times (and supposing that N is up in the millions) would require a lot of scientists, take a lot of time, and cost a lot of money, but suppose that you could actually do it. For each repetition of the experiment, you’d get some particular value for the sample statistic you were interested in (the percent of kids in that sample who like chocolate), and you’d write this number down on a (really big) piece of paper.

After conducting your experiment N times, you’d have a huge set of values for the sampling statistic (that is, the percent of kids who like chocolate). You could then calculate the mean of those values by adding them up and dividing by N. And you could calculate the standard deviation by subtracting the mean from each value, squaring each difference, adding up the squares, dividing by N – 1, and then taking the square root. And you could construct a histogram of the N percentage values to see how they were spread out, as described in Chapter 8.

Statisticians describe this in a more formal way — they say that all your replicate results are spread out in something called the sampling distribution for that sample statistic of your experiment. The idea of a sampling distribution is at the heart of the concepts of accuracy and precision.

Accuracy refers to how close your observed sample statistic comes to the true population parameter, or more formally, how close the mean of the sampling distribution is to the mean of the population distribution. For example, how close is the mean of all your percentage values to the true percentage of children who like chocolate?

Precision refers to how close your replicate values of the sample statistic are to each other, or more formally, how wide the sampling distribution is, which can be expressed as the standard deviation of the sampling distribution. For example, what is the standard deviation of your big collection of percentage values?

Thinking of measurement as a kind of sampling

No measuring instrument (ruler, scale, voltmeter, hematology analyzer, and so on) is perfect, so questions of measurement accuracy and precision are just as relevant as questions of sampling accuracy and precision. In fact, statisticians think of measuring as a kind of sampling process. This analogy may seem like quite a stretch, but it lets them analyze measurement errors using the same concepts, terminology, and mathematical techniques that they use to analyze sampling errors.

For example, suppose you happen to weigh exactly 86.73839 kilograms at this very moment. If you were to step onto a bathroom scale (the old kind, with springs and a dial), it certainly wouldn’t show exactly that weight. And if you were to step off the scale and then on it again, it might not show exactly the same weight as the first time. A set of repetitive weights would differ from your true weight — and they’d differ from each other — for any of many reasons. For example, maybe you couldn’t read the dial that precisely, the scale was miscalibrated, you shifted your weight slightly, or you stood in a slightly different spot on the platform each time.

You can consider your measured weight to be a number randomly drawn from a hypothetical population of possible weights that the scale might produce if the same person were to be weighed repeatedly on it. If you weigh yourself a thousand times, those 1,000 numbers will be spread out into a sampling distribution that describes the accuracy and precision of the process of measuring your weight with that particular bathroom scale.

Expressing errors in terms of accuracy and precision

In the preceding section, I explain the difference between accuracy and precision. In the following sections, I describe what can cause your results to be inaccurate and what can cause them to be imprecise.

Inaccuracy comes from systematic errors

Inaccuracy results from the effects of systematic errors — those that tend to affect all replications the same way — leading to a biased result (one that’s off in a definite direction). These errors can arise in sampling and in measuring.

Systematic errors in a clinical study can result from causes such as the following:

Enrolling subjects who are not representative of the population that you want to draw conclusions about, either through incorrect inclusion/exclusion criteria (such as wanting to draw conclusions that apply to males and females but enrolling only males) or through inappropriate advertising (for example, putting a notice in a newspaper, on the web, or on a college cafeteria bulletin board that only part of the target population ever looks at)

Human error (mistakes) such as recording lab results in the wrong units (entering all glucose values as milligrams per deciliter [mg/dL] when the protocol calls for millimoles per liter [mmol/L]) or administering the wrong product to the subject (giving a placebo to a subject who should have gotten the real product)

Systematic errors in a measurement can result from the following types of circumstances:

Physical changes occur in the measuring instrument (for example, wooden rulers might shrink and scale springs might get stiff with age).

The measuring instrument is used improperly (for example, the balance isn’t zeroed before weighing).

The measuring instrument is poorly calibrated (or not calibrated at all).

The operator makes mistakes (such as using the wrong reagents in an analyzer).

Systematic errors don’t follow any particular statistical distribution — they can be of almost any magnitude in either direction. They’re not very amenable to statistical analysis, either. Each kind of systematic error has to be identified, and its source has to be tracked down and corrected.

Imprecision comes from random errors

Imprecision results from the effects of random fluctuations — those that tend to be unpredictable — and can affect each replication differently.

Sampling imprecision (as, for example, in a clinical trial) arises from several sources:

Subject-to-subject variability (for example, different subjects have different weights, different blood pressure, and different tendencies to respond to a treatment)

Within-subject variability (for example, one person’s blood pressure, recorded every 15 minutes, will show random variability from one reading to another because of the combined action of a large number of internal factors, such as stress, and external factors, like activity, noise, and so on)

Random sampling errors (inherent in the random sampling process itself)

Measurement imprecision arises from the combined effects of a large number of individual, uncontrolled factors, such as

Environmental factors (like temperature, humidity, mechanical vibrations, voltage fluctuations, and so on)

Physically induced randomness (such as electrical noise or static, or nuclear decay in assay methods using radioactive isotopes)

Operator variability (for example, reading a scale from a slightly different angle or estimating digits between scale markings)

Random errors may seem to be more diverse, heterogeneous, indescribable, and, therefore, more unmanageable than systematic errors. But it turns out that random errors are much more amenable to statistical description and analysis than systematic errors, as you see in the next section.

The general magnitude of random sampling and measurement errors is expressed in something called the standard error (SE) of the sample statistic or the measured result. The SE is simply the standard deviation of the sampling distribution of the sample statistic or measured value. The smaller the SE is, the higher the precision. (Find out how to calculate the SE for different sample statistics later in this chapter.)

Improving Accuracy and Precision

While perfect accuracy and precision will always be an unattainable ideal, you can take steps to minimize the effects of systematic errors and random fluctuations on your sampled and measured data.

Enhancing sampling accuracy

You improve sampling accuracy by eliminating sources of bias in the selection of subjects for your study. The study’s inclusion criteria should ideally define the population you want your study’s conclusions to apply to. If you want your conclusions to apply to all adult diabetics, for example, your inclusion criteria may state that subjects must be 18 years or older and must have a definitive clinical diagnosis of diabetes mellitus, as confirmed by a glucose tolerance test. The study’s exclusion criteria should be limited to only those conditions and situations that make it impossible for a subject to safely participate in the study and provide usable data for analysis.

You also want to try to select subjects as broadly and evenly as possible from the total target population. This task may be difficult or even impossible (it’s almost impossible to obtain a representative sample from a worldwide population). But the scientific validity of a study depends on having as representative a sample as possible, so you should sample as wide a geographic region as is practically feasible.

Getting more accurate measurements

Measurement accuracy very often becomes a matter of properly calibrating an instrument against known standards. The instrument may be as simple as a ruler or as complicated as a million-dollar analyzer, but the principles are the same. They generally involve the following steps:

1. Acquire one or more known standards from a reliable source.

Known standards are generally prepared and certified by an organization or a company that you have reason to believe has much more accurate instruments than you do, such as the National Institute of Standards and Technology (NIST) or a well-respected company like Hewlett-Packard or Fisher Scientific.

If you’re calibrating a blood glucose analyzer, for example, you need to acquire a set of glucose solutions whose concentrations are known with great accuracy, and can be taken as “true” concentration values (perhaps five vials, with glucose values of 50, 100, 200, 400, and 800 mg/dL).

2. Run your measuring process or assay, using your instrument, on those standards; record the instrument’s results, along with the “true” values.

Continuing with the glucose example, you might split each vial into four aliquots (portions), and run these 20 specimens through the analyzer.

3. Plot your instrument’s readings against the true values and fit the best line possible to that data.

You’d plot the results of the analysis of the standards as 20 points on a scattergram, with the true value from the standards provider (GlucTrue) on the X axis, and the instrument’s results (GlucInstr) on the Y axis. The best line may not be a straight line, so you may have to do some nonlinear curve-fitting (I describe how to do this in Chapter 21).

4. Use that fitted line to convert your instrument’s readings into the values you report. (You have to do some algebra to rearrange the formula to calculate the X value from the Y value.)

Suppose the fitted equation from Step 3 was GlucInstr = 1.54 + 0.9573 × GlucTrue. With a little algebra, this equation can be rearranged to GlucTrue = (GlucInstr – 1.54)/0.9573. If you were to run a patient’s specimen through that instrument and get a value of 200.0, you’d use the calibration equation to get the corrected value: (200 – 1.54)/0.9573, which works out to 207.3, the value you’d report for this specimen.

If done properly, this process can effectively remove almost all systematic errors from your measurements, resulting in very accurate measurements.

You can find out more about calibration curves from the GraphPad website (www.graphpad.com).

Improving sampling precision

You improve the precision of anything you observe from your sample of subjects by having a larger sample. The central limit theorem (or CLT, one of the foundations of probability theory) describes how random fluctuations behave when a bunch of random variables are added (or averaged) together. Among many other things, the CLT describes how the precision of a sample statistic depends on the sample size.

The precision of any sample statistic increases (that is, the SE decreases) in proportion to the square root of the sample size. So, if Trial A has four times as many subjects as Trial B, then the results from Trial A will be twice as precise as (that is, have one-half the SE of) the results from Trial B, because the square root of four is two.

You can also get better precision (and smaller SEs) by setting up your experiment in a way that lessens random variability in the population. For example, if you want to compare a weight-loss product to a placebo, you should try to have the two treatment groups in your trial as equally balanced as possible with respect to every subject characteristic that can conceivably influence weight loss. Identical twins make ideal (though hard-to-find) subjects for clinical trials because they’re so closely matched in so many ways. Alternatively, you can make your inclusion criteria more stringent. For example, you can restrict the study groups to just males within a narrow age, height, and weight range and impose other criteria that eliminate other sources of between-subject variability (such as history of smoking, hypertension, nervous disorders, and so on).

But although narrowing the inclusion criteria makes your study sample more homogeneous and eliminates more sources of random fluctuations, it also has some important drawbacks:

It makes finding suitable subjects harder.

Your inferences (conclusions) from this study can now only be applied to the narrower population (corresponding to your more stringent inclusion criteria).

Increasing the precision of your measurements

Here are a few general suggestions for achieving better precision (smaller random errors) in your measurements:

Use the most precise measuring instruments you can afford. For example, a beam balance may yield more precise measurements than a spring scale, and an electronic balance may be even more precise.

Control as many sources of random fluctuations due to external perturbations as you can. Depending on how the measuring device operates, a reading can be influenced by temperature, humidity, mechanical vibrations, electrical power fluctuations, and a host of other environmental factors. Operator technique also contributes to random variability in readings.

When reading an instrument with an analog display, like a dial or linear scale (as opposed to a computerized device with a digital readout), try to interpolate (estimate an extra digit) between the divisions and record the number with that extra decimal place. So if you’re weighing someone on a scale with a large rotary dial with lines spaced every kilogram, try to estimate the position of the dial pointer to the nearest tenth of a kilogram.

Make replicate readings and average them. This technique is one of the most widely applicable (see the next section for more information).

Calculating Standard Errors for Different Sample Statistics

As I mention in the earlier section Imprecision comes from random errors, the standard error (SE) is just the standard deviation (SD) of the sampling distribution of the numbers that you get from measuring the same thing over and over again. But you don’t necessarily have to carry out this repetitive process in practice. You can usually estimate the SE of a sample statistic obtained from a single experiment by using a formula appropriate for the sample statistic. The following sections describe how to calculate the SE for various kinds of sample statistics.

What if you've calculated something from your raw data by a very complicated set of formulas (like the area under a concentration-versus-time curve)? Ideally, you should be able to quote a standard error for any quantity you calculate from your data, but no SE formula may be available for that particular calculation. In Chapter 11, I explain how SEs propagate through various mathematical formulas, and this information might help you figure out the SE for some calculated quantities. But there is also a very general (and surprisingly simple) method to estimate the SE of centiles, correlation coefficients, AUCs, or anything else you might want to calculate from your data. It involves using your data in a special way (called "resampling") to simulate what might have happened if you had repeated your experiment many times over, each time calculating and recording the quantity you're interested in. The SD of all these simulated values turns out to be a good estimate of the SE of the sample statistic. When this method was first proposed, statisticians were very skeptical and called it the "bootstrap" method, implying that it was like "picking yourself up by your bootstraps" (that is, it was impossible). I describe this method (with an example) in an article at www.dummies.com/extras/biostatistics.

A mean

From the central limit theorem (see the earlier section Improving sampling precision for details), the SE of the mean of N numbers (SEMN) is related to the standard deviation (SD) of the numbers by the formula . So if you study 25 adult diabetics and find that they have an average fasting blood glucose level of 130 milligrams per deciliter (mg/dL) with an SD of ±40 mg/dL, you can say that your estimate of the mean has a precision (SE) of , which is equal to ±40/5, or ±8 mg/dL.

Making three or four replicates of each measurement and then reporting the average of those replicates is typical practice in laboratory research (though less common in clinical studies). Applied to measurements, the central limit theorem tells you that the SEMN is more precise than any one individual measurement (SE1) by a factor of the square root of N: .

The mean of four measurements has an SE that’s one-half the SE of a single measurement; that is, it’s twice as precise. So averaging three or four independent measurements often provides an easy and relatively inexpensive way to improve precision. But because of the square-root relationship, it becomes a matter of diminishing returns — to get 1 extra digit of precision (a tenfold reduction in SE), you have to average 100 independent replicates. Also, averaging doesn’t improve accuracy; systematic errors affecting all the replicates are not reduced by averaging.

A proportion

If you were to survey 100 typical children and find that 70 of them like chocolate, you’d estimate that 70 percent of children like chocolate. How precise is that estimated 70-percent figure?

Based on the properties of the binomial distribution (see Chapters 3 and 25), which generally describes observed proportions of this type, the standard error (SE) of an observed proportion (p), based on a sample size of N, is given by this approximate formula:

For small values of N, this formula underestimates SE, but for N = 10 or more, the approximation is very good.

Plugging in the numbers, the SE of the observed 70 percent (0.7 proportion) in 100 children is

So you report the percentage of children who like chocolate as 70 percent ± 4.6 percent, being sure to state that the ± number is the standard error of the percentage.

Event counts and rates

Closely related to the binomial case in the preceding section (where you have some number of events observed out of some number of opportunities for the event to occur) is the case of the observed number of sporadic events over some interval of time or space.

For example, suppose that there were 36 fatal highway accidents in your county in the last three months. If that’s the only safety data you have to go on, then your best estimate of the monthly fatal accident rate is simply that observed count divided by the length of time during which they were observed: 36/3, or 12.0 fatal accidents per month. How precise is that estimate?

Based on the properties of the Poisson distribution (see Chapters 3 and 25), which generally describes the sporadic occurrences of independent events, the standard error (SE) of an event rate (R), based on the occurrence of N events in T units of time, is given by this approximate formula:

For small values of N, this formula underestimates the SE, but for an N of ten or more, the approximation is very good.

Plugging in the numbers, the SE for an observed count of 36 fatalities (N) in 3 months (T) is

So you would report that the estimated rate is 12.0 ± 2.0 fatal accidents per month, being sure to state that the ± number is the SE of the monthly rate.

Another common example is an isotope-based lab assay, which counts individual nuclear disintegrations. These instruments can be programmed to count for a certain amount of time or to count until a certain number of disintegrations have been observed. Either way, the disintegration rate is the number of counts divided by the amount of time. If you set the instrument to count for one second and you get about 100 clicks, that number is good to ±10 clicks, or about 10 percent relative precision. But if you count for one minute and get about 6,000 clicks, that number is good to ±77 clicks, or about 1.3 percent relative precision. So there’s a clear tradeoff between speed and precision — the longer you count, the more precise the event rate (and in this case, the assay result).

A regression coefficient

Suppose you’re interested in whether or not blood urea nitrogen (BUN), a measure of kidney performance, tends to naturally increase after age 60 in generally healthy adults. You enroll a bunch of generally healthy adults age 60 and above, record their ages, and measure their BUN. Next, you create a scatter plot of BUN versus age, and then you fit a straight line to the data points, using regression analysis (see Chapter 18 for details). The regression analysis gives you two regression coefficients: the slope and the intercept of the fitted straight line. The slope of this line has units of (mg/dL)/year, and tells you how much, on average, a healthy person’s BUN goes up with every additional year of age after age 60. Suppose the answer you get is a 1.4 mg/dL glucose increase per year. How precise is that estimate of yearly increase?

This is one time you don’t need any formula. Any good regression program (like the ones I describe in Chapter 3 and in the regression-related chapters later in this book) should provide the SE for every parameter (regression coefficient) it fits to your data, so it should give you the SE for the slope of the fitted straight line. Be thankful the program does this, because you wouldn’t want to attempt the SE calculations yourself — they’re really complicated!

Estimating the sample size needed to achieve the precision you want

All the SE formulas shown in this chapter contain a term. (Precision is almost always proportional to the square root of N.) This fact leads to a simple way to estimate the sample size required for any desired SE — just use some algebra to solve the SE formula for N.

For example, if you’re designing a study to estimate the success rate of some treatment, and you want your estimate to have an SE of ±5 percentage points, how many subjects do you have to treat? The formula for the SE of a proportion, as you find out earlier in this chapter, is . Using some high-school algebra, you can solve this equation for N to get.

You can substitute 0.05 for SE (keep in mind that numbers have to go into the formulas as proportions, not as percentages). You also need a guess for the expected proportion of successes. Suppose you expect your treatment to have an 80 percent success rate; you enter 0.8 for p (remember: proportions, not percentages).

You have , which

works out to about 64 subjects. Similarly, you can calculate that if you want ±1 percent precision, you need 1,600 subjects (it’s hard to estimate proportions very precisely!).

The same idea can be applied to other sample statistics, even those for which no explicit SE formula is available. For example, Chapter 18 describes a way to estimate how many observations you need in order to estimate a regression coefficient with a certain precision.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Aiming for Accuracy and Precision

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9: Aiming for Accuracy and Precision