Chapter 3

Getting Statistical: A Short Review of Basic Statistics

In This Chapter

arrow Getting a handle on probability, randomness, sampling, and inference

arrow Tackling hypothesis testing

arrow Knowing about nonparametric statistical tests

This chapter provides a brief overview of some basic concepts that are often taught in a one-semester introductory statistics course. They form a conceptual framework for topics that I cover in more depth throughout this book. Here, you get the scoop on probability, randomness, populations, samples, statistical inference, hypothesis testing, and nonparametric statistics.

Note: I can only summarize the concepts here; they’re covered in much more depth in Statistics For Dummies, 2nd Edition, and Statistics II For Dummies, both written by Deborah J. Rumsey, PhD, and published by Wiley. So you may want to skim through this chapter to get an idea of what topics you’re already comfortable with and which ones you need to brush up on.

Taking a Chance on Probability

Defining probability without using some word that means the same (or nearly the same) thing can be hard. Probability is the degree of certainty, the chance, or the likelihood that something will happen. Of course, if you then try to define chance or likelihood or certainty, you may wind up using the word probability in the definition.

Don’t worry; I clear up the basics of probability in the following sections. I explain how to define probability as a number, provide a few simple rules of probability, and compare probability to odds (these two terms are related but not the same thing).

Thinking of probability as a number

Probability describes the relative frequency of the occurrence of an event (like getting heads on a coin flip or drawing the ace of spades from a deck of cards). Probability is a number between 0 and 1, although in casual conversation, you often see probabilities expressed as percentages, often followed by the word chance instead of probability. For example: If the probability of rain is 0.7, you may hear someone say that there’s a 70 percent chance of rain.

remember.eps Probabilities are numbers between 0 and 1 that can be interpreted this way:

check.png A probability of 0 means that the event definitely won’t occur.

check.png A probability of 1 (or 100 percent) means that the event definitely will occur.

check.png A probability between 0 and 1 (like 0.7) means that the event will occur some part of the time (like 70 percent) in the long run.

The probability of one particular thing happening out of N equally likely things that could happen is 1/N. So with a deck of 52 different cards, the probability of drawing any one specific card (like the ace of spades) is 1/52.

Following a few basic rules

Here are three basic rules, or formulas, of probabilities — I call them the not rule, the and rule, and the or rule. In the formulas that follow, I use Prob as an abbreviation for probability, expressed as a fraction (between 0 and 1).

warning_bomb.eps Don’t use percentage numbers (0 to 100) in probability formulas.

remember.eps Most of the mathematical underpinning of statistics is based on the careful application of the following basic rules to ever more complicated situations:

check.png The not rule: The probability of some event X not happening is 1 minus the probability of X happening:

Prob(not X) = 1 – Prob(X)

So if the probability of rain tomorrow is 0.7, then the probability of no rain tomorrow is 1 – 0.7, or 0.3.

check.png The and rule: For two independent events, X and Y, the probability of event X and event Y both happening is equal to the product of the probability of each of the two events:

Prob(X and Y) = Prob(X) × Prob(Y)

So, if you flip a fair coin and then draw a card from a deck, what’s the probability of getting heads on the coin flip and then drawing the ace of spades? The probability of getting heads in a fair coin flip is 1/2, and the probability of drawing the ace of spades from a deck of cards is 1/52, so the probability of having both of these things happen is (1/2)(1/52), or 1/104, or 0.0096 (approximately).

check.png The or rule: For two independent events, X and Y, the probability of one or the other or both events happening is given by a more complicated formula, which can be derived from the preceding two rules.

Prob(X or Y) = 1 – (1 – Prob(X)) × (1 – Prob(Y))

Suppose you roll a pair of dice. What’s the probability of at least one of the dice coming up a 4? If the dice aren’t loaded, there’s a 1/6 chance (a probability of 0.167, approximately) of getting a 4 (or any other specified number) on any die you roll, so the probability of getting a 4 on at least one of the two dice is 1 – (1 – 0.167) × (1 – 0.167), which works out to 1 – 0.833 × 0.833, or 0.31, approximately.

remember.eps The “and” and “or” rules apply only to independent events. For example, you can’t use these rules to calculate the probability of a person selected at random being obese and hypertensive by using the prevalences (probabilities) of obesity and hypertension in the general population because these two medical conditions tend to be associated — if you have one, you’re at greater risk of having the other also.

Comparing odds versus probability

You see the word odds used a lot in this book, especially in Chapter 14 (on the fourfold cross-tab table) and Chapter 20 (on logistic regression). Odds and probability are related, but the two words are not synonymous.

remember.eps Odds equal the probability of something happening divided by the probability of that thing not happening. So, knowing that the probability of something not happening is 1 minus the probability of that thing happening (see the preceding section), you have the formula:

Odds = Probability/(1 – Probability)

With a little algebra (which you don’t need to worry about), you can solve this formula for probability as a function of odds:

Probability = Odds/(1 + Odds)

Table 3-1 shows how probability and odds are related.

Table 3-1 The Relationship between Probability and Odds

Probability

Odds

Interpretation

1.0

Infinity

The event will definitely occur.

0.9

9

The event will occur 90% of the time (is nine times as likely to occur as to not occur).

0.75

3

The event will occur 75% of the time (is three times as likely to occur as to not occur).

0.667

2

The event will occur two-thirds of the time (is twice as likely to occur as to not occur).

0.5

1.0

The event will occur about half the time (is equally likely to occur or not occur).

0.333

0.5

The event will occur one-third of the time (is only half as likely to occur as to not occur).

0.25

0.3333

The event will occur 25% of the time (is one-third as likely to occur as to not occur).

0.1

0.1111

The event will occur 10% of the time (is 1/9th as likely to occur as to not occur).

0

0

The event definitely will not occur.

remember.eps As you can see in Table 3-1, for very low probability, the odds are very close to the probability; but as probability increases, the odds increase faster. By the time probability reaches 0.5, the odds have become 1, and as probability approaches 1, the odds become infinitely large! This definition of odds is consistent with its common-language use. For instance: If the odds of a horse losing a race are 3:1, that means you have three chances of losing and one chance of winning, for a 0.75 probability of losing.

Some Random Thoughts about Randomness

Like probability (which I cover earlier in this chapter), the word random is something we use all the time and something we all have some intuitive concept of, but find hard to put into precise language. You can talk about random events and random variables. Random is a term that applies to the data you acquire in your experiments. When talking about a sequence of random numbers, random means the absence of any pattern in the numbers that can be used to predict what the next number will be.

remember.eps The important idea is that you can’t predict a specific outcome if a random element is involved. But that doesn’t mean that you can’t make any statements about the collection of random numbers. Statisticians can say a lot about how a group of random numbers behave collectively.

The first step in analyzing a set of data is to have a good idea of what the data looks like. This is the job of descriptive statistics — to show you how a set of numbers are spread around and to show you the relationship between two or more sets of data. The basic tool for describing the distribution of values for some variable in a sample of subjects is the histogram, or frequency distribution graph (I describe histograms in more detail in Chapter 8). Histograms help you visualize the distributions of two types of variables:

check.png Categorical: For categorical variables (such as gender or race), a histogram is simply a bar chart showing how many observations fall into each category, like the distribution of race in a sample of subjects, as shown in Figure 3-1a.

check.png Continuous: To make a histogram of a continuous variable (such as weight or blood hemoglobin), you divide the range of values into some convenient interval, count how many observations fall within each interval, and then display those counts in a bar chart, as shown in Figure 3-1b (which shows the distribution of hemoglobin for a sample of subjects).

9781118553992-fg0301.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-1: Histograms of categorical (a) and continuous (b) data.

Picking Samples from Populations

The idea of sampling from a population is one of the most fundamental concepts in statistics — indeed, in all of science. For example, you can’t test how a chemotherapy drug will work in all people with lung cancer; you can study only a limited sample of lung cancer patients who are available to you and draw conclusions from that sample — conclusions that you hope will be valid for all lung cancer patients.

In the following sections, I explain how samples are only imperfect reflections of the populations they’re drawn from, and I describe the basics of probability distributions.

Recognizing that sampling isn’t perfect

remember.eps As used in clinical research, the terms population and sample can be defined this way:

check.png Population: All individuals having a precisely defined set of characteristics (for example: human, male, age 18–65, with Stage 3 lung cancer)

check.png Sample: A subset of a defined population, selected for experimental study

Any sample, no matter how carefully it is selected, is only an imperfect reflection of the population, due to the unavoidable occurrence of random sampling fluctuations. Figure 3-2, which shows IQ scores of a random sample of 100 subjects from the U.S. population, exhibits this characteristic. (IQ scores are standardized so that the average for the whole population is 100, with a standard deviation of 15.)

9781118553992-fg0302.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-2: Distribution of IQ scores in a) the population, and b) a random sample of 100 subjects from that population.

The sample is distributed more or less like the population, but clearly it’s only an approximation to the true distribution. The mean and standard deviation (I define those terms precisely in Chapter 8) of the sample are close to, but not exactly equal to, the mean and standard deviation of the population, and the histogram doesn’t have a perfect bell shape. These characteristics are always true of any random sample.

remember.eps Histograms are prepared from data you observe in your sample of subjects, and they describe how the values fluctuate in that sample. A histogram of an observed variable, prepared from a random sample of data, is an approximation to what the population distribution of that variable looks like.

Digging into probability distributions

Samples differ from populations because of random fluctuations. Statisticians understand quantitatively how random fluctuations behave by developing mathematical equations, called probability distribution functions, that describe how likely it is that random fluctuations will exceed any given magnitude. A probability distribution can be represented in several ways:

check.png As a mathematical equation that gives the chance that a fluctuation will be of a certain magnitude. Using calculus, this function can be ­integrated — turned into another related function that tells the probability that a fluctuation will be at least as large as a certain magnitude.

check.png As a graph of the distribution, which looks and works much like a histogram of observed data.

check.png As a table of values telling how likely it is that random fluctuations will exceed a certain magnitude.

Over the years, hundreds of different probability distributions have been described, but most practical statistical work utilizes only a few of them. You encounter fewer than a dozen probability distributions in this book. In the following sections, I break down two types of distributions: those that describe fluctuations in your data and those that you encounter when performing statistical tests.

Distributions that describe your data

Some distributions describe the random fluctuations you see in your data:

check.png Normal: The familiar, bell-shaped, normal distribution describes (at least approximately) an enormous number of variables you encounter.

check.png Log-normal: The skewed, log-normal distribution describes many laboratory results (enzymes and antibody titers, for example), lengths of hospital stays, and related things like costs, utilization of tests, drugs, and so forth.

check.png Binomial: The binomial distribution describes proportions, such as the fraction of subjects responding to treatment.

check.png Poisson: The Poisson distribution describes the number of occurrences of sporadic random events, such as clicks in a gamma radiation counter or deaths during some period of time.

Chapter 25 describes these and other distribution functions in more detail, and you encounter them throughout this book.

Distributions that come up during statistical testing

Some frequency distributions don’t describe fluctuations in observed data, but rather describe fluctuations in numbers that you calculate as part of a statistical test (described in the later section Honing In on Hypothesis Testing). These distributions include the Student t, chi-square, and Fisher F distributions (see Chapter 25), which are used to obtain the p values (see the later section Getting the language down for a definition of p values) that result from the tests.

Introducing Statistical Inference

Statistical inference is the drawing (that is, inferring) of conclusions about a population based on what you see in a sample from that population. In keeping with the idea that statisticians understand how random fluctuations behave, we can say that statistical inference theory is concerned with how we can extract what’s real in our data, despite the unavoidable random noise that’s always present due to sampling fluctuations or measurement errors. This very broad area of statistical theory is usually subdivided into two topics: statistical estimation theory and statistical decision theory.

Statistical estimation theory

Statistical estimation theory focuses on the accuracy and precision of things that you estimate, measure, count, or calculate. It gives you ways to indicate how precise your measurements are and to calculate the range that’s likely to include the true value. The following sections provide the fundamentals of this theory.

Accuracy and precision

remember.eps Whenever you estimate or measure anything, your estimated or measured value can differ from the truth in two ways — it can be inaccurate, imprecise, or both.

check.png Accuracy refers to how close your measurement tends to come to the true value, without being systematically biased in one direction or another.

check.png Precision refers to how close a bunch of replicate measurements come to each other — that is, how reproducible they are.

Figure 3-3 shows four shooting targets with a bunch of bullet holes from repeated rifle shots. These targets illustrate the distinction between accuracy and precision — two terms that describe different kinds of errors that can occur when sampling or measuring something (or, in this case, when shooting at a target).

9781118553992-fg0303.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-3: The difference between accuracy and ­precision.

You see the following in Figure 3-3:

check.png The upper-left target is what most people would hope to achieve — the shots all cluster together (good precision), and they center on the bull’s-eye (good accuracy).

check.png The upper-right target shows that the shots are all very consistent with each other (good precision), so we know that the shooter was very steady (with no large random perturbations from one shot to the next), and any other random effects must have also been quite small. But the shots were all consistently high and to the right (poor accuracy). Perhaps the gun sight was misaligned or the shooter didn’t know how to use it properly. A systematic error occurred somewhere in the aiming and shooting process.

check.png The lower-left target indicates that the shooter wasn’t very consistent from one shot to another (he had poor precision). Perhaps he was unsteady in holding the rifle; perhaps he breathed differently for each shot; perhaps the bullets were not all properly shaped, and had different aerodynamics; or any number of other random differences may have had an effect from one shot to the next. About the only good thing you can say about this shooter is that at least he tended to be more or less centered around the bull’s-eye — the shots don’t show any tendency to be consistently high or low, or consistently to the left or right of center. There’s no evidence of systematic error (or inaccuracy) in his shooting.

check.png The lower-right target shows the worst kind of shooting — the shots are not closely clustered (poor precision) and they seem to show a tendency to be high and to the right (poor accuracy). Both random and systematic errors are prominent in this shooter’s shooting.

Sampling distributions and standard errors

remember.eps The standard error (abbreviated SE) is one way to indicate how precise your estimate or measurement of something is. The SE tells you how much the estimate or measured value might vary if you were to repeat the experiment or the measurement many times, using a different random sample from the same population each time and recording the value you obtained each time. This collection of numbers would have a spread of values, forming what is called the sampling distribution for that variable. The SE is a measure of the width of the sampling distribution, as described in Chapter 9.

Fortunately, you don’t have to repeat the entire experiment a large number of times to calculate the SE. You can usually estimate the SE using data from a single experiment. In Chapter 9, I describe how to calculate the standard errors for means, proportions, event rates, regression coefficients, and other quantities you measure, count, or calculate.

Confidence intervals

Confidence intervals provide another way to indicate the precision of an estimate or measurement of something. A confidence interval (CI) around an estimated value is the range in which you have a certain degree of certitude, called the confidence level (CL), that the true value for that variable lies. If calculated properly, your quoted confidence interval should encompass the true value a percentage of the time at least equal to the quoted confidence level.

Suppose you treat 100 randomly selected migraine headache sufferers with a new drug, and you find that 80 of them respond to the treatment (according to the response criteria you have established). Your observed response rate is 80 percent, but how precise is this observed rate? You can calculate that the 95 percent confidence interval for this 80 percent response rate goes from 70.8 percent to 87.3 percent. Those two numbers are called the lower and upper 95 percent confidence limits around the observed response rate. If you claim that the true response rate (in the population of migraine sufferers that you drew your sample from) lies between 70.8 percent and 87.3 percent, there’s a 95 percent chance that that claim is correct.

How did I get those confidence limits? In Chapter 10, I describe how to calculate confidence intervals around means, proportions, event rates, regression coefficients, and other quantities you measure, count, or calculate.

Statistical decision theory

Statistical decision theory is perhaps the largest branch of statistics. It encompasses all the famous (and many not-so-famous) significance tests — Student t tests (see Chapter 12), chi-square tests (see Chapter 13), analysis of variance (ANOVA; see Chapter 12), Pearson correlation tests (see Chapter 17), Wilcoxon and Mann-Whitney tests (see Chapter 12), and on and on.

remember.eps In its most basic form, statistical decision theory deals with determining whether or not some real effect is present in your data. I use the word effect throughout this book, and it can refer to different things in different circumstances. Examples of effects include the following:

check.png The average value of something may be different in one group compared to another. For example, males may have higher hemoglobin values, on average, than females; the effect of gender on hemoglobin can be quantified by the difference in mean hemoglobin between males and females. Or subjects treated with a drug may have a higher recovery rate than subjects given a placebo; the effect size could be expressed as the difference in recovery rate (drug minus placebo) or by the ratio of the odds of recovery for the drug relative to the placebo (the odds ratio).

check.png The average value of something may be different from zero (or from some other specified value). For example, the average change in body weight over 12 weeks in a group of subjects undergoing physical therapy may be different from zero.

check.png Two numerical variables may be associated (also called correlated). For example, if obesity is associated with hypertension, then body mass index may be correlated with systolic blood pressure. This effect is often quantified by the Pearson correlation coefficient.

Homing In on Hypothesis Testing

The theory of statistical hypothesis testing was developed in the early 20th century and has been the mainstay of practical statistics ever since. It was designed to apply the scientific method to situations involving data with random fluctuations (and almost all real-world data has random fluctuations). In the following sections, I list a few terms commonly used in hypothesis testing; explain the steps, results, and possible errors of testing; and describe the relationships between power, sample size, and effect size in testing.

Getting the language down

Here are some of the most common terms used in hypothesis testing:

check.png Null hypothesis (abbreviated H0): The assertion that any apparent effect you see in your data does not reflect any real effect in the population, but is merely the result of random fluctuations.

check.png Alternate hypothesis (abbreviated H1 or HAlt): The assertion that there really is some real effect in your data, over and above whatever is attributable to random fluctuations.

check.png Significance test: A calculation designed to determine whether H0 can reasonably explain what you see in your data.

check.png Significance: The conclusion that random fluctuations alone can’t account for the size of the effect you observe in your data, so H0 must be false, and you accept HAlt.

check.png Statistic: A number that you obtain or calculate from your data.

check.png Test statistic: A number, calculated from your data, usually for the purpose of testing H0. It’s often — but not always — calculated as the ratio of a number that measures the size of the effect (the signal) divided by a number that measures the size of the random fluctuations (the noise).

check.png p value: The probability that random fluctuations alone in the absence of any real effect (in the population) can produce an observed effect at least as large as what you observe in your sample. The p value is the probability of random fluctuations making the test statistic at least as large as what you calculate from your data (or, more precisely, at least as far away from H0 in the direction of HAlt).

check.png Type I error: Getting a significant result when, in fact, no effect is present.

check.png Alpha: The probability of making a Type I error.

check.png Type II error: Failing to get a significant result when, in fact, some effect really is present.

check.png Beta: The probability of making a Type II error.

check.png Power: The probability of getting a significant result when some effect is really present.

Testing for significance

remember.eps All the famous statistical significance tests (Student t, chi-square, ANOVA, and so on) work on the same general principle — they evaluate the size of apparent effect you see in your data against the size of the random fluctuations present in your data. I describe individual statistical tests throughout this book — t tests and ANOVAs in Chapter 12, chi-square and Fisher Exact tests in Chapter 13, correlation tests in Chapter 17, and so on. But here I describe the general steps that underlie all the common statistical tests of significance.

1. Boil your raw data down into a single number, called a test statistic.

Each test has its own formula, but in general, the test statistic represents the magnitude of the effect you’re looking for relative to the magnitude of the random noise in your data. For example, the test statistic for the unpaired Student t test for comparing means between two groups is calculated as a fraction:

9781118553992-eq03001.eps

The numerator is a measure of the effect you’re looking for — the difference between the two groups. And the denominator is a measure of the random noise in your data — the spread of values within each group. The larger the observed effect is, relative to the amount of random scatter in your data, the larger the Student t statistic will be.

2. Determine how likely (or unlikely) it is for random fluctuations to produce a test statistic as large as the one you actually got from your data.

The mathematicians have done the hard work; they’ve developed formulas (really complicated ones) that describe how much the test statistic bounces around if only random fluctuations are present (that is, if H0 is true).

Understanding the meaning of “p value” as the result of a test

The end result of a statistical significance test is a p value, which represents the probability that random fluctuations alone could have generated results that differed from the null hypothesis (H0), in the direction of the alternate hypothesis (HAlt), by at least as much as what you observed in your data.

If this probability is too small, then H0 can no longer explain your results, and you’re justified in rejecting it and accepting HAlt, which says that some real effect is present. You can say that the effect seen in your data is statistically significant.

remember.eps How small is too small for a p value? This determination is arbitrary; it depends on how much of a risk you’re willing to take of being fooled by random fluctuations (that is, of making a Type I error). Over the years, the value of 0.05 has become accepted as a reasonable criterion for declaring ­significance. If you adopt the criterion that p must be less than or equal to 0.05 to declare significance, then you’ll keep the chance of making a Type I error to no more than 5 percent.

Examining Type I and Type II errors

The outcome of a statistical test is a decision to either accept or reject H0 in favor of HAlt. Because H0 pertains to the population, it’s either true or false for the population you’re sampling from. You may never know what that truth is, but an objective truth is out there nonetheless.

The truth can be one of two things, and your conclusion is one of two things, so four different situations are possible; these are often portrayed in a fourfold table, as shown in Figure 3-4 (Chapter 14 has details on these tables).

9781118553992-fg0304.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-4: Right and wrong conclusions from a statistical hypothesis test.

remember.eps Here are the four things that can happen when you run a statistical significance test on your data (using an example of testing a drug for efficacy):

check.png You can get a nonsignificant result when there is truly no effect present. This is correct — you don’t want to claim that a drug works if it really doesn’t. (See the upper-left corner of the outlined box in Figure 3-4.)

check.png You can get a significant result when there truly is some effect present. This is correct — you do want to claim that a drug works when it really does. (See the lower-right corner of the outlined box in Figure 3-4.)

check.png You can get a significant result when there’s truly no effect present. This is a Type I error — you’ve been tricked by random fluctuations that made the drug look effective. (See the lower-left corner of the outlined box in Figure 3-4.) Your company will invest millions of dollars into the further development of a drug that will eventually be shown to be worthless. Statisticians use the Greek letter alpha (α) to represent the probability of making a Type I error.

check.png You can get a nonsignificant result when there truly is an effect present. This is a Type II error (see the upper-right corner of the outlined box in Figure 3-4) — you’ve failed to see that the drug really works, perhaps because the effect was obscured by the random noise in the data. Further development will be halted, and the miracle drug of the century will be consigned to the scrap heap, along with the Nobel prize you’ll never get. Statisticians use the Greek letter beta (β) to represent the probability of making a Type II error.

tip.eps Limiting your chance of making a Type I error (falsely claiming significance) is very easy. If you don’t want to make a Type I error more than 5 percent of the time, don’t declare significance unless the p value is less than 0.05. That’s called testing at the 0.05 alpha level. If you’re willing to make a Type I error 10 percent of the time, use p < 0.10 as your criterion for significance. If you’re terrified of Type I errors, use p < 0.000001 as your criterion for significance, and you won’t falsely claim significance more than one time in a million.

Why not use a small alpha level (like p < 0.000001) for your significance testing? Because then you’ll almost never get significance, even if an effect really is present. Researchers don’t like to go through life never making any discoveries. If a drug really is effective, you want to get a significant result when you test it. You need to strike a balance between Type I and Type II errors — between the alpha and beta error rates. If you make alpha too small, beta will become too large, and vice versa. Is there any way to keep both types of errors small? There is, and that’s what I describe next.

Grasping the power of a test

remember.eps The power of a statistical test is the chance that it will come out statistically significant when it should — that is, when the alternative hypothesis is really true. Power is a probability and is very often expressed as a percentage. Beta is the chance of getting a nonsignificant result when the alternative hypothesis is true, so you see that power and beta are related mathematically: Power = 1 – beta.

The power of any statistical test depends on several factors:

check.png The alpha level you’ve established for the test — that is, the chance you’re willing to accept of making a Type I error

check.png The actual magnitude of the effect in the population, relative to the amount of noise in the data

check.png The size of your sample

Power, sample size, effect size relative to noise, and alpha level can’t all be varied independently; they’re interrelated — connected and constrained by a mathematical relationship involving the four quantities.

This relationship is often very complicated, and sometimes it can’t be written down explicitly as a formula, but it does exist. For any particular type of test, you can (at least in theory) determine any one of the four quantities if you know the other three. So there are four different ways to do power calculations, with each way calculating one of the four quantities from arbitrarily specified values of the other three. (I have more to say about this in Chapter 5, where I describe practical issues that arise during the design of research studies.) In the following sections, I describe the relationships between power, sample size, and effect size, and I briefly note how you can perform power calculations.

Power, sample size, and effect size relationships

remember.eps The alpha level of a statistical test is usually set to 0.05, unless there are special considerations, which I describe in Chapter 5. After you specify the value of alpha, you can display the relationship between the other three variables (power, sample size, and effect size) in several ways. The next three graphs show these relationships for the Student t test; graphs for other statistical tests are generally similar to these:

check.png Power versus sample size, for various effect sizes: For all statistical tests, power always increases as the sample size increases, if other things (such as alpha level and effect size) are held constant. This relationship is illustrated in Figure 3-5. “Eff” is the effect size — the between-group difference divided by the within-group standard deviation.

Very small samples very seldom produce significant results unless the effect size is very large. Conversely, extremely large samples (many thousands of subjects) are almost always significant unless the effect size is near zero. In epidemiological studies, which often involve hundreds of thousands of subjects, statistical tests tend to produce extremely small (and therefore extremely significant) p values, even when the effect size is so small that it’s of no practical importance.

9781118553992-fg0305.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-5: The power of a statistical test increases as the sample size and the effect size increase.

check.png Power versus effect size, for various sample sizes: For all statistical tests, power always increases as the effect size increases, if other things (such as alpha level and sample size) are held constant. This relationship is illustrated in Figure 3-6. “N” is the number of subjects in each group.

For very large effect sizes, the power approaches 100 percent. For very small effect sizes, you might think the power of the test would approach zero, but you can see from Figure 3-6 that it doesn’t go down all the way to zero; it actually approaches the alpha level of the test. (Keep in mind that the alpha level of the test is the probability of the test producing a significant result when no effect is truly present.)

9781118553992-fg0306.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-6: The power of a statistical test increases as the effect size increases.

check.png Sample size versus effect size, for various values of power: For all statistical tests, sample size and effect size are inversely related, if other things (such as alpha level and power) are held constant. Small effects can be detected only with large samples; large effects can often be detected with small samples. This relationship is illustrated in Figure 3-7.

9781118553992-fg0307.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-7: Smaller effects need larger samples.

This inverse relationship between sample size and effect size takes on a very simple mathematical form (at least to a good approximation): The required sample size is inversely proportional to the square of the effect size that can be detected. Or, equivalently, the detectable effect size is inversely proportional to the square root of the sample size. So, quadrupling your sample size allows you to detect effect sizes only one-half as large.

How to do power calculations

tip.eps Power calculations are a crucial part of the design of any research project. You don’t want your study to be underpowered (with a high risk of missing real effects) or overpowered (larger, costlier, and more time-consuming than necessary). You need to provide a power/sample-size analysis for any research proposal you submit for funding or any protocol you submit to a review board for approval. You can perform power calculations in several ways:

check.png Computer software: The larger statistics packages (such as SPSS, SAS, and R) provide a wide range of power calculations — see Chapter 4 for more about these packages. There are also programs specially designed for this purpose (nQuery, StatExact, Power and Precision, PS-Power & Sample Size, and Gpower, for instance).

check.png Web pages: Many of the more common power calculations can be performed online using web-based calculators. A large collection of these can be found at StatPages.info.

check.png Hand-held devices: Apps for the more common power calculations are available for most tablets and smartphones.

check.png Printed charts and tables: You can find charts and tables in textbooks (including this one; see Chapter 12 and this book's Cheat Sheet at www.dummies.com/cheatsheet/biostatistics). These are ideal for quick and dirty calculations.

check.png Rules of thumb: Some approximate sample-size calculations are simple enough to do on a scrap of paper or even in your head! You find some of these in Chapter 26 and on the Cheat Sheet: Go to www.dummies.com/cheatsheet/biostatistics.

Going Outside the Norm with Nonparametric Statistics

All statistical tests are derived on the basis of some assumptions about your data, and most of the classical significance tests (such as Student t tests, analysis of variance, and regression tests) assume that your data is distributed according to some classical frequency distribution (most commonly the normal distribution; see Chapter 25). Because the classic distribution functions are all written as mathematical expressions involving parameters (like means and standard deviation), they’re called parametric distribution functions, and tests that assume your data conforms to a parametric distribution function are called parametric tests. Because the normal distribution is the most common statistical distribution, the term parametric test is most often used to mean a test that assumes normally distributed data.

But sometimes your data isn’t parametric. For example, you may not want to assume that your data is normally distributed because it may be very noticeably skewed, as shown in Figure 3-8a.

Sometimes, you may be able to perform some kind of transformation of your data to make it more normally distributed. For example, many variables that have a skewed distribution can be turned into normally distributed numbers by taking logarithms, as shown in Figure 3-8b. If, by trial and error, you can find some kind of transformation that normalizes your data, you can run the classical tests on the transformed data. (See Chapter 8.)

9781118553992-fg0308.eps

Illustration by Wiley, Composition Services Graphics

Figure 3-8: Skewed data (a) can sometimes be turned into normally distributed data (b) by taking logarithms.

But sometimes your data is stubbornly abnormal, and you can’t use the parametric tests. Fortunately, statisticians have developed special tests that don’t assume normally distributed data; these are (not surprisingly) called nonparametric tests. Most of the common classic parametric tests have nonparametric counterparts. As you may expect, the most widely known and commonly used nonparametric tests are those that correspond to the most widely known and commonly used classical tests. Some of these are shown in Table 3-2.

Table 3-2 Nonparametric Counterparts of Classic Tests

Classic Parametric Test

Nonparametric Equivalent

One-group or paired Student t test (see Chapter 12)

Sign test; Wilcoxon signed-ranks test

Two-group Student t test (see Chapter 12)

Wilcoxon sum-of-ranks test; ­Mann-Whitney U test

One-way ANOVA (see Chapter 12)

Kruskal-Wallis test

Pearson Correlation test (see Chapter 17)

Spearman Rank Correlation test

Most nonparametric tests involve first sorting your data values, from lowest to highest, and recording the rank of each measurement (the lowest value has a rank of 1, the next highest value a rank of 2, and so on). All subsequent calculations are done with these ranks rather than with the actual data values.

Although nonparametric tests don’t assume normality, they do make certain assumptions about your data. For example, many nonparametric tests assume that you don’t have any tied values in your data set (in other words, no two subjects have exactly the same values). Most parametric tests incorporate adjustments for the presence of ties, but this weakens the test and makes the results nonexact.

tip.eps Even in descriptive statistics, the common parameters have nonparametric counterparts. Although means and standard deviations can be calculated for any set of numbers, they’re most useful for summarizing data when the numbers are normally distributed. When you don’t know how the numbers are distributed, medians and quartiles are much more useful as measures of central tendency and dispersion (see Chapter 8 for details).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset