1.3 Fundamental Elements of Statistics

Statistical methods are particularly useful for studying, analyzing, and learning about populations of experimental units.

An experimental (or observational) unit is an object (e.g., person, thing, transaction, or event) about which we collect data.

A population is a set of all units (usually people, objects, transactions, or events) that we are interested in studying.

For example, populations may include (1) all employed workers in the United States, (2) all registered voters in California, (3) everyone who is afflicted with AIDS, (4) all the cars produced last year by a particular assembly line, (5) the entire stock of spare parts available at Southwest Airlines’ maintenance facility, (6) all sales made at the drive-in window of a McDonald’s restaurant during a given year, or (7) the set of all accidents occurring on a particular stretch of interstate highway during a holiday period. Notice that the first three population examples (1–3) are sets (groups) of people, the next two (4–5) are sets of objects, the next (6) is a set of transactions, and the last (7) is a set of events. Notice also that each set includes all the units in the population.

In studying a population, we focus on one or more characteristics or properties of the units in the population. We call such characteristics variables. For example, we may be interested in the variables age, gender, and number of years of education of the people currently unemployed in the United States.

A variable is a characteristic or property of an individual experimental (or observational) unit in the population.

The name variable is derived from the fact that any particular characteristic may vary among the units in a population.

In studying a particular variable, it is helpful to be able to obtain a numerical representation for it. Often, however, numerical representations are not readily available, so measurement plays an important supporting role in statistical studies. Measurement is the process we use to assign numbers to variables of individual population units. We might, for instance, measure the performance of the president by asking a registered voter to rate it on a scale from 1 to 10. Or we might measure the age of the U.S. workforce simply by asking each worker, “How old are you?” In other cases, measurement involves the use of instruments such as stopwatches, scales, and calipers.

If the population you wish to study is small, it is possible to measure a variable for every unit in the population. For example, if you are measuring the GPA for all incoming first-year students at your university, it is at least feasible to obtain every GPA. When we measure a variable for every unit of a population, it is called a census of the population. Typically, however, the populations of interest in most applications are much larger, involving perhaps many thousands, or even an infinite number, of units. Examples of large populations are those following the definition of population above, as well as all graduates of your university or college, all potential buyers of a new iPhone, and all pieces of first-class mail handled by the U.S. Post Office. For such populations, conducting a census would be prohibitively time consuming or costly. A reasonable alternative would be to select and study a subset (or portion) of the units in the population.

A sample is a subset of the units of a population.

For example, instead of polling all 145 million registered voters in the United States during a presidential election year, a pollster might select and question a sample of just 1,500 voters. (See Figure 1.2.) If he is interested in the variable “presidential preference,” he would record (measure) the preference of each vote sampled.

Figure 1.2

A sample of voter registration cards for all registered voters

After the variables of interest for every unit in the sample (or population) are measured, the data are analyzed, either by descriptive or inferential statistical methods. The pollster, for example, may be interested only in describing the voting patterns of the sample of 1,500 voters. More likely, however, he will want to use the information in the sample to make inferences about the population of all 145 million voters.

A statistical inference is an estimate, prediction, or some other generalization about a population based on information contained in a sample.

That is, we use the information contained in the smaller sample to learn about the larger population.* Thus, from the sample of 1,500 voters, the pollster may estimate the percentage of all the voters who would vote for each presidential candidate if the election were held on the day the poll was conducted, or he might use the results to predict the outcome on election day.

Example 1.1 Key Elements of a Statistical Problem—Ages of Broadway Ticketbuyers

Problem

  1. According to Variety (Jan. 10, 2014), the average age of Broadway ticketbuyers is 42.5 years. Suppose a Broadway theatre executive hypothesizes that the average age of ticketbuyers to her theatre’s plays is less than 42.5 years. To test her hypothesis, she samples 200 ticketbuyers to her theatre’s plays and determines the age of each.

    1. Describe the population.

    2. Describe the variable of interest.

    3. Describe the sample.

    4. Describe the inference.

Solution

  1. The population is the set of all units of interest to the theatre executive, which is the set of all ticketbuyers to her theatre’s plays.

  2. The age (in years) of each ticketbuyer is the variable of interest.

  3. The sample must be a subset of the population. In this case, it is the 200 ticketbuyers selected by the executive.

  4. The inference of interest involves the generalization of the information contained in the sample of 200 ticketbuyers to the population of all her theatre’s ticketbuyers. In particular, the executive wants to estimate the average age of the ticketbuyers to her theatre’s plays in order to determine whether it is less than 42.5 years. She might accomplish this by calculating the average age of the sample and using that average to estimate the average age of the population.

Look Back

A key to diagnosing a statistical problem is to identify the data set collected (in this example, the ages of the 200 ticketbuyers) as a population or a sample.

Now Work Exercise 1.13

Example 1.2 Key Elements of a Statistical Problem—Pepsi vs. Coca-Cola

Problem

  1. “Cola wars” is the popular term for the intense competition between Coca-Cola and Pepsi displayed in their marketing campaigns, which have featured movie and television stars, rock videos, athletic endorsements, and claims of consumer preference based on taste tests. Suppose, as part of a Pepsi marketing campaign, 1,000 cola consumers are given a blind taste test (i.e., a taste test in which the two brand names are disguised). Each consumer is asked to state a preference for brand A or brand B.

    1. Describe the population.

    2. Describe the variable of interest.

    3. Describe the sample.

    4. Describe the inference.

Solution

  1. Since we are interested in the responses of cola consumers in a taste test, a cola consumer is the experimental unit. Thus, the population of interest is the collection or set of all cola consumers.

  2. The characteristic that Pepsi wants to measure is the consumer’s cola preference, as revealed under the conditions of a blind taste test, so cola preference is the variable of interest.

  3. The sample is the 1,000 cola consumers selected from the population of all cola consumers.

  4. The inference of interest is the generalization of the cola preferences of the 1,000 sampled consumers to the population of all cola consumers. In particular, the preferences of the consumers in the sample can be used to estimate the percentages of cola consumers who prefer each brand.

Look Back

In determining whether the study is inferential or descriptive, we assess whether Pepsi is interested in the responses of only the 1,000 sampled customers (descriptive statistics) or in the responses of the entire population of consumers (inferential statistics).

Now Work Exercise 1.16b

The preceding definitions and examples identify four of the five elements of an inferential statistical problem: a population, one or more variables of interest, a sample, and an inference. But making the inference is only part of the story; we also need to know its reliability—that is, how good the inference is. The only way we can be certain that an inference about a population is correct is to include the entire population in our sample. However, because of resource constraints (i.e., insufficient time or money), we usually can’t work with whole populations, so we base our inferences on just a portion of the population (a sample). Thus, we introduce an element of uncertainty into our inferences. Consequently, whenever possible, it is important to determine and report the reliability of each inference made. Reliability, then, is the fifth element of inferential statistical problems.

The measure of reliability that accompanies an inference separates the science of statistics from the art of fortune-telling. A palm reader, like a statistician, may examine a sample (your hand) and make inferences about the population (your life). However, unlike statistical inferences, the palm reader’s inferences include no measure of reliability.

Suppose, like the theatre executive in Example 1.1, we are interested in the error of estimation (i.e., the difference between the average age of a population of ticketbuyers and the average age of a sample of ticketbuyers). Using statistical methods, we can determine a bound on the estimation error. This bound is simply a number that our estimation error (the difference between the average age of the sample and the average age of the popu­lation) is not likely to exceed. We’ll see in later chapters that this This bound is a measure of the uncertainty of our inference. The reliability of statistical inferences is discussed throughout this text. For now, we simply want you to realize that an inference is incomplete without a measure of its reliability.

A measure of reliability is a statement (usually quantitative) about the degree of uncertainty associated with a statistical inference.

Let’s conclude this section with a summary of the elements of descriptive and of inferential statistical problems and an example to illustrate a measure of reliability.

Four Elements of Descriptive Statistical Problems

  1. The population or sample of interest

  2. One or more variables (characteristics of the population or sample units) that are to be investigated

  3. Tables, graphs, or numerical summary tools

  4. Identification of patterns in the data

Five Elements of Inferential Statistical Problems

  1. The population of interest

  2. One or more variables (characteristics of the population units) that are to be investigated

  3. The sample of population units

  4. The inference about the population based on information contained in the sample

  5. A measure of the reliability of the inference

Example 1.3 Reliability of an Inference—Pepsi vs. Coca-Cola

Problem

  1. Refer to Example 1.2, in which the preferences of 1,000 cola consumers were indicated in a taste test. Describe how the reliability of an inference concerning the preferences of all cola consumers in the Pepsi bottler’s marketing region could be measured.

Solution

  1. When the preferences of 1,000 consumers are used to estimate those of all consumers in a region, the estimate will not exactly mirror the preferences of the population. For example, if the taste test shows that 56% of the 1,000 cola consumers preferred Pepsi, it does not follow (nor is it likely) that exactly 56% of all cola drinkers in the region prefer Pepsi. Nevertheless, we can use sound statistical reasoning (which we’ll explore later in the text) to ensure that the sampling procedure will generate estimates that are almost certainly within a specified limit of the true percentage of all cola consumers who prefer Pepsi. For example, such reasoning might assure us that the estimate of the preference for Pepsi is almost certainly within 5% of the preference of the population. The implication is that the actual preference for Pepsi is between 51% [i.e., (565)%] and 61% [i.e., (56+5)%]—that is, (56±5)%. This interval represents a measure of the reliability of the inference.

Look Ahead

The interval 56±5 is called a confidence interval, since we are confident that the true percentage of cola consumers who prefer Pepsi in a taste test falls into the range (51, 61). In Chapter 7, we learn how to assess the degree of confidence (e.g., a 90% or 95% level of confidence) in the interval.

Statistics in Action Revisited

Identifying the Population, Sample, and Inference

Consider the 2013 Pew Internet & American Life Project survey on social networking. In particular, consider the survey results on the use of social networking sites like Facebook. The experimental unit for the study is an adult (the person answering the question), and the variable measured is the response (“yes” or “no”) to the question.

The Pew Research Center reported that 1,445 adult Internet users participated in the study. Obviously, that number is not all of the adult Internet users in the United States. Consequently, the 1,445 responses represent a sample selected from the much larger population of all adult Internet users.

Earlier surveys found that 55% of adults used an online social networking site in 2006 and 65% in 2008. These are descriptive statistics that provide information on the popularity of social networking in past years. Since 73% of the surveyed adults in 2013 used an online social networking site, the Pew Research Center inferred that usage of social networking sites continues its upward trend, with more and more adults getting online each year. That is, the researchers used the descriptive statistics from the sample to make an inference about the current population of U.S. adults’ use of social networking.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset