1.5 Collecting Data: Sampling and Related Issues

Once you decide on the type of data—quantitative or qualitative—appropriate for the problem at hand, you’ll need to collect the data. Generally, you can obtain data in three different ways:

  1. From a published source

  2. From a designed experiment

  3. From an observational study (e.g., a survey)

Sometimes, the data set of interest has already been collected for you and is available in a published source, such as a book, journal, or newspaper. For example, you may want to examine and summarize the divorce rates (i.e., number of divorces per 1,000 population) in the 50 states of the United States. You can find this data set (as well as numerous other data sets) at your library in the Statistical Abstract of the United States, published annually by the U.S. government. Similarly, someone who is interested in monthly mortgage applications for new home construction would find this data set in the Survey of Current Business, another government publication. Other examples of published data sources include The Wall Street Journal (financial data) and Elias Sports Bureau (sports information). The Internet (World Wide Web) now provides a medium by which data from published sources are readily obtained.*

A second method of collecting data involves conducting a designed experiment, in which the researcher exerts strict control over the units (people, objects, or things) in the study. For example, an often-cited medical study investigated the potential of aspirin in preventing heart attacks. Volunteer physicians were divided into two groups: the treatment group and the control group. Each physician in the treatment group took one aspirin tablet a day for one year, while each physician in the control group took an aspirin-free placebo made to look like an aspirin tablet. The researchers—not the physicians under study—controlled who received the aspirin (the treatment) and who received the placebo. As you’ll learn in Chapter 10, a A properly designed experiment allows you to extract more information from the data than is possible with an uncontrolled study.

Finally, observational studies can be employed to collect data. In an observational study, the researcher observes the experimental units in their natural setting and records the variable(s) of interest. For example, a child psychologist might observe and record the level of aggressive behavior of a sample of fifth graders playing on a school playground. Similarly, a zoologist may observe and measure the weights of newborn elephants born in captivity. Unlike a designed experiment, an observational study is a study in which the researcher makes no attempt to control any aspect of the experimental units.

The most common type of observational study is a survey, where the researcher samples a group of people, asks one or more questions, and records the responses. Probably the most familiar type of survey is the political poll, conducted by any one of a number of organizations (e.g., Harris, Gallup, Roper, and CNN) and designed to predict the outcome of a political election. Another familiar survey is the Nielsen survey, which provides the major networks with information on the most-watched programs on television. Surveys can be conducted through the mail, with telephone interviews, or with in-person interviews. Although in-person surveys are more expensive than mail or telephone surveys, they may be necessary when complex information is to be collected.

A designed experiment is a data collection method where the researcher exerts full control over the characteristics of the experimental units sampled. These experiments typically involve a group of experimental units that are assigned the treatment and an untreated (or control) group.

An observational study is a data collection method where the experimental units sampled are observed in their natural setting. No attempt is made to control the characteristics of the experimental units sampled. (Examples include opinion polls and surveys.)

Regardless of which data collection method is employed, it is likely that the data will be a sample from some population. And if we wish to apply inferential statistics, we must obtain a representative sample.

A representative sample exhibits characteristics typical of those possessed by the target population.

For example, consider a political poll conducted during a presidential election year. Assume that the pollster wants to estimate the percentage of all 145 million registered voters in the United States who favor the incumbent president. The pollster would be unwise to base the estimate on survey data collected for a sample of voters from the incumbent’s own state. Such an estimate would almost certainly be biased high; consequently, it would not be very reliable.

The most common way to satisfy the representative sample requirement is to select a random sample. A simple random sample ensures that every subset of fixed size in the population has the same chance of being included in the sample. If the pollster samples 1,500 of the 145 million voters in the population so that every subset of 1,500 voters has an equal chance of being selected, she has devised a random sample.

A simple random sample of n experimental units is a sample selected from the population in such a way that every different sample of size n has an equal chance of selection.

The procedure for selecting a simple random sample typically relies on a random number generator. Random number generators are available in table form online*, and they are built into most statistical software packages. The SAS, MINITAB, and SPSS statistical software packages all have easy-to-use random number generators for creating a random sample. The next example illustrates the procedure.

Example 1.5 Generating a Simple Random Sample—Selecting Households for a Feasibility Study

Figure 1.3

Random Selection of 20 Households Using MINITAB

Problem

  1. Suppose you wish to assess the feasibility of building a new high school. As part of your study, you would like to gauge the opinions of people living close to the proposed building site. The neighborhood adjacent to the site has 711 homes. Use a random number generator to select a simple random sample of 20 households from the neighborhood to participate in the study.

Solution

  1. In this study, your population of interest consists of the 711 households in the adjacent neighborhood. To ensure that every possible sample of 20 households selected from the 711 has an equal chance of selection (i.e., to ensure a simple random sample), first assign a number from 1 to 711 to each of the households in the population. These numbers were entered into MINITAB. Now, apply the random number generator of MINITAB, requesting that 20 households be selected without replacement. Figure 1.3 shows the output from MINITAB. You can see that households numbered 78, 152, 157, 177, 216, . . . , 690 are the households to be included in your sample.

Look Back

It can be shown (proof omitted) that there are over 3 × 1038 possible samples of size 20 that can be selected from the 711 households. Random number generators guarantee (to a certain degree of approximation) that each possible sample has an equal chance of being selected.

In addition to simple random samples, there are more complex random sampling designs that can be employed. These include (but are not limited to) stratified random sampling, cluster sampling, systematic sampling, and randomized response sampling. Brief descriptions of each follow. (For more details on the use of these sampling methods, consult the references at the end of this chapter.)

Stratified random sampling is typically used when the experimental units associated with the population can be separated into two or more groups of units, called strata, where the characteristics of the experimental units are more similar within strata than across strata. Random samples of experimental units are obtained for each strata, then the units are combined to form the complete sample. For example, if you are gauging opinions of voters on a polarizing issue, like government-sponsored health care, you may want to stratify on political affiliation (Republicans and Democrats), making sure that representative samples of both Republicans and Democrats (in proportion to the number of Republicans and Democrats in the voting population) are included in your survey.

Sometimes it is more convenient and logical to sample natural groupings (clusters) of experimental units first, then collect data from all experimental units within each cluster. This involves the use of cluster sampling. For example, suppose a marketer for a large upscale restaurant chain wants to find out whether customers like the new menu. Rather than collect a simple random sample of all customers (which would be very difficult and costly to do), the marketer will randomly sample 10 of the 150 restaurant locations (clusters), then interview all customers eating at each of the 10 locations on a certain night.

Another popular sampling method is systematic sampling. This method involves systematically selecting every kth experimental unit from a list of all experimental units. For example, every fifth person who walks into a shopping mall could be asked whether he or she owns a smart phone. Or a quality control engineer at a manufacturing plant may select every 10th item produced on an assembly line for inspection.

A fourth alternative to simple random sampling is randomized response sampling. This design is particularly useful when the questions of the pollsters are likely to elicit false answers. For example, suppose each person in a sample of wage earners is asked whether he or she cheated on an income tax return. A cheater might lie, thus biasing an estimate of the true likelihood of someone cheating on his or her tax return. To circumvent this problem, each person is presented with two questions, one being the object of the survey and the other an innocuous question, such as:

  1. Did you ever cheat on your federal income tax return?

  2. Did you drink coffee this morning?

One of the questions is chosen at random to answer by the wage earner by flipping a coin; however, which particular question is answered is unknown to the interviewer. In this way, the random response method attempts to elicit an honest response to a sensitive question. Sophisticated statistical methods are then employed to derive an estimate of percentage of “yes” responses to the sensitive question.

No matter what type of sampling design you employ to collect the data for your study, be careful to avoid selection bias. Selection bias occurs when some experimental units in the population have less chance of being included in the sample than others. This results in samples that are not representative of the population. Consider an opinion poll that employs either a telephone survey or mail survey. After collecting a random sample of phone numbers or mailing addresses, each person in the sample is contacted via telephone or the mail and a survey conducted. Unfortunately, these types of surveys often suffer from selection bias due to nonresponse. Some individuals may not be home when the phone rings, or others may refuse to answer the questions or mail back the questionnaire. As a consequence, no data is obtained for the nonrespondents in the sample. If the nonrespondents and respondents differ greatly on an issue, then nonresponse bias exits. For example, those who choose to answer a question on a school board issue may have a vested interest in the outcome of the survey—say, parents with children of school age, schoolteachers whose jobs may be in jeopardy, or citizens whose taxes might be substantially affected. Others with no vested interest may have an opinion on the issue but might not take the time to respond.

Selection bias results when a subset of experimental units in the population has little or no chance of being selected for the sample.

Nonresponse bias is a type of selection bias that results when data on all experimental units in a sample are not obtained.

Finally, even if your sample is representative of the population, the data collected may suffer from measurement error. That is, the values of the data (quantitative or qualitative) may be inaccurate. In sample surveys, opinion polls, and so on, measurement error often results from ambiguous or leading questions. Consider the survey question: “How often did you change the oil in your car last year?” It is not clear whether the researcher is wanting to know how often you personally changed the oil in your car or how often you took your car into a service station to get an oil change. The ambiguous question may lead to inaccurate responses. On the other hand, consider the question: “Does the new health plan offer more comprehensive medical services at less cost than the old one?” The way the question is phrased leads the reader to believe that the new plan is better and to a “yes” response—a response that is more desirable to the researcher. A better, more neutral way to phrase the question is: “Which health plan offers more comprehensive medical services at less cost, the old one or the new one?”

Measurement error refers to inaccuracies in the values of the data collected. In surveys, the error may be due to ambiguous or leading questions and the interviewer’s effect on the respondent.

We conclude this section with two examples involving actual sampling studies.

Example 1.6 Method of Data Collection—Internet Addiction Study

Problem

  1. What percentage of Web users are addicted to the Internet? To find out, a psychologist designed a series of 10 questions based on a widely used set of criteria for gambling addiction and distributed them through the Web site ABCNews.com. (A sample question: “Do you use the Internet to escape problems?”) A total of 17,251 Web users responded to the questionnaire. If participants answered “yes” to at least half of the questions, they were viewed as addicted. The findings, released at an annual meeting of the American Psychological Association, revealed that 990 respondents, or 5.7%, are addicted to the Internet.

    1. Identify the data collection method.

    2. Identify the target population.

    3. Are the sample data representative of the population?

Solution

  1. The data collection method is a survey: 17,251 Internet users responded to the questions posed at the ABCNews.com Web site.

  2. Since the Web site can be accessed by anyone surfing the Internet, presumably the target population is all Internet users.

  3. Because the 17,251 respondents clearly make up a subset of the target population, they do form a sample. Whether or not the sample is representative is unclear, since we are given no information on the 17,251 respondents. However, a survey like this one in which the respondents are self-selected (i.e., each Internet user who saw the survey chose whether to respond to it) often suffers from nonresponse bias. It is possible that many Internet users who chose not to respond (or who never saw the survey) would have answered the questions differently, leading to a higher (or lower) percentage of affirmative answers.

Look Back

Any inferences based on survey samples that employ self-selection are suspect due to potential nonresponse bias.

Example 1.7 Method of Data Collection—Study of Susceptibility to Hypnosis

Problem

  1. In a classic study, psychologists at the University of Tennessee investigated the susceptibility of people to hypnosis (Psychological Assessment, Mar. 1995). In a random sample of 130 undergraduate psychology students at the university, each experienced both traditional hypnosis and computer-assisted hypnosis. Approximately half were randomly assigned to undergo the traditional procedure first, followed by the computer-assisted procedure. The other half were randomly assigned to experience computer-assisted hypnosis first, then traditional hypnosis. Following the hypnosis epi­sodes, all students filled out questionnaires designed to measure a student’s susceptibility to hypnosis. The susceptibility scores of the two groups of students were compared.

    1. Identify the data collection method.

    2. Is the sample data representative of the target population?

Solution

  1. Here, the experimental units are the psychology students. Since the researchers controlled which type of hypnosis—traditional or computer assisted—the students experienced first (through random assignment), a designed experiment was used to collect the data.

  2. The sample of 130 psychology students was randomly selected from all psychology students at the University of Tennessee. If the target population is all University of Tennessee psychology students, it is likely that the sample is representative. However, the researchers warn that the sample data should not be used to make inferences about other, more general populations.

Look Ahead

By using randomization in a designed experiment, the researcher is attempting to eliminate different types of bias, including self-selection bias. (We discuss this type of bias in the next section.)

Now Work Exercise 1.27

Statistics in Action Revisited

Identifying the Data Collection Method and Data Type

In the Pew Internet & American Life Project report, American adults were asked to respond to a variety of questions about social networking site usage. According to the report, the data were obtained through phone interviews in the United States of 1,445 adult Internet users. Consequently, the data collection method is a survey (observational study).

Both quantitative and qualitative data were collected in the survey. For example, the survey question asking adults if they ever use a social networking site is phrased to elicit a “yes” or “no” response. Since the responses produced for this question are categorical in nature, these data are qualitative. However, the question asking for the number of social networking sites used will give meaningful numerical responses, such as 0, 1, 2, and so on. Thus, these data are quantitative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset