CHAPTER
2

Data, Data Everywhere and Not a Drop to Drink

In This Chapter

  • The difference between data and information
  • Where does data come from?
  • What kinds of data and variables can we use?
  • Classifying data by their level of measurement
  • Setting up Excel for statistical analysis

Data is the basic foundation for the field of statistics. The validity of any statistical study hinges on the validity of the data from the beginning of the process. Many things can come into question, such as the accuracy of the data or the source of the data. Without the proper foundation, your efforts to provide a sound analysis will come tumbling down.

The issues surrounding data can be surprisingly complex. After all, aren’t we just talking about numbers here? What could go wrong? Well, plenty can. Because data can be classified in different ways, we need to recognize the difference between them. Data also can be measured in many ways. The data measurement choice we make at the start of the study will determine what kind of statistical techniques we can apply.

The Importance of Data

Data is simply defined as the value assigned to a specific observation or measurement. If Bob is collecting data on his wife’s snoring behavior, he can do so in different ways. He can measure how many times Debbie snores over a 10-minute period. He can measure the length of each snore in seconds. He could also measure how loud each snore is with a descriptive phrase, like “That one sounded like a bear just waking up from hibernation!” or “Wow! That one sounded like an Alaskan seal calling for its young.” (How a sound like that can come from a person who can fit into a pair of size 2 jeans and still be able to breathe, you’ll never know.)

In each case, we’re recording data on the same event in a different form. In the first case, we’re measuring a frequency or number of occurrences. In the second instance, we’re measuring duration or length in time. And the final attempt measures the event by describing volume using words rather than numbers. Each of these cases just shows a different way to use data.

Data is the building blocks of all statistical studies. You can hire the most expensive, well-known statisticians and provide them with the latest computer hardware and software available, but if the data you provide them is inaccurate or not relevant to the study, then the final results will be worthless.

However, data all by its lonesome is not all that useful. By definition, data is just the raw facts and figures that pertain to a measurement of interest. Information, on the other hand, is derived from the facts for the purpose of making decisions. One of the major reasons to use statistics is to transform data into information. For example, the table that follows shows monthly sales data for a small retail store.

DEFINITION

Data is the value assigned to an observation or a measurement and is the building blocks to statistical analysis. Information, on the other hand, is data that is transformed into useful facts that can be used for a specific purpose, such as making a decision.

Monthly Sales Data

Month

Sales ($)

January

15,178

February

14,293

March

13,492

April

12,287

May

11,321

Using statistical analysis, we can generate information that may be of interest, such as “Wake up! You are doing something very wrong. At this rate, you will be out of business by early next year.” Based on this valuable information, we can make some important decisions about how to avoid this impending disaster.

TEST YOUR KNOWLEDGE

Is the word “data” singular or plural? In the dictionary, “data” is a plural noun. The singular is “datum.” However, people commonly now use the word “data” as both plural and singular, depending on what it references. If “data” is used as a count noun (nouns that can be counted; describe “how many”), then it is plural. For example, data include names, ages, gender, and race of students. If “data” is used as a mass noun (nouns that cannot be counted; describe “how much”), then people often use it as singular. For example, data is available everywhere.

The Sources of Data—Where Does All This Stuff Come From?

We classify the sources of data into two broad categories: primary and secondary. Secondary data is data that somebody else has collected and made available for others to use. The U.S. government loves to collect and publish all sorts of interesting data, just in case anyone should need it. The Department of Commerce handles census data, and the Department of Labor collects mountains of, you guessed it, labor statistics. The Department of the Interior provides all sorts of data about U.S. resources.

DEFINITION

Primary data is data that you have collected for your own use. Secondary data is data collected by someone else that you are “borrowing.”

The main drawback of using secondary data is that you have no control over how the data was collected. It’s a natural human tendency to believe anything that’s in print (you believe us, don’t you?), and sometimes that requires a leap of faith. The advantage of secondary data is that it’s cheap (sometimes free) and it’s available immediately. That’s called instant gratification.

Primary data, on the other hand, is data collected by the person who eventually uses this data. It can be expensive to acquire, but the main advantage is that it’s your data and you have control over how it’s gathered. Then you have nobody else to blame but yourself if you make a mess of it.

When collecting primary data, you want to ensure that the results will not be biased by the manner in which they are collected. You can obtain primary data in many ways, such as direct observation, surveys, and experiments.

Direct Observation—I’ll Be Watching You

Whether the subjects know it or not, direct observation involves observing behavior as it occurs. Most often, this method focuses on gathering data while the subjects of interest are in their natural environment, oblivious to being watched (called disguised observation). Examples of these studies would be observing wild animals stalking their prey in the forest or teenagers at the mall on Friday night (or is that the same example?). The advantage of this method is that the subjects will unlikely be influenced by the data collection.

Focus groups are a direct observational technique where the subjects are aware that data is being collected (undisguised observation). Businesses use focus groups to gather information in a group setting controlled by a moderator. The subjects are asked to discuss specific topics and are usually paid for their time.

Experiments—Who’s in Control?

This method is more direct than observation because the subjects will participate in an experiment designed to determine the effectiveness of a treatment. An example of a treatment could be the use of a new medical drug. Two groups would be established. The first is the experimental group who receive the new drug, and the second is the control group who thinks they are getting the new drug but are in fact getting no medication. The reactions from each group are measured and compared to determine whether the new drug is effective.

The benefit of experiments is that they allow the statistician who designs the experiment to observe how certain manipulated variables could influence the results, such as gender, age, and education of the participants. The concern about collecting data through experiments is that the response of the subjects might be influenced by the fact that they are participating in a study. In addition, the claims that the experimental studies are attempting to verify need to be clear and specific. The design of experiments for a statistical study is a very complex topic and goes beyond the scope of this book.

Surveys—Is That Your Final Answer?

This technique of data collection involves asking the subject a series of questions. The questionnaire needs to be carefully designed to avoid any bias (see Chapter 1) or confusion for those participating. Concerns also exist about the influence that the survey will have on the participant’s responses. Some participants tend to adjust their responses to fit in line with what they believe is socially desirable or “the right answer.” The survey can be administered online, or by e-mail, snail-mail, or telephone. It’s the telephone survey that I’m most fond of, especially when I get the call just as I’m sitting down to dinner, getting into the shower, or finally making some progress on the chapter I’m writing.

Whatever method you employ, your primary concern should always be that the sample is representative of the population in which you are interested.

BOB’S BASICS

Research has shown that the manner in which the questions are asked can affect the responses that a person provides on a questionnaire. A question posed in a positive tone will tend to evoke a more positive response and vice versa. A good strategy is to test your questionnaire with a small group of people before releasing it to the general public.

Data and Types of Variables

Variables are measures of characteristics of interest. The variable is the basic element of algebra, and it can take on any value (numerical or non-numerical). Data are the values that the variable takes. The range of values that each variable can have differs significantly.

Variables can be divided according to different criteria, such as:

  • Quantitative versus qualitative
  • Discrete versus continuous
  • Dependent versus independent

Quantitative vs. Qualitative Variables

A quantitative variable is a variable with numerical values such as age and income. A qualitative variable is a variable with descriptive, non-numerical values, such as gender and race. If I ask you about your gender, you can describe yourself as male or female. The answer is a descriptive value, not a number, so it is a qualitative variable. Whereas if I ask you about your age, you will give me a number so it is a quantitative variable. If you ask me, I’d say I’m 21 years old!

Discrete vs. Continuous Variables

Quantitative variables can be classified as either discrete or continuous. A discrete variable assumes certain values only, usually with gaps between the values. It results from counting or enumeration. For example, the number of cars or children is a discrete value. If I ask you how many cars you have, you could say “one,” “two,” or “three.” But can you say, “I have 1.5 cars?” Or can you say, “I have 2.5 kids?” Of course not! This is a discrete variable and its value is usually a whole digit with no decimal or fraction. A continuous variable, on the other hand, can take on any value within a certain range, such as length, distance, and height. It results from measurements. For example, you can say the distance from my home to the nearest mall is 5.6 miles.

Dependent vs. Independent Variables

A dependent variable is a variable whose value is determined by another variable. An independent variable is a variable whose value is unaffected by another variable. For example, your exam grade depends on how long you studied. Here, your exam grade is the dependent variable and your hours of study are the independent variables. The dependent variable responds to the independent variable. That’s why it is called the dependent variable–because it depends on the independent variable.

Types of Measurement Scales—a Weighty Topic

An important way to classify data is by the way it is measured. This distinction is critical because it affects which statistical techniques we can use in our analysis of the data. Measurement classification can be made in several levels: nominal measurement, ordinal measurement, interval measurement, and ratio measurement. Each measure adds something to the previous one. Let’s look at each.

Nominal Level of Measurement

A nominal level of measurement deals with qualitative variables. In nominal measurement, names or classifications are used to divide the data into separate categories, with no meaningful order. One example is gender, with the categories being male and female. I could assign the number 0 for males and 1 for females, or I could switch them to 1 for males and 0 for females. But as you can see in this example, the order or value of the number assigned to the category is unimportant. In other words, we cannot reasonably rank-order the numbers from highest to lowest because only the qualitative measurements are meaningful, not the numbers assigned to the categories.

Other examples of nominal data are zip codes (there is no meaningful order or ranking to zip codes), marital status, race, types of dogs, and so forth.

This measurement type does not allow us to perform any mathematical operations, such as adding or multiplying. These types of data are considered the lowest level of data. As a result, it is the most restrictive when choosing a statistical technique to use for analysis.

Ordinal Level of Measurement

On the food chain of data measurement, ordinal is the next level up. It has all the properties of nominal data with the added feature that we can rank-order the categories or values from lowest to highest. The ordinal level of measurement can be qualitative or quantitative data. In ordinal measurements, the order of the numbers or values is meaningful while the magnitude of the values is not.

An example of ordinal measurements would be a survey with the options: strongly disagree (1), disagree (2), neutral (3), agree (4), and strongly agree (5). Here, the order of the numbers is important since (4) is better than (2), but the magnitude of the number assigned to the category is not important. I cannot say that (4) is twice as good as (2). Another example of the ordinal level of measurement is movie ratings with 1, 2, 3, or 4 stars. We know a 4-star movie is better than a 1-star movie, however we cannot claim that a 4-star movie is 4 times as good as a 1-star movie.

Interval Level of Measurement

Moving up the scale of data, we find ourselves at the interval level, which deals with quantitative variables only. With interval measurements, both the order and the magnitude of the numbers are meaningful because the distance between the measurements is quantitatively equidistance. Now we get to work with the mathematical operations of addition and subtraction when comparing values. For this data, we can measure the difference between the categories with numbers that actually provide meaningful information. An example is temperature: 80°F is 10 degrees warmer than 70°F. However, ratio comparisons don’t make sense, so we cannot perform multiplication and division on this type of data. Why not? Simply because we cannot argue that 80°F is twice as warm as 40°F.

Another characteristic of interval measurements is that there is no true zero point—zero does not mean that there is no quantity. For example 0°F or 0°C does not mean that there is no temperature (even though they feel very cold!). Other examples of interval data are IQ scores, SAT scores, and GPAs.

Ratio Level of Measurement

The king of measurement types is the ratio level, which are quantitative variables. This is as good as it gets as far as data is concerned. Now we can perform all four mathematical operations to compare values with absolutely no feelings of guilt. Ratio data has all the features of interval data (meaningful order and magnitude) with the added benefit of a true zero point. The term “true zero point” means that a 0 data value indicates the absence of the object being measured. An example of ratio level data is weight: 80 pounds are 10 pounds more than 70 pounds, 0 pounds actually indicates the absence of any pounds, and 80 pounds are twice as heavy as 40 pounds (we cannot say this for the interval measurement). Other examples include distance, age, height, time, and salary.

With a true 0 point, ratios make sense and we can use the rules of multiplication and division to compare data values. This allows us to say that a person who is 6 feet in height is twice as tall as a 3-foot person or that a 20-year-old person is half the age of a 40 year old.

The distinction between interval and ratio data is a fine line. To help identify the proper scale, use the “twice as much” rule. If the phrase “twice as much” accurately describes the relationship between two values that differ by a multiple of 2, then the data can be considered ratio level.

WRONG NUMBER

Interval data does not have a true 0 point. For example, 0 degrees Fahrenheit does not represent the absence of temperature, even though it may feel like it. To help explain this, try baking a cake at twice the recommended temperature in half the recommended time. Yuck!

Figure 2.1 summarizes the different data scales and how they relate to one another. As we explore different statistical techniques later in this book, we will revisit these different measurement scales. You will discover that specific techniques require certain types of data.

Figure 2.1

Summary of data measurement scales.

Computers to the Rescue

As mentioned in Chapter 1, we will explore the use of Excel in solving some of the statistics problems in this book. If you have no interest in using Excel in this manner, just skip this section. We promise you won’t hurt our feelings. The purpose of this last section is to talk about the use of computers with statistics in general and then to make sure that your computer is ready to follow along.

The Role of Computers in Statistics

During the 1970s and 1980s, people with high levels of programming skill were performing on mainframe computers the only serious statistical analysis. These people were somewhat “different” from the rest of us. Fortunately, we have advanced from the Dark Ages and now have awesome, user-friendly computing power at our fingertips. Powerful programs such as SAS, SPSS, Stata, Minitab, R, and Excel are readily accessible to those of us who don’t know a lick of computer programming and allow us to perform some of the most sophisticated statistical analysis known to humankind.

Parts of this book will demonstrate how to solve some of the statistical techniques using Microsoft Excel. Choosing to skip these parts will not interfere with your grasp of topics in subsequent chapters. This is simply optional material to expose you to statistical analysis on the computer. I also assume you already have a basic working knowledge of how to use Excel.

Installing the Data Analysis Add-In

Our first task is to check whether Excel’s data analysis tool is available on your computer. If you are a Mac user, unless you have Excel 2016 for Mac, then you will not be able to use this add-in without a third-party installation (so feel free to research your options!). For PC users, follow along:

  • Open an Excel spreadsheet and click on the Data tab at the top of the Excel window. If you see the Data Analysis tab as in Figure 2.2 below, then you are all set and don’t need to install anything. You can safely skip the rest of this chapter and move on to Chapter 3.

Figure 2.2

Excel’s Data Analysis add-in.

  • If you don’t see the Data Analysis tab, such as in Figure 2.3, then you need to install it by following the steps below.

Figure 2.3

Excel without the Data Analysis add-in.

1. Click on File then choose Options. Select Add-Ins and click on Analysis ToolPak.

Figure 2.4

Adding Excel’s Options.

2. Click on Go. And then check the boxes for Analysis ToolPak and Analysis ToolPak–VBA. Solver is also useful so you may want to check it, too. Then click OK.

Figure 2.5

Excel’s Add-Ins dialog box.

3. Click on the Data tab, and you will now see the Data Analysis icon displayed on the right hand side.

Figure 2.6

Excel with the Data Analysis Add-In.

4. Click on Data Analysis, and you will see the Data Analysis menu, shown in Figure 2.7.

Figure 2.7

Excel’s Data Analysis dialog box.

Your Excel program is now ready to perform all sorts of statistical magic for you as we explore various techniques throughout this book. At this point, you can click Cancel and close out Excel. Each time you open Excel in the future, the Data Analysis icon should be available under the Data tab at the top of the Excel window.

Practice Problems

Classify the following data as nominal, ordinal, interval, or ratio. Explain your choice.

1. Average monthly temperature in Fahrenheit degrees for the city of Wilmington throughout the year

2. Average monthly rainfall in inches for the city of Wilmington throughout the year

3. Education level of survey respondents

Level

Number of Respondents

High school

168

Bachelor’s degree

784

Master’s degree

212

4. Marital status of survey respondents

Status

Number of Respondents

Single

28

Married

189

Divorced

62

5. Age of the respondents in the survey

6. Gender of the respondents in the survey

7. The year in which the respondent was born

8. The voting affiliations of the respondents in the survey, classified as Republican, Democrat, or Undecided

9. The race of the respondents in the survey, classified as White, African American, Asian, or Other

10. The performance rating of employees, classified as Above Expectations, Meets Expectations, or Below Expectations

11. The uniform number of each member on a sports team

12. A list of the graduating high school seniors by class rank

13. The final exam scores for my statistics class on a scale of 0 to 100

14. The state in which the respondents in a survey reside

The Least You Need to Know

  • Data serves as the building blocks for all statistical analysis.
  • Data are the values that variables can take. Variables can be classified as quantitative or qualitative; discrete or continuous; and dependent or independent.
  • Nominal data is assigned to categories with no mathematical comparisons between observations.
  • Ordinal data has all the properties of nominal data with the additional capability of arranging the observations in order.
  • Interval data has all the properties of ordinal data with the additional capability of calculating meaningful differences between the observations.
  • Ratio data has all the properties of interval data with the additional capability of expressing one observation as a multiple of another.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset