Data Analysis
Statistics is the mathematical approach to organizing and interpreting numerical information. The data collected through survey research is typically interpreted through the use of conventional statistical analyses, or other scholarly methods of data analysis. The results of statistical analyses help researchers describe patterns in the data, analyze relationships, make comparisons, and even predict future outcomes.
A variable is any characteristic that is measurable, which may vary or change among population elements. For example, in a study on health habits among young adults, weight may be considered a variable; all persons weighing 55 kilograms would have the same numerical weight, but this weight will likely change among the population. Another example of a variable may be level of satisfaction with sports shoes; however, unlike weight, which has a consistent form of measurement across all circumstances, a scale for measuring satisfaction with sport shoes may not already exist. Therefore, a numerical scale with standard rules must be created to properly and consistently collect this information across customers. For example, in Survey A, product satisfaction may be measured on a scale from 1 to 100, with 100 representing absolute satisfaction. Survey B, however, may measure satisfaction by counting the number of repeat customers. The rule is that at least 15 percent of all customers must reorder sports shoes within a year to demonstrate satisfaction.
Variables fall into one of two classifications: independent, or dependent. Independent variables (i.e., explanatory or predictor variables), help explain or predict a response, outcome, or result in corresponding changes of the dependent variable. An independent variable stands alone and is not influenced or changed by any of the other variables you are trying to measure. For example, in a study on employee satisfaction, age might be an independent variable of interest. Other factors (e.g., eating habits, education level, time spent watching TV) are not going to change a person’s age; this variable remains constant. In fact, when looking for some kind of relationship between variables, you are trying to see if the independent variable (e.g., age) influences some change in your other variables of interest (e.g., employee satisfaction levels). These other variables of interest are most often your dependent variables. A dependent variable, as you might guess, is something that depends on other factors. For example, a test score could be a dependent variable because it could change, depending on several factors (e.g., hours spent studying, hours of sleep the night before an exam, or even how hungry you were when you took the test). Usually, when you are looking for a relationship between two things, you are trying to find out what factors influence or change the dependent variable in the way that it does.
The usual symbolism for a dependent variable is “y” and for an independent variable, “x.” Two examples are provided in Exhibit 6.1.
Exhibit 6.1
Example 1
Survey Objective: To assess how employment status (i.e., full-time, part-time, or unemployed) impacts scores on a “Quality of Life Inventory” among men who had surgery for prostate cancer two years ago.
Number of Independent Variables: 1 (employment status)
Number of Dependent Variables: 1 (quality of life)
Example 2
Survey Objective: To compare boys and girls in terms of whether they do or do not support the school’s new dress code.
Number of Independent Variables: 1 (gender)
Number of Dependent Variables: 1 (support for dress code)
Descriptive and inferential statistics: There are two applications of statistics: (1) to describe characteristics of the population or sample (descriptive statistics) and (2) to generalize from the sample to the population (inferential statistics). Descriptive statistics uses the data to provide descriptions of the population, either through numerical calculations, graphs, or tables. Inferential statistics, on the other hand, is used to make inferences and predictions about a population based on a sample of data taken from the population in question.
Population parameters and sample statistics: The primary purpose of inferential statistics is to make an inference about the population from a representative subset of the population, called a sample. The sample is a subset of the total number of elements or individuals in the population (see Chapter 5 for more details on sampling). The term “sample statistics” is used to designate variables in the sample or measures computed from the sample data. The term population parameter is used to designate the variables or measured characteristics of the entire population. Generally speaking, statisticians and researchers use Greek letters to denote population parameters and English letters to denote sample statistics.
Values are scores or other numerical ratings of a variable or characteristic (e.g., attitudes, behavior, health status, or demographics such as age, income, etc.). A distribution is a listing that consists of all variable or characteristic values in a dataset, and the frequency of their occurrence. For example, in a survey that produces scores on a scale from 1 to 10, the distribution of scores would consist of the number of people who achieve a score of 1, 2, and so on, to a score of 10.
Making the Data Usable
Once the data have been collected as described in Chapter 5, we need to analyze the data to determine how the information we collected can help us answer our research question(s). In this section, we will cover the key tools of making your analysis, including frequency distributions, proportions, measures of central tendency, and measures of dispersion.
It is quite simple to also construct a distribution of relative frequency, or a percentage distribution. In this case, the frequency of occurrence is changed from a frequency count, to a percentage of the relative prevalence of a given value. For example, if we surveyed ice-cream flavor preferences, and found that 5 out of 50 people surveyed liked chocolate ice-cream, we could change that frequency count to a percentage and say that 10 percent of the people surveyed like chocolate ice-cream. It is usually clearer to speak in percentages when explaining frequency results.
Probability is the long-run relative frequency with which an event is likely to occur. In inferential statistics, we utilize the concept of a probability distribution, which is conceptually the same thing as a percentage distribution, except that the data are converted into probabilities.
There are four ways to measure central tendency, and each has a slightly different meaning/interpretation:
Exhibit 6.2
Calculating the Mean
Customers who took the 21-point Attitude Toward Solar Energy Systems Survey reported these 15 scores
−6, −3, −3, 0, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6
To calculate the mean, first sum the scores:
Step 1
(−6)+(−3)+(−3)+(0)+(2)+(2)+(2)+(3)+(3)+(3)+(3)+(4)+(4)+(5)+(6)= 25
Next, divide the sum of the score by the total number of scores (i.e., 15).
Step 2
25/15= 1.67
The mean is very sensitive to extreme values in a set of observations. Suppose, for example, the 15th student obtained a score of 20, rather than 6. The mean would then be 39/15 or 2.6. This one student’s score would substantially change the average, and doesn’t give us a realistic understanding of where the majority of the data lie.
Skewness has important implications for deciding which measures of central tendency best represent the data. In positively skewed distributions, the mean is usually greater than the median, and is always greater than the mode; whereas in negatively skewed distributions, the mean is usually less than the median, and is always less than the mode. For an example of the shapes of positively and negatively skewed distributions, as well as how skewness affects measures of central tendency, see Exhibit 6.3.
There are several methods for calculating dispersion.
While we do not expect all observations to be exactly like the mean, in a clustered (skinny) distribution, the scores will be a short distance from the mean, resulting in a small range; while in a dispersed (fat) distribution, scores will be more spread out, resulting in a large range.
The interquartile range (IQR) is a measure of statistical dispersion or variability in the dataset that is found by first dividing the data into quartiles, or four quarters. The IQR encompasses only the middlemost 50 percent of observations. This range can be useful in reducing the influence of outliers (i.e., extreme data points).
One of the most commonly used techniques for measuring dispersion is called standard deviation. Standard deviation is the average distance each score is from the mean; it quantifies the amount of variation between data points in a data set. To calculate the standard deviation, first calculate the difference between each observed score, and the mean of the data set, and square each number. Next, sum those values, and divide this number by the total number of values in your data set. Last, take the square root of this number. This gives you an idea about the relative dispersion and volatility of the data.
Normal Distribution
One of the most useful probability distributions in statistics is the normal distribution, also called the normal curve. The normal curve is a bell-shaped distribution in which almost all (99 percent) of the values are within + or −3 standard deviations of its mean. The standardized normal distribution is a specific normal curve that is a purely theoretical probability distribution and is especially useful for understanding inferential statistics. The standardized normal distribution has four characteristics: (1) it is symmetrical about its mean; (2) the mean of the normal curve is also indicative of its highest point (the mode) and is a vertical line, about which this curve is symmetrical; (3) the normal curve has an infinite number of cases (it is a continuous distribution), and the area under the curve has a probability density equal to 1.0; and (4) the standardized normal curve has a mean of zero standard units.
The standardized normal distribution is extremely valuable in translating or transforming any normal variable, X, into a standardized value, Z. This has many pragmatic implications for the business researcher. The standardized normal table, which can be found in the back of most statistics and business research books, allows a researcher to evaluate the probability of the occurrence of certain events. These tables are also easily accessible online. Likewise, Excel can be used to complete this calculation using Data Analysis tools.
The computation of a standardized value (i.e., z-score), allows researchers/professionals the ability to compare any variable to another in the form of a standardized unit. To calculate a z-score, all we need to do is subtract the mean from the value to be transformed, and divide by the standard deviation.
Exhibit 6.4 displays the normal curve with measured units of the variable as the scale. In this case, the scale uses IQ units (55 to 145). Exhibit 6.4 also shows the percent distribution. So, approximately 68 percent of the data fall between 85 and 115 IQ units.
Exhibit 6.5 is the same graph, but this time, displays standard units as the scale along the x axis. Correspondingly, 68 percent of the data fall between +1 and –1 standard units in this graph.
Choosing the Analysis Methodology
When analyzing data, the calculations that result are only beneficial if they provide the most relevant pieces of information, and if the data you are analyzing is available and clean. The first step is determining how to best analyze the data. To determine this, you must answer four questions:
In this first set of results, the findings are tallied (or, counted) and reported as percentages. A tally, or frequency count, is a computation of how many people fit into a category (e.g., men or women, participants under/over 70 years of age, those who saw five or more movies last year or those that did not). Counts and frequencies take the form of numbers and percentages.
The following checklist should be used before you decide on an analysis method for your data.
Exhibit 6.8
Summary Checklist
Summary Checklist for Choosing a Method to Analyze Survey Data:
Exhibit 6.9
Survey Analysis Example
A survey is given to 160 people to determine the number and types of movies they watch. The survey is analyzed statistically to accomplish the following:
Exhibit 6.9 provides an example of survey analysis.
Illustrative results of the goals of statistical analysis from Exhibit 6.9 are as follows:
In this set of results, the findings are presented as averages (“on average,” “the typical” moviegoer). When you are interested in the center, such as the average, of a distribution of findings, you are concerned with measures of central tendency (mean, median, mode). Measures of dispersion (range, standard deviation) are often given along with measures of central tendency.
In the third set of results, the survey reports the relationship between number of books read and movies seen. One way of estimating the relationship between two characteristics is by using a correlation coefficient. This type of analysis can be run in Excel using the correlation function under the Data Analysis tab. The resulting statistic is referred to as the coefficient, or correlation coefficient, and indicates how strongly the variables are associated with one another. This same statistic indicates if both variables move in the same direction, or in opposition to one another (as one increases in value, the other decreases).
In these results, comparisons are made between men and women. The term statistical significance is used to show that the differences between them are statistically meaningful and not likely to be due to chance.
In the final results, survey data are used to “predict” frequent movie-going. Predicting might answer a question such as, “Of all the characteristic data I have collected (e.g., income, education, number of books read, etc.), which are linked to frequent movie-going?” For instance, does income influence movie-going? What about education? What about both of these variables combined?