CHAPTER 6

Data Analysis

Statistics is the mathematical approach to organizing and interpreting numerical information. The data collected through survey research is typically interpreted through the use of conventional statistical analyses, or other scholarly methods of data analysis. The results of statistical analyses help researchers describe patterns in the data, analyze relationships, make comparisons, and even predict future outcomes.

A variable is any characteristic that is measurable, which may vary or change among population elements. For example, in a study on health habits among young adults, weight may be considered a variable; all persons weighing 55 kilograms would have the same numerical weight, but this weight will likely change among the population. Another example of a variable may be level of satisfaction with sports shoes; however, unlike weight, which has a consistent form of measurement across all circumstances, a scale for measuring satisfaction with sport shoes may not already exist. Therefore, a numerical scale with standard rules must be created to properly and consistently collect this information across customers. For example, in Survey A, product satisfaction may be measured on a scale from 1 to 100, with 100 representing absolute satisfaction. Survey B, however, may measure satisfaction by counting the number of repeat customers. The rule is that at least 15 percent of all customers must reorder sports shoes within a year to demonstrate satisfaction.

Variables fall into one of two classifications: independent, or dependent. Independent variables (i.e., explanatory or predictor variables), help explain or predict a response, outcome, or result in corresponding changes of the dependent variable. An independent variable stands alone and is not influenced or changed by any of the other variables you are trying to measure. For example, in a study on employee satisfaction, age might be an independent variable of interest. Other factors (e.g., eating habits, education level, time spent watching TV) are not going to change a person’s age; this variable remains constant. In fact, when looking for some kind of relationship between variables, you are trying to see if the independent variable (e.g., age) influences some change in your other variables of interest (e.g., employee satisfaction levels). These other variables of interest are most often your dependent variables. A dependent variable, as you might guess, is something that depends on other factors. For example, a test score could be a dependent variable because it could change, depending on several factors (e.g., hours spent studying, hours of sleep the night before an exam, or even how hungry you were when you took the test). Usually, when you are looking for a relationship between two things, you are trying to find out what factors influence or change the dependent variable in the way that it does.

The usual symbolism for a dependent variable is “y” and for an independent variable, “x.” Two examples are provided in Exhibit 6.1.

Exhibit 6.1

Example 1

Survey Objective: To assess how employment status (i.e., full-time, part-time, or unemployed) impacts scores on a “Quality of Life Inventory” among men who had surgery for prostate cancer two years ago.

Number of Independent Variables: 1 (employment status)

Number of Dependent Variables: 1 (quality of life)

Example 2

Survey Objective: To compare boys and girls in terms of whether they do or do not support the school’s new dress code.

Number of Independent Variables: 1 (gender)

Number of Dependent Variables: 1 (support for dress code)

Descriptive and inferential statistics: There are two applications of statistics: (1) to describe characteristics of the population or sample (descriptive statistics) and (2) to generalize from the sample to the population (inferential statistics). Descriptive statistics uses the data to provide descriptions of the population, either through numerical calculations, graphs, or tables. Inferential statistics, on the other hand, is used to make inferences and predictions about a population based on a sample of data taken from the population in question.

Population parameters and sample statistics: The primary purpose of inferential statistics is to make an inference about the population from a representative subset of the population, called a sample. The sample is a subset of the total number of elements or individuals in the population (see Chapter 5 for more details on sampling). The term “sample statistics” is used to designate variables in the sample or measures computed from the sample data. The term population parameter is used to designate the variables or measured characteristics of the entire population. Generally speaking, statisticians and researchers use Greek letters to denote population parameters and English letters to denote sample statistics.

Values are scores or other numerical ratings of a variable or characteristic (e.g., attitudes, behavior, health status, or demographics such as age, income, etc.). A distribution is a listing that consists of all variable or characteristic values in a dataset, and the frequency of their occurrence. For example, in a survey that produces scores on a scale from 1 to 10, the distribution of scores would consist of the number of people who achieve a score of 1, 2, and so on, to a score of 10.

Making the Data Usable

Once the data have been collected as described in Chapter 5, we need to analyze the data to determine how the information we collected can help us answer our research question(s). In this section, we will cover the key tools of making your analysis, including frequency distributions, proportions, measures of central tendency, and measures of dispersion.

  • Frequency distributions: Constructing a frequency table or frequency distribution is one of the most common means of summarizing a set of data. This technique simply counts the number of times a given variable occurs, tabulating a “frequency” of occurrence for each observed value.

    It is quite simple to also construct a distribution of relative frequency, or a percentage distribution. In this case, the frequency of occurrence is changed from a frequency count, to a percentage of the relative prevalence of a given value. For example, if we surveyed ice-cream flavor preferences, and found that 5 out of 50 people surveyed liked chocolate ice-cream, we could change that frequency count to a percentage and say that 10 percent of the people surveyed like chocolate ice-cream. It is usually clearer to speak in percentages when explaining frequency results.

    Probability is the long-run relative frequency with which an event is likely to occur. In inferential statistics, we utilize the concept of a probability distribution, which is conceptually the same thing as a percentage distribution, except that the data are converted into probabilities.

  • Proportions: A proportion is a comparison of the number of population elements that share the same value or characteristic, to the whole population; usually, this is expressed as a percentage.
  • Central Tendency: Measures of central tendency are statistics that describe the location of the center of a distribution.

    There are four ways to measure central tendency, and each has a slightly different meaning/interpretation:

    • Mean: The mean is the arithmetic average. That is, the sum of all the observations divided by the total number of observations. Often, we will not have enough data to calculate the population mean, µ (read as “mu”), so we will calculate a sample mean, x¯ (read as “x bar”). Exhibit 6.2 shows the calculation of a mean.
    • Median: The median is the exact midpoint of the distribution when the observations are arranged in ascending, or descending order (i.e., the 50th percentile). In other words, the median is the value which exactly half the values in the sample fall below, and exactly half the values reside above. If the distribution contains an odd number of data points, such as 15 respondents, then your median will be the 8th ranked ordered value; however, if the distribution has an even number of data points, such as 16 respondents, then the median will be the average of the 8th and 9th rank ordered values in the sample.
    • Mode: The mode is the measure of central tendency that identifies the value that occurs most often. It is possible to have more than one mode in a data set. Consider the data (3,3,3,4,4,4,5,6,7). In this case, the modes would be 3 and 4.
  • Skew: Skew is a measure of the symmetry/asymmetry of a distribution of scores about its mean. The skew of a distribution can be positive or negative, or undefined. A negatively skewed distribution occurs when the majority of scores are observed at the high end of the distribution, and there are relatively few low scores (i.e., scores are clustered to the right of the graph). Conversely, a positively skewed distribution occurs when the majority of scores are observed at the low end of the distribution, and there are relatively few high scores (i.e., scores are clustered to the left of the graph).

Exhibit 6.2

Calculating the Mean

Customers who took the 21-point Attitude Toward Solar Energy Systems Survey reported these 15 scores

−6, −3, −3, 0, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6

To calculate the mean, first sum the scores:

Step 1

(−6)+(−3)+(−3)+(0)+(2)+(2)+(2)+(3)+(3)+(3)+(3)+(4)+(4)+(5)+(6)= 25

Next, divide the sum of the score by the total number of scores (i.e., 15).

Step 2

25/15= 1.67

The mean is very sensitive to extreme values in a set of observations. Suppose, for example, the 15th student obtained a score of 20, rather than 6. The mean would then be 39/15 or 2.6. This one student’s score would substantially change the average, and doesn’t give us a realistic understanding of where the majority of the data lie.

Skewness has important implications for deciding which measures of central tendency best represent the data. In positively skewed distributions, the mean is usually greater than the median, and is always greater than the mode; whereas in negatively skewed distributions, the mean is usually less than the median, and is always less than the mode. For an example of the shapes of positively and negatively skewed distributions, as well as how skewness affects measures of central tendency, see Exhibit 6.3.

  • Measures of dispersion are descriptive statistics that depict the spread of numerical data. For example, in a survey that produces scores on a scale from 1 to 10, you calculate measures of dispersion to answer questions such as “Are most of the scores clustered around a single score, say 5?” or “What is the highest score? The lowest?”

    There are several methods for calculating dispersion.

    • Range: The range is the simplest measure of dispersion. It is the distance between the smallest and largest values of the scores that make up a frequency distribution. The range does not consider all observations; it merely tells us the distance between the two most extreme values.

      While we do not expect all observations to be exactly like the mean, in a clustered (skinny) distribution, the scores will be a short distance from the mean, resulting in a small range; while in a dispersed (fat) distribution, scores will be more spread out, resulting in a large range.

      The interquartile range (IQR) is a measure of statistical dispersion or variability in the dataset that is found by first dividing the data into quartiles, or four quarters. The IQR encompasses only the middlemost 50 percent of observations. This range can be useful in reducing the influence of outliers (i.e., extreme data points).

    • Deviation Scores: A deviation of any observation from the mean can be calculated by subtracting the mean from that observation.

      One of the most commonly used techniques for measuring dispersion is called standard deviation. Standard deviation is the average distance each score is from the mean; it quantifies the amount of variation between data points in a data set. To calculate the standard deviation, first calculate the difference between each observed score, and the mean of the data set, and square each number. Next, sum those values, and divide this number by the total number of values in your data set. Last, take the square root of this number. This gives you an idea about the relative dispersion and volatility of the data.

Normal Distribution

One of the most useful probability distributions in statistics is the normal distribution, also called the normal curve. The normal curve is a bell-shaped distribution in which almost all (99 percent) of the values are within + or −3 standard deviations of its mean. The standardized normal distribution is a specific normal curve that is a purely theoretical probability distribution and is especially useful for understanding inferential statistics. The standardized normal distribution has four characteristics: (1) it is symmetrical about its mean; (2) the mean of the normal curve is also indicative of its highest point (the mode) and is a vertical line, about which this curve is symmetrical; (3) the normal curve has an infinite number of cases (it is a continuous distribution), and the area under the curve has a probability density equal to 1.0; and (4) the standardized normal curve has a mean of zero standard units.

The standardized normal distribution is extremely valuable in translating or transforming any normal variable, X, into a standardized value, Z. This has many pragmatic implications for the business researcher. The standardized normal table, which can be found in the back of most statistics and business research books, allows a researcher to evaluate the probability of the occurrence of certain events. These tables are also easily accessible online. Likewise, Excel can be used to complete this calculation using Data Analysis tools.

The computation of a standardized value (i.e., z-score), allows researchers/professionals the ability to compare any variable to another in the form of a standardized unit. To calculate a z-score, all we need to do is subtract the mean from the value to be transformed, and divide by the standard deviation.

Exhibit 6.4 displays the normal curve with measured units of the variable as the scale. In this case, the scale uses IQ units (55 to 145). Exhibit 6.4 also shows the percent distribution. So, approximately 68 percent of the data fall between 85 and 115 IQ units.

Exhibit 6.5 is the same graph, but this time, displays standard units as the scale along the x axis. Correspondingly, 68 percent of the data fall between +1 and –1 standard units in this graph.

Choosing the Analysis Methodology

When analyzing data, the calculations that result are only beneficial if they provide the most relevant pieces of information, and if the data you are analyzing is available and clean. The first step is determining how to best analyze the data. To determine this, you must answer four questions:

  • What is the measurement scale of the data? (Exhibit 6.6) For more information on scaling, refer to Chapter 3.
  • How many independent and dependent variables are there? What statistical or analytical techniques are available? (Exhibits 6.7, 6.8)
  • Does the survey data fit the requirements of the methods?

    In this first set of results, the findings are tallied (or, counted) and reported as percentages. A tally, or frequency count, is a computation of how many people fit into a category (e.g., men or women, participants under/over 70 years of age, those who saw five or more movies last year or those that did not). Counts and frequencies take the form of numbers and percentages.

  • Describe the responses to each of the questions: Respondents were asked how many movies they saw on average in the last six months, and if they prefer action or romance. On average, college graduates saw five or more movies in a six-month period, with a range of two to eight. The typical college graduate prefers action to romance.

The following checklist should be used before you decide on an analysis method for your data.

Exhibit 6.8

Summary Checklist

Summary Checklist for Choosing a Method to Analyze Survey Data:

  • Count the number of independent variables.
  • Determine if the data on the independent variables are nominal, ordinal, or numerical.
  • Count the number of dependent variables.
  • Determine if the data on the dependent variables are nominal, ordinal, or numerical.
  • Choose potential data analysis methods.
  • Screen the survey’s objectives (description, relationship, prediction, comparison) against the analysis method’s assumptions and outcomes.

Exhibit 6.9

Survey Analysis Example

A survey is given to 160 people to determine the number and types of movies they watch. The survey is analyzed statistically to accomplish the following:

  • Describe the backgrounds of the respondents
  • Describe the responses to each of the questions
  • Determine if a relationship exists between the number of movies seen and number of books read during the past year
  • Compare the number of movies watched by men with the number watched by women
  • Find out if gender, education, or income predicts how frequently the respondents watch movies.

Exhibit 6.9 provides an example of survey analysis.

Illustrative results of the goals of statistical analysis from Exhibit 6.9 are as follows:

  • Describe the backgrounds of the respondents: Of the survey’s 160 respondents, 77 (48.1 percent) were men. Seventy two of all respondents (48 percent) earn more than $50,000 per year, and have finished at least two years of college. Of the 150 respondents who answered the question, 32 (21.3 percent) stated they always or nearly always attend movies to escape daily responsibilities.

    In this set of results, the findings are presented as averages (“on average,” “the typical” moviegoer). When you are interested in the center, such as the average, of a distribution of findings, you are concerned with measures of central tendency (mean, median, mode). Measures of dispersion (range, standard deviation) are often given along with measures of central tendency.

  • Determine if a relationship exists between the number of movies seen and number of books read during the past year: Respondents were asked how many books they read in the past year. The number of books they read and the number of movies they saw were then compared. Respondents who read at least five books in the past year also saw five or more movies.

    In the third set of results, the survey reports the relationship between number of books read and movies seen. One way of estimating the relationship between two characteristics is by using a correlation coefficient. This type of analysis can be run in Excel using the correlation function under the Data Analysis tab. The resulting statistic is referred to as the coefficient, or correlation coefficient, and indicates how strongly the variables are associated with one another. This same statistic indicates if both variables move in the same direction, or in opposition to one another (as one increases in value, the other decreases).

  • Compare the number of movies watched by men with the number watched by women: The percentage of men and women who saw five or more movies in six months was compared, and no differences were found. The average number of movies watched by women during this same time period was higher than that watched by men. These results were statistically significant ( p =0.05). Older men’s scores on the other hand were significantly higher ( p =0.05) than that of older women.

    In these results, comparisons are made between men and women. The term statistical significance is used to show that the differences between them are statistically meaningful and not likely to be due to chance.

  • Find out if gender, education, or income predicts how frequently the respondents watch movies: Education and income were found to be the best predictors of how frequently people go to the movies. As level of education and income increased among respondents, movie-going frequency decreased. Phrased differently, respondents with the most education and income saw the fewest number movies on average.

    In the final results, survey data are used to “predict” frequent movie-going. Predicting might answer a question such as, “Of all the characteristic data I have collected (e.g., income, education, number of books read, etc.), which are linked to frequent movie-going?” For instance, does income influence movie-going? What about education? What about both of these variables combined?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset