Chapter 3

Univariate Descriptive Statistics

Abstract

This chapter discusses the main concepts of univariate descriptive statistics. Through tables, charts, and/or summary measures, it is possible to describe the behavior of each type of variable. Frequency distribution tables represent the frequency in which a set of data occurs. Charts can be used to represent the distribution of the variable. Summary measures are subdivided into measures of position or location (central trend and quantiles), measures of dispersion or variability, and measures of shape (skewness and kurtosis). Measures of position can be used to represent a dataset, while measures of dispersion can be used to calculate the variability of a dataset. Conversely, measures of skewness and kurtosis characterize the shape of the distribution of the population elements sampled around the mean. Finally, tables, charts, graphs, and summary measures have been studied using Excel, IBM SPSS Statistics Software®, and Stata Statistical Software®.

Keywords

Univariate descriptive statistics; Frequency distribution tables; Charts; Summary measures; Measures of position or location (central trend and quantiles); Measures of dispersion or variability; Measures of shape (skewness and kurtosis)

Mathematics is the alphabet with which God has written the Universe.

Galileo Galilei

3.1 Introduction

Descriptive statistics describes and summarizes the main characteristics observed in a dataset through tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data behavior. The analysis is based on the dataset being studied (sample), without drawing any conclusions or inferences from the population.

Researchers can use descriptive statistics to study a single variable (univariate descriptive statistics), two variables (bivariate descriptive statistics), or more than two variables (multivariate descriptive statistics). In this chapter, we will study the concepts of descriptive statistics involving a single variable.

Univariate descriptive statistics considers the following topics: (a) the frequency in which a set of data occurs through frequency distribution tables; (b) the representation of the variable’s distribution through charts; and (c) measures that represent a data series, such as measures of position or location, measures of dispersion or variability, and measures of shape (skewness and kurtosis).

The four main goals of this chapter are: (1) to introduce the most common concepts related to the tables, charts, and summary measures in univariate descriptive statistics, (2) to present its applications in real examples, (3) to construct tables, charts, and summary measures using Excel and the statistical software SPSS and Stata, and (4) to discuss the results achieved.

As described in the previous chapter, before we begin using descriptive statistics, it is necessary to identify the type of variable being studied. The type of variable is essential when calculating descriptive statistics and in the graphical representation of the results. Fig. 3.1 shows the univariate descriptive statistics that will be studied in this chapter, represented by tables, charts, graphs, and summary measures, for each type of variable. Fig. 3.1 summarizes the following information:

  1. a) The descriptive statistics used to represent the behavior of one qualitative variable’s data are frequency distribution tables and graphs/charts.
  2. b) The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs.
  3. c) The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and by a Pareto chart.
  4. d) For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of the data of continuous variables grouped into classes.
  5. e) Line graphs, dot or dispersion plots, histograms, stem-and-leaf plots, and boxplots (box-and-whisker diagrams) are normally used as the graphical representation of quantitative variables.
  6. f) Measures of position or location can be divided into measures of central tendency (mean, mode, and median) and quantiles (quartiles, deciles, and percentiles).
  7. g) The most common measures of dispersion or variability are range, average deviation, variance, standard deviation, standard error, and coefficient of variation.
  8. h) The measures of shape include measures of skewness and kurtosis.
Fig. 3.1
Fig. 3.1 A brief summary of univariate descriptive statistics. ⁎The mode, which provides the most frequent value of the variable, is the only summary measure that can also be used for qualitative variables.

3.2 Frequency Distribution Table

Frequency distribution tables can be used to represent the frequency in which a set of data with qualitative or quantitative variables occurs.

In the case of qualitative variables, the table represents the frequency in which each variable category happens. For discrete quantitative variables, the frequency of occurrences is calculated for each discrete value of the variable. On the other hand, continuous variable data are first grouped into classes and, afterwards, we calculate the frequencies in which each class occurs.

A frequency distribution table contains the following calculations:

  1. a) Absolute frequency (Fi): number of times each value i appears in the sample.
  2. b) Relative frequency (Fri): percentage related to the absolute frequency.
  3. c) Cumulative frequency (Fac): sum of all the values equal to or less than the value being analyzed.
  4. d) Relative cumulative frequency (Frac): percentage related to the cumulative frequency (sum of all relative frequencies equal to or less than the value being considered).

3.2.1 Frequency Distribution Table for Qualitative Variables

Through a practical example, we will build the frequency distribution table using the calculations of the absolute frequency, relative frequency, cumulative frequency, and relative cumulative frequency for each category of the qualitative variable being analyzed.

Example 3.1

Saint August Hospital provides 3000 blood transfusions to hospitalized patients every month. In order for the hospital to be able to maintain its stocks, 60 blood donations a day are necessary. Table 3.E.1 shows the total number of donors for each blood type on a certain day. Build the frequency distribution table for this problem.

Table 3.E.1

Total Number of Donors of Each Blood Type
Blood TypeDonors
A +15
A −2
B +6
B −1
AB +1
AB −1
O +32
O −2

Solution

The complete frequency distribution table for Example 3.1 is shown in Table 3.E.2:

Table 3.E.2

Frequency Distribution of Example 3.1
Blood TypeFiFri (%)FacFrac (%)
A +15251525
A −23.331728.33
B +6102338.33
B −11.672440
AB +11.672541.67
AB −11.672643.33
O +3253.335896.67
O −23.3360100
Sum60100

Unlabelled Table

3.2.2 Frequency Distribution Table for Discrete Data

Through the frequency distribution table, we can calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency for each possible value of the discrete variable.

Different from qualitative variables, instead of the possible categories we must have the possible numeric values. To facilitate understanding, the data must be presented in ascending order.

Example 3.2

A Japanese restaurant is defining the new layout for its tables and, in order to do that, it collected information on the number of people who have lunch and dinner at each table throughout one week. Table 3.E.3 shows the first 40 pieces of data collected. Build the frequency distribution table for these data.

Table 3.E.3

Number of People per Table
2547416225
41286452826
47256415102
21064346384

Unlabelled Table

Solution

In the next table, each row of the first column represents a possible numeric value of the variable being analyzed. The data are sorted in ascending order. The complete frequency distribution table for Example 3.2 is shown below.

Table 3.E.4

Frequency Distribution for Example 3.2
Number of PeopleFiFri (%)FacFrac (%)
12525
28201025
3251230
4922.52152.5
5512.52665
66153280
7253485
837.53792.5
10253997.5
1212.540100
Sum40100

Unlabelled Table

3.2.3 Frequency Distribution Table for Continuous Data Grouped into Classes

As described in Chapter 2, continuous quantitative variables are those whose possible values are in an interval of real numbers. Therefore, it makes no sense to calculate the frequency for each possible value, since they rarely repeat themselves. It is better to group the data into classes or ranges.

The interval to be defined between the classes is random. However, we must be careful if the number of classes is too small because a lot of information can be lost. On the other hand, if the number of classes is too large, the summary of information is compromised (Bussab and Morettin, 2011). The interval between the classes does not need to be constant, but in order to keep things simple, we will assume the same interval.

The following steps must be taken to build a frequency distribution table for continuous data:

Step 1: Sort the data in ascending order.

Step 2: Determine the number of classes (k), using one of the options:

  1. a) Sturges’ Rule → k = 1 + 3.3 ⋅ log(n)
  2. b) Through expression k=nsi9_e

where n is the sample size.

The value of k must be an integer.

Step 3: Determine the interval between the classes (h), calculated as the range of the sample (A = maximum value − minimum value) divided by the number of classes:

h=A/k

si10_e

The value of h is rounded to the highest integer.

Step 4: Build the frequency distribution table (calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency) for each class.

The lowest limit of the first class corresponds to the minimum value of the sample. To determine the highest limit of each class, we must add the value of h to the lowest limit of the respective class. The lowest limit of the new class corresponds to the highest limit of the previous class.

Example 3.3

Consider the data in Table 3.E.5 regarding the grades of 30 students enrolled in the subject Financial Market. Elaborate a frequency distribution table for this problem.

Table 3.E.5

Grades of 30 Students Enrolled in the Subject Financial Market
4.23.95.76.54.66.38.04.45.05.5
6.04.55.07.26.47.25.06.84.73.5
6.07.48.83.85.55.06.67.15.34.7

Unlabelled Table

Note: To determine the number of classes, use Sturges’ rule.

Solution

Let’s apply the four steps to build the frequency distribution table of Example 3.3, whose variables are continuous:

Step 1: Let’s sort the data in ascending order, as shown in Table 3.E.6.

Table 3.E.6

Data From Table 3.E.5 Sorted in Ascending Order
3.53.83.94.24.44.54.64.74.75
5555.35.55.55.7666.3
6.46.56.66.87.17.27.27.488.8

Unlabelled Table

Step 2: Let’s determine the number of classes (k) by using Sturges’ rule:

k=1+3.3log(30)=5.876

si11_e

Step 3: The interval between the classes (h) is given by:

h=Ak=(8.83.5)6=0.881

si12_e

Step 4: Finally, let’s build the frequency distribution table for each class.

The lowest limit of the first class corresponds to the minimum grade 3.5. From this value, we must add the interval between the classes (1), considering that the highest limit of the first class will be 4.5. The second class starts from this value, and so on, and so forth, until the last class is defined. We use the notation ├ to determine that the lowest limit is included in the class and the highest limit is not. The complete frequency distribution table for Example 3.3 (Table 3.E.7) is presented.

Table 3.E.7

Frequency Distribution for Example 3.3
ClassFiFri (%)FacFrac (%)
3.5 ├ 4.5516.67516.67
4.5 ├ 5.59301446.67
5.5 ├ 6.5723.332170
6.5 ├ 7.5723.332893.33
7.5 ├ 8.513.332996.67
8.5 ├ 9.513.3330100
Sum30100

Unlabelled Table

3.3 Graphical Representation of the Results

The behavior of qualitative and quantitative variable data can also be represented in a graphical way. Charts are a representation of numeric data, in the form of geometric figures (graphs, diagrams, drawings, or images), allowing the reader to interpret these data quickly and objectively.

In Section 3.3.1, the main graphical representations for qualitative variables are illustrated: bar charts (horizontal and vertical), pie charts, and a Pareto chart.

The graphical representation of quantitative variables is usually illustrated by line graphs, dot plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams), as shown in Section 3.3.2.

Bar charts (horizontal and vertical), pie charts, a Pareto chart, line graphs, dot plots, and histograms will be generated in Excel. The boxplots and histograms will be constructed by using SPSS and Stata.

To build a chart in Excel, first, variables’ data and names must be standardized, codified, and selected in a spreadsheet. The next step consists in clicking on the Insert tab and, in the group Charts, selecting the type of chart we are interested in using (Columns, Rows, Pie, Bar, Area, Scatter, or Other Charts). The chart will be generated automatically on the screen, and it can be personalized according to the preferences of the researcher.

Excel offers a variety of chart styles, layouts, and formats. To use them, researcher just needs to select the plotted chart and click on the Design, Layout or Format tab. On the Layout tab, for example, there are many resources available, such as, Chart Title, Axis Titles (shows the name of the horizontal and vertical axes); Legend (shows or hides the legend); Data Labels (allows researcher to insert the series name, the category name, or the values of the labels in the place we are interested in); Data Table (shows the data table below the chart, with or without legend codes); Axes (allows researcher to personalize the scale of the horizontal and vertical axes); Gridlines (shows or hides horizontal and vertical gridlines), among others. The Chart Title, Axis Titles, Legend, Data Labels and Data Table icons are in the Labels group, while the icons Axes and Gridlines are in the Axes group.

3.3.1 Graphical Representation for Qualitative Variables

3.3.1.1 Bar Chart

This type of chart is widely used for nominal and ordinal qualitative variables, but it can also be used for discrete quantitative variables, because it allows us to investigate the presence of data trends.

As its name indicates, through bars, this chart represents the absolute or relative frequencies of each possible category (or numeric value) of a qualitative variable (or quantitative). In vertical bar charts, each variable category is shown on the X-axis as a bar with constant width, and the height of the respective bar indicates the frequency of the category on the Y-axis. Conversely, in horizontal bar charts, each variable category is shown on the Y-axis as a bar of constant height, and the length of the respective bar indicates the frequency of the category on the X-axis.

Let’s now build horizontal and vertical bar charts from a practical example.

Example 3.4

A bank created a satisfaction survey, which was used with 120 customers, trying to measure how agile its services were (excellent, good, satisfactory, and poor). The absolute frequencies for each category are presented in Table 3.E.8. Construct a vertical and horizontal bar chart for this problem.

Table 3.E.8

Frequencies of Occurrences per Category
SatisfactionAbsolute Frequency
Excellent58
Good18
Satisfactory32
Poor12

Solution

Let’s build the vertical and horizontal bar charts of Example 3.4 in Excel.

First, the data in Table 3.E.8 must be standardized, codified, and selected in a spreadsheet. After that, we can click on the Insert tab and, in the Charts group, and select the option Columns. The chart is automatically generated on the screen.

Next, to personalize the chart, while clicking on it, we must select the following icons on the Layout tab: (a) Axis Titles: let’s select the title for the horizontal axis (Satisfaction) and for the vertical axis (Frequency); (b) Legend: to hide the legend, we must click on None; (c) Data Labels: clicking on More Data Label Options, the option Value must be selected in Label Contains (or we can select the option Outside End).

Fig. 3.2 shows the vertical bar chart of Example 3.4 generated in Excel.

Fig. 3.2
Fig. 3.2 Vertical bar chart for Example 3.4.

Based on Fig. 3.2, we can see that the categories of the variable being analyzed are presented on the X-axis by bars with the same width and their respective heights indicate the frequencies on the Y-axis.

To construct the horizontal bar chart, we must select the option Bar instead of Columns. The other steps follow the same logic. Fig. 3.3 represents the frequency data from Table 3.E.8 through a horizontal bar chart constructed in Excel.

Fig. 3.3
Fig. 3.3 Horizontal bar chart for Example 3.4.

The horizontal bar chart in Fig. 3.3 represents the categories of the variable on the Y-axis and their respective frequencies on the X-axis. For each variable category, we draw a bar with a length that corresponds to its frequency.

Therefore, this chart only offers information related to the behavior of each category of the original variable and to the generation of investigations regarding the type of distribution, not allowing us to calculate position, dispersion, skewness or kurtosis measures, since the variable being studied is qualitative.

3.3.1.2 Pie Chart

Another way to represent qualitative data, in terms of relative frequencies (percentages), is the definition of pie charts. The chart corresponds to a circle with a random radius (the whole) divided into sectors or slices of pie of several different sizes (parts of the whole).

This chart allows the researcher to visualize the data as slices of a pie or parts of a whole. Let’s now build the pie chart from a practical example.

Example 3.5

An election poll was carried out in the city of Sao Paulo to check voters’ preferences concerning the political parties running in the next elections for Mayor. The percentage of voters per political party can be seen in Table 3.E.9. Construct a pie chart for Example 3.5.

Table 3.E.9

Percentage of Voters per Political Party
Political PartyPercentage
PMDB18
PSDB22
PDT12.5
PT24.5
PC do B8
PV5
Others10

Solution

Let’s build the pie chart for Example 3.5 in Excel. The steps are similar to the ones in Example 3.4. However, we now have to select the option Pie in the Charts group, on the Insert tab. Fig. 3.4 presents the pie chart obtained in Excel for the data shown in Table 3.E.9.

Fig. 3.4
Fig. 3.4 Pie chart of Example 3.5.

3.3.1.3 Pareto Chart

The Pareto chart is a Quality control tool and has as its main objective to investigate the types of problems and, consequently, to identify their respective causes, so that an action can be taken in order to reduce or eliminate them.

The Pareto chart is a chart that contains bars and a line graph. The bars represent the absolute frequencies of occurrences of problems and the lines represent the relative cumulative frequencies. The problems are sorted in descending order of priority. Let’s now illustrate a practical example with a Pareto chart.

Example 3.6

A manufacturer of credit and magnetic cards has as its main objective to reduce the number of defective cards. The quality inspector classified a sample of 1000 cards that were collected during one week of production, according to the types of defects found, as shown in Table 3.E.10. Construct a Pareto chart for this problem.

Table 3.E.10

Frequencies of the Occurrence of Each Defect
Type of DefectAbsolute Frequency (Fi)
Damaged/Bent71
Perforated28
Illegible printing12
Wrong characters20
Wrong numbers44
Others6
Total181

Solution

The first step in generating a Pareto chart is to sort the defects in order of priority (from the highest to the lowest frequency). The bar chart represents the absolute frequency of each defect. To construct the line graph, it is necessary to calculate the relative cumulative frequency (%) up to the defect analyzed. Table 3.E.11 shows the absolute frequency for each type of defect, in descending order, and the relative cumulative frequency (%).

Table 3.E.11

Absolute Frequency for Each Defect and the Relative Cumulative Frequency (%)
Type of DefectNumber of DefectsCumulative %
Damaged/Bent7139.23
Wrong numbers4463.54
Perforated2879.01
Wrong characters2090.06
Illegible printing1296.69
Others6100

Let’s now build a Pareto chart for Example 3.6 in Excel, using the data in Table 3.E.11.

First, the data in Table 3.E.11 must be standardized, codified, and selected in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns (and the clustered column subtype). Note that the chart is automatically generated on the screen. However, absolute frequency data as well as relative cumulative frequency data are presented as columns. To change the type of chart related to the cumulative percentage, we must click with the right button on any bar of the respective series and select the option Change Series Chart Type, followed by a line graph with markers. The resulting chart is a Pareto chart.

To personalize the Pareto chart, we must use the following icons on the Layout tab: (a) Axis Titles: for the bar chart, we selected the title for the horizontal axis (Type of defect) and for the vertical axis (Frequency); for the line graph, we called the vertical axis Percentage; (b) Legend: to hide the legend, we must click on None; (c) Data Table: let’s select the option Show Data Table with Legend Keys; (d) Axes: the main unit of the vertical axes for both charts is set in 20 and the maximum value of the vertical axis for line graphs, in 100.

Fig. 3.5 shows the chart constructed in Excel that corresponds to the Pareto chart for Example 3.6.

Fig. 3.5
Fig. 3.5 The Pareto chart for Example 3.6. Legend: A, Damaged/Bent; B, Wrong numbers; C, Perforated; D, Wrong characters; E, Illegible printing; F, Others.

3.3.2 Graphical Representation for Quantitative Variables

3.3.2.1 Line Graph

In a line graph, points are represented by the intersection of the variables involved on the horizontal axis (X) and on the vertical axis (Y), and they are connected by straight lines.

Despite considering two axes, line graphs will be used in this chapter to represent the behavior of a single variable. The graph shows the evolution or trend of a quantitative variable’s data, which is usually continuous, at regular intervals. The numeric variable values are represented on the Y-axis, and the X-axis only shows the data distribution in a uniform way. Let’s now illustrate a practical example of a line graph.

Example 3.7

Cheap & Easy is a supermarket that registered the percentage of losses it had in the last 12 months (Table 3.E.12). After having done that, it will adopt new prevention measures. Build a line graph for Example 3.7.

Table 3.E.12

Percentage of Losses in the Last 12 Months
MonthLosses (%)
January0.42
February0.38
March0.12
April0.34
May0.22
June0.15
July0.18
August0.31
September0.47
October0.24
November0.42
December0.09

Solution

To build the line graph for Example 3.7 in Excel, in the Charts group, on the Insert tab, we must select the option Lines. The other steps follow the same logic of the previous examples. The complete chart can be seen in Fig. 3.6.

Fig. 3.6
Fig. 3.6 Line graph for Example 3.7.

3.3.2.2 Scatter Plot

A scatter plot is very similar to a line graph. The biggest difference between them is in the way the data are plotted on the horizontal axis.

Similar to a line graph, here the points are also represented by the intersection of the variables along the X-axis and the vertical axis. However, they are not connected by straight lines.

The scatter plot studied in this chapter is used to show the evolution or trend of a single quantitative variable’s data, similar to the line graph; however, at irregular intervals (in general). Analogous to a line graph, the numeric variable values are represented on the Y-axis and the X-axis only represents the data behavior throughout time.

In the next chapter, we will see how a scatter plot can be used to describe the behavior of two variables simultaneously (bivariate analysis). The numeric values of one variable will be represented on the Y-axis and the other one on the X-axis.

Example 3.8

Papermisto is the supplier of three types of raw materials for the production of paper: cellulose, mechanical pulp, and trimmings. In order to maintain its quality standards, the factory carries out a rigorous inspection of its products during each production phase. At irregular intervals, an operator must verify the esthetic and dimensional characteristics of the product selected with specialized instruments. For instance, in the cellulose storage phase, the product must be piled up in bales of approximately 250 kg each. Table 3.E.13 shows the weight of the bales collected in the last 5 hours, at irregular intervals, varying between 20 and 45 minutes. Construct a scatter plot for Example 3.8.

Table 3.E.13

Evolution of the Weight of the Bales Throughout Time
Time (min)Weight (kg)
30250
50255
85252
106248
138250
178249
198252
222251
252250
297245

Solution

To build the scatter plot for Example 3.8 in Excel, in the Charts group, on the Insert tab, we must select the option Scatter. The other steps follow the same logic of the previous examples. The scatter plot can be seen in Fig. 3.7.

Fig. 3.7
Fig. 3.7 Scatter plot for Example 3.8.

3.3.2.3 Histogram

A histogram is a vertical bar chart that represents the frequency distribution of one quantitative variable (discrete or continuous). The variable values being studied are presented on the X-axis (the base of each bar, with a constant width, represents each possible value of the discrete variable or each class of continuous values, sorted in ascending order). On the other hand, the height of the bars on the Y-axis represents the frequency distribution (absolute, relative, or cumulative) of the respective variable values.

A histogram is very similar to a Pareto chart. It is also one of the seven quality tools. A Pareto chart represents the frequency distribution of a qualitative variable (types of problem), whose categories represented on the X-axis are sorted in order of priority (from the category with the highest frequency to the one with the lowest). A histogram represents the frequency distribution of a quantitative variable, whose values represented on the X-axis are sorted in ascending order.

Therefore, the first step to elaborate a histogram is building the frequency distribution table. As presented in Sections 3.2.2 and 3.2.3, for each possible value of a discrete variable or for a class with continuous data, we calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency. The data must be sorted in ascending order.

The histogram is then constructed from this table. The first column of the frequency distribution table, which represents the numeric values or the classes with the values of the variable being studied, will be presented on the X-axis, and the column of absolute frequency (or relative frequency, cumulative frequency, or relative cumulative frequency) will be presented on the Y-axis.

Many pieces of statistical software generate the histogram automatically, from the original values of the quantitative variable being studied, without having to calculate the frequencies. Even though Excel has the option of building a histogram from analysis tools, we will show how to build it from the column chart, due to its simplicity.

Example 3.9

In order to improve their services, a national bank is hiring new managers to serve their corporate clients. Table 3.E.14 shows the number of companies dealt with daily in one of their main branches in the capital. Elaborate a histogram from these data using Excel.

Table 3.E.14

Number of Companies Dealt With Daily
131113101112812910
12108119111411109

Unlabelled Table

Solution

The first step is building the frequency distribution table:

From the data in Table 3.E.15, we can build a histogram of absolute frequency, relative frequency, cumulative frequency, or relative cumulative frequency using Excel. The histogram generated will be the absolute frequency one.

Thus, we must standardize, codify, and select the first two columns of Table 3.E.15 (except the last row: Sum) in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns.

Let’s click on the chart so that it can be personalized. On the Layout tab, we selected the following icons: (a) Axis Titles: select the title for the horizontal axis (Number of companies) and for the vertical axis (Absolute frequency); (b) Legend: to hide the legend, we must click on None. The histogram generated in Excel can be seen in Fig. 3.8.

Fig. 3.8
Fig. 3.8 Histogram of absolute frequencies elaborated in Excel for Example 3.9.

Table 3.E.15

Frequency Distribution for Example 3.9
Number of CompaniesFiFri (%)FacFrac (%)
8210210
9315525
10420945
115251470
123151785
132101995
141520100
Sum20100

Unlabelled Table

As mentioned, many statistical computer packages, including SPSS and Stata, build the histogram automatically from the original data of the variable being studied (in this example, using the data in Table 3.E.14), without having to calculate the frequencies. Moreover, these packages have the option of plotting the normal curve.

Fig. 3.9 shows the histogram generated using SPSS (with the option of a normal curve) using the data in Table 3.E.14. We will see this in detail in Sections 3.6 and 3.7, how it can be constructed using SPSS and Stata software, respectively.

Fig. 3.9
Fig. 3.9 Histogram constructed using SPSS for Example 3.9 (discrete data).

Note that the values of the discrete variable are presented in the middle of the base.

For continuous variables, consider the data in Table 3.E.5 (Example 3.3), regarding the grades of the students enrolled in the subject Financial Market. These data were sorted in ascending order, as presented in Table 3.E.6.

Fig. 3.10 shows the histogram generated using SPSS software (with the option of a normal curve) using the data in Table 3.E.5 or Table 3.E.6.

Fig. 3.10
Fig. 3.10 Histogram generated using SPSS for Example 3.3 (continuous data).

Note that the data were grouped considering an interval between h = 0.5 classes, differently from Example 3.3 that considered h = 1. The classes’ lower limits are represented on the left side of the base of the bar, and the upper limits (not included in the class) on the right side. The height of the bar represents the total frequency in the class. For example, the first bar represents the 3.5 ├ 4.0 class and there are three values in this interval (3.5, 3.8 and 3.9).

3.3.2.4 Stem-and-Leaf Plot

Both bar charts and histograms represent the shape of the variable’s frequency distribution. The stem-and-leaf plot is an alternative to represent the frequency distributions of discrete and continuous quantitative variables with few observations, with the advantage of maintaining the original value of each observation (it allows the visualization of all data information).

In the plot, the representation of each observation is divided into two parts, separated by a vertical line: the stem is located on the left of the vertical line and represents the observation’s first digit(s); the leaf is located on the right of the vertical line and represents the observation’s last digit(s). Choosing the number of initial digits that will form the stem or the number of complementary digits that will form the leaf is random. The stems usually contain the most significant digits, and the leaves the least significant.

The stems are represented in a single column and their different values throughout many lines. For each stem represented on the left-hand side of the vertical line, we have the respective leaves shown on the right-hand side throughout many columns. Stems as well as leaves must be sorted in ascending order. In the cases in which there are too many leaves per stem, we can have more than one line with the same stem. Choosing the number of lines is random, as well as defining the interval or the number of classes in a frequency distribution.

To build a stem-and-leaf plot, we can follow the sequence of steps:

Step 1: Sort the data in ascending order, to make the visualization of the data easier.

Step 2: Define the number of initial digits that will form the stem, or the number of complementary digits that will form the leaf.

Step 3: Elaborate the stems, represented in a single column on the left of the vertical line. Their different values are represented throughout many lines, in ascending order. When the number of leaves by stem is very high, we can define two or more lines for the same stem.

Step 4: Place the leaves that correspond to the respective stems, on the right-hand side of the vertical line, throughout many columns (in ascending order).

Example 3.10

A small company collected its employees’ ages, as shown in Table 3.E.16. Build a stem-and-leaf plot.

Table 3.E.16

Employees’ Ages
44602249315842633337
54554071556235455954
50512431407328357548

Unlabelled Table

Solution

To construct the stem-and-leaf plot, let’s apply the four steps described:

  • Step 1

First, we must sort the data in ascending order, as shown in Table 3.E.17.

  • Step 2

Table 3.E.17

Employees’ Ages in Ascending Order
22242831313335353740
40424445484950515454
55555859606263717375

Unlabelled Table

The next step to construct a stem-and-leaf plot is to define the number of initial digits of the observation that will form the stem. The complementary digits will form the leaf. In this example, all of the observations have two digits. The stems correspond to the tens and the leaves correspond to the units.

  • Step 3

The following step is to build the stems. Based on Table 3.E.17, we can see that there are observations that begin with the tens 2, 3, 4, 5, 6, and 7 (stems). The stem with the highest frequency is 5 (8 observations), it is possible to represent all of its leaves in a single line. Therefore, we will have a single line per stem. Hence, the stems are presented in a single column on the left of the vertical line, in ascending order, as shown in Fig. 3.11.

  • Step 4
Fig. 3.11
Fig. 3.11 Building the stems for Example 3.10.

Finally, let’s place the leaves that correspond to each stem on the right-hand side of the vertical line. The leaves are represented in ascending order throughout many columns. For example, stem 2 contains leaves 2, 4, and 8. Stem 5 contains leaves 0, 1, 4, 4, 5, 5, 8, and 9, represented throughout 8 columns. If this stem were divided into two lines, the first line would have leaves 0 to 4, and the second line leaves 5 to 9.

Fig. 3.12 illustrates the stem-and-leaf plot for Example 3.10.

Fig. 3.12
Fig. 3.12 Stem-and-Leaf plot for Example 3.10.

Example 3.11

The average temperature, in Celsius, registered in the last 40 days in the city of Porto Alegre can be found in Table 3.E.18. Elaborate the stem-and-leaf plot for Example 3.11.

Table 3.E.18

Average Temperature in Celsius
8.513.712.99.411.719.212.89.719.511.5
15.516.020.417.418.014.414.813.016.620.2
17.917.716.915.218.517.816.216.418.216.9
18.719.613.217.220.514.116.115.918.815.7

Unlabelled Table

Solution

Once again, let’s apply the four steps to construct the stem-and-leaf plot, but now we have to consider continuous variables.

  • Step 1

First, let’s sort the data in ascending order, as shown in Table 3.E.19.

  • Step 2

Table 3.E.19

Average Temperature in Ascending Order
8.59.49.711.511.712.812.913.013.213.7
14.114.414.815.215.515.715.916.016.116.2
16.416.616.916.917.217.417.717.817.918.0
18.218.518.718.819.219.519.620.220.420.5

Unlabelled Table

In this example, the leaves correspond to the last digit. The remaining digits (to the left) correspond to the stems.

  • Steps 3 and 4

The stems vary from 8 to 20. The stem with the highest frequency is 16 (7 observations), and its leaves can be represented in a single line. For each stem, we place the respective leaves. Fig. 3.13 shows the stem-and-leaf plot for Example 3.11.

Fig. 3.13
Fig. 3.13 Stem-and-Leaf Plot for Example 3.11.

3.3.2.5 Boxplot or Box-and-Whisker Diagram

The boxplot (or box-and-whisker diagram) is a graphical representation of five measures of position or location of a certain variable: minimum value, first quartile (Q1), second quartile (Q2) or median (Md), third quartile (Q3) and maximum value. From a sorted sample, the median corresponds to the central position and the quartiles to subdivisions of the sample, four equal parts, each one containing 25% of the data.

Thus, the first quartile (Q1) describes 25% of the first data (organized in ascending order). The second quartile corresponds to the median (50% of the sorted data are located below it and the remaining 50% above it), and the third quartile (Q13) corresponds to 75% of the observations. The dispersion measure resulting from these location measures is called interquartile range (IQR) or interquartile interval (IQI) and corresponds to the difference between Q3 and Q1.

This plot allows us to assess the data symmetry and distribution. It also gives us a visual perspective of whether or not there are discrepant data (univariate outliers), since these data are above the upper and lower limits. A representation of the diagram can be seen in Fig. 3.14.

Fig. 3.14
Fig. 3.14 Boxplot.

Calculating the median, the first, and third quartiles, and investigating the existence of univariate outliers will be discussed in Sections 3.4.1.1, 3.4.1.2, and 3.4.1.3, respectively. In Sections 3.6.3 and 3.7, we will study how to generate the box-and-whisker diagram on SPSS and Stata, respectively, using a practical example.

3.4 The Most Common Summary-Measures in Univariate Descriptive Statistics

Information found in a dataset can be summarized through suitable numerical measures, called summary measures.

In univariate descriptive statistics, the most common summary measures have as their main objective to represent the behavior of the variable being studied through its central and noncentral values, its dispersions, or the way its values are distributed around the mean.

The summary measures that will be studied in this chapter are measures of position or location (measures of central tendency and quantiles), measures of dispersion or variability, and measures of shape, such as, skewness and kurtosis.

These measures are calculated for metric or quantitative variables. The only exception is the mode, which is a measure of central tendency that provides the most frequent value of a certain variable, so, it can also be calculated for nonmetric or qualitative variables.

3.4.1 Measures of Position or Location

These measures provide values that characterize the behavior of a data series, indicating the data position or location in relation to the axis of the values assumed by the variable or characteristic being studied.

The measures of position or location are subdivided into measures of central tendency (mean, median, and mode) and quantiles (quartiles, deciles, and percentiles).

3.4.1.1 Measures of Central Tendency

The most common measures of central tendency are the arithmetic mean, the median, and the mode.

3.4.1.1.1 Arithmetic Mean

The arithmetic mean can be a representative measure of a population with N elements, represented by the Greek letter μ, or a representative measure of a sample with n elements, represented by ˉXsi13_e.

3.4.1.1.1.1 Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data

Simple arithmetic mean, or simply mean, or average, is the sum of all the values of a certain variable (discrete or continuous) divided by the total number of observations. Thus, the sample arithmetic mean of a certain variable X (ˉXsi13_e) is:

ˉX=ni=1Xin

si15_e  (3.1)

where n is the total number of observations in the dataset and Xi, for i = 1, …, n, represents each one of variable X’s values.

Example 3.12

Calculate the simple arithmetic mean of the data in Table 3.E.20, regarding the grades of the graduate students enrolled in the subject Quantitative Methods.

Table 3.E.20

Students’ Grades
5.76.56.98.38.04.26.37.45.86.9

Unlabelled Table

Solution

The mean is simply calculated as the sum of all the values in Table 3.E.20 divided by the total number of observations:

ˉX=5.7+6.5++6.910=6.6

si16_e

The MEAN function in Excel calculates the simple arithmetic mean of the set of values selected. Let’s assume that the data in Table 3.E.20 are available from cell A1 to cell A10. To calculate the mean, we just need to insert the expression = MEAN(A1:A10).

Another way to calculate the mean using Excel, as well as other descriptive measures, such as, the median, mode, variance, standard deviation, standard error, skewness and kurtosis, which will also be studied in this chapter, is by using the Analysis ToolPack supplement (Section 3.5).

3.4.1.1.1.2 Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data

When calculating the simple arithmetic mean, all of the occurrences have the same importance or weight. When we are interested in assigning different weights (pi) to each value i of variable X, we use the weighted arithmetic mean:

ˉX=ni=1Xi.pini=1pi

si17_e  (3.2)

If the weight is expressed in percentages (relative weight - rw), Expression (3.2) becomes:

ˉX=ni=1Xi.rwi

si18_e  (3.3)

Example 3.13

At Vanessa’s school, the annual average of each subject is calculated based on the grades obtained throughout all four quarters, with their respective weights being: 1, 2, 3, and 4. Table 3.E.21 shows Vanessa’s grades in mathematics in each quarter. Calculate her annual average in the subject.

Table 3.E.21

Vanessa’s Grades in Mathematics
PeriodGradeWeight
1st Quarter4.51
2nd Quarter7.02
3rd Quarter5.53
4th Quarter6.54

Solution

The annual average is calculated by using the weighted arithmetic mean criterion. Applying Expression (3.2) to the data in Table 3.E.21, we have:

ˉX=4.5×1+7.0×2+5.5×3+6.5×41+2+3+4=6.1

si19_e

Example 3.14

There are five stocks in a certain investment portfolio. Table 3.E.22 shows the average yield of each stock in the previous month, as well as the respective percentage invested. Determine the portfolio’s average yield.

Table 3.E.22

Yield of Each Stock and Percentage Invested
StockYield (%)% Investment
Bank of Brazil ON1.0510
Bradesco PN0.5625
Eletrobras PNB0.0815
Gerdau PN0.2420
Vale PN0.7530

Solution

The portfolio’s average yield (%) corresponds to the sum of the products between each stock’s average yield (%) and the respective percentage invested, and, using Expression (3.3), we have:

ˉX=1.05×0.10+0.56×0.25+0.08×0.15+0.24×0.20+0.75×0.30=0.53%

si20_e

3.4.1.1.1.3 Case 3: Arithmetic Mean of Grouped Discrete Data

When the discrete values of Xi repeat themselves, the data are grouped into a frequency table. To calculate the arithmetic mean, we have to use the same criterion as for the weighted mean. However, the weight for each Xi will be represented by absolute frequencies (Fi) and, instead of n observations with n different values, we will have n observations with m different values (grouped data):

ˉX=mi=1Xi.Fimi=1Fi=mi=1Xi.Fin

si21_e  (3.4)

If the frequency of the data is expressed in terms of the percentage relative to the absolute frequency (relative frequency—Fr), Expression (3.4) becomes:

ˉX=mi=1Xi.Fri

si22_e  (3.5)

Example 3.15

A satisfaction survey with 120 participants evaluated the performance of a health insurance company through grades given to it. Grades that vary between 1 and 10. The survey’s results can be seen in Table 3.E.23. Calculate the arithmetic mean for Example 3.15.

Table 3.E.23

Absolute Frequency Table
GradesNumber of Participants
19
212
315
418
524
626
75
87
93
101

Solution

The arithmetic mean of Example 3.15 is calculated from Expression (3.4):

ˉX=1×9+2×12++9×3+10×1120=4.62

si23_e

3.4.1.1.1.4 Case 4: Arithmetic Mean of Continuous Data Grouped into Classes

To calculate the simple arithmetic mean, the weighted arithmetic mean, and the arithmetic mean of grouped discrete data, Xi represents each i value of variable X.

For continuous data grouped into classes, each class does not have a single value defined, but a set of values. In order for the arithmetic mean to be calculated in this case, we assume that Xi is the middle or central point of class i (i = 1,…,k), so, Expressions (3.4) and (3.5) are rewritten due to the number of classes (k):

ˉX=ki=1Xi.Fiki=1Fi=ki=1Xi.Fin

si24_e  (3.6)

ˉX=ki=1Xi.Fri

si25_e  (3.7)

Example 3.16

Table 3.E.24 shows the classes of salaries paid to the employees of a certain company and their respective absolute and relative frequencies. Calculate the average salary.

Table 3.E.24

Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Relative Frequencies
ClassesFiFri (%)
1 ├ 324017.14
3 ├ 548034.29
5 ├ 732022.86
7 ├ 915010.71
9 ├ 111309.29
11 ├ 13805.71
Sum1400100

Solution

Considering Xi the central point of class i and applying Expression (3.6), we have:

ˉX=2×240+4×480+6×320+8×150+10×130+12×801,400=5.557

si26_e

or using Expression (3.7):

ˉX=2×0.1714+4×0.3429++10×0.0929+12×0.0571=5.557

si27_e

Therefore, the average salary is US$ 5,557.14.

3.4.1.1.2 Median

The median (Md) is a measure of location. It locates the center of the distribution of a set of data sorted in ascending order. Its value separates the series in two equal parts, so, 50% of the elements are less than or equal to the median, and the other 50 % are greater than or equal to the median.

3.4.1.1.2.1 Case 1: Median of Ungrouped Discrete and Continuous Data

The median of variable X (discrete or continuous) can be calculated as follows:

Md(X)={Xn2+X(n2)+12,ifnis an even number.X(n+1)2,ifnis anoddnumber.

si28_e  (3.8)

where n is the total number of observations and X1 ≤ … ≤ Xn, considering that X1 is the smallest observation or the value of the first element, and that Xn is the highest observation or the value of the last element.

Example 3.17

Table 3.E.25 shows the monthly production of treadmills of a company in a given year. Calculate the median.

Table 3.E.25

Monthly Production of Treadmills in a Given Year
MonthProduction (units)
Jan.210
Feb.180
Mar.203
April195
May208
June230
July185
Aug.190
Sept.200
Oct.182
Nov.205
Dec.196

Solution

To calculate the median, the observations are sorted in ascending order. Therefore, we have the order of the observations and their respective positions:

180182185190195196200203205208210230
1st2nd3rd4th5th6th7th8th9th10th11th12th

Unlabelled Table

The median will be the mean between the sixth and the seventh elements, since n is an even number, that is:

Md=X122+X(122)+12

si29_e

Md=196+2002=198

si30_e

Excel calculates the median of a set of data through the MED function.

Note that the median does not consider the order of magnitude of the original variable’s values. If, for instance, the highest value were 400 instead of 230, the median would be exactly the same; however, with a much higher mean.

The median is also known as the 2nd quartile (Q2), 50th percentile (P50), or 5th decile (D5). These definitions will be studied in more detail in the following sections.

3.4.1.1.2.2 Case 2: Median of Grouped Discrete Data

Here, the calculation of the median is similar to the previous case. However, the data are grouped in a frequency distribution table.

Analogous to Case 1, if n is an odd number, the position of the central element will be (n + 1)/2. We can see in the cumulative frequency column the group that has this position and, consequently, its corresponding value in the first column (median).

If n is an even number, we verify the group(s) that contain(s) the central positions n/2 and (n/2) + 1 in the cumulative frequency column. If both positions correspond to the same group, we directly obtain their corresponding value in the first column (median). If each position corresponds to a distinct group, the median will be the average between the corresponding values defined in the first column.

Example 3.18

Table 3.E.26 shows the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies. Calculate the median.

Table 3.E.26

Frequency Distribution
Number of BedroomsFiFac
166
21319
32039
41554
5761
6667
7370
Sum70

Since n is an even number, the median will be the average of the values that occupy positions n/2 and (n/2) + 1, that is:

Md=Xn2+X(n2)+12=X35+X362

si31_e

Based on Table 3.E.26, we can see that the third group contains all the elements between positions 20 and 39 (including 35 and 36), whose corresponding value is 3. Therefore, the median is:

Md=3+32=3

si32_e

3.4.1.1.2.3 Case 3: Median of Continuous Data Grouped into Classes

For continuous variables grouped into classes, in which the data are presented in a frequency distribution table, we apply the following steps to calculate the median:

Step 1: Calculate the position of the median, not taking into consideration if n is an even or an odd number, through the following expression:

Pos(Md)=n/2

si33_e  (3.9)

Step 2: Identify the class that contains the median (median class) from the cumulative frequency column.

Step 3: Calculate the median using the following expression:

Md=LIMd+(n2Fac(Md1))FMd×AMd

si34_e  (3.10)

where:

  • LIMd = lower limit of the median class;
  • FMd = absolute frequency of the median class;
  • Fac(Md − 1)= cumulative frequency from the previous class to the median class;
  • AMd = range of the median class;
  • n = total number of observations.

Example 3.19

Consider the data in Example 3.16 regarding the classes of salaries paid to the employees of a company and their respective absolute and cumulative frequencies (Table 3.E.27). Calculate the median.

Table 3.E.27

Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Cumulative Frequencies
ClassesFiFac
1 ├ 3240240
3 ├ 5480720
5 ├ 73201040
7 ├ 91501190
9 ├ 111301320
11 ├ 13801400
Sum1400

Solution

In the case of continuous data grouped into classes, let’s apply the following steps to calculate the median:

Step 1: First, we calculate the position of the median:

Pos(Md)=n2=14002=700

si35_e

Step 2: Through the cumulative frequency column, we can see that the median is in the second class (3 ├ 5).

Step 3: Calculating the median:

Md=LIMd+(n2Fac(Md1))FMd×AMd

si36_e

where:

LIMd = 3, FMd = 480, Fac(Md−1) = 240, AMd = 2, n = 1400

Therefore, we have:

Md=3+(700240)480×2=4916(US$4916.67)

si37_e

3.4.1.1.3 Mode

The mode (Mo) of a data series corresponds to the observation that occurs with the highest frequency. The mode is the only measure of position that can also be used for qualitative variables, since these variables only allow us to calculate frequencies.

3.4.1.1.3.1 Case 1: Mode of Ungrouped Data

Consider a set of observations X1, X2, …, Xn of a certain variable. The mode is the value that appears with the highest frequency.

Excel gives us the mode of a set of data through the MODE function.

Example 3.20

The production of carrots in a certain company is divided into five phases, including the post-harvest handling phase. Table 3.E.28 shows the average time the processing (in seconds) takes in this phase for 20 observations. Calculate the mode.

Table 3.E.28

Processing Time in the Post-Harvest Handling Phase in Seconds
45.044.544.045.046.546.045.844.845.046.2
44.545.045.444.945.746.244.745.646.344.9

Unlabelled Table

Solution

The mode is 45.0, which is the most frequent value in the dataset (Table 3.E.28). This value could be determined directly in Excel by using the MODE function.

3.4.1.1.3.2 Case 2: Mode of Grouped Qualitative or Discrete Data

For discrete qualitative or quantitative data grouped in a frequency distribution table, the mode can be obtained directly from the table. It is the value with the highest absolute frequency.

Example 3.21

A TV station interviewed 500 viewers trying to analyze their preferences in terms of interest categories. The result of the survey can be seen in Table 3.E.29. Calculate the mode.

Table 3.E.29

Viewers’ Preferences in Terms of Interest Categories
Interest CategoriesFi
Movies71
Soap Operas46
News90
Comedy98
Sports120
Concerts35
Variety40
Sum500

Solution

Based on Table 3.E.29, we can see that the mode corresponds to the category Sports (the highest absolute frequency). Therefore, the mode is the only measure of position that can also be used for qualitative variables.

3.4.1.1.3.3 Case 3: Mode of Continuous Data Grouped into Classes

For continuous data grouped into classes, there are several procedures to calculate the mode, such as, Czuber’s and King’s methods.

Czuber’s method has the following phases:

Step 1: Identify the class that has the mode (modal class), which is the one with the highest absolute frequency.

Step 2: Calculate the mode (Mo):

Mo=LIMo+FMoFMo12.FMo(FMo1+FMo+1)×AMo

si38_e  (3.11)

where:

  • LIMo = lower limit of the modal class;
  • FMo = absolute frequency of the modal class;
  • FMo − 1 = absolute frequency from the previous class to the modal class;
  • FMo + 1 = absolute frequency from the posterior class to the modal class;
  • AMo = range of the modal class.

Example 3.22

A set of continuous data with 200 observations is grouped into classes with their respective absolute frequencies, as shown in Table 3.E.30. Determine the mode using Czuber’s method.

Table 3.E.30

Continuous Data Grouped into Classes and Their Respective Frequencies
ClassFi
01 ├ 1021
10 ├ 2036
20 ├ 3058
30 ├ 4024
40 ├ 5019
Sum200

Solution

Considering continuous data grouped into classes, we can use Czuber’s method to calculate the mode:

Step 1: Based on Table 3.E.30, we can see that the modal class is the third one (20 ├ 30), since it has the highest absolute frequency.

Step 2: Calculating the mode (Mo):

Mo=LIMo+FMoFMo12.FMo(FMo1+FMo+1)×AMo

si38_e

where:

LIMo = 20, FMo = 58, FMo−1 = 36, FMo + 1 = 24, AMo = 10

Therefore, we have:

Mo=20+58362×58(36+24)×10=23.9

si40_e

On the other hand, King’s method consists of the following phases:

Step 1: Identify the modal class (the one with the highest absolute frequency).

Step 2: Calculate the mode (Mo) using the following expression:

Mo=LIMo+FMo+1FMo1+FMo+1×AMo

si41_e  (3.12)

where:

  • LIMo = lower limit of the modal class;
  • FMo − 1 = absolute frequency from the previous class to the modal class;
  • FMo + 1 = absolute frequency from the posterior class to the modal class;
  • AMo = range of the modal class.

Example 3.23

Once again, consider the data from the previous example. Use King’s method to determine the mode.

Solution

In Example 3.22, we saw that:

LIMo=20FMo+1=24FMo1=36AMo=10

si42_e

Applying Expression (3.12):

Mo=LIMo+FMo+1FMo1+FMo+1×AMo=20+2436+24×10=24

si43_e

3.4.1.2 Quantiles

According to Bussab and Morettin (2011), only the use of measures of central tendency may not be suitable to represent a set of data, since they are also impacted by extreme values. Moreover, only with the use of these measures, it is not possible for the researcher to have a clear idea of the data dispersion and symmetry. As an alternative, we can use quantiles, such as, quartiles, deciles, and percentiles. The 2nd quartile (Q2), 5th decile (D5), or 50th percentile (P50) correspond to the median; therefore, they are measures of central tendency.

3.4.1.2.1 Quartiles

Quartiles (Qi, i = 1, 2, 3) are measures of position that divide a set of data into four parts with equal dimensions, sorted in ascending order.

Unlabelled Image

Thus, the 1st Quartile (Q1 or the 25th percentile) indicates that 25% of the data are less than Q1, or that 75% of the data are greater than Q1.

The 2nd Quartile (Q2, or the 5th decile, or the 50th percentile) corresponds to the median, indicating that 50% of the data are less or greater than Q2.

The 3rd Quartile (Q3 or the 75th percentile) indicates that 75% of the data are less than Q3, or that 25% of the data are greater than Q3.

3.4.1.2.2 Deciles

Deciles (Di, i = 1, 2, ..., 9) are measures of position that divide a set of data into 10 equal parts, sorted in ascending order.

Unlabelled Image

Therefore, the 1st decile (D1 or 10th percentile) indicates that 10% of the data are less than D1 or that 90% of the data are greater than D1.

The 2nd decile (D2 or 20th percentile) indicates that 20% of the data are less than D2 or that 80% of the data are greater than D2.

And so on, and so forth, until the 9th decile (D9 or 90th percentile), indicating that 90% of the data are less than D9 or that 10% of the data are greater than D9.

3.4.1.2.3 Percentiles

Percentiles (Pi, i = 1, 2, ..., 99) are measures of position that divide a set of data, sorted in ascending order, into 100 equal parts.

Hence, the 1st percentile (P1) indicates that 1% of the data is less than P1 or that 99% of the data are greater than P1.

The 2nd percentile (P2) indicates that 2% of the data are less than P2 or that 98% of the data are greater than P2.

And so on, and so forth, until the 99th percentile (P99), which indicates that 99% of the data are less than P99 or that 1% of the data is greater than P99.

3.4.1.2.3.1 Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data

If the position of the quartile, decile, or percentile we are interested in is an integer or is exactly between two positions, calculating the respective quartile, decile or percentile becomes easier. However, this does not happen all the time (imagine a sample with 33 elements and that the objective is to calculate the 67th percentile), there are many methods proposed for this kind of calculation that lead to close results, but they are not identical.

We will present a simple and generic method that can be applied to calculate any quartile, decile, or percentile of order i, considering ungrouped discrete and continuous data:

Step 1: Sort the observations in ascending order.

Step 2: Determine the position of the quartile, decile, or percentile, of order i, we are interested in:

QuartilePos(Qi)=[n4×i]+12,i=1,2,3

si44_e  (3.13)

DecilePos(Di)=[n10×i]+12,i=1,2,,9

si45_e  (3.14)

PercentilePos(Pi)=[n100×i]+12,i=1,2,,99

si46_e  (3.15)

Step 3: Calculate the value of the quartile, decile, or percentile that corresponds to the respective position.

Assume that Pos(Q1) = 3.75, that is, the value of Q1 is between the 3rd and 4th positions (75% closer to the 4th position, and 25% to the 3rd position). Therefore, Q1 will be the sum of the value that corresponds to the 3rd position multiplied by 0.25, with the value that corresponds to the 4th position multiplied by 0.75.

Example 3.24

Consider the data in Example 3.20 regarding the average carrot processing time in the post-harvest handling phase, as specified in Table 3.E.28. Determine Q1 (1st quartile), Q3 (3rd quartile), D2 (2nd decile), and P64 (64th percentile).

Solution

For ungrouped continuous data, we must apply the following steps to determine the quartiles, deciles, and percentiles we are interested in:

Step 1: Sort the observations in ascending order.

1st2nd3rd4th5th7th7th8th9th10th
44.044.544.544.744.844.944.945.045.045.0
11th12th13th14th15th16th17th18th19th20th
45.045.445.645.745.846.046.246.246.346.5

Unlabelled Table

Step 2: Calculation of the positions of Q1, Q3, D2, and P64:

  1. a) Pos(Q1)=[204×1]+12=5.5si47_e
  2. b) Pos(Q3)=[204×3]+12=15.5si48_e
  3. c) Pos(D2)=[2010×2]+12=4.5si49_e
  4. d) Pos(P64)=[20100×64]+12=13.3si50_e

Step 3: Calculating Q1, Q3, D2, and P64:

a) Pos(Q1) = 5.5 means that its corresponding value is 50% near position 5 and 50% near position 6, that is, Q1 is simply the average of the values that correspond to both positions:

Q1=44.8+44.92=44.85

si51_e

b) Pos(Q3) = 15.5 means that the value we are interested in is between positions 15 and 16 (50% near the 15th position and 50% near the 16th position), so, Q3 can be calculated as follows:

Q3=45.8+462=45.9

si52_e

c) Pos(D2) = 4.5 means that the value we are interested in is between positions 4 and 5, so, D2 can be calculated as follows:

D2=44.7+44.82=44.75

si53_e

d) Pos(P64) = 13.3 means that the value we are interested in is 70% closer to position 13 and 30% closer to position 14, so, P64 can be calculated as follows:

P64 = (0.70 x 45.6) + (0.30 x 45.7) = 45.63.

Interpretation

Q1 = 44.85 indicates that, in 25% of the observations (the first 5 observations listed in Step 1), the carrot processing time in the post-harvest handling phase is less than 44.85 seconds, or that in 75% of the observations (the remaining 15 observations), the processing time is greater than 44.85.

Q3 = 45.9 indicates that, in 75% of the observations (15 of them), the processing time is less than 45.9 seconds, or that in 5 observations, the processing time is greater than 45.9.

D2 = 44.75 indicates that, in 20% of the observations (4 of them), the processing time is less than 44.75 seconds, or that in 80% of the observations (16 of them), the processing time is greater than 44.75.

P64 = 45.63 indicates that, in 64% of the observations (12.8 of them), the processing time is less than 45.63 seconds, or that in 36% of the observations (7.2 of them) the processing time is greater than 45.63.

Excel calculates the quartile of order i (i = 0, 1, 2, 3, 4) through the QUARTILE function. As arguments of the function, we must define the matrix or set of data in which we are interested to calculate the respective quartile (it does not need to be in ascending order), in addition to the fourth we are interested in (minimum value = 0; 1st quartile = 1; 2nd quartile = 2, 3rd quartile = 3; maximum value = 4).

The k-th percentile (k = 0, ..., 1) can also be calculated in Excel through the PERCENTILE function. As arguments of the function, we must define the matrix we are interested in, in addition to the value of k (for example, in the case of P64, k = 0.64).

The calculation of quartiles, deciles, and percentiles using SPSS and Stata statistical software will be demonstrated in Sections 3.6 and 3.7, respectively.

SPSS and Stata software use two methods to calculate quartiles, deciles, or percentiles. One of them is called Tukey’s Hinges and it is the method used in this book. The other method is related to the Weighted Average, whose calculations are more complex. Excel, on the other hand, implements another algorithm that gets similar results.

3.4.1.2.3.2 Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data

Here, the calculation of quartiles, deciles, and percentiles is similar to the previous case. However, the data are grouped in a frequency distribution table.

In the frequency distribution table, the data must be sorted in ascending order, with their respective absolute and cumulative frequencies. First, we must determine the position of the quartile, decile, or percentile, of order i, we are interested in through Expressions (3.13), (3.14), and (3.15), respectively. From the cumulative frequency column, we must verify the group(s) that contain(s) this position. If the position is a discrete number, its corresponding value is obtained directly in the first column. However, if the position is a fractional number, as, for example, 2.5, and if the 2nd and the 3rd positions are in the same group, its respective value will also be obtained directly. On the other hand, if the position is a fractional number, as, for example, 4.25, and positions 4 and 5 are in different groups, we must calculate the sum of the value that corresponds to the 4th position multiplied by 0.75 with the value that corresponds to the 5th position multiplied by 0.25 (similar to Case 1).

Example 3.25

Consider the data in Example 3.18 regarding the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies (Table 3.E.26). Calculate Q1, D4, and P96.

Solution

Let’s calculate the positions of Q1, D4, and P96 through Expressions (3.13), (3.14), and (3.15), respectively, and their corresponding values:

  1. a) Pos(Q1)=[704×1]+12=18si54_e

Based on Table 3.E.26, we can see that position 18 is in the second group (2 bedrooms), so, Q1 = 2.

  1. b) Pos(D4)=[7010×4]+12=28.5si55_e

Through the cumulative frequency column, we can see that positions 28 and 29 are in the third group (3 bedrooms), so, D4 = 3.

  1. c) Pos(P96)=[70100×96]+12=67.7si56_e

that is, P96 is 70% closer to position 68 and 30% to position 67. Through the cumulative frequency column, we can see that position 68 is in the seventh group (7 bedrooms) and position 67 to the sixth group (6 bedrooms), so, P96 can be calculated as follows:

P96=(0.70x7)+(0.30x6)=6.7.

si57_e

Interpretation

Q1 = 2 indicates that 25% of the real estate properties have less than 2 bedrooms, or that 75% of the real estate properties have more than 2 bedrooms.

D4 = 3 indicates that 40% of the real estate properties have less than 3 bedrooms, or that 60% of the real estate properties have more than 3 bedrooms.

P96 = 6.7 indicates that 96% of the real estate properties have less than 6.7 bedrooms, or that 4% of the real estate properties have more than 6.7 bedrooms.

3.4.1.2.3.3 Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes

For continuous data grouped into classes in which data are represented in a frequency distribution table, we must apply the following steps to calculate the quartiles, deciles, and percentiles:

Step 1: Calculate the position of the quartile, decile, or percentile, of order i, we are interested in through the following expressions:

QuartilePos(Qi)=n4×i,i=1,2,3

si58_e  (3.16)

DecilePos(Di)=n10×i,i=1,2,,9

si59_e  (3.17)

PercentilePos(Pi)=n100×i,i=1,2,,99

si60_e  (3.18)

Step 2: Identify the class that contains the quartile, decile, or percentile, of order i, we are interested in (quartile class, decile class, or percentile class) from the cumulative frequency column.

Step 3: Calculate the quartile, decile, or percentile, of order i, we are interested in through the following expressions:

QuartileQi=LLQi+(Pos(Qi)Fcum(Qi1)FQi)×RQi,i=1,2,3

si61_e  (3.19)

where:

  • LLQi = lower limit of the quartile class;
  • Fcum(Qi − 1)= cumulative frequency from the previous class to the quartile class;
  • FQi = absolute frequency of the quartile class;
  • RQi = range of the quartile class.

DecileDi=LLDi+(Pos(Di)Fcum(Di1)FDi)×RDi,i=1,2,,9

si62_e  (3.20)

where:

  • LLDi = lower limit of the decile class;
  • Fcum(Di − 1)= cumulative frequency from the previous class to the decile class;
  • FDi = absolute frequency of the decile class;
  • RDi = range of the decile class.

PercentilePi=LLPi+(Pos(Pi)Fcum(Pi1)FPi)×RPi,i=1,2,,99

si63_e  (3.21)

where:

  • LLPi = lower limit of the percentile class;
  • Fcum(Pi − 1)= cumulative frequency from the previous class to the percentile class;
  • FPi = absolute frequency of the percentile class;
  • RPi = range of the percentile class.

Example 3.26

A survey on the health conditions of 250 patients collected information about their weight. The data are grouped into classes, as shown in Table 3.E.31. Calculate the first quartile, the seventh decile, and the 60th percentile.

Table 3.E.31

Absolute and Cumulative Frequencies Distribution table of Patients’ Weight Grouped into Classes
ClassFiFac
50 ├ 601818
60 ├ 702846
70 ├ 804995
80 ├ 9066161
90 ├ 10040201
100 ├ 11033234
110 ├ 12016250
Sum250

Solution

Let’s apply the three steps to calculate Q1, D7, and P60:

Step 1: Let’s calculate the position of the first quartile, the seventh decile, and the 60th percentile through Expressions (3.16), (3.17), and (3.18), respectively:

1stQuartilePos(Q1)=2504×1=62.5

si64_e

7thDecilePos(D7)=25010×7=175

si65_e

60thPercentilePos(P60)=250100×60=150

si66_e

Step 2: Let’s identify the class that has Q1, D7, and P60 from the cumulative frequency column in Table 3.E.31:

  • Q1 is in the 3rd class (70 ├ 80)
  • D7 is in the 5th class (90 ├ 100)
  • P60 is in the 4th class (80 ├ 90)

Step 3: Let’s calculate Q1, D7, and P60 from Expressions (3.19), (3.20), and (3.21), respectively:

Q1=LLQ1+(Pos(Q1)Fcum(Q11)FQ1)×RQ1=70+(62.54649)×10=73.37

si67_e

D7=LLD7+(Pos(D7)Fcum(D71)FD7)×RD7=90+(17516140)×10=93.5

si68_e

P60=LLP60+(Pos(P60)Fcum(P601)FP60)×RP60=80+(1509566)×10=88.33

si69_e

Interpretation

Q1 = 73.37 indicates that 25% of the patients weigh less than 73.37 kg, or that 75% of the patients weigh more than 73.37 kg.

D7 = 93.5 indicates that 70% of the patients weigh less than 93.5 kg, or that 30% of the patients weigh more than 93.5 kg.

P60 = 88.33 indicates that 60% of the patients weigh less than 88.33 kg, or that 40% of the patients weigh more than 88.33 kg.

3.4.1.3 Identifying the Existence of Univariate Outliers

A dataset can contain observations that are extremely distant from most observations or that are inconsistent. These observations are called outliers or atypical, discrepant, abnormal, or extreme values.

Before deciding what will be done with the outliers, we must know the causes that lead to such an occurrence. In many cases, these causes can determine the most suitable treatment for the respective outliers. The main causes are measurement mistakes, execution/implementation mistakes, and variability inherent to the population.

There are many outlier identification methods: boxplots, discordance models, Dixon’s test, Grubbs’ test, Z-scores, among others. In the Appendix of Chapter 11 (Cluster Analysis), a very efficient method for detecting multivariate outliers will be presented (BACON algorithm—Blocked Adaptive Computationally Efficient Outlier Nominators).

The existence of outliers through boxplots (the construction of boxplots was studied in Section 3.3.2.5) is identified from the IQR (interquartile range), which corresponds to the difference between the third and first quartiles:

IQR=Q3Q1

si70_e  (3.22)

Note that the IQR is the length of the box. Any values located below Q1 or above Q3 by 1.5 ∙ IQR more will be considered mild outliers and will be represented by circles. They may even be accepted in the population, but with some suspicion. Thus, the X° value of a variable is considered a mild outlier when:

X°<Q11.5IQR

si71_e  (3.23)

X°>Q3+1.5IQR

si72_e  (3.24)

or any values located below Q1 or above Q3 by 3 ∙ IQR more will be considered extreme outliers and will be presented by asterisks. Thus, the X⁎ value of a variable is considered an extreme outlier when:

X<Q13.IQR

si73_e  (3.25)

X>Q3+3.IQR

si74_e  (3.26)

Fig. 3.15 illustrates the boxplot with the identification of outliers.

Fig. 3.15
Fig. 3.15 Boxplot with the identification of outliers.

Example 3.27

Consider the sorted data in Example 3.24 regarding the average carrot processing time in the post-harvest handling phase:where Q1 = 44.85, Q2 = 45, Q3 = 45.9, mean = 45.3, and mode = 45.

44.044.544.544.744.844.944.945.045.045.0
45.045.445.645.745.846.046.246.246.346.5

Unlabelled Table

Check and see if there are mild and extreme outliers.

Solution

To verify if there is a possible outlier, we must calculate:

Q11.5(Q3Q1)=44.851.5.(45.944.85)=43.275

si75_e

Q3+1.5(Q3Q1)=45.9+1.5.(45.944.85)=47.475

si76_e

Since there is no value in the distribution outside this interval, we conclude that there are no mild outliers. Obviously, it is not necessary to calculate the interval for extreme outliers.

In case only one outlier in a certain variable is identified, the researcher can treat it through some existing procedures, as, for example, the complete elimination of this observation. On the other hand, if there is more than one outlier for one or more variables individually, the elimination of all the observations can reduce the sample size significantly. To avoid this problem, it is very common for observations considered outliers for a certain variable to have their atypical values substituted for the mean of the variable, thus, excluding the outliers (Fávero et al., 2009).

The authors mention other procedures for dealing with outliers, such as, substituting them for values from a regression or winsorization; which, in an organized way, eliminates an equal number of observations from each side of the distribution.

Fávero et al. (2009) also highlight the importance of dealing with outliers when the researcher in interested in investigating the behavior of a certain variable without the influence of observations with atypical values. On the other hand, if the main goal is to analyze the behavior of these atypical observations or to define subgroups through discrepancy criteria, maybe eliminating these observations or substituting their values would not be the best solution.

3.4.2 Measures of Dispersion or Variability

To study the behavior of a set of data, we use measures of central tendency, measures of dispersion, in addition to the nature or shape of the data distribution. Measures of central tendency determine a value that represents the set of data. In order to characterize the dispersion or variability of the data, measures of dispersion are necessary.

The most common measures of dispersion are the range, average deviation, variance, standard deviation, standard error, and the coefficient of variation (CV).

3.4.2.1 Range

The simplest measure of variability is the total range, or simply range (R), which represents the difference between the highest and lowest value of the set of data:

R=XmaxXmin

si77_e  (3.27)

3.4.2.2 Average Deviation

Deviation is the difference between each observed value and the mean of the variable. Thus, for population data, it would be represented by (Xi − μ), and for sample data, by (XiˉX)si78_e. The modulus or absolute deviation ignores the ± sign and is denoted by |XiˉX|si1_e.

Average deviation, or absolute average deviation, represents the arithmetic mean of absolute deviations.

3.4.2.2.1 Case 1: Average Deviation of Ungrouped Discrete and Continuous Data

The average deviation (ˉDsi80_e) is the sum of the absolute deviations of all observations divided by the population size (N) or the sample size (n):

ˉD=Ni=1|Xiμ|N(,for,,the,,population)

si81_e  (3.28)

ˉD=ni=1|XiˉX|n(,for,,samples)

si82_e  (3.29)

Example 3.28

Table 3.E.32 shows the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the average deviation.

Table 3.E.32

Distances Traveled (km)
12.422.618.99.714.522.526.317.731.220.4

Unlabelled Table

Solution

For the data in Table 3.E.32, we have ˉX=19.62si83_e. Applying Expression (3.29), we get the average deviation:

ˉD=|12.419.62|+|22.619.62|++|20.419.62|10=4.98

si84_e

The average deviation can be directly calculated in Excel using the AVEDEV function.

3.4.2.2.2 Case 2: Average Deviation of Grouped Discrete Data

For grouped data, presented in a frequency distribution table for m groups, the calculation of the average deviation is:

ˉD=mi=1|Xiμ|.FiN(,for,,the,,population)

si85_e  (3.30)

ˉD=mi=1|XiˉX|.Fin(,for,,samples)

si86_e  (3.31)

bearing in mind that ˉX=mi=1Xi.Finsi87_e.

Example 3.29

Table 3.E.33 shows the number of goals scored by the D.C. soccer team in their last 30 games, with their respective absolute frequencies. Calculate the average deviation.

Table 3.E.33

Frequency Distribution of Example 3.29
Number of GoalsFi
05
18
26
34
44
52
61
Sum30

Solution

The mean is ˉX=0×5+1×8++6×130=2.133si88_e. The average deviation can be determined from the calculations presented in Table 3.E.34:

Table 3.E.34

Calculations of the Average Deviation for Example 3.29
Number of GoalsFi|XiˉX|si1_e|XiˉX|.Fisi2_e
052.13310.667
181.1339.067
260.1330.800
340.8673.467
441.8677.467
522.8675.733
613.8673.867
Sum3041.067

Unlabelled Table

Therefore, ˉD=mi=1|XiˉX|.Fin=41.06730=1.369si89_e.

3.4.2.2.3 Case 3: Average Deviation of Continuous Data Grouped into Classes

For continuous data grouped into classes, the calculation of the average deviation is:

ˉD=ki=1|Xiμ|.FiN(,for,,the,,population)

si90_e  (3.32)

ˉD=ki=1|XiˉX|.Fin(,for,,samples)

si91_e  (3.33)

Note that Expressions (3.32) and (3.33) are similar to Expressions (3.30) and (3.31), respectively, except that, instead of m groups, we consider k classes. Moreover, Xi represents the middle or central point of each class i, where ˉX=ki=1Xi.Finsi92_e, as presented in Expression (3.6).

Example 3.30

In order to determine its variation due to genetic factors, a survey with 100 newborn babies collected information about their weight. Table 3.E.35 shows the data grouped into classes and their respective absolute frequencies. Calculate the average deviation.

Table 3.E.35

Newborn Babies’ Weight (in kg) Grouped into Classes
ClassFi
2.0 ├ 2.510
2.5 ├ 3.024
3.0 ├ 3.531
3.5 ├ 4.022
4.0 ├ 4.513
Sum

Solution

First, we must calculate ˉXsi13_e:

ˉX=ki=1Xi.Fin=2.25×10+2.75×24+3.25×31+3.75×22+4.25×13100=3.270

si94_e

The average deviation can be determined from the calculations presented in Table 3.E.36:

Table 3.E.36

Calculations of the Average Deviation for Example 3.30
ClassFiXi|XiˉX|si1_e|XiˉX|.Fisi2_e
2.0 ├ 2.5102.251.0210.20
2.5 ├ 3.0242.750.5212.48
3.0 ├ 3.5313.250.020.62
3.5 ├ 4.0223.750.4810.56
4.0 ├ 4.5134.250.9812.74
Sum10046.6

Unlabelled Table

Therefore, ˉD=ki=1|XiˉX|.Fin=46.6100=0.466si95_e.

3.4.2.3 Variance

Variance is a measure of dispersion or variability that evaluates how much the data are dispersed in relation to the arithmetic mean. Thus, the higher the variance, the higher the data dispersion.

3.4.2.3.1 Case 1: Variance of Ungrouped Discrete and Continuous Data

Instead of considering the mean of absolute deviations, as discussed in the previous section, it is more common to calculate the mean of squared deviations. This measure is known as variance:

σ2=Ni=1(Xiμ)2N=Ni=1X2i(Ni=1Xi)2NN(,for,,the,,population)

si96_e  (3.34)

S2=ni=1(XiˉX)2n1=ni=1X2i(ni=1Xi)2nn1(for samples)

si97_e  (3.35)

The relationship between the sample variance (S2) and the population variance (σ2) is given by:

S2=Nn1.σ2

si98_e  (3.36)

Example 3.31

Consider the data in Example 3.28 regarding the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the variance.

Solution

We saw in Example 3.28 that ˉX=19.62si83_e. Applying Expression (3.35), we have:

S2=(12.419.62)2+(22.619.62)2++(20.419.62)29=41.94

si100_e

The sample variance can be directly calculated in Excel using the VAR.S function. To calculate the variance population, we must use the VAR.P function.

3.4.2.3.2 Case 2: Variance of Grouped Discrete Data

For grouped data, represented in a frequency distribution table by m groups, the variance can be calculated as follows:

σ2=mi=1(Xiμ)2.FiN=mi=1X2i.Fi(mi=1Xi.Fi)2NN(,for,,the,,population)

si101_e  (3.37)

S2=mi=1(XiˉX)2.Fin1=mi=1X2i.Fi(mi=1Xi.Fi)2nn1(,for,,samples)

si102_e  (3.38)

where ˉX=mi=1Xi.Finsi103_e.

Example 3.32

Consider the data in Example 3.29 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the variance.

Solution

As calculated in Example 3.29, the mean is ˉX=2.133si104_e. The variance can be determined from the calculations presented in Table 3.E.37:

Table 3.E.37

Calculations of the Variance
Number of GoalsFi(XiˉX)2si5_e(XiˉX)2.Fisi6_e
054.55122.756
181.28410.276
260.0180.107
340.7513.004
443.48413.938
528.21816.436
6114.95114.951
Sum3081.467

Unlabelled Table

Therefore, S2=mi=1(XiˉX)2.Fin1=81.46729=2.809si105_e

3.4.2.3.3 Case 3: Variance of Continuous Data Grouped into Classes

For continuous data grouped into classes, we calculate the variance as follows:

σ2=ki=1(Xiμ)2.FiN=ki=1X2i.Fi(ki=1Xi.Fi)2NN(,for,,the,,population)

si106_e  (3.39)

S2=ki=1(Xiˉx)2.Fin1=ki=1X2i.Fi(ki=1Xi.Fi)2nn1(,for,,samples)

si107_e  (3.40)

Note that Expressions (3.39) and (3.40) are similar to Expressions (3.37) and (3.38), respectively, except that we consider k classes instead of m groups.

Example 3.33

Consider the data in Example 3.30 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the variance.

Solution

As calculated in Example 3.30, we have ˉX=3.270si108_e.

The variance can be determined from the calculations presented in Table 3.E.38:

Table 3.E.38

Calculations of the Variance for Example 3.33
ClassFiXi(XiˉX)2si5_e(XiˉX)2.Fisi6_e
2.0 ├ 2.5102.251.040410.404
2.5 ├ 3.0242.750.27046.4896
3.0 ├ 3.5313.250.00040.0124
3.5 ├ 4.0223.750.23045.0688
4.0 ├ 4.5134.250.960412.4852
Sum10034.46

Unlabelled Table

Therefore, S2=ki=1(XiˉX)2.Fin1=34.4699=0.348si109_e.

3.4.2.4 Standard Deviation

Since the variance considers the mean of squared deviations, its value tends to be very high and difficult to interpret. To solve this problem, we calculate the square root of the variance. This measure is known as the standard deviation. It is calculated as follows:

σ=σ2(,for,,the,,population)

si110_e  (3.41)

S=S2(,for,,samples)

si111_e  (3.42)

Example 3.34

Once again, consider the data in Examples 3.28 or 3.31 regarding the distances traveled (in km) by the vehicle. Calculate the standard deviation.

Solution

We have ˉX=19.62si83_e. The standard deviation is the square root of the variance, which has already been calculated in Example 3.31:

S=(12.419.62)2+(22.619.62)2++(20.419.62)29=41.94=6.476

si113_e

The standard deviation of a sample can be directly calculated in Excel using the STDEV.S function. To calculate the standard deviation of the population, we use the STDEV.P function.

Example 3.35

Consider the data in Examples 3.29 or 3.32 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the standard deviation.

Solution

The mean is ˉX=2.133si104_e. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.32, as demonstrated in Table 3.E.37:

Therefore, S=mi=1(XiˉX)2.Fin1=81.46729=2.809=1.676si115_e.

Example 3.36

Consider the data in Examples 3.30 or 3.33 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the standard deviation.

Solution

We have ˉX=3.270si108_e. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.33, as demonstrated in Table 3.E.38:

Therefore, S=ki=1(XiˉX)2.Fin1=34.4699=0.348=0.59si117_e.

3.4.2.5 Standard Error

The standard error is the standard deviation of the mean. It is obtained by dividing the standard deviation by the square root of the population or sample size:

σˉX=σNforthepopulation

si118_e  (3.43)

SˉX=Snforsamples

si119_e  (3.44)

The higher the number of measurements, the better the determination of the average value will be (higher accuracy), due to the compensation of random errors.

Example 3.37

One of the phases in the preparation of concrete is mixing it in a concrete mixer. Tables 3.E.39 and 3.E.40 show the concrete mixing times (in seconds), considering a sample with 10 and 30 elements, respectively. Calculate the standard error for both cases and interpret the results.

Table 3.E.39

Concrete Mixing Time for a Sample With 10 Elements
124111132142108127133144148105

Unlabelled Table

Table 3.E.40

Concrete Mixing Time for a Sample With 30 Elements
125102135126132129156112108134
126104143140138129119114107121
124112148145130125120127106148

Unlabelled Table

Solution

First, let’s calculate the standard deviation for both samples:

S1=(124127.4)2+(111127.4)2++(105127.4)29=15.364

si120_e

S2=(125126.167)2+(102126.167)2++(148126.167)229=14.227

si121_e

To calculate the standard error, we must apply Expression (3.44):

SˉX1=S1n1=15.36410=4.858

si122_e

SˉX2=S2n2=14.22730=2.598

si123_e

Despite the small difference in the calculation of the standard deviation, we can see that the standard error of the first sample is almost the double when compared to the second sample. Therefore, the higher the number of measurements, the higher the accuracy.

3.4.2.6 Coefficient of Variation

The coefficient of variation (CV) is a relative measure of dispersion that provides the variation of the data in relation to the mean. The smaller the value, the more homogeneous the data will be, that is, the smaller the dispersion around the mean will be. It can be calculated as follows:

CV=σμ×100(%)forthepopulation

si124_e  (3.45)

CV=SˉX×100(%)forsamples

si125_e  (3.46)

A CV can be considered low, indicating a set of data that is reasonably homogeneous, when it is less than 30%. If this value is greater than 30%, the set of data can be considered heterogeneous. However, this standard varies according to the application.

Example 3.38

Calculate the coefficient of variation for both samples of the previous example.

Solution

Applying Expression (3.46), we have:

CV1=S1ˉX1×100=15.364127.4×100=12.06%

si126_e

CV2=S2ˉX2×100=14.227126.167×100=11.28%

si127_e

These results confirm the homogeneity of the data of the variable being studied for both samples. We conclude, therefore, that the mean is a good measure to represent the data.

Let’s now study the measures of skewness and kurtosis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset