Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3

Univariate Descriptive Statistics

Abstract

This chapter discusses the main concepts of univariate descriptive statistics. Through tables, charts, and/or summary measures, it is possible to describe the behavior of each type of variable. Frequency distribution tables represent the frequency in which a set of data occurs. Charts can be used to represent the distribution of the variable. Summary measures are subdivided into measures of position or location (central trend and quantiles), measures of dispersion or variability, and measures of shape (skewness and kurtosis). Measures of position can be used to represent a dataset, while measures of dispersion can be used to calculate the variability of a dataset. Conversely, measures of skewness and kurtosis characterize the shape of the distribution of the population elements sampled around the mean. Finally, tables, charts, graphs, and summary measures have been studied using Excel, IBM SPSS Statistics Software®, and Stata Statistical Software®.

Keywords

Univariate descriptive statistics; Frequency distribution tables; Charts; Summary measures; Measures of position or location (central trend and quantiles); Measures of dispersion or variability; Measures of shape (skewness and kurtosis)

Mathematics is the alphabet with which God has written the Universe.

Galileo Galilei

3.1 Introduction

Descriptive statistics describes and summarizes the main characteristics observed in a dataset through tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data behavior. The analysis is based on the dataset being studied (sample), without drawing any conclusions or inferences from the population.

Researchers can use descriptive statistics to study a single variable (univariate descriptive statistics), two variables (bivariate descriptive statistics), or more than two variables (multivariate descriptive statistics). In this chapter, we will study the concepts of descriptive statistics involving a single variable.

Univariate descriptive statistics considers the following topics: (a) the frequency in which a set of data occurs through frequency distribution tables; (b) the representation of the variable’s distribution through charts; and (c) measures that represent a data series, such as measures of position or location, measures of dispersion or variability, and measures of shape (skewness and kurtosis).

The four main goals of this chapter are: (1) to introduce the most common concepts related to the tables, charts, and summary measures in univariate descriptive statistics, (2) to present its applications in real examples, (3) to construct tables, charts, and summary measures using Excel and the statistical software SPSS and Stata, and (4) to discuss the results achieved.

As described in the previous chapter, before we begin using descriptive statistics, it is necessary to identify the type of variable being studied. The type of variable is essential when calculating descriptive statistics and in the graphical representation of the results. Fig. 3.1 shows the univariate descriptive statistics that will be studied in this chapter, represented by tables, charts, graphs, and summary measures, for each type of variable. Fig. 3.1 summarizes the following information:

a) The descriptive statistics used to represent the behavior of one qualitative variable’s data are frequency distribution tables and graphs/charts.
b) The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs.
c) The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and by a Pareto chart.
d) For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of the data of continuous variables grouped into classes.
e) Line graphs, dot or dispersion plots, histograms, stem-and-leaf plots, and boxplots (box-and-whisker diagrams) are normally used as the graphical representation of quantitative variables.
f) Measures of position or location can be divided into measures of central tendency (mean, mode, and median) and quantiles (quartiles, deciles, and percentiles).
g) The most common measures of dispersion or variability are range, average deviation, variance, standard deviation, standard error, and coefficient of variation.
h) The measures of shape include measures of skewness and kurtosis.

Fig. 3.1 A brief summary of univariate descriptive statistics. ⁎The mode, which provides the most frequent value of the variable, is the only summary measure that can also be used for qualitative variables.

3.2 Frequency Distribution Table

Frequency distribution tables can be used to represent the frequency in which a set of data with qualitative or quantitative variables occurs.

In the case of qualitative variables, the table represents the frequency in which each variable category happens. For discrete quantitative variables, the frequency of occurrences is calculated for each discrete value of the variable. On the other hand, continuous variable data are first grouped into classes and, afterwards, we calculate the frequencies in which each class occurs.

A frequency distribution table contains the following calculations:

a) Absolute frequency (F_i): number of times each value i appears in the sample.
b) Relative frequency (Fr_i): percentage related to the absolute frequency.
c) Cumulative frequency (F_ac): sum of all the values equal to or less than the value being analyzed.
d) Relative cumulative frequency (Fr_ac): percentage related to the cumulative frequency (sum of all relative frequencies equal to or less than the value being considered).

3.2.1 Frequency Distribution Table for Qualitative Variables

Through a practical example, we will build the frequency distribution table using the calculations of the absolute frequency, relative frequency, cumulative frequency, and relative cumulative frequency for each category of the qualitative variable being analyzed.

Example 3.1

Saint August Hospital provides 3000 blood transfusions to hospitalized patients every month. In order for the hospital to be able to maintain its stocks, 60 blood donations a day are necessary. Table 3.E.1 shows the total number of donors for each blood type on a certain day. Build the frequency distribution table for this problem.

Table 3.E.1

Total Number of Donors of Each Blood Type
Blood Type	Donors
A +	15
A −	2
B +	6
B −	1
AB +	1
AB −	1
O +	32
O −	2

Solution

The complete frequency distribution table for Example 3.1 is shown in Table 3.E.2:

Table 3.E.2

Frequency Distribution of Example 3.1
Blood Type	F_i	Fr_i (%)	F_ac	Fr_ac (%)
A +	15	25	15	25
A −	2	3.33	17	28.33
B +	6	10	23	38.33
B −	1	1.67	24	40
AB +	1	1.67	25	41.67
AB −	1	1.67	26	43.33
O +	32	53.33	58	96.67
O −	2	3.33	60	100
Sum	60	100

Unlabelled Table

3.2.2 Frequency Distribution Table for Discrete Data

Through the frequency distribution table, we can calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency for each possible value of the discrete variable.

Different from qualitative variables, instead of the possible categories we must have the possible numeric values. To facilitate understanding, the data must be presented in ascending order.

Example 3.2

A Japanese restaurant is defining the new layout for its tables and, in order to do that, it collected information on the number of people who have lunch and dinner at each table throughout one week. Table 3.E.3 shows the first 40 pieces of data collected. Build the frequency distribution table for these data.

Table 3.E.3

Number of People per Table
2	5	4	7	4	1	6	2	2	5
4	12	8	6	4	5	2	8	2	6
4	7	2	5	6	4	1	5	10	2
2	10	6	4	3	4	6	3	8	4

Unlabelled Table

Solution

In the next table, each row of the first column represents a possible numeric value of the variable being analyzed. The data are sorted in ascending order. The complete frequency distribution table for Example 3.2 is shown below.

Table 3.E.4

Frequency Distribution for Example 3.2
Number of People	F_i	Fr_i (%)	F_ac	Fr_ac (%)
1	2	5	2	5
2	8	20	10	25
3	2	5	12	30
4	9	22.5	21	52.5
5	5	12.5	26	65
6	6	15	32	80
7	2	5	34	85
8	3	7.5	37	92.5
10	2	5	39	97.5
12	1	2.5	40	100
Sum	40	100

Unlabelled Table

3.2.3 Frequency Distribution Table for Continuous Data Grouped into Classes

As described in Chapter 2, continuous quantitative variables are those whose possible values are in an interval of real numbers. Therefore, it makes no sense to calculate the frequency for each possible value, since they rarely repeat themselves. It is better to group the data into classes or ranges.

The interval to be defined between the classes is random. However, we must be careful if the number of classes is too small because a lot of information can be lost. On the other hand, if the number of classes is too large, the summary of information is compromised (Bussab and Morettin, 2011). The interval between the classes does not need to be constant, but in order to keep things simple, we will assume the same interval.

The following steps must be taken to build a frequency distribution table for continuous data:

Step 1: Sort the data in ascending order.

Step 2: Determine the number of classes (k), using one of the options:

a) Sturges’ Rule → k = 1 + 3.3 ⋅ log(n)
b) Through expression $k = \sqrt{n}$

where n is the sample size.

The value of k must be an integer.

Step 3: Determine the interval between the classes (h), calculated as the range of the sample (A = maximum value − minimum value) divided by the number of classes:

$h = A / k$

The value of h is rounded to the highest integer.

Step 4: Build the frequency distribution table (calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency) for each class.

The lowest limit of the first class corresponds to the minimum value of the sample. To determine the highest limit of each class, we must add the value of h to the lowest limit of the respective class. The lowest limit of the new class corresponds to the highest limit of the previous class.

Example 3.3

Consider the data in Table 3.E.5 regarding the grades of 30 students enrolled in the subject Financial Market. Elaborate a frequency distribution table for this problem.

Table 3.E.5

Grades of 30 Students Enrolled in the Subject Financial Market
4.2	3.9	5.7	6.5	4.6	6.3	8.0	4.4	5.0	5.5
6.0	4.5	5.0	7.2	6.4	7.2	5.0	6.8	4.7	3.5
6.0	7.4	8.8	3.8	5.5	5.0	6.6	7.1	5.3	4.7

Unlabelled Table

Note: To determine the number of classes, use Sturges’ rule.

Solution

Let’s apply the four steps to build the frequency distribution table of Example 3.3, whose variables are continuous:

Step 1: Let’s sort the data in ascending order, as shown in Table 3.E.6.

Table 3.E.6

Data From Table 3.E.5 Sorted in Ascending Order
3.5	3.8	3.9	4.2	4.4	4.5	4.6	4.7	4.7	5
5	5	5	5.3	5.5	5.5	5.7	6	6	6.3
6.4	6.5	6.6	6.8	7.1	7.2	7.2	7.4	8	8.8

Unlabelled Table

Step 2: Let’s determine the number of classes (k) by using Sturges’ rule:

$k = 1 + 3.3 \cdot log (30) = 5.87 ≅ 6$

Step 3: The interval between the classes (h) is given by:

$h = \frac{A}{k} = \frac{(8.8 - 3.5)}{6} = 0.88 ≅ 1$

si12_e

Step 4: Finally, let’s build the frequency distribution table for each class.

The lowest limit of the first class corresponds to the minimum grade 3.5. From this value, we must add the interval between the classes (1), considering that the highest limit of the first class will be 4.5. The second class starts from this value, and so on, and so forth, until the last class is defined. We use the notation ├ to determine that the lowest limit is included in the class and the highest limit is not. The complete frequency distribution table for Example 3.3 (Table 3.E.7) is presented.

Table 3.E.7

Frequency Distribution for Example 3.3
Class	F_i	Fr_i (%)	F_ac	Fr_ac (%)
3.5 ├ 4.5	5	16.67	5	16.67
4.5 ├ 5.5	9	30	14	46.67
5.5 ├ 6.5	7	23.33	21	70
6.5 ├ 7.5	7	23.33	28	93.33
7.5 ├ 8.5	1	3.33	29	96.67
8.5 ├ 9.5	1	3.33	30	100
Sum	30	100

Unlabelled Table

3.3 Graphical Representation of the Results

The behavior of qualitative and quantitative variable data can also be represented in a graphical way. Charts are a representation of numeric data, in the form of geometric figures (graphs, diagrams, drawings, or images), allowing the reader to interpret these data quickly and objectively.

In Section 3.3.1, the main graphical representations for qualitative variables are illustrated: bar charts (horizontal and vertical), pie charts, and a Pareto chart.

The graphical representation of quantitative variables is usually illustrated by line graphs, dot plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams), as shown in Section 3.3.2.

Bar charts (horizontal and vertical), pie charts, a Pareto chart, line graphs, dot plots, and histograms will be generated in Excel. The boxplots and histograms will be constructed by using SPSS and Stata.

To build a chart in Excel, first, variables’ data and names must be standardized, codified, and selected in a spreadsheet. The next step consists in clicking on the Insert tab and, in the group Charts, selecting the type of chart we are interested in using (Columns, Rows, Pie, Bar, Area, Scatter, or Other Charts). The chart will be generated automatically on the screen, and it can be personalized according to the preferences of the researcher.

Excel offers a variety of chart styles, layouts, and formats. To use them, researcher just needs to select the plotted chart and click on the Design, Layout or Format tab. On the Layout tab, for example, there are many resources available, such as, Chart Title, Axis Titles (shows the name of the horizontal and vertical axes); Legend (shows or hides the legend); Data Labels (allows researcher to insert the series name, the category name, or the values of the labels in the place we are interested in); Data Table (shows the data table below the chart, with or without legend codes); Axes (allows researcher to personalize the scale of the horizontal and vertical axes); Gridlines (shows or hides horizontal and vertical gridlines), among others. The Chart Title, Axis Titles, Legend, Data Labels and Data Table icons are in the Labels group, while the icons Axes and Gridlines are in the Axes group.

3.3.1 Graphical Representation for Qualitative Variables

3.3.1.1 Bar Chart

This type of chart is widely used for nominal and ordinal qualitative variables, but it can also be used for discrete quantitative variables, because it allows us to investigate the presence of data trends.

As its name indicates, through bars, this chart represents the absolute or relative frequencies of each possible category (or numeric value) of a qualitative variable (or quantitative). In vertical bar charts, each variable category is shown on the X-axis as a bar with constant width, and the height of the respective bar indicates the frequency of the category on the Y-axis. Conversely, in horizontal bar charts, each variable category is shown on the Y-axis as a bar of constant height, and the length of the respective bar indicates the frequency of the category on the X-axis.

Let’s now build horizontal and vertical bar charts from a practical example.

Example 3.4

A bank created a satisfaction survey, which was used with 120 customers, trying to measure how agile its services were (excellent, good, satisfactory, and poor). The absolute frequencies for each category are presented in Table 3.E.8. Construct a vertical and horizontal bar chart for this problem.

Table 3.E.8

Frequencies of Occurrences per Category
Satisfaction	Absolute Frequency
Excellent	58
Good	18
Satisfactory	32
Poor	12

Solution

Let’s build the vertical and horizontal bar charts of Example 3.4 in Excel.

First, the data in Table 3.E.8 must be standardized, codified, and selected in a spreadsheet. After that, we can click on the Insert tab and, in the Charts group, and select the option Columns. The chart is automatically generated on the screen.

Next, to personalize the chart, while clicking on it, we must select the following icons on the Layout tab: (a) Axis Titles: let’s select the title for the horizontal axis (Satisfaction) and for the vertical axis (Frequency); (b) Legend: to hide the legend, we must click on None; (c) Data Labels: clicking on More Data Label Options, the option Value must be selected in Label Contains (or we can select the option Outside End).

Fig. 3.2 shows the vertical bar chart of Example 3.4 generated in Excel.

Based on Fig. 3.2, we can see that the categories of the variable being analyzed are presented on the X-axis by bars with the same width and their respective heights indicate the frequencies on the Y-axis.

To construct the horizontal bar chart, we must select the option Bar instead of Columns. The other steps follow the same logic. Fig. 3.3 represents the frequency data from Table 3.E.8 through a horizontal bar chart constructed in Excel.

The horizontal bar chart in Fig. 3.3 represents the categories of the variable on the Y-axis and their respective frequencies on the X-axis. For each variable category, we draw a bar with a length that corresponds to its frequency.

Therefore, this chart only offers information related to the behavior of each category of the original variable and to the generation of investigations regarding the type of distribution, not allowing us to calculate position, dispersion, skewness or kurtosis measures, since the variable being studied is qualitative.

3.3.1.2 Pie Chart

Another way to represent qualitative data, in terms of relative frequencies (percentages), is the definition of pie charts. The chart corresponds to a circle with a random radius (the whole) divided into sectors or slices of pie of several different sizes (parts of the whole).

This chart allows the researcher to visualize the data as slices of a pie or parts of a whole. Let’s now build the pie chart from a practical example.

Example 3.5

An election poll was carried out in the city of Sao Paulo to check voters’ preferences concerning the political parties running in the next elections for Mayor. The percentage of voters per political party can be seen in Table 3.E.9. Construct a pie chart for Example 3.5.

Table 3.E.9

Percentage of Voters per Political Party
Political Party	Percentage
PMDB	18
PSDB	22
PDT	12.5
PT	24.5
PC do B	8
PV	5
Others	10

Solution

Let’s build the pie chart for Example 3.5 in Excel. The steps are similar to the ones in Example 3.4. However, we now have to select the option Pie in the Charts group, on the Insert tab. Fig. 3.4 presents the pie chart obtained in Excel for the data shown in Table 3.E.9.

3.3.1.3 Pareto Chart

The Pareto chart is a Quality control tool and has as its main objective to investigate the types of problems and, consequently, to identify their respective causes, so that an action can be taken in order to reduce or eliminate them.

The Pareto chart is a chart that contains bars and a line graph. The bars represent the absolute frequencies of occurrences of problems and the lines represent the relative cumulative frequencies. The problems are sorted in descending order of priority. Let’s now illustrate a practical example with a Pareto chart.

Example 3.6

A manufacturer of credit and magnetic cards has as its main objective to reduce the number of defective cards. The quality inspector classified a sample of 1000 cards that were collected during one week of production, according to the types of defects found, as shown in Table 3.E.10. Construct a Pareto chart for this problem.

Table 3.E.10

Frequencies of the Occurrence of Each Defect
Type of Defect	Absolute Frequency (F_i)
Damaged/Bent	71
Perforated	28
Illegible printing	12
Wrong characters	20
Wrong numbers	44
Others	6
Total	181

Solution

The first step in generating a Pareto chart is to sort the defects in order of priority (from the highest to the lowest frequency). The bar chart represents the absolute frequency of each defect. To construct the line graph, it is necessary to calculate the relative cumulative frequency (%) up to the defect analyzed. Table 3.E.11 shows the absolute frequency for each type of defect, in descending order, and the relative cumulative frequency (%).

Table 3.E.11

Absolute Frequency for Each Defect and the Relative Cumulative Frequency (%)
Type of Defect	Number of Defects	Cumulative %
Damaged/Bent	71	39.23
Wrong numbers	44	63.54
Perforated	28	79.01
Wrong characters	20	90.06
Illegible printing	12	96.69
Others	6	100

Let’s now build a Pareto chart for Example 3.6 in Excel, using the data in Table 3.E.11.

First, the data in Table 3.E.11 must be standardized, codified, and selected in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns (and the clustered column subtype). Note that the chart is automatically generated on the screen. However, absolute frequency data as well as relative cumulative frequency data are presented as columns. To change the type of chart related to the cumulative percentage, we must click with the right button on any bar of the respective series and select the option Change Series Chart Type, followed by a line graph with markers. The resulting chart is a Pareto chart.

To personalize the Pareto chart, we must use the following icons on the Layout tab: (a) Axis Titles: for the bar chart, we selected the title for the horizontal axis (Type of defect) and for the vertical axis (Frequency); for the line graph, we called the vertical axis Percentage; (b) Legend: to hide the legend, we must click on None; (c) Data Table: let’s select the option Show Data Table with Legend Keys; (d) Axes: the main unit of the vertical axes for both charts is set in 20 and the maximum value of the vertical axis for line graphs, in 100.

Fig. 3.5 shows the chart constructed in Excel that corresponds to the Pareto chart for Example 3.6.

3.3.2 Graphical Representation for Quantitative Variables

3.3.2.1 Line Graph

In a line graph, points are represented by the intersection of the variables involved on the horizontal axis (X) and on the vertical axis (Y), and they are connected by straight lines.

Despite considering two axes, line graphs will be used in this chapter to represent the behavior of a single variable. The graph shows the evolution or trend of a quantitative variable’s data, which is usually continuous, at regular intervals. The numeric variable values are represented on the Y-axis, and the X-axis only shows the data distribution in a uniform way. Let’s now illustrate a practical example of a line graph.

Example 3.7

Cheap & Easy is a supermarket that registered the percentage of losses it had in the last 12 months (Table 3.E.12). After having done that, it will adopt new prevention measures. Build a line graph for Example 3.7.

Table 3.E.12

Percentage of Losses in the Last 12 Months
Month	Losses (%)
January	0.42
February	0.38
March	0.12
April	0.34
May	0.22
June	0.15
July	0.18
August	0.31
September	0.47
October	0.24
November	0.42
December	0.09

Solution

To build the line graph for Example 3.7 in Excel, in the Charts group, on the Insert tab, we must select the option Lines. The other steps follow the same logic of the previous examples. The complete chart can be seen in Fig. 3.6.

3.3.2.2 Scatter Plot

A scatter plot is very similar to a line graph. The biggest difference between them is in the way the data are plotted on the horizontal axis.

Similar to a line graph, here the points are also represented by the intersection of the variables along the X-axis and the vertical axis. However, they are not connected by straight lines.

The scatter plot studied in this chapter is used to show the evolution or trend of a single quantitative variable’s data, similar to the line graph; however, at irregular intervals (in general). Analogous to a line graph, the numeric variable values are represented on the Y-axis and the X-axis only represents the data behavior throughout time.

In the next chapter, we will see how a scatter plot can be used to describe the behavior of two variables simultaneously (bivariate analysis). The numeric values of one variable will be represented on the Y-axis and the other one on the X-axis.

Example 3.8

Papermisto is the supplier of three types of raw materials for the production of paper: cellulose, mechanical pulp, and trimmings. In order to maintain its quality standards, the factory carries out a rigorous inspection of its products during each production phase. At irregular intervals, an operator must verify the esthetic and dimensional characteristics of the product selected with specialized instruments. For instance, in the cellulose storage phase, the product must be piled up in bales of approximately 250 kg each. Table 3.E.13 shows the weight of the bales collected in the last 5 hours, at irregular intervals, varying between 20 and 45 minutes. Construct a scatter plot for Example 3.8.

Table 3.E.13

Evolution of the Weight of the Bales Throughout Time
Time (min)	Weight (kg)
30	250
50	255
85	252
106	248
138	250
178	249
198	252
222	251
252	250
297	245

Solution

To build the scatter plot for Example 3.8 in Excel, in the Charts group, on the Insert tab, we must select the option Scatter. The other steps follow the same logic of the previous examples. The scatter plot can be seen in Fig. 3.7.

3.3.2.3 Histogram

A histogram is a vertical bar chart that represents the frequency distribution of one quantitative variable (discrete or continuous). The variable values being studied are presented on the X-axis (the base of each bar, with a constant width, represents each possible value of the discrete variable or each class of continuous values, sorted in ascending order). On the other hand, the height of the bars on the Y-axis represents the frequency distribution (absolute, relative, or cumulative) of the respective variable values.

A histogram is very similar to a Pareto chart. It is also one of the seven quality tools. A Pareto chart represents the frequency distribution of a qualitative variable (types of problem), whose categories represented on the X-axis are sorted in order of priority (from the category with the highest frequency to the one with the lowest). A histogram represents the frequency distribution of a quantitative variable, whose values represented on the X-axis are sorted in ascending order.

Therefore, the first step to elaborate a histogram is building the frequency distribution table. As presented in Sections 3.2.2 and 3.2.3, for each possible value of a discrete variable or for a class with continuous data, we calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency. The data must be sorted in ascending order.

The histogram is then constructed from this table. The first column of the frequency distribution table, which represents the numeric values or the classes with the values of the variable being studied, will be presented on the X-axis, and the column of absolute frequency (or relative frequency, cumulative frequency, or relative cumulative frequency) will be presented on the Y-axis.

Many pieces of statistical software generate the histogram automatically, from the original values of the quantitative variable being studied, without having to calculate the frequencies. Even though Excel has the option of building a histogram from analysis tools, we will show how to build it from the column chart, due to its simplicity.

Example 3.9

In order to improve their services, a national bank is hiring new managers to serve their corporate clients. Table 3.E.14 shows the number of companies dealt with daily in one of their main branches in the capital. Elaborate a histogram from these data using Excel.

Table 3.E.14

Number of Companies Dealt With Daily
13	11	13	10	11	12	8	12	9	10
12	10	8	11	9	11	14	11	10	9

Unlabelled Table

Solution

The first step is building the frequency distribution table:

From the data in Table 3.E.15, we can build a histogram of absolute frequency, relative frequency, cumulative frequency, or relative cumulative frequency using Excel. The histogram generated will be the absolute frequency one.

Thus, we must standardize, codify, and select the first two columns of Table 3.E.15 (except the last row: Sum) in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns.

Let’s click on the chart so that it can be personalized. On the Layout tab, we selected the following icons: (a) Axis Titles: select the title for the horizontal axis (Number of companies) and for the vertical axis (Absolute frequency); (b) Legend: to hide the legend, we must click on None. The histogram generated in Excel can be seen in Fig. 3.8.

Fig. 3.8 Histogram of absolute frequencies elaborated in Excel for Example 3.9.

Table 3.E.15

Frequency Distribution for Example 3.9
Number of Companies	F_i	Fr_i (%)	F_ac	Fr_ac (%)
8	2	10	2	10
9	3	15	5	25
10	4	20	9	45
11	5	25	14	70
12	3	15	17	85
13	2	10	19	95
14	1	5	20	100
Sum	20	100

Unlabelled Table

As mentioned, many statistical computer packages, including SPSS and Stata, build the histogram automatically from the original data of the variable being studied (in this example, using the data in Table 3.E.14), without having to calculate the frequencies. Moreover, these packages have the option of plotting the normal curve.

Fig. 3.9 shows the histogram generated using SPSS (with the option of a normal curve) using the data in Table 3.E.14. We will see this in detail in Sections 3.6 and 3.7, how it can be constructed using SPSS and Stata software, respectively.

Note that the values of the discrete variable are presented in the middle of the base.

For continuous variables, consider the data in Table 3.E.5 (Example 3.3), regarding the grades of the students enrolled in the subject Financial Market. These data were sorted in ascending order, as presented in Table 3.E.6.

Fig. 3.10 shows the histogram generated using SPSS software (with the option of a normal curve) using the data in Table 3.E.5 or Table 3.E.6.

Note that the data were grouped considering an interval between h = 0.5 classes, differently from Example 3.3 that considered h = 1. The classes’ lower limits are represented on the left side of the base of the bar, and the upper limits (not included in the class) on the right side. The height of the bar represents the total frequency in the class. For example, the first bar represents the 3.5 ├ 4.0 class and there are three values in this interval (3.5, 3.8 and 3.9).

3.3.2.4 Stem-and-Leaf Plot

Both bar charts and histograms represent the shape of the variable’s frequency distribution. The stem-and-leaf plot is an alternative to represent the frequency distributions of discrete and continuous quantitative variables with few observations, with the advantage of maintaining the original value of each observation (it allows the visualization of all data information).

In the plot, the representation of each observation is divided into two parts, separated by a vertical line: the stem is located on the left of the vertical line and represents the observation’s first digit(s); the leaf is located on the right of the vertical line and represents the observation’s last digit(s). Choosing the number of initial digits that will form the stem or the number of complementary digits that will form the leaf is random. The stems usually contain the most significant digits, and the leaves the least significant.

The stems are represented in a single column and their different values throughout many lines. For each stem represented on the left-hand side of the vertical line, we have the respective leaves shown on the right-hand side throughout many columns. Stems as well as leaves must be sorted in ascending order. In the cases in which there are too many leaves per stem, we can have more than one line with the same stem. Choosing the number of lines is random, as well as defining the interval or the number of classes in a frequency distribution.

To build a stem-and-leaf plot, we can follow the sequence of steps:

Step 1: Sort the data in ascending order, to make the visualization of the data easier.

Step 2: Define the number of initial digits that will form the stem, or the number of complementary digits that will form the leaf.

Step 3: Elaborate the stems, represented in a single column on the left of the vertical line. Their different values are represented throughout many lines, in ascending order. When the number of leaves by stem is very high, we can define two or more lines for the same stem.

Step 4: Place the leaves that correspond to the respective stems, on the right-hand side of the vertical line, throughout many columns (in ascending order).

Example 3.10

A small company collected its employees’ ages, as shown in Table 3.E.16. Build a stem-and-leaf plot.

Table 3.E.16

Employees’ Ages
44	60	22	49	31	58	42	63	33	37
54	55	40	71	55	62	35	45	59	54
50	51	24	31	40	73	28	35	75	48

Unlabelled Table

Solution

To construct the stem-and-leaf plot, let’s apply the four steps described:

Step 1

First, we must sort the data in ascending order, as shown in Table 3.E.17.

Step 2

Table 3.E.17

Employees’ Ages in Ascending Order
22	24	28	31	31	33	35	35	37	40
40	42	44	45	48	49	50	51	54	54
55	55	58	59	60	62	63	71	73	75

Unlabelled Table

The next step to construct a stem-and-leaf plot is to define the number of initial digits of the observation that will form the stem. The complementary digits will form the leaf. In this example, all of the observations have two digits. The stems correspond to the tens and the leaves correspond to the units.

Step 3

The following step is to build the stems. Based on Table 3.E.17, we can see that there are observations that begin with the tens 2, 3, 4, 5, 6, and 7 (stems). The stem with the highest frequency is 5 (8 observations), it is possible to represent all of its leaves in a single line. Therefore, we will have a single line per stem. Hence, the stems are presented in a single column on the left of the vertical line, in ascending order, as shown in Fig. 3.11.

Step 4

Fig. 3.11 Building the stems for Example 3.10.

Finally, let’s place the leaves that correspond to each stem on the right-hand side of the vertical line. The leaves are represented in ascending order throughout many columns. For example, stem 2 contains leaves 2, 4, and 8. Stem 5 contains leaves 0, 1, 4, 4, 5, 5, 8, and 9, represented throughout 8 columns. If this stem were divided into two lines, the first line would have leaves 0 to 4, and the second line leaves 5 to 9.

Fig. 3.12 illustrates the stem-and-leaf plot for Example 3.10.

Example 3.11

The average temperature, in Celsius, registered in the last 40 days in the city of Porto Alegre can be found in Table 3.E.18. Elaborate the stem-and-leaf plot for Example 3.11.

Table 3.E.18

Average Temperature in Celsius
8.5	13.7	12.9	9.4	11.7	19.2	12.8	9.7	19.5	11.5
15.5	16.0	20.4	17.4	18.0	14.4	14.8	13.0	16.6	20.2
17.9	17.7	16.9	15.2	18.5	17.8	16.2	16.4	18.2	16.9
18.7	19.6	13.2	17.2	20.5	14.1	16.1	15.9	18.8	15.7

Unlabelled Table

Solution

Once again, let’s apply the four steps to construct the stem-and-leaf plot, but now we have to consider continuous variables.

Step 1

First, let’s sort the data in ascending order, as shown in Table 3.E.19.

Step 2

Table 3.E.19

Average Temperature in Ascending Order
8.5	9.4	9.7	11.5	11.7	12.8	12.9	13.0	13.2	13.7
14.1	14.4	14.8	15.2	15.5	15.7	15.9	16.0	16.1	16.2
16.4	16.6	16.9	16.9	17.2	17.4	17.7	17.8	17.9	18.0
18.2	18.5	18.7	18.8	19.2	19.5	19.6	20.2	20.4	20.5

Unlabelled Table

In this example, the leaves correspond to the last digit. The remaining digits (to the left) correspond to the stems.

Steps 3 and 4

The stems vary from 8 to 20. The stem with the highest frequency is 16 (7 observations), and its leaves can be represented in a single line. For each stem, we place the respective leaves. Fig. 3.13 shows the stem-and-leaf plot for Example 3.11.

Fig. 3.13 Stem-and-Leaf Plot for Example 3.11.

3.3.2.5 Boxplot or Box-and-Whisker Diagram

The boxplot (or box-and-whisker diagram) is a graphical representation of five measures of position or location of a certain variable: minimum value, first quartile (Q₁), second quartile (Q₂) or median (Md), third quartile (Q₃) and maximum value. From a sorted sample, the median corresponds to the central position and the quartiles to subdivisions of the sample, four equal parts, each one containing 25% of the data.

Thus, the first quartile (Q₁) describes 25% of the first data (organized in ascending order). The second quartile corresponds to the median (50% of the sorted data are located below it and the remaining 50% above it), and the third quartile (Q₁₃) corresponds to 75% of the observations. The dispersion measure resulting from these location measures is called interquartile range (IQR) or interquartile interval (IQI) and corresponds to the difference between Q₃ and Q₁.

This plot allows us to assess the data symmetry and distribution. It also gives us a visual perspective of whether or not there are discrepant data (univariate outliers), since these data are above the upper and lower limits. A representation of the diagram can be seen in Fig. 3.14.

Calculating the median, the first, and third quartiles, and investigating the existence of univariate outliers will be discussed in Sections 3.4.1.1, 3.4.1.2, and 3.4.1.3, respectively. In Sections 3.6.3 and 3.7, we will study how to generate the box-and-whisker diagram on SPSS and Stata, respectively, using a practical example.

3.4 The Most Common Summary-Measures in Univariate Descriptive Statistics

Information found in a dataset can be summarized through suitable numerical measures, called summary measures.

In univariate descriptive statistics, the most common summary measures have as their main objective to represent the behavior of the variable being studied through its central and noncentral values, its dispersions, or the way its values are distributed around the mean.

The summary measures that will be studied in this chapter are measures of position or location (measures of central tendency and quantiles), measures of dispersion or variability, and measures of shape, such as, skewness and kurtosis.

These measures are calculated for metric or quantitative variables. The only exception is the mode, which is a measure of central tendency that provides the most frequent value of a certain variable, so, it can also be calculated for nonmetric or qualitative variables.

3.4.1 Measures of Position or Location

These measures provide values that characterize the behavior of a data series, indicating the data position or location in relation to the axis of the values assumed by the variable or characteristic being studied.

The measures of position or location are subdivided into measures of central tendency (mean, median, and mode) and quantiles (quartiles, deciles, and percentiles).

3.4.1.1 Measures of Central Tendency

The most common measures of central tendency are the arithmetic mean, the median, and the mode.

3.4.1.1.1 Arithmetic Mean

The arithmetic mean can be a representative measure of a population with N elements, represented by the Greek letter μ, or a representative measure of a sample with n elements, represented by $\bar{X}$ .

3.4.1.1.1.1 Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data

Simple arithmetic mean, or simply mean, or average, is the sum of all the values of a certain variable (discrete or continuous) divided by the total number of observations. Thus, the sample arithmetic mean of a certain variable X ( $\bar{X}$ ) is:

$\bar{X} = \frac{\sum_{i = 1}^{n} X_{i}}{n}$

si15_e (3.1)

where n is the total number of observations in the dataset and X_i, for i = 1, …, n, represents each one of variable X’s values.

Example 3.12

Calculate the simple arithmetic mean of the data in Table 3.E.20, regarding the grades of the graduate students enrolled in the subject Quantitative Methods.

Table 3.E.20

Students’ Grades
5.7	6.5	6.9	8.3	8.0	4.2	6.3	7.4	5.8	6.9

Unlabelled Table

Solution

The mean is simply calculated as the sum of all the values in Table 3.E.20 divided by the total number of observations:

$\bar{X} = \frac{5.7 + 6.5 + \dots + 6.9}{10} = 6.6$

si16_e

The MEAN function in Excel calculates the simple arithmetic mean of the set of values selected. Let’s assume that the data in Table 3.E.20 are available from cell A1 to cell A10. To calculate the mean, we just need to insert the expression = MEAN(A1:A10).

Another way to calculate the mean using Excel, as well as other descriptive measures, such as, the median, mode, variance, standard deviation, standard error, skewness and kurtosis, which will also be studied in this chapter, is by using the Analysis ToolPack supplement (Section 3.5).

3.4.1.1.1.2 Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data

When calculating the simple arithmetic mean, all of the occurrences have the same importance or weight. When we are interested in assigning different weights (p_i) to each value i of variable X, we use the weighted arithmetic mean:

$\bar{X} = \frac{\sum_{i = 1}^{n} X_{i} . p_{i}}{\sum_{i = 1}^{n} p_{i}}$

si17_e (3.2)

If the weight is expressed in percentages (relative weight - rw), Expression (3.2) becomes:

$\bar{X} = \sum_{i = 1}^{n} X_{i} . {rw}_{i}$

si18_e (3.3)

Example 3.13

At Vanessa’s school, the annual average of each subject is calculated based on the grades obtained throughout all four quarters, with their respective weights being: 1, 2, 3, and 4. Table 3.E.21 shows Vanessa’s grades in mathematics in each quarter. Calculate her annual average in the subject.

Table 3.E.21

Vanessa’s Grades in Mathematics
Period	Grade	Weight
1st Quarter	4.5	1
2nd Quarter	7.0	2
3rd Quarter	5.5	3
4th Quarter	6.5	4

Solution

The annual average is calculated by using the weighted arithmetic mean criterion. Applying Expression (3.2) to the data in Table 3.E.21, we have:

$\bar{X} = \frac{4.5 \times 1 + 7.0 \times 2 + 5.5 \times 3 + 6.5 \times 4}{1 + 2 + 3 + 4} = 6.1$

si19_e

Example 3.14

There are five stocks in a certain investment portfolio. Table 3.E.22 shows the average yield of each stock in the previous month, as well as the respective percentage invested. Determine the portfolio’s average yield.

Table 3.E.22

Yield of Each Stock and Percentage Invested
Stock	Yield (%)	% Investment
Bank of Brazil ON	1.05	10
Bradesco PN	0.56	25
Eletrobras PNB	0.08	15
Gerdau PN	0.24	20
Vale PN	0.75	30

Solution

The portfolio’s average yield (%) corresponds to the sum of the products between each stock’s average yield (%) and the respective percentage invested, and, using Expression (3.3), we have:

$\bar{X} = 1.05 \times 0.10 + 0.56 \times 0.25 + 0.08 \times 0.15 + 0.24 \times 0.20 + 0.75 \times 0.30 = 0.53 %$

3.4.1.1.1.3 Case 3: Arithmetic Mean of Grouped Discrete Data

When the discrete values of X_i repeat themselves, the data are grouped into a frequency table. To calculate the arithmetic mean, we have to use the same criterion as for the weighted mean. However, the weight for each X_i will be represented by absolute frequencies (F_i) and, instead of n observations with n different values, we will have n observations with m different values (grouped data):

$\bar{X} = \frac{\sum_{i = 1}^{m} X_{i} . F_{i}}{\sum_{i = 1}^{m} F_{i}} = \frac{\sum_{i = 1}^{m} X_{i} . F_{i}}{n}$

si21_e (3.4)

If the frequency of the data is expressed in terms of the percentage relative to the absolute frequency (relative frequency—Fr), Expression (3.4) becomes:

$\bar{X} = \sum_{i = 1}^{m} X_{i} . {Fr}_{i}$

si22_e (3.5)

Example 3.15

A satisfaction survey with 120 participants evaluated the performance of a health insurance company through grades given to it. Grades that vary between 1 and 10. The survey’s results can be seen in Table 3.E.23. Calculate the arithmetic mean for Example 3.15.

Table 3.E.23

Absolute Frequency Table
Grades	Number of Participants
1	9
2	12
3	15
4	18
5	24
6	26
7	5
8	7
9	3
10	1

Solution

The arithmetic mean of Example 3.15 is calculated from Expression (3.4):

$\bar{X} = \frac{1 \times 9 + 2 \times 12 + \dots + 9 \times 3 + 10 \times 1}{120} = 4.62$

si23_e

3.4.1.1.1.4 Case 4: Arithmetic Mean of Continuous Data Grouped into Classes

To calculate the simple arithmetic mean, the weighted arithmetic mean, and the arithmetic mean of grouped discrete data, X_i represents each i value of variable X.

For continuous data grouped into classes, each class does not have a single value defined, but a set of values. In order for the arithmetic mean to be calculated in this case, we assume that X_i is the middle or central point of class i (i = 1,…,k), so, Expressions (3.4) and (3.5) are rewritten due to the number of classes (k):

$\bar{X} = \frac{\sum_{i = 1}^{k} X_{i} . F_{i}}{\sum_{i = 1}^{k} F_{i}} = \frac{\sum_{i = 1}^{k} X_{i} . F_{i}}{n}$

si24_e (3.6)

$\bar{X} = \sum_{i = 1}^{k} X_{i} . {Fr}_{i}$

si25_e (3.7)

Example 3.16

Table 3.E.24 shows the classes of salaries paid to the employees of a certain company and their respective absolute and relative frequencies. Calculate the average salary.

Table 3.E.24

Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Relative Frequencies
Classes	F_i	Fr_i (%)
1 ├ 3	240	17.14
3 ├ 5	480	34.29
5 ├ 7	320	22.86
7 ├ 9	150	10.71
9 ├ 11	130	9.29
11 ├ 13	80	5.71
Sum	1400	100

Solution

Considering X_i the central point of class i and applying Expression (3.6), we have:

$\bar{X} = \frac{2 \times 240 + 4 \times 480 + 6 \times 320 + 8 \times 150 + 10 \times 130 + 12 \times 80}{1,400} = 5.557$

si26_e

or using Expression (3.7):

$\bar{X} = 2 \times 0.1714 + 4 \times 0.3429 + \dots + 10 \times 0.0929 + 12 \times 0.0571 = 5.557$

Therefore, the average salary is US$ 5,557.14.

3.4.1.1.2 Median

The median (Md) is a measure of location. It locates the center of the distribution of a set of data sorted in ascending order. Its value separates the series in two equal parts, so, 50% of the elements are less than or equal to the median, and the other 50 % are greater than or equal to the median.

3.4.1.1.2.1 Case 1: Median of Ungrouped Discrete and Continuous Data

The median of variable X (discrete or continuous) can be calculated as follows:

$Md (X) = \{\begin{cases} \frac{X_{\frac{n}{2}} + X_{(\frac{n}{2}) + 1}}{2}, if n is an even number . \\ X_{\frac{(n + 1)}{2}}, if n is an odd number . \end{cases}$

si28_e (3.8)

where n is the total number of observations and X₁ ≤ … ≤ X_n, considering that X₁ is the smallest observation or the value of the first element, and that X_n is the highest observation or the value of the last element.

Example 3.17

Table 3.E.25 shows the monthly production of treadmills of a company in a given year. Calculate the median.

Table 3.E.25

Monthly Production of Treadmills in a Given Year
Month	Production (units)
Jan.	210
Feb.	180
Mar.	203
April	195
May	208
June	230
July	185
Aug.	190
Sept.	200
Oct.	182
Nov.	205
Dec.	196

Solution

To calculate the median, the observations are sorted in ascending order. Therefore, we have the order of the observations and their respective positions:

180	182	185	190	195	196	200	203	205	208	210	230
1st	2nd	3rd	4th	5th	6th	7th	8th	9th	10th	11th	12th

Unlabelled Table

The median will be the mean between the sixth and the seventh elements, since n is an even number, that is:

$Md = \frac{X_{\frac{12}{2}} + X_{(\frac{12}{2}) + 1}}{2}$

si29_e

$Md = \frac{196 + 200}{2} = 198$

si30_e

Excel calculates the median of a set of data through the MED function.

Note that the median does not consider the order of magnitude of the original variable’s values. If, for instance, the highest value were 400 instead of 230, the median would be exactly the same; however, with a much higher mean.

The median is also known as the 2nd quartile (Q₂), 50th percentile (P₅₀), or 5th decile (D₅). These definitions will be studied in more detail in the following sections.

3.4.1.1.2.2 Case 2: Median of Grouped Discrete Data

Here, the calculation of the median is similar to the previous case. However, the data are grouped in a frequency distribution table.

Analogous to Case 1, if n is an odd number, the position of the central element will be (n + 1)/2. We can see in the cumulative frequency column the group that has this position and, consequently, its corresponding value in the first column (median).

If n is an even number, we verify the group(s) that contain(s) the central positions n/2 and (n/2) + 1 in the cumulative frequency column. If both positions correspond to the same group, we directly obtain their corresponding value in the first column (median). If each position corresponds to a distinct group, the median will be the average between the corresponding values defined in the first column.

Example 3.18

Table 3.E.26 shows the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies. Calculate the median.

Table 3.E.26

Frequency Distribution
Number of Bedrooms	F_i	F_ac
1	6	6
2	13	19
3	20	39
4	15	54
5	7	61
6	6	67
7	3	70
Sum	70

Since n is an even number, the median will be the average of the values that occupy positions n/2 and (n/2) + 1, that is:

$Md = \frac{X_{\frac{n}{2}} + X_{(\frac{n}{2}) + 1}}{2} = \frac{X_{35} + X_{36}}{2}$

si31_e

Based on Table 3.E.26, we can see that the third group contains all the elements between positions 20 and 39 (including 35 and 36), whose corresponding value is 3. Therefore, the median is:

$Md = \frac{3 + 3}{2} = 3$

si32_e

3.4.1.1.2.3 Case 3: Median of Continuous Data Grouped into Classes

For continuous variables grouped into classes, in which the data are presented in a frequency distribution table, we apply the following steps to calculate the median:

Step 1: Calculate the position of the median, not taking into consideration if n is an even or an odd number, through the following expression:

$Pos (Md) = n / 2$

(3.9)

Step 2: Identify the class that contains the median (median class) from the cumulative frequency column.

Step 3: Calculate the median using the following expression:

$Md = {LI}_{Md} + \frac{(\frac{n}{2} - F_{ac (Md - 1)})}{F_{Md}} \times A_{Md}$

si34_e (3.10)

where:

LI_Md = lower limit of the median class;
F_Md = absolute frequency of the median class;
F_{ac(Md − 1)}= cumulative frequency from the previous class to the median class;
A_Md = range of the median class;
n = total number of observations.

Example 3.19

Consider the data in Example 3.16 regarding the classes of salaries paid to the employees of a company and their respective absolute and cumulative frequencies (Table 3.E.27). Calculate the median.

Table 3.E.27

Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Cumulative Frequencies
Classes	F_i	F_ac
1 ├ 3	240	240
3 ├ 5	480	720
5 ├ 7	320	1040
7 ├ 9	150	1190
9 ├ 11	130	1320
11 ├ 13	80	1400
Sum	1400

Solution

In the case of continuous data grouped into classes, let’s apply the following steps to calculate the median:

Step 1: First, we calculate the position of the median:

$Pos (Md) = \frac{n}{2} = \frac{1400}{2} = 700$

si35_e

Step 2: Through the cumulative frequency column, we can see that the median is in the second class (3 ├ 5).

Step 3: Calculating the median:

$Md = {LI}_{Md} + \frac{(\frac{n}{2} - F_{ac (Md - 1)})}{F_{Md}} \times A_{Md}$

si36_e

where:

LI_Md = 3, F_Md = 480, F_ac(Md−1) = 240, A_Md = 2, n = 1400

Therefore, we have:

$Md = 3 + \frac{(700 - 240)}{480} \times 2 = 4916 (US $ 4916.67)$

si37_e

3.4.1.1.3 Mode

The mode (Mo) of a data series corresponds to the observation that occurs with the highest frequency. The mode is the only measure of position that can also be used for qualitative variables, since these variables only allow us to calculate frequencies.

3.4.1.1.3.1 Case 1: Mode of Ungrouped Data

Consider a set of observations X₁, X₂, …, X_n of a certain variable. The mode is the value that appears with the highest frequency.

Excel gives us the mode of a set of data through the MODE function.

Example 3.20

The production of carrots in a certain company is divided into five phases, including the post-harvest handling phase. Table 3.E.28 shows the average time the processing (in seconds) takes in this phase for 20 observations. Calculate the mode.

Table 3.E.28

Processing Time in the Post-Harvest Handling Phase in Seconds
45.0	44.5	44.0	45.0	46.5	46.0	45.8	44.8	45.0	46.2
44.5	45.0	45.4	44.9	45.7	46.2	44.7	45.6	46.3	44.9

Unlabelled Table

Solution

The mode is 45.0, which is the most frequent value in the dataset (Table 3.E.28). This value could be determined directly in Excel by using the MODE function.

3.4.1.1.3.2 Case 2: Mode of Grouped Qualitative or Discrete Data

For discrete qualitative or quantitative data grouped in a frequency distribution table, the mode can be obtained directly from the table. It is the value with the highest absolute frequency.

Example 3.21

A TV station interviewed 500 viewers trying to analyze their preferences in terms of interest categories. The result of the survey can be seen in Table 3.E.29. Calculate the mode.

Table 3.E.29

Viewers’ Preferences in Terms of Interest Categories
Interest Categories	F_i
Movies	71
Soap Operas	46
News	90
Comedy	98
Sports	120
Concerts	35
Variety	40
Sum	500

Solution

Based on Table 3.E.29, we can see that the mode corresponds to the category Sports (the highest absolute frequency). Therefore, the mode is the only measure of position that can also be used for qualitative variables.

3.4.1.1.3.3 Case 3: Mode of Continuous Data Grouped into Classes

For continuous data grouped into classes, there are several procedures to calculate the mode, such as, Czuber’s and King’s methods.

Czuber’s method has the following phases:

Step 1: Identify the class that has the mode (modal class), which is the one with the highest absolute frequency.

Step 2: Calculate the mode (Mo):

$Mo = {LI}_{Mo} + \frac{F_{Mo} - F_{Mo - 1}}{2 . F_{Mo} - (F_{Mo - 1} + F_{Mo + 1})} \times A_{Mo}$

si38_e (3.11)

where:

LI_Mo = lower limit of the modal class;
F_Mo = absolute frequency of the modal class;
F_Mo − 1 = absolute frequency from the previous class to the modal class;
F_Mo + 1 = absolute frequency from the posterior class to the modal class;
A_Mo = range of the modal class.

Example 3.22

A set of continuous data with 200 observations is grouped into classes with their respective absolute frequencies, as shown in Table 3.E.30. Determine the mode using Czuber’s method.

Table 3.E.30

Continuous Data Grouped into Classes and Their Respective Frequencies
Class	F_i
01 ├ 10	21
10 ├ 20	36
20 ├ 30	58
30 ├ 40	24
40 ├ 50	19
Sum	200

Solution

Considering continuous data grouped into classes, we can use Czuber’s method to calculate the mode:

Step 1: Based on Table 3.E.30, we can see that the modal class is the third one (20 ├ 30), since it has the highest absolute frequency.

Step 2: Calculating the mode (Mo):

$Mo = {LI}_{Mo} + \frac{F_{Mo} - F_{Mo - 1}}{2 . F_{Mo} - (F_{Mo - 1} + F_{Mo + 1})} \times A_{Mo}$

si38_e

where:

LI_Mo = 20, F_Mo = 58, F_Mo−1 = 36, F_Mo + 1 = 24, A_Mo = 10

Therefore, we have:

$Mo = 20 + \frac{58 - 36}{2 \times 58 - (36 + 24)} \times 10 = 23.9$

si40_e

On the other hand, King’s method consists of the following phases:

Step 1: Identify the modal class (the one with the highest absolute frequency).

Step 2: Calculate the mode (Mo) using the following expression:

$Mo = {LI}_{Mo} + \frac{F_{Mo + 1}}{F_{Mo - 1} + F_{Mo + 1}} \times A_{Mo}$

si41_e (3.12)

where:

LI_Mo = lower limit of the modal class;
F_Mo − 1 = absolute frequency from the previous class to the modal class;
F_Mo + 1 = absolute frequency from the posterior class to the modal class;
A_Mo = range of the modal class.

Example 3.23

Once again, consider the data from the previous example. Use King’s method to determine the mode.

Solution

In Example 3.22, we saw that:

${LI}_{Mo} = 20 F_{Mo + 1} = 24 F_{Mo - 1} = 36 A_{Mo} = 10$

Applying Expression (3.12):

$Mo = {LI}_{Mo} + \frac{F_{Mo + 1}}{F_{Mo - 1} + F_{Mo + 1}} \times A_{Mo} = 20 + \frac{24}{36 + 24} \times 10 = 24$

si43_e

3.4.1.2 Quantiles

According to Bussab and Morettin (2011), only the use of measures of central tendency may not be suitable to represent a set of data, since they are also impacted by extreme values. Moreover, only with the use of these measures, it is not possible for the researcher to have a clear idea of the data dispersion and symmetry. As an alternative, we can use quantiles, such as, quartiles, deciles, and percentiles. The 2nd quartile (Q₂), 5th decile (D₅), or 50th percentile (P₅₀) correspond to the median; therefore, they are measures of central tendency.

3.4.1.2.1 Quartiles

Quartiles (Q_i, i = 1, 2, 3) are measures of position that divide a set of data into four parts with equal dimensions, sorted in ascending order.

Thus, the 1st Quartile (Q₁ or the 25th percentile) indicates that 25% of the data are less than Q₁, or that 75% of the data are greater than Q₁.

The 2nd Quartile (Q₂, or the 5th decile, or the 50th percentile) corresponds to the median, indicating that 50% of the data are less or greater than Q₂.

The 3rd Quartile (Q₃ or the 75th percentile) indicates that 75% of the data are less than Q₃, or that 25% of the data are greater than Q₃.

3.4.1.2.2 Deciles

Deciles (D_i, _i = 1, 2, ..., 9) are measures of position that divide a set of data into 10 equal parts, sorted in ascending order.

Therefore, the 1st decile (D₁ or 10th percentile) indicates that 10% of the data are less than D₁ or that 90% of the data are greater than D₁.

The 2nd decile (D₂ or 20th percentile) indicates that 20% of the data are less than D₂ or that 80% of the data are greater than D₂.

And so on, and so forth, until the 9th decile (D₉ or 90th percentile), indicating that 90% of the data are less than D₉ or that 10% of the data are greater than D₉.

3.4.1.2.3 Percentiles

Percentiles (P_i, i = 1, 2, ..., 99) are measures of position that divide a set of data, sorted in ascending order, into 100 equal parts.

Hence, the 1st percentile (P₁) indicates that 1% of the data is less than P₁ or that 99% of the data are greater than P₁.

The 2nd percentile (P₂) indicates that 2% of the data are less than P₂ or that 98% of the data are greater than P₂.

And so on, and so forth, until the 99th percentile (P₉₉), which indicates that 99% of the data are less than P₉₉ or that 1% of the data is greater than P₉₉.

3.4.1.2.3.1 Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data

If the position of the quartile, decile, or percentile we are interested in is an integer or is exactly between two positions, calculating the respective quartile, decile or percentile becomes easier. However, this does not happen all the time (imagine a sample with 33 elements and that the objective is to calculate the 67th percentile), there are many methods proposed for this kind of calculation that lead to close results, but they are not identical.

We will present a simple and generic method that can be applied to calculate any quartile, decile, or percentile of order i, considering ungrouped discrete and continuous data:

Step 1: Sort the observations in ascending order.

Step 2: Determine the position of the quartile, decile, or percentile, of order i, we are interested in:

$Quartile \to Pos (Q_{i}) = [\frac{n}{4} \times i] + \frac{1}{2}, i = 1, 2, 3$

si44_e (3.13)

$Decile \to Pos (D_{i}) = [\frac{n}{10} \times i] + \frac{1}{2}, i = 1, 2, \dots, 9$

si45_e (3.14)

$Percentile \to Pos (P_{i}) = [\frac{n}{100} \times i] + \frac{1}{2}, i = 1, 2, \dots, 99$

si46_e (3.15)

Step 3: Calculate the value of the quartile, decile, or percentile that corresponds to the respective position.

Assume that Pos(Q₁) = 3.75, that is, the value of Q₁ is between the 3rd and 4th positions (75% closer to the 4th position, and 25% to the 3rd position). Therefore, Q₁ will be the sum of the value that corresponds to the 3rd position multiplied by 0.25, with the value that corresponds to the 4th position multiplied by 0.75.

Example 3.24

Consider the data in Example 3.20 regarding the average carrot processing time in the post-harvest handling phase, as specified in Table 3.E.28. Determine Q₁ (1st quartile), Q₃ (3rd quartile), D₂ (2nd decile), and P₆₄ (64th percentile).

Solution

For ungrouped continuous data, we must apply the following steps to determine the quartiles, deciles, and percentiles we are interested in:

Step 1: Sort the observations in ascending order.

1st	2nd	3rd	4th	5th	7th	7th	8th	9th	10th
44.0	44.5	44.5	44.7	44.8	44.9	44.9	45.0	45.0	45.0
11th	12th	13th	14th	15th	16th	17th	18th	19th	20th
45.0	45.4	45.6	45.7	45.8	46.0	46.2	46.2	46.3	46.5

Unlabelled Table

Step 2: Calculation of the positions of Q₁, Q₃, D₂, and P₆₄:

a) $Pos (Q_{1}) = [\frac{20}{4} \times 1] + \frac{1}{2} = 5.5$
b) $Pos (Q_{3}) = [\frac{20}{4} \times 3] + \frac{1}{2} = 15.5$
c) $Pos (D_{2}) = [\frac{20}{10} \times 2] + \frac{1}{2} = 4.5$
d) $Pos (P_{64}) = [\frac{20}{100} \times 64] + \frac{1}{2} = 13.3$

Step 3: Calculating Q₁, Q₃, D₂, and P₆₄:

a) Pos(Q₁) = 5.5 means that its corresponding value is 50% near position 5 and 50% near position 6, that is, Q₁ is simply the average of the values that correspond to both positions:

$Q_{1} = \frac{44.8 + 44.9}{2} = 44.85$

si51_e

b) Pos(Q₃) = 15.5 means that the value we are interested in is between positions 15 and 16 (50% near the 15th position and 50% near the 16th position), so, Q₃ can be calculated as follows:

$Q_{3} = \frac{45.8 + 46}{2} = 45.9$

si52_e

c) Pos(D₂) = 4.5 means that the value we are interested in is between positions 4 and 5, so, D₂ can be calculated as follows:

$D_{2} = \frac{44.7 + 44.8}{2} = 44.75$

si53_e

d) Pos(P₆₄) = 13.3 means that the value we are interested in is 70% closer to position 13 and 30% closer to position 14, so, P₆₄ can be calculated as follows:

P₆₄ = (0.70 x 45.6) + (0.30 x 45.7) = 45.63.

Interpretation

Q₁ = 44.85 indicates that, in 25% of the observations (the first 5 observations listed in Step 1), the carrot processing time in the post-harvest handling phase is less than 44.85 seconds, or that in 75% of the observations (the remaining 15 observations), the processing time is greater than 44.85.

Q₃ = 45.9 indicates that, in 75% of the observations (15 of them), the processing time is less than 45.9 seconds, or that in 5 observations, the processing time is greater than 45.9.

D₂ = 44.75 indicates that, in 20% of the observations (4 of them), the processing time is less than 44.75 seconds, or that in 80% of the observations (16 of them), the processing time is greater than 44.75.

P₆₄ = 45.63 indicates that, in 64% of the observations (12.8 of them), the processing time is less than 45.63 seconds, or that in 36% of the observations (7.2 of them) the processing time is greater than 45.63.

Excel calculates the quartile of order i (i = 0, 1, 2, 3, 4) through the QUARTILE function. As arguments of the function, we must define the matrix or set of data in which we are interested to calculate the respective quartile (it does not need to be in ascending order), in addition to the fourth we are interested in (minimum value = 0; 1st quartile = 1; 2nd quartile = 2, 3rd quartile = 3; maximum value = 4).

The k-th percentile (k = 0, ..., 1) can also be calculated in Excel through the PERCENTILE function. As arguments of the function, we must define the matrix we are interested in, in addition to the value of k (for example, in the case of P₆₄, k = 0.64).

The calculation of quartiles, deciles, and percentiles using SPSS and Stata statistical software will be demonstrated in Sections 3.6 and 3.7, respectively.

SPSS and Stata software use two methods to calculate quartiles, deciles, or percentiles. One of them is called Tukey’s Hinges and it is the method used in this book. The other method is related to the Weighted Average, whose calculations are more complex. Excel, on the other hand, implements another algorithm that gets similar results.

3.4.1.2.3.2 Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data

Here, the calculation of quartiles, deciles, and percentiles is similar to the previous case. However, the data are grouped in a frequency distribution table.

In the frequency distribution table, the data must be sorted in ascending order, with their respective absolute and cumulative frequencies. First, we must determine the position of the quartile, decile, or percentile, of order i, we are interested in through Expressions (3.13), (3.14), and (3.15), respectively. From the cumulative frequency column, we must verify the group(s) that contain(s) this position. If the position is a discrete number, its corresponding value is obtained directly in the first column. However, if the position is a fractional number, as, for example, 2.5, and if the 2nd and the 3rd positions are in the same group, its respective value will also be obtained directly. On the other hand, if the position is a fractional number, as, for example, 4.25, and positions 4 and 5 are in different groups, we must calculate the sum of the value that corresponds to the 4th position multiplied by 0.75 with the value that corresponds to the 5th position multiplied by 0.25 (similar to Case 1).

Example 3.25

Consider the data in Example 3.18 regarding the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies (Table 3.E.26). Calculate Q₁, D₄, and P₉₆.

Solution

Let’s calculate the positions of Q₁, D₄, and P₉₆ through Expressions (3.13), (3.14), and (3.15), respectively, and their corresponding values:

a) $Pos (Q_{1}) = [\frac{70}{4} \times 1] + \frac{1}{2} = 18$

Based on Table 3.E.26, we can see that position 18 is in the second group (2 bedrooms), so, Q₁ = 2.

b) $Pos (D_{4}) = [\frac{70}{10} \times 4] + \frac{1}{2} = 28.5$

Through the cumulative frequency column, we can see that positions 28 and 29 are in the third group (3 bedrooms), so, D₄ = 3.

c) $Pos (P_{96}) = [\frac{70}{100} \times 96] + \frac{1}{2} = 67.7$

that is, P₉₆ is 70% closer to position 68 and 30% to position 67. Through the cumulative frequency column, we can see that position 68 is in the seventh group (7 bedrooms) and position 67 to the sixth group (6 bedrooms), so, P₉₆ can be calculated as follows:

$P_{96} = (0.70 x 7) + (0.30 x 6) = 6.7 .$

Interpretation

Q₁ = 2 indicates that 25% of the real estate properties have less than 2 bedrooms, or that 75% of the real estate properties have more than 2 bedrooms.

D₄ = 3 indicates that 40% of the real estate properties have less than 3 bedrooms, or that 60% of the real estate properties have more than 3 bedrooms.

P₉₆ = 6.7 indicates that 96% of the real estate properties have less than 6.7 bedrooms, or that 4% of the real estate properties have more than 6.7 bedrooms.

3.4.1.2.3.3 Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes

For continuous data grouped into classes in which data are represented in a frequency distribution table, we must apply the following steps to calculate the quartiles, deciles, and percentiles:

Step 1: Calculate the position of the quartile, decile, or percentile, of order i, we are interested in through the following expressions:

$Quartile \to Pos (Q_{i}) = \frac{n}{4} \times i, i = 1, 2, 3$

(3.16)

$Decile \to Pos (D_{i}) = \frac{n}{10} \times i, i = 1, 2, \dots, 9$

(3.17)

$Percentile \to Pos (P_{i}) = \frac{n}{100} \times i, i = 1, 2, \dots, 99$

(3.18)

Step 2: Identify the class that contains the quartile, decile, or percentile, of order i, we are interested in (quartile class, decile class, or percentile class) from the cumulative frequency column.

Step 3: Calculate the quartile, decile, or percentile, of order i, we are interested in through the following expressions:

$Quartile \to Q_{i} = {LL}_{Q_{i}} + (\frac{Pos (Q_{i}) - F_{cum (Q_{i} - 1)}}{F_{Q_{i}}}) \times R_{Q_{i},} i = 1, 2, 3$

si61_e (3.19)

where:

LL_{Q_i} = lower limit of the quartile class;
F_{cum(Q_i − 1)}= cumulative frequency from the previous class to the quartile class;
F_{Q_i} = absolute frequency of the quartile class;
R_{Q_i} = range of the quartile class.

$Decile \to D_{i} = {LL}_{D_{i}} + (\frac{Pos (D_{i}) - F_{cum (D_{i} - 1)}}{F_{D_{i}}}) \times R_{D_{i},} i = 1, 2, \dots, 9$

si62_e (3.20)

where:

LL_{D_i} = lower limit of the decile class;
F_{cum(D_i − 1)}= cumulative frequency from the previous class to the decile class;
F_{D_i} = absolute frequency of the decile class;
R_{D_i} = range of the decile class.

$Percentile \to P_{i} = {LL}_{P_{i}} + (\frac{Pos (P_{i}) - F_{cum (P_{i} - 1)}}{F_{P_{i}}}) \times R_{P_{i},} i = 1, 2, \dots, 99$

si63_e (3.21)

where:

LL_{P_i} = lower limit of the percentile class;
F_{cum(P_i − 1)}= cumulative frequency from the previous class to the percentile class;
F_{P_i} = absolute frequency of the percentile class;
R_{P_i} = range of the percentile class.

Example 3.26

A survey on the health conditions of 250 patients collected information about their weight. The data are grouped into classes, as shown in Table 3.E.31. Calculate the first quartile, the seventh decile, and the 60th percentile.

Table 3.E.31

Absolute and Cumulative Frequencies Distribution table of Patients’ Weight Grouped into Classes
Class	F_i	F_ac
50 ├ 60	18	18
60 ├ 70	28	46
70 ├ 80	49	95
80 ├ 90	66	161
90 ├ 100	40	201
100 ├ 110	33	234
110 ├ 120	16	250
Sum	250

Solution

Let’s apply the three steps to calculate Q₁, D₇, and P₆₀:

Step 1: Let’s calculate the position of the first quartile, the seventh decile, and the 60th percentile through Expressions (3.16), (3.17), and (3.18), respectively:

$1st Quartile \to Pos (Q_{1}) = \frac{250}{4} \times 1 = 62.5$

si64_e

$7th Decile \to Pos (D_{7}) = \frac{250}{10} \times 7 = 175$

si65_e

$60 th Percentile \to Pos (P_{60}) = \frac{250}{100} \times 60 = 150$

si66_e

Step 2: Let’s identify the class that has Q₁, D₇, and P₆₀ from the cumulative frequency column in Table 3.E.31:

Q₁ is in the 3rd class (70 ├ 80)
D₇ is in the 5th class (90 ├ 100)
P₆₀ is in the 4th class (80 ├ 90)

Step 3: Let’s calculate Q₁, D₇, and P₆₀ from Expressions (3.19), (3.20), and (3.21), respectively:

$Q_{1} = {LL}_{Q_{1}} + (\frac{Pos (Q_{1}) - F_{cum (Q_{1} - 1)}}{F_{Q_{1}}}) \times R_{Q 1} = 70 + (\frac{62.5 - 46}{49}) \times 10 = 73.37$

si67_e

$D_{7} = {LL}_{D_{7}} + (\frac{Pos (D_{7}) - F_{cum (D_{7} - 1)}}{F_{D_{7}}}) \times R_{D_{7}} = 90 + (\frac{175 - 161}{40}) \times 10 = 93.5$

si68_e

$P_{60} = {LL}_{P_{60}} + (\frac{Pos (P_{60}) - F_{cum (P_{60} - 1)}}{F_{P_{60}}}) \times R_{P_{60}} = 80 + (\frac{150 - 95}{66}) \times 10 = 88.33$

si69_e

Interpretation

Q₁ = 73.37 indicates that 25% of the patients weigh less than 73.37 kg, or that 75% of the patients weigh more than 73.37 kg.

D₇ = 93.5 indicates that 70% of the patients weigh less than 93.5 kg, or that 30% of the patients weigh more than 93.5 kg.

P₆₀ = 88.33 indicates that 60% of the patients weigh less than 88.33 kg, or that 40% of the patients weigh more than 88.33 kg.

3.4.1.3 Identifying the Existence of Univariate Outliers

A dataset can contain observations that are extremely distant from most observations or that are inconsistent. These observations are called outliers or atypical, discrepant, abnormal, or extreme values.

Before deciding what will be done with the outliers, we must know the causes that lead to such an occurrence. In many cases, these causes can determine the most suitable treatment for the respective outliers. The main causes are measurement mistakes, execution/implementation mistakes, and variability inherent to the population.

There are many outlier identification methods: boxplots, discordance models, Dixon’s test, Grubbs’ test, Z-scores, among others. In the Appendix of Chapter 11 (Cluster Analysis), a very efficient method for detecting multivariate outliers will be presented (BACON algorithm—Blocked Adaptive Computationally Efficient Outlier Nominators).

The existence of outliers through boxplots (the construction of boxplots was studied in Section 3.3.2.5) is identified from the IQR (interquartile range), which corresponds to the difference between the third and first quartiles:

$IQR = Q_{3} - Q_{1}$

(3.22)

Note that the IQR is the length of the box. Any values located below Q₁ or above Q₃ by 1.5 ∙ IQR more will be considered mild outliers and will be represented by circles. They may even be accepted in the population, but with some suspicion. Thus, the X° value of a variable is considered a mild outlier when:

$X^{°} < Q_{1} - 1.5 ∙ IQR$

(3.23)

$X^{°} > Q_{3} + 1.5 ∙ IQR$

(3.24)

or any values located below Q₁ or above Q₃ by 3 ∙ IQR more will be considered extreme outliers and will be presented by asterisks. Thus, the X⁎ value of a variable is considered an extreme outlier when:

$X^{*} < Q_{1} - 3 . IQR$

(3.25)

$X^{*} > Q_{3} + 3 . IQR$

(3.26)

Fig. 3.15 illustrates the boxplot with the identification of outliers.

Example 3.27

Consider the sorted data in Example 3.24 regarding the average carrot processing time in the post-harvest handling phase:where Q₁ = 44.85, Q₂ = 45, Q₃ = 45.9, mean = 45.3, and mode = 45.

44.0	44.5	44.5	44.7	44.8	44.9	44.9	45.0	45.0	45.0
45.0	45.4	45.6	45.7	45.8	46.0	46.2	46.2	46.3	46.5

Unlabelled Table

Check and see if there are mild and extreme outliers.

Solution

To verify if there is a possible outlier, we must calculate:

$Q_{1} - 1.5 ∙ (Q_{3} - Q_{1}) = 44.85 - 1.5 . (45.9 - 44.85) = 43.275$

$Q_{3} + 1.5 ∙ (Q_{3} - Q_{1}) = 45.9 + 1.5 . (45.9 - 44.85) = 47.475$

Since there is no value in the distribution outside this interval, we conclude that there are no mild outliers. Obviously, it is not necessary to calculate the interval for extreme outliers.

In case only one outlier in a certain variable is identified, the researcher can treat it through some existing procedures, as, for example, the complete elimination of this observation. On the other hand, if there is more than one outlier for one or more variables individually, the elimination of all the observations can reduce the sample size significantly. To avoid this problem, it is very common for observations considered outliers for a certain variable to have their atypical values substituted for the mean of the variable, thus, excluding the outliers (Fávero et al., 2009).

The authors mention other procedures for dealing with outliers, such as, substituting them for values from a regression or winsorization; which, in an organized way, eliminates an equal number of observations from each side of the distribution.

Fávero et al. (2009) also highlight the importance of dealing with outliers when the researcher in interested in investigating the behavior of a certain variable without the influence of observations with atypical values. On the other hand, if the main goal is to analyze the behavior of these atypical observations or to define subgroups through discrepancy criteria, maybe eliminating these observations or substituting their values would not be the best solution.

3.4.2 Measures of Dispersion or Variability

To study the behavior of a set of data, we use measures of central tendency, measures of dispersion, in addition to the nature or shape of the data distribution. Measures of central tendency determine a value that represents the set of data. In order to characterize the dispersion or variability of the data, measures of dispersion are necessary.

The most common measures of dispersion are the range, average deviation, variance, standard deviation, standard error, and the coefficient of variation (CV).

3.4.2.1 Range

The simplest measure of variability is the total range, or simply range (R), which represents the difference between the highest and lowest value of the set of data:

$R = X_{\max} - X_{\min}$

(3.27)

3.4.2.2 Average Deviation

Deviation is the difference between each observed value and the mean of the variable. Thus, for population data, it would be represented by (X_i − μ), and for sample data, by $(X_{i} - \bar{X})$ . The modulus or absolute deviation ignores the ± sign and is denoted by $|X_{i} - \bar{X}|$ .

Average deviation, or absolute average deviation, represents the arithmetic mean of absolute deviations.

3.4.2.2.1 Case 1: Average Deviation of Ungrouped Discrete and Continuous Data

The average deviation ( $\bar{D}$ ) is the sum of the absolute deviations of all observations divided by the population size (N) or the sample size (n):

$\bar{D} = \frac{\sum_{i = 1}^{N} |X_{i} - μ|}{N} (for, the, population)$

si81_e (3.28)

$\bar{D} = \frac{\sum_{i = 1}^{n} |X_{i} - \bar{X}|}{n} (for, samples)$

si82_e (3.29)

Example 3.28

Table 3.E.32 shows the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the average deviation.

Table 3.E.32

Distances Traveled (km)
12.4	22.6	18.9	9.7	14.5	22.5	26.3	17.7	31.2	20.4

Unlabelled Table

Solution

For the data in Table 3.E.32, we have $\bar{X} = 19.62$ . Applying Expression (3.29), we get the average deviation:

$\bar{D} = \frac{|12.4 - 19.62| + |22.6 - 19.62| + \dots + |20.4 - 19.62|}{10} = 4.98$

si84_e

The average deviation can be directly calculated in Excel using the AVEDEV function.

3.4.2.2.2 Case 2: Average Deviation of Grouped Discrete Data

For grouped data, presented in a frequency distribution table for m groups, the calculation of the average deviation is:

$\bar{D} = \frac{\sum_{i = 1}^{m} |X_{i} - μ| . F_{i}}{N} (for, the, population)$

si85_e (3.30)

$\bar{D} = \frac{\sum_{i = 1}^{m} |X_{i} - \bar{X}| . F_{i}}{n} (for, samples)$

si86_e (3.31)

bearing in mind that $\bar{X} = \frac{\sum_{i = 1}^{m} X_{i} . F_{i}}{n}$ .

Example 3.29

Table 3.E.33 shows the number of goals scored by the D.C. soccer team in their last 30 games, with their respective absolute frequencies. Calculate the average deviation.

Table 3.E.33

Frequency Distribution of Example 3.29
Number of Goals	F_i
0	5
1	8
2	6
3	4
4	4
5	2
6	1
Sum	30

Solution

The mean is $\bar{X} = \frac{0 \times 5 + 1 \times 8 + \dots + 6 \times 1}{30} = 2.133$ . The average deviation can be determined from the calculations presented in Table 3.E.34:

Table 3.E.34

Calculations of the Average Deviation for Example 3.29
Number of Goals	F_i	$\|X_{i} - \bar{X}\|$	$\|X_{i} - \bar{X}\| . F_{i}$
0	5	2.133	10.667
1	8	1.133	9.067
2	6	0.133	0.800
3	4	0.867	3.467
4	4	1.867	7.467
5	2	2.867	5.733
6	1	3.867	3.867
Sum	30		41.067

Unlabelled Table

Therefore, $\bar{D} = \frac{\sum_{i = 1}^{m} |X_{i} - \bar{X}| . F_{i}}{n} = \frac{41.067}{30} = 1.369$ .

3.4.2.2.3 Case 3: Average Deviation of Continuous Data Grouped into Classes

For continuous data grouped into classes, the calculation of the average deviation is:

$\bar{D} = \frac{\sum_{i = 1}^{k} |X_{i} - μ| . F_{i}}{N} (for, the, population)$

si90_e (3.32)

$\bar{D} = \frac{\sum_{i = 1}^{k} |X_{i} - \bar{X}| . F_{i}}{n} (for, samples)$

si91_e (3.33)

Note that Expressions (3.32) and (3.33) are similar to Expressions (3.30) and (3.31), respectively, except that, instead of m groups, we consider k classes. Moreover, X_i represents the middle or central point of each class i, where $\bar{X} = \frac{\sum_{i = 1}^{k} X_{i} . F_{i}}{n}$ , as presented in Expression (3.6).

Example 3.30

In order to determine its variation due to genetic factors, a survey with 100 newborn babies collected information about their weight. Table 3.E.35 shows the data grouped into classes and their respective absolute frequencies. Calculate the average deviation.

Table 3.E.35

Newborn Babies’ Weight (in kg) Grouped into Classes
Class	F_i
2.0 ├ 2.5	10
2.5 ├ 3.0	24
3.0 ├ 3.5	31
3.5 ├ 4.0	22
4.0 ├ 4.5	13
Sum

Solution

First, we must calculate $\bar{X}$ :

$\bar{X} = \frac{\sum_{i = 1}^{k} X_{i} . F_{i}}{n} = \frac{2.25 \times 10 + 2.75 \times 24 + 3.25 \times 31 + 3.75 \times 22 + 4.25 \times 13}{100} = 3.270$

si94_e

The average deviation can be determined from the calculations presented in Table 3.E.36:

Table 3.E.36

Calculations of the Average Deviation for Example 3.30
Class	F_i	X_i	$\|X_{i} - \bar{X}\|$	$\|X_{i} - \bar{X}\| . F_{i}$
2.0 ├ 2.5	10	2.25	1.02	10.20
2.5 ├ 3.0	24	2.75	0.52	12.48
3.0 ├ 3.5	31	3.25	0.02	0.62
3.5 ├ 4.0	22	3.75	0.48	10.56
4.0 ├ 4.5	13	4.25	0.98	12.74
Sum	100			46.6

Unlabelled Table

Therefore, $\bar{D} = \frac{\sum_{i = 1}^{k} |X_{i} - \bar{X}| . F_{i}}{n} = \frac{46.6}{100} = 0.466$ .

3.4.2.3 Variance

Variance is a measure of dispersion or variability that evaluates how much the data are dispersed in relation to the arithmetic mean. Thus, the higher the variance, the higher the data dispersion.

3.4.2.3.1 Case 1: Variance of Ungrouped Discrete and Continuous Data

Instead of considering the mean of absolute deviations, as discussed in the previous section, it is more common to calculate the mean of squared deviations. This measure is known as variance:

$σ^{2} = \frac{\sum_{i = 1}^{N} {(X_{i} - μ)}^{2}}{N} = \frac{\sum_{i = 1}^{N} X_{i}^{2} - \frac{{(\sum_{i = 1}^{N} X_{i})}^{2}}{N}}{N} (for, the, population)$

si96_e (3.34)

$S^{2} = \frac{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}}{n - 1} = \frac{\sum_{i = 1}^{n} X_{i}^{2} - \frac{{(\sum_{i = 1}^{n} X_{i})}^{2}}{n}}{n - 1} (for samples)$

si97_e (3.35)

The relationship between the sample variance (S²) and the population variance (σ²) is given by:

$S^{2} = \frac{N}{n - 1} . σ^{2}$

si98_e (3.36)

Example 3.31

Consider the data in Example 3.28 regarding the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the variance.

Solution

We saw in Example 3.28 that $\bar{X} = 19.62$ . Applying Expression (3.35), we have:

$S^{2} = \frac{{(12.4 - 19.62)}^{2} + {(22.6 - 19.62)}^{2} + \dots + {(20.4 - 19.62)}^{2}}{9} = 41.94$

si100_e

The sample variance can be directly calculated in Excel using the VAR.S function. To calculate the variance population, we must use the VAR.P function.

3.4.2.3.2 Case 2: Variance of Grouped Discrete Data

For grouped data, represented in a frequency distribution table by m groups, the variance can be calculated as follows:

$σ^{2} = \frac{\sum_{i = 1}^{m} {(X_{i} - μ)}^{2} . F_{i}}{N} = \frac{\sum_{i = 1}^{m} X_{i}^{2} . F_{i} - \frac{{(\sum_{i = 1}^{m} X_{i} . F_{i})}^{2}}{N}}{N} (for, the, population)$

si101_e (3.37)

$S^{2} = \frac{\sum_{i = 1}^{m} {(X_{i} - \bar{X})}^{2} . F_{i}}{n - 1} = \frac{\sum_{i = 1}^{m} X_{i}^{2} . F_{i} - \frac{{(\sum_{i = 1}^{m} X_{i} . F_{i})}^{2}}{n}}{n - 1} (for, samples)$

si102_e (3.38)

where $\bar{X} = \frac{\sum_{i = 1}^{m} X_{i} . F_{i}}{n}$ .

Example 3.32

Consider the data in Example 3.29 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the variance.

Solution

As calculated in Example 3.29, the mean is $\bar{X} = 2.133$ . The variance can be determined from the calculations presented in Table 3.E.37:

Table 3.E.37

Calculations of the Variance
Number of Goals	F_i	${(X_{i} - \bar{X})}^{2}$	${(X_{i} - \bar{X})}^{2} . F_{i}$
0	5	4.551	22.756
1	8	1.284	10.276
2	6	0.018	0.107
3	4	0.751	3.004
4	4	3.484	13.938
5	2	8.218	16.436
6	1	14.951	14.951
Sum	30		81.467

Unlabelled Table

Therefore, $S^{2} = \frac{\sum_{i = 1}^{m} {(X_{i} - \bar{X})}^{2} . F_{i}}{n - 1} = \frac{81.467}{29} = 2.809$ si105_e

3.4.2.3.3 Case 3: Variance of Continuous Data Grouped into Classes

For continuous data grouped into classes, we calculate the variance as follows:

$σ^{2} = \frac{\sum_{i = 1}^{k} {(X_{i} - μ)}^{2} . F_{i}}{N} = \frac{\sum_{i = 1}^{k} X_{i}^{2} . F_{i} - \frac{{(\sum_{i = 1}^{k} X_{i} . F_{i})}^{2}}{N}}{N} (for, the, population)$

si106_e (3.39)

$S^{2} = \frac{\sum_{i = 1}^{k} {(X_{i} - \bar{x})}^{2} . F_{i}}{n - 1} = \frac{\sum_{i = 1}^{k} X_{i}^{2} . F_{i} - \frac{{(\sum_{i = 1}^{k} X_{i} . F_{i})}^{2}}{n}}{n - 1} (for, samples)$

si107_e (3.40)

Note that Expressions (3.39) and (3.40) are similar to Expressions (3.37) and (3.38), respectively, except that we consider k classes instead of m groups.

Example 3.33

Consider the data in Example 3.30 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the variance.

Solution

As calculated in Example 3.30, we have $\bar{X} = 3.270$ .

The variance can be determined from the calculations presented in Table 3.E.38:

Table 3.E.38

Calculations of the Variance for Example 3.33
Class	F_i	X_i	${(X_{i} - \bar{X})}^{2}$	${(X_{i} - \bar{X})}^{2} . F_{i}$
2.0 ├ 2.5	10	2.25	1.0404	10.404
2.5 ├ 3.0	24	2.75	0.2704	6.4896
3.0 ├ 3.5	31	3.25	0.0004	0.0124
3.5 ├ 4.0	22	3.75	0.2304	5.0688
4.0 ├ 4.5	13	4.25	0.9604	12.4852
Sum	100			34.46

Unlabelled Table

Therefore, $S^{2} = \frac{\sum_{i = 1}^{k} {(X_{i} - \bar{X})}^{2} . F_{i}}{n - 1} = \frac{34.46}{99} = 0.348$ .

3.4.2.4 Standard Deviation

Since the variance considers the mean of squared deviations, its value tends to be very high and difficult to interpret. To solve this problem, we calculate the square root of the variance. This measure is known as the standard deviation. It is calculated as follows:

$σ = \sqrt{σ^{2}} (for, the, population)$

(3.41)

$S = \sqrt{S^{2}} (for, samples)$

(3.42)

Example 3.34

Once again, consider the data in Examples 3.28 or 3.31 regarding the distances traveled (in km) by the vehicle. Calculate the standard deviation.

Solution

We have $\bar{X} = 19.62$ . The standard deviation is the square root of the variance, which has already been calculated in Example 3.31:

$S = \sqrt{\frac{{(12.4 - 19.62)}^{2} + {(22.6 - 19.62)}^{2} + \dots + {(20.4 - 19.62)}^{2}}{9}} = \sqrt{41.94} = 6.476$

si113_e

The standard deviation of a sample can be directly calculated in Excel using the STDEV.S function. To calculate the standard deviation of the population, we use the STDEV.P function.

Example 3.35

Consider the data in Examples 3.29 or 3.32 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the standard deviation.

Solution

The mean is $\bar{X} = 2.133$ . The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.32, as demonstrated in Table 3.E.37:

Therefore, $S = \sqrt{\frac{\sum_{i = 1}^{m} {(X_{i} - \bar{X})}^{2} . F_{i}}{n - 1}} = \sqrt{\frac{81.467}{29}} = \sqrt{2.809} = 1.676$ .

Example 3.36

Consider the data in Examples 3.30 or 3.33 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the standard deviation.

Solution

We have $\bar{X} = 3.270$ . The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.33, as demonstrated in Table 3.E.38:

Therefore, $S = \sqrt{\frac{\sum_{i = 1}^{k} {(X_{i} - \bar{X})}^{2} . F_{i}}{n - 1}} = \sqrt{\frac{34.46}{99}} = \sqrt{0.348} = 0.59$ si117_e .

3.4.2.5 Standard Error

The standard error is the standard deviation of the mean. It is obtained by dividing the standard deviation by the square root of the population or sample size:

$σ_{\bar{X}} = \frac{σ}{\sqrt{N}} for the population$

si118_e (3.43)

$S_{\bar{X}} = \frac{S}{\sqrt{n}} for samples$

si119_e (3.44)

The higher the number of measurements, the better the determination of the average value will be (higher accuracy), due to the compensation of random errors.

Example 3.37

One of the phases in the preparation of concrete is mixing it in a concrete mixer. Tables 3.E.39 and 3.E.40 show the concrete mixing times (in seconds), considering a sample with 10 and 30 elements, respectively. Calculate the standard error for both cases and interpret the results.

Table 3.E.39

Concrete Mixing Time for a Sample With 10 Elements
124	111	132	142	108	127	133	144	148	105

Unlabelled Table

Table 3.E.40

Concrete Mixing Time for a Sample With 30 Elements
125	102	135	126	132	129	156	112	108	134
126	104	143	140	138	129	119	114	107	121
124	112	148	145	130	125	120	127	106	148

Unlabelled Table

Solution

First, let’s calculate the standard deviation for both samples:

$S_{1} = \sqrt{\frac{{(124 - 127.4)}^{2} + {(111 - 127.4)}^{2} + \dots + {(105 - 127.4)}^{2}}{9}} = 15.364$

si120_e

$S_{2} = \sqrt{\frac{{(125 - 126.167)}^{2} + {(102 - 126.167)}^{2} + \dots + {(148 - 126.167)}^{2}}{29}} = 14.227$

si121_e

To calculate the standard error, we must apply Expression (3.44):

$S_{{\bar{X}}_{1}} = \frac{S_{1}}{\sqrt{n_{1}}} = \frac{15.364}{\sqrt{10}} = 4.858$

si122_e

$S_{{\bar{X}}_{2}} = \frac{S_{2}}{\sqrt{n_{2}}} = \frac{14.227}{\sqrt{30}} = 2.598$

si123_e

Despite the small difference in the calculation of the standard deviation, we can see that the standard error of the first sample is almost the double when compared to the second sample. Therefore, the higher the number of measurements, the higher the accuracy.

3.4.2.6 Coefficient of Variation

The coefficient of variation (CV) is a relative measure of dispersion that provides the variation of the data in relation to the mean. The smaller the value, the more homogeneous the data will be, that is, the smaller the dispersion around the mean will be. It can be calculated as follows:

$CV = \frac{σ}{μ} \times 100 (%) for the population$

si124_e (3.45)

$CV = \frac{S}{\bar{X}} \times 100 (%) for samples$

si125_e (3.46)

A CV can be considered low, indicating a set of data that is reasonably homogeneous, when it is less than 30%. If this value is greater than 30%, the set of data can be considered heterogeneous. However, this standard varies according to the application.

Example 3.38

Calculate the coefficient of variation for both samples of the previous example.

Solution

Applying Expression (3.46), we have:

${CV}_{1} = \frac{S_{1}}{{\bar{X}}_{1}} \times 100 = \frac{15.364}{127.4} \times 100 = 12.06 %$

si126_e

${CV}_{2} = \frac{S_{2}}{{\bar{X}}_{2}} \times 100 = \frac{14.227}{126.167} \times 100 = 11.28 %$

si127_e

These results confirm the homogeneity of the data of the variable being studied for both samples. We conclude, therefore, that the mean is a good measure to represent the data.

Let’s now study the measures of skewness and kurtosis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Univariate Descriptive Statistics

Create new playlist

Sign In

Sign Up

3.1 Introduction

3.2 Frequency Distribution Table

3.2.1 Frequency Distribution Table for Qualitative Variables

3.2.2 Frequency Distribution Table for Discrete Data

3.2.3 Frequency Distribution Table for Continuous Data Grouped into Classes

3.3 Graphical Representation of the Results

3.3.1 Graphical Representation for Qualitative Variables

3.3.1.1 Bar Chart

3.3.1.2 Pie Chart

3.3.1.3 Pareto Chart

3.3.2 Graphical Representation for Quantitative Variables

3.3.2.1 Line Graph

3.3.2.2 Scatter Plot

3.3.2.3 Histogram

3.3.2.4 Stem-and-Leaf Plot

3.3.2.5 Boxplot or Box-and-Whisker Diagram

3.4 The Most Common Summary-Measures in Univariate Descriptive Statistics

3.4.1 Measures of Position or Location

3.4.1.1 Measures of Central Tendency

3.4.1.1.1 Arithmetic Mean

3.4.1.1.1.1 Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data

3.4.1.1.1.2 Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data

3.4.1.1.1.3 Case 3: Arithmetic Mean of Grouped Discrete Data

3.4.1.1.1.4 Case 4: Arithmetic Mean of Continuous Data Grouped into Classes

3.4.1.1.2 Median

3.4.1.1.2.1 Case 1: Median of Ungrouped Discrete and Continuous Data

3.4.1.1.2.2 Case 2: Median of Grouped Discrete Data

3.4.1.1.2.3 Case 3: Median of Continuous Data Grouped into Classes

3.4.1.1.3 Mode

3.4.1.1.3.1 Case 1: Mode of Ungrouped Data

3.4.1.1.3.2 Case 2: Mode of Grouped Qualitative or Discrete Data

3.4.1.1.3.3 Case 3: Mode of Continuous Data Grouped into Classes

3.4.1.2 Quantiles

3.4.1.2.1 Quartiles

3.4.1.2.2 Deciles

3.4.1.2.3 Percentiles

3.4.1.2.3.1 Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data

3.4.1.2.3.2 Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data

3.4.1.2.3.3 Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes

3.4.1.3 Identifying the Existence of Univariate Outliers

3.4.2 Measures of Dispersion or Variability

3.4.2.1 Range

3.4.2.2 Average Deviation

3.4.2.2.1 Case 1: Average Deviation of Ungrouped Discrete and Continuous Data

3.4.2.2.2 Case 2: Average Deviation of Grouped Discrete Data

3.4.2.2.3 Case 3: Average Deviation of Continuous Data Grouped into Classes

3.4.2.3 Variance

3.4.2.3.1 Case 1: Variance of Ungrouped Discrete and Continuous Data

3.4.2.3.2 Case 2: Variance of Grouped Discrete Data

3.4.2.3.3 Case 3: Variance of Continuous Data Grouped into Classes

3.4.2.4 Standard Deviation

3.4.2.5 Standard Error

3.4.2.6 Coefficient of Variation

Table of Contents for
Chapter 3: Univariate Descriptive Statistics