This chapter discusses the main concepts of univariate descriptive statistics. Through tables, charts, and/or summary measures, it is possible to describe the behavior of each type of variable. Frequency distribution tables represent the frequency in which a set of data occurs. Charts can be used to represent the distribution of the variable. Summary measures are subdivided into measures of position or location (central trend and quantiles), measures of dispersion or variability, and measures of shape (skewness and kurtosis). Measures of position can be used to represent a dataset, while measures of dispersion can be used to calculate the variability of a dataset. Conversely, measures of skewness and kurtosis characterize the shape of the distribution of the population elements sampled around the mean. Finally, tables, charts, graphs, and summary measures have been studied using Excel, IBM SPSS Statistics Software®, and Stata Statistical Software®.
Univariate descriptive statistics; Frequency distribution tables; Charts; Summary measures; Measures of position or location (central trend and quantiles); Measures of dispersion or variability; Measures of shape (skewness and kurtosis)
Mathematics is the alphabet with which God has written the Universe.
Galileo Galilei
Descriptive statistics describes and summarizes the main characteristics observed in a dataset through tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data behavior. The analysis is based on the dataset being studied (sample), without drawing any conclusions or inferences from the population.
Researchers can use descriptive statistics to study a single variable (univariate descriptive statistics), two variables (bivariate descriptive statistics), or more than two variables (multivariate descriptive statistics). In this chapter, we will study the concepts of descriptive statistics involving a single variable.
Univariate descriptive statistics considers the following topics: (a) the frequency in which a set of data occurs through frequency distribution tables; (b) the representation of the variable’s distribution through charts; and (c) measures that represent a data series, such as measures of position or location, measures of dispersion or variability, and measures of shape (skewness and kurtosis).
The four main goals of this chapter are: (1) to introduce the most common concepts related to the tables, charts, and summary measures in univariate descriptive statistics, (2) to present its applications in real examples, (3) to construct tables, charts, and summary measures using Excel and the statistical software SPSS and Stata, and (4) to discuss the results achieved.
As described in the previous chapter, before we begin using descriptive statistics, it is necessary to identify the type of variable being studied. The type of variable is essential when calculating descriptive statistics and in the graphical representation of the results. Fig. 3.1 shows the univariate descriptive statistics that will be studied in this chapter, represented by tables, charts, graphs, and summary measures, for each type of variable. Fig. 3.1 summarizes the following information:
Frequency distribution tables can be used to represent the frequency in which a set of data with qualitative or quantitative variables occurs.
In the case of qualitative variables, the table represents the frequency in which each variable category happens. For discrete quantitative variables, the frequency of occurrences is calculated for each discrete value of the variable. On the other hand, continuous variable data are first grouped into classes and, afterwards, we calculate the frequencies in which each class occurs.
A frequency distribution table contains the following calculations:
Through a practical example, we will build the frequency distribution table using the calculations of the absolute frequency, relative frequency, cumulative frequency, and relative cumulative frequency for each category of the qualitative variable being analyzed.
Through the frequency distribution table, we can calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency for each possible value of the discrete variable.
Different from qualitative variables, instead of the possible categories we must have the possible numeric values. To facilitate understanding, the data must be presented in ascending order.
As described in Chapter 2, continuous quantitative variables are those whose possible values are in an interval of real numbers. Therefore, it makes no sense to calculate the frequency for each possible value, since they rarely repeat themselves. It is better to group the data into classes or ranges.
The interval to be defined between the classes is random. However, we must be careful if the number of classes is too small because a lot of information can be lost. On the other hand, if the number of classes is too large, the summary of information is compromised (Bussab and Morettin, 2011). The interval between the classes does not need to be constant, but in order to keep things simple, we will assume the same interval.
The following steps must be taken to build a frequency distribution table for continuous data:
Step 1: Sort the data in ascending order.
Step 2: Determine the number of classes (k), using one of the options:
where n is the sample size.
The value of k must be an integer.
Step 3: Determine the interval between the classes (h), calculated as the range of the sample (A = maximum value − minimum value) divided by the number of classes:
h=A/k
The value of h is rounded to the highest integer.
Step 4: Build the frequency distribution table (calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency) for each class.
The lowest limit of the first class corresponds to the minimum value of the sample. To determine the highest limit of each class, we must add the value of h to the lowest limit of the respective class. The lowest limit of the new class corresponds to the highest limit of the previous class.
The behavior of qualitative and quantitative variable data can also be represented in a graphical way. Charts are a representation of numeric data, in the form of geometric figures (graphs, diagrams, drawings, or images), allowing the reader to interpret these data quickly and objectively.
In Section 3.3.1, the main graphical representations for qualitative variables are illustrated: bar charts (horizontal and vertical), pie charts, and a Pareto chart.
The graphical representation of quantitative variables is usually illustrated by line graphs, dot plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams), as shown in Section 3.3.2.
Bar charts (horizontal and vertical), pie charts, a Pareto chart, line graphs, dot plots, and histograms will be generated in Excel. The boxplots and histograms will be constructed by using SPSS and Stata.
To build a chart in Excel, first, variables’ data and names must be standardized, codified, and selected in a spreadsheet. The next step consists in clicking on the Insert tab and, in the group Charts, selecting the type of chart we are interested in using (Columns, Rows, Pie, Bar, Area, Scatter, or Other Charts). The chart will be generated automatically on the screen, and it can be personalized according to the preferences of the researcher.
Excel offers a variety of chart styles, layouts, and formats. To use them, researcher just needs to select the plotted chart and click on the Design, Layout or Format tab. On the Layout tab, for example, there are many resources available, such as, Chart Title, Axis Titles (shows the name of the horizontal and vertical axes); Legend (shows or hides the legend); Data Labels (allows researcher to insert the series name, the category name, or the values of the labels in the place we are interested in); Data Table (shows the data table below the chart, with or without legend codes); Axes (allows researcher to personalize the scale of the horizontal and vertical axes); Gridlines (shows or hides horizontal and vertical gridlines), among others. The Chart Title, Axis Titles, Legend, Data Labels and Data Table icons are in the Labels group, while the icons Axes and Gridlines are in the Axes group.
This type of chart is widely used for nominal and ordinal qualitative variables, but it can also be used for discrete quantitative variables, because it allows us to investigate the presence of data trends.
As its name indicates, through bars, this chart represents the absolute or relative frequencies of each possible category (or numeric value) of a qualitative variable (or quantitative). In vertical bar charts, each variable category is shown on the X-axis as a bar with constant width, and the height of the respective bar indicates the frequency of the category on the Y-axis. Conversely, in horizontal bar charts, each variable category is shown on the Y-axis as a bar of constant height, and the length of the respective bar indicates the frequency of the category on the X-axis.
Let’s now build horizontal and vertical bar charts from a practical example.
Another way to represent qualitative data, in terms of relative frequencies (percentages), is the definition of pie charts. The chart corresponds to a circle with a random radius (the whole) divided into sectors or slices of pie of several different sizes (parts of the whole).
This chart allows the researcher to visualize the data as slices of a pie or parts of a whole. Let’s now build the pie chart from a practical example.
The Pareto chart is a Quality control tool and has as its main objective to investigate the types of problems and, consequently, to identify their respective causes, so that an action can be taken in order to reduce or eliminate them.
The Pareto chart is a chart that contains bars and a line graph. The bars represent the absolute frequencies of occurrences of problems and the lines represent the relative cumulative frequencies. The problems are sorted in descending order of priority. Let’s now illustrate a practical example with a Pareto chart.
In a line graph, points are represented by the intersection of the variables involved on the horizontal axis (X) and on the vertical axis (Y), and they are connected by straight lines.
Despite considering two axes, line graphs will be used in this chapter to represent the behavior of a single variable. The graph shows the evolution or trend of a quantitative variable’s data, which is usually continuous, at regular intervals. The numeric variable values are represented on the Y-axis, and the X-axis only shows the data distribution in a uniform way. Let’s now illustrate a practical example of a line graph.
A scatter plot is very similar to a line graph. The biggest difference between them is in the way the data are plotted on the horizontal axis.
Similar to a line graph, here the points are also represented by the intersection of the variables along the X-axis and the vertical axis. However, they are not connected by straight lines.
The scatter plot studied in this chapter is used to show the evolution or trend of a single quantitative variable’s data, similar to the line graph; however, at irregular intervals (in general). Analogous to a line graph, the numeric variable values are represented on the Y-axis and the X-axis only represents the data behavior throughout time.
In the next chapter, we will see how a scatter plot can be used to describe the behavior of two variables simultaneously (bivariate analysis). The numeric values of one variable will be represented on the Y-axis and the other one on the X-axis.
A histogram is a vertical bar chart that represents the frequency distribution of one quantitative variable (discrete or continuous). The variable values being studied are presented on the X-axis (the base of each bar, with a constant width, represents each possible value of the discrete variable or each class of continuous values, sorted in ascending order). On the other hand, the height of the bars on the Y-axis represents the frequency distribution (absolute, relative, or cumulative) of the respective variable values.
A histogram is very similar to a Pareto chart. It is also one of the seven quality tools. A Pareto chart represents the frequency distribution of a qualitative variable (types of problem), whose categories represented on the X-axis are sorted in order of priority (from the category with the highest frequency to the one with the lowest). A histogram represents the frequency distribution of a quantitative variable, whose values represented on the X-axis are sorted in ascending order.
Therefore, the first step to elaborate a histogram is building the frequency distribution table. As presented in Sections 3.2.2 and 3.2.3, for each possible value of a discrete variable or for a class with continuous data, we calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency. The data must be sorted in ascending order.
The histogram is then constructed from this table. The first column of the frequency distribution table, which represents the numeric values or the classes with the values of the variable being studied, will be presented on the X-axis, and the column of absolute frequency (or relative frequency, cumulative frequency, or relative cumulative frequency) will be presented on the Y-axis.
Many pieces of statistical software generate the histogram automatically, from the original values of the quantitative variable being studied, without having to calculate the frequencies. Even though Excel has the option of building a histogram from analysis tools, we will show how to build it from the column chart, due to its simplicity.
As mentioned, many statistical computer packages, including SPSS and Stata, build the histogram automatically from the original data of the variable being studied (in this example, using the data in Table 3.E.14), without having to calculate the frequencies. Moreover, these packages have the option of plotting the normal curve.
Fig. 3.9 shows the histogram generated using SPSS (with the option of a normal curve) using the data in Table 3.E.14. We will see this in detail in Sections 3.6 and 3.7, how it can be constructed using SPSS and Stata software, respectively.
Note that the values of the discrete variable are presented in the middle of the base.
For continuous variables, consider the data in Table 3.E.5 (Example 3.3), regarding the grades of the students enrolled in the subject Financial Market. These data were sorted in ascending order, as presented in Table 3.E.6.
Fig. 3.10 shows the histogram generated using SPSS software (with the option of a normal curve) using the data in Table 3.E.5 or Table 3.E.6.
Note that the data were grouped considering an interval between h = 0.5 classes, differently from Example 3.3 that considered h = 1. The classes’ lower limits are represented on the left side of the base of the bar, and the upper limits (not included in the class) on the right side. The height of the bar represents the total frequency in the class. For example, the first bar represents the 3.5 ├ 4.0 class and there are three values in this interval (3.5, 3.8 and 3.9).
Both bar charts and histograms represent the shape of the variable’s frequency distribution. The stem-and-leaf plot is an alternative to represent the frequency distributions of discrete and continuous quantitative variables with few observations, with the advantage of maintaining the original value of each observation (it allows the visualization of all data information).
In the plot, the representation of each observation is divided into two parts, separated by a vertical line: the stem is located on the left of the vertical line and represents the observation’s first digit(s); the leaf is located on the right of the vertical line and represents the observation’s last digit(s). Choosing the number of initial digits that will form the stem or the number of complementary digits that will form the leaf is random. The stems usually contain the most significant digits, and the leaves the least significant.
The stems are represented in a single column and their different values throughout many lines. For each stem represented on the left-hand side of the vertical line, we have the respective leaves shown on the right-hand side throughout many columns. Stems as well as leaves must be sorted in ascending order. In the cases in which there are too many leaves per stem, we can have more than one line with the same stem. Choosing the number of lines is random, as well as defining the interval or the number of classes in a frequency distribution.
To build a stem-and-leaf plot, we can follow the sequence of steps:
Step 1: Sort the data in ascending order, to make the visualization of the data easier.
Step 2: Define the number of initial digits that will form the stem, or the number of complementary digits that will form the leaf.
Step 3: Elaborate the stems, represented in a single column on the left of the vertical line. Their different values are represented throughout many lines, in ascending order. When the number of leaves by stem is very high, we can define two or more lines for the same stem.
Step 4: Place the leaves that correspond to the respective stems, on the right-hand side of the vertical line, throughout many columns (in ascending order).
The boxplot (or box-and-whisker diagram) is a graphical representation of five measures of position or location of a certain variable: minimum value, first quartile (Q1), second quartile (Q2) or median (Md), third quartile (Q3) and maximum value. From a sorted sample, the median corresponds to the central position and the quartiles to subdivisions of the sample, four equal parts, each one containing 25% of the data.
Thus, the first quartile (Q1) describes 25% of the first data (organized in ascending order). The second quartile corresponds to the median (50% of the sorted data are located below it and the remaining 50% above it), and the third quartile (Q13) corresponds to 75% of the observations. The dispersion measure resulting from these location measures is called interquartile range (IQR) or interquartile interval (IQI) and corresponds to the difference between Q3 and Q1.
This plot allows us to assess the data symmetry and distribution. It also gives us a visual perspective of whether or not there are discrepant data (univariate outliers), since these data are above the upper and lower limits. A representation of the diagram can be seen in Fig. 3.14.
Calculating the median, the first, and third quartiles, and investigating the existence of univariate outliers will be discussed in Sections 3.4.1.1, 3.4.1.2, and 3.4.1.3, respectively. In Sections 3.6.3 and 3.7, we will study how to generate the box-and-whisker diagram on SPSS and Stata, respectively, using a practical example.
Information found in a dataset can be summarized through suitable numerical measures, called summary measures.
In univariate descriptive statistics, the most common summary measures have as their main objective to represent the behavior of the variable being studied through its central and noncentral values, its dispersions, or the way its values are distributed around the mean.
The summary measures that will be studied in this chapter are measures of position or location (measures of central tendency and quantiles), measures of dispersion or variability, and measures of shape, such as, skewness and kurtosis.
These measures are calculated for metric or quantitative variables. The only exception is the mode, which is a measure of central tendency that provides the most frequent value of a certain variable, so, it can also be calculated for nonmetric or qualitative variables.
These measures provide values that characterize the behavior of a data series, indicating the data position or location in relation to the axis of the values assumed by the variable or characteristic being studied.
The measures of position or location are subdivided into measures of central tendency (mean, median, and mode) and quantiles (quartiles, deciles, and percentiles).
The most common measures of central tendency are the arithmetic mean, the median, and the mode.
The arithmetic mean can be a representative measure of a population with N elements, represented by the Greek letter μ, or a representative measure of a sample with n elements, represented by ˉX.
Simple arithmetic mean, or simply mean, or average, is the sum of all the values of a certain variable (discrete or continuous) divided by the total number of observations. Thus, the sample arithmetic mean of a certain variable X (ˉX) is:
ˉX=n∑i=1Xin
where n is the total number of observations in the dataset and Xi, for i = 1, …, n, represents each one of variable X’s values.
When calculating the simple arithmetic mean, all of the occurrences have the same importance or weight. When we are interested in assigning different weights (pi) to each value i of variable X, we use the weighted arithmetic mean:
ˉX=n∑i=1Xi.pin∑i=1pi
If the weight is expressed in percentages (relative weight - rw), Expression (3.2) becomes:
ˉX=n∑i=1Xi.rwi
When the discrete values of Xi repeat themselves, the data are grouped into a frequency table. To calculate the arithmetic mean, we have to use the same criterion as for the weighted mean. However, the weight for each Xi will be represented by absolute frequencies (Fi) and, instead of n observations with n different values, we will have n observations with m different values (grouped data):
ˉX=m∑i=1Xi.Fim∑i=1Fi=m∑i=1Xi.Fin
If the frequency of the data is expressed in terms of the percentage relative to the absolute frequency (relative frequency—Fr), Expression (3.4) becomes:
ˉX=m∑i=1Xi.Fri
To calculate the simple arithmetic mean, the weighted arithmetic mean, and the arithmetic mean of grouped discrete data, Xi represents each i value of variable X.
For continuous data grouped into classes, each class does not have a single value defined, but a set of values. In order for the arithmetic mean to be calculated in this case, we assume that Xi is the middle or central point of class i (i = 1,…,k), so, Expressions (3.4) and (3.5) are rewritten due to the number of classes (k):
ˉX=k∑i=1Xi.Fik∑i=1Fi=k∑i=1Xi.Fin
ˉX=k∑i=1Xi.Fri
The median (Md) is a measure of location. It locates the center of the distribution of a set of data sorted in ascending order. Its value separates the series in two equal parts, so, 50% of the elements are less than or equal to the median, and the other 50 % are greater than or equal to the median.
The median of variable X (discrete or continuous) can be calculated as follows:
Md(X)={Xn2+X(n2)+12,ifnis an even number.X(n+1)2,ifnis anoddnumber.
where n is the total number of observations and X1 ≤ … ≤ Xn, considering that X1 is the smallest observation or the value of the first element, and that Xn is the highest observation or the value of the last element.
Here, the calculation of the median is similar to the previous case. However, the data are grouped in a frequency distribution table.
Analogous to Case 1, if n is an odd number, the position of the central element will be (n + 1)/2. We can see in the cumulative frequency column the group that has this position and, consequently, its corresponding value in the first column (median).
If n is an even number, we verify the group(s) that contain(s) the central positions n/2 and (n/2) + 1 in the cumulative frequency column. If both positions correspond to the same group, we directly obtain their corresponding value in the first column (median). If each position corresponds to a distinct group, the median will be the average between the corresponding values defined in the first column.
For continuous variables grouped into classes, in which the data are presented in a frequency distribution table, we apply the following steps to calculate the median:
Step 1: Calculate the position of the median, not taking into consideration if n is an even or an odd number, through the following expression:
Pos(Md)=n/2
Step 2: Identify the class that contains the median (median class) from the cumulative frequency column.
Step 3: Calculate the median using the following expression:
Md=LIMd+(n2−Fac(Md−1))FMd×AMd
where:
The mode (Mo) of a data series corresponds to the observation that occurs with the highest frequency. The mode is the only measure of position that can also be used for qualitative variables, since these variables only allow us to calculate frequencies.
Consider a set of observations X1, X2, …, Xn of a certain variable. The mode is the value that appears with the highest frequency.
Excel gives us the mode of a set of data through the MODE function.
For discrete qualitative or quantitative data grouped in a frequency distribution table, the mode can be obtained directly from the table. It is the value with the highest absolute frequency.
For continuous data grouped into classes, there are several procedures to calculate the mode, such as, Czuber’s and King’s methods.
Czuber’s method has the following phases:
Step 1: Identify the class that has the mode (modal class), which is the one with the highest absolute frequency.
Step 2: Calculate the mode (Mo):
Mo=LIMo+FMo−FMo−12.FMo−(FMo−1+FMo+1)×AMo
where:
According to Bussab and Morettin (2011), only the use of measures of central tendency may not be suitable to represent a set of data, since they are also impacted by extreme values. Moreover, only with the use of these measures, it is not possible for the researcher to have a clear idea of the data dispersion and symmetry. As an alternative, we can use quantiles, such as, quartiles, deciles, and percentiles. The 2nd quartile (Q2), 5th decile (D5), or 50th percentile (P50) correspond to the median; therefore, they are measures of central tendency.
Quartiles (Qi, i = 1, 2, 3) are measures of position that divide a set of data into four parts with equal dimensions, sorted in ascending order.
Thus, the 1st Quartile (Q1 or the 25th percentile) indicates that 25% of the data are less than Q1, or that 75% of the data are greater than Q1.
The 2nd Quartile (Q2, or the 5th decile, or the 50th percentile) corresponds to the median, indicating that 50% of the data are less or greater than Q2.
The 3rd Quartile (Q3 or the 75th percentile) indicates that 75% of the data are less than Q3, or that 25% of the data are greater than Q3.
Deciles (Di, i = 1, 2, ..., 9) are measures of position that divide a set of data into 10 equal parts, sorted in ascending order.
Therefore, the 1st decile (D1 or 10th percentile) indicates that 10% of the data are less than D1 or that 90% of the data are greater than D1.
The 2nd decile (D2 or 20th percentile) indicates that 20% of the data are less than D2 or that 80% of the data are greater than D2.
And so on, and so forth, until the 9th decile (D9 or 90th percentile), indicating that 90% of the data are less than D9 or that 10% of the data are greater than D9.
Percentiles (Pi, i = 1, 2, ..., 99) are measures of position that divide a set of data, sorted in ascending order, into 100 equal parts.
Hence, the 1st percentile (P1) indicates that 1% of the data is less than P1 or that 99% of the data are greater than P1.
The 2nd percentile (P2) indicates that 2% of the data are less than P2 or that 98% of the data are greater than P2.
And so on, and so forth, until the 99th percentile (P99), which indicates that 99% of the data are less than P99 or that 1% of the data is greater than P99.
If the position of the quartile, decile, or percentile we are interested in is an integer or is exactly between two positions, calculating the respective quartile, decile or percentile becomes easier. However, this does not happen all the time (imagine a sample with 33 elements and that the objective is to calculate the 67th percentile), there are many methods proposed for this kind of calculation that lead to close results, but they are not identical.
We will present a simple and generic method that can be applied to calculate any quartile, decile, or percentile of order i, considering ungrouped discrete and continuous data:
Step 1: Sort the observations in ascending order.
Step 2: Determine the position of the quartile, decile, or percentile, of order i, we are interested in:
Quartile→Pos(Qi)=[n4×i]+12,i=1,2,3
Decile→Pos(Di)=[n10×i]+12,i=1,2,…,9
Percentile→Pos(Pi)=[n100×i]+12,i=1,2,…,99
Step 3: Calculate the value of the quartile, decile, or percentile that corresponds to the respective position.
Assume that Pos(Q1) = 3.75, that is, the value of Q1 is between the 3rd and 4th positions (75% closer to the 4th position, and 25% to the 3rd position). Therefore, Q1 will be the sum of the value that corresponds to the 3rd position multiplied by 0.25, with the value that corresponds to the 4th position multiplied by 0.75.
Here, the calculation of quartiles, deciles, and percentiles is similar to the previous case. However, the data are grouped in a frequency distribution table.
In the frequency distribution table, the data must be sorted in ascending order, with their respective absolute and cumulative frequencies. First, we must determine the position of the quartile, decile, or percentile, of order i, we are interested in through Expressions (3.13), (3.14), and (3.15), respectively. From the cumulative frequency column, we must verify the group(s) that contain(s) this position. If the position is a discrete number, its corresponding value is obtained directly in the first column. However, if the position is a fractional number, as, for example, 2.5, and if the 2nd and the 3rd positions are in the same group, its respective value will also be obtained directly. On the other hand, if the position is a fractional number, as, for example, 4.25, and positions 4 and 5 are in different groups, we must calculate the sum of the value that corresponds to the 4th position multiplied by 0.75 with the value that corresponds to the 5th position multiplied by 0.25 (similar to Case 1).
For continuous data grouped into classes in which data are represented in a frequency distribution table, we must apply the following steps to calculate the quartiles, deciles, and percentiles:
Step 1: Calculate the position of the quartile, decile, or percentile, of order i, we are interested in through the following expressions:
Quartile→Pos(Qi)=n4×i,i=1,2,3
Decile→Pos(Di)=n10×i,i=1,2,…,9
Percentile→Pos(Pi)=n100×i,i=1,2,…,99
Step 2: Identify the class that contains the quartile, decile, or percentile, of order i, we are interested in (quartile class, decile class, or percentile class) from the cumulative frequency column.
Step 3: Calculate the quartile, decile, or percentile, of order i, we are interested in through the following expressions:
Quartile→Qi=LLQi+(Pos(Qi)−Fcum(Qi−1)FQi)×RQi,i=1,2,3
where:
Decile→Di=LLDi+(Pos(Di)−Fcum(Di−1)FDi)×RDi,i=1,2,…,9
where:
Percentile→Pi=LLPi+(Pos(Pi)−Fcum(Pi−1)FPi)×RPi,i=1,2,…,99
where:
A dataset can contain observations that are extremely distant from most observations or that are inconsistent. These observations are called outliers or atypical, discrepant, abnormal, or extreme values.
Before deciding what will be done with the outliers, we must know the causes that lead to such an occurrence. In many cases, these causes can determine the most suitable treatment for the respective outliers. The main causes are measurement mistakes, execution/implementation mistakes, and variability inherent to the population.
There are many outlier identification methods: boxplots, discordance models, Dixon’s test, Grubbs’ test, Z-scores, among others. In the Appendix of Chapter 11 (Cluster Analysis), a very efficient method for detecting multivariate outliers will be presented (BACON algorithm—Blocked Adaptive Computationally Efficient Outlier Nominators).
The existence of outliers through boxplots (the construction of boxplots was studied in Section 3.3.2.5) is identified from the IQR (interquartile range), which corresponds to the difference between the third and first quartiles:
IQR=Q3−Q1
Note that the IQR is the length of the box. Any values located below Q1 or above Q3 by 1.5 ∙ IQR more will be considered mild outliers and will be represented by circles. They may even be accepted in the population, but with some suspicion. Thus, the X° value of a variable is considered a mild outlier when:
X°<Q1−1.5∙IQR
X°>Q3+1.5∙IQR
or any values located below Q1 or above Q3 by 3 ∙ IQR more will be considered extreme outliers and will be presented by asterisks. Thus, the X⁎ value of a variable is considered an extreme outlier when:
X∗<Q1−3.IQR
X∗>Q3+3.IQR
Fig. 3.15 illustrates the boxplot with the identification of outliers.
To study the behavior of a set of data, we use measures of central tendency, measures of dispersion, in addition to the nature or shape of the data distribution. Measures of central tendency determine a value that represents the set of data. In order to characterize the dispersion or variability of the data, measures of dispersion are necessary.
The most common measures of dispersion are the range, average deviation, variance, standard deviation, standard error, and the coefficient of variation (CV).
The simplest measure of variability is the total range, or simply range (R), which represents the difference between the highest and lowest value of the set of data:
R=Xmax−Xmin
Deviation is the difference between each observed value and the mean of the variable. Thus, for population data, it would be represented by (Xi − μ), and for sample data, by (Xi−ˉX). The modulus or absolute deviation ignores the ± sign and is denoted by |Xi−ˉX|.
Average deviation, or absolute average deviation, represents the arithmetic mean of absolute deviations.
The average deviation (ˉD) is the sum of the absolute deviations of all observations divided by the population size (N) or the sample size (n):
ˉD=N∑i=1|Xi−μ|N(,for,,the,,population)
ˉD=n∑i=1|Xi−ˉX|n(,for,,samples)
For grouped data, presented in a frequency distribution table for m groups, the calculation of the average deviation is:
ˉD=m∑i=1|Xi−μ|.FiN(,for,,the,,population)
ˉD=m∑i=1|Xi−ˉX|.Fin(,for,,samples)
bearing in mind that ˉX=m∑i=1Xi.Fin.
For continuous data grouped into classes, the calculation of the average deviation is:
ˉD=k∑i=1|Xi−μ|.FiN(,for,,the,,population)
ˉD=k∑i=1|Xi−ˉX|.Fin(,for,,samples)
Note that Expressions (3.32) and (3.33) are similar to Expressions (3.30) and (3.31), respectively, except that, instead of m groups, we consider k classes. Moreover, Xi represents the middle or central point of each class i, where ˉX=k∑i=1Xi.Fin, as presented in Expression (3.6).
Variance is a measure of dispersion or variability that evaluates how much the data are dispersed in relation to the arithmetic mean. Thus, the higher the variance, the higher the data dispersion.
Instead of considering the mean of absolute deviations, as discussed in the previous section, it is more common to calculate the mean of squared deviations. This measure is known as variance:
σ2=N∑i=1(Xi−μ)2N=N∑i=1X2i−(N∑i=1Xi)2NN(,for,,the,,population)
S2=n∑i=1(Xi−ˉX)2n−1=n∑i=1X2i−(n∑i=1Xi)2nn−1(for samples)
The relationship between the sample variance (S2) and the population variance (σ2) is given by:
S2=Nn−1.σ2
For grouped data, represented in a frequency distribution table by m groups, the variance can be calculated as follows:
σ2=m∑i=1(Xi−μ)2.FiN=m∑i=1X2i.Fi−(m∑i=1Xi.Fi)2NN(,for,,the,,population)
S2=m∑i=1(Xi−ˉX)2.Fin−1=m∑i=1X2i.Fi−(m∑i=1Xi.Fi)2nn−1(,for,,samples)
where ˉX=m∑i=1Xi.Fin.
For continuous data grouped into classes, we calculate the variance as follows:
σ2=k∑i=1(Xi−μ)2.FiN=k∑i=1X2i.Fi−(k∑i=1Xi.Fi)2NN(,for,,the,,population)
S2=k∑i=1(Xi−ˉx)2.Fin−1=k∑i=1X2i.Fi−(k∑i=1Xi.Fi)2nn−1(,for,,samples)
Note that Expressions (3.39) and (3.40) are similar to Expressions (3.37) and (3.38), respectively, except that we consider k classes instead of m groups.
Since the variance considers the mean of squared deviations, its value tends to be very high and difficult to interpret. To solve this problem, we calculate the square root of the variance. This measure is known as the standard deviation. It is calculated as follows:
σ=√σ2(,for,,the,,population)
S=√S2(,for,,samples)
The standard error is the standard deviation of the mean. It is obtained by dividing the standard deviation by the square root of the population or sample size:
σˉX=σ√Nforthepopulation
SˉX=S√nforsamples
The higher the number of measurements, the better the determination of the average value will be (higher accuracy), due to the compensation of random errors.
The coefficient of variation (CV) is a relative measure of dispersion that provides the variation of the data in relation to the mean. The smaller the value, the more homogeneous the data will be, that is, the smaller the dispersion around the mean will be. It can be calculated as follows:
CV=σμ×100(%)forthepopulation
CV=SˉX×100(%)forsamples
A CV can be considered low, indicating a set of data that is reasonably homogeneous, when it is less than 30%. If this value is greater than 30%, the set of data can be considered heterogeneous. However, this standard varies according to the application.