Measures of asymmetry (skewness) and kurtosis characterize the shape of the distribution of the population elements sampled around the mean (Maroco, 2014).
Measures of skewness describe the shape of a frequency distribution curve. For a symmetrical curve or frequency distribution, the mean, the mode, and the median are the same. For an asymmetrical curve, the mean gets farther away from the mode, and the median is located in an intermediary position. Fig. 3.16 shows a symmetrical distribution.
On the other hand, if the frequency distribution is more concentrated on the left side, that is, the tail on the right is longer than the tail on the left, we will have a positively skewed distribution or to the right, as shown in Fig. 3.17. In this case, the mean is greater than the median, and the latter is greater than the mode ().
Conversely, if the frequency distribution is more concentrated on the right side, that is, the tail on the left is longer than the tail on the right, we will have a negatively skewed distribution or to the left, as shown in Fig. 3.18. In this case, the mean is less than the median, and the latter is less than the mode .
Pearson’s first coefficient of skewness (Sk1) is a measure of skewness given by the difference between the mean and the mode, weighted by one measure of dispersion (the standard deviation):
which has the following interpretation:
If Sk1 = 0, the distribution is symmetrical;
If Sk1 > 0, the distribution is positively skewed (to the right);
If Sk1 < 0, the distribution is negatively skewed (to the left).
To avoid using the mode to calculate the skewness, we must adopt the empirical relationship between the mean, the median, and the mode: , which corresponds to Pearson’s second coefficient of skewness (Sk2):
In the same way, we have:
If Sk2 = 0, the distribution is symmetrical;
If Sk2 > 0, the distribution is positively skewed (to the right);
If Sk2 < 0, the distribution is negatively skewed (to the left).
Pearson’s first and second coefficients of skewness allow us to compare two or more distributions and to evaluate which one is more asymmetrical. Its modulus indicates the intensity of the skewness. That is, the higher Pearson’s coefficient of skewness is, the more asymmetrical the curve is. Thus:
If 0 < | Sk | < 0.15, the skewness is weak;
If 0.15 ≤ | Sk | ≤ 1, the skewness is moderate;
If | Sk | > 1, the skewness is strong.
Another measure of skewness is Bowley’s coefficient of skewness (SkB), also known as quartile coefficient of skewness, calculated with quantiles, such as, the first and third quartiles, in addition to the median:
In the same way, we have:
If SkB = 0, the distribution is symmetrical;
If SkB > 0, the distribution is positively skewed (to the right);
If SkB < 0, the distribution is negatively skewed (to the left).
The last measure of skewness we will study is known as Fisher’s coefficient of skewness (g1), calculated from the third moment around the mean (M3), as presented in Maroco (2014):
where:
which is interpreted the same way as the other coefficients of skewness, that is:
If g1 = 0, the distribution is symmetrical;
If g1 > 0, the distribution is positively skewed (to the right);
If g1 < 0, the distribution is negatively skewed (to the left).
Fisher’s coefficient of skewness can be calculated in Excel using the DISTORTION function (see Example 3.42) or using the Analysis Tools supplement (Section 3.5). Its calculation through SPSS software will be presented in Section 3.6.
The coefficient of skewness on Stata is calculated from the second and third moments around the mean, as presented by Cox (2010):
where:
which is interpreted the same way as the other coefficients of skewness, that is:
If Sk = 0, the distribution is symmetrical;
If Sk > 0, the distribution is positively skewed (to the right);
If Sk < 0, the distribution is negatively skewed (to the left).
In addition to measures of skewness, measures of kurtosis can also be used to characterize the shape of the distribution of the variable being studied.
Kurtosis can be defined as the flatness level of a frequency distribution (height of the peak of the curve) in relation to a theoretical distribution that usually corresponds to the normal distribution.
When the shape of the distribution is not very flat, nor very long, similar to a normal curve, it is called mesokurtic, as we can see in Fig. 3.19.
In contrast, when the distribution shows a frequency curve that is flatter than a normal curve, it is called platykurtic, as shown in Fig. 3.20.
Or, when the distribution presents a frequency curve that is longer than a normal curve, it is called leptokurtic, according to Fig. 3.21.
One of the most common coefficients to measure the flatness level or kurtosis of a distribution is the percentile coefficient of kurtosis, or simply coefficient of kurtosis (k). It is calculated from the interquartile interval, in addition to the 10th and 90th percentiles:
which has the following interpretation:
If k = 0.263, we say that the curve is mesokurtic;
If k > 0.263, we say that the curve is platykurtic;
If k < 0.263, we say that the curve is leptokurtic.
Another very common measure to determine the flatness level or kurtosis of a distribution is Fisher’s coefficient of kurtosis, (g2). It is calculated using the fourth moment near the mean (M4), as presented in Maroco (2014):
where:
which has the following interpretation:
If g2 = 0, the curve has a normal distribution (mesokurtic);
If g2 < 0, the curve is very flat (platykurtic);
If g2 > 0, the curve is very long (leptokurtic).
Many pieces of statistical software, among them SPSS, use Fisher’s coefficient of kurtosis to calculate the flatness level or kurtosis (Section 3.6). In Excel, the KURT function calculates Fisher's coefficient of kurtosis (Example 3.42), and it can be calculated through the Analysis ToolPak supplement as well (Section 3.5).
The coefficient of kurtosis on Stata is calculated from the second and fourth moments near the mean, as presented by Bock (1975) and Cox (2010):
which has the following interpretation:
If kS = 3, the curve has a normal distribution (mesokurtic);
If kS < 3, the curve is very flat (platykurtic);
If kS > 3, the curve is very long (leptokurtic).
Section 3.3.1 showed the graphical representation of qualitative variables through bar charts (horizontal and vertical), pie charts, and the Pareto chart. We demonstrated how each one of these charts can be obtained using Excel. Conversely, Section 3.3.2 showed the graphical representation of quantitative variables through line graphs, scatter plots, histograms, among others. Analogously, we presented how most of them can be obtained using Excel.
Section 3.4 presented the main summary measures, including measures of central tendency (mean, mode, and median), quantiles (quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), in addition to the measures of shape as skewness and kurtosis. Then, we presented how they can be calculated using the Excel functions, except the ones that are not available.
This section discusses how to obtain descriptive statistics (such as, the mean, standard error, median, mode, standard deviation, variance, kurtosis, skewness, among others), through the Analysis ToolPak add-in in Excel.
In order to do that, let’s consider the problem presented in Example 3.42, whose data are available in Excel in the file Stock_Market.xls, presented in cells A1:A20, as shown in Fig. 3.22.
To load the Analysis ToolPak add-in in Excel, we must first click on the File tab and on Options, as shown in Fig. 3.23.
Now, the Excel Options dialog box will open, as shown in Fig. 3.24. From this box, we selected the option Add-ins. In Add-ins, we must select the option Analysis ToolPak and click on Go.
Then, the Add-ins dialog box will appear, as shown in Fig. 3.25. Among the add-ins available, we must select the option Analysis ToolPak and click on OK.
Thus, the option Data Analysis will start being available on the Data tab, inside the Analysis group, as shown in Fig. 3.26.
Fig. 3.27 shows the Data Analysis dialog box. Note that several analysis tools are available. Let’s select the option Descriptive Statistics and click on OK.
From the Descriptive Statistics dialog box (Fig. 3.28), we must select the Input Range (A1:A20) and, as Output options, let’s select Summary statistics. The results can be presented in a new spreadsheet or in a new work folder. Finally, let’s click on OK.
The descriptive statistics generated can be seen in Fig. 3.29 and include measures of central tendency (mean, mode, and median), measures of dispersion or variability (variance, standard deviation, and standard error), and measures of shape (skewness and kurtosis). The range can be calculated from the difference between the sample’s maximum and minimum values. As mentioned in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by Excel (using the SKEW function or by Fig. 3.28) corresponds to Fisher’s coefficient of skewness (g1); and the measure of kurtosis calculated (using the KURT function or by Fig. 3.28) corresponds to Fisher’s coefficient of kurtosis (g2).
From a practical example, this section presents how to obtain the main univariate descriptive statistics studied in this chapter by using IBM SPSS Statistics Software. These include frequency distribution tables, charts (histogram, stem-and-leaf plots, boxplots, bar charts, and pie charts), measures of central tendency (mean, mode, and median), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of shape (skewness and kurtosis). The use of the images in this section has been authorized by the International Business Machines Corporation©.
The data presented in Example 3.42 are the input basis on SPSS and are available in the file Stock_Market.sav, as shown in Fig. 3.30.
To obtain such descriptive statistics, we must click on Analyze → Descriptive Statistics. After that, three options can be used: Frequencies, Descriptive, and Explore.
This option can be used for qualitative and quantitative variables, and it provides frequency distribution tables, as well as measures of central tendency (mean, median, and mode), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of skewness and kurtosis. The Frequencies option also plots bar charts, pie charts, or histograms (with or without a normal curve). Therefore, on the toolbar, click on Analyze → Descriptive Statistics and select Frequencies..., as shown in Fig. 3.31.
Therefore, the Frequencies dialog box will open. The variable being studied (Stock price, called Price) must be selected in Variable(s) and the Display frequency tables option must be activated so that the frequency distribution table can be shown (Fig. 3.32).
The following step consists of clicking on Statistics... To select the summary measures that interest us (Fig. 3.33).
Among the quantiles, let’s select the option Quartiles (which calculates the first and third quartiles, in addition to the median). To get the percentile of order i (i = 1, 2, ..., 99), we must select the option Percentile(s) and add the order desired. In this case, we chose to calculate the percentiles of order 10 and 60. The measures of central tendency that we have to select are the mean, median, and mode. As measures of dispersion, let’s select Std. deviation (standard deviation), Variance, Range, and S.E. mean (standard error). Finally, let’s select both measures of shape of a distribution: Skewness and Kurtosis. To go back to the Frequencies dialog box, we must click on Continue.
Next, let’s click on Charts... and select the chart that interest us. As options, we have Bar charts, Pie charts, or Histograms. Let’s select the last chart with the option of plotting a normal curve (Fig. 3.34). Bar or pie charts can be shown in terms of absolute frequencies (Frequencies) or relative frequencies (Percentages). In order to go back to the Frequencies dialog box once again, we must click on Continue.
Finally, click on OK. Fig. 3.35 shows the calculations of the summary measures selected in Fig. 3.33.
As studied in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by SPSS corresponds to Fisher’s coefficient of skewness (g1), and the measure of kurtosis corresponds to Fisher’s coefficient of kurtosis (g2), respectively.
Also in Fig. 3.35, note that the percentiles of order 25, 50, and 75 that correspond to the first quartile, median, and third quartile, respectively, were calculated automatically. The method used to calculate the percentiles was the Weighted Average.
The frequency distribution table can be seen in Fig. 3.36.
The first column represents the absolute frequency of each element (Fi), the second and third columns represent the relative frequency of each element (Fri—%), and the last column represents the relative cumulative frequency (Frac—%).
Also in Fig. 3.36, we can see that all the values happened only once. Since we have a continuous quantitative variable with 20 observations and no repetitions, constructing bar or pie charts would not give the researcher any additional information, that is, it would not allow a good visualization of how the stock prices behave in terms of bins. Hence, we chose to construct a histogram with previously defined bins. The histogram generated using SPSS with the option of plotting a normal curve can be seen in Fig. 3.37.
Different from Frequencies..., which also has the frequency distribution table option, besides bar charts, pie charts, or histograms (with or without a normal curve), Descriptives... only makes summary measures available (therefore, it is recommended for quantitative variables). Nevertheless, measures of central tendency, such as, the median and mode are not made available; nor are quantiles, such as, quartiles and percentiles. To use it, let’s click on Analyze → Descriptive Statistics and select Descriptives..., as shown in Fig. 3.38.
Therefore, the Descriptives dialog box will open. The variable being studied must be selected in Variable(s), as shown in Fig. 3.39.
Let’s click on Options... and select the summary measures that interest us (Fig. 3.40). Note that the same summary measures in the Frequencies... were selected, except the median, the mode, in addition to the quartiles and percentiles that are not available, as already mentioned. Let’s click on Continue to go back to the Descriptives dialog box.
Finally, click on OK. The results are available in Fig. 3.41.
As Frequencies..., Explore... does not provide the frequency distribution table either. Regarding the types of chart, different from this last option, which offers bar charts, pie charts, and histograms, Explore... provides stem-and-leaf plots, boxplots, in addition to histograms. However, it does not have the option of plotting a normal curve. Regarding summary measures, Explore... provides measures of central tendency, such as, the mean and median (there is no option for the mode); quantiles, such as, percentiles (of order 5, 10, 25, 50, 75, 90, and 95); measures of dispersion, such as, the range, variance, standard deviation, among others (it does not calculate the standard error), besides measures of skewness and kurtosis. Therefore, this command is the best one to generate descriptive statistics for quantitative variables. Hence, from Analyze → Descriptive Statistics, select Explore..., as shown in Fig. 3.42.
Therefore, the Explore dialog box will open. The variable being studied must be selected from the list of dependent variables (Dependent List), as shown in Fig. 3.43.
Next, we must click on Statistics... to open the Explore: Statistics box, and select the options Descriptives, Outliers, and Percentiles, as shown in Fig. 3.44.
Let’s click on Continue to go back to the Explore box. Next, we must click on Plots... to open the Explore: Plots box and select the charts that interest us, as shown in Fig. 3.45. In this case, we have to select Boxplots: Factor levels together (the resulting boxplots will be together in the same chart), Stem-and-leaf and the histogram (note that there is no option for plotting the normal curve). Once again, we must click on Continue to go back to the Explore dialog box.
Finally, click on OK. The results obtained are illustrated.
Fig. 3.46 shows the results obtained from Explore: Statistics, with Descriptives option.
Fig. 3.47 shows the results obtained from Explore: Statistics, with Percentiles option. The percentiles of order 5, 10, 25 (Q1), 50 (median), 75 (Q3), 90, and 95 were calculated using two methods: the Weighted Average and Tukey’s Hinges. The latter corresponds to the method proposed in this chapter (Section 3.4.1.2, Case 1). Thus, applying the expressions in Section 3.4.1.2 to this example, we get the same results seen in Fig. 3.47, as regards Tukey’s Hinges method for calculating P25, P50, and P75. Coincidently, in this example, the value of P75 was the same for both methods, but they are usually different.
Fig. 3.48 shows the results obtained from the Explore: Statistics, with Outliers option. The extreme values of the distribution are presented here (the highest five and the lowest five), with their respective positions found in the dataset.
Now, the charts constructed from the options selected in Explore: Plots (histograms, stem-and-leaf plots, and boxplots) are presented in Figs. 3.49, 3.50, and 3.51, respectively.
Obviously, the histogram generated by Fig. 3.49 is the same as the Frequencies... (Fig. 3.37); however, without the normal curve, since the Explore... does not provide this function.
Fig. 3.50 shows that the first two digits of the number (the integers, before the point) form the stem and the decimals correspond to the leaf. Moreover, stem 18 is represented in two lines because it contains several observations.
In Section 3.4.1.3, we learned how to calculate an extreme outlier through expressions X⁎ < Q1 − 3.(Q3 − Q1) and X⁎ > Q3 + 3.(Q3 − Q1). If we consider that Q1 = 18.15 and Q3 = 18.8, we have X⁎ < 16.2 or X⁎ > 20.75. Since there are no observations outside these limits, we conclude that there are no extreme outliers.
Repeating the same procedure for mild outliers, that is, applying expressions X° < Q1 − 1.5.(Q3 − Q1) and X° > Q3 + 1.5.(Q3 − Q1), we can see that there is one observation with a value of less than 17.175 (20th observation), and another one with a value greater than 19.775 (10th observation). These values are therefore considered mild outliers.
The boxplot in Fig. 3.51 shows that observations 10 and 20, with values 19.9 and 16.9, respectively, are mild outliers (represented by circles). Depending on their survey goals, this allows researchers to decide whether to keep them, exclude them (the analysis may be harmed because of the reduction in the sample size), or substitute their values for the variable mean.
Continuing in Fig. 3.51, the values of Q1, Q2 (Md), and Q3 correspond to 18.15, 18.5, and 18.8, respectively, which are those obtained from Tukey’s Hinges method (Fig. 3.47), considering all of the initial 20 observations. Therefore, the boxplot’s measures of position (Q1, Md, and Q3), except for the minimum and maximum values, are calculated without excluding the outliers.
The same descriptive statistics obtained in the previous section through SPSS software will be calculated in this section through Stata Statistical Software. The results will be compared to those obtained in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. The data presented in Example 3.42 are the input basis on Stata, and are available in the file Stock_Market.dta.
Through command tabulate, or simply tab, as we will use throughout this book, we can obtain frequency distribution tables for a certain variable. The syntax of the command is:
where the term variable⁎ should be substituted for the name of the variable considered in the analysis.
Fig. 3.52 shows the obtained output using the command tab price.
Just as the frequency distribution table obtained through SPSS (Fig. 3.36), Fig. 3.52 provides the absolute, relative, and relative cumulative frequencies for each category of the variable price.
Consider a case with more than one variable being studied in which the objective is to construct univariate frequency distribution tables (one-way tables), that is, one table for each variable being analyzed. In this case, we must use the command tab1, with the following syntax:
where the term variables⁎ should be substituted for the list of variables being considered in the analysis.
Through command summarize, or simply sum, as we will use throughout this book, we can obtain summary measures, such as, the mean, standard deviation, and minimum and maximum values. The syntax of this command is:
where the term variables⁎ should be substituted for the list of variables to be considered in the analysis. If no variable is specified, the statistics will be calculated for all of the variables in the dataset.
Through the option detail, we can obtain additional statistics, such as, the coefficient of skewness, the coefficient of kurtosis, the four lowest and highest values, as well as several percentiles. The syntax of this command is:
Therefore, for the data in our example, available in the file Stock_Market.dta, first, we must type the following command:
obtaining the statistics in Fig. 3.53.
To obtain additional descriptive statistics, we must type the following command:
Fig. 3.54 shows the generated outputs.
As shown in Fig. 3.54, the option detail provides the calculation of the percentiles of order 1, 5, 10, 25, 50, 75, 90, 95 and 99. These results are obtained by Tukey’s Hinges method. We have seen, through Fig. 3.47 on the SPSS software, the results of the percentiles of order 25, 50, and 75 obtained by the same method.
Fig. 3.54 also provides the four lowest and highest values of the sample analyzed, as well as the coefficients of skewness and kurtosis. Note that these values coincide with the ones calculated in Sections 3.4.3.1.5 and 3.4.3.2.3, respectively.
The previous section discussed how to calculate the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles through Tukey’s Hinges method.
On the other hand, by using the command centile, we can specify the percentiles to be calculated. The method used in this case is the Weighted Average. The syntax of this command is:
centile variables⁎, centile (numbers⁎)
where the term variables⁎ should be substituted for the list of variables to be considered in the analysis, and the term numbers⁎ for the list of numbers that represent the order of the percentiles to be reported.
Therefore, let’s suppose that we want to calculate the percentiles of order 5, 10, 25, 60, 64, 90, and 95 for the variable price, through the Weighted Average. In order to do that, we must use the following command:
centile price, centile (5 10 25 60 64 90 95)
The results can be seen in Fig. 3.55.
We have seen, through Fig. 3.35, the results of the SPSS software for the percentiles of order 10, 25, 50, 60, and 75 using the same method. Fig. 3.47 on SPSS also provided the calculation of the percentiles of order 5, 10, 25, 50, 75, 90, and 95 through the Weighted Average. The only percentile that had not been specified previously was the one of order 64; the others coincide with the results in Figs. 3.35 and 3.47.
Stata makes a series of charts available, including bar charts, pie charts, scatter plots, histograms, stem-and-leaf, and boxplots, among others. Next, we will discuss how to obtain histograms, stem-and-leaf plots, and boxplots on Stata, for the data available in the file Stock_Market.dta.
Histograms on Stata can be obtained for continuous and discrete variables. In the case of continuous variables, to obtain a histogram of absolute frequencies, with the option of plotting a normal curve, we must type the following syntax:
histogram variable⁎, normal frequency
or simply:
as we will use throughout this book. As mentioned before, the term variable⁎ must be substituted for the name of the variable being studied.
For discrete variables, we must include the term discrete:
hist variable⁎, discrete norm freq
Going back to the data in Example 3.42, to obtain a frequency histogram, with the option of plotting a normal curve, we must type the following command:
The obtained output is shown in Fig. 3.56.
The stem-and-leaf plot on Stata can be obtained using the command stem, followed by the name of the variable being studied. For the data in the file Stock_Market.dta, we just need to type the following command:
The obtained output is shown in Fig. 3.57.
To obtain the boxplot on the Stata software, we must use the following syntax:
where the term variables⁎ should be substituted for the list of variables to be considered in the analysis, and, for each variable, one chart is constructed.
For the data in Example 3.42, the command is:
The chart is shown in Fig. 3.58 which corresponds to the same chart as in Fig. 3.51 generated using SPSS.
In this chapter, we studied descriptive statistics for a single variable (univariate descriptive statistics), in order to acquire a better understanding of the behavior of each variable through tables, charts, graphs and summary measures, identifying trends, variability, and outliers.
Before we start using descriptive statistics, it is necessary to identify the type of variable we will study. The type of variable is essential for calculating descriptive statistics and in the graphical representation of the results.
The descriptive statistics used to represent the behavior of a qualitative variable’s data are frequency distribution tables and charts. The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and a Pareto chart.
For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of continuous variables’ data grouped into classes. Line graphs, dot or scatter plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams) are normally used to graphically represent quantitative variables.
Table 3.3
Type of Defect | Total |
---|---|
Lack of Alignment | 98 |
Scratches | 67 |
Deformation | 45 |
Discoloration | 28 |
Oxygenation | 12 |
Total | 250 |
Table 3.5
Class | Fi | Fri (%) |
---|---|---|
30 ├ 60 | 11 | 4.4 |
60 ├ 90 | 29 | 11.6 |
90 ├ 120 | 41 | 16.4 |
120 ├ 150 | 82 | 32.8 |
150 ├ 180 | 54 | 21.6 |
180 ├ 210 | 33 | 13.2 |
Sum | 250 | 100 |
Table 3.6
Stock A | Stock B |
---|---|
31 | 25 |
30 | 33 |
24 | 27 |
24 | 34 |
28 | 32 |
22 | 26 |
24 | 26 |
34 | 28 |
24 | 34 |
28 | 28 |
23 | 31 |
30 | 28 |
31 | 34 |
32 | 16 |
26 | 28 |
39 | 29 |
25 | 27 |
42 | 28 |
29 | 33 |
24 | 29 |
22 | 34 |
23 | 33 |
32 | 27 |
29 | 26 |
Table 3.7
Hospital | Investment |
---|---|
A | 44 |
B | 12 |
C | 6 |
D | 22 |
E | 60 |
F | 15 |
G | 30 |
H | 200 |
I | 10 |
J | 8 |
K | 4 |
L | 75 |
M | 180 |
N | 50 |
O | 64 |