3.4.3 Measures of Shape

Measures of asymmetry (skewness) and kurtosis characterize the shape of the distribution of the population elements sampled around the mean (Maroco, 2014).

3.4.3.1 Measures of Skewness

Measures of skewness describe the shape of a frequency distribution curve. For a symmetrical curve or frequency distribution, the mean, the mode, and the median are the same. For an asymmetrical curve, the mean gets farther away from the mode, and the median is located in an intermediary position. Fig. 3.16 shows a symmetrical distribution.

Fig. 3.16
Fig. 3.16 Symmetrical distribution.

On the other hand, if the frequency distribution is more concentrated on the left side, that is, the tail on the right is longer than the tail on the left, we will have a positively skewed distribution or to the right, as shown in Fig. 3.17. In this case, the mean is greater than the median, and the latter is greater than the mode (Mo<Md<X¯si128_e).

Fig. 3.17
Fig. 3.17 Skewness to the right or positive skewness.

Conversely, if the frequency distribution is more concentrated on the right side, that is, the tail on the left is longer than the tail on the right, we will have a negatively skewed distribution or to the left, as shown in Fig. 3.18. In this case, the mean is less than the median, and the latter is less than the mode X¯<Md<Mosi129_e.

Fig. 3.18
Fig. 3.18 Skewness to the left or negative skewness.
3.4.3.1.1 Pearson’s First Coefficient of Skewness

Pearson’s first coefficient of skewness (Sk1) is a measure of skewness given by the difference between the mean and the mode, weighted by one measure of dispersion (the standard deviation):

Sk1=μMoσforthepopulation

si130_e  (3.47)

Sk1=X¯MoSforsamples,

si131_e  (3.48)

which has the following interpretation:

If Sk1 = 0, the distribution is symmetrical;

If Sk1 > 0, the distribution is positively skewed (to the right);

If Sk1 < 0, the distribution is negatively skewed (to the left).

Example 3.39

From one set of data, we obtained the following measures X¯=34.7si132_e, Mo = 31.5, Md = 33.2, and S = 12.4. Determine the type of skewness and calculate Pearson’s first coefficient of skewness.

Solution

Since Mo<Md<X¯si128_e, we have a positive asymmetrical distribution (to the right). Applying Expression (3.48), we can determine Pearson’s first coefficient of skewness:

Sk1=X¯MoS=34.731.512.4=0.258

si134_e

Classifying the distribution as positively skewed can also be interpreted by the value Sk1 > 0.

3.4.3.1.2 Pearson’s Second Coefficient of Skewness

To avoid using the mode to calculate the skewness, we must adopt the empirical relationship between the mean, the median, and the mode: X¯Mo=3.X¯Mdsi135_e, which corresponds to Pearson’s second coefficient of skewness (Sk2):

Sk2=3.μMdσforthepopulation

si136_e  (3.49)

Sk2=3.X¯MdSforsamples

si137_e  (3.50)

In the same way, we have:

If Sk2 = 0, the distribution is symmetrical;

If Sk2 > 0, the distribution is positively skewed (to the right);

If Sk2 < 0, the distribution is negatively skewed (to the left).

Pearson’s first and second coefficients of skewness allow us to compare two or more distributions and to evaluate which one is more asymmetrical. Its modulus indicates the intensity of the skewness. That is, the higher Pearson’s coefficient of skewness is, the more asymmetrical the curve is. Thus:

If 0 < | Sk | < 0.15, the skewness is weak;

If 0.15 ≤ | Sk | ≤ 1, the skewness is moderate;

If | Sk | > 1, the skewness is strong.

Example 3.40

From the data in Example 3.39, calculate Pearson’s second coefficient of skewness.

Solution

Applying Expression (3.50), we have:

Sk2=3.X¯MdS=3.34.733.212.4=0.363

si138_e

Analogously, since Sk2 > 0, we confirm that the distribution is positively skewed.

3.4.3.1.3 Bowley’s Coefficient of Skewness

Another measure of skewness is Bowley’s coefficient of skewness (SkB), also known as quartile coefficient of skewness, calculated with quantiles, such as, the first and third quartiles, in addition to the median:

SkB=Q3+Q12.MdQ3Q1

si139_e  (3.51)

In the same way, we have:

If SkB = 0, the distribution is symmetrical;

If SkB > 0, the distribution is positively skewed (to the right);

If SkB < 0, the distribution is negatively skewed (to the left).

Example 3.41

Calculate Bowley’s coefficient of skewness for the following dataset, which has already been sorted in ascending order:

242529313640444548505456
1st2nd3rd4th5th6th7th8th9th10th11th12th

Unlabelled Table

Solution

We have Q1 = 30, Md = 42, and Q3 = 49. Therefore, we can determine Bowley’s coefficient of skewness:

SkB=Q3+Q12.MdQ3Q1=49+302.424930=0.263

si140_e

Since SkB < 0, we conclude that the distribution is negatively skewed (to the left).

3.4.3.1.4 Fisher’s Coefficient of Skewness

The last measure of skewness we will study is known as Fisher’s coefficient of skewness (g1), calculated from the third moment around the mean (M3), as presented in Maroco (2014):

g1=n2.M3n1.n2.S3

si141_e  (3.52)

where:

M3=i=1nXiX¯3n

si142_e  (3.53)

which is interpreted the same way as the other coefficients of skewness, that is:

If g1 = 0, the distribution is symmetrical;

If g1 > 0, the distribution is positively skewed (to the right);

If g1 < 0, the distribution is negatively skewed (to the left).

Fisher’s coefficient of skewness can be calculated in Excel using the DISTORTION function (see Example 3.42) or using the Analysis Tools supplement (Section 3.5). Its calculation through SPSS software will be presented in Section 3.6.

3.4.3.1.5 Coefficient of Skewness on Stata

The coefficient of skewness on Stata is calculated from the second and third moments around the mean, as presented by Cox (2010):

Sk=M3M23/2

si143_e  (3.54)

where:

M2=i=1nXiX¯2n

si144_e  (3.55)

which is interpreted the same way as the other coefficients of skewness, that is:

If Sk = 0, the distribution is symmetrical;

If Sk > 0, the distribution is positively skewed (to the right);

If Sk < 0, the distribution is negatively skewed (to the left).

3.4.3.2 Measures of Kurtosis

In addition to measures of skewness, measures of kurtosis can also be used to characterize the shape of the distribution of the variable being studied.

Kurtosis can be defined as the flatness level of a frequency distribution (height of the peak of the curve) in relation to a theoretical distribution that usually corresponds to the normal distribution.

When the shape of the distribution is not very flat, nor very long, similar to a normal curve, it is called mesokurtic, as we can see in Fig. 3.19.

Fig. 3.19
Fig. 3.19 Mesokurtic curve.

In contrast, when the distribution shows a frequency curve that is flatter than a normal curve, it is called platykurtic, as shown in Fig. 3.20.

Fig. 3.20
Fig. 3.20 Platykurtic curve.

Or, when the distribution presents a frequency curve that is longer than a normal curve, it is called leptokurtic, according to Fig. 3.21.

Fig. 3.21
Fig. 3.21 Leptokurtic curve.
3.4.3.2.1 Coefficient of Kurtosis

One of the most common coefficients to measure the flatness level or kurtosis of a distribution is the percentile coefficient of kurtosis, or simply coefficient of kurtosis (k). It is calculated from the interquartile interval, in addition to the 10th and 90th percentiles:

k=Q3Q12P90P10,

si145_e  (3.56)

which has the following interpretation:

If k = 0.263, we say that the curve is mesokurtic;

If k > 0.263, we say that the curve is platykurtic;

If k < 0.263, we say that the curve is leptokurtic.

3.4.3.2.2 Fisher’s Coefficient of Kurtosis

Another very common measure to determine the flatness level or kurtosis of a distribution is Fisher’s coefficient of kurtosis, (g2). It is calculated using the fourth moment near the mean (M4), as presented in Maroco (2014):

g2=n2.n+1.M4n1.n2.n3.S43.n12n2.n3

si146_e  (3.57)

where:

M4=i=1nXiX¯4n,

si147_e  (3.58)

which has the following interpretation:

If g2 = 0, the curve has a normal distribution (mesokurtic);

If g2 < 0, the curve is very flat (platykurtic);

If g2 > 0, the curve is very long (leptokurtic).

Many pieces of statistical software, among them SPSS, use Fisher’s coefficient of kurtosis to calculate the flatness level or kurtosis (Section 3.6). In Excel, the KURT function calculates Fisher's coefficient of kurtosis (Example 3.42), and it can be calculated through the Analysis ToolPak supplement as well (Section 3.5).

3.4.3.2.3 Coefficient of Kurtosis on Stata

The coefficient of kurtosis on Stata is calculated from the second and fourth moments near the mean, as presented by Bock (1975) and Cox (2010):

kS=M4M22

si148_e  (3.59)

which has the following interpretation:

If kS = 3, the curve has a normal distribution (mesokurtic);

If kS < 3, the curve is very flat (platykurtic);

If kS > 3, the curve is very long (leptokurtic).

Example 3.42

Table 3.E.41 shows the prices of stock Y throughout a month, resulting in a sample with 20 periods (i.e., business days). Calculate:

  1. a) Fisher’s coefficient of skewness (g1);
  2. b) The coefficient of skewness used on Stata;
  3. c) Fisher’s coefficient of kurtosis (g2);
  4. d) The coefficient of kurtosis used on Stata;

Table 3.E.41

Prices of Stock Y Throughout the Month
18.718.318.418.718.818.819.118.919.119.9
18.518.518.117.918.218.318.118.817.516.9

Unlabelled Table

Solution

The mean and the standard deviation of the data in Table 3.E.41 are X¯=18.475si149_e and S = 0.6324, respectively. We have:

  1. a) Fisher’s coefficient of skewness g1:
    It is calculated using the third moment near the mean (M3):

M3=i=1nXiX¯3n=18.718.4753++16.918.475320=0.0788

si150_e

Therefore, we have:

g1=n2.M3n1.n2.S3=2020.07919180.633=0.3647

si151_e

Since g1 < 0, we can conclude that the frequency curve is more concentrated on the right side and has a longer tail to the left, that is, the distribution is asymmetrical to the left or negative.

Excel calculates Fisher’s coefficient of skewness (g1) through the SKEW function. File Stock_Market.xls shows the data from Table 3.E.41, cells A1:A20. Thus, to calculate it, we just need to insert expression = SKEW(A1:A20).

  1. b) The coefficient of skewness used on Stata:

It is calculated from the second and third moments near the mean:

M2=i=1nXiX¯2n=18.718.4752++16.918.475220=0.3799

si152_e

M3=0.0788

si153_e

It is calculated as follows:

Sk=M3M23/2=0.3367,

si154_e

which is interpreted the same way as Fisher’s coefficient of skewness.

  1. c) Fisher’s coefficient of kurtosis g2:

It is calculated using the fourth moment near the mean (M4):

M4=i=1nXiX¯4n=18.718.4754++16.918.475420=0.5857

si155_e

Therefore, we calculate g2 as follows:

g2=n2.n+1.M4n1.n2.n3.S43.n12n2.n3

si146_e

g2=202210.58571918170.632443.1921817=1.7529

si157_e

Thus, we can conclude that the curve is long or leptokurtic.

The KURT function in Excel calculates Fisher’s coefficient of kurtosis (g2). To calculate it from the file Stock_Market.xls, we must insert expression = KURT(A1:A20).

  1. d) Coefficient of kurtosis on Stata:

It is calculated from the second and fourth moments near the mean:

M2 = 0.3799 and M4 = 0.5857, as already calculated. Thus:

kS=M4M22=0.58570.37992=4.0586

si158_e

Since kS > 3, the curve is long or leptokurtic.

In the next three sections, we will discuss how to construct tables, charts, graphs, and summary measures in Excel and in the statistical softwares SPSS and Stata, using the data in Example 3.42.

3.5 A Practical Example in Excel

Section 3.3.1 showed the graphical representation of qualitative variables through bar charts (horizontal and vertical), pie charts, and the Pareto chart. We demonstrated how each one of these charts can be obtained using Excel. Conversely, Section 3.3.2 showed the graphical representation of quantitative variables through line graphs, scatter plots, histograms, among others. Analogously, we presented how most of them can be obtained using Excel.

Section 3.4 presented the main summary measures, including measures of central tendency (mean, mode, and median), quantiles (quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), in addition to the measures of shape as skewness and kurtosis. Then, we presented how they can be calculated using the Excel functions, except the ones that are not available.

This section discusses how to obtain descriptive statistics (such as, the mean, standard error, median, mode, standard deviation, variance, kurtosis, skewness, among others), through the Analysis ToolPak add-in in Excel.

In order to do that, let’s consider the problem presented in Example 3.42, whose data are available in Excel in the file Stock_Market.xls, presented in cells A1:A20, as shown in Fig. 3.22.

Fig. 3.22
Fig. 3.22 Dataset in Excel—Price of Stock Y.

To load the Analysis ToolPak add-in in Excel, we must first click on the File tab and on Options, as shown in Fig. 3.23.

Fig. 3.23
Fig. 3.23 File tab, focusing more on Options.

Now, the Excel Options dialog box will open, as shown in Fig. 3.24. From this box, we selected the option Add-ins. In Add-ins, we must select the option Analysis ToolPak and click on Go.

Fig. 3.24
Fig. 3.24 Excel Options dialog box.

Then, the Add-ins dialog box will appear, as shown in Fig. 3.25. Among the add-ins available, we must select the option Analysis ToolPak and click on OK.

Fig. 3.25
Fig. 3.25 Add-ins dialog box.

Thus, the option Data Analysis will start being available on the Data tab, inside the Analysis group, as shown in Fig. 3.26.

Fig. 3.26
Fig. 3.26 Availability of the Data Analysis command, from the Data tab.

Fig. 3.27 shows the Data Analysis dialog box. Note that several analysis tools are available. Let’s select the option Descriptive Statistics and click on OK.

Fig. 3.27
Fig. 3.27 Data Analysis dialog box.

From the Descriptive Statistics dialog box (Fig. 3.28), we must select the Input Range (A1:A20) and, as Output options, let’s select Summary statistics. The results can be presented in a new spreadsheet or in a new work folder. Finally, let’s click on OK.

Fig. 3.28
Fig. 3.28 Descriptive Statistics dialog box.

The descriptive statistics generated can be seen in Fig. 3.29 and include measures of central tendency (mean, mode, and median), measures of dispersion or variability (variance, standard deviation, and standard error), and measures of shape (skewness and kurtosis). The range can be calculated from the difference between the sample’s maximum and minimum values. As mentioned in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by Excel (using the SKEW function or by Fig. 3.28) corresponds to Fisher’s coefficient of skewness (g1); and the measure of kurtosis calculated (using the KURT function or by Fig. 3.28) corresponds to Fisher’s coefficient of kurtosis (g2).

Fig. 3.29
Fig. 3.29 Descriptive statistics in Excel.

3.6 A Practical Example on SPSS

From a practical example, this section presents how to obtain the main univariate descriptive statistics studied in this chapter by using IBM SPSS Statistics Software. These include frequency distribution tables, charts (histogram, stem-and-leaf plots, boxplots, bar charts, and pie charts), measures of central tendency (mean, mode, and median), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of shape (skewness and kurtosis). The use of the images in this section has been authorized by the International Business Machines Corporation©.

The data presented in Example 3.42 are the input basis on SPSS and are available in the file Stock_Market.sav, as shown in Fig. 3.30.

Fig. 3.30
Fig. 3.30 Dataset on SPSS—Price of Stock Y.

To obtain such descriptive statistics, we must click on AnalyzeDescriptive Statistics. After that, three options can be used: Frequencies, Descriptive, and Explore.

3.6.1 Frequencies Option

This option can be used for qualitative and quantitative variables, and it provides frequency distribution tables, as well as measures of central tendency (mean, median, and mode), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of skewness and kurtosis. The Frequencies option also plots bar charts, pie charts, or histograms (with or without a normal curve). Therefore, on the toolbar, click on AnalyzeDescriptive Statistics and select Frequencies..., as shown in Fig. 3.31.

Fig. 3.31
Fig. 3.31 Descriptive statistics on SPSS—Frequencies Option.

Therefore, the Frequencies dialog box will open. The variable being studied (Stock price, called Price) must be selected in Variable(s) and the Display frequency tables option must be activated so that the frequency distribution table can be shown (Fig. 3.32).

Fig. 3.32
Fig. 3.32 Frequencies dialog box: selecting the variable and showing the frequency table.

The following step consists of clicking on Statistics... To select the summary measures that interest us (Fig. 3.33).

Fig. 3.33
Fig. 3.33 Frequencies: Statistics dialog box.

Among the quantiles, let’s select the option Quartiles (which calculates the first and third quartiles, in addition to the median). To get the percentile of order i (i = 1, 2, ..., 99), we must select the option Percentile(s) and add the order desired. In this case, we chose to calculate the percentiles of order 10 and 60. The measures of central tendency that we have to select are the mean, median, and mode. As measures of dispersion, let’s select Std. deviation (standard deviation), Variance, Range, and S.E. mean (standard error). Finally, let’s select both measures of shape of a distribution: Skewness and Kurtosis. To go back to the Frequencies dialog box, we must click on Continue.

Next, let’s click on Charts... and select the chart that interest us. As options, we have Bar charts, Pie charts, or Histograms. Let’s select the last chart with the option of plotting a normal curve (Fig. 3.34). Bar or pie charts can be shown in terms of absolute frequencies (Frequencies) or relative frequencies (Percentages). In order to go back to the Frequencies dialog box once again, we must click on Continue.

Fig. 3.34
Fig. 3.34 Frequencies: Charts dialog box.

Finally, click on OK. Fig. 3.35 shows the calculations of the summary measures selected in Fig. 3.33.

Fig. 3.35
Fig. 3.35 Summary measures obtained from Frequencies: Statistics.

As studied in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by SPSS corresponds to Fisher’s coefficient of skewness (g1), and the measure of kurtosis corresponds to Fisher’s coefficient of kurtosis (g2), respectively.

Also in Fig. 3.35, note that the percentiles of order 25, 50, and 75 that correspond to the first quartile, median, and third quartile, respectively, were calculated automatically. The method used to calculate the percentiles was the Weighted Average.

The frequency distribution table can be seen in Fig. 3.36.

Fig. 3.36
Fig. 3.36 Frequency distribution.

The first column represents the absolute frequency of each element (Fi), the second and third columns represent the relative frequency of each element (Fri—%), and the last column represents the relative cumulative frequency (Frac—%).

Also in Fig. 3.36, we can see that all the values happened only once. Since we have a continuous quantitative variable with 20 observations and no repetitions, constructing bar or pie charts would not give the researcher any additional information, that is, it would not allow a good visualization of how the stock prices behave in terms of bins. Hence, we chose to construct a histogram with previously defined bins. The histogram generated using SPSS with the option of plotting a normal curve can be seen in Fig. 3.37.

Fig. 3.37
Fig. 3.37 Histogram with a normal curve obtained from Frequencies: Charts.

3.6.2 Descriptives Option

Different from Frequencies..., which also has the frequency distribution table option, besides bar charts, pie charts, or histograms (with or without a normal curve), Descriptives... only makes summary measures available (therefore, it is recommended for quantitative variables). Nevertheless, measures of central tendency, such as, the median and mode are not made available; nor are quantiles, such as, quartiles and percentiles. To use it, let’s click on AnalyzeDescriptive Statistics and select Descriptives..., as shown in Fig. 3.38.

Fig. 3.38
Fig. 3.38 Descriptive statistics on SPSS—Descriptives Option.

Therefore, the Descriptives dialog box will open. The variable being studied must be selected in Variable(s), as shown in Fig. 3.39.

Fig. 3.39
Fig. 3.39 Descriptives dialog box: selecting the variable.

Let’s click on Options... and select the summary measures that interest us (Fig. 3.40). Note that the same summary measures in the Frequencies... were selected, except the median, the mode, in addition to the quartiles and percentiles that are not available, as already mentioned. Let’s click on Continue to go back to the Descriptives dialog box.

Fig. 3.40
Fig. 3.40 Descriptives: Options dialog box.

Finally, click on OK. The results are available in Fig. 3.41.

Fig. 3.41
Fig. 3.41 Summary measures obtained from Descriptive: Options.

3.6.3 Explore Option

As Frequencies..., Explore... does not provide the frequency distribution table either. Regarding the types of chart, different from this last option, which offers bar charts, pie charts, and histograms, Explore... provides stem-and-leaf plots, boxplots, in addition to histograms. However, it does not have the option of plotting a normal curve. Regarding summary measures, Explore... provides measures of central tendency, such as, the mean and median (there is no option for the mode); quantiles, such as, percentiles (of order 5, 10, 25, 50, 75, 90, and 95); measures of dispersion, such as, the range, variance, standard deviation, among others (it does not calculate the standard error), besides measures of skewness and kurtosis. Therefore, this command is the best one to generate descriptive statistics for quantitative variables. Hence, from AnalyzeDescriptive Statistics, select Explore..., as shown in Fig. 3.42.

Fig. 3.42
Fig. 3.42 Descriptive statistics on SPSS—Explore Option.

Therefore, the Explore dialog box will open. The variable being studied must be selected from the list of dependent variables (Dependent List), as shown in Fig. 3.43.

Fig. 3.43
Fig. 3.43 Explore dialog box: selecting the variable.

Next, we must click on Statistics... to open the Explore: Statistics box, and select the options Descriptives, Outliers, and Percentiles, as shown in Fig. 3.44.

Fig. 3.44
Fig. 3.44 Explore: Statistics dialog box.

Let’s click on Continue to go back to the Explore box. Next, we must click on Plots... to open the Explore: Plots box and select the charts that interest us, as shown in Fig. 3.45. In this case, we have to select Boxplots: Factor levels together (the resulting boxplots will be together in the same chart), Stem-and-leaf and the histogram (note that there is no option for plotting the normal curve). Once again, we must click on Continue to go back to the Explore dialog box.

Fig. 3.45
Fig. 3.45 Explore: Plots dialog box.

Finally, click on OK. The results obtained are illustrated.

Fig. 3.46 shows the results obtained from Explore: Statistics, with Descriptives option.

Fig. 3.46
Fig. 3.46 Results Obtained from the Descriptives Option.

Fig. 3.47 shows the results obtained from Explore: Statistics, with Percentiles option. The percentiles of order 5, 10, 25 (Q1), 50 (median), 75 (Q3), 90, and 95 were calculated using two methods: the Weighted Average and Tukey’s Hinges. The latter corresponds to the method proposed in this chapter (Section 3.4.1.2, Case 1). Thus, applying the expressions in Section 3.4.1.2 to this example, we get the same results seen in Fig. 3.47, as regards Tukey’s Hinges method for calculating P25, P50, and P75. Coincidently, in this example, the value of P75 was the same for both methods, but they are usually different.

Fig. 3.47
Fig. 3.47 Results obtained from the Percentiles option.

Fig. 3.48 shows the results obtained from the Explore: Statistics, with Outliers option. The extreme values of the distribution are presented here (the highest five and the lowest five), with their respective positions found in the dataset.

Fig. 3.48
Fig. 3.48 Results obtained from the Outliers option.

Now, the charts constructed from the options selected in Explore: Plots (histograms, stem-and-leaf plots, and boxplots) are presented in Figs. 3.49, 3.50, and 3.51, respectively.

Fig. 3.49
Fig. 3.49 Histogram constructed from the Explore: Plots dialog box.
Fig. 3.50
Fig. 3.50 Stem-and-leaf chart generated from the Explore: Plots dialog box.
Fig. 3.51
Fig. 3.51 Boxplot generated from the Explore: Plots dialog box.

Obviously, the histogram generated by Fig. 3.49 is the same as the Frequencies... (Fig. 3.37); however, without the normal curve, since the Explore... does not provide this function.

Fig. 3.50 shows that the first two digits of the number (the integers, before the point) form the stem and the decimals correspond to the leaf. Moreover, stem 18 is represented in two lines because it contains several observations.

In Section 3.4.1.3, we learned how to calculate an extreme outlier through expressions X⁎ < Q1 − 3.(Q3 − Q1) and X⁎ > Q3 + 3.(Q3 − Q1). If we consider that Q1 = 18.15 and Q3 = 18.8, we have X⁎ < 16.2 or X⁎ > 20.75. Since there are no observations outside these limits, we conclude that there are no extreme outliers.

Repeating the same procedure for mild outliers, that is, applying expressions X° < Q1 − 1.5.(Q3 − Q1) and X° > Q3 + 1.5.(Q3 − Q1), we can see that there is one observation with a value of less than 17.175 (20th observation), and another one with a value greater than 19.775 (10th observation). These values are therefore considered mild outliers.

The boxplot in Fig. 3.51 shows that observations 10 and 20, with values 19.9 and 16.9, respectively, are mild outliers (represented by circles). Depending on their survey goals, this allows researchers to decide whether to keep them, exclude them (the analysis may be harmed because of the reduction in the sample size), or substitute their values for the variable mean.

Continuing in Fig. 3.51, the values of Q1, Q2 (Md), and Q3 correspond to 18.15, 18.5, and 18.8, respectively, which are those obtained from Tukey’s Hinges method (Fig. 3.47), considering all of the initial 20 observations. Therefore, the boxplot’s measures of position (Q1, Md, and Q3), except for the minimum and maximum values, are calculated without excluding the outliers.

3.7 A Practical Example on Stata

The same descriptive statistics obtained in the previous section through SPSS software will be calculated in this section through Stata Statistical Software. The results will be compared to those obtained in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. The data presented in Example 3.42 are the input basis on Stata, and are available in the file Stock_Market.dta.

3.7.1 Univariate Frequency Distribution Tables on Stata

Through command tabulate, or simply tab, as we will use throughout this book, we can obtain frequency distribution tables for a certain variable. The syntax of the command is:

tab variable⁎

where the term variable⁎ should be substituted for the name of the variable considered in the analysis.

Fig. 3.52 shows the obtained output using the command tab price.

Fig. 3.52
Fig. 3.52 Frequency distribution on Stata using the command tab.

Just as the frequency distribution table obtained through SPSS (Fig. 3.36), Fig. 3.52 provides the absolute, relative, and relative cumulative frequencies for each category of the variable price.

Consider a case with more than one variable being studied in which the objective is to construct univariate frequency distribution tables (one-way tables), that is, one table for each variable being analyzed. In this case, we must use the command tab1, with the following syntax:

tab1 variables⁎

where the term variables⁎ should be substituted for the list of variables being considered in the analysis.

3.7.2 Summary of Univariate Descriptive Statistics on Stata

Through command summarize, or simply sum, as we will use throughout this book, we can obtain summary measures, such as, the mean, standard deviation, and minimum and maximum values. The syntax of this command is:

sum variables⁎

where the term variables⁎ should be substituted for the list of variables to be considered in the analysis. If no variable is specified, the statistics will be calculated for all of the variables in the dataset.

Through the option detail, we can obtain additional statistics, such as, the coefficient of skewness, the coefficient of kurtosis, the four lowest and highest values, as well as several percentiles. The syntax of this command is:

sum variables⁎, detail

Therefore, for the data in our example, available in the file Stock_Market.dta, first, we must type the following command:

sum price

obtaining the statistics in Fig. 3.53.

Fig. 3.53
Fig. 3.53 Summary measures using the command sum on Stata.

To obtain additional descriptive statistics, we must type the following command:

sum price⁎, detail

Fig. 3.54 shows the generated outputs.

Fig. 3.54
Fig. 3.54 Additional statistics using the option detail.

As shown in Fig. 3.54, the option detail provides the calculation of the percentiles of order 1, 5, 10, 25, 50, 75, 90, 95 and 99. These results are obtained by Tukey’s Hinges method. We have seen, through Fig. 3.47 on the SPSS software, the results of the percentiles of order 25, 50, and 75 obtained by the same method.

Fig. 3.54 also provides the four lowest and highest values of the sample analyzed, as well as the coefficients of skewness and kurtosis. Note that these values coincide with the ones calculated in Sections 3.4.3.1.5 and 3.4.3.2.3, respectively.

3.7.3 Calculating Percentiles on Stata

The previous section discussed how to calculate the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles through Tukey’s Hinges method.

On the other hand, by using the command centile, we can specify the percentiles to be calculated. The method used in this case is the Weighted Average. The syntax of this command is:

centile variables⁎, centile (numbers⁎)

where the term variables⁎ should be substituted for the list of variables to be considered in the analysis, and the term numbers⁎ for the list of numbers that represent the order of the percentiles to be reported.

Therefore, let’s suppose that we want to calculate the percentiles of order 5, 10, 25, 60, 64, 90, and 95 for the variable price, through the Weighted Average. In order to do that, we must use the following command:

centile price, centile (5 10 25 60 64 90 95)

The results can be seen in Fig. 3.55.

Fig. 3.55
Fig. 3.55 Results obtained from the command centile on Stata.

We have seen, through Fig. 3.35, the results of the SPSS software for the percentiles of order 10, 25, 50, 60, and 75 using the same method. Fig. 3.47 on SPSS also provided the calculation of the percentiles of order 5, 10, 25, 50, 75, 90, and 95 through the Weighted Average. The only percentile that had not been specified previously was the one of order 64; the others coincide with the results in Figs. 3.35 and 3.47.

3.7.4 Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots

Stata makes a series of charts available, including bar charts, pie charts, scatter plots, histograms, stem-and-leaf, and boxplots, among others. Next, we will discuss how to obtain histograms, stem-and-leaf plots, and boxplots on Stata, for the data available in the file Stock_Market.dta.

3.7.4.1 Histogram

Histograms on Stata can be obtained for continuous and discrete variables. In the case of continuous variables, to obtain a histogram of absolute frequencies, with the option of plotting a normal curve, we must type the following syntax:

histogram variable⁎, normal frequency

or simply:

hist variable⁎, norm freq

as we will use throughout this book. As mentioned before, the term variable⁎ must be substituted for the name of the variable being studied.

For discrete variables, we must include the term discrete:

hist variable⁎, discrete norm freq

Going back to the data in Example 3.42, to obtain a frequency histogram, with the option of plotting a normal curve, we must type the following command:

hist price, norm freq

The obtained output is shown in Fig. 3.56.

Fig. 3.56
Fig. 3.56 Frequency histogram on Stata.

3.7.4.2 Stem-and-Leaf

The stem-and-leaf plot on Stata can be obtained using the command stem, followed by the name of the variable being studied. For the data in the file Stock_Market.dta, we just need to type the following command:

stem price

The obtained output is shown in Fig. 3.57.

Fig. 3.57
Fig. 3.57 Stem-and-Leaf plot on Stata.

3.7.4.3 Boxplot

To obtain the boxplot on the Stata software, we must use the following syntax:

graph box variables⁎

where the term variables⁎ should be substituted for the list of variables to be considered in the analysis, and, for each variable, one chart is constructed.

For the data in Example 3.42, the command is:

graph box price

The chart is shown in Fig. 3.58 which corresponds to the same chart as in Fig. 3.51 generated using SPSS.

Fig. 3.58
Fig. 3.58 Boxplot on Stata.

3.8 Final Remarks

In this chapter, we studied descriptive statistics for a single variable (univariate descriptive statistics), in order to acquire a better understanding of the behavior of each variable through tables, charts, graphs and summary measures, identifying trends, variability, and outliers.

Before we start using descriptive statistics, it is necessary to identify the type of variable we will study. The type of variable is essential for calculating descriptive statistics and in the graphical representation of the results.

The descriptive statistics used to represent the behavior of a qualitative variable’s data are frequency distribution tables and charts. The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and a Pareto chart.

For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of continuous variables’ data grouped into classes. Line graphs, dot or scatter plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams) are normally used to graphically represent quantitative variables.

3.9 Exercises

  1. 1) What statistics can be used (and in which situations) to represent the behavior of a single quantitative or qualitative variable?
  2. 2) What are the limitations of only using measures of central tendency in the study of a certain variable?
  3. 3) How can we verify the existence of outliers in a certain variable?
  4. 4) Describe each one of the measures of dispersion or variability.
  5. 5) What is the difference between Pearson’s first and second coefficients used as measures of skewness in a distribution?
  6. 6) What is the best chart to check the position, skewness and discrepancy among the data?
  7. 7) In the case of bar charts and scatter plots, what kind of data should be used?
  8. 8) What are the most suitable charts to represent qualitative data?
  9. 9) Table 3.1 shows the number of vehicles sold by a dealership in the last 30 days. Construct a frequency distribution table for these data.

    Table 3.1

    Number of Vehicles Sold
    7591110896810
    8571191167109
    85686765108

    t0010

  10. 10) A survey on patients’ health was carried out and information regarding the weight of 50 patients was collected (Table 3.2). Build the frequency distribution table for this problem.

    Table 3.2

    Patients’ Weight
    60.478.965.782.180.992.385.786.690.393.2
    75.277.380.462.090.470.480.575.955.084.3
    81.378.370.585.671.977.576.167.780.678.0
    71.674.892.187.783.893.469.397.881.772.2
    69.380.290.076.954.778.455.275.599.366.7

    t0015

  11. 11) At an electrical appliances factory, in the door component production phase, the quality inspector verifies the total number of parts rejected per type of defect (lack of alignment, scratches, deformation, discoloration, and oxygenation), as shown in Table 3.3.

    Table 3.3

    Total Number of Parts Rejected per Type of Defect
    Type of DefectTotal
    Lack of Alignment98
    Scratches67
    Deformation45
    Discoloration28
    Oxygenation12
    Total250

    We would like you to:
    1. a) Elaborate a frequency distribution table for this problem.
    2. b) Construct a pie chart, in addition to a Pareto chart.
  12. 12) To preserve açaí, it is necessary to carry out several procedures, such as, whitening, pasteurization, freezing, and dehydration. The files Dehydration.xls, Dehydration.sav, and Dehydration.dta show the processing times (in seconds) in the dehydration phase throughout 100 periods. We would like you to:
    1. a) Calculate the measures of position regarding the arithmetic mean, the median, and the mode.
    2. b) Calculate the first and third quartiles and see if there are any outliers.
    3. c) Calculate the 10th and 90th percentiles.
    4. d) Calculate the 3rd and 6th deciles.
    5. e) Calculate the measures of dispersion (range, average deviation, variance, standard deviation, standard error, and coefficient of variation).
    6. f) Check if the distribution is symmetrical, positively skewed, or negatively skewed.
    7. g) Calculate the coefficient of kurtosis and determine the flatness level of the distribution (mesokurtic, platykurtic or leptokurtic).
    8. h) Construct a histogram, a stem-and-leaf plot, and a boxplot for the variable being studied.
  13. 13) In a certain bank branch, we collected the average service time (in minutes) from a sample with 50 customers regarding three types of services. The data can be found in files Services.xls, Services.sav, and Services.dta. Compare the results of the services based on the following measures:
    1. a) Measures of position (mean, median, and mode).
    2. b) Measures of dispersion (variance, standard deviation, and standard error).
    3. c) First and third quartiles; check if there are any outliers.
    4. d) Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution.
    5. e) For each one of the variables, construct a bar chart, a boxplot, and a histogram.
  14. 14) A passenger collected the average travel times (in minutes) of a bus in the district of Vila Mariana, on the Jabaquara route, for 120 days (Table 3.4).

    Table 3.4

    Average Travel Times in 120 Days
    TimeNumber of Days
    304
    327
    3310
    3512
    3818
    4022
    4220
    4315
    458
    504

    We would like you to:
    1. a) Calculate the arithmetic mean, the median, and the mode.
    2. b) Calculate Q1, Q3, D4, P61, and P84.
    3. c) Are there any outliers?
    4. d) Calculate the range, the variance, the standard deviation, and the standard error.
    5. e) Calculate Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution.
    6. f) Construct a bar chart, a histogram, a stem-and-leaf plot, and a boxplot.
  15. 15) In order to improve the quality of its services, a retail company collected the average service time, in seconds, of 250 employees. The data were grouped into classes, with their respective absolute and relative frequencies, as shown in Table 3.5.

    Table 3.5

    Average Service Time
    ClassFiFri (%)
    30 ├ 60114.4
    60 ├ 902911.6
    90 ├ 1204116.4
    120 ├ 1508232.8
    150 ├ 1805421.6
    180 ├ 2103313.2
    Sum250100

    We would like you to:
    1. a) Calculate the arithmetic mean, the median, and the mode.
    2. b) Calculate Q1, Q3, D2, P13, and P95.
    3. c) Are there any outliers?
    4. d) Calculate the range, the variance, the standard deviation, and the standard error.
    5. e) Calculate Pearson’s first coefficient of skewness and the coefficient of kurtosis. Classify the symmetry and the flatness level of each distribution.
    6. f) Construct a histogram.
  16. 16) A financial analyst wants to compare the price of two stocks throughout the previous month. The data are listed in Table 3.6.

    Table 3.6

    Stock Price
    Stock AStock B
    3125
    3033
    2427
    2434
    2832
    2226
    2426
    3428
    2434
    2828
    2331
    3028
    3134
    3216
    2628
    3929
    2527
    4228
    2933
    2429
    2234
    2333
    3227
    2926

    Carry out a comparative analysis of the price of both stocks based on:
    1. a) Measures of position, such as, the mean, median, and mode.
    2. b) Measures of dispersion, such as, the range, variance, standard deviation, and standard error.
    3. c) The existence of outliers.
    4. d) The symmetry and flatness level of the distribution.
    5. e) A line graph, scatter plot, stem-and-leaf plot, histogram, and boxplot.
  17. 17) Aiming to determine the standards of the investments made in hospitals in Sao Paulo (US$ millions), a state government agency collected data regarding 15 hospitals, as shown in Table 3.7.

    Table 3.7

    Investments in 15 Hospitals in the State of Sao Paulo
    HospitalInvestment
    A44
    B12
    C6
    D22
    E60
    F15
    G30
    H200
    I10
    J8
    K4
    L75
    M180
    N50
    O64

    We would like you to:
    1. a) Calculate the sample’s arithmetic mean and standard deviation.
    2. b) Eliminate possible outliers.
    3. c) Once again, calculate the sample’s arithmetic mean and standard deviation (without the outliers).
    4. d) What can we say about the standard deviation of the new sample without the outliers?

References

Bock R.D. Multivariate Statistical Methods in Behavioral Research. New York: McGraw-Hill; 1975.

Bussab W.O., Morettin P.A. Estatística básica. seventh ed. São Paulo: Saraiva; 2011.

Cox N.J. Speaking Stata: the limits of sample skewness and kurtosis. Stata J. 2010;10(3):482–495.

Fávero L.P., Belfiore P., Silva F.L., Chan B.L. Análise de dados: modelagem multivariada para tomada de decisões. Rio de Janeiro: Campus Elsevier; 2009.

Maroco J. Análise estatística com o SPSS Statistics. sixth ed. Lisboa: Edições Sílabo; 2014.


"To view the full reference list for the book, click here"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset