Computing the measures of central tendency of a dataset in Python

To illustrate this, let's consider the following dataset, which consists of marks that were obtained by 15 pupils for a test scored out of 20:

    In [18]: grades = [10, 10, 14, 18, 18, 5, 10, 8, 1, 12, 14, 12, 13, 1, 18]  

The mean, median, and mode can be obtained as follows:

    In [29]: %precision 3  # Set output precision to 3 decimal places
    Out[29]:u'%.3f'
    
    In [30]: import numpy as np
             np.mean(grades)
    Out[30]: 10.933
    
    In [35]: %precision
             np.median(grades)
    Out[35]: 12.0
    
    In [24]: from scipy import stats
             stats.mode(grades)
    Out[24]: (array([ 10.]), array([ 3.]))
    In [39]: import matplotlib.pyplot as plt
    In [40]: plt.hist(grades)
             plt.title('Histogram of grades')
             plt.xlabel('Grade')
             plt.ylabel('Frequency')
             plt.show()  

The following is the output of the preceding code:

To illustrate how the skewness of data or an outlier value can drastically affect the usefulness of the mean as a measure of central tendency, consider the following dataset, which shows the wages (in thousands of dollars) of the staff at a factory:

    In [45]: %precision 2
             salaries = [17, 23, 14, 16, 19, 22, 15, 18, 18, 93, 95]
    
    In [46]: np.mean(salaries)
    Out[46]: 31.82

Based on the mean value, we may make the assumption that the data is centered around the mean value of 31.82. However, we would be wrong. To explain why, let's display an empirical distribution of the data using a bar plot:

    In [59]: fig = plt.figure()
             ax = fig.add_subplot(111)
             ind = np.arange(len(salaries))
             width = 0.2
             plt.hist(salaries, bins=xrange(min(salaries),
             max(salaries)).__len__())
             ax.set_xlabel('Salary')
             ax.set_ylabel('# of employees')
             ax.set_title('Bar chart of salaries')
             plt.show()  

The following is the output of the preceding code:

From the preceding bar plot, we can see that most of the salaries are far below 30K and that no one is close to the mean of 32K. Now, if we take a look at the median, we will see that it is a better measure of central tendency in this case:

    In [47]: np.median(salaries)
    Out[47]: 18.00

We can also take a look at a histogram of the data:

    In [56]: plt.hist(salaries, bins=len(salaries))
             plt.title('Histogram of salaries')
             plt.xlabel('Salary')
             plt.ylabel('Frequency')
             plt.show()

The following is the output of the preceding code:

The histogram is actually a better representation of the data as bar plots are generally used to represent categorical data, while histograms are preferred for quantitative data, which is the case for the salaries data. For more information on when to use histograms versus bar plots, refer to http://onforb.es/1Dru2gv.

If the distribution is symmetrical and unimodal (that is, has only one mode), the three measures—mean, median, and mode—will be equal. This is not the case if the distribution is skewed. In that case, the mean and median will differ from each other. With a negatively skewed distribution, the mean will be lower than the median and vice versa for a positively skewed distribution:

Diagram sourced from http://www.southalabama.edu/coe/bset/johnson/lectures/lec15_files/iage014.jpg.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset