Quartile

A more significant measure of dispersion is the quartile and related interquartile ranges. It also stands for quarterly percentile, which means that it is the value on the measurement scale below which 25, 50, 75, and 100 percent of the scores in the sorted dataset fall. The quartiles are three points that split the dataset into four groups, with each one containing one-fourth of the data. To illustrate this, suppose we have a dataset of 20 test scores that are ranked, as follows:

    In [27]: import random
             random.seed(100)
             testScores = [random.randint(0,100) for p in 
                           xrange(0,20)]
             testScores
    Out[27]: [14, 45, 77, 71, 73, 43, 80, 53, 8, 46, 4, 94, 95, 33, 31, 77, 20, 18, 19, 35]
    
    In [28]: #data needs to be sorted for quartiles
sortedScores = np.sort(testScores) In [30]: rankedScores = {i+1: sortedScores[i] for i in xrange(len(sortedScores))} In [31]: rankedScores Out[31]: {1: 4, 2: 8, 3: 14, 4: 18, 5: 19, 6: 20, 7: 31, 8: 33, 9: 35, 10: 43, 11: 45, 12: 46, 13: 53, 14: 71, 15: 73, 16: 77, 17: 77, 18: 80, 19: 94, 20: 95}

The first quartile (Q1) lies between the fifth and sixth score, the second quartile (Q2) lies between the tenth and eleventh score, and the third quartile (Q3) lies between the fifteenth and sixteenth score. Thus, we get the following results by using linear interpolation and calculating the midpoint:

Q1 = (19+20)/2 = 19.5
Q2 = (43 + 45)/2 = 44
Q3 = (73 + 77)/2 = 75  

To see this in IPython, we can use the scipy.stats or numpy.percentile packages:

    In [38]: from scipy.stats.mstats import mquantiles
             mquantiles(sortedScores)
    Out[38]: array([ 19.45,  44.  ,  75.2 ])
    
    In [40]: [np.percentile(sortedScores, perc) for perc in [25,50,75]]
    Out[40]: [19.75, 44.0, 74.0]
  

The reason why the values don't match exactly with our previous calculations is due to the different interpolation methods. The interquartile range is the first quartile subtracted from the third quartile (Q3 - Q1). It represents the middle 50 in a dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset