Measuring central tendency of noisy data

We can measure central tendency with the mean and median. These measures use all the data available. It is a generally accepted idea to get rid of outliers by discarding data on the higher and lower end of a data set. The truncated mean or trimmed mean, and derivatives of it such as the interquartile mean (IQM) and trimean, use this idea too. Take a look at the following equations:

Measuring central tendency of noisy data

The truncated mean discards the data at given percentiles—for instance, from the lowest value to the 5th percentile and from the 95th percentile to the highest value. The trimean (4.1) is a weighted average of the median, first quartile, and third quartile. For the IQM (4.2), we discard the lowest and highest quartile of the data, so it is a special case of the truncated mean. We will calculate these measures with the SciPy tmean() and trima() functions.

How to do it...

We will take a look at the central tendency for varying levels of truncation with the following steps:

  1. The imports are as follows:
    import matplotlib.pyplot as plt
    from scipy.stats import tmean
    from scipy.stats.mstats import trima
    import numpy as np
    import dautil as dl
    import seaborn as sns
    from IPython.display import HTML
    
    context = dl.nb.Context('central_tendency')
  2. Define the following function to calculate the interquartile mean:
    def iqm(a):
        return truncated_mean(a, 25)
  3. Define the following function to plot distributions:
    def plotdists(var, ax):
        displot_label = 'From {0} to {1} percentiles'
        cyc = dl.plotting.Cycler()
    
        for i in range(1, 9, 3):
            limits = dl.stats.outliers(var, method='percentiles', 
                                   percentiles=(i, 100 - i))
            truncated = trima(var, limits=limits).compressed()
            sns.distplot(truncated, ax=ax, color=cyc.color(),
                         hist_kws={'histtype': 'stepfilled', 'alpha': 1/i,
                                   'linewidth': cyc.lw()},
                         label=displot_label.format(i, 100 - i))
  4. Define the following function to compute the truncated mean:
    def truncated_mean(a, percentile):
        limits = dl.stats.outliers(a, method='percentiles', 
                                   percentiles=(percentile, 100 - percentile))
        
        return tmean(a, limits=limits)
  5. Load the data and calculate means as follows:
    df = dl.data.Weather.load().resample('M').dropna()
    x = range(9)
    temp_means = [truncated_mean(df['TEMP'], i) for i in x]
    ws_means = [truncated_mean(df['WIND_SPEED'], i) for i in x]
  6. Plot the means and distributions with the following code:
    sp = dl.plotting.Subplotter(2, 2, context)
    cp = dl.plotting.CyclePlotter(sp.ax)
    cp.plot(x, temp_means, label='Truncated mean')
    cp.plot(x, dl.stats.trimean(df['TEMP']) * np.ones_like(x), label='Trimean')
    cp.plot(x, iqm(df['TEMP']) * np.ones_like(x), label='IQM')
    sp.label(ylabel_params=dl.data.Weather.get_header('TEMP'))
    
    cp = dl.plotting.CyclePlotter(sp.next_ax())
    cp.plot(x, ws_means, label='Truncated mean')
    cp.plot(x, dl.stats.trimean(df['WIND_SPEED']) * np.ones_like(x),
            label='Trimean')
    cp.plot(x, iqm(df['WIND_SPEED']) * np.ones_like(x), label='IQM')
    sp.label(ylabel_params=dl.data.Weather.get_header('WIND_SPEED'))
    
    plotdists(df['TEMP'], sp.next_ax())
    sp.label(xlabel_params=dl.data.Weather.get_header('TEMP'))
    
    plotdists(df['WIND_SPEED'], sp.next_ax())
    sp.label(xlabel_params=dl.data.Weather.get_header('WIND_SPEED'))
    plt.tight_layout()
    HTML(dl.report.HTMLBuilder().watermark())

Refer to the following screenshot for the end result (refer to the central_tendency.ipynb file in this book's code bundle):

How to do it...

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset