Normalizing with the Box-Cox transformation

Data that doesn't follow a known distribution, such as the normal distribution, is often difficult to manage. A popular strategy to get control of the data is to apply the Box-Cox transformation. It is given by the following equation:

Normalizing with the Box-Cox transformation

The scipy.stats.boxcox() function can apply the transformation for positive data. We will use the same data as in the Clipping and filtering outliers recipe. With Q-Q plots, we will show that the Box-Cox transformation does indeed make the data appear more normal.

How to do it...

The following steps show how to normalize data with the Box-Cox transformation:

  1. The imports are as follows:
    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    from scipy.stats import boxcox
    import seaborn as sns
    import dautil as dl
    from IPython.display import HTML
  2. Load the data and transform it as follows:
    context = dl.nb.Context('normalizing_boxcox')
    
    starsCYG = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data
    
    var = 'log.Te'
    
    # Data must be positive
    transformed, _ = boxcox(starsCYG[var])
  3. Display the Q-Q plots and the distribution plots as follows:
    sp = dl.plotting.Subplotter(2, 2, context)
    sp.label()
    sm.qqplot(starsCYG[var], fit=True, line='s', ax=sp.ax)
    
    sp.label(advance=True)
    sm.qqplot(transformed, fit=True, line='s', ax=sp.ax)
    
    sp.label(advance=True)
    sns.distplot(starsCYG[var], ax=sp.ax)
    
    sp.label(advance=True)
    sns.distplot(transformed, ax=sp.ax)                                       
    plt.tight_layout()
    HTML(dl.report.HTMLBuilder().watermark())

Refer to the following screenshot for the end result (refer to the normalizing_boxcox.ipynb file in this book's code bundle):

How to do it...

How it works

The Q-Q plots, in the previous screenshot, graph theoretical quantiles for the normal distribution against the quantiles of the actual data. To help evaluate conformance to the normal distribution, I displayed a line that should correspond with perfectly normal data. The more the data fits the line, the more normal it is. As you can see, the transformed data fits the line better and is, therefore, more normal. The distribution plots should help you to confirm this.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset