Data that doesn't follow a known distribution, such as the normal distribution, is often difficult to manage. A popular strategy to get control of the data is to apply the Box-Cox transformation. It is given by the following equation:
The scipy.stats.boxcox()
function can apply the transformation for positive data. We will use the same data as in the Clipping and filtering outliers recipe. With Q-Q plots, we will show that the Box-Cox transformation does indeed make the data appear more normal.
The following steps show how to normalize data with the Box-Cox transformation:
import statsmodels.api as sm import matplotlib.pyplot as plt from scipy.stats import boxcox import seaborn as sns import dautil as dl from IPython.display import HTML
context = dl.nb.Context('normalizing_boxcox') starsCYG = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data var = 'log.Te' # Data must be positive transformed, _ = boxcox(starsCYG[var])
sp = dl.plotting.Subplotter(2, 2, context) sp.label() sm.qqplot(starsCYG[var], fit=True, line='s', ax=sp.ax) sp.label(advance=True) sm.qqplot(transformed, fit=True, line='s', ax=sp.ax) sp.label(advance=True) sns.distplot(starsCYG[var], ax=sp.ax) sp.label(advance=True) sns.distplot(transformed, ax=sp.ax) plt.tight_layout() HTML(dl.report.HTMLBuilder().watermark())
Refer to the following screenshot for the end result (refer to the normalizing_boxcox.ipynb
file in this book's code bundle):
The Q-Q plots, in the previous screenshot, graph theoretical quantiles for the normal distribution against the quantiles of the actual data. To help evaluate conformance to the normal distribution, I displayed a line that should correspond with perfectly normal data. The more the data fits the line, the more normal it is. As you can see, the transformed data fits the line better and is, therefore, more normal. The distribution plots should help you to confirm this.