Winsorizing is another technique to deal with outliers and is named after Charles Winsor. In effect, Winsorization clips outliers to given percentiles in a symmetric fashion. For instance, we can clip to the 5th and 95th percentile. SciPy has a winsorize()
function, which performs this procedure. The data for this recipe is the same as that for the Clipping and filtering outliers recipe.
Winsorize the data with the following procedure:
rom scipy.stats.mstats import winsorize import statsmodels.api as sm import seaborn as sns import matplotlib.pyplot as plt import dautil as dl from IPython.display import HTML
starsCYG = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data limit = 0.15 winsorized_x = starsCYG.copy() winsorized_x['log.Te'] = winsorize(starsCYG['log.Te'], limits=limit)
winsorized_y = starsCYG.copy() winsorized_y['log.light'] = winsorize(starsCYG['log.light'], limits=limit) winsorized_xy = starsCYG.apply(winsorize, limits=[limit, limit])
sp = dl.plotting.Subplotter(2, 2, context) sp.label() sns.regplot(x='log.Te', y='log.light', data=starsCYG, ax=sp.ax) sp.label(advance=True) sns.regplot(x='log.Te', y='log.light', data=winsorized_x, ax=sp.ax) sp.label(advance=True) sns.regplot(x='log.Te', y='log.light', data=winsorized_y, ax=sp.ax) sp.label(advance=True) sns.regplot(x='log.Te', y='log.light', data=winsorized_xy, ax=sp.ax) plt.tight_layout() HTML(dl.report.HTMLBuilder().watermark())
Refer to the following screenshot for the end result (refer to the winsorising_data.ipynb
file in this book's code bundle):