Robust regression is designed to deal better with outliers in data than ordinary regression. This type of regression uses special robust estimators, which are also supported by statsmodels. Obviously, there is no best estimator, so the choice of estimator depends on the data and the model.
In this recipe, we will fit data about annual sunspot counts available in statsmodels. We will define a simple model where the current count depends linearly on the previous value. To demonstrate the effect of outliers, I added a pretty big value and we will compare the robust regression model and an ordinary least squares model.
The following steps describe how to apply the robust linear model:
import statsmodels.api as sm import matplotlib.pyplot as plt import dautil as dl from IPython.display import HTML
def set_labels(ax): ax.set_xlabel('Year') ax.set_ylabel('Sunactivity')
def plot_fit(df, ax, results): x = df['YEAR'] cp = dl.plotting.CyclePlotter(ax) cp.plot(x[1:], df['SUNACTIVITY'][1:], label='Data') cp.plot(x[2:], results.predict()[1:], label='Fit') ax.legend(loc='best')
df = sm.datasets.sunspots.load_pandas().data vals = df['SUNACTIVITY'].values # Outlier added by malicious person, because noone # laughs at his jokes. vals[0] = 100
rlm_model = sm.RLM(vals[1:], sm.add_constant(vals[:-1]), M=sm.robust.norms.TrimmedMean()) rlm_results = rlm_model.fit() hb = dl.report.HTMLBuilder() hb.h1('Fitting a robust linear model') hb.h2('Robust Linear Model') hb.add(rlm_results.summary().tables[1].as_html())
hb.h2('Ordinary Linear Model') ols_model = sm.OLS(vals[1:], sm.add_constant(vals[:-1])) ols_results = ols_model.fit() hb.add(ols_results.summary().tables[1].as_html())
fig, [ax, ax2] = plt.subplots(2, 1) plot_fit(df, ax, rlm_results) ax.set_title('Robust Linear Model') set_labels(ax) ax2.set_title('Ordinary Least Squares') plot_fit(df, ax2, ols_results) set_labels(ax2) plt.tight_layout() HTML(hb.html)
Refer to the following screenshot for the end result (refer to the rlm_demo.ipynb
file in this book's code bundle):