Influence plots take into account residuals after a fit, influence, and leverage for individual data points similar to bubble plots. The size of the residuals is plotted on the vertical axis and can indicate that a data point is an outlier. To understand influence plots, take a look at the following equations:
The residuals according to the statsmodels
documentation are scaled by standard deviation (2.1). In (2.2), n is the number of observations and p is the number of regressors. We have a so-called
hat-matrix, which is given by (2.3).
The diagonal elements of the hat matrix give the special metric called leverage. Leverage serves as the horizontal axis and indicates potential influence of influence plots. In influence plots, influence determines the size of plotted points. Influential points tend to have high residuals and leverage. To measure influence, statsmodels
can use either Cook's distance (2.4) or DFFITS (2.5).
import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.formula.api import ols from dautil import data
dawb = data.Worldbank() countries = dawb.get_countries()[['name', 'iso2c']]
population = dawb.download(indicator=[dawb.get_name('pop_grow'), dawb.get_name('gdp_pcap'), dawb.get_name('primary_education')], country=countries['iso2c'], start=2014, end=2014) population = dawb.rename_columns(population)
population_model = ols("pop_grow ~ gdp_pcap + primary_education", data=population).fit()
%matplotlib inline fig, ax = plt.subplots(figsize=(19.2, 14.4)) fig = sm.graphics.influence_plot(population_model, ax=ax, criterion="cooks") plt.grid()
Refer to the following plot for the end result:
The code is in the highlighting_influence.ipynb
file in this book's code bundle.