Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Highlighting data points with influence plots

Influence plots take into account residuals after a fit, influence, and leverage for individual data points similar to bubble plots. The size of the residuals is plotted on the vertical axis and can indicate that a data point is an outlier. To understand influence plots, take a look at the following equations:

The residuals according to the statsmodels documentation are scaled by standard deviation (2.1). In (2.2), n is the number of observations and p is the number of regressors. We have a so-called hat-matrix, which is given by (2.3).

The diagonal elements of the hat matrix give the special metric called leverage. Leverage serves as the horizontal axis and indicates potential influence of influence plots. In influence plots, influence determines the size of plotted points. Influential points tend to have high residuals and leverage. To measure influence, statsmodels can use either Cook's distance (2.4) or DFFITS (2.5).

How to do it...

The imports are as follows:

import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from dautil import data

Get the available country codes:

dawb = data.Worldbank()

countries = dawb.get_countries()[['name', 'iso2c']]

Load the data from the Worldbank:

population = dawb.download(indicator=[dawb.get_name('pop_grow'), dawb.get_name('gdp_pcap'),
                                    dawb.get_name('primary_education')],
                         country=countries['iso2c'], start=2014, end=2014)

population = dawb.rename_columns(population)

Define an ordinary least squares model, as follows:

population_model = ols("pop_grow ~ gdp_pcap + primary_education",
                       data=population).fit()

Display an influence plot of the model using Cook's distance:

%matplotlib inline
fig, ax = plt.subplots(figsize=(19.2, 14.4))
fig = sm.graphics.influence_plot(population_model, ax=ax, criterion="cooks")
plt.grid()

Refer to the following plot for the end result:

The code is in the highlighting_influence.ipynb file in this book's code bundle.

Table of Contents for
Highlighting data points with influence plots

Highlighting data points with influence plots

How to do it...

See also

Table of Contents for Highlighting data points with influence plots

Create new playlist

Sign In

Sign Up

Highlighting data points with influence plots

How to do it...

See also

Table of Contents for
Highlighting data points with influence plots