Highlighting data points with influence plots

Influence plots take into account residuals after a fit, influence, and leverage for individual data points similar to bubble plots. The size of the residuals is plotted on the vertical axis and can indicate that a data point is an outlier. To understand influence plots, take a look at the following equations:

Highlighting data points with influence plots

The residuals according to the statsmodels documentation are scaled by standard deviation (2.1). In (2.2), n is the number of observations and p is the number of regressors. We have a so-called hat-matrix, which is given by (2.3).

The diagonal elements of the hat matrix give the special metric called leverage. Leverage serves as the horizontal axis and indicates potential influence of influence plots. In influence plots, influence determines the size of plotted points. Influential points tend to have high residuals and leverage. To measure influence, statsmodels can use either Cook's distance (2.4) or DFFITS (2.5).

How to do it...

  1. The imports are as follows:
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    from statsmodels.formula.api import ols
    from dautil import data
  2. Get the available country codes:
    dawb = data.Worldbank()
    
    countries = dawb.get_countries()[['name', 'iso2c']]
  3. Load the data from the Worldbank:
    population = dawb.download(indicator=[dawb.get_name('pop_grow'), dawb.get_name('gdp_pcap'),
                                        dawb.get_name('primary_education')],
                             country=countries['iso2c'], start=2014, end=2014)
    
    population = dawb.rename_columns(population)
  4. Define an ordinary least squares model, as follows:
    population_model = ols("pop_grow ~ gdp_pcap + primary_education",
                           data=population).fit()
  5. Display an influence plot of the model using Cook's distance:
    %matplotlib inline
    fig, ax = plt.subplots(figsize=(19.2, 14.4))
    fig = sm.graphics.influence_plot(population_model, ax=ax, criterion="cooks")
    plt.grid()

Refer to the following plot for the end result:

How to do it...

The code is in the highlighting_influence.ipynb file in this book's code bundle.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset