Evaluating relations between variables with ANOVA

Analysis of variance (ANOVA) is a statistical data analysis method invented by statistician Ronald Fisher. This method partitions data of a continuous variable using the values of one or more corresponding categorical variables to analyze variance. ANOVA is a form of linear modeling. If we are modeling with one categorical variable, we speak of one-way ANOVA. In this recipe, we will use two categorical variables so we have two-way ANOVA. In two-way ANOVA, we create a contingency table—a table containing counts for all combinations of the two categorical variables (we will see a contingency table example soon). The linear model is then given by the equation:

Evaluating relations between variables with ANOVA

This is an additive model where μij is the mean of the continuous variable corresponding to one cell of the contingency table, μ is the mean for the whole data set, αi is the contribution of the first categorical variable, βj is the contribution of the second categorical variable, and ɣ ij is a cross-term. We will apply this model to weather data.

How to do it...

The following steps apply two-way ANOVA to wind speed as continuous variable, rain as a binary variable, and wind direction as categorical variable:

  1. The imports are as follows:
    from statsmodels.formula.api import ols
    import dautil as dl
    from statsmodels.stats.anova import anova_lm
    import seaborn as sns
    import matplotlib.pyplot as plt
    from IPython.display import HTML
  2. Load the data and fit the model with statsmodels:
    df = dl.data.Weather.load().dropna()
    df['RAIN'] = df['RAIN'] > 0
    formula = 'WIND_SPEED ~ C(RAIN) + C(WIND_DIR)'
    lm = ols(formula, df).fit()
    hb = dl.HTMLBuilder()
    hb.h1('ANOVA Applied to Weather Data')
    hb.h2('ANOVA results')
    hb.add_df(anova_lm(lm), index=True)
  3. Display a truncated contingency table and visualize the data with Seaborn:
    df['WIND_DIR'] = dl.data.Weather.categorize_wind_dir(df)
    hb.h2('Truncated Contingency table')
    hb.add_df(df.groupby([df['RAIN'], df['WIND_DIR']]).count().head(3),index=True)
    
    sns.pointplot(y='WIND_SPEED', x='WIND_DIR',
                  hue='RAIN', data=df[['WIND_SPEED', 'RAIN', 'WIND_DIR']])
    HTML(hb.html)

Refer to the following screenshot for the end result (see anova.ipynb file in this book's code bundle):

How to do it...

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset