Correlating a binary and a continuous variable with the point biserial correlation

The point-biserial correlation correlates a binary variable Y and a continuous variable X. The coefficient is calculated as follows:

Correlating a binary and a continuous variable with the point biserial correlation

The subscripts in (3.21) correspond to the two groups of the binary variable. M1 is the mean of X for values corresponding to group 1 of Y. M2 is the mean of X for values corresponding to group 0 of Y.

In this recipe, the binary variable we will use is rain or no rain. We will correlate this variable with temperature.

How to do it...

We will calculate the correlation with the scipy.stats.pointbiserialr() function. We will also compute the rolling correlation using a 2 year window with the np.roll() function. The steps are as follows:

  1. The imports are as follows:
    import dautil as dl
    from scipy import stats
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    from IPython.display import HTML
  2. Load the data and correlate the two relevant arrays:
    df = dl.data.Weather.load().dropna()
    df['RAIN'] = df['RAIN'] > 0
    
    stats_corr = stats.pointbiserialr(df['RAIN'].values, df['TEMP'].values)
  3. Compute the 2 year rolling correlation as follows:
    N = 2 * 365
    corrs = []
    
    for i in range(len(df.index) - N):
        x = np.roll(df['RAIN'].values, i)[:N]
        y = np.roll(df['TEMP'].values, i)[:N]
        corrs.append(stats.pointbiserialr(x, y)[0])
    
    corrs = pd.DataFrame(corrs,
                         index=df.index[N:],
                         columns=['Correlation']).resample('A')
  4. Plot the results with the following code:
    plt.plot(corrs.index.values, corrs.values)
    plt.hlines(stats_corr[0], corrs.index.values[0], corrs.index.values[-1],
               label='Correlation using the whole data set')
    plt.title('Rolling Point-biserial Correlation of Rain and Temperature with a 2 Year Window')
    plt.xlabel('Year')
    plt.ylabel('Correlation')
    plt.legend(loc='best')
    HTML(dl.report.HTMLBuilder().watermark())

Refer to the following screenshot for the end result (see correlating_pointbiserial.ipynb file in this book's code bundle):

How to do it...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset