The Spearman rank correlation uses ranks to correlate two variables with the Pearson Correlation. Ranks are the positions of values in sorted order. Items with equal values get a rank, which is the average of their positions. For instance, if we have two items of equal value assigned position 2 and 3, the rank is 2.5 for both items. Have a look at the following equations:
In these equations, n is the sample size. (3.17) shows how the correlation is calculated. (3.19) gives the standard error. (3.20) is about the z-score, which we assume to be normally distributed. F(r) is here the same as in (3.14), since it is the same correlation but applied to ranks.
In this recipe we calculate the Spearman correlation between wind speed and temperature aggregated by the day of the year and the corresponding confidence interval. Then, we display the correlation matrix for all the weather data. The steps are as follows:
import dautil as dl from scipy import stats import numpy as np import math import seaborn as sns import matplotlib.pyplot as plt from IPython.html import widgets from IPython.display import display from IPython.display import HTML
def get_ci(n, corr): z = math.sqrt((n - 3)/1.06) * np.arctanh(corr) se = 0.6325/(math.sqrt(n - 1)) ci = z + np.array([-1, 1]) * se * stats.norm.ppf((1 + 0.95)/2) return np.tanh(ci)
df = dl.data.Weather.load().dropna() df = dl.ts.groupby_yday(df).mean() drop1 = widgets.Dropdown(options=dl.data.Weather.get_headers(), selected_label='TEMP', description='Variable 1') drop2 = widgets.Dropdown(options=dl.data.Weather.get_headers(), selected_label='WIND_SPEED', description='Variable 2') display(drop1) display(drop2)
var1 = df[drop1.value].values var2 = df[drop2.value].values stats_corr = stats.spearmanr(var1, var2) dl.options.set_pd_options() html_builder = dl.report.HTMLBuilder() html_builder.h1('Spearman Correlation between {0} and {1}'.format( dl.data.Weather.get_header(drop1.value), dl.data.Weather.get_header(drop2.value))) html_builder.h2('scipy.stats.spearmanr()') dfb = dl.report.DFBuilder(['Correlation', 'p-value']) dfb.row([stats_corr[0], stats_corr[1]]) html_builder.add_df(dfb.build())
n = len(df.index) ci = get_ci(n, stats_corr) html_builder.h2('Confidence intervale') dfb = dl.report.DFBuilder(['2.5 percentile', '97.5 percentile']) dfb.row(ci) html_builder.add_df(dfb.build())
corr = df.corr(method='spearman') %matplotlib inline plt.title('Spearman Correlation Matrix') sns.heatmap(corr) HTML(html_builder.html)
Refer to the following screenshot for the end result (see the correlating_spearman.ipynb
file in this book's code bundle):