Relationships between variables – scatterplots

The real power in data analysis is realized when we study how different variables relate. At the end of the previous section, we related income and gender, that is, a quantitative variable with a categorical one. In this section, we will investigate scatterplots, which are a graphical representation of the relationship between two quantitative variables.

To illustrate how Pandas can be used to explore the relationship between two variables, we will use an important example from the history of astronomy. Astronomer Edwin Hubble, in 1929, published a very important paper where he discovered that there is an approximately linear relationship between the distance and velocity of extragalactic nebulae. This was the foundation that would come to be the big bang theory.

A reprint of the article is available at http://apod.nasa.gov/diamond_jubilee/d_1996/hub_1929.html , where the data was obtained from. Notice that the dataset is very small and simply printed in the article itself! Some minor formatting and cleanup was done to the data to make it easier to use. In particular, the velocity values were manually changed to be all positive as only the magnitude of the velocity matters. To plot a scatterplot of the data, enter and run the following code:

hubble_data = pd.read_csv('data/hubble-data.csv') 
hubble_data.plot(kind='scatter', x='r', y='v'), 

In this code, after reading the data using the read_csv() function, we plot the data using the plot() method with the kind='scatter' option. The x='r' and y='v' option tell Pandas which columns to plot in the x and y axes, respectively.

Observing the plot, it is clearly seen that there is a relationship between distance and velocity, contradicting the view, prevalent at the time, that the Universe is stationary. To make the relationship clearer, we can add a trend line to the plot.

Note

A central part of statistical inference is hypothesis testing, where one dataset/sample is tested against another or tested against a model-generated dataset. Statistical hypothesis testing is then used to investigate a proposed relationship between the two datasets. The proposed relationship is compared to an idealized null hypothesis, that is, that no relationship exists between the two datasets. The null hypothesis is rejected only if the probability of it being true is below a certain significant level. That is, hypothesis testing can only give the significance of the null hypothesis, not the proposed model. This is a very odd and usually hard concept to grasp. We will touch on hypothesis testing in this chapter, and take a deeper dive in Chapter 4 Regression.

We first need to compute the linear regression line for the relationship. We do this using SciPy with the following code:

from scipy.stats import linregress 
rv = hubble_data.as_matrix(columns=['r','v']) 
a, b, r, p, stderr = linregress(rv) 
print(a, b, r, p, stderr) 

We start by importing the linregress function from the scipy.stats module. This module is not Pandas-aware, so we first convert the data to a NumPy array using the as_matrix() method. Next, we call the linregress function, which returns the following:

  • a: This is the slope of the regression line
  • b: This is the intercept of the regression line
  • r: This is the correlation coefficient
  • p: This is the two-sided p-value for the hypotheses test-for the null hypothesis that assumes the slope is zero
  • stderr: This is the standard error of the estimate

For our example, we get, rounding to two decimals, a=454.16b=-40.78r=0.79p=4.48E-6, and stderr=75.24.

Linear regression is covered in Chapter 4 Regression, but we will interpret the results here. The correlation coefficient of 0.79 indicates a strong relationship, and the very small p-value indicates that the null hypothesis should be rejected, giving support to the existence of a relationship between the variables. The square of r is 0.62, so 62% of the variability in the data is explained by the linear model as opposed to random variation.

All this indicates that a linear model can describe the increase in velocity as a function of distance for galaxies in the Universe. To display this visually, we can plot the regression line together with the data using the following code:

hubble_data.plot(kind='scatter', x='r', y='v') 
rdata = hubble_data['r'] 
rmin, rmax = min(rdata), max(rdata) 
rvalues = np.linspace(rmin, rmax, 200) 
yvalues = a * rvalues + b 
plt.plot(rvalues, yvalues, color='red', lw=1.5) 
plt.locator_params(nbins=5); 

As Pandas does not currently support the option of drawing regression lines on scatterplots, we take advantage of the fact that matplotlib is used by Pandas in the background to construct plots. After graphing the scatterplot, we compute the maximum and minimum values of the data and generate a NumPy array with equally spaced values of the distance with a call to the linspace() function. Then, the yvalues = a* rvalues + b statement computes the points in the line. Finally, we call matplotlib's plot() function to plot the line. The resulting graph is displayed in the following image:

Relationships between variables – scatterplots

From this model, Hubble went on to hypothesize that the Universe is expanding, an idea that eventually yielded the model for the Universe currently accepted in cosmology.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset