An illustrative example

We will now illustrate all the preceding points with an example. Suppose we're doing a study in which we would like to illustrate the effect of temperature on how often crickets chirp. The data for this example was obtained from the book The Song of Insects, by George W Pierce, which was written in 1948. George Pierce measured the frequency of chirps made by a ground cricket at various temperatures.

We want to investigate the frequency of cricket chirps and the temperature, as we suspect that there is a relationship between them. The data consists of 16 data points, and we will read it into a DataFrame.

The data is sourced from http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/slr/frames/slr02.html. Let's take a look at it:

    In [38]: import pandas as pd
             import numpy as np
             chirpDf= pd.read_csv('cricket_chirp_temperature.csv')
In [39]: chirpDf Out[39]:chirpFrequency temperature 0 20.000000 88.599998 1 16.000000 71.599998 2 19.799999 93.300003 3 18.400000 84.300003 4 17.100000 80.599998 5 15.500000 75.199997 6 14.700000 69.699997 7 17.100000 82.000000 8 15.400000 69.400002 9 16.200001 83.300003 10 15.000000 79.599998 11 17.200001 82.599998 12 16.000000 80.599998 13 17.000000 83.500000 14 14.400000 76.300003 15 rows × 2 columns

First, let's make a scatter plot of the data, along with a regression line, or line of best fit:

    In [29]: plt.scatter(chirpDf.temperature,chirpDf.chirpFrequency,
                marker='o',edgecolor='b',facecolor='none',alpha=0.5)
               plt.xlabel('Temperature')
               plt.ylabel('Chirp Frequency')
               slope, intercept = np.polyfit(chirpDf.temperature,chirpDf.chirpFrequency,1)
               plt.plot(chirpDf.temperature,chirpDf.temperature*slope + intercept,'r')
               plt.show()
  

As you can see from the following diagram, there seems to be a linear relationship between temperature and the chirp frequency:

We can now proceed to investigate further by using the statsmodels.ols (ordinary least squares) method:

    [37]: chirpDf= pd.read_csv('cricket_chirp_temperature.csv')
          chirpDf=np.round(chirpDf,2)
          result=sm.ols('temperature ~ chirpFrequency',chirpDf).fit()
          result.summary()
    
    Out[37]: OLS Regression Results
       Dep. Variable: temperature     R-squared:      0.697
       Model: OLS     Adj. R-squared: 0.674
       Method:        Least Squares   F-statistic:    29.97
       Date:  Wed, 27 Aug 2014     Prob (F-statistic):     0.000107
       Time:  23:28:14        Log-Likelihood: -40.348
       No. Observations:      15      AIC:    84.70
       Df Residuals:  13      BIC:    86.11
       Df Model:      1               
                       coef     std err t     P>|t| [95.0% Conf. Int.]
       Intercept     25.2323 10.060  2.508 0.026 3.499 46.966
       chirpFrequency 3.2911  0.601  5.475 0.000 1.992 4.590
    
       Omnibus:        1.003   Durbin-Watson:  1.818
       Prob(Omnibus):  0.606   Jarque-Bera (JB):       0.874
       Skew:   -0.391  Prob(JB):       0.646
       Kurtosis:       2.114   Cond. No.       171.

We will ignore most of the preceding results, except for the R-squared, Intercept, and chirpFrequency values.

From the preceding result, we can conclude that the slope of the regression line is 3.29 and that the intercept on the temperature axis is 25.23. Thus, the regression line equation looks like temperature = 25.23 + 3.29 * chirpFrequency.

This means that as the chirp frequency increases by 1, the temperature increases by about 3.29 degrees Fahrenheit. However, note that the intercept value is not really meaningful as it is outside the bounds of the data. We can also only make predictions for values within the bounds of the data. For example, we cannot predict what chirpFrequency is at 32 degrees Fahrenheit as it is outside the bounds of the data; moreover, at 32 degrees Fahrenheit, the crickets would have frozen to death. The value of R—that is, the correlation coefficient—is given as follows:

In [38]: R=np.sqrt(result.rsquared)
         R
Out[38]: 0.83514378678237422  

Thus, our correlation coefficient is R = 0.835. This would indicate that about 84 percent of the chirp frequency can be explained by the changes in temperature.

The book containing this data, The Song of Insects, can be found at http://www.hup.harvard.edu/catalog.php?isbn=9780674420663.

For a more in-depth treatment of single and multivariable regression, refer to the following websites:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset