Linear regression

Finally! We will explore our first true machine learning model. Linear regression is a form of regression, which means that it is a machine learning model that attempts to find a relationship between predictors and a response variable and that response variable is, you guessed it, continuous! This notion is synonymous with making a line of best fit.

In the case of linear regression, we will attempt to find a linear relationship between our predictors and our response variable. Formally, we wish to solve a formula of the following format:

Linear regression

Let's look at the constituents of this formula:

  • y is our response variable
  • xi is our ith variable (ith column or ith predictor)
  • B0 is the intercept
  • Bi is the coefficient for the xi term

Let's take a look at some data before we go in depth. This dataset is publically available and attempts to predict the number of bikes needed on a particular day for a bike sharing program:

# read the data and set the datetime as the index 
# taken from Kaggle: https://www.kaggle.com/c/bike-sharing-demand/data 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline 
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv' 
bikes = pd.read_csv(url) 
bikes.head()

Following is the output:

Linear regression

We can see that every row represents a single hour of bike usage. In this case, we are interested in predicting count, which represents the total number of bikes rented in the period of that hour.

Let's, for example, look at a scatter plot between temperature (the temp column) and count, as shown:

bikes.plot(kind='scatter', x='temp', y='count', alpha=0.2) 

We get the following graph as output:

Linear regression

And now, let's use a module, called seaborn, to draw ourselves a line of best fit, as follows:

import seaborn as sns #using seaborn to get a line of best fit 
sns.lmplot(x='temp', y='count', data=bikes, aspect=1.5, scatter_kws={'alpha':0.2}) 

Following is the output:

Linear regression

This line in the graph attempts to visualize and quantify the relationship between temp and count. To make a prediction, we simply find a given temperature and then see where the line would predict the count. For example, if the temperature is 20 degrees (Celsius, mind you), then our line would predict that about 200 bikes will be rented. If the temperature is above 40 degrees, then more than 400 bikes will be needed!

It appears that as temp goes up, our count also goes up. Let's see if our correlation value, which quantifies a linear relationship between variables, also matches this notion:

bikes[['count', 'temp']].corr() 
# 0.3944 

There is a (weak) positive correlation between the two variables! Now, let's go back to the form of the linear regression:

Linear regression

Our model will attempt to draw a perfect line between all of the dots in the preceding graph but, of course, we can clearly see that there is no perfect line between these dots! The model will then find the best fit line possible. How? We can draw infinite lines between the data points, but what makes a line the best?

Consider the following diagram:

Linear regression

In our model, we are given the x and the y and the model learns the beta coefficients, also known as model coefficients:

  • The black dots are the observed values of x and y.
  • The blue line is our line of best fit.
  • The red lines between the dots and the line are called the residuals; they are the distances between the observed values and the line. They are how wrong the line is.

Each data point has a residual or a distance to the line of best fit. The sum of squared residuals is the summation of each residual squared. The best fit line has the smallest sum of squared residual value. Let's build this line in Python:

# create X and y 
feature_cols = ['temp'] # a list of the predictors 
X = bikes[feature_cols] # subsetting our data to only the predictors 
y = bikes['count'] # our response variable 

Note how we made an X and a y variable. These represent our predictors and our response variable. Then, we will import our machine learning module, scikit-learn, as shown:

# import scikit-learn, our machine learning module 
from sklearn.linear_model import LinearRegression 

Finally, we will fit our model to the predictors and the response variable, as follows:

linreg = LinearRegression() #instantiate a new model 
linreg.fit(X, y) #fit the model to our data 
 
# print the coefficients 
print(linreg.intercept_) print(linreg.coef_)  
6.04621295962  # our Beta_0 
[ 9.17054048]     # our beta parameters 

Let's interpret this:

  • B0 (6.04) is the value of y when X = 0
  • It is the estimation of bikes that will be rented when the temperature is 0 degrees Celsius
  • So, at 0 degrees, six bikes are predicted to be in use (it's cold!)

Sometimes, it might not make sense to interpret the intercept at all because there might not be a concept of zero of something. Recall the levels of data. Not all levels have this notion. Temperature exists at a level that has the inherent notion of no bikes; so, we are safe. Be careful in the future though and verify results:

  • B1 (9.17) is our temperature coefficient.
  • It is the change in y divided by the change in x1.
  • It represents how x and y move together.
  • A change in 1 degree Celsius is associated with an increase of about nine bikes rented.
  • The sign of this coefficient is important. If it were negative, that would imply that a rise in temperature is associated with a drop in rentals.

Consider the following representation of the beta coefficients in a linear regression:

Linear regression

It is important to reiterate that these are all statements of correlation and not a statement of causation. We have no means of stating whether or not the rental increase is caused by the change in temperature, it is just that there appears to be movement together.

Using scikit-learn to make predictions is easy:

linreg.predict(20) 
# 189.4570 

This means that 190 bikes will likely be rented if the temperature is 20 degrees.

Adding more predictors

Adding more predictors to the model is as simple as telling the linear regression model in scikit-learn about them!

Before we do, we should look at the data dictionary provided to us to make more sense out of these predictors:

  • season: 1 = spring, 2 = summer, 3 = fall, and 4 = winter
  • holiday: Whether the day is considered a holiday
  • workingday: Whether the day is a weekend or holiday
  • weather:
    • Clear, Few clouds, Partly cloudy
    • Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp: Temperature in Celsius
  • atemp: Feels like temperature in Celsius
  • humidity: Relative humidity
  • windspeed: Wind speed
  • casual: Number of non-registered user rentals initiated
  • registered: Number of registered user rentals initiated
  • count: Number of total rentals

Now let's actually create our linear regression model. As before, we will first create a list holding the features we wish to look at, create our features and our response datasets (X and y), and then fit our linear regression. Once we fit our regression model, we will take a look at the model's coefficients in order to see how our features are interacting with our response:

# create a list of features 
feature_cols = ['temp', 'season', 'weather', 'humidity'] 
# create X and y 
X = bikes[feature_cols] 
y = bikes['count'] 
 
# instantiate and fit 
linreg = LinearRegression() 
linreg.fit(X, y) 
 
# pair the feature names with the coefficients 
result = zip(feature_cols, linreg.coef_) resultSet = set(result)print(resultSet)

This gives us the following output:

[('temp', 7.8648249924774403), 
 ('season', 22.538757532466754), 
 ('weather', 6.6703020359238048), 
 ('humidity', -3.1188733823964974)] 

And this is what that means:

  • Holding all other predictors constant, a 1 unit increase in temperature is associated with a rental increase of 7.86 bikes
  • Holding all other predictors constant, a 1 unit increase in season is associated with a rental increase of 22.5 bikes
  • Holding all other predictors constant, a 1 unit increase in weather is associated with a rental increase of 6.67 bikes
  • Holding all other predictors constant, a 1 unit increase in humidity is associated with a rental decrease of 3.12 bikes

This is interesting. Note that, as weather goes up (meaning that the weather is getting closer to overcast), the bike demand goes up, as is the case when the season variables increase (meaning that we are approaching winter). This is not what I was expecting at all!

Let's take a look at the individual scatter plots between each predictor and the response, as illustrated:

feature_cols = ['temp', 'season', 'weather', 'humidity'] 
# multiple scatter plots 
sns.pairplot(bikes, x_vars=feature_cols, y_vars='count', kind='reg')

We get the following output:

Adding more predictors

Note how the weather line is trending downwards, which is the opposite of what the last linear model was suggesting. Now, we have to worry about which of these predictors are actually helping us make the prediction, and which ones are just noise. To do so, we're going to need some more advanced metrics.

Regression metrics

There are three main metrics when using regression machine learning models. They are as follows:

  • The mean absolute error
  • The mean squared error
  • The root mean squared error

Each metric attempts to describe and quantify the effectiveness of a regression model by comparing a list of predictions to a list of correct answers. Each of the following mentioned metrics is slightly different from the rest and tells a different story:

Regression metrics

Let's look at the coefficients:

  • n is the number of observations
  • yi is the actual value
  • ŷ is the predicted value

Let's take a look in Python:

# example true and predicted response values 
true = [9, 6, 7, 6] 
pred = [8, 7, 7, 12] 
# note that each value in the last represents a single prediction for a model 
# So we are comparing four predictions to four actual answers 
 
# calculate these metrics by hand! 
from sklearn import metrics 
import numpy as np 
print('MAE:', metrics.mean_absolute_error(true, pred)) print('MSE:', metrics.mean_squared_error(true, pred)) print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred))) 

Following is the output:

MAE: 2.0
MSE: 9.5
RMSE: 3.08220700148

The breakdown of these numbers is as follows:

  • MAE is probably the easiest to understand, because it's just the average error. It denotes, on an average, how wrong the model is.
  • MSE is more effective than MAE, because MSE punishes larger errors, which tends to be much more useful in the real world.
  • RMSE is even more popular than MSE, because it is much more interpretable.

RMSE is usually the preferred metric for regression, but no matter which one you choose, they are all loss functions and therefore are something to be minimized. Let's use RMSE to ascertain which columns are helping and which are hurting.

Let's start with only using temperature. Note that our procedure will be as follows:

  1. Create our X and our y variables
  2. Fit a linear regression model
  3. Use the model to make a list of predictions based on X
  4. Calculate RMSE between the predictions and the actual values

Let's take a look at the code:

from sklearn import metrics 
# import metrics from scikit-learn 
 
feature_cols = ['temp'] 
# create X and y 
X = bikes[feature_cols] 
linreg = LinearRegression() 
linreg.fit(X, y) 
y_pred = linreg.predict(X) 
np.sqrt(metrics.mean_squared_error(y, y_pred)) # RMSE 
# Can be interpreted loosely as an average error 
#166.45 

Now, let's try it using temperature and humidity, as shown:

feature_cols = ['temp', 'humidity'] 
# create X and y 
X = bikes[feature_cols] 
linreg = LinearRegression() 
linreg.fit(X, y) 
y_pred = linreg.predict(X) 
np.sqrt(metrics.mean_squared_error(y, y_pred)) # RMSE 
# 157.79 

It got better! Let's try using even more predictors, as illustrated:

feature_cols = ['temp', 'humidity', 'season', 'holiday', 'workingday', 'windspeed', 'atemp'] 
# create X and y 
X = bikes[feature_cols] 
linreg = LinearRegression() 
linreg.fit(X, y) 
y_pred = linreg.predict(X) 
np.sqrt(metrics.mean_squared_error(y, y_pred)) # RMSE 
# 155.75 

Even better! At first, this seems like a major triumph, but there is actually a hidden danger here. Note that we are training the line to fit to X and y and then asking it to predict X again! This is actually a huge mistake in machine learning because it can lead to overfitting, which means that our model is merely memorizing the data and regurgitating it back to us.

Imagine that you are a student, and you walk into the first day of class and the teacher says that the final exam is very difficult in this class. In order to prepare you, she gives you practice test after practice test after practice test. The day of the final exam arrives and you are shocked to find out that every question on the exam is exactly the same as in the practice test! Luckily, you did them so many times that you remember the answer and get a 100% in the exam.

The same thing applies here, more or less. By fitting and predicting on the same data, the model is memorizing the data and getting better at it. A great way to combat this overfitting problem is to use the train/test approach to fit machine learning models, which works as illustrated:

Regression metrics

Essentially, we will take the following steps:

  1. Split up the dataset into two parts: a training and a test set
  2. Fit our model on the training set and then test it on the test set, just like in school, where the teacher would teach from one set of notes and then test us on different (but similar) questions
  3. Once our model is good enough (based on our metrics), we turn our model's attention toward the entire dataset
  4. Our model awaits new data previously unseen by anyone

The goal here is to minimize the out-of-sample errors of our model, which are the errors our model has on data that it has never seen before. This is important because the main idea (usually) of a supervised model is to predict outcomes for new data. If our model is unable to generalize from our training data and use that to predict unseen cases, then our model isn't very good.

The preceding diagram outlines a simple way of ensuring that our model can effectively ingest the training data and use it to predict data points that the model itself has never seen. Of course, as data scientists, we know that the test set also has answers attached to them, but the model doesn't know that.

All of this might sound complicated, but luckily, the scikit-learn package has a built-in method to do this, as shown:

from sklearn.cross_validation import train_test_split 
# function that splits data into training and testing sets 
 
 
feature_cols = ['temp'] 
X = bikes[feature_cols] 
y = bikes['count'] 
# setting our overall data X, and y 
# Note that in this example, we are attempting to find an association between the temperature of the day and the number of bike rentals. 
 
X_train, X_test, y_train, y_test = train_test_split(X, y) # split the data into training and testing sets 
# X_train and y_train will be used to train the model 
# X_test and y_test will be used to test the model 
# Remember that all four of these variables are just subsets of the overall X and y. 
 
linreg = LinearRegression() 
# instantiate the model 
 
linreg.fit(X_train, y_train) 
# fit the model to our training set 
 
y_pred = linreg.predict(X_test) 
# predict our testing set 
 
np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # RMSE 
# Calculate our metric: 166.91 

We will spend more time on the reasoning behind this train/test split in Chapter 12, Beyond the Essentials, and look into an even more helpful method, but the main reason we must go through this extra work is because we do not want to fall into a trap where our model is simply regurgitating our dataset back to us and will not be able to handle unseen data points.

In other words, our train/test split is ensuring that the metrics we are looking at are more honest estimates of our sample performance.

Now, let's try again with more predictors, as follows:

feature_cols = ['temp', 'workingday'] 
X = bikes[feature_cols] 
y = bikes['count'] 
 
X_train, X_test, y_train, y_test = train_test_split(X, y) 
# Pick a new random training and test set 
 
linreg = LinearRegression() 
linreg.fit(X_train, y_train) 
y_pred = linreg.predict(X_test) 
# fit and predict 
 
np.sqrt(metrics.mean_squared_error(y_test, y_pred)) 
# 166.95 

Now our model actually got worse with that addition! This implies that workingday might not be very predictive of our response, the bike rental count.

Now, all of this is good and well, but how well is our model really doing at predicting? We have our root mean squared error of around 167 bikes, but is that good? One way to discover this is to evaluate the null model.

The null model in supervised machine learning represents effectively guessing the expected outcome over and over, and seeing how you did. For example, in regression, if we only ever guess the average number of hourly bike rentals, then how well would that model do?

First, let's get the average hourly bike rental, as shown:

average_bike_rental = bikes['count'].mean() 
average_bike_rental 
# 191.57 

This means that, overall, in this dataset, regardless of weather, time, day of the week, humidity, and everything else, the average number of bikes that go out every hour is about 192.

Let's make a fake prediction list, wherein every single guess is 191.57. Let's make this guess for every single hour, as follows:

num_rows = bikes.shape[0] 
num_rows 
# 10886 null_model_predictions = [average_bike_rental]*num_rows 
null_model_predictions

The output is as follows:

[191.57413191254824, 
 191.57413191254824, 
 191.57413191254824, 
 191.57413191254824, 
... 
 191.57413191254824, 
 191.57413191254824, 
 191.57413191254824, 
 191.57413191254824] 

So, now we have 10886 values, all of them are the average hourly bike rental number. Now, let's see what RMSE would be if our model only ever guessed the expected value of the average hourly bike rental count:

np.sqrt(metrics.mean_squared_error(y, null_model_predictions))

The output is as follows:

181.13613 

Simply guessing, it looks like our RMSE would be 181 bikes. So, even with one or two features, we can beat it! Beating the null model is a kind of baseline in machine learning. If you think about it, why go through any effort at all if your machine learning is not even better than just guessing!

We've spent a great deal of time on linear regression, but I'd like to now take some time to look at our next machine learning model, which is actually, somewhat, a cousin of linear regression. They are based on very similar ideas but have one major difference—while linear regression is a regression model and can only be used to make predictions of continuous numbers, our next machine learning model will be a classification model, which means that it will attempt to make associations between features and a categorical response.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset