Finally! We will explore our first true machine learning model. Linear regression is a form of regression, which means that it is a machine learning model that attempts to find a relationship between predictors and a response variable and that response variable is, you guessed it, continuous! This notion is synonymous with making a line of best fit.
In the case of linear regression, we will attempt to find a linear relationship between our predictors and our response variable. Formally, we wish to solve a formula of the following format:
Let's look at the constituents of this formula:
Let's take a look at some data before we go in depth. This dataset is publically available and attempts to predict the number of bikes needed on a particular day for a bike sharing program:
# read the data and set the datetime as the index # taken from Kaggle: https://www.kaggle.com/c/bike-sharing-demand/data import pandas as pd import matplotlib.pyplot as plt %matplotlib inline url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv' bikes = pd.read_csv(url) bikes.head()
Following is the output:
We can see that every row represents a single hour of bike usage. In this case, we are interested in predicting count
, which represents the total number of bikes rented in the period of that hour.
Let's, for example, look at a scatter plot between temperature (the temp
column) and count
, as shown:
bikes.plot(kind='scatter', x='temp', y='count', alpha=0.2)
We get the following graph as output:
And now, let's use a module, called seaborn
, to draw ourselves a line of best fit, as follows:
import seaborn as sns #using seaborn to get a line of best fit sns.lmplot(x='temp', y='count', data=bikes, aspect=1.5, scatter_kws={'alpha':0.2})
This line in the graph attempts to visualize and quantify the relationship between temp
and count
. To make a prediction, we simply find a given temperature and then see where the line would predict the count
. For example, if the temperature is 20 degrees (Celsius, mind you), then our line would predict that about 200 bikes will be rented. If the temperature is above 40 degrees, then more than 400 bikes will be needed!
It appears that as temp
goes up, our count
also goes up. Let's see if our correlation value, which quantifies a linear relationship between variables, also matches this notion:
bikes[['count', 'temp']].corr() # 0.3944
There is a (weak) positive correlation between the two variables! Now, let's go back to the form of the linear regression:
Our model will attempt to draw a perfect line between all of the dots in the preceding graph but, of course, we can clearly see that there is no perfect line between these dots! The model will then find the best fit line possible. How? We can draw infinite lines between the data points, but what makes a line the best?
Consider the following diagram:
In our model, we are given the x and the y and the model learns the beta coefficients, also known as model coefficients:
Each data point has a residual or a distance to the line of best fit. The sum of squared residuals is the summation of each residual squared. The best fit line has the smallest sum of squared residual value. Let's build this line in Python:
# create X and y feature_cols = ['temp'] # a list of the predictors X = bikes[feature_cols] # subsetting our data to only the predictors y = bikes['count'] # our response variable
Note how we made an X
and a y
variable. These represent our predictors and our response variable. Then, we will import our machine learning module, scikit-learn
, as shown:
# import scikit-learn, our machine learning module from sklearn.linear_model import LinearRegression
Finally, we will fit our model to the predictors and the response variable, as follows:
linreg = LinearRegression() #instantiate a new model linreg.fit(X, y) #fit the model to our data # print the coefficients print(linreg.intercept_) print(linreg.coef_) 6.04621295962 # our Beta_0 [ 9.17054048] # our beta parameters
Let's interpret this:
y
when X = 0
Sometimes, it might not make sense to interpret the intercept at all because there might not be a concept of zero of something. Recall the levels of data. Not all levels have this notion. Temperature exists at a level that has the inherent notion of no bikes; so, we are safe. Be careful in the future though and verify results:
Consider the following representation of the beta coefficients in a linear regression:
It is important to reiterate that these are all statements of correlation and not a statement of causation. We have no means of stating whether or not the rental increase is caused by the change in temperature, it is just that there appears to be movement together.
Using scikit-learn
to make predictions is easy:
linreg.predict(20) # 189.4570
This means that 190 bikes will likely be rented if the temperature is 20 degrees.
Adding more predictors to the model is as simple as telling the linear regression model in scikit-learn
about them!
Before we do, we should look at the data dictionary provided to us to make more sense out of these predictors:
season
: 1
= spring, 2
= summer, 3
= fall, and 4
= winterholiday
: Whether the day is considered a holidayworkingday
: Whether the day is a weekend or holidayweather
:Clear
, Few clouds
, Partly cloudy
Mist + Cloudy
, Mist + Broken clouds
, Mist + Few clouds
, Mist
Light Snow
, Light Rain + Thunderstorm
+ Scattered clouds
, Light Rain + Scattered clouds
Heavy Rain
+ Ice Pallets
+ Thunderstorm
+ Mist
, Snow + Fog
temp
: Temperature in Celsiusatemp
: Feels like temperature in Celsiushumidity
: Relative humiditywindspeed
: Wind speedcasual
: Number of non-registered user rentals initiatedregistered
: Number of registered user rentals initiatedcount
: Number of total rentalsNow let's actually create our linear regression model. As before, we will first create a list holding the features we wish to look at, create our features and our response datasets (X
and y
), and then fit our linear regression. Once we fit our regression model, we will take a look at the model's coefficients in order to see how our features are interacting with our response:
# create a list of features feature_cols = ['temp', 'season', 'weather', 'humidity'] # create X and y X = bikes[feature_cols] y = bikes['count'] # instantiate and fit linreg = LinearRegression() linreg.fit(X, y) # pair the feature names with the coefficients result = zip(feature_cols, linreg.coef_) resultSet = set(result)print(resultSet)
This gives us the following output:
[('temp', 7.8648249924774403), ('season', 22.538757532466754), ('weather', 6.6703020359238048), ('humidity', -3.1188733823964974)]
And this is what that means:
This is interesting. Note that, as weather
goes up (meaning that the weather is getting closer to overcast), the bike demand goes up, as is the case when the season variables increase (meaning that we are approaching winter). This is not what I was expecting at all!
Let's take a look at the individual scatter plots between each predictor and the response, as illustrated:
feature_cols = ['temp', 'season', 'weather', 'humidity'] # multiple scatter plots sns.pairplot(bikes, x_vars=feature_cols, y_vars='count', kind='reg')
We get the following output:
Note how the weather line is trending downwards, which is the opposite of what the last linear model was suggesting. Now, we have to worry about which of these predictors are actually helping us make the prediction, and which ones are just noise. To do so, we're going to need some more advanced metrics.
There are three main metrics when using regression machine learning models. They are as follows:
Each metric attempts to describe and quantify the effectiveness of a regression model by comparing a list of predictions to a list of correct answers. Each of the following mentioned metrics is slightly different from the rest and tells a different story:
Let's look at the coefficients:
Let's take a look in Python:
# example true and predicted response values true = [9, 6, 7, 6] pred = [8, 7, 7, 12] # note that each value in the last represents a single prediction for a model # So we are comparing four predictions to four actual answers # calculate these metrics by hand! from sklearn import metrics import numpy as np print('MAE:', metrics.mean_absolute_error(true, pred)) print('MSE:', metrics.mean_squared_error(true, pred)) print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred)))
Following is the output:
MAE: 2.0 MSE: 9.5 RMSE: 3.08220700148
The breakdown of these numbers is as follows:
MAE
is probably the easiest to understand, because it's just the average error. It denotes, on an average, how wrong the model is.MSE
is more effective than MAE
, because MSE
punishes larger errors, which tends to be much more useful in the real world.RMSE
is even more popular than MSE
, because it is much more interpretable.
RMSE
is usually the preferred metric for regression, but no matter which one you choose, they are all loss functions and therefore are something to be minimized. Let's use RMSE
to ascertain which columns are helping and which are hurting.
Let's start with only using temperature. Note that our procedure will be as follows:
Let's take a look at the code:
from sklearn import metrics # import metrics from scikit-learn feature_cols = ['temp'] # create X and y X = bikes[feature_cols] linreg = LinearRegression() linreg.fit(X, y) y_pred = linreg.predict(X) np.sqrt(metrics.mean_squared_error(y, y_pred)) # RMSE # Can be interpreted loosely as an average error #166.45
Now, let's try it using temperature and humidity, as shown:
feature_cols = ['temp', 'humidity'] # create X and y X = bikes[feature_cols] linreg = LinearRegression() linreg.fit(X, y) y_pred = linreg.predict(X) np.sqrt(metrics.mean_squared_error(y, y_pred)) # RMSE # 157.79
It got better! Let's try using even more predictors, as illustrated:
feature_cols = ['temp', 'humidity', 'season', 'holiday', 'workingday', 'windspeed', 'atemp'] # create X and y X = bikes[feature_cols] linreg = LinearRegression() linreg.fit(X, y) y_pred = linreg.predict(X) np.sqrt(metrics.mean_squared_error(y, y_pred)) # RMSE # 155.75
Even better! At first, this seems like a major triumph, but there is actually a hidden danger here. Note that we are training the line to fit to X
and y
and then asking it to predict X
again! This is actually a huge mistake in machine learning because it can lead to overfitting, which means that our model is merely memorizing the data and regurgitating it back to us.
Imagine that you are a student, and you walk into the first day of class and the teacher says that the final exam is very difficult in this class. In order to prepare you, she gives you practice test after practice test after practice test. The day of the final exam arrives and you are shocked to find out that every question on the exam is exactly the same as in the practice test! Luckily, you did them so many times that you remember the answer and get a 100% in the exam.
The same thing applies here, more or less. By fitting and predicting on the same data, the model is memorizing the data and getting better at it. A great way to combat this overfitting problem is to use the train/test approach to fit machine learning models, which works as illustrated:
Essentially, we will take the following steps:
The goal here is to minimize the out-of-sample errors of our model, which are the errors our model has on data that it has never seen before. This is important because the main idea (usually) of a supervised model is to predict outcomes for new data. If our model is unable to generalize from our training data and use that to predict unseen cases, then our model isn't very good.
The preceding diagram outlines a simple way of ensuring that our model can effectively ingest the training data and use it to predict data points that the model itself has never seen. Of course, as data scientists, we know that the test set also has answers attached to them, but the model doesn't know that.
All of this might sound complicated, but luckily, the scikit-learn
package has a built-in method to do this, as shown:
from sklearn.cross_validation import train_test_split # function that splits data into training and testing sets feature_cols = ['temp'] X = bikes[feature_cols] y = bikes['count'] # setting our overall data X, and y # Note that in this example, we are attempting to find an association between the temperature of the day and the number of bike rentals. X_train, X_test, y_train, y_test = train_test_split(X, y) # split the data into training and testing sets # X_train and y_train will be used to train the model # X_test and y_test will be used to test the model # Remember that all four of these variables are just subsets of the overall X and y. linreg = LinearRegression() # instantiate the model linreg.fit(X_train, y_train) # fit the model to our training set y_pred = linreg.predict(X_test) # predict our testing set np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # RMSE # Calculate our metric: 166.91
We will spend more time on the reasoning behind this train/test split in Chapter 12, Beyond the Essentials, and look into an even more helpful method, but the main reason we must go through this extra work is because we do not want to fall into a trap where our model is simply regurgitating our dataset back to us and will not be able to handle unseen data points.
In other words, our train/test split is ensuring that the metrics we are looking at are more honest estimates of our sample performance.
Now, let's try again with more predictors, as follows:
feature_cols = ['temp', 'workingday'] X = bikes[feature_cols] y = bikes['count'] X_train, X_test, y_train, y_test = train_test_split(X, y) # Pick a new random training and test set linreg = LinearRegression() linreg.fit(X_train, y_train) y_pred = linreg.predict(X_test) # fit and predict np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # 166.95
Now our model actually got worse with that addition! This implies that workingday
might not be very predictive of our response, the bike rental count.
Now, all of this is good and well, but how well is our model really doing at predicting? We have our root mean squared error of around 167 bikes, but is that good? One way to discover this is to evaluate the null model.
The null model in supervised machine learning represents effectively guessing the expected outcome over and over, and seeing how you did. For example, in regression, if we only ever guess the average number of hourly bike rentals, then how well would that model do?
First, let's get the average hourly bike rental, as shown:
average_bike_rental = bikes['count'].mean() average_bike_rental # 191.57
This means that, overall, in this dataset, regardless of weather, time, day of the week, humidity, and everything else, the average number of bikes that go out every hour is about 192.
Let's make a fake prediction list, wherein every single guess is 191.57. Let's make this guess for every single hour, as follows:
num_rows = bikes.shape[0] num_rows # 10886 null_model_predictions = [average_bike_rental]*num_rows null_model_predictions
The output is as follows:
[191.57413191254824, 191.57413191254824, 191.57413191254824, 191.57413191254824, ... 191.57413191254824, 191.57413191254824, 191.57413191254824, 191.57413191254824]
So, now we have 10886
values, all of them are the average hourly bike rental number. Now, let's see what RMSE
would be if our model only ever guessed the expected value of the average hourly bike rental count:
np.sqrt(metrics.mean_squared_error(y, null_model_predictions))
The output is as follows:
181.13613
Simply guessing, it looks like our RMSE
would be 181 bikes. So, even with one or two features, we can beat it! Beating the null model is a kind of baseline in machine learning. If you think about it, why go through any effort at all if your machine learning is not even better than just guessing!
We've spent a great deal of time on linear regression, but I'd like to now take some time to look at our next machine learning model, which is actually, somewhat, a cousin of linear regression. They are based on very similar ideas but have one major difference—while linear regression is a regression model and can only be used to make predictions of continuous numbers, our next machine learning model will be a classification model, which means that it will attempt to make associations between features and a categorical response.