We are familiar with the basic concept of probability in that the probability of an event occurring can be simply modeled as the number of ways the event can occur divided by all the possible outcomes. For example, if, out of 3,000 people who walked into a store, 1,000 actually bought something, then we could say that the probability of a single person buying an item is as shown:
However, we also have a related concept, called odds. The odds of an outcome occurring is the ratio of the number of ways that the outcome occurs divided by every other possible outcome instead of all possible outcomes. In the same example, the odds of a person buying something would be as follows:
This means that for every customer you convert, you will not convert two customers. These concepts are so related, there is even a formula to get from one to the other. We have that:
Let's check this with our example, as illustrated:
It checks out!
Let's use Python to make a table of probabilities and associated odds, as shown:
# create a table of probability versus odds table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]}) table['odds'] = table.probability/(1 - table.probability) table
So, we see that as our probabilities increase, so do our odds, but at a much faster rate! In fact, as the probability of an event occurring nears 1, our odds will shoot off into infinity. Earlier, we said that we couldn't simply regress to probability because our line would shoot off into positive and negative infinities, predicting improper probabilities, but what if we regress to odds? Well, odds go off to positive infinity, but alas, they will merely approach 0 on the bottom, but never go below 0. Therefore, we cannot simply regress to probability, or odds. It looks like we've hit rock bottom folks!
However, wait, natural numbers and logarithms to the rescue! Think of logarithms as follows:
Basically, logarithms and exponents are one and the same. We are just so used to writing exponents in the first way that we forget there is another way to write them. How about another example? If we take the logarithm of a number, we are asking the question, hey, what exponent would we need to put on this number to make it the given number?
Note that np.log
automatically does all logarithms in base e, which is what we want:
np.log(10) # == 2.3025 # meaning that e ^ 2.302 == 10 # to prove that 2.71828**2.3025850929940459 # == 9.9999 # e ^ log(10) == 10
Let's go ahead and add the logarithm of odds, or log-odds to our table, as follows:
# add log-odds to the table table['logodds'] = np.log(table.odds) table
So, now every row has the probability of a single event occurring, the odds of that event occurring, and now the log-odds of that event occurring. Let's go ahead and ensure that our numbers are on the up and up. Let's choose a probability of .25
, as illustrated:
prob = .25 odds = prob / (1 - prob) odds # 0.33333333 logodds = np.log(odds) logodds # -1.09861228
It checks out! Wait, look! Our logodds
variable seems to go down below zero and, in fact, logodds
is not bounded above, nor is it bounded below, which means that it is a great candidate for a response variable for linear regression. In fact, this is where our story of logistic regression really begins.
The long and short of it is that logistic regression is a linear regression between our feature, X, and the log-odds of our data belonging to a certain class that we will call true for the sake of generalization.
If p represents the probability of a data point belonging to a particular class, then logistic regression can be written as follows:
If we rearrange our variables and solve this for p, we would get the logistic function, which takes on an S shape, where y is bounded by [0,1]
:
The preceding graph represents the logistic function's ability to map our continuous input, x
, to a smooth probability curve that begins at the left, near probability 0, and as we increase x
, our probability of belonging to a certain class rises naturally and smoothly up to probability 1. In other words:
The logistic function has some nice properties, as follows:
It takes on an S shape
In order to interpret the outputs of a logistic function, we must understand the difference between probability and odds. The odds of an event are given by the ratio of the probability of the event by its complement, as shown:
In linear regression, the parameter represents the change in the response variable for a unit change in x
. In logistic regression, represents the change in the log-odds for a unit change in x
. This means that gives us the change in the odds for a unit change in x
.
Consider that we are interested in mobile purchase behavior. Let y
be a class label denoting purchase/no purchase, and let x
denote whether the phone was an iPhone.
Also, suppose that we perform a logistic regression, and we get = 0.693.
In this case the odds ratio is np.exp(0.693) = 2, which means that the likelihood of purchase is twice as high if the phone is an iPhone.
Our examples have mostly been binary classification, meaning that we are only predicting one of two outcomes but logistic regression can handle predicting multiple options in our categorical response using a one-versus-all approach, meaning that it will fit a probability curve for each categorical response!
Back to our bikes briefly to see scikit-learn's logistic regression in action. I will begin by making a new response variable that is categorical. To make things simple, I made a column, called above_average
, which is true if the hourly bike rental count is above average and false otherwise.
# Make a cateogirical response bikes['above_average'] = bikes['count'] >= average_bike_rental
As mentioned before, we should look at our null model. In regression, our null model always predicts the average response, but in classification, our null model always predicts the most common outcome. In this case, we can use a Pandas value count to see that. About 60% of the time, the bike rental count is not above average:
bikes['above_average'].value_counts(normalize=True)
Now, let's actually use logistic regression to try and predict whether or not the hourly bike rental count will be above average, as shown:
from sklearn.linear_model import LogisticRegression feature_cols = ['temp'] # using only temperature X = bikes[feature_cols] y = bikes['above_average'] # make our overall X and y variables, this time our y is # out binary response variable, above_average X_train, X_test, y_train, y_test = train_test_split(X, y) # make our train test split logreg = LogisticRegression() # instantate our model logreg.fit(X_train, y_train) # fit our model to our training set logreg.score(X_test, y_test) # score it on our test set to get a better sense of out of sample performance # 0.65650257
It seems that by only using temperature, we can beat the null model of guessing false all of the time! This is our first step in making our model the best it can be.
Between linear and logistic regression, I'd say we already have a great tool belt of machine learning forming, but I have a question—it seems that both of these algorithms are only able to take in quantitative columns as features, but what if I have a categorical feature that I think has an association to my response?