Probability, odds, and log odds

We are familiar with the basic concept of probability in that the probability of an event occurring can be simply modeled as the number of ways the event can occur divided by all the possible outcomes. For example, if, out of 3,000 people who walked into a store, 1,000 actually bought something, then we could say that the probability of a single person buying an item is as shown:

Probability, odds, and log odds

However, we also have a related concept, called odds. The odds of an outcome occurring is the ratio of the number of ways that the outcome occurs divided by every other possible outcome instead of all possible outcomes. In the same example, the odds of a person buying something would be as follows:

Probability, odds, and log odds

This means that for every customer you convert, you will not convert two customers. These concepts are so related, there is even a formula to get from one to the other. We have that:

Probability, odds, and log odds

Let's check this with our example, as illustrated:

Probability, odds, and log odds

It checks out!

Let's use Python to make a table of probabilities and associated odds, as shown:

# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table
Probability, odds, and log odds

So, we see that as our probabilities increase, so do our odds, but at a much faster rate! In fact, as the probability of an event occurring nears 1, our odds will shoot off into infinity. Earlier, we said that we couldn't simply regress to probability because our line would shoot off into positive and negative infinities, predicting improper probabilities, but what if we regress to odds? Well, odds go off to positive infinity, but alas, they will merely approach 0 on the bottom, but never go below 0. Therefore, we cannot simply regress to probability, or odds. It looks like we've hit rock bottom folks!

However, wait, natural numbers and logarithms to the rescue! Think of logarithms as follows:

Probability, odds, and log odds

Basically, logarithms and exponents are one and the same. We are just so used to writing exponents in the first way that we forget there is another way to write them. How about another example? If we take the logarithm of a number, we are asking the question, hey, what exponent would we need to put on this number to make it the given number?

Note that np.log automatically does all logarithms in base e, which is what we want:

np.log(10) # == 2.3025
# meaning that e ^ 2.302 == 10

# to prove that
2.71828**2.3025850929940459 # == 9.9999
# e ^ log(10) == 10

Let's go ahead and add the logarithm of odds, or log-odds to our table, as follows:

# add log-odds to the table
table['logodds'] = np.log(table.odds)
table
Probability, odds, and log odds

So, now every row has the probability of a single event occurring, the odds of that event occurring, and now the log-odds of that event occurring. Let's go ahead and ensure that our numbers are on the up and up. Let's choose a probability of .25, as illustrated:

prob = .25

odds = prob / (1 - prob)
odds
# 0.33333333

logodds = np.log(odds)
logodds
# -1.09861228

It checks out! Wait, look! Our logodds variable seems to go down below zero and, in fact, logodds is not bounded above, nor is it bounded below, which means that it is a great candidate for a response variable for linear regression. In fact, this is where our story of logistic regression really begins.

The math of logistic regression

The long and short of it is that logistic regression is a linear regression between our feature, X, and the log-odds of our data belonging to a certain class that we will call true for the sake of generalization.

If p represents the probability of a data point belonging to a particular class, then logistic regression can be written as follows:

The math of logistic regression

If we rearrange our variables and solve this for p, we would get the logistic function, which takes on an S shape, where y is bounded by [0,1]:

The math of logistic regression
The math of logistic regression

The preceding graph represents the logistic function's ability to map our continuous input, x, to a smooth probability curve that begins at the left, near probability 0, and as we increase x, our probability of belonging to a certain class rises naturally and smoothly up to probability 1. In other words:

  • Logistic regression gives an output of the probabilities of a specific class being true
  • Those probabilities can be converted into class predictions

The logistic function has some nice properties, as follows:

It takes on an S shape

  • Output is bounded by 0 and 1, as a probability should be

In order to interpret the outputs of a logistic function, we must understand the difference between probability and odds. The odds of an event are given by the ratio of the probability of the event by its complement, as shown:

The math of logistic regression

In linear regression, the The math of logistic regression parameter represents the change in the response variable for a unit change in x. In logistic regression, The math of logistic regression represents the change in the log-odds for a unit change in x. This means that The math of logistic regression gives us the change in the odds for a unit change in x.

Consider that we are interested in mobile purchase behavior. Let y be a class label denoting purchase/no purchase, and let x denote whether the phone was an iPhone.

Also, suppose that we perform a logistic regression, and we get The math of logistic regression = 0.693.

In this case the odds ratio is np.exp(0.693) = 2, which means that the likelihood of purchase is twice as high if the phone is an iPhone.

Note

Our examples have mostly been binary classification, meaning that we are only predicting one of two outcomes but logistic regression can handle predicting multiple options in our categorical response using a one-versus-all approach, meaning that it will fit a probability curve for each categorical response!

Back to our bikes briefly to see scikit-learn's logistic regression in action. I will begin by making a new response variable that is categorical. To make things simple, I made a column, called above_average, which is true if the hourly bike rental count is above average and false otherwise.

# Make a cateogirical response
bikes['above_average'] = bikes['count'] >= average_bike_rental

As mentioned before, we should look at our null model. In regression, our null model always predicts the average response, but in classification, our null model always predicts the most common outcome. In this case, we can use a Pandas value count to see that. About 60% of the time, the bike rental count is not above average:

bikes['above_average'].value_counts(normalize=True)

Now, let's actually use logistic regression to try and predict whether or not the hourly bike rental count will be above average, as shown:

from sklearn.linear_model import LogisticRegression

feature_cols = ['temp']
# using only temperature

X = bikes[feature_cols]
y = bikes['above_average']
# make our overall X and y variables, this time our y is
# out binary response variable, above_average

X_train, X_test, y_train, y_test = train_test_split(X, y)
# make our train test split

logreg = LogisticRegression()
# instantate our model

logreg.fit(X_train, y_train)
# fit our model to our training set

logreg.score(X_test, y_test)
# score it on our test set to get a better sense of out of sample performance

# 0.65650257

It seems that by only using temperature, we can beat the null model of guessing false all of the time! This is our first step in making our model the best it can be.

Between linear and logistic regression, I'd say we already have a great tool belt of machine learning forming, but I have a question—it seems that both of these algorithms are only able to take in quantitative columns as features, but what if I have a categorical feature that I think has an association to my response?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset