Chapter 6. Logistic Regression with Python

In the previous chapter, we learned about linear regression. We saw that linear regression is one of the most basic models that assumes that there is a linear relationship between a predictor variable and an output variable.

In this chapter, we will be discussing the details of logistic regression. We will be covering the following topics in this chapter:

  • Math behind logistic regression: Logistic regression relies on concepts such as conditional probability and odds ratio. In this chapter, we will understand what they mean and how they are applied. We will also see how the odds ratio is transformed to establish a linear relationship with the predictor variable. We will analyze the final logistic regression equation and understand the meaning of each term and coefficient.
  • Implementing logistic regression with Python: Similar to what we did in the last chapter, we will take a dataset and implement a logistic regression model on it to understand the various nuances of logistic regression. We will use both the statsmodel.api and scikit-learn modules for doing this.
  • Model validation: The model, once developed, needs to be validated to assess the accuracy and efficiency of the model. We will see how a k-fold cross validation works to validate a model. We will also understand concepts such as Sensitivity, Specificity, and the Receiver Operating Characteristic (ROC) curve and see how they are used for model validation.

Linear regression versus logistic regression

One thing to note about the linear regression model is that the output variable is always a continuous variable. In other words, linear regression is a good choice when one needs to predict continuous numbers. However, what if the output variable is a discrete number. What if we want to classify our records in two or more categories? Can we still extend the assumptions of a linear relationship and try to classify the records?

As it happens, there is a separate regression model that takes care of a situation where the output variable is a binary or categorical variable rather than a continuous variable. This model is called logistic regression. In other words, logistic regression is a variation of linear regression where the output variable is a binary or categorical variable. The two regressions are similar in the sense that they both assume a linear relationship between the predictor and output variables. However, as we will see soon, the output variable needs to undergo some transformation in the case of logistic regression.

A few scenarios where logistic regression can be applied are as follows:

  • To predict whether a random customer will buy a particular product or not, given his details such as income, gender, shopping history, and advertisement history
  • To predict whether a team will win or lose a match, given the match and team details such as weather, form of players, stadium, and hours spent in training

    Note how the output variable in both the cases is a binary or categorical variable.

The following table contains a comparison of the two models:

 

Linear regression

Logistic regression

Predictor variables

Continuous numeric/categorical

Continuous numeric/categorical

Output variables

Continuous numeric

Categorical

Relationship

Linear

Linear (with some transformations)

Before we delve into implementing and assessing the model, it is of critical importance to understand the mathematics that makes the foundation of the algorithm. Let us try to understand some mathematical concepts that make the backbone of the logistic regression model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset