Summary

A logistic regression is a versatile technique used widely in the cases where the variable to be predicted is a binary (or categorical) variable. This chapter dives deep into the math behind the logistics regression and the process to implement it using the scikit-learn and statsmodel api modules. It is important to understand the math behind the algorithm so that the model is not used as a black box without knowing what is going on behind the hood. To recap, the following are the main takeaways from the chapter:

  • Linear regression wouldn't be an appropriate model to predict binary variables as the predictor variables can range from -infinity to +infinity, while the binary variable would be 0 or 1.
  • The odds of a certain event happening is the probability of that event happening divided by the probability of that event not happening. The higher the odds, the higher are the chances of the event happening. The odds can range from 0 to infinity.
  • The final equation for the logistic regression is:
    Summary
  • The variable coefficients are calculated using the maximum Log-likelihood estimate. The roots of the equation are often calculated using the Newton-Raphson method.
  • Each coefficient estimate has a Wald statistic and p-value associated to it. The smaller the p-value, the more significant the variable coefficient is to the model.
  • The model can be validated using the k-fold cross validation technique, wherein the logistic regression model is run k-times using the testing and training data derived from the overall dataset.
  • The model predicts the probability for each observation. A threshold probability value is defined to categorize the probability values as 0 (failures) and 1 (successes).
  • Sensitivity measures what proportion of successes were actually identified as successes, while Specificity measures what proportion of failures were actually identified as failures.
  • An ROC curve is a plot of Sensitivity vs (1-Specificity). A diagonal (y=x) line is a good benchmark for the ROC curve. If the curve lies above the diagonal line, the model is better than a random guess. If the curve lies below, then the model is worse than a random guess.

It will do wonders for your understanding of logistic regression if you take a dataset and try implementing a logistic regression model on it. In the next chapter, we will learn about an unsupervised algorithm called Clustering or Segmentation that is used widely in marketing and natural sciences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset