We will illustrate how to use logistic regression with statsmodels based on a simple built-in dataset containing quarterly US macro data from 1959 – 2009 (see the notebook logistic_regression_macro_data.ipynb for detail).
The variables and their transformations are listed in the following table:
Variable |
Description |
Transformation |
realgdp |
Real gross domestic product |
Annual Growth Rate |
realcons |
Real personal consumption expenditures |
Annual Growth Rate |
realinv |
Real gross private domestic investment |
Annual Growth Rate |
realgovt |
Real federal expenditures and gross investment |
Annual Growth Rate |
realdpi |
Real private disposable income |
Annual Growth Rate |
m1 |
M1 nominal money stock |
Annual Growth Rate |
tbilrate |
Monthly 3 treasury bill rate |
Level |
unemp |
Seasonally adjusted unemployment rate (%) |
Level |
infl |
Inflation rate |
Level |
realint |
Real interest rate |
Level |
To obtain a binary target variable, we compute the 20-quarter rolling average of the annual growth rate of quarterly real GDP. We then assign 1 if current growth exceeds the moving average and 0 otherwise. Finally, we shift the indicator variables to align next quarter's outcome with the current quarter.
We use an intercept and convert the quarter values to dummy variables and train the logistic regression model as follows:
import statsmodels.api as sm
data = pd.get_dummies(data.drop(drop_cols, axis=1), columns=['quarter'], drop_first=True).dropna()
model = sm.Logit(data.target, sm.add_constant(data.drop('target', axis=1)))
result = model.fit()
result.summary()
This produces the following summary for our model with 198 observations and 13 variables, including intercept:
The summary indicates that the model has been trained using maximum likelihood and provides the maximized value of the log-likelihood function at -67.9.
The LL-Null value of -136.42 is the result of the maximized log-likelihood function when only an intercept is included. It forms the basis for the pseudo-R2 statistic and the Log-Likelihood Ratio (LLR) test.
The pseudo-R2 statistic is a substitute for the familiar R2 available under least squares. It is computed based on the ratio of the maximized log-likelihood function for the null model m0 and the full model m1 as follows:
The values vary from 0 (when the model does not improve the likelihood) to 1 where the model fits perfectly and the log-likelihood is maximized at 0. Consequently, higher values indicate a better fit.
The LLR test generally compares a more restricted model and is computed as:
The null hypothesis is that the restricted model performs better but the low p-value suggests that we can reject this hypothesis and prefer the full model over the null model. This is similar to the F-test for linear regression (where we can also use the LLR test when we estimate the model using MLE).
The z-statistic plays the same role as the t-statistic in the linear regression output and is equally computed as the ratio of the coefficient estimate and its standard error. The p-values also indicate the probability of observing the test statistic assuming the null hypothesis H0 : β = 0 that the population coefficient is zero. We can reject this hypothesis for the intercept, realcons, realinv, realgovt, realdpi, and unemp.