How to conduct inference with statsmodels

We will illustrate how to use logistic regression with statsmodels based on a simple built-in dataset containing quarterly US macro data from 1959 – 2009 (see the notebook logistic_regression_macro_data.ipynb for detail).

The variables and their transformations are listed in the following table:

Variable	Description	Transformation
`realgdp`	Real gross domestic product	Annual Growth Rate
`realcons`	Real personal consumption expenditures	Annual Growth Rate
`realinv`	Real gross private domestic investment	Annual Growth Rate
`realgovt`	Real federal expenditures and gross investment	Annual Growth Rate
`realdpi`	Real private disposable income	Annual Growth Rate
`m1`	M1 nominal money stock	Annual Growth Rate
`tbilrate`	Monthly 3 treasury bill rate	Level
`unemp`	Seasonally adjusted unemployment rate (%)	Level
`infl`	Inflation rate	Level
`realint`	Real interest rate	Level

To obtain a binary target variable, we compute the 20-quarter rolling average of the annual growth rate of quarterly real GDP. We then assign 1 if current growth exceeds the moving average and 0 otherwise. Finally, we shift the indicator variables to align next quarter's outcome with the current quarter.

We use an intercept and convert the quarter values to dummy variables and train the logistic regression model as follows:

import statsmodels.api as sm

data = pd.get_dummies(data.drop(drop_cols, axis=1), columns=['quarter'], drop_first=True).dropna()
model = sm.Logit(data.target, sm.add_constant(data.drop('target', axis=1)))
result = model.fit()
result.summary()

This produces the following summary for our model with 198 observations and 13 variables, including intercept:

Logit Regression results

The summary indicates that the model has been trained using maximum likelihood and provides the maximized value of the log-likelihood function at -67.9.

The LL-Null value of -136.42 is the result of the maximized log-likelihood function when only an intercept is included. It forms the basis for the pseudo-R² statisticand the Log-Likelihood Ratio (LLR) test.

The pseudo-R²statistic is a substitute for the familiar R² available under least squares. It is computed based on the ratio of the maximized log-likelihood function for the null model m₀ and the full model m₁ as follows:

The values vary from 0 (when the model does not improve the likelihood) to 1 where the model fits perfectly and the log-likelihood is maximized at 0. Consequently, higher values indicate a better fit.

The LLR test generally compares a more restricted model and is computed as:

The null hypothesis is that the restricted model performs better but the low p-value suggests that we can reject this hypothesis and prefer the full model over the null model. This is similar to the F-test for linear regression (where we can also use the LLR test when we estimate the model using MLE).

The z-statistic plays the same role as the t-statistic in the linear regression output and is equally computed as the ratio of the coefficient estimate and its standard error. The p-values also indicate the probability of observing the test statistic assuming the null hypothesis H₀ : β = 0 that the population coefficient is zero. We can reject this hypothesis for the intercept, realcons, realinv, realgovt, realdpi, and unemp.

Table of Contents for How to conduct inference with statsmodels

Create new playlist

Sign In

Sign Up

Table of Contents for
How to conduct inference with statsmodels