Logistic regression

We discussed the intuition behind and basics of logistic regression models in Chapter 3, Machine Learning Foundations. To build a model on our training set, we use the following code:

from sklearn.linear_model import LogisticRegression

clfs = [LogisticRegression()]

for clf in clfs:
    clf.fit(X_train, y_train.ravel())
    print(type(clf))
    print('Training accuracy: ' + str(clf.score(X_train, y_train)))
    print('Validation accuracy: ' + str(clf.score(X_test, y_test)))
    
    coefs = {
        'column': [X_train_cols[i] for i in range(len(X_train_cols))],
        'coef': [clf.coef_[0,i] for i in range(len(X_train_cols))]
    }
    df_coefs = pd.DataFrame(coefs)
    print(df_coefs.sort_values('coef', axis=0, ascending=False))

Prior to the for loop, we import the LogisticRegression class and set clf equal to a LogisticRegression instance. The training and testing occur in the for loop. First, we use the fit() method to fit the model (for example, to determine the optimal coefficients) using the training data. Next, we use the score() method to assess the model performance on both the training data and the testing data.

In the second half of the for loop, we print out the coefficient values of each feature. In general, features with coefficients that are farther from zero are the most positively/negatively correlated with the outcome. However, we did not scale the data prior to training, so it is possible that more important predictors that are not scaled appropriately will have lower coefficients.

The output of the code should look like the following:

<class 'sklearn.linear_model.logistic.LogisticRegression'>
Training accuracy: 0.888978581423
Validation accuracy: 0.884261501211
         coef                                             column
346  2.825056                     rfv_Symptoms_of_onset_of_labor
696  1.618454                   rfv_Adverse_effect_of_drug_abuse
95   1.467790                    rfv_Delusions_or_hallucinations
108  1.435026  rfv_Other_symptoms_or_problems_relating_to_psy...
688  1.287535                                rfv_Suicide_attempt
895  1.265043                                          IMMEDR_01
520  1.264023  rfv_General_psychiatric_or_psychological_exami...
278  1.213235                                       rfv_Jaundice
712  1.139245         rfv_For_other_and_unspecified_test_results
469  1.084806                            rfv_Other_heart_disease
...

First, let's discuss the performance of the training and testing sets. They are close together, which indicates the model did not overfit to the training set. The accuracy is approximately 88%, which is on par with performance in research studies (Cameron et al., 2015) for predicting emergency department status.

Looking at the coefficients, we can confirm that they make intuitive sense. The feature with the highest coefficient is related to the onset of labor in pregnancy; we all know that labor results in a hospital admission. Many of the features pertaining to severe psychiatric disease, which almost always result in an admission due to the risk the patient poses to themselves or to others. The IMMEDR_1 feature also has a high coefficient; remember that this feature corresponds to a value of 1 on the triage scale, which is the most critical value:

...
898 -0.839861                                          IMMEDR_04
823 -0.848631                                         BEDDATA_03
625 -0.873828                               rfv_Hand_and_fingers
371 -0.960739                                      rfv_Skin_rash
188 -0.963524                                  rfv_Earache_pain_
217 -0.968058                                       rfv_Soreness
899 -1.019763                                          IMMEDR_05
604 -1.075670                      rfv_Suture__insertion_removal
235 -1.140021                                      rfv_Toothache
30  -1.692650                                           LEFTATRI

In contrast, scrolling to the bottom reveals some of the features that are negatively correlated with an admission. Having a toothache and needing sutures to be removed show up here, and they are not likely to result in admissions since they are not urgent complaints.

We've trained our first model. Let's see whether some of the more complex models will improve.

Table of Contents for Logistic regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Logistic regression