Logistic regression

We discussed the intuition behind and basics of logistic regression models in Chapter 3, Machine Learning Foundations. To build a model on our training set, we use the following code:

from sklearn.linear_model import LogisticRegression

clfs = [LogisticRegression()]

for clf in clfs:
clf.fit(X_train, y_train.ravel())
print(type(clf))
print('Training accuracy: ' + str(clf.score(X_train, y_train)))
print('Validation accuracy: ' + str(clf.score(X_test, y_test)))

coefs = {
'column': [X_train_cols[i] for i in range(len(X_train_cols))],
'coef': [clf.coef_[0,i] for i in range(len(X_train_cols))]
}
df_coefs = pd.DataFrame(coefs)
print(df_coefs.sort_values('coef', axis=0, ascending=False))

Prior to the for loop, we import the LogisticRegression class and set clf equal to a LogisticRegression instance. The training and testing occur in the for loop. First, we use the fit() method to fit the model (for example, to determine the optimal coefficients) using the training data. Next, we use the score() method to assess the model performance on both the training data and the testing data.

In the second half of the for loop, we print out the coefficient values of each feature. In general, features with coefficients that are farther from zero are the most positively/negatively correlated with the outcome. However, we did not scale the data prior to training, so it is possible that more important predictors that are not scaled appropriately will have lower coefficients.

The output of the code should look like the following:

<class 'sklearn.linear_model.logistic.LogisticRegression'>
Training accuracy: 0.888978581423
Validation accuracy: 0.884261501211
         coef                                             column
346  2.825056                     rfv_Symptoms_of_onset_of_labor
696  1.618454                   rfv_Adverse_effect_of_drug_abuse
95   1.467790                    rfv_Delusions_or_hallucinations
108  1.435026  rfv_Other_symptoms_or_problems_relating_to_psy...
688  1.287535                                rfv_Suicide_attempt
895  1.265043                                          IMMEDR_01
520  1.264023  rfv_General_psychiatric_or_psychological_exami...
278  1.213235                                       rfv_Jaundice
712  1.139245         rfv_For_other_and_unspecified_test_results
469  1.084806                            rfv_Other_heart_disease
...

First, let's discuss the performance of the training and testing sets. They are close together, which indicates the model did not overfit to the training set. The accuracy is approximately 88%, which is on par with performance in research studies (Cameron et al., 2015) for predicting emergency department status.

Looking at the coefficients, we can confirm that they make intuitive sense. The feature with the highest coefficient is related to the onset of labor in pregnancy; we all know that labor results in a hospital admission. Many of the features pertaining to severe psychiatric disease, which almost always result in an admission due to the risk the patient poses to themselves or to others. The IMMEDR_1 feature also has a high coefficient; remember that this feature corresponds to a value of 1 on the triage scale, which is the most critical value:

...
898 -0.839861 IMMEDR_04 823 -0.848631 BEDDATA_03 625 -0.873828 rfv_Hand_and_fingers 371 -0.960739 rfv_Skin_rash 188 -0.963524 rfv_Earache_pain_ 217 -0.968058 rfv_Soreness 899 -1.019763 IMMEDR_05 604 -1.075670 rfv_Suture__insertion_removal 235 -1.140021 rfv_Toothache 30 -1.692650 LEFTATRI

In contrast, scrolling to the bottom reveals some of the features that are negatively correlated with an admission. Having a toothache and needing sutures to be removed show up here, and they are not likely to result in admissions since they are not urgent complaints.

We've trained our first model. Let's see whether some of the more complex models will improve.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset