The model is trained on the full development set. The scores are computed on the full evaluation set. Precision-recall f1-score support:
0 1.00 0.96 0.98 85296
1 0.04 0.93 0.08 147
micro avg 0.96 0.96 0.96 85443
macro avg 0.52 0.94 0.53 85443
weighted avg 1.00 0.96 0.98 85443
We find the best hyperparameter optimizing for recall:
def print_gridsearch_scores(x_train_data,y_train_data):
c_param_range = [0.01,0.1,1,10,100]
clf = GridSearchCV(LogisticRegression(), {"C": c_param_range}, cv=5, scoring='recall')
clf.fit(x_train_data,y_train_data)
print "Best parameters set found on development set:"
print clf.bestparams
print "Grid scores on development set:"
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print "%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params)
return clf.best_params_["C"]
We find the best parameters set found on development, as shown here:
best_c = print_gridsearch_scores(X_train_undersample,y_train_undersample)
The output looks like this:
{'C': 0.01}
Grid scores on set:
0.916 (+/-0.056) for {'C': 0.01}
0.907 (+/-0.068) for {'C': 0.1}
0.916 (+/-0.089) for {'C': 1}
0.916 (+/-0.089) for {'C': 10}
0.913 (+/-0.095) for {'C': 100}
Create a function to plot a confusion matrix. This function prints and plots the confusion matrix. Normalization can be applied by setting normalize=True:
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#print("Normalized confusion matrix")
else:
1#print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')