Ship it!

Let's assume we want to integrate this classifier into our site. In all of the preceding examples, we always trained on only 90% of the available data, because we used the other 10% for testing. Let's assume that the data was all we had. In that case, we should retrain the classifier on all data:

>>> C_best = 0.01 # determined above
>>> clf = LogisticRegression(C=C_best)
>>> clf.fit(X, Y) # now trainining an all data without cross-validation
>>> print(clf.coef_)
[[ 0.24937413 0.00777857 0.0097297 0.00061647 0.02354386 -0.03715787 -0.03406846]]

Finally, we should store the trained classifier, because we definitely do not want to retrain it each time we start the classification service. Instead, we can simply serialize the classifier after training and then deserialize on that site:

>>> import pickle
>>> pickle.dump(clf, open("logreg.dat", "w"))
>>> clf = pickle.load(open("logreg.dat", "r"))
>>> print(clf.coef_) # showing that we indeed got the same classifier again
[[ 0.24937413 0.00777857 0.0097297 0.00061647 0.02354386 -0.03715787 -0.03406846]]

Congratulations, the classifier is now ready to be used as if it had just been trained. We can now use the classifier's predict_proba() to calculate the probability of an answer being a good one. We will use the threshold of 0.66, which results in 79% precision at 21% recall, as we determined earlier:

>>> good_thresh = 0.66  

Let's take a look at the features of two artificial posts to show how it works:

>>> # Remember that the features are in this order:
>>> # LinkCount, NumCodeLines, NumTextTokens, AvgSentLen, AvgWordLen,
>>> # NumAllCaps, NumExclams
>>> good_post = (2, 1, 100, 5, 4, 1, 0)
>>> poor_post = (1, 0, 10, 5, 6, 5, 4)
>>> proba = clf.predict_proba([good_post, poor_post])
>>> print(proba) # print probabilities (poor, good) per post
array([[ 0.30127876, 0.69872124],
[ 0.62934963, 0.37065037]])
>>> print(proba >= good_thresh)
array([[False, True],
[False, False]], dtype=bool)

As expected, we manage to detect the first post as good, but cannot say anything about the second, which is why we would show a nice, motivating message directing the writer to improve the post.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset