A bit of math with a small example

To get an initial understanding of the way logistic regression works, let's first take a look at the following example, where we have artificial feature values, X, plotted with the corresponding classes, 0 or 1:

from scipy.stats import norm
np.random.seed(3) # for reproducibility
NUM_PER_CLASS = 40
X_log = np.hstack((norm.rvs(2, size=NUM_PER_CLASS, scale=2),
norm.rvs(8, size=NUM_PER_CLASS, scale=3)))
y_log = np.hstack((np.zeros(NUM_PER_CLASS),
np.ones(NUM_PER_CLASS))).astype(int)
plt.xlim((-5, 20))
plt.scatter(X_log, y_log, c=np.array(['blue', 'red'])[y_log], s=10)
plt.xlabel("feature value")
plt.ylabel("class")

Refer to the following graph:

As we can see, the data so noisy that classes overlap in the feature value range between 1 and 6. Therefore, it is better to not directly model the discrete classes, but rather the probability that a feature value belongs to class 1, P(X). Once we possess such a model, we could then predict class 1 if P(X)>0.5, and class 0 otherwise.

Mathematically, it is always difficult to model something that has a finite range, as is the case here with our discrete labels, 0 and 1. We therefore tweak the probabilities a bit so that they always stay between 0 and 1. And for that, we will need the odds ratio and the logarithm of it.

Let's say a feature has a probability of 0.9 that it belongs to class 1, P(y=1) = 0.9. The odds ratio is then P(y=1)/P(y=0) = 0.9/0.1 = 9. We could say that the chance is 9:1 that this feature maps to class 1. If P(y=0.5), we would consequently have a 1:1 chance that the instance is of class 1. The odds ratio is bounded by 0 but goes to infinity (the left graph in the following set of graphs). If we now take the logarithm of it, we can map all probabilities between 0 and 1 to the full range from negative to positive-infinity (the right graph in the following set of graphs). The nice thing is that we still maintain the relationship that higher probability leads to a higher log of odds, just not limited to 0 and 1 anymore:

This means that we can now fit linear combinations of our features (OK, we only have one and a constant, but that will change soon) to the log(odds) values. In a sense, we replace the linear from Chapter 1, Getting Started with Python Machine Learning with  (replacing y with log(odds)).

We can solve this for , so that we have .

We simply have to find the right coefficients, such that the formula gives the lowest errors for all our (xi, pi) pairs in our dataset, but that will be done by scikit-learn.

After fitting, the formula will give the probability for every new data point, x, that belongs to class 1:

>>> from sklearn.linear_model import LogisticRegression
>>> clf = LogisticRegression()
>>> print(clf)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
>>> clf.fit(X_log, y_log)
>>> print(np.exp(clf.intercept_), np.exp(clf.coef_.ravel()))
[ 0.09437188] [ 1.80094112]
>>> def lr_model(clf, X):
... return 1 / (1 + np.exp(-(clf.intercept_ + clf.coef_* X)))
>>> print("P(x=-1)=%.2f P(x=7)=%.2f"%(lr_model(clf, -1),
lr_model(clf, 7)))
P(x=-1)=0.05 P(x=7)=0.85

You might have noticed that scikit-learn exposes the first coefficient through the intercept_ special field.

If we plot the fitted model, we see that it makes perfect sense given the data:

X_range = np.arange(-5, 20, 0.1)
plt.figure(figsize=(10, 4), dpi=300)
plt.xlim((-5, 20))
plt.scatter(X_log, y_log, c=np.array(['blue', 'red'])[y_log], s=5)
# we use ravel() to get rid of the additional axis
plt.plot(X_range, lr_model(clf, X_range).ravel(), c='green')
plt.plot(X_range, np.ones(X_range.shape[0]) * 0.5, "--")
plt.xlabel("feature value")
plt.ylabel("class")
plt.grid(True)

Refer to the following graph: 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset