The mechanics of logistic regression

Now that we have knowledge of the logistic function, it is easy to map it to the algorithm that stems from it. In logistic regression, the function input z becomes the weighted sum of features. Given a data sample x with n features x1, x2, ..., xn (x represents a feature vector and x= (x1, x2, ..., xn)), and weights (also called coefficients) of the model (represents a vector (w1, w2, ..., wn)), z and is expressed as follows:

Or sometimes, the model comes with an intercept (also called bias) w0, the preceding linear relationship becomes as follows:

As for the output y(z) in the range of 0 to 1, in the algorithm, it becomes the probability of the target being "1" or the positive class:

Thus logistic regression is a probabilistic classifier, similar to the naive Bayes classifier.

A logistic regression model, or specifically, its weight vector w is learned from the training data, with the goal of predicting a positive sample as close to 1 as possible and predicting a negative sample as close to 0 as possible. In mathematical language, the weights are trained so as to minimize the cost defined as mean squared error (MSE), which measures the average of squares of difference between the truth and prediction. Give m training samples, , where is either 1 (positive class) or 0 (negative class), the cost function J(w) regarding the weights to be optimized is expressed as follows:

However, the preceding cost function is non-convex,which means when searching for the optimal w, many local (suboptimal) optimums are found and the function does not converge to a global optimum.

Examples of convex and non-convex function are plotted respectively below:

To overcome this, the cost function in practice is defined as follows:

We can take a closer look at the cost of a single training sample:

If , when it predicts correctly (positive class in 100% probability), the sample cost j is 0; the cost keeps increasing when it is less likely to be the positive class; when it incorrectly predicts that there is no chance to be the positive class, the cost is infinitely high:

>>> y_hat = np.linspace(0, 1, 1000)
>>> cost = -np.log(y_hat)
>>> plt.plot(y_hat, cost)
>>> plt.xlabel('Prediction')
>>> plt.ylabel('Cost')
>>> plt.xlim(0, 1)
>>> plt.ylim(0, 7)
>>> plt.show()

On the contrary, if , when it predicts correctly (positive class in 0 probability, or negative class in 100% probability), the sample cost j is 0; the cost keeps increasing when it is more likely to be the positive class; when it wrongly predicts that there is no chance to be the negative class, the cost goes infinitely high:

>>> y_hat = np.linspace(0, 1, 1000)
>>> cost = -np.log(1 - y_hat)
>>> plt.plot(y_hat, cost)
>>> plt.xlabel('Prediction')
>>> plt.ylabel('Cost')
>>> plt.xlim(0, 1)
>>> plt.ylim(0, 7)
>>> plt.show()

Minimizing this alternative cost function is actually equivalent to minimizing the MSE-based cost function. And the advantages of choosing it over the other one include:

  • Obviously being convex so that the optimal model weights can be found
  • summation of the logarithms of prediction or simplifies the calculation of its derivative with respect to the weights, which we will talk about later

Due to the logarithmic function, the cost function is also called logarithmic loss, or simply log loss.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset