Training a logistic regression model using gradient descent

Gradient descent (also called steepest descent) is a procedure of minimizing an objective function by first-order iterative optimization. In each iteration, it moves a step that is proportional to the negative derivative of the objective function at the current point. This means the to-be-optimal point iteratively moves downhill towards the minimal value of the objective function. The proportion we just mentioned is called learning rate, or step size. It can be summarized in a mathematical equation as follows:

Here, the left w is the weight vector after a learning step, and the right w is the one before moving, η is the learning rate, and w is the first-order derivative, the gradient.

In our case, let's start with the derivative of the cost function J(w) with respect to w. It might require some knowledge of calculus, but don't worry, we will walk through it step by step:

  1. We first calculate the derivative of  with respect to w. We herein take the j-th weight wj, as an example (note z=wTx, and we omit the (i) for simplicity):

  1. Then, we calculate the derivative of the sample cost J(w) as follows:

  1. Finally, we calculate the entire cost over m samples as follows:

  1. We then generalize it to ∆w:

  1. Combined with the preceding derivations, the weights can be updated as follows:

 Here, w gets updated in each iteration.

  1. After a substantial number of iterations, the learned w and b are then used to classify a new sample x' by means of the following equation:

The decision threshold is 0.5 by default, but it definitely can be other values. In a case where a false negative is, by all means, supposed to be avoided, for example, when predicting fire occurrence (positive class) for alerts, the decision threshold can be lower than 0.5, such as 0.3, depending on how paranoid we are and how proactively we want to prevent the positive event from happening. On the other hand, when false positive class is the one should be evaded, for instance, when predicting the product success (positive class) rate for quality assurance, the decision threshold can be greater than 0.5, such as 0.7, based on how high the standard we set is.

With a thorough understanding of the gradient descent based training and predicting process, we now implement the logistic regression algorithm from scratch:

  1. We begin by defining the function computing the prediction  with current weights:
>>> def compute_prediction(X, weights):
... """ Compute the prediction y_hat based on current weights
... Args:
... X (numpy.ndarray)
... weights (numpy.ndarray)
... Returns:
... numpy.ndarray, y_hat of X under weights
... """
... z =, weights)
... predictions = sigmoid(z)
... return predictions

  1. With this, we are able to continue with the function updating the weights  by one step in a gradient descent manner. Take a look at the following codes:
>>> def update_weights_gd(X_train, y_train, weights,
... """ Update weights by one step
... Args:
... X_train, y_train (numpy.ndarray, training data set)
... weights (numpy.ndarray)
... learning_rate (float)
... Returns:
... numpy.ndarray, updated weights
... """
... predictions = compute_prediction(X_train, weights)
... weights_delta =, y_train - predictions)
... m = y_train.shape[0]
... weights += learning_rate / float(m) * weights_delta
... return weights
  1. Then, the function calculating the cost J(w) is depicted as well:
>>> def compute_cost(X, y, weights):
... """ Compute the cost J(w)
... Args:
... X, y (numpy.ndarray, data set)
... weights (numpy.ndarray)
... Returns:
... float
... """
... predictions = compute_prediction(X, weights)
... cost = np.mean(-y * np.log(predictions)
- (1 - y) * np.log(1 - predictions))
... return cost
  1. Now, we connect all these functions to the model training function by executing the following:
  • Updating the weights vector in each iteration
  • Printing out the current cost for every 100 (can be other values) iterations to ensure cost is decreasing and that things are on the right track

Take a look at the following:

>>> def train_logistic_regression(X_train, y_train, max_iter,
learning_rate, fit_intercept=False):
... """ Train a logistic regression model
... Args:
... X_train, y_train (numpy.ndarray, training data set)
... max_iter (int, number of iterations)
... learning_rate (float)
... fit_intercept (bool, with an intercept w0 or not)
... Returns:
... numpy.ndarray, learned weights
... """
... if fit_intercept:
... intercept = np.ones((X_train.shape[0], 1))
... X_train = np.hstack((intercept, X_train))
... weights = np.zeros(X_train.shape[1])
... for iteration in range(max_iter):
... weights = update_weights_gd(X_train, y_train,
weights, learning_rate)
... # Check the cost for every 100 (for example)
... if iteration % 100 == 0:
... print(compute_cost(X_train, y_train, weights))
... return weights
  1. Finally, predict the results of new inputs using the trained model as follows:
>>> def predict(X, weights):
... if X.shape[1] == weights.shape[0] - 1:
... intercept = np.ones((X.shape[0], 1))
... X = np.hstack((intercept, X))
... return compute_prediction(X, weights)

Implementing logistic regression is very simple, as we just saw. Let's now examine it using a brief example:

>>> X_train = np.array([[6, 7],
... [2, 4],
... [3, 6],
... [4, 7],
... [1, 6],
... [5, 2],
... [2, 0],
... [6, 3],
... [4, 1],
... [7, 2]])
>>> y_train = np.array([0,
... 0,
... 0,
... 0,
... 0,
... 1,
... 1,
... 1,
... 1,
... 1])

Train a logistic regression model by 1000 iterations, at a learning rate of 0.1 based on intercept-included weights:

>>> weights = train_logistic_regression(X_train, y_train, 
max_iter=1000, learning_rate=0.1, fit_intercept=True)

The decreasing cost means that the model is being optimized over time. We can check the model's performance on new samples as follows:

>>> X_test = np.array([[6, 1],
... [1, 3],
... [3, 1],
... [4, 5]])
>>> predictions = predict(X_test, weights)
>>> predictions
array([ 0.9999478 , 0.00743991, 0.9808652 , 0.02080847])

To visualize this, execute the following codes:

>>> import matplotlib.pyplot as plt
>>> plt.scatter(X_train[:,0], X_train[:,1], c=['b']*5+['k']*5,

Blue dots are training samples from class 0, while black dots are those from class 1. Use 0.5 as the classification decision threshold:

>>> colours = ['k' if prediction >= 0.5 else 'b' 
for prediction in predictions]
>>> plt.scatter(X_test[:,0], X_test[:,1], marker='*', c=colours)

Blue stars are testing samples predicted from class 0, while black stars are those predicted from class 1:

>>> plt.xlabel('x1')
>>> plt.ylabel('x2')

Refer to the following screenshot for the end result:

The model we trained correctly predicts classes of new samples (the stars).

