5

Predicting Online Ad Click-Through with Logistic Regression

In the previous chapter, we predicted ads click-through using tree algorithms. In this chapter, we will continue our journey of tackling the billion-dollar problem. We will focus on learning a very (probably the most) scalable classification model—logistic regression. We will explore what the logistic function is, how to train a logistic regression model, adding regularization to the model, and variants of logistic regression that are applicable to very large datasets. Besides its application in classification, we will also discuss how logistic regression and random forest are used to pick significant features. You won't get bored as there will be lots of implementations from scratch with scikit-learn and TensorFlow.

In this chapter, we will cover the following topics:

  • Categorical feature encoding
  • The logistic function
  • What is logistic regression?
  • Gradient descent and stochastic gradient descent
  • The implementations of logistic regression
  • Click-through prediction with logistic regression
  • Logistic regression with L1 and L2 regularization
  • Logistic regression for feature selection
  • Online learning
  • Another way to select features—random forest

Converting categorical features to numerical – one-hot encoding and ordinal encoding

In Chapter 4Predicting Online Ad Click-Through with Tree-Based Algorithms, I mentioned how one-hot encoding transforms categorical features to numerical features in order to use them in the tree algorithms in scikit-learn and TensorFlow. If we transform categorical features into numerical ones using one-hot encoding, we don't limit our choice of algorithms to the tree-based ones that can work with categorical features.

The simplest solution we can think of in terms of transforming a categorical feature with k possible values is to map it to a numerical feature with values from 1 to k. For example, [Tech, Fashion, Fashion, Sports, Tech, Tech, Sports] becomes [1, 2, 2, 3, 1, 1, 3]. However, this will impose an ordinal characteristic, such as Sports being greater than Tech, and a distance property, such as Sports being closer to Fashion than to Tech.

Instead, one-hot encoding converts the categorical feature to k binary features. Each binary feature indicates the presence or absence of a corresponding possible value. Hence, the preceding example becomes the following:

Figure 5.1: Transforming user interest into numerical features with one-hot encoding

Previously, we have used OneHotEncoder from scikit-learn to convert a matrix of strings into a binary matrix, but here, let's take a look at another module, DictVectorizer, which also provides an efficient conversion. It transforms dictionary objects (categorical feature: value) into one-hot encoded vectors.

For example, take a look at the following code:

>>> from sklearn.feature_extraction import DictVectorizer
>>> X_dict = [{'interest': 'tech', 'occupation': 'professional'},
...           {'interest': 'fashion', 'occupation': 'student'},
...           {'interest': 'fashion','occupation':'professional'},
...           {'interest': 'sports', 'occupation': 'student'},
...           {'interest': 'tech', 'occupation': 'student'},
...           {'interest': 'tech', 'occupation': 'retired'},
...           {'interest': 'sports','occupation': 'professional'}]
>>> dict_one_hot_encoder = DictVectorizer(sparse=False)
>>> X_encoded = dict_one_hot_encoder.fit_transform(X_dict)
>>> print(X_encoded)
[[ 0.  0. 1. 1.  0. 0.]
 [ 1.  0. 0. 0.  0. 1.]
 [ 1.  0. 0. 1.  0. 0.]
 [ 0.  1. 0. 0.  0. 1.]
 [ 0.  0. 1. 0.  0. 1.]
 [ 0.  0. 1. 0.  1. 0.]
 [ 0.  1. 0. 1.  0. 0.]]

We can also see the mapping by executing the following:

>>> print(dict_one_hot_encoder.vocabulary_)
{'interest=fashion': 0, 'interest=sports': 1,
'occupation=professional': 3, 'interest=tech': 2,
'occupation=retired': 4, 'occupation=student': 5}

When it comes to new data, we can transform it with the following:

>>> new_dict = [{'interest': 'sports', 'occupation': 'retired'}]
>>> new_encoded = dict_one_hot_encoder.transform(new_dict)
>>> print(new_encoded)
[[ 0. 1. 0. 0. 1. 0.]]

We can inversely transform the encoded features back to the original features like this:

>>> print(dict_one_hot_encoder.inverse_transform(new_encoded))
[{'interest=sports': 1.0, 'occupation=retired': 1.0}]

One important thing to note is that if a new (not seen in training data) category is encountered in new data, it should be ignored (otherwise, the encoder will complain about the unseen categorical value). DictVectorizer handles this implicitly (while OneHotEncoder needs to specify the parameter ignore):

>>> new_dict = [{'interest': 'unknown_interest',
               'occupation': 'retired'},
...             {'interest': 'tech', 'occupation':
               'unseen_occupation'}]
>>> new_encoded = dict_one_hot_encoder.transform(new_dict)
>>> print(new_encoded)
[[ 0.  0. 0. 0.  1. 0.]
 [ 0.  0. 1. 0.  0. 0.]]

Sometimes, we prefer transforming a categorical feature with k possible values into a numerical feature with values ranging from 1 to k. We conduct ordinal encoding in order to employ ordinal or ranking knowledge in our learning; for example, large, medium, and small become 3, 2, and 1, respectively; good and bad become 1 and 0, while one-hot encoding fails to preserve such useful information. We can realize ordinal encoding easily through the use of pandas, for example:

>>> import pandas as pd
>>> df = pd.DataFrame({'score': ['low',
...                              'high',
...                              'medium',
...                              'medium',
...                              'low']})
>>> print(df)
   score
0     low
1    high
2  medium
3  medium
4     low
>>> mapping = {'low':1, 'medium':2, 'high':3}
>>> df['score'] = df['score'].replace(mapping)
>>> print(df)
  score
0      1
1      3
2      2
3      2
4      1

We convert the string feature into ordinal values based on the mapping we define.

We've covered transforming categorical features into numerical ones. Next, we will talk about logistic regression, a classifier that only takes in numerical features.

Classifying data with logistic regression

In the last chapter, we trained the tree-based models only based on the first 300,000 samples out of 40 million. We did so simply because training a tree on a large dataset is extremely computationally expensive and time-consuming. Since we are now not limited to algorithms directly taking in categorical features thanks to one-hot encoding, we should turn to a new algorithm with high scalability for large datasets. As mentioned, logistic regression is one of the most, or perhaps the most, scalable classification algorithms.

Getting started with the logistic function

Let's start with an introduction to the logistic function (which is more commonly referred to as the sigmoid function) as the algorithm's core before we dive into the algorithm itself. It basically maps an input to an output of a value between 0 and 1, and is defined as follows:

We can visualize what it looks like by performing the following steps:

  1. Define the logistic function:
    >>> import numpy as np
    >>> def sigmoid(input):
    ...     return 1.0 / (1 + np.exp(-input))
    
  2. Input variables from -8 to 8, and the corresponding output, as follows:
    >>> z = np.linspace(-8, 8, 1000)
    >>> y = sigmoid(z)
    >>> import matplotlib.pyplot as plt
    >>> plt.plot(z, y)
    >>> plt.axhline(y=0, ls='dotted', color='k')
    >>> plt.axhline(y=0.5, ls='dotted', color='k')
    >>> plt.axhline(y=1, ls='dotted', color='k')
    >>> plt.yticks([0.0, 0.25, 0.5, 0.75, 1.0])
    >>> plt.xlabel('z')
    >>> plt.ylabel('y(z)')
    >>> plt.show()
    

Refer to the following screenshot for the end result:

Figure 5.2: The logistic function

In the S-shaped curve, all inputs are transformed into the range from 0 to 1. For positive inputs, a greater value results in an output closer to 1; for negative inputs, a smaller value generates an output closer to 0; when the input is 0, the output is the midpoint, 0.5.

Jumping from the logistic function to logistic regression

Now that you have some knowledge of the logistic function, it is easy to map it to the algorithm that stems from it. In logistic regression, the function input z becomes the weighted sum of features. Given a data sample x with n features, x1, x2, …, xn (x represents a feature vector and x = (x1, x2, …, xn)), and weights (also called coefficients) of the model (w represents a vector (w1, w2, …, wn)), z is expressed as follows:

Also, occasionally, the model comes with an intercept (also called bias), w0. In this instance, the preceding linear relationship becomes:

As for the output y(z) in the range of 0 to 1, in the algorithm, it becomes the probability of the target being 1 or the positive class:

Hence, logistic regression is a probabilistic classifier, similar to the Naïve Bayes classifier.

A logistic regression model or, more specifically, its weight vector w is learned from the training data, with the goal of predicting a positive sample as close to 1 as possible and predicting a negative sample as close to 0 as possible. In mathematical language, the weights are trained so as to minimize the cost defined as the mean squared error (MSE), which measures the average of squares of the difference between the truth and the prediction. Given m training samples, , , … …, , where y(i) is either 1 (positive class) or 0 (negative class), the cost function J(w) regarding the weights to be optimized is expressed as follows:

However, the preceding cost function is non-convex, which means that, when searching for the optimal w, many local (suboptimal) optimums are found and the function does not converge to a global optimum.

Examples of the convex and non-convex functions are plotted respectively below:

Figure 5.3: Examples of convex and non-convex functions

In the convex example, there is only one global optimum, while there are two optimums in the non-convex example. For more about convex and non-convex functions, feel free to check out https://en.wikipedia.org/wiki/Convex_function and https://web.stanford.edu/class/ee364a/lectures/functions.pdf.

To overcome this, the cost function in practice is defined as follows:

We can take a closer look at the cost of a single training sample:

When the ground truth y(i) = 1, if the model predicts correctly with full confidence (the positive class with 100% probability), the sample cost j is 0; the cost j increases when the predicted probability decreases. If the model incorrectly predicts that there is no chance of the positive class, the cost is infinitely high. We can visualize it as follows:

>>> y_hat = np.linspace(0, 1, 1000)
>>> cost = -np.log(y_hat)
>>> plt.plot(y_hat, cost)
>>> plt.xlabel('Prediction')
>>> plt.ylabel('Cost')
>>> plt.xlim(0, 1)
>>> plt.ylim(0, 7)
>>> plt.show()

Refer to the following graph for the end result:

Figure 5.4: Cost function of logistic regression when y=1

On the contrary, when the ground truth y(i) = 0, if the model predicts correctly with full confidence (the positive class with 0 probability, or the negative class with 100% probability), the sample cost j is 0; the cost j increases when the predicted probability increases. When it incorrectly predicts that there is no chance of the negative class, the cost becomes infinitely high. We can visualize it using the following code:

>>> y_hat = np.linspace(0, 1, 1000)
>>> cost = -np.log(1 - y_hat)
>>> plt.plot(y_hat, cost)
>>> plt.xlabel('Prediction')
>>> plt.ylabel('Cost')
>>> plt.xlim(0, 1)
>>> plt.ylim(0, 7)
>>> plt.show()

The following graph is the resultant output:

Figure 5.5: Cost function of logistic regression when y=0

Minimizing this alternative cost function is actually equivalent to minimizing the MSE-based cost function. The advantages of choosing it over the MSE one include the following:

  • Obviously, being convex, so that the optimal model weights can be found
  • A summation of the logarithms of prediction  or  simplifies the calculation of its derivative with respect to the weights, which we will talk about later

Due to the logarithmic function, the cost function is also called logarithmic loss, or simply log loss.

Now that we have the cost function ready, how can we train the logistic regression model to minimize the cost function? Let's see in the next section.

Training a logistic regression model

Now, the question is how we can obtain the optimal w such that J(w) is minimized. We can do so using gradient descent.

Training a logistic regression model using gradient descent

Gradient descent (also called steepest descent) is a procedure of minimizing an objective function by first-order iterative optimization. In each iteration, it moves a step that is proportional to the negative derivative of the objective function at the current point. This means the to-be-optimal point iteratively moves downhill toward the minimal value of the objective function. The proportion we just mentioned is called the learning rate, or step size. It can be summarized in a mathematical equation as follows:

Here, the left w is the weight vector after a learning step, and the right w is the one before moving, η is the learning rate, and ∆w is the first-order derivative, the gradient.

In our case, let's start with the derivative of the cost function J(w) with respect to w. It might require some knowledge of calculus, but don't worry, we will walk through it step by step:

  1. We first calculate the derivative of  with respect to w. We herein take the j-th weight, wj, as an example (note z=wTx, and we omit the (i) for simplicity):
  2. Then, we calculate the derivative of the sample cost J(w) as follows:
  3. Finally, we calculate the entire cost over m samples as follows:
  4. We then generalize it to ∆w:
  5. Combined with the preceding derivations, the weights can be updated as follows:

     Here, w gets updated in each iteration.

  6. After a substantial number of iterations, the learned w and b are then used to classify a new sample x' by means of the following equation:

The decision threshold is 0.5 by default, but it definitely can be other values. In a case where a false negative is, by all means, supposed to be avoided, for example, when predicting fire occurrence (the positive class) for alerts, the decision threshold can be lower than 0.5, such as 0.3, depending on how paranoid we are and how proactively we want to prevent the positive event from happening. On the other hand, when the false positive class is the one that should be evaded, for instance, when predicting the product success (the positive class) rate for quality assurance, the decision threshold can be greater than 0.5, such as 0.7, or lower than 0.5, depending on how high a standard you set.

With a thorough understanding of the gradient descent-based training and predicting process, we will now implement the logistic regression algorithm from scratch:

  1. We begin by defining the function that computes the prediction with the current weights:
    >>> def compute_prediction(X, weights):
    ...     """
    ...     Compute the prediction y_hat based on current weights
    ...     """
    ...     z = np.dot(X, weights)
    ...     predictions = sigmoid(z)
    ...     return predictions
    
  2. With this, we are able to continue with the function updating the weights  by one step in a gradient descent manner. Take a look at the following code:
    >>> def update_weights_gd(X_train, y_train, weights,
                                               learning_rate):
    ...     """
    ...     Update weights by one step
    ...     """
    ...     predictions = compute_prediction(X_train, weights)
    ...     weights_delta = np.dot(X_train.T, y_train - predictions)
    ...     m = y_train.shape[0]
    ...     weights += learning_rate / float(m) * weights_delta
    ...     return weights
    
  3. Then, the function calculating the cost J(w) is implemented as well:
    >>> def compute_cost(X, y, weights):
    ...     """
    ...     Compute the cost J(w)
    ...     """
    ...     predictions = compute_prediction(X, weights)
    ...     cost = np.mean(-y * np.log(predictions)
                          - (1 - y) * np.log(1 - predictions))
    ...     return cost
    
  4. Now, we connect all these functions to the model training function by executing the following:
    • Updating the weights vector in each iteration
    • Printing out the current cost for every 100 (this can be another value) iterations to ensure cost is decreasing and that things are on the right track

    They are implemented in the following function:

    >>> def train_logistic_regression(X_train, y_train, max_iter,
                              learning_rate, fit_intercept=False):
    ...     """ Train a logistic regression model
    ...     Args:
    ...         X_train, y_train (numpy.ndarray, training data set)
    ...         max_iter (int, number of iterations)
    ...         learning_rate (float)
    ...         fit_intercept (bool, with an intercept w0 or not)
    ...     Returns:
    ...         numpy.ndarray, learned weights
    ...     """
    ...     if fit_intercept:
    ...         intercept = np.ones((X_train.shape[0], 1))
    ...         X_train = np.hstack((intercept, X_train))
    ...     weights = np.zeros(X_train.shape[1])
    ...     for iteration in range(max_iter):
    ...         weights = update_weights_gd(X_train, y_train,
                                           weights, learning_rate)
    ...         # Check the cost for every 100 (for example)       
                 iterations
    ...         if iteration % 100 == 0:
    ...             print(compute_cost(X_train, y_train, weights))
    ...     return weights
    
  5. Finally, we predict the results of new inputs using the trained model as follows:
    >>> def predict(X, weights):
    ...     if X.shape[1] == weights.shape[0] - 1:
    ...         intercept = np.ones((X.shape[0], 1))
    ...         X = np.hstack((intercept, X))
    ...     return compute_prediction(X, weights)
    

Implementing logistic regression is very simple, as you just saw. Let's now examine it using a toy example:

>>> X_train = np.array([[6, 7],
...                     [2, 4],
...                     [3, 6],
...                     [4, 7],
...                     [1, 6],
...                     [5, 2],
...                     [2, 0],
...                     [6, 3],
...                     [4, 1],
...                     [7, 2]])
>>> y_train = np.array([0,
...                     0,
...                     0,
...                     0,
...                     0,
...                     1,
...                     1,
...                     1,
...                     1,
...                     1])

We train a logistic regression model for 1000 iterations, at a learning rate of 0.1 based on intercept-included weights:

>>> weights = train_logistic_regression(X_train, y_train, 
             max_iter=1000, learning_rate=0.1, fit_intercept=True)
0.574404237166
0.0344602233925
0.0182655727085
0.012493458388
0.00951532913855
0.00769338806065
0.00646209433351
0.00557351184683
0.00490163225453
0.00437556774067

The decreasing cost means that the model is being optimized over time. We can check the model's performance on new samples as follows:

>>> X_test = np.array([[6, 1],
...                    [1, 3],
...                    [3, 1],
...                    [4, 5]])
>>> predictions = predict(X_test, weights)
>>> predictions
array([ 0.9999478 , 0.00743991, 0.9808652 , 0.02080847])

To visualize this, execute the following code:

>>> import matplotlib.pyplot as plt
>>> plt.scatter(X_train[:,0], X_train[:,1], c=['b']*5+['k']*5, 
                                                      marker='o')

Blue dots are training samples from class 0, while black dots are those from class 1. Use 0.5 as the classification decision threshold:

>>> colours = ['k' if prediction >= 0.5 else 'b' 
                                  for prediction in predictions]
>>> plt.scatter(X_test[:,0], X_test[:,1], marker='*', c=colours)

Blue stars are testing samples predicted from class 0, while black stars are those predicted from class 1:

>>> plt.xlabel('x1')
>>> plt.ylabel('x2')
>>> plt.show()

Refer to the following screenshot for the end result:

Figure 5.6: Training and testing sets of the toy example

The model we trained correctly predicts classes of new samples (the stars).

Predicting ad click-through with logistic regression using gradient descent

After this brief example, we will now deploy the algorithm we just developed in our click-through prediction project.

We herein start with only 10,000 training samples (you will soon see why we don't start with 270,000, as we did in the previous chapter):

>>> import pandas as pd
>>> n_rows = 300000
>>> df = pd.read_csv("train", nrows=n_rows)
>>> X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], 
                                                     axis=1).values
>>> Y = df['click'].values
>>> n_train = 10000
>>> X_train = X[:n_train]
>>> Y_train = Y[:n_train]
>>> X_test = X[n_train:]
>>> Y_test = Y[n_train:]
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X_train_enc = enc.fit_transform(X_train)
>>> X_test_enc = enc.transform(X_test)

Train a logistic regression model over 10000 iterations, at a learning rate of 0.01 with bias:

>>> import timeit
>>> start_time = timeit.default_timer()
>>> weights = train_logistic_regression(X_train_enc.toarray(), 
              Y_train, max_iter=10000, learning_rate=0.01, 
              fit_intercept=True)
0.6820019456743648
0.4608619713011896
0.4503715555130051
…
…
…
0.41485094023829017
0.41477416506724385
0.41469802145452467
>>> print(f"--- {(timeit.default_timer() - start_time)}.3fs seconds ---")
--- 232.756s seconds ---

It takes 232 seconds to optimize the model. The trained model performs on the testing set as follows:

>>> pred = predict(X_test_enc.toarray(), weights)
>>> from sklearn.metrics import roc_auc_score
>>> print(f'Training samples: {n_train}, AUC on testing set: {roc_auc_score(Y_test, pred):.3f}')
Training samples: 10000, AUC on testing set: 0.703

Now, let's use 100,000 training samples (n_train = 100000) and repeat the same process. It will take 5240.4 seconds, which is almost 1.5 hours. It takes 22 times longer to fit data of 10 times the size. As I mentioned at the beginning of the chapter, the logistic regression classifier can be good at training on large datasets. But our testing results seem to contradict this. How could we handle even larger training datasets efficiently, not just 100,000 samples, but millions? Let's look at a more efficient way to train a logistic regression model in the next section.

Training a logistic regression model using stochastic gradient descent

In gradient descent-based logistic regression models, all training samples are used to update the weights in every single iteration. Hence, if the number of training samples is large, the whole training process will become very time-consuming and computationally expensive, as you just witnessed in our last example.

Fortunately, a small tweak will make logistic regression suitable for large-sized datasets. For each weight update, only one training sample is consumed, instead of the complete training set. The model moves a step based on the error calculated by a single training sample. Once all samples are used, one iteration finishes. This advanced version of gradient descent is called stochastic gradient descent (SGD). Expressed in a formula, for each iteration, we do the following: 

SGD generally converges much faster than gradient descent where a large number of iterations is usually needed.

To implement SGD-based logistic regression, we just need to slightly modify the update_weights_gd function:

>>> def update_weights_sgd(X_train, y_train, weights, 
                                           learning_rate):
...     """ One weight update iteration: moving weights by one 
            step based on each individual sample
...     Args:
...     X_train, y_train (numpy.ndarray, training data set)
...     weights (numpy.ndarray)
...     learning_rate (float)
...     Returns:
...     numpy.ndarray, updated weights
...     """
...     for X_each, y_each in zip(X_train, y_train):
...         prediction = compute_prediction(X_each, weights)
...         weights_delta = X_each.T * (y_each - prediction)
...         weights += learning_rate * weights_delta
...     return weights

In the train_logistic_regression function, SGD is applied:

>>> def train_logistic_regression_sgd(X_train, y_train, max_iter, 
                              learning_rate, fit_intercept=False):
...     """ Train a logistic regression model via SGD
...     Args:
...     X_train, y_train (numpy.ndarray, training data set)
...     max_iter (int, number of iterations)
...     learning_rate (float)
...     fit_intercept (bool, with an intercept w0 or not)
...     Returns:
...     numpy.ndarray, learned weights
...     """
...     if fit_intercept:
...         intercept = np.ones((X_train.shape[0], 1))
...         X_train = np.hstack((intercept, X_train))
...     weights = np.zeros(X_train.shape[1])
...     for iteration in range(max_iter):
...         weights = update_weights_sgd(X_train, y_train, weights, 
                                                     learning_rate)
...         # Check the cost for every 2 (for example) iterations
...         if iteration % 2 == 0:
...             print(compute_cost(X_train, y_train, weights))
...     return weights                   

Now, let's see how powerful SGD is. We will work with 100,000 training samples and choose 10 as the number of iterations, 0.01 as the learning rate, and print out current costs every other iteration:

>>> start_time = timeit.default_timer()
>>> weights = train_logistic_regression_sgd(X_train_enc.toarray(), 
        Y_train, max_iter=10, learning_rate=0.01, fit_intercept=True)
0.4127864859625796
0.4078504597223988
0.40545733114863264
0.403811787845451
0.4025431351250833
>>> print(f"--- {(timeit.default_timer() - start_time)}.3fs seconds ---")
--- 40.690s seconds ---
>>> pred = predict(X_test_enc.toarray(), weights)
>>> print(f'Training samples: {n_train}, AUC on testing set: {roc_auc_score(Y_test, pred):.3f}')
Training samples: 100000, AUC on testing set: 0.732

The training process finishes in just 40 seconds!

As usual, after successfully implementing the SGD-based logistic regression algorithm from scratch, we implement it using the SGDClassifier module of scikit-learn:

>>> from sklearn.linear_model import SGDClassifier
>>> sgd_lr = SGDClassifier(loss='log', penalty=None, 
             fit_intercept=True, max_iter=10, 
             learning_rate='constant', eta0=0.01)

Here, 'log' for the loss parameter indicates that the cost function is log loss, penalty is the regularization term to reduce overfitting, which we will discuss further in the next section, max_iter is the number of iterations, and the remaining two parameters mean the learning rate is 0.01 and unchanged during the course of training. It should be noted that the default learning_rate is 'optimal', where the learning rate slightly decreases as more and more updates are made. This can be beneficial for finding the optimal solution on large datasets.

Now, train the model and test it:

>>> sgd_lr.fit(X_train_enc.toarray(), Y_train)
>>> pred = sgd_lr.predict_proba(X_test_enc.toarray())[:, 1]
>>> print(f'Training samples: {n_train}, AUC on testing set: {roc_auc_score(Y_test, pred):.3f}')
Training samples: 100000, AUC on testing set: 0.734

Quick and easy!

Training a logistic regression model with regularization

As I briefly mentioned in the previous section, the penalty parameter in the logistic regression SGDClassifier is related to model regularization. There are two basic forms of regularization, L1 (also called Lasso) and L2 (also called ridge). In either way, the regularization is an additional term on top of the original cost function:

Here, α is the constant that multiplies the regularization term, and q is either 1 or 2 representing L1 or L2 regularization where the following applies:

Training a logistic regression model is the process of reducing the cost as a function of weights w. If it gets to a point where some weights, such as wi, wj, and wk are considerably large, the whole cost will be determined by these large weights. In this case, the learned model may just memorize the training set and fail to generalize to unseen data. The regularization term herein is introduced in order to penalize large weights, as the weights now become part of the cost to minimize. Regularization as a result eliminates overfitting. Finally, parameter α provides a trade-off between log loss and generalization. If α is too small, it is not able to compress large weights and the model may suffer from high variance or overfitting; on the other hand, if α is too large, the model may become over generalized and perform poorly in terms of fitting the dataset, which is the syndrome of underfitting. α is an important parameter to tune in order to obtain the best logistic regression model with regularization.

As for choosing between the L1 and L2 form, the rule of thumb is based on whether feature selection is expected. In machine learning classification, feature selection is the process of picking a subset of significant features for use in better model construction. In practice, not every feature in a dataset carries information that is useful for discriminating samples; some features are either redundant or irrelevant and hence can be discarded with little loss. In a logistic regression classifier, feature selection can only be achieved with L1 regularization. To understand this, let's consider two weight vectors, w1 = (1, 0) and w2 = (0.5, 0.5); supposing they produce the same amount of log loss, the L1 and L2 regularization terms of each weight vector are as follows:

The L1 term of both vectors is equivalent, while the L2 term of w2 is less than that of w1. This indicates that L2 regularization penalizes weights composed of significantly large and small weights more than L1 regularization does. In other words, L2 regularization favors relatively small values for all weights, and avoids significantly large and small values for any weight, while L1 regularization allows some weights with a significantly small value and some with a significantly large value. Only with L1 regularization can some weights be compressed to close to or exactly 0, which enables feature selection.

In scikit-learn, the regularization type can be specified by the penalty parameter with the options none (without regularization), "l1""l2", and "elasticnet" (a mixture of L1 and L2), and the multiplier α can be specified by the alpha parameter.

Feature selection using L1 regularization

We herein examine L1 regularization for feature selection.

Initialize an SGD logistic regression model with L1 regularization, and train the model based on 10,000 samples:

>>> sgd_lr_l1 = SGDClassifier(loss='log', penalty='l1', alpha=0.0001, fit_intercept=True, max_iter=10, 
learning_rate='constant', eta0=0.01)
>>> sgd_lr_l1.fit(X_train_enc.toarray(), Y_train)

With the trained model, we obtain the absolute values of its coefficients:

>>> coef_abs = np.abs(sgd_lr_l1.coef_)
>>> print(coef_abs)
[[0. 0.09963329 0. ... 0. 0. 0.07431834]]

The bottom 10 coefficients and their values are printed as follows:

>>> print(np.sort(coef_abs)[0][:10])
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> bottom_10 = np.argsort(coef_abs)[0][:10]

We can see what these 10 features are using the following code:

>>> feature_names = enc.get_feature_names()
>>> print('10 least important features are:
', 
                                   feature_names[bottom_10])
10 least important features are:
 ['x0_1001' 'x8_851897aa' 'x8_85119990' 'x8_84ebbcd4' 'x8_84eb6b0e'
 'x8_84dda655' 'x8_84c2f017' 'x8_84ace234' 'x8_84a9d4ba' 'x8_84915a27']

They are 1001 from the 0 column (that is the C1 column) in X_train, "851897aa" from the 8 column (that is the device_model column), and so on and so forth.

Similarly, the top 10 coefficients and their values can be obtained as follows:

>>> print(np.sort(coef_abs)[0][-10:])
[0.67912376 0.70885933 0.79975917 0.8828797 0.98146351 0.98275124
 1.08313767 1.13261091 1.18445527 1.40983505]
>>> top_10 = np.argsort(coef_abs)[0][-10:]
>>> print('10 most important features are:
', feature_names[top_10])
10 most important features are:
 ['x7_cef3e649' 'x3_7687a86e' 'x18_61' 'x18_15' 'x5_9c13b419' 
'x5_5e3f096f' 'x2_763a42b5' 'x2_d9750ee7' 'x3_27e3c518' 
'x5_1779deee']

They are "cef3e649" from the 7 column (that is app_category) in X_train"7687a86e" from the third column (that is site_domain), and so on and so forth.

Training on large datasets with online learning

So far, we have trained our model on no more than 300,000 samples. If we go beyond this figure, memory might be overloaded since it holds too much data, and the program will crash. In this section, we will explore how to train on a large-scale dataset with online learning.

SGD evolves from gradient descent by sequentially updating the model with individual training samples one at a time, instead of the complete training set at once. We can scale up SGD further with online learning techniques. In online learning, new data for training is available in sequential order or in real time, as opposed to all at once in an offline learning environment. A relatively small chunk of data is loaded and preprocessed for training at a time, which releases the memory used to hold the entire large dataset. Besides better computational feasibility, online learning is also used because of its adaptability to cases where new data is generated in real time and is needed for modernizing the model. For instance, stock price prediction models are updated in an online learning manner with timely market data; click-through prediction models need to include the most recent data reflecting users' latest behaviors and tastes; spam email detectors have to be reactive to the ever-changing spammers by considering new features that are dynamically generated.

The existing model trained by previous datasets can now be updated based on the most recently available dataset only, instead of rebuilding it from scratch based on previous and recent datasets together, as is the case in offline learning:

Figure 5.7: Online versus offline learning

In the preceding example, online learning allows the model to continue training with new arriving data. However, in offline learning, we have to retrain the whole model with the new arriving data along with the old data.

The SGDClassifier module in scikit-learn implements online learning with the partial_fit method (while the fit method is applied in offline learning, as you have seen). We will train the model with 1,000,000 samples, where we feed in 100,000 samples at one time to simulate an online learning environment. And we will test the trained model on another 100,000 samples as follows:

>>> n_rows = 100000 * 11
>>> df = pd.read_csv("train", nrows=n_rows)
>>> X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], 
                                                      axis=1).values
>>> Y = df['click'].values
>>> n_train = 100000 * 10
>>> X_train = X[:n_train]
>>> Y_train = Y[:n_train]
>>> X_test = X[n_train:]
>>> Y_test = Y[n_train:]

Fit the encoder on the whole training set as follows:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> enc.fit(X_train)

Initialize an SGD logistic regression model where we set the number of iterations to 1 in order to partially fit the model and enable online learning:

>>> sgd_lr_online = SGDClassifier(loss='log', penalty=None, 
                               fit_intercept=True, max_iter=1, 
                               learning_rate='constant', eta0=0.01)

Loop over every 100000 samples and partially fit the model:

>>> start_time = timeit.default_timer()
>>> for i in range(10):
...     x_train = X_train[i*100000:(i+1)*100000]
...     y_train = Y_train[i*100000:(i+1)*100000]
...     x_train_enc = enc.transform(x_train)
...     sgd_lr_online.partial_fit(x_train_enc.toarray(), y_train, 
                                                    classes=[0, 1])

Again, we use the partial_fit method for online learning. Also, we specify the classes parameter, which is required in online learning:

>>> print(f"--- {(timeit.default_timer() - start_time)}.3fs seconds ---")
--- 167.399s seconds ---

Apply the trained model on the testing set, the next 100,000 samples, as follows:

>>> x_test_enc = enc.transform(X_test)
>>> pred = sgd_lr_online.predict_proba(x_test_enc.toarray())[:, 1]
>>> print(f'Training samples: {n_train * 10}, AUC on testing set: {roc_auc_score(Y_test, pred):.3f}')
Training samples: 10000000, AUC on testing set: 0.761

With online learning, training based on a total of 1 million samples only takes 167 seconds and yields better accuracy.

We have been using logistic regression for binary classification so far. Can we use it for multiclass cases? Yes. However, we do need to make some small tweaks. Let's see this in the next section.

Handling multiclass classification

One last thing worth noting is how logistic regression algorithms deal with multiclass classification. Although we interact with the scikit-learn classifiers in multiclass cases the same way as in binary cases, it is useful to understand how logistic regression works in multiclass classification.

Logistic regression for more than two classes is also called multinomial logistic regression, or better known latterly as softmax regression. As you have seen in the binary case, the model is represented by one weight vector w, and the probability of the target being 1 or the positive class is written as follows:

In the K class case, the model is represented by K weight vectors, w1, w2, ..., wK, and the probability of the target being class k is written as follows:

Note that the term normalizes probabilities  (k from 1 to K) so that they total 1. The cost function in the binary case is expressed as follows:

Similarly, the cost function in the multiclass case becomes the following:

Here, function  is 1 only if  is true, otherwise it's 0.

With the cost function defined, we obtain the step  for the j weight vector in the same way as we derived the step ∆w in the binary case:

In a similar manner, all K weight vectors are updated in each iteration. After sufficient iterations, the learned weight vectors, w1, w2, ..., wK, are then used to classify a new sample x' by means of the following equation:

To have a better sense, let's experiment with it with a classic dataset, the handwritten digits for classification:

>>> from sklearn import datasets
>>> digits = datasets.load_digits()
>>> n_samples = len(digits.images)

As the image data is stored in 8*8 matrices, we need to flatten them, as follows:

>>> X = digits.images.reshape((n_samples, -1))
>>> Y = digits.target

We then split the data as follows:

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                    test_size=0.2, random_state=42)

We then combine grid search and cross-validation to find the optimal multiclass logistic regression model as follows:

>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {'penalty': ['l2', None],
...               'alpha': [1e-07, 1e-06, 1e-05, 1e-04],
...               'eta0': [0.01, 0.1, 1, 10]}
>>> sgd_lr = SGDClassifier(loss='log', learning_rate='constant', 
                          eta0=0.01, fit_intercept=True, max_iter=10)
>>> grid_search = GridSearchCV(sgd_lr, parameters, 
                               n_jobs=-1, cv=3)
>>> grid_search.fit(term_docs_train, label_train)
>>> print(grid_search.best_params_)
{'alpha': 1e-07, 'eta0': 0.1, 'penalty': None}

To predict using the optimal model, we apply the following:

>>> sgd_lr_best = grid_search.best_estimator_
>>> accuracy = sgd_lr_best.score(term_docs_test, label_test)
>>> print(f'The accuracy on testing set is: {accuracy*100:.1f}%')
The accuracy on testing set is: 94.2%

It doesn't look much different from the previous example, since SGDClassifier handles multiclass internally. Feel free to compute the confusion matrix as an exercise. It will be interesting to see how the model performs on individual classes.

The next section will be a bonus section where we will implement logistic regression with TensorFlow and use click prediction as an example.

Implementing logistic regression using TensorFlow

We herein use 90% of the first 300,000 samples for training, the remaining 10% for testing, and assume that X_train_encY_trainX_test_enc, and Y_test contain the correct data:

  1. First, we import TensorFlow, transform X_train_enc and X_test_enc into a numpy array, and cast X_train_encY_trainX_test_enc, and Y_test to float32:
    >>> import tensorflow as tf
    >>> X_train_enc = X_train_enc.toarray().astype('float32')
    >>> X_test_enc = X_test_enc.toarray().astype('float32')
    >>> Y_train = Y_train.astype('float32')
    >>> Y_test = Y_test.astype('float32')
    
  2. We use the tf.data API to shuffle and batch data:
    >>> batch_size = 1000
    >>> train_data = tf.data.Dataset.from_tensor_slices((X_train_enc, Y_train))
    >>> train_data = train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)
    

    For each weight update, only one batch of samples is consumed, instead of the one sample or the complete training set. The model moves a step based on the error calculated by a batch of samples. The batch size is 1,000 in this example.

  3. Then, we define the weights and bias of the logistic regression model:
    >>> n_features = int(X_train_enc.shape[1])
    >>> W = tf.Variable(tf.zeros([n_features, 1]))
    >>> b = tf.Variable(tf.zeros([1]))
    
  4. We then create a gradient descent optimizer that searches for the best coefficients by minimizing the loss. We herein use Adam as our optimizer, which is an advanced gradient descent with a learning rate (starting with 0.0008) that is adaptive to gradients:
    >>> learning_rate = 0.0008
    >>> optimizer = tf.optimizers.Adam(learning_rate)
    
  5. We define the optimization process where we compute the current prediction and cost and update the model coefficients following the computed gradients:
    >>> def run_optimization(x, y):
    ...     with tf.GradientTape() as g:
    ...         logits = tf.add(tf.matmul(x, W), b)[:, 0]
    ...         cost =  
                tf.reduce_mean(
                tf.nn.sigmoid_cross_entropy_with_logits(
                 labels=y, logits=logits))
    ...     gradients = g.gradient(cost, [W, b])
    ...     optimizer.apply_gradients(zip(gradients, [W, b]))
    

    Here, tf.GradientTape allows us to track TensorFlow computations and calculate gradients with respect to the given variables.

  6. We run the training for 6,000 steps (one step is with one batch of random samples):
    >>> training_steps = 6000
    >>> for step, (batch_x, batch_y) in 
                  enumerate(train_data.take(training_steps), 1):
    ...     run_optimization(batch_x, batch_y)
    ...     if step % 500 == 0:
    ...         logits = tf.add(tf.matmul(batch_x, W), b)[:, 0]
    ...         loss = 
                tf.reduce_mean(
                tf.nn.sigmoid_cross_entropy_with_logits(
                 labels=batch_y, logits=logits))
    ...         print("step: %i, loss: %f" % (step, loss))
    step: 500, loss: 0.448672
    step: 1000, loss: 0.389186
    step: 1500, loss: 0.413012
    step: 2000, loss: 0.445663
    step: 2500, loss: 0.361000
    step: 3000, loss: 0.417154
    step: 3500, loss: 0.359435
    step: 4000, loss: 0.393363
    step: 4500, loss: 0.402097
    step: 5000, loss: 0.376734
    step: 5500, loss: 0.372981
    step: 6000, loss: 0.406973
    

    And for every 500 steps, we compute and print out the current cost to check the training performance. As you can see, the training loss is decreasing overall.

  7. After the model is trained, we use it to make predictions on the testing set and report the AUC metric:
    >>> logits = tf.add(tf.matmul(X_test_enc, W), b)[:, 0]
    >>> pred = tf.nn.sigmoid(logits)
    >>> auc_metric = tf.keras.metrics.AUC()
    >>> auc_metric.update_state(Y_test, pred)
    >>> print(f'AUC on testing set: {auc_metric.result().numpy():.3f}')
    AUC on testing set: 0.771
    

    We are able to achieve an AUC of 0.771 with the TensorFlow-based logistic regression model. You can also tweak the learning rate, the number of training steps, and other hyperparameters to obtain a better performance. This will be a fun exercise at the end of the chapter.

You have seen how feature selection works with L1-regularized logistic regression in the previous section, Feature selection using L1 regularization, where weights of unimportant features are compressed to close to, or exactly, 0. Besides L1-regularized logistic regression, random forest is another frequently used feature selection technique. Let's see more in the next section.

Feature selection using random forest

To recap, random forest is bagging over a set of individual decision trees. Each tree considers a random subset of the features when searching for the best splitting point at each node. And, in a decision tree, only those significant features (along with their splitting values) are used to constitute tree nodes. Consider the forest as a whole: the more frequently a feature is used in a tree node, the more important it is. In other words, we can rank the importance of features based on their occurrences in nodes among all trees, and select the top most important ones.

A trained RandomForestClassifier module in scikit-learn comes with an attribute, feature_importances_, indicating the feature importance, which is calculated as the proportion of occurrences in tree nodes. Again, we will examine feature selection with random forest on the dataset with 100,000 ad click samples:

>>> from sklearn.ensemble import RandomForestClassifier
>>> random_forest = RandomForestClassifier(n_estimators=100, 
                 criterion='gini', min_samples_split=30, n_jobs=-1)
>>> random_forest.fit(X_train_enc.toarray(), Y_train)

After fitting the random forest model, we obtain the feature importance scores with the following:

>>> feature_imp = random_forest.feature_importances_
>>> print(feature_imp)
[1.60540750e-05 1.71248082e-03 9.64485853e-04 ... 5.41025913e-04
 7.78878273e-04 8.24041944e-03]

Take a look at the bottom 10 feature scores and the corresponding 10 least important features:

>>> feature_names = enc.get_feature_names()
>>> print(np.sort(feature_imp)[:10])
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> bottom_10 = np.argsort(feature_imp)[:10]
>>> print('10 least important features are:
', feature_names[bottom_10])
10 least important features are:
 ['x8_ea4912eb' 'x8_c2d34e02' 'x6_2d332391' 'x2_ca9b09d0' 
'x2_0273c5ad' 'x8_92bed2f3' 'x8_eb3f4b48' 'x3_535444a1' 'x8_8741c65a' 
'x8_46cb77e5']

And now, take a look at the top 10 feature scores and the corresponding 10 most important features:

>>> print(np.sort(feature_imp)[-10:])
[0.00809279 0.00824042 0.00885188 0.00897925 0.01080301 0.01088246
 0.01270395 0.01392431 0.01532718 0.01810339]
>>> top_10 = np.argsort(feature_imp)[-10:]
>>> print('10 most important features are:
', feature_names[top_10])
10 most important features are:
 ['x17_-1' 'x18_157' 'x12_300' 'x13_250' 'x3_98572c79' 'x8_8a4875bd' 'x14_1993' 'x15_2' 'x2_d9750ee7' 'x18_33']

In this section, we covered how random forest is used for feature selection.

Summary

In this chapter, we continued working on the online advertising click-through prediction project. This time, we overcame the categorical feature challenge by means of the one-hot encoding technique. We then resorted to a new classification algorithm, logistic regression, for its high scalability to large datasets. The in-depth discussion of the logistic regression algorithm started with the introduction of the logistic function, which led to the mechanics of the algorithm itself. This was followed by how to train a logistic regression model using gradient descent.

After implementing a logistic regression classifier by hand and testing it on our click-through dataset, you learned how to train the logistic regression model in a more advanced manner, using SGD, and we adjusted our algorithm accordingly. We also practiced how to use the SGD-based logistic regression classifier from scikit-learn and applied it to our project.

We then continued to tackle problems we might face in using logistic regression, including L1 and L2 regularization for eliminating overfitting, online learning techniques for training on large-scale datasets, and handling multiclass scenarios. You also learned how to implement logistic regression with TensorFlow. Finally, the chapter ended with applying the random forest model to feature selection, as an alternative to L1-regularized logistic regression.

You might be curious about how we can efficiently train the model on the entire dataset of 40 million samples. In the next chapter, we will utilize tools such as Spark and the PySpark module to scale up our solution.

Exercises

  1. In the logistic regression-based click-through prediction project, can you also tweak hyperparameters such as penaltyeta0, and alpha in the SGDClassifier model? What is the highest testing AUC you are able to achieve?
  2. Can you try to use more training samples, for instance, 10 million samples, in the online learning solution?
  3. In the TensorFlow-based solution, can you tweak the learning rate, the number of training steps, and other hyperparameters to obtain a better performance?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset