Training a logistic regression model with regularization

As we briefly mentioned in the last section, the penalty parameter in the logistic regression SGDClassifier is related to model regularization. There are two basic forms of regularization, L1 and L2. In either way, the regularization is an additional term on top of the original cost function:

Where is the constant that multiplies the regularization term, and q is either 1 or 2 representing L1 or L2 regularization where:

Training a logistic regression model is a process of reducing the cost as a function of weights w. If it gets to a point where some weights such as wi, wj, wk are considerably large, the whole cost will be determined by these large weights. In this case, the learned model may just memorize the training set and fail to generalize to unseen data. The regularization term herein is introduced in order to penalize large weights as the weights now become part of the cost to minimize. Regularization as a result eliminates overfitting. Finally, parameter provides a tradeoff between log loss and generalization. If is too small, it is not able to compromise large weights and the model may suffer from high variance or overfitting; on the other hand, if is too large, the model becomes over generalized and performs poorly in terms of fitting the dataset, which is the syndrome of underfitting. is an important parameter to tune in order to obtain the best logistic regression model with regularization.

As for choosing between the L1 and L2 form, the rule of thumb is whether feature selection is expected. In machine learning classification, feature selection is the process of picking a subset of significant features for use in better model construction. In practice, not every feature in a dataset carries information useful for discriminating samples; some features are either redundant or irrelevant, and hence can be discarded with little loss.

In the logistic regression classifier, feature selection can be achieved only with L1 regularization. To understand this, we consider two weight vectors and and suppose they produce the same amount of log loss, the L1 and L2 regularization terms of each weight vector are:

The L1 term of both vectors is equivalent, while the L2 term of w2 is less than that of w1. This indicates that L2 regularization penalizes more on weights composed of significantly large and small weights than L1 regularization does. In other words, L2 regularization favors relatively small values for all weights, avoids significantly large and small values for any weight, while L1 regularization allows some weights with significantly small values and some with significantly large values. Only with L1 regularization, some weights can be compressed to close to or exactly 0, which enables feature selection.

In scikit-learn, the regularization type can be specified by the penalty parameter with options none (without regularization), "l1", "l2", and "elasticnet" (mixture of L1 and L2), and the multiplier by the alpha parameter.

We herein examine L1 regularization for feature selection as follows:

Initialize an SGD logistic regression model with L1 regularization, and train the model based on 10 thousand samples:

>>> l1_feature_selector = SGDClassifier(loss='log', penalty='l1', 
alpha=0.0001, fit_intercept=True,
n_iter=5, learning_rate='constant',
eta0=0.01)
>>> l1_feature_selector.fit(X_train_10k, y_train_10k)

With the trained model, now select the important features using the transform method:

>>> X_train_10k_selected = l1_feature_selector.transform(X_train_10k)  

The generated dataset contains the 574 most important features only:

>>> print(X_train_10k_selected.shape)
(10000, 574)

As opposed to 2820 features in the original dataset:

>>> print(X_train_10k.shape)
(10000, 2820)

Take a closer look at the weights of the trained model:

>>> l1_feature_selector.coef_
array([[ 0.17832874, 0. , 0. , ..., 0. ,
0. , 0. ]])

Its bottom 10 weights and the corresponding 10 least important features:

>>> print(np.sort(l1_feature_selector.coef_)[0][:10])
[-0.59326128 -0.43930402 -0.43054312 -0.42387413 -0.41166026
-0.41166026 -0.31539391 -0.30743371 -0.28278958 -0.26746869]
>>> print(np.argsort(l1_feature_selector.coef_)[0][:10])
[ 559 1540 2172 34 2370 2566 579 2116 278 2221]

And its top 10 weights and the corresponding 10 most important features:

>>> print(np.sort(l1_feature_selector.coef_)[0][-10:])
[ 0.27764331 0.29581609 0.30518966 0.3083551 0.31949471
0.3464423 0.35382674 0.3711177 0.38212495 0.40790229]
>>> print(np.argsort(l1_feature_selector.coef_)[0][-10:])
[2110 2769 546 547 2275 2149 2580 1503 1519 2761]

We can also learn what the actual feature is as follows:

>>> dict_one_hot_encoder.feature_names_[2761]
'site_id=d9750ee7'
>>> dict_one_hot_encoder.feature_names_[1519]
'device_model=84ebbcd4'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset