Training on large datasets with online learning

So far, we have trained our model on no more than 300,000 samples. If we go beyond this figure, memory might be overloaded since it holds too much data, and the program will crash. In this section, we will be presenting how to train on a large-scale dataset with online learning.

Stochastic gradient descent grows from gradient descent by sequentially updating the model with individual training samples one at a time, instead of the complete training set at once. We can scale up stochastic gradient descent further with online learning techniques. In online learning, new data for training is available in a sequential order or in real time, as opposed to all at once in an offline learning environment. A relatively small chunk of data is loaded and preprocessed for training at a time, which releases the memory used to hold the entire large dataset. Besides better computational feasibility, online learning is also used because of its adaptability to cases where new data is generated in real time and needed in modernizing the model. For instance, stock price prediction models are updated in an online learning manner with timely market data; click-through prediction models need to include the most recent data reflecting users' latest behaviors and tastes; spam email detectors have to be reactive to the ever-changing spammers by considering new features that are dynamically generated.

The existing model trained by previous datasets is now updated based on the most recently available dataset only, instead of rebuilt from scratch based on previous and recent datasets together, as in offline learning:

The SGDClassifier module in scikit-learn implements online learning with the partial_fit method (while the fit method is applied in offline learning, as we have seen). We train the model with 1,000,000 samples, where we feed in 100,000 samples at one time to simulate an online learning environment. And we will test the trained model on the next 100,000 samples as follows:

>>> n_rows = 100000 * 11
>>> df = pd.read_csv("train", nrows=n_rows)
>>> X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'],
axis=1).values
>>> Y = df['click'].values
>>> n_train = 100000 * 10
>>> X_train = X[:n_train]
>>> Y_train = Y[:n_train]
>>> X_test = X[n_train:]
>>> Y_test = Y[n_train:]

Fit the encoder on the whole training set as follows:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> enc.fit(X_train)

Initialize an SGD logistic regression model where we set the number of iterations to 1 in order to partially fit the model and enable online learning:

>>> sgd_lr_online = SGDClassifier(loss='log', penalty=None, 
fit_intercept=True, n_iter=1,
learning_rate='constant', eta0=0.01)

Loop over every 100000 samples and partially fit the model:

>>> start_time = timeit.default_timer()
>>> for i in range(10):
... x_train = X_train[i*100000:(i+1)*100000]
... y_train = Y_train[i*100000:(i+1)*100000]
... x_train_enc = enc.transform(x_train)
... sgd_lr_online.partial_fit(x_train_enc.toarray(), y_train,
classes=[0, 1])

Again, we use the partial_fit method for online learning. Also, we specify the classes parameter, which is required in online learning:

>>> print("--- %0.3fs seconds ---" % (timeit.default_timer() - 
start_time))
--- 167.399s seconds ---

Apply the trained model on the testing set, the next 100,000 samples, as follows:

>>> x_test_enc = enc.transform(X_test)
>>> pred = sgd_lr_online.predict_proba(x_test_enc.toarray())[:, 1]
>>> print('Training samples: {0}, AUC on testing set:
{1:.3f}'.format(n_train * 10, roc_auc_score(Y_test, pred)))
Training samples: 10000000, AUC on testing set: 0.761

With online learning, training based on a total of 1 million samples only takes 167 seconds and yields better accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset