Predicting stock price with four regression algorithms

Now that we've learned four (or five, you could say) commonly used and powerful regression algorithms and performance evaluation metrics, let's utilize each of them to solve our stock price prediction problem.

We generated features based on data from 1988 to 2016 earlier, and we'll now continue with constructing the training set with data from 1988 to 2015 and the testing set with data from 2016:

>>> data_raw = pd.read_csv('19880101_20161231.csv', index_col='Date')
>>> data = generate_features(data_raw)
>>> start_train = '1988-01-01'
>>> end_train = '2015-12-31'
>>> start_test = '2016-01-01'
>>> end_test = '2016-12-31'
>>> data_train = data.ix[start_train:end_train]
>>> X_train = data_train.drop('close', axis=1).values
>>> y_train = data_train['close'].values
>>> print(X_train.shape)
(6804, 37)
>>> print(y_train.shape)
(6804,)

All fields in the dataframe data except 'close' are feature columns, and 'close' is the target column. We have 6,553 training samples and each sample is 37-dimensional. And we have 252 testing samples:

>>> print(X_test.shape)
(252, 37)

We first experiment with SGD-based linear regression. Before we train the model, we should realize that SGD-based algorithms are sensitive to data with features at largely different scales, for example, in our case, the average value of the open feature is around 8,856, while that of the moving_avg_365 feature is 0.00037 or so. Hence, we need to normalize features into the same or a comparable scale. We do so by removing the mean and rescaling to unit variance:

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()

We rescale both sets with scaler taught by the training set:

>>> X_scaled_train = scaler.fit_transform(X_train)
>>> X_scaled_test = scaler.transform(X_test)

Now we can search for the SGD-based linear regression with the optimal set of parameters. We specify l2 regularization and 1,000 iterations and tune the regularization term multiplier, alpha, and initial learning rate, eta0:

>>> param_grid = {
...     "alpha": [1e-5, 3e-5, 1e-4],
...     "eta0": [0.01, 0.03, 0.1],
... }
>>> lr = SGDRegressor(penalty='l2', n_iter=1000)
>>> grid_search = GridSearchCV(lr, param_grid, cv=5, scoring='r2')
>>> grid_search.fit(X_scaled_train, y_train)

Select the best linear regression model and make predictions of the testing samples:

>>> print(grid_search.best_params_)
{'alpha': 3e-05, 'eta0': 0.03}
>>> lr_best = grid_search.best_estimator_
>>> predictions_lr = lr_best.predict(X_scaled_test)

Measure the prediction performance via the MSE, MAE, and R²:

>>> print('MSE: {0:.3f}'.format(
             mean_squared_error(y_test, predictions_lr)))
MSE: 18934.971
>>> print('MAE: {0:.3f}'.format(
             mean_absolute_error(y_test, predictions_lr))
MAE: 100.244
>>> print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_lr)))
R^2: 0.979

We achieve 0.979 R² with a fine-tuned linear regression model.

Similarly, we experiment with random forest, where we specify 500 trees to ensemble and tune the the maximum depth of the tree, max_depth; the minimum number of samples required to further split a node, min_samples_split; and the number of features used for each tree, as well as the following:

>>> param_grid = {
...     'max_depth': [50, 70, 80],
...     'min_samples_split': [5, 10],
...     'max_features': ['auto', 'sqrt'],
...     'min_samples_leaf': [3, 5]
... }
>>> rf = RandomForestRegressor(n_estimators=500, n_jobs=-1)
>>> grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='r2', 
                                                         n_jobs=-1)
>>> grid_search.fit(X_train, y_train)

Note this may take a while, hence we use all available CPU cores for training.

Select the best regression forest model and make predictions of the testing samples:

>>> print(grid_search.best_params_)
{'max_depth': 70, 'max_features': 'auto', 'min_samples_leaf': 3, 'min_samples_split': 5}
>>> rf_best = grid_search.best_estimator_
>>> predictions_rf = rf_best.predict(X_test)

Measure the prediction performance as follows:

>>> print('MSE: {0:.3f}'.format(mean_squared_error(y_test, 
          predictions_rf)))
MSE: 260349.365
>>> print('MAE: {0:.3f}'.format(mean_absolute_error(y_test, 
           predictions_rf)))
MAE: 299.344
>>> print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_rf)))
R^2: 0.706

An R² of 0.706 is obtained with a tweaked forest regressor.

Next, we work with SVR with linear and RBF kernel and leave the penalty parameter C and ε as well as the kernel coefficient of RBF for fine tuning. Similar to SGD-based algorithms, SVR doesn't work well on data with feature scale disparity:

>>> param_grid = [
...     {'kernel': ['linear'], 'C': [100, 300, 500], 
            'epsilon': [0.00003, 0.0001]},
...     {'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
             'C': [10, 100, 1000], 'epsilon': [0.00003, 0.0001]}
... ]

Again, to work around this, we use the rescaled data to train the SVR model:

>>> svr = SVR()
>>> grid_search = GridSearchCV(svr, param_grid, cv=5, scoring='r2')
>>> grid_search.fit(X_scaled_train, y_train)

Select the best SVR model and make predictions of the testing samples:

>>> print(grid_search.best_params_)
{'C': 500, 'epsilon': 3e-05, 'kernel': 'linear'}
>>> svr_best = grid_search.best_estimator_ 
>>> predictions_svr = svr_best.predict(X_scaled_test)
>>> print('MSE: {0:.3f}'.format(mean_squared_error(y_test, predictions_svr)))
MSE: 17466.596
>>> print('MAE: {0:.3f}'.format(mean_absolute_error(y_test, predictions_svr)))
MAE: 95.070
>>> print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_svr)))
R^2: 0.980

With SVR, we're able to achieve R² 0.980 on the testing set.

Finally, we experiment with the neural network where we fine-tune from the following options for hyperparameters including a list of hidden layer sizes, activation function, optimizer, learning rate, penalty factor, and mini-batch size:

>>> param_grid = {
...     'hidden_layer_sizes': [(50, 10), (30, 30)],
...     'activation': ['logistic', 'tanh', 'relu'],
...     'solver': ['sgd', 'adam'],
...     'learning_rate_init': [0.0001, 0.0003, 0.001, 0.01],
...     'alpha': [0.00003, 0.0001, 0.0003],
...     'batch_size': [30, 50]
... }
>>> nn = MLPRegressor(random_state=42, max_iter=2000)
>>> grid_search = GridSearchCV(nn, param_grid, cv=5, scoring='r2', 
                               n_jobs=-1)
>>> grid_search.fit(X_scaled_train, y_train)

Select the best neural network model and make predictions of the testing samples:

>>> print(grid_search.best_params_)
{'activation': 'relu', 'alpha': 0.0003, 'hidden_layer_sizes': (50, 10), 'learning_rate_init': 0.001, 'solver': 'adam'}
>>> nn_best = grid_search.best_estimator_
>>> predictions_nn = nn_best.predict(X_scaled_test)
>>> print('MSE: {0:.3f}'.format(mean_squared_error(y_test, 
          predictions_nn)))
MSE: 19619.618
>>> print('MAE: {0:.3f}'.format(mean_absolute_error(y_test, 
          predictions_nn)))
MAE: 100.956
>>> print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_nn)))
R^2: 0.978

We're able to achieve a 0.978 R² with a fine-tuned neural network model.

We'll also plot the prediction generated by each of the three algorithms, along with the ground truth:

Table of Contents for Predicting stock price with four regression algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Predicting stock price with four regression algorithms