Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7 Predicting Stock Prices with Regression Algorithms

In the previous chapter, we trained a classifier on a large click dataset using Spark. In this chapter, we will be solving a problem that interests everyone—predicting stock prices. Getting wealthy by means of smart investment—who isn't interested?! Stock market movements and stock price predictions have been actively researched by a large number of financial, trading, and even technology corporations. A variety of methods have been developed to predict stock prices using machine learning techniques. Herein, we will be focusing on learning several popular regression algorithms, including linear regression, regression trees and regression forests, and support vector regression, and utilizing them to tackle this billion (or trillion) dollar problem.

We will cover the following topics in this chapter:

Introducing the stock market and stock prices
What is regression?
Stock data acquisition and feature engineering
The mechanics of linear regression
Implementing linear regression (from scratch, and using scikit-learn and TensorFlow)
The mechanics of regression trees
Implementing regression trees (from scratch and using scikit-learn)
From regression tree to regression forest
The mechanics of support vector regression and implementing it with scikit-learn
Regression performance evaluation
Predicting stock prices with regression algorithms

A brief overview of the stock market and stock prices

The stock of a corporation signifies ownership in the corporation. A single share of the stock represents a claim on the fractional assets and the earnings of the corporation in proportion to the total number of shares. For example, if an investor owns 50 shares of stock in a company that has, in total, 1,000 outstanding shares, that investor (or shareholder) would own and have a claim on 5% of the company's assets and earnings.

Stocks of a company can be traded between shareholders and other parties via stock exchanges and organizations. Major stock exchanges include New York Stock Exchange, NASDAQ, London Stock Exchange Group, Shanghai Stock Exchange, and Hong Kong Stock Exchange. The prices that a stock is traded at fluctuate essentially due to the law of supply and demand. At any one moment, the supply is the number of shares that are in the hands of public investors, the demand is the number of shares investors want to buy, and the price of the stock moves up and down in order to attain and maintain equilibrium.

In general, investors want to buy low and sell high. This sounds simple enough, but it's very challenging to implement as it's monumentally difficult to say whether a stock price will go up or down. There are two main streams of studies that attempt to understand factors and conditions that lead to price changes or even to forecast future stock prices, fundamental analysis and technical analysis:

Fundamental analysis: This stream focuses on underlying factors that influence a company's value and business, including overall economy and industry conditions from macro perspectives, the company's financial conditions, management, and competitors from micro perspectives.
Technical analysis: On the other hand, this stream predicts future price movements through the statistical study of past trading activity, including price movement, volume, and market data. Predicting prices via machine learning techniques is an important topic in technical analysis nowadays.

Many quantitative, or quant, trading firms have been using machine learning to empower automated and algorithmic trading. In this chapter, we'll be working as a quantitative analyst/researcher, exploring how to predict stock prices with several typical machine learning regression algorithms.

What is regression?

Regression is one of the main types of supervised learning in machine learning. In regression, the training set contains observations (also called features) and their associated continuous target values. The process of regression has two phases:

The first phase is exploring the relationships between the observations and the targets. This is the training phase.
The second phase is using the patterns from the first phase to generate the target for a future observation. This is the prediction phase.

The overall process is depicted in the following diagram:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassetsf185e82e-664b-4fde-8a40-f24afb1fc382.png

Figure 7.1: Training and prediction phase in regression

The major difference between regression and classification is that the output values in regression are continuous, while in classification they are discrete. This leads to different application areas for these two supervised learning methods. Classification is basically used to determine desired memberships or characteristics, as you've seen in previous chapters, such as email being spam or not, newsgroup topics, or ad click-through. On the other hand, regression mainly involves estimating an outcome or forecasting a response.

An example of estimating continuous targets with linear regression is depicted as follows, where we try to fit a line against a set of two-dimensional data points:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassetsf4d6acdc-d102-4a57-8cdf-764e35e4c54d.png

Figure 7.2: Linear regression example

Typical machine learning regression problems include the following:

Predicting house prices based on location, square footage, number of bedrooms, and bathrooms
Estimating power consumption based on information about a system's processes and memory
Forecasting demand in retail
Predicting stock prices

I've talked about regression in this section and will briefly introduce its use in the stock market and trading in the next one.

Mining stock price data

In theory, we can apply regression techniques to predicting prices of a particular stock. However, it's difficult to ensure the stock we pick is suitable for learning purposes—its price should follow some learnable patterns and it can't have been affected by unprecedented instances or irregular events. Hence, we'll herein be focusing on one of the most popular stock indexes to better illustrate and generalize our price regression approach.

Let's first cover what an index is. A stock index is a statistical measure of the value of a portion of the overall stock market. An index includes several stocks that are diverse enough to represent a section of the whole market. And the price of an index is typically computed as the weighted average of the prices of selected stocks.

The Dow Jones Industrial Average (DJIA) is one of the longest established and most commonly watched indexes in the world. It consists of 30 of the most significant stocks in the U.S., such as Microsoft, Apple, General Electric, and the Walt Disney Company, and represents around a quarter of the value of the entire U.S. market. You can view its daily prices and performance on Yahoo Finance at https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI:

Figure 7.3: Screenshot of daily prices and performance in Yahoo Finance

On each trading day, the price of a stock changes and is recorded in real time. Five values illustrating movements in the price over one unit of time (usually one day, but it can also be one week or one month) are key trading indicators. They are as follows:

Open: The starting price for a given trading day
Close: The final price on that day
High: The highest prices at which the stock traded on that day
Low: The lowest prices at which the stock traded on that day
Volume: The total number of shares traded before the market closed on that day

Other major indexes besides DJIA include the following:

The S&P 500 (short for Standard & Poor's 500) index is made up of 500 of the most commonly traded stocks in the U.S., representing 80% of the value of the entire U.S. market (https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC)
NASDAQ Composite is composed of all stocks traded on NASDAQ (https://finance.yahoo.com/quote/%5EIXIC/history?p=%5EIXIC)
The Russell 2000 (RUT) index is a collection of the last 2,000 out of 3,000 largest publicly traded companies in the U.S. (https://finance.yahoo.com/quote/%5ERUT/history?p=%5ERUT)
London FTSE-100 is composed of the top 100 companies in market capitalization listed on the London Stock Exchange (https://finance.yahoo.com/quote/%5EFTSE/)

We will be focusing on DJIA and using its historical prices and performance to predict future prices. In the following sections, we will be exploring how to develop price prediction models, specifically regression models, and what can be used as indicators or predictive features.

Getting started with feature engineering

When it comes to a machine learning algorithm, the first question to ask is usually what features are available or what the predictive variables are.

The driving factors that are used to predict future prices of DJIA, the close prices, include historical and current open prices as well as historical performance (high, low, and volume). Note that current or same-day performance (high, low, and volume) shouldn't be included because we simply can't foresee the highest and lowest prices at which the stock traded or the total number of shares traded before the market closed on that day.

Predicting the close price with only those preceding four indicators doesn't seem promising and might lead to underfitting. So, we need to think of ways to generate more features in order to increase predictive power. To recap, in machine learning, feature engineering is the process of creating domain-specific features based on existing features in order to improve the performance of a machine learning algorithm.

Feature engineering usually requires sufficient domain knowledge and can be very difficult and time-consuming. In reality, features used to solve a machine learning problem are not usually directly available and need to be specifically designed and constructed, for example, term frequency or tf-idf features in spam email detection and newsgroup classification. Hence, feature engineering is essential in machine learning and is usually where we spend the most effort in solving a practical problem.

When making an investment decision, investors usually look at historical prices over a period of time, not just the price the day before. Therefore, in our stock price prediction case, we can compute the average close price over the past week (five trading days), over the past month, and over the past year as three new features. We can also customize the time window to the size we want, such as the past quarter or the past six months. On top of these three averaged price features, we can generate new features associated with the price trend by computing the ratios between each pair of average prices in the three different time frames, for instance, the ratio between the average price over the past week and over the past year.

Besides prices, volume is another important factor that investors analyze. Similarly, we can generate new volume-based features by computing the average volumes in several different time frames and the ratios between each pair of averaged values.

Besides historical averaged values in a time window, investors also greatly consider stock volatility. Volatility describes the degree of variation of prices for a given stock or index over time. In statistical terms, it's basically the standard deviation of the close prices. We can easily generate new sets of features by computing the standard deviation of close prices in a particular time frame, as well as the standard deviation of volumes traded. In a similar manner, ratios between each pair of standard deviation values can be included in our engineered feature pool.

Last but not least, return is a significant financial metric that investors closely watch for. Return is the percentage of gain or loss of close price for a stock/index in a particular period. For example, daily return and annual return are financial terms we frequently hear. They are calculated as follows:

Here, price_i is the price on the i^th day and price_i_-1 is the price on the day before. Weekly and monthly returns can be computed in a similar way. Based on daily returns, we can produce a moving average over a particular number of days.

For instance, given daily returns of the past week, , , , , and , we can calculate the moving average over that week as follows:

In summary, we can generate the following predictive variables by applying feature engineering techniques:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassetsf9b9503d-a56f-48e8-94bf-1f03e04c5cd9.png

Figure 7.4: Generated features (1)

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassets2cc5263-cb4d-46bc-85ab-87c5820968f0.png

Figure 7.5: Generated features (2)

Eventually, we are able to generate in total 31 sets of features, along with the following six original features:

OpenPrice_i: This feature represents the open price
OpenPrice_i-1: This feature represents the open price on the past day
ClosePrice_i-1: This feature represents the close price on the past day
HighPrice_i-1: This feature represents the highest price on the past day
LowPrice_i-1: This feature represents the lowest price on the past day
Volume_i-1: This feature represents the volume on the past day

Acquiring data and generating features

For easier reference, we will implement the code for generating features here rather than in later sections. We will start by obtaining the dataset we need for our project.

Throughout the project, we will acquire stock index price and performance data from Yahoo Finance. For example, on the Historical Data page, https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI, we can change the Time Period to Dec 01, 2005 – Dec10, 2005, select Historical Prices in Show, and Daily in Frequency (or open this link directly: https://finance.yahoo.com/quote/%5EDJI/history?period1=1133395200&period2=1134172800&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true), then click on the Apply button. Click the Download data button to download the data and name the file 20051201_20051210.csv.

We can load the data we just downloaded as follows:

>>> mydata = pd.read_csv('20051201_20051210.csv', index_col='Date')
>>> mydata
               Open         High         Low          Close 
Date
2005-12-01 10806.030273 10934.900391 10806.030273 10912.570312
2005-12-02 10912.009766 10921.370117 10861.660156 10877.509766
2005-12-05 10876.950195 10876.950195 10810.669922 10835.009766
2005-12-06 10835.410156 10936.200195 10835.410156 10856.860352
2005-12-07 10856.860352 10868.059570 10764.009766 10810.910156
2005-12-08 10808.429688 10847.250000 10729.669922 10755.120117
2005-12-09 10751.759766 10805.950195 10729.910156 10778.580078
              Volume    Adjusted Close
Date
2005-12-01 256980000.0   10912.570312
2005-12-02 214900000.0   10877.509766
2005-12-05 237340000.0   10835.009766
2005-12-06 264630000.0   10856.860352
2005-12-07 243490000.0   10810.910156
2005-12-08 253290000.0   10755.120117
2005-12-09 238930000.0   10778.580078

Note the output is a pandas DataFrame object. The Date column is the index column, and the rest of the columns are the corresponding financial variables. In the following lines of code, you will see how powerful pandas is at simplifying data analysis and transformation on relational (or table-like) data.

First, we implement feature generation by starting with a sub-function that directly creates features from the original six features, as follows:

>>> def add_original_feature(df, df_new):
...     df_new['open'] = df['Open']
...     df_new['open_1'] = df['Open'].shift(1)
...     df_new['close_1'] = df['Close'].shift(1)
...     df_new['high_1'] = df['High'].shift(1)
...     df_new['low_1'] = df['Low'].shift(1)
...     df_new['volume_1'] = df['Volume'].shift(1)

Then we develop a sub-function that generates six features related to average close prices:

>>> def add_avg_price(df, df_new):
...     df_new['avg_price_5'] = 
                     df['Close'].rolling(5).mean().shift(1)
...     df_new['avg_price_30'] =  
                     df['Close'].rolling(21).mean().shift(1)
...     df_new['avg_price_365'] = 
                     df['Close'].rolling(252).mean().shift(1)
...     df_new['ratio_avg_price_5_30'] = 
                 df_new['avg_price_5'] / df_new['avg_price_30']
...     df_new['ratio_avg_price_5_365'] = 
                 df_new['avg_price_5'] / df_new['avg_price_365']
...     df_new['ratio_avg_price_30_365'] = 
                df_new['avg_price_30'] / df_new['avg_price_365']

Similarly, a sub-function that generates six features related to average volumes is as follows:

>>> def add_avg_volume(df, df_new):
...     df_new['avg_volume_5'] = 
                  df['Volume'].rolling(5).mean().shift(1)
...     df_new['avg_volume_30'] =   
                  df['Volume'].rolling(21).mean().shift(1)
...     df_new['avg_volume_365'] = 
                      df['Volume'].rolling(252).mean().shift(1)
...     df_new['ratio_avg_volume_5_30'] = 
                df_new['avg_volume_5'] / df_new['avg_volume_30']
...     df_new['ratio_avg_volume_5_365'] = 
               df_new['avg_volume_5'] / df_new['avg_volume_365']
...     df_new['ratio_avg_volume_30_365'] = 
               df_new['avg_volume_30'] / df_new['avg_volume_365']

As for the standard deviation, we develop the following sub-function for the price-related features:

>>> def add_std_price(df, df_new):
...     df_new['std_price_5'] = 
               df['Close'].rolling(5).std().shift(1)
...     df_new['std_price_30'] = 
               df['Close'].rolling(21).std().shift(1)
...     df_new['std_price_365'] = 
               df['Close'].rolling(252).std().shift(1)
...     df_new['ratio_std_price_5_30'] = 
               df_new['std_price_5'] / df_new['std_price_30']
...     df_new['ratio_std_price_5_365'] = 
               df_new['std_price_5'] / df_new['std_price_365']
...     df_new['ratio_std_price_30_365'] = 
               df_new['std_price_30'] / df_new['std_price_365']

Similarly, a sub-function that generates six volume-based standard deviation features is as follows:

>>> def add_std_volume(df, df_new):
...     df_new['std_volume_5'] = 
                 df['Volume'].rolling(5).std().shift(1)
...     df_new['std_volume_30'] = 
                 df['Volume'].rolling(21).std().shift(1)
...     df_new['std_volume_365'] = 
                 df['Volume'].rolling(252).std().shift(1)
...     df_new['ratio_std_volume_5_30'] = 
                df_new['std_volume_5'] / df_new['std_volume_30']
...     df_new['ratio_std_volume_5_365'] = 
                df_new['std_volume_5'] / df_new['std_volume_365']
...     df_new['ratio_std_volume_30_365'] = 
               df_new['std_volume_30'] / df_new['std_volume_365']

Seven return-based features are generated using the following sub-function:

>>> def add_return_feature(df, df_new):
...     df_new['return_1'] = ((df['Close'] - df['Close'].shift(1))    
                               / df['Close'].shift(1)).shift(1)
...     df_new['return_5'] = ((df['Close'] - df['Close'].shift(5)) 
                               / df['Close'].shift(5)).shift(1)
...     df_new['return_30'] = ((df['Close'] - 
           df['Close'].shift(21)) / df['Close'].shift(21)).shift(1)
...     df_new['return_365'] = ((df['Close'] - 
         df['Close'].shift(252)) / df['Close'].shift(252)).shift(1)
...     df_new['moving_avg_5'] = 
                    df_new['return_1'].rolling(5).mean().shift(1)
...     df_new['moving_avg_30'] = 
                    df_new['return_1'].rolling(21).mean().shift(1)
...     df_new['moving_avg_365'] = 
                   df_new['return_1'].rolling(252).mean().shift(1)

Finally, we put together the main feature generation function that calls all the preceding sub-functions:

>>> def generate_features(df):
...     """
...     Generate features for a stock/index based on historical price and performance
...     @param df: dataframe with columns "Open", "Close", "High", "Low", "Volume", "Adjusted Close"
...     @return: dataframe, data set with new features
...     """
...     df_new = pd.DataFrame()
...     # 6 original features
...     add_original_feature(df, df_new)
...     # 31 generated features
...     add_avg_price(df, df_new)
...     add_avg_volume(df, df_new)
...     add_std_price(df, df_new)
...     add_std_volume(df, df_new)
...     add_return_feature(df, df_new)
...     # the target
...     df_new['close'] = df['Close']
...     df_new = df_new.dropna(axis=0)
...     return df_new

Note that the window sizes here are 5, 21, and 252, instead of 7, 30, and 365 representing the weekly, monthly, and yearly window. This is because there are 252 (rounded) trading days in a year, 21 trading days in a month, and 5 in a week.

We can apply this feature engineering strategy on the DJIA data queried from 1988 to 2019 as follows (or directly download it from this page: https://finance.yahoo.com/quote/%5EDJI/history?period1=567993600&period2=1577750400&interval=1d&filter=history&frequency=1d):

>>> data_raw = pd.read_csv('19880101_20191231.csv', index_col='Date')
>>> data = generate_features(data_raw)

Take a look at what the data with the new features looks like:

>>> print(data.round(decimals=3).head(5))

The preceding command line generates the following output:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassetsf52f7dc6-203f-4025-b5ea-fd9c13d225f9.png

Figure 7.6: Screenshot of printing the first five rows of the DataFrame

Since all features and driving factors are ready, we will now focus on regression algorithms that estimate the continuous target variables based on these predictive features.

Estimating with linear regression

The first regression model that comes to our mind is linear regression. Does it mean fitting data points using a linear function, as its name implies? Let's explore it.

How does linear regression work?

In simple terms, linear regression tries to fit as many of the data points as possible with a straight line in two-dimensional space or a plane in three-dimensional space. It explores the linear relationship between observations and targets, and the relationship is represented in a linear equation or weighted sum function. Given a data sample x with n features, x₁, x₂, …, x_n (x represents a feature vector and x = (x₁, x₂, …, x_n)), and weights (also called coefficients) of the linear regression model w (w represents a vector (w₁, w₂, …, w_n)), the target y is expressed as follows:

Also, sometimes the linear regression model comes with an intercept (also called bias) w₀, so the preceding linear relationship becomes as follows:

Does it look familiar? The logistic regression algorithm you learned in Chapter 5, Predicting Online Ad Click-Through with Logistic Regression, is just an addition of logistic transformation on top of the linear regression, which maps the continuous weighted sum to the 0 (negative) or 1 (positive) class. Similarly, a linear regression model, or specifically its weight vector, w, is learned from the training data, with the goal of minimizing the estimation error defined as the mean squared error (MSE), which measures the average of squares of difference between the truth and prediction. Given m training samples, (x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), … (x⁽ⁱ⁾, y⁽ⁱ⁾)…, (x⁽^m⁾, y⁽^m⁾), the cost function J(w) regarding the weights to be optimized is expressed as follows:

Here, is the prediction.

Again, we can obtain the optimal w so that J(w) is minimized using gradient descent. The first-order derivative, the gradient ∆w, is derived as follows:

Combined with the gradient and learning rate η, the weight vector w can be updated in each step as follows:

After a substantial number of iterations, the learned w is then used to predict a new sample as follows:

After learning about the mathematical theory behind linear regression, let's implement it from scratch in the next section.

Implementing linear regression from scratch

Now that you have a thorough understanding of gradient-descent-based linear regression, we'll implement it from scratch.

We start by defining the function computing the prediction, , with the current weights:

>>> def compute_prediction(X, weights):
...     """
...     Compute the prediction y_hat based on current weights
...     """
...     predictions = np.dot(X, weights)
...     return predictions

Then, we continue with the function updating the weight, w, with one step in a gradient descent manner, as follows:

>>> def update_weights_gd(X_train, y_train, weights, 
learning_rate):
...     """
...     Update weights by one step and return updated wights
...     """
...     predictions = compute_prediction(X_train, weights)
...     weights_delta = np.dot(X_train.T, y_train - predictions)
...     m = y_train.shape[0]
...     weights += learning_rate / float(m) * weights_delta
...     return weights

Next, we add the function that calculates the cost J(w) as well:

>>> def compute_cost(X, y, weights):
...     """
...     Compute the cost J(w)
...     """
...     predictions = compute_prediction(X, weights)
...     cost = np.mean((predictions - y) ** 2 / 2.0)
...     return cost

Now, put all functions together with a model training function by performing the following tasks:

Update the weight vector in each iteration
Print out the current cost for every 100 (or it can be any number) iterations to ensure cost is decreasing and things are on the right track

Let's see how it's done by executing the following commands:

>>> def train_linear_regression(X_train, y_train, max_iter, learning_rate, fit_intercept=False):
...     """
...     Train a linear regression model with gradient descent, and return trained model
...     """
...     if fit_intercept:
...         intercept = np.ones((X_train.shape[0], 1))
...         X_train = np.hstack((intercept, X_train))
...     weights = np.zeros(X_train.shape[1])
...     for iteration in range(max_iter):
...         weights = update_weights_gd(X_train, y_train, 
                                       weights, learning_rate)
...         # Check the cost for every 100 (for example) iterations
...         if iteration % 100 == 0:
...             print(compute_cost(X_train, y_train, weights))
...     return weights

Finally, predict the results of new input values using the trained model as follows:

>>> def predict(X, weights):
...     if X.shape[1] == weights.shape[0] - 1:
...         intercept = np.ones((X.shape[0], 1))
...         X = np.hstack((intercept, X))
...     return compute_prediction(X, weights)

Implementing linear regression is very similar to logistic regression, as you just saw. Let's examine it with a small example:

>>> X_train = np.array([[6], [2], [3], [4], [1], 
                        [5], [2], [6], [4], [7]])
>>> y_train = np.array([5.5, 1.6, 2.2, 3.7, 0.8, 
                        5.2, 1.5, 5.3, 4.4, 6.8])

Train a linear regression model with 100 iterations, at a learning rate of 0.01 based on intercept-included weights:

>>> weights = train_linear_regression(X_train, y_train,
            max_iter=100, learning_rate=0.01, fit_intercept=True)

Check the model's performance on new samples as follows:

>>> X_test = np.array([[1.3], [3.5], [5.2], [2.8]])
>>> predictions = predict(X_test, weights)
>>> import matplotlib.pyplot as plt
>>> plt.scatter(X_train[:, 0], y_train, marker='o', c='b')
>>> plt.scatter(X_test[:, 0], predictions, marker='*', c='k')
>>> plt.xlabel('x')
>>> plt.ylabel('y')
>>> plt.show()

Refer to the following screenshot for the end result:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassets4ef2a555-62cb-4663-b045-ed086b7dbb20.png

Figure 7.7: Linear regression on a toy dataset

The model we trained correctly predicts new samples (depicted by the stars).

Let's try it on another dataset, the diabetes dataset from scikit-learn:

>>> from sklearn import datasets
>>> diabetes = datasets.load_diabetes()
>>> print(diabetes.data.shape)
(442, 10)
>>> num_test = 30 
>>> X_train = diabetes.data[:-num_test, :]
>>> y_train = diabetes.target[:-num_test]

Train a linear regression model with 5000 iterations, at a learning rate of 1 based on intercept-included weights (the cost is displayed every 500 iterations):

>>> weights = train_linear_regression(X_train, y_train, 
              max_iter=5000, learning_rate=1, fit_intercept=True)
2960.1229915
1539.55080927
1487.02495658
1480.27644342
1479.01567047
1478.57496091
1478.29639883
1478.06282572
1477.84756968
1477.64304737
>>> X_test = diabetes.data[-num_test:, :]
>>> y_test = diabetes.target[-num_test:]
>>> predictions = predict(X_test, weights)
>>> print(predictions)
[ 232.22305668 123.87481969 166.12805033 170.23901231 
  228.12868839 154.95746522 101.09058779 87.33631249 
  143.68332296 190.29353122 198.00676871 149.63039042 
   169.56066651 109.01983998 161.98477191 133.00870377 
   260.1831988 101.52551082 115.76677836 120.7338523
   219.62602446 62.21227353 136.29989073 122.27908721 
   55.14492975 191.50339388 105.685612 126.25915035 
   208.99755875 47.66517424]
>>> print(y_test)
[ 261. 113. 131. 174. 257. 55. 84. 42. 146. 212. 233. 
  91. 111. 152. 120. 67. 310. 94. 183. 66. 173. 72. 
  49. 64. 48. 178. 104. 132. 220. 57.]

The estimate is pretty close to the ground truth.

Next, let's utilize scikit-learn to implement linear regression.

Implementing linear regression with scikit-learn

So far, we have been using gradient descent in weight optimization but, like with logistic regression, linear regression is also open to stochastic gradient descent (SGD). To use it, we can simply replace the update_weights_gd function with the update_weights_sgd function we created in Chapter 5, Predicting Online Ad Click-Through with Logistic Regression.

We can also directly use the SGD-based regression algorithm, SGDRegressor, from scikit-learn:

>>> from sklearn.linear_model import SGDRegressor
>>> regressor = SGDRegressor(loss='squared_loss', penalty='l2',
  alpha=0.0001, learning_rate='constant', eta0=0.01, max_iter=1000)

Here, 'squared_loss' for the loss parameter indicates that the cost function is MSE; penalty is the regularization term and it can be None, l1, or l2, which is similar to SGDClassifier in Chapter 5, Predicting Online Ad Click-Through with Logistic Regression, in order to reduce overfitting; max_iter is the number of iterations; and the remaining two parameters mean the learning rate is 0.01 and unchanged during the course of training. Train the model and output predictions on the testing set as follows:

>>> regressor.fit(X_train, y_train)
>>> predictions = regressor.predict(X_test)
>>> print(predictions)
[ 231.03333725 124.94418254 168.20510142 170.7056729 
  226.52019503 154.85011364 103.82492496 89.376184 
  145.69862538 190.89270871 197.0996725 151.46200981 
  170.12673917 108.50103463 164.35815989 134.10002755 
  259.29203744 103.09764563 117.6254098 122.24330421
  219.0996765 65.40121381 137.46448687 123.25363156 
  57.34965405 191.0600674 109.21594994 128.29546226 
  207.09606669 51.10475455]

You can also implement linear regression with TensorFlow. Let's see this in the next section.

Implementing linear regression with TensorFlow

First, we import TensorFlow and construct the model:

>>> import tensorflow as tf
>>> layer0 = tf.keras.layers.Dense(units=1, 
                      input_shape=[X_train.shape[1]])
>>> model = tf.keras.Sequential(layer0)

It uses a linear layer (or you can think of it as a linear function) to connect the input in the X_train.shape[1] dimension and the output in 1 dimension.

Next, we specify the loss function, the MSE, and a gradient descent optimizer Adam with a learning rate of 1:

>>> model.compile(loss='mean_squared_error',
             optimizer=tf.keras.optimizers.Adam(1))

Now we train the model for 100 iterations:

>>> model.fit(X_train, y_train, epochs=100, verbose=True)
Epoch 1/100
412/412 [==============================] - 1s 2ms/sample - loss: 27612.9129
Epoch 2/100
412/412 [==============================] - 0s 44us/sample - loss: 23802.3043
Epoch 3/100
412/412 [==============================] - 0s 47us/sample - loss: 20383.9426
Epoch 4/100
412/412 [==============================] - 0s 51us/sample - loss: 17426.2599
Epoch 5/100
412/412 [==============================] - 0s 44us/sample - loss: 14857.0057
……
Epoch 96/100
412/412 [==============================] - 0s 55us/sample - loss: 2971.6798
Epoch 97/100
412/412 [==============================] - 0s 44us/sample - loss: 2970.8919
Epoch 98/100
412/412 [==============================] - 0s 52us/sample - loss: 2970.7903
Epoch 99/100
412/412 [==============================] - 0s 47us/sample - loss: 2969.7266
Epoch 100/100
412/412 [==============================] - 0s 46us/sample - loss: 2970.4180

This also prints out the loss for every iteration. Finally, we make predictions using the trained model:

>>> predictions = model.predict(X_test)[:, 0]
>>> print(predictions)
[231.52155  124.17711  166.71492  171.3975   227.70126  152.02522
 103.01532   91.79277  151.07457  190.01042  190.60373  152.52274
 168.92166  106.18033  167.02473  133.37477  259.24756  101.51256
 119.43106  120.893005 219.37921   64.873634 138.43217  123.665634
  56.33039  189.27441  108.67446  129.12535  205.06857   47.99469 ]

The next regression algorithm you will be learning about is decision tree regression.

Estimating with decision tree regression

Decision tree regression is also called regression tree. It is easy to understand a regression tree by comparing it with its sibling, the classification tree, which you are already familiar with.

Transitioning from classification trees to regression trees

In classification, a decision tree is constructed by recursive binary splitting and growing each node into left and right children. In each partition, it greedily searches for the most significant combination of features and its value as the optimal splitting point. The quality of separation is measured by the weighted purity of labels of the two resulting children, specifically via Gini Impurity or Information Gain. In regression, the tree construction process is almost identical to the classification one, with only two differences due to the fact that the target becomes continuous:

The quality of the splitting point is now measured by the weighted MSE of two children; the MSE of a child is equivalent to the variance of all target values, and the smaller the weighted MSE, the better the split
The average value of targets in a terminal node becomes the leaf value, instead of the majority of labels in the classification tree

To make sure you understand regression trees, let's work on a small example of house price estimation using the features house type and number of bedrooms:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassets9195b7b4-8ab7-4f17-864d-c39767b0d693.png

Figure 7.8: Toy dataset of house prices

We first define the MSE and weighted MSE computation functions that will be used in our calculation:

>>> def mse(targets):
...     # When the set is empty
...     if targets.size == 0:
...         return 0
...     return np.var(targets)

Then we define the weighted MSE after a split in a node:

>>> def weighted_mse(groups):
...     """
...     Calculate weighted MSE of children after a split
...     """
...     total = sum(len(group) for group in groups)
...     weighted_sum = 0.0
...     for group in groups:
...         weighted_sum += len(group) / float(total) * mse(group)
...     return weighted_sum

Test things out by executing the following commands:

>>> print(f'{mse(np.array([1, 2, 3])):.4f}')
0.6667
>>> print(f'{weighted_mse([np.array([1, 2, 3]), np.array([1, 2])]):.4f}')
0.5000

To build the house price regression tree, we first exhaust all possible pairs of feature and value, and we compute the corresponding MSE:

MSE(type, semi) = weighted_mse([[600, 400, 700], [700, 800]]) = 10333
MSE(bedroom, 2) = weighted_mse([[700, 400], [600, 800, 700]]) = 13000
MSE(bedroom, 3) = weighted_mse([[600, 800], [700, 400, 700]]) = 16000
MSE(bedroom, 4) = weighted_mse([[700], [600, 700, 800, 400]]) = 17500

The lowest MSE is achieved with the type, semi pair, and the root node is then formed by this splitting point. The result of this partition is as follows:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassets9202cb7-4c7a-483e-be0d-860e7186253f.png

Figure 7.9: Splitting using (type=semi)

If we are satisfied with a one-level regression tree, we can stop here by assigning both branches as leaf nodes with the value as the average of targets of the samples included. Alternatively, we can go further down the road by constructing the second level from the right branch (the left branch can't be split further):

MSE(bedroom, 2) = weighted_mse([[], [600, 400, 700]]) = 15556
MSE(bedroom, 3) = weighted_mse([[400], [600, 700]]) = 1667
MSE(bedroom, 4) = weighted_mse([[400, 600], [700]]) = 6667

With the second splitting point specified by the bedroom, 3 pair (whether it has at least three bedrooms or not) with the lowest MSE, our tree becomes as shown in the following diagram:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassets7c71608e-3f53-415e-b3a2-75c540b463c2.png

Figure 7.10: Splitting using (bedroom>=3)

We can finish up the tree by assigning average values to both leaf nodes.

Implementing decision tree regression

Now that you're clear about the regression tree construction process, it's time for coding.

The node splitting utility function we will define in this section is identical to what we used in Chapter 4, Predicting Online Ad Click-Through with Tree-Based Algorithms, which separates samples in a node into left and right branches based on a feature and value pair:

>>> def split_node(X, y, index, value):
...     """
...     Split data set X, y based on a feature and a value
...     @param index: index of the feature used for splitting
...     @param value: value of the feature used for splitting
...     @return: left and right child, a child is in the format of [X, y]
...     """
...     x_index = X[:, index]
...     # if this feature is numerical
...     if type(X[0, index]) in [int, float]:
...         mask = x_index >= value
...     # if this feature is categorical
...     else:
...         mask = x_index == value
...     # split into left and right child
...     left = [X[~mask, :], y[~mask]]
...     right = [X[mask, :], y[mask]]
...     return left, right

Next, we define the greedy search function, trying out all possible splits and returning the one with the least weighted MSE:

>>> def get_best_split(X, y):
...     """
...     Obtain the best splitting point and resulting children for the data set X, y
...     @return: {index: index of the feature, value: feature value, children: left and right children}
...     """
...     best_index, best_value, best_score, children = 
                                     None, None, 1e10, None
...     for index in range(len(X[0])):
...         for value in np.sort(np.unique(X[:, index])):
...             groups = split_node(X, y, index, value)
...             impurity = weighted_mse(
                                [groups[0][1], groups[1][1]])
...             if impurity < best_score:
...                 best_index, best_value, best_score, children 
                                   = index, value, impurity, groups
...     return {'index': best_index, 'value': best_value, 
                'children': children}

The preceding selection and splitting process occurs in a recursive manner on each of the subsequent children. When a stopping criterion is met, the process at a node stops, and the mean value of the sample targets will be assigned to this terminal node:

>>> def get_leaf(targets):
...     # Obtain the leaf as the mean of the targets
...     return np.mean(targets)

And finally, here is the recursive function, split, that links it all together. It checks whether any stopping criteria are met and assigns the leaf node if so, or proceeds with further separation otherwise:

>>> def split(node, max_depth, min_size, depth):
...     """
...     Split children of a node to construct new nodes or assign them terminals
...     @param node: dict, with children info
...     @param max_depth: maximal depth of the tree
...     @param min_size: minimal samples required to further split a child
...     @param depth: current depth of the node
...     """
...     left, right = node['children']
...     del (node['children'])
...     if left[1].size == 0:
...         node['right'] = get_leaf(right[1])
...         return
...     if right[1].size == 0:
...         node['left'] = get_leaf(left[1])
...         return
...     # Check if the current depth exceeds the maximal depth
...     if depth >= max_depth:
...         node['left'], node['right'] = get_leaf(
                             left[1]), get_leaf(right[1])
...         return
...     # Check if the left child has enough samples
...     if left[1].size <= min_size:
...         node['left'] = get_leaf(left[1])
...     else:
...         # It has enough samples, we further split it
...         result = get_best_split(left[0], left[1])
...         result_left, result_right = result['children']
...         if result_left[1].size == 0:
...             node['left'] = get_leaf(result_right[1])
...         elif result_right[1].size == 0:
...             node['left'] = get_leaf(result_left[1])
...         else:
...             node['left'] = result
...             split(node['left'], max_depth, min_size, depth + 1)
...     # Check if the right child has enough samples
...     if right[1].size <= min_size:
...         node['right'] = get_leaf(right[1])
...     else:
...         # It has enough samples, we further split it
...         result = get_best_split(right[0], right[1])
...         result_left, result_right = result['children']
...         if result_left[1].size == 0:
...             node['right'] = get_leaf(result_right[1])
...         elif result_right[1].size == 0:
...             node['right'] = get_leaf(result_left[1])
...         else:
...             node['right'] = result
...             split(node['right'], max_depth, min_size, 
                       depth + 1)

The entry point of the regression tree construction is as follows:

>>> def train_tree(X_train, y_train, max_depth, min_size):
...     root = get_best_split(X_train, y_train)
...     split(root, max_depth, min_size, 1)
...     return root

Now, let's test it with a hand-calculated example:

>>> X_train = np.array([['semi', 3],
...                     ['detached', 2],
...                     ['detached', 3],
...                     ['semi', 2],
...                     ['semi', 4]], dtype=object)
>>> y_train = np.array([600, 700, 800, 400, 700])
>>> tree = train_tree(X_train, y_train, 2, 2)

To verify the trained tree is identical to what we constructed by hand, we write a function displaying the tree:

>>> CONDITION = {'numerical': {'yes': '>=', 'no': '<'},
...              'categorical': {'yes': 'is', 'no': 'is not'}}
>>> def visualize_tree(node, depth=0):
...     if isinstance(node, dict):
...         if type(node['value']) in [int, float]:
...             condition = CONDITION['numerical']
...         else:
...             condition = CONDITION['categorical']
...         print('{}|- X{} {} {}'.format(depth * ' ', 
                  node['index'] + 1, condition['no'], 
                  node['value']))
...         if 'left' in node:
...             visualize_tree(node['left'], depth + 1)
...         print('{}|- X{} {} {}'.format(depth * ' ', 
                 node['index'] + 1, condition['yes'], 
                 node['value']))
...         if 'right' in node:
...             visualize_tree(node['right'], depth + 1)
...     else:
...         print('{}[{}]'.format(depth * ' ', node))
>>> visualize_tree(tree)
|- X1 is not detached
  |- X2 < 3
    [400.0]
  |- X2 >= 3
    [650.0]
|- X1 is detached
  [750.0]

Now that you have a better understanding of the regression tree after implementing it from scratch, we can directly use the DecisionTreeRegressor package (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) from scikit-learn. Let's apply it on an example of predicting Boston house prices as follows:

>>> boston = datasets.load_boston()
>>> num_test = 10 # the last 10 samples as testing set
>>> X_train = boston.data[:-num_test, :]
>>> y_train = boston.target[:-num_test]
>>> X_test = boston.data[-num_test:, :]
>>> y_test = boston.target[-num_test:]
>>> from sklearn.tree import DecisionTreeRegressor
>>> regressor = DecisionTreeRegressor(max_depth=10, 
                                      min_samples_split=3)
>>> regressor.fit(X_train, y_train)
>>> predictions = regressor.predict(X_test)
>>> print(predictions)
[12.7 20.9 20.9 20.2 20.9 30.8
 20.73076923 24.3 28.2 20.73076923]

Compare predictions with the ground truth as follows:

>>> print(y_test)
[ 19.7  18.3 21.2  17.5 16.8 22.4  20.6 23.9 22. 11.9]

We have implemented a regression tree in this section. Is there an ensemble version of the regression tree? Let's see next.

Implementing a regression forest

In Chapter 4, Predicting Online Ad Click-Through with Tree-Based Algorithms, we explored random forests as an ensemble learning method by combining multiple decision trees that are separately trained and randomly subsampling training features in each node of a tree. In classification, a random forest makes a final decision by a majority vote of all tree decisions. Applied to regression, a random forest regression model (also called a regression forest) assigns the average of regression results from all decision trees to the final decision.

Here, we will use the regression forest package RandomForestRegressor from scikit-learn and deploy it in our Boston house price prediction example:

>>> from sklearn.ensemble import RandomForestRegressor
>>> regressor = RandomForestRegressor(n_estimators=100, 
                           max_depth=10, min_samples_split=3)
>>> regressor.fit(X_train, y_train)
>>> predictions = regressor.predict(X_test)
>>> print(predictions)
[ 19.34404351 20.93928947 21.66535354 19.99581433 20.873871
  25.52030056 21.33196685 28.34961905 27.54088571 21.32508585]

The third regression algorithm that we want to explore is support vector regression (SVR).

Estimating with support vector regression

As the name implies, SVR is part of the support vector family and a sibling of the support vector machine (SVM) for classification (or we can just call it SVC) that you learned about in Chapter 3, Recognizing Faces with Support Vector Machine.

To recap, SVC seeks an optimal hyperplane that best segregates observations from different classes. Suppose a hyperplane is determined by a slope vector w and intercept b, and the optimal hyperplane is picked so that the distance (which can be expressed as ) from the nearest points in each of the segregated spaces to the hyperplane is maximized. The optimal w and b can be learned and solved with the following optimization problem:

Minimizing ||w||
Subject to y⁽ⁱ⁾(wx⁽ⁱ⁾ + b) ≥ 1, for a training set of (x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), … (x⁽ⁱ⁾, y⁽ⁱ⁾)…, (x^(m), y^(m))

In SVR, our goal is to find a decision hyperplane (defined by a slope vector w and intercept b) so that two hyperplanes wx+b=-ε (negative hyperplane) and wx+b=ε (positive hyperplane) can cover most training data. In other words, most of the data points are bounded in the ε bands of the optimal hyperplane. And at the same time, the optimal hyperplane is as flat as possible, which means w is as small as possible, as shown in the following diagram:

\192.168.0.200All_Books2020Working_Titles16326_PML by Example 3EBookDraftsLiu2Eassetsafad175c-8f90-484c-8810-12a41f37f3d8.png

Figure 7.11: Finding the decision hyperplane in SVR

This translates into deriving the optimal w and b by solving the following optimization problem:

Minimizing ||w||
Subject to |y⁽ⁱ⁾ − (wx⁽ⁱ⁾ + b)| ≤ ε, given a training set of (x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), … (x⁽ⁱ⁾, y⁽ⁱ⁾)…, (x^(m), y^(m))

The theory behind SVR is very similar to SVM. In the next section, let's see the implementation of SVR.

Implementing SVR

Again, to solve the preceding optimization problem, we need to resort to quadratic programming techniques, which are beyond the scope of our learning journey. Therefore, we won't cover the computation methods in detail and will implement the regression algorithm using the SVR package (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) from scikit-learn.

Important techniques used in SVM, such as penalty as a trade off between bias and variance, and the kernel (RBF, for example) handling linear non-separation, are transferable to SVR. The SVR package from scikit-learn also supports these techniques.

Let's solve the previous house price prediction problem with SVR this time:

>>> from sklearn.svm import SVR
>>> regressor = SVR(C=0.1, epsilon=0.02, kernel='linear')
>>> regressor.fit(X_train, y_train)
>>> predictions = regressor.predict(X_test)
>>> print(predictions)
[ 14.59908201 19.32323741 21.16739294 18.53822876 20.1960847
  23.74076575 22.65713954 26.98366295 25.75795682 22.69805145]

You've learned about three (or four) regression algorithms. So, how should we evaluate regression performance? Let's find out in the next section.

Evaluating regression performance

So far, we've covered three popular regression algorithms in depth and implemented them from scratch by using several prominent libraries. Instead of judging how well a model works on testing sets by printing out the prediction, we need to evaluate its performance with the following metrics, which give us better insights:

The MSE, as I mentioned, measures the squared loss corresponding to the expected value. Sometimes the square root is taken on top of the MSE in order to convert the value back into the original scale of the target variable being estimated. This yields the root mean squared error (RMSE). Also, the RMSE has the benefit of penalizing large errors more since we first calculate the square of an error.
The mean absolute error (MAE) on the other hand measures the absolute loss. It uses the same scale as the target variable and gives us an idea of how close the predictions are to the actual values.

For both the MSE and MAE, the smaller the value, the better the regression model.
R² (pronounced r squared) indicates the goodness of the fit of a regression model. It is the fraction of the dependent variable variation that a regression model is able to explain. It ranges from 0 to 1, representing from no fit to a perfect prediction. There is a variant of R² called adjusted R². It adjusts for the number of features in a model relative to the number of data points.

Let's compute these three measurements on a linear regression model using corresponding functions from scikit-learn:

We will work on the diabetes dataset again and fine-tune the parameters of the linear regression model using the grid search technique:

>>> diabetes = datasets.load_diabetes()
>>> num_test = 30 # the last 30 samples as testing set
>>> X_train = diabetes.data[:-num_test, :]
>>> y_train = diabetes.target[:-num_test]
>>> X_test = diabetes.data[-num_test:, :]
>>> y_test = diabetes.target[-num_test:]
>>> param_grid = {
...     "alpha": [1e-07, 1e-06, 1e-05],
...     "penalty": [None, "l2"],
...     "eta0": [0.03, 0.05, 0.1],
...     "max_iter": [500, 1000]
... }
>>> from sklearn.model_selection import GridSearchCV
>>> regressor = SGDRegressor(loss='squared_loss', 
                             learning_rate='constant',
                             random_state=42)
>>> grid_search = GridSearchCV(regressor, param_grid, cv=3)

We obtain the optimal set of parameters:

>>> grid_search.fit(X_train, y_train)
>>> print(grid_search.best_params_)
{'alpha': 1e-07, 'eta0': 0.05, 'max_iter': 500, 'penalty': None}
>>> regressor_best = grid_search.best_estimator_

We predict the testing set with the optimal model:

>>> predictions = regressor_best.predict(X_test)

We evaluate the performance on testing sets based on the MSE, MAE, and R² metrics:

>>> from sklearn.metrics import mean_squared_error, 
    mean_absolute_error, r2_score
>>> mean_squared_error(y_test, predictions)
1933.3953304460413
>>> mean_absolute_error(y_test, predictions)
35.48299900764652
>>> r2_score(y_test, predictions)
0.6247444629690868

Now that you've learned about three (or four, you could say) commonly used and powerful regression algorithms and performance evaluation metrics, let's utilize each of them to solve our stock price prediction problem.

Predicting stock prices with the three regression algorithms

Here are the steps to predict the stock price:

Earlier, we generated features based on data from 1988 to 2019, and we will now continue with constructing the training set with data from 1988 to 2018 and the testing set with data from 2019:

>>> data_raw = pd.read_csv('19880101_20191231.csv', index_col='Date')
>>> data = generate_features(data_raw)
>>> start_train = '1988-01-01'
>>> end_train = '2018-12-31'
>>> start_test = '2019-01-01'
>>> end_test = '2019-12-31'
>>> data_train = data.loc[start_train:end_train]
>>> X_train = data_train.drop('close', axis=1).values
>>> y_train = data_train['close'].values
>>> print(X_train.shape)
(7558, 37)
>>> print(y_train.shape)
(7558,)

All fields in the dataframe data except 'close' are feature columns, and 'close' is the target column. We have 7,558 training samples and each sample is 37-dimensional. We also have 251 testing samples:

>>> print(X_test.shape)
(251, 37)

We will first experiment with SGD-based linear regression. Before we train the model, you should realize that SGD-based algorithms are sensitive to data with features at very different scales; for example, in our case, the average value of the open feature is around 8,856, while that of the moving_avg_365 feature is 0.00037 or so. Hence, we need to normalize features into the same or a comparable scale. We do so by removing the mean and rescaling to unit variance:
```
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
```

We rescale both sets with scaler taught by the training set:

>>> X_scaled_train = scaler.fit_transform(X_train)
>>> X_scaled_test = scaler.transform(X_test)

Now we can search for the SGD-based linear regression with the optimal set of parameters. We specify l2 regularization and 1,000 iterations, and tune the regularization term multiplier, alpha, and initial learning rate, eta0:

>>> param_grid = {
...     "alpha": [1e-4, 3e-4, 1e-3],
...     "eta0": [0.01, 0.03, 0.1],
... }
>>> lr = SGDRegressor(penalty='l2', max_iter=1000, random_state=42
)
>>> grid_search = GridSearchCV(lr, param_grid, cv=5, scoring='r2')
>>> grid_search.fit(X_scaled_train, y_train)

Select the best linear regression model and make predictions of the testing samples:

>>> print(grid_search.best_params_)
{'alpha': 0.0001, 'eta0': 0.03}
>>> lr_best = grid_search.best_estimator_
>>> predictions_lr = lr_best.predict(X_scaled_test)

Measure the prediction performance via the MSE, MAE, and R2:

>>> print(f'MSE: {mean_squared_error(y_test, predictions_lr):.3f}')
MSE: 41631.128
>>> print(f'MAE: {mean_absolute_error(y_test, predictions_lr):.3f}')
MAE: 154.989
>>> print(f'R^2: {r2_score(y_test, predictions_lr):.3f}')
R^2: 0.964

We achieve an R² of 0.964 with a fine-tuned linear regression model.

Similarly, let's experiment with a random forest. We specify 100 trees to ensemble and tune the maximum depth of the tree, max_depth; the minimum number of samples required to further split a node, min_samples_split; and the number of features used for each tree, as well as the following:

>>> param_grid = {
...     'max_depth': [30, 50],
...     'min_samples_split': [2, 5, 10],
...     'min_samples_leaf': [3, 5]
...
... }
>>> rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_features='auto', random_state=42)
>>> grid_search = GridSearchCV(rf, param_grid, cv=5, 
                               scoring='r2', n_jobs=-1)
>>> grid_search.fit(X_train, y_train)

Note this may take a while, hence we use all available CPU cores for training.

Select the best regression forest model and make predictions of the testing samples:

>>> print(grid_search.best_params_)
{'max_depth': 30, 'min_samples_leaf': 3, 'min_samples_split': 2}
>>> rf_best = grid_search.best_estimator_
>>> predictions_rf = rf_best.predict(X_test)

Measure the prediction performance as follows:

>>> print(f'MSE: {mean_squared_error(y_test, predictions_rf):.3f}')
MSE: 404310.522
>>> print(f'MAE: {mean_absolute_error(y_test, predictions_rf):.3f}')
MAE: 419.398
>>> print(f'R^2: {r2_score(y_test, predictions_rf):.3f}')
R^2: 0.647

An R² of 0.647 is obtained with a tweaked forest regressor.

Next, we work with SVR with a linear and RBF kernel and leave the penalty hyperparameters C and ε, as well as the kernel coefficient of RBF, for fine-tuning. Similar to SGD-based algorithms, SVR doesn't work well on data with feature scale disparity:

>>> param_grid = [
...     {'kernel': ['linear'], 'C': [100, 300, 500], 
            'epsilon': [0.00003, 0.0001]},
...     {'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
             'C': [10, 100, 1000], 'epsilon': [0.00003, 0.0001]}
... ]

Again, to work around this, we use the rescaled data to train the SVR model:

>>> svr = SVR()
>>> grid_search = GridSearchCV(svr, param_grid, cv=5, scoring='r2')
>>> grid_search.fit(X_scaled_train, y_train)

Select the best SVR model and make predictions of the testing samples:

>>> print(grid_search.best_params_)
{'C': 500, 'epsilon': 0.0001, 'kernel': 'linear'}
>>> svr_best = grid_search.best_estimator_ 
>>> predictions_svr = svr_best.predict(X_scaled_test)
>>> print(f'MSE: {mean_squared_error(y_test, predictions_svr):.3f}')
MSE: 29999.827
>>> print(f'MAE: {mean_absolute_error(y_test, predictions_svr):.3f}')
MAE: 123.566
>>> print(f'R^2: {r2_score(y_test, predictions_svr):.3f}')
R^2: 0.974

With SVR, we're able to achieve an R² of 0.974 on the testing set.

We also plot the prediction generated by each of the three algorithms, along with the ground truth:

Figure 7.12: Predictions using the three algorithms versus the ground truth

The visualization is produced by the following code:

>>> import matplotlib.pyplot as plt
>>> plt.plot(data_test.index, y_test, c='k')
>>> plt.plot(data_test.index, predictions_lr, c='b')
>>> plt.plot(data_test.index, predictions_rf, c='r')
>>> plt.plot(data_test.index, predictions_svr, c='g')
>>> plt.xticks(range(0, 252, 10), rotation=60)
>>> plt.xlabel('Date')
>>> plt.ylabel('Close price')
>>> plt.legend(['Truth', 'Linear regression', 'Random Forest', 'SVR'])
>>> plt.show()

We've built a stock predictor using three regression algorithms individually in this section. Overall, SVR outperforms the other two algorithms.

Summary

In this chapter, we worked on the last project in this book, predicting stock (specifically stock index) prices using machine learning regression techniques. We started with a short introduction to the stock market and factors that influence trading prices. To tackle this billion-dollar problem, we investigated machine learning regression, which estimates a continuous target variable, as opposed to discrete output in classification. We followed this with an in-depth discussion of three popular regression algorithms, linear regression, regression trees and regression forests, and SVR. We covered their definitions, mechanics, and implementations from scratch with several popular frameworks, including scikit-learn and TensorFlow, along with applications on toy datasets. You also learned the metrics used to evaluate a regression model. Finally, we applied what was covered in this whole chapter to solve our stock price prediction problem.

In the next chapter, we will continue working on the stock price prediction project, but with powerful neural networks. We will see whether they can beat what we have achieved with the three regression models in this chapter.

Exercises

As mentioned, can you add more signals to our stock prediction system, such as the performance of other major indexes? Does this improve prediction?
Recall that I briefly mentioned several major stock indexes besides DJIA. Is it possible to improve on the DJIA price prediction model we just developed by considering the historical prices and performances of these major indexes? It's highly likely! The idea behind this is that no stock or index is isolated and that there are weak or strong influences between stocks and different financial markets. This should be intriguing to explore.
Can you try to ensemble linear regression and SVR, for example, averaging the prediction, and see if you can improve the prediction?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Predicting Stock Prices with Regression Algorithms

Create new playlist

Sign In

Sign Up

7

Predicting Stock Prices with Regression Algorithms