The second half of the chapter is all about neural networks. In the first part, we will be building a simple neural network that only forecasts the next time step. Since the spikes in the series are very large, we will be working with log-transformed page views in input and output. We can use the short-term forecast neural network to make longer-term forecasts, too, by feeding its predictions back into the network.
Before we can dive in and start building forecast models, we need to do some preprocessing and feature engineering. The advantage of neural networks is that they can take in both a high number of features in addition to very high-dimensional data. The disadvantage is that we have to be careful about what features we input. Remember how we discussed look-ahead bias earlier in the chapter, including future data that would not have been available at the time of forecasting, which is a problem in backtesting.
For each series, we will assemble the following features:
log_view
: The natural logarithm of page views. Since the logarithm of zero is undefined, we will use log1p
, which is the natural logarithm of page views plus one.days
: One-hot encoded weekdays.year_lag
: The value of log_view
from 365 days ago. -1
if there is no value available.halfyear_lag
: The value of log_view
from 182 days ago. -1
if there is no value available.quarter_lag
: The value of log_view
from 91 days ago. -1
if there is no value available.page_enc
: The one-hot encoded subpage.agent_enc
: The one-hot encoded agent.acc_enc
: The one-hot encoded access method.year_autocorr
: The autocorrelation of the series of 365 days.halfyr_autocorr
: The autocorrelation of the series of 182 days.quarter_autocorr
: The autocorrelation of the series of 91 days.medians
: The median of page views over the lookback period.These features are assembled for each time series, giving our input data the shape (batch size, look back window size, 29).
The day of the week matters. Sundays may show different access behavior, when people are browsing from their couch, compared to Mondays, when people may be looking up things for work. So, we need to encode the weekday. A simple one-hot encoding will do the job:
import datetime from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder weekdays = [datetime.datetime.strptime(date, '%Y-%m-%d').strftime('%a') for date in train.columns.values[:-4]]
Firstly, we turn the date strings (such as 2017-03-02) into their weekday (Thursday). This is very simple to do, and can be done with the following code:
day_one_hot = LabelEncoder().fit_transform(weekdays) day_one_hot = day_one_hot.reshape(-1, 1)
We then encode the weekdays into integers, so that "Monday" becomes 1
, "Tuesday" becomes 2
, and so on. We reshape the resulting array into a rank-2 tensor with shape (array length, 1) so that the one-hot encoder knows that we have many observations, but only one feature, and not the other way around:
day_one_hot = OneHotEncoder(sparse=False).fit_transform(day_one_hot) day_one_hot = np.expand_dims(day_one_hot,0)
Finally, we one-hot encode the days. We then add a new dimension to the tensor showing that we only have one "row" of dates. We will later repeat the array along this axis:
agent_int = LabelEncoder().fit(train['Agent']) agent_enc = agent_int.transform(train['Agent']) agent_enc = agent_enc.reshape(-1, 1) agent_one_hot = OneHotEncoder(sparse=False).fit(agent_enc) del agent_enc
We will need the encoders for the agents later when we encode the agent of each series.
Here, we first create a LabelEncoder
instance that can transform the agent name strings into integers. We then transform all of the agents into such an integer string in order to set up a OneHotEncoder
instance that can one-hot encode the agents. To save memory, we will then delete the already-encoded agents.
We do the same for subpages and access methods by running the following:
page_int = LabelEncoder().fit(train['Sub_Page']) page_enc = page_int.transform(train['Sub_Page']) page_enc = page_enc.reshape(-1, 1) page_one_hot = OneHotEncoder(sparse=False).fit(page_enc) del page_enc acc_int = LabelEncoder().fit(train['Access']) acc_enc = acc_int.transform(train['Access']) acc_enc = acc_enc.reshape(-1, 1) acc_one_hot = OneHotEncoder(sparse=False).fit(acc_enc) del acc_enc
Now we come to the lagged features. Technically, neural networks could discover what past events are relevant for forecasting themselves. However, this is pretty difficult because of the vanishing gradient problem, something that is covered in more detail later, in the LSTM section of this chapter. For now, let's just set up a little function that creates an array lagged by a number of days:
def lag_arr(arr, lag, fill): filler = np.full((arr.shape[0],lag,1),-1) comb = np.concatenate((filler,arr),axis=1) result = comb[:,:arr.shape[1]] return result
This function first creates a new array that will fill up the "empty space" from the shift. The new array has as many rows as the original array but its series length, or width, is the number of days we want to lag. We then attach this array to the front of our original array. Finally, we remove elements from the back of the array in order to get back to the original array series length or width. We want to inform our model about the amount of autocorrelation for different time intervals. To compute the autocorrelation for a single series, we shift the series by the amount of lag we want to measure the autocorrelation for. We then compute the autocorrelation:
In this formula is the lag indicator. We do not just use a NumPy function since there is a real possibility that the divider is zero. In this case, our function will just return 0:
def single_autocorr(series, lag): s1 = series[lag:] s2 = series[:-lag] ms1 = np.mean(s1) ms2 = np.mean(s2) ds1 = s1 - ms1 ds2 = s2 - ms2 divider = np.sqrt(np.sum(ds1 * ds1)) * np.sqrt(np.sum(ds2 * ds2)) return np.sum(ds1 * ds2) / divider if divider != 0 else 0
We can use this function, which we wrote for a single series, to create a batch of autocorrelation features, as seen here:
def batc_autocorr(data,lag,series_length): corrs = [] for i in range(data.shape[0]): c = single_autocorr(data, lag) corrs.append(c) corr = np.array(corrs) corr = np.expand_dims(corr,-1) corr = np.expand_dims(corr,-1) corr = np.repeat(corr,series_length,axis=1) return corr
Firstly, we calculate the autocorrelations for each series in the batch. Then we fuse the correlations together into one NumPy array. Since autocorrelations are a global feature, we need to create a new dimension for the length of the series and another new dimension to show that this is only one feature. We then repeat the autocorrelations over the entire length of the series.
The get_batch
function utilizes all of these tools in order to provide us with one batch of data, as can be seen with the following code:
def get_batch(train,start=0,lookback = 100): #1 assert((start + lookback) <= (train.shape[1] - 5)) #2 data = train.iloc[:,start:start + lookback].values #3 target = train.iloc[:,start + lookback].values target = np.log1p(target) #4 log_view = np.log1p(data) log_view = np.expand_dims(log_view,axis=-1) #5 days = day_one_hot[:,start:start + lookback] days = np.repeat(days,repeats=train.shape[0],axis=0) #6 year_lag = lag_arr(log_view,365,-1) halfyear_lag = lag_arr(log_view,182,-1) quarter_lag = lag_arr(log_view,91,-1) #7 agent_enc = agent_int.transform(train['Agent']) agent_enc = agent_enc.reshape(-1, 1) agent_enc = agent_one_hot.transform(agent_enc) agent_enc = np.expand_dims(agent_enc,1) agent_enc = np.repeat(agent_enc,lookback,axis=1) #8 page_enc = page_int.transform(train['Sub_Page']) page_enc = page_enc.reshape(-1, 1) page_enc = page_one_hot.transform(page_enc) page_enc = np.expand_dims(page_enc, 1) page_enc = np.repeat(page_enc,lookback,axis=1) #9 acc_enc = acc_int.transform(train['Access']) acc_enc = acc_enc.reshape(-1, 1) acc_enc = acc_one_hot.transform(acc_enc) acc_enc = np.expand_dims(acc_enc,1) acc_enc = np.repeat(acc_enc,lookback,axis=1) #10 year_autocorr = batc_autocorr(data,lag=365,series_length=lookback) halfyr_autocorr = batc_autocorr(data,lag=182,series_length=lookback) quarter_autocorr = batc_autocorr(data,lag=91,series_length=lookback) #11 medians = np.median(data,axis=1) medians = np.expand_dims(medians,-1) medians = np.expand_dims(medians,-1) medians = np.repeat(medians,lookback,axis=1) #12 batch = np.concatenate((log_view, days, year_lag, halfyear_lag, quarter_lag, page_enc, agent_enc, acc_enc, year_autocorr, halfyr_autocorr, quarter_autocorr, medians),axis=2) return batch, target
That was a lot of code, so let's take a minute to walk through the preceding code step by step in order to fully understand it:
Finally, we can use our get_batch
function to write a generator, just like we did in Chapter 3, Utilizing Computer Vision. This generator loops over the original training set and passes a subset into the get_batch
function. It then yields the batch obtained.
Note that we choose random starting points to make the most out of our data:
def generate_batches(train,batch_size = 32, lookback = 100): num_samples = train.shape[0] num_steps = train.shape[1] - 5 while True: for i in range(num_samples // batch_size): batch_start = i * batch_size batch_end = batch_start + batch_size seq_start = np.random.randint(num_steps - lookback) X,y = get_batch(train.iloc[batch_start:batch_end],start=seq_start) yield X,y
This function is what we will train and validate on.