Forecasting with neural networks

The second half of the chapter is all about neural networks. In the first part, we will be building a simple neural network that only forecasts the next time step. Since the spikes in the series are very large, we will be working with log-transformed page views in input and output. We can use the short-term forecast neural network to make longer-term forecasts, too, by feeding its predictions back into the network.

Before we can dive in and start building forecast models, we need to do some preprocessing and feature engineering. The advantage of neural networks is that they can take in both a high number of features in addition to very high-dimensional data. The disadvantage is that we have to be careful about what features we input. Remember how we discussed look-ahead bias earlier in the chapter, including future data that would not have been available at the time of forecasting, which is a problem in backtesting.

Data preparation

For each series, we will assemble the following features:

  • log_view: The natural logarithm of page views. Since the logarithm of zero is undefined, we will use log1p, which is the natural logarithm of page views plus one.
  • days: One-hot encoded weekdays.
  • year_lag: The value of log_view from 365 days ago. -1 if there is no value available.
  • halfyear_lag: The value of log_view from 182 days ago. -1 if there is no value available.
  • quarter_lag: The value of log_view from 91 days ago. -1 if there is no value available.
  • page_enc: The one-hot encoded subpage.
  • agent_enc: The one-hot encoded agent.
  • acc_enc: The one-hot encoded access method.
  • year_autocorr: The autocorrelation of the series of 365 days.
  • halfyr_autocorr: The autocorrelation of the series of 182 days.
  • quarter_autocorr: The autocorrelation of the series of 91 days.
  • medians: The median of page views over the lookback period.

These features are assembled for each time series, giving our input data the shape (batch size, look back window size, 29).

Weekdays

The day of the week matters. Sundays may show different access behavior, when people are browsing from their couch, compared to Mondays, when people may be looking up things for work. So, we need to encode the weekday. A simple one-hot encoding will do the job:

import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

weekdays = [datetime.datetime.strptime(date, '%Y-%m-%d').strftime('%a')
            for date in train.columns.values[:-4]]

Firstly, we turn the date strings (such as 2017-03-02) into their weekday (Thursday). This is very simple to do, and can be done with the following code:

day_one_hot = LabelEncoder().fit_transform(weekdays)
day_one_hot = day_one_hot.reshape(-1, 1)

We then encode the weekdays into integers, so that "Monday" becomes 1, "Tuesday" becomes 2, and so on. We reshape the resulting array into a rank-2 tensor with shape (array length, 1) so that the one-hot encoder knows that we have many observations, but only one feature, and not the other way around:

day_one_hot = OneHotEncoder(sparse=False).fit_transform(day_one_hot)
day_one_hot = np.expand_dims(day_one_hot,0)

Finally, we one-hot encode the days. We then add a new dimension to the tensor showing that we only have one "row" of dates. We will later repeat the array along this axis:

agent_int = LabelEncoder().fit(train['Agent'])
agent_enc = agent_int.transform(train['Agent'])
agent_enc = agent_enc.reshape(-1, 1)
agent_one_hot = OneHotEncoder(sparse=False).fit(agent_enc)

del agent_enc

We will need the encoders for the agents later when we encode the agent of each series.

Here, we first create a LabelEncoder instance that can transform the agent name strings into integers. We then transform all of the agents into such an integer string in order to set up a OneHotEncoder instance that can one-hot encode the agents. To save memory, we will then delete the already-encoded agents.

We do the same for subpages and access methods by running the following:

page_int = LabelEncoder().fit(train['Sub_Page'])
page_enc = page_int.transform(train['Sub_Page'])
page_enc = page_enc.reshape(-1, 1)
page_one_hot = OneHotEncoder(sparse=False).fit(page_enc)

del page_enc

acc_int = LabelEncoder().fit(train['Access'])
acc_enc = acc_int.transform(train['Access'])
acc_enc = acc_enc.reshape(-1, 1)
acc_one_hot = OneHotEncoder(sparse=False).fit(acc_enc)

del acc_enc

Now we come to the lagged features. Technically, neural networks could discover what past events are relevant for forecasting themselves. However, this is pretty difficult because of the vanishing gradient problem, something that is covered in more detail later, in the LSTM section of this chapter. For now, let's just set up a little function that creates an array lagged by a number of days:

def lag_arr(arr, lag, fill):
    filler = np.full((arr.shape[0],lag,1),-1)
    comb = np.concatenate((filler,arr),axis=1)
    result = comb[:,:arr.shape[1]]
    return result

This function first creates a new array that will fill up the "empty space" from the shift. The new array has as many rows as the original array but its series length, or width, is the number of days we want to lag. We then attach this array to the front of our original array. Finally, we remove elements from the back of the array in order to get back to the original array series length or width.
We want to inform our model about the amount of autocorrelation for different time intervals. To compute the autocorrelation for a single series, we shift the series by the amount of lag we want to measure the autocorrelation for. We then compute the autocorrelation:

Weekdays

In this formula Weekdays is the lag indicator. We do not just use a NumPy function since there is a real possibility that the divider is zero. In this case, our function will just return 0:

def single_autocorr(series, lag):
    s1 = series[lag:]
    s2 = series[:-lag]
    ms1 = np.mean(s1)
    ms2 = np.mean(s2)
    ds1 = s1 - ms1
    ds2 = s2 - ms2
    divider = np.sqrt(np.sum(ds1 * ds1)) * np.sqrt(np.sum(ds2 * ds2))
    return np.sum(ds1 * ds2) / divider if divider != 0 else 0

We can use this function, which we wrote for a single series, to create a batch of autocorrelation features, as seen here:

def batc_autocorr(data,lag,series_length):
    corrs = []
    for i in range(data.shape[0]):
        c = single_autocorr(data, lag)
        corrs.append(c)
    corr = np.array(corrs)
    corr = np.expand_dims(corr,-1)
    corr = np.expand_dims(corr,-1)
    corr = np.repeat(corr,series_length,axis=1)
    return corr

Firstly, we calculate the autocorrelations for each series in the batch. Then we fuse the correlations together into one NumPy array. Since autocorrelations are a global feature, we need to create a new dimension for the length of the series and another new dimension to show that this is only one feature. We then repeat the autocorrelations over the entire length of the series.

The get_batch function utilizes all of these tools in order to provide us with one batch of data, as can be seen with the following code:

def get_batch(train,start=0,lookback = 100):                  #1
    assert((start + lookback) <= (train.shape[1] - 5))        #2
    data = train.iloc[:,start:start + lookback].values        #3
    target = train.iloc[:,start + lookback].values
    target = np.log1p(target)                                 #4
    log_view = np.log1p(data)
    log_view = np.expand_dims(log_view,axis=-1)               #5
    days = day_one_hot[:,start:start + lookback]
    days = np.repeat(days,repeats=train.shape[0],axis=0)      #6
    year_lag = lag_arr(log_view,365,-1)
    halfyear_lag = lag_arr(log_view,182,-1)
    quarter_lag = lag_arr(log_view,91,-1)                     #7
    agent_enc = agent_int.transform(train['Agent'])
    agent_enc = agent_enc.reshape(-1, 1)
    agent_enc = agent_one_hot.transform(agent_enc)
    agent_enc = np.expand_dims(agent_enc,1)
    agent_enc = np.repeat(agent_enc,lookback,axis=1)          #8
    page_enc = page_int.transform(train['Sub_Page'])
    page_enc = page_enc.reshape(-1, 1)
    page_enc = page_one_hot.transform(page_enc)
    page_enc = np.expand_dims(page_enc, 1)
    page_enc = np.repeat(page_enc,lookback,axis=1)            #9
    acc_enc = acc_int.transform(train['Access'])
    acc_enc = acc_enc.reshape(-1, 1)
    acc_enc = acc_one_hot.transform(acc_enc)
    acc_enc = np.expand_dims(acc_enc,1)
    acc_enc = np.repeat(acc_enc,lookback,axis=1)              #10
    year_autocorr = batc_autocorr(data,lag=365,series_length=lookback)
    halfyr_autocorr = batc_autocorr(data,lag=182,series_length=lookback)
    quarter_autocorr = batc_autocorr(data,lag=91,series_length=lookback)                                       #11
    medians = np.median(data,axis=1)
    medians = np.expand_dims(medians,-1)
    medians = np.expand_dims(medians,-1)
    medians = np.repeat(medians,lookback,axis=1)              #12
    batch = np.concatenate((log_view,
                            days, 
                            year_lag, 
                            halfyear_lag, 
                            quarter_lag,
                            page_enc,
                            agent_enc,
                            acc_enc, 
                            year_autocorr, 
                            halfyr_autocorr,
                            quarter_autocorr, 
                            medians),axis=2)
    
    return batch, target

That was a lot of code, so let's take a minute to walk through the preceding code step by step in order to fully understand it:

  1. Ensures there is enough data to create a lookback window and a target from the given starting point.
  2. Separates the lookback window from the training data.
  3. Separates the target and then takes the one plus logarithm of it.
  4. Takes the one plus logarithm of the lookback window and adds a feature dimension.
  5. Gets the days from the precomputed one-hot encoding of days and repeats it for each time series in the batch.
  6. Computes the lag features for year lag, half-year lag, and quarterly lag.
  7. This step will encode the global features using the preceding defined encoders. The next two steps, 8 and 9, will echo the same role.
  8. This step repeats step 7.
  9. This step repeats step 7 and 8.
  10. Calculates the year, half-year, and quarterly autocorrelation.
  11. Calculates the median for the lookback data.
  12. Fuses all these features into one batch.

Finally, we can use our get_batch function to write a generator, just like we did in Chapter 3, Utilizing Computer Vision. This generator loops over the original training set and passes a subset into the get_batch function. It then yields the batch obtained.

Note that we choose random starting points to make the most out of our data:

def generate_batches(train,batch_size = 32, lookback = 100):
    num_samples = train.shape[0]
    num_steps = train.shape[1] - 5
    while True:
        for i in range(num_samples // batch_size):
            batch_start = i * batch_size
            batch_end = batch_start + batch_size

            seq_start = np.random.randint(num_steps - lookback)
            X,y = get_batch(train.iloc[batch_start:batch_end],start=seq_start)
            yield X,y

This function is what we will train and validate on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset