Establishing a training and testing regime

Even with lots of data available, we have to ask ourselves; How do we want to split data between training, validation, and testing. This dataset already comes with a test set of future data, therefore we don't have to worry about the test set, but for the validation set, there are two ways of splitting: a walk-forward split, and a side-by-side split:

Establishing a training and testing regime

Possible testing regimes

In a walk-forward split, we train on all 145,000 series. To validate, we are going to use more recent data from all the series. In a side-by-side split, we sample a number of series for training and use the rest for validation.

Both have advantages and disadvantages. The disadvantage of walk-forward splitting is that we cannot use all of the observations of the series for our predictions. The disadvantage of side-by-side splitting is that we cannot use all series for training.

If we have few series, but multiple data observations per series, a walk-forward split is preferable. However, if we have a lot of series, but few observations per series, then a side-by-side split is preferable.

Establishing a training and testing regime also aligns more nicely with the forecasting problem at hand. In side-by-side splitting, the model might overfit to global events in the prediction period. Imagine that Wikipedia was down for a week in the prediction period used in side-by-side splitting. This event would reduce the number of views for all the pages, and as a result the model would overfit to this global event.

We would not catch the overfitting in our validation set because the prediction period would also be affected by the global event. However, in our case, we have multiple time series, but only about 550 observations per series. Therefore there seems to be no global events that would have significantly impacted all the Wikipedia pages in that time period.

However, there are some global events that impacted views for some pages, such as the Winter Olympics. Yet, this is a reasonable risk in this case, as the number of pages affected by such global events is still small. Since we have an abundance of series and only a few observations per series, a side-by-side split is more feasible in our case.

In this chapter, we're focusing on forecasting traffic for 50 days. So, we must first split the last 50 days of each series from the rest, as seen in the following code, before splitting the training and validation set:

from sklearn.model_selection import train_test_split

X = data.iloc[:,:500]
y = data.iloc[:,500:]

X_train, X_val, y_train, y_val = train_test_split(X.values, y.values, test_size=0.1, random_state=42)

When splitting, we use X.values to only get the data, not a DataFrame containing the data. After splitting we are left with 130,556 series for training and 14,507 for validation.

In this example, we are going to use the mean absolute percentage error (MAPE) as a loss and evaluation metric. MAPE can cause division-by-zero errors if the true value of y is zero. Thus, to prevent division by zero occurring, we'll use a small-value epsilon:

def mape(y_true,y_pred):
    eps = 1
    err = np.mean(np.abs((y_true - y_pred) / (y_true + eps))) * 100
    return err
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset