Median forecasting

A good sanity check and an often underrated forecasting tool is medians. A median is a value separating the higher half of a distribution from the lower half; it sits exactly in the middle of the distribution. Medians have the advantage of removing noise, coupled with the fact that they are less susceptible to outliers than means, and the way they capture the midpoint of distribution means that they are also easy to compute.

To make a forecast, we compute the median over a look-back window in our training data. In this case, we use a window size of 50, but you could experiment with other values. The next step is to select the last 50 values from our X values and compute the median.

Take a minute to note that in the NumPy median function, we have to set keepdims=True. This ensures that we keep a two-dimensional matrix rather than a flat array, which is important when computing the error. So, to make a forecast, we need to run the following code:

lookback = 50

lb_data = X_train[:,-lookback:]

med = np.median(lb_data,axis=1,keepdims=True)

err = mape(y_train,med)

The output returned shows we obtain an error of about 68.1%; not bad given the simplicity of our method. To see how the medians work, let's plot the X values, the true y values, and predictions for a random page:

idx = 15000

fig, ax = plt.subplots(figsize=(10, 7))


ax.plot(np.arange(500),X_train[idx], label='X')
ax.plot(np.arange(500,550),y_train[idx],label='True')

ax.plot(np.arange(500,550),np.repeat(med[idx],50),label='Forecast')

plt.title(' '.join(train.loc[idx,['Subject', 'Sub_Page']]))
ax.legend()
ax.set_yscale('log')

As you can see, our plotting consists of drawing three plots. For each plot, we must specify the X and Y values for the plot. For X_train, the X values range from 0 to 500, and for y_train and the forecast they range from 500 to 550. We then select the series we want to plot from our training data. Since we have only one median value, we repeat the median forecast of the desired series 50 times in order to draw our forecast.

The output can be seen here:

Median forecasting

Median forecast and actual values for access of an image file. The True values are to the right-hand side of the plot, and the median forecast is the horizontal line in the center of them.

As you can see in the preceding output median forecast, the data for this page, in this case, an image of American actor Eric Stoltz, is very noisy, and the median cuts through all the noise. The median is especially useful here for pages that are visited infrequently and where there is no clear trend or pattern.

This is not all you can do with medians. Beyond what we've just covered, you could, for example, use different medians for weekends or use a median of medians from multiple look-back periods. A simple tool, such as median forecasting, is able to deliver good results with smart feature engineering. Therefore, it makes sense to spend a bit of time on implementing median forecasting as a baseline and performing a sanity check before using more advanced methods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset