Autocorrelation

Autocorrelation is the correlation between two elements of a series separated by a given interval. Intuitively, we would, for example, assume that knowledge about the last time step helps us in forecasting the next step. But how about knowledge from 2 time steps ago or from 100 time steps ago?

Running autocorrelation_plot will plot the correlation between elements with different lag times and can help us answer these questions. As a matter of fact, pandas comes with a handy autocorrelation plotting tool. To use it, we have to pass a series of data. In our case, we pass the page views of a page, selected at random.

We can do this by running the following code:

from pandas.plotting import autocorrelation_plot

autocorrelation_plot(data.iloc[110])
plt.title(' '.join(train.loc[110,['Subject', 'Sub_Page']]))

This will present us with the following diagram:

Autocorrelation

Autocorrelation of the Oh My Girl Chinese Wikipedia page

The plot in the preceding chart shows the correlation of page views for the Wikipedia page of Oh My Girl, a South Korean girl group, within the Chinese Wikipedia.

You can see that shorter time intervals between 1 and 20 days show a higher autocorrelation than longer intervals. Likewise there are also curious spikes, such as around 120 days and 280 days. It's possible that annual, quarterly, or monthly events could lead to an increase in the frequency of visits to the Oh My Girl Wikipedia page.

We can examine the general pattern of these frequencies by drawing 1,000 of these autocorrelation plots. To do this we run the following code:

a = np.random.choice(data.shape[0],1000)

for i in a:
    autocorrelation_plot(data.iloc[i])
    
plt.title('1K Autocorrelations')

This code snippet first samples 1,000 random numbers between 0 and the number of series in our dataset, which in our case is around 145,000. We use these as indices to randomly sample rows from our dataset for which we then draw the autocorrelation plot, which we can see in the following graphic:

Autocorrelation

Autocorrelations for 1,000 Wikipedia pages

As you can see, autocorrelations can be quite different for different series and there is a lot of noise within the chart. There also seems to be a general trend toward higher correlations at around the 350-day mark.

Therefore, it makes sense to incorporate annual lagged page views as a time-dependent feature as well as the autocorrelation for one-year time intervals as a global feature. The same is true for quarterly and half-year lag as these seem to have high autocorrelations, or sometimes quite negative autocorrelations, which makes them valuable as well.

Time series analysis, such as in the examples shown previously, can help us engineer features for our model. Complex neural networks could, in theory, discover all of these features by themselves. However, it is often much easier to help them a bit, especially with information about long periods of time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset