Chapter 4. Understanding Time Series

A time series is a form of data that has a temporal dimension and is easily the most iconic form of financial data out there. While a single stock quote is not a time series, take the quotes you get every day and line them up, and you get a much more interesting time series. Virtually all media materials related to finance sooner or later show a stock price gap; not a list of prices at a given moment, but a development of prices over time.

You'll often hear financial commenters discussing the movement of prices: "Apple Inc. is up 5%." But what does that mean? You'll hear absolute values a lot less, such as, "A share of Apple Inc. is $137.74." Again, what does that mean? This occurs because market participants are interested in how things will develop in the future and they try to extrapolate these forecasts from how things developed in the past:

Understanding Time Series

Multiple time series graphs as seen on Bloomberg TV

Most forecasting that is done involves looking at past developments over a period of time. The concept of a time series set of data is an important element related to forecasting; for example, farmers will look at a time series dataset when forecasting crop yields. Because of this, a vast body of knowledge and tools for working with time series has developed within the fields of statistics, econometrics, and engineering.

In this chapter, we will be looking at a few classic tools that are still very much relevant today. We will then learn how neural networks can deal with time series, and how deep learning models can express uncertainty.

Before we jump into looking at time series, I need to set your expectations for this chapter. Many of you might have come to this chapter to read about stock market forecasting, but I need to warn you that this chapter is not about stock market forecasting, neither is any other chapter in this book.

Economic theory shows that markets are somewhat efficient. The efficient market hypothesis states that all publicly available information is included in stock prices. This extends to information on how to process information, such as forecasting algorithms.

If this book were to present an algorithm that could predict prices on the stock market and deliver superior returns, many investors would simply implement this algorithm. Since those algorithms would all buy or sell in anticipation of price changes, they would change the prices in the present, thus destroying the advantage that you would gain by using the algorithm. Therefore, the algorithm presented would not work for future readers.

Instead, this chapter will use traffic data from Wikipedia. Our goal is to forecast traffic for a specific Wikipedia page. We can obtain the Wikipedia traffic data via the wikipediatrend CRAN package.

The dataset that we are going to use here is the traffic data of around 145,000 Wikipedia pages that has been provided by Google. The data can be obtained from Kaggle.

Visualization and preparation in pandas

As we saw in Chapter 2, Applying Machine Learning to Structured Data, it's usually a good idea to get an overview of the data before we start training. You can achieve this for the data we obtained from Kaggle by running the following:

train = pd.read_csv('../input/train_1.csv').fillna(0)
train.head()

Running this code will give us the following table:

 

Page

2015-07-01

2015-07-02

2016-12-31

0

2NE1_zh.wikipedia.org_all-access_spider

18.0

11.0

20.0

1

2PM_zh.wikipedia.org_all-access_spider

11.0

14.0

20.0

The data in the Page column contains the name of the page, the language of the Wikipedia page, the type of accessing device, and the accessing agent. The other columns contain the traffic for that page on that date.

So, in the preceding table, the first row contains the page of 2NE1, a Korean pop band, on the Chinese version of Wikipedia, by all methods of access, but only for agents classified as spider traffic; that is, traffic not coming from humans. While most time series work is focused on local, time-dependent features, we can enrich all of our models by providing access to global features.

Therefore, we want to split up the page string into smaller, more useful features. We can achieve this by running the following code:

def parse_page(page):
    x = page.split('_')
    return ' '.join(x[:-3]), x[-3], x[-2], x[-1]

We split the string by underscores. The name of a page could also include an underscore, so we separate off the last three fields and then join the rest to get the subject of the article.

As we can see in the following code, the third-from-last element is the sub URL, for example, en.wikipedia.org. The second-from-last element is the access, and the last element the agent:

parse_page(train.Page[0])
Out:
('2NE1', 'zh.wikipedia.org', 'all-access', 'spider')

When we apply this function to every page entry in the training set, we obtain a list of tuples that we can then join together into a new DataFrame, as we can see in the following code:

l = list(train.Page.apply(parse_page))
df = pd.DataFrame(l)
df.columns = ['Subject','Sub_Page','Access','Agent']

Finally, we must add this new DataFrame back to our original DataFrame before removing the original page column, which we can do by running the following:

train = pd.concat([train,df],axis=1)
del train['Page']

As a result of running this code, we have successfully finished loading the dataset. This means we can now move on to exploring it.

Aggregate global feature statistics

After all of this hard work, we can now create some aggregate statistics on global features.

The pandas value_counts() function allows us to plot the distribution of global features easily. By running the following code, we will get a bar chart output of our Wikipedia dataset:

train.Sub_Page.value_counts().plot(kind='bar')

As a result of running the previous code, we will output a bar chat that ranks the distributions of records within our dataset:

Aggregate global feature statistics

Distribution of records by Wikipedia country page

The preceding plot shows the number of time series available for each subpage. Wikipedia has subpages for different languages, and we can see that our dataset contains pages from the English (en), Japanese (ja), German (de), French (fr), Chinese (zh), Russian (ru), and Spanish (es) Wikipedia sites.

In the bar chart we produced you may have also noted two non-country based Wikipedia sites. Both commons.wikimedia.org and www.mediawiki.org are used to host media files such as images.

Let's run that command again, this time focusing on the type of access:

train.Access.value_counts().plot(kind='bar')

After running this code, we'll then see the following bar chart as the output:

Aggregate global feature statistics

Distribution of records by access type

There are two possible access methods: mobile and desktop. There's also a third option all-access, which combines the statistics for mobile and desktop access.

We can then plot the distribution of records by agent by running the following code:

train.Agent.value_counts().plot(kind='bar')

After running that code, we'll output the following chart:

Aggregate global feature statistics

Distribution of records by agent

There are time series available not only for spider agents, but also for all other types of access. In classic statistical modeling, the next step would be to analyze the effect of each of these global features and build models around them. However, this is not necessary if there's enough data and computing power available.

If that's the case then a neural network is able to discover the effects of the global features itself and create new features based on their interactions. There are only two real considerations that need to be addressed for global features:

  • Is the distribution of features very skewed? If this is the case then there might only be a few instances that possess a global feature, and our model might overfit on this global feature. Imagine that there were only a small number of articles from the Chinese Wikipedia in the dataset. The algorithm might distinguish too much based on the feature then overfit the few Chinese entries. Our distribution is relatively even, so we do not have to worry about this.
  • Can features be easily encoded? Some global features cannot be one-hot encoded. Imagine that we were given the full text of a Wikipedia article with the time series. It would not be possible to use this feature straight away, as some heavy preprocessing would have to be done in order to use it. In our case, there are a few relatively straightforward categories that can be one-hot encoded. The subject names, however, cannot be one-hot encoded since there are too many of them.

Examining the sample time series

To examine the global features, of our dataset, we have to look at a few sample time series in order to get an understanding of the challenges that we may face. In this section, we will plot the views for the English language page of Twenty One Pilots, a musical duo from the USA.

To plot the actual page views together with a 10-day rolling mean. We can do this by running the following code:

idx = 39457

window = 10


data = train.iloc[idx,0:-4]
name = train.iloc[idx,-4]
days = [r for r in range(data.shape[0] )]

fig, ax = plt.subplots(figsize=(10, 7))

plt.ylabel('Views per Page')
plt.xlabel('Day')
plt.title(name)

ax.plot(days,data.values,color='grey')
ax.plot(np.convolve(data, 
                    np.ones((window,))/window, 
                    mode='valid'),color='black')


ax.set_yscale('log')

There is a lot going on in this code snippet, and it is worth going through it step by step. Firstly, we define which row we want to plot. The Twenty One Pilots article is row 39,457 in the training dataset. From there, we then define the window size for the rolling mean.

We separate the page view data and the name from the overall dataset by using the pandas  iloc tool. This allows us to index the data by row and column coordinates. Counting the days rather than displaying all the dates of the measurements makes the plot easier to read, therefore we are going to create a day counter for the X-axis.

Next, we set up the plot and make sure it has the desired size by setting figsize. We also define the axis labels and the title. Next, we plot the actual page views. Our X coordinates are the days, and the Y coordinates are the page views.

To compute the mean, we are going to use a convolve operation, which you might be familiar with as we explored convolutions in Chapter 3, Utilizing Computer Vision. This convolve operation creates a vector of ones divided by the window size, in this case 10. The convolve operation slides the vector over the page view, multiplies 10-page views with 1/10, and then sums the resulting vector up. This creates a rolling mean with a window size 10. We plot this mean in black. Finally, we specify that we want to use a log scale for the Y axis:

Examining the sample time series

Access statistics for the Twenty One Pilots Wikipedia page with a rolling mean

You can see there are some pretty large spikes in the Twenty One Pilots graph we just generated, even though we used a logarithmic axis. On some days, views skyrocket to 10 times what they were just days before. Because of that, it quickly becomes clear that a good model will have to be able to deal with such extreme spikes.

Before we move on, it's worth pointing out that it's also clearly visible that there are global trends, as the page views generally increase over time.

For good measure, let's plot the interest in Twenty One Pilots for all languages. We can do this by running the following code:

fig, ax = plt.subplots(figsize=(10, 7))
plt.ylabel('Views per Page')
plt.xlabel('Day')
plt.title('Twenty One Pilots Popularity')
ax.set_yscale('log')

for country in ['de','en','es','fr','ru']:
    idx= np.where((train['Subject'] == 'Twenty One Pilots') 
                  & (train['Sub_Page'] == '{}.wikipedia.org'.format(country)) & (train['Access'] == 'all-access') & (train['Agent'] == 'all-agents'))
                  
    idx=idx[0][0]
    
    data = train.iloc[idx,0:-4]
    handle = ax.plot(days,data.values,label=country)
    

ax.legend()

In this snippet, we first set up the graph, as before. We then loop over the language codes and find the index of Twenty One Pilots. The index is an array wrapped in a tuple, so we have to extract the integer specifying the actual index. We then extract the page view data from the training dataset and plot the page views.

In the following chart, we can view the output of the code that we've just produced:

Examining the sample time series

Access statistics for Twenty One Pilots by country

There is clearly some correlation between the time series. The English language version of Wikipedia (the top line) is, not surprisingly, by far the most popular. We can also see that the time series in our datasets are clearly not stationary; they change means and standard deviations over time.

A stationary process is one whose unconditional joint probability distribution stays constant over time. In other words, things such as the series mean or standard deviation should stay constant.

However, as you can see, between days 200-250 in the preceding graph, the mean views on the page changes dramatically. This result undermines some of the assumptions many classic modeling approaches make. Yet, financial time series are hardly ever stationary, so it is worthwhile dealing with these problems. By addressing these problems, we become familiar with several useful tools that can help us handle nonstationarity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset