Chapter 11. Time Series and Causality

 

"An economist is an expert who will know tomorrow why the things he predicted yesterday didn't happen today."

 
 --Laurence J. Peter

A univariate time series is where the measurements are collected over a standard measure of time, which could be by the minute, hour, day, week, or month. What makes the time series problematic over the other data collected is that the order of the observations probably matters. This dependency of order can cause the standard analysis methods to produce an unnecessarily high bias or variance.

It seems that there is a paucity of literature on machine learning and time series data. This is unfortunate as so much of the real-world data involves a time component. Furthermore, time series analysis can be quite complicated and tricky. I would say that if you haven't seen a time series analysis done incorrectly, you haven't been looking close enough.

Another aspect involving time series that is often neglected is causality. Yes, we don't want to confuse correlation with causation, but in time series analysis, one can apply the technique of Granger causality in order to determine if causality indeed exists.

In this chapter, we will apply time series/econometric techniques to identify univariate forecast models, bivariate regression models, and finally, Granger causality. After completing the chapter, you may not be a complete master of the time series analysis, but you will know enough to perform an effective analysis and understand the fundamental issues to consider when building time series models and creating predictive models (forecasts).

Univariate time series analysis

We will focus on two methods to analyze and forecast a single time series: exponential smoothing and Autoregressive Integrated Moving Average (ARIMA) models. We will start by looking at exponential smoothing models.

Exponential smoothing models use weights for past observations, such as a moving average model, but unlike moving average models, the more recent the observation, the more weight it is given, relative to the later ones. There are three possible smoothing parameters to estimate: the overall smoothing parameter, a trend parameter, and smoothing parameter. If no trend or seasonality is present, then these parameters become null.

The smoothing parameter produces a forecast with the following equation:

Univariate time series analysis

In this equation, Yt is the value at the time, T, and alpha (α) is the smoothing parameter. Algorithms optimize the alpha (and other parameters) by minimizing the errors, for example, sum of squared error (SSE) or mean squared error (MSE).

The forecast equation along with trend and seasonality equations, if applicable, will be as follows:

  • The forecast, where A is the preceding smoothing equation and h is the number of forecast periods, Univariate time series analysis
  • The trend, where Univariate time series analysis
  • The seasonality, where m is the number of seasonal periods, Univariate time series analysis)St-m

This equation is referred to as the Holt-Winter's Method. The forecast equation is additive in nature with the trend as linear. The method also allows the inclusion of a dampened trend and multiplicative seasonality, where the seasonality proportionally increases or decreases over time. It has been my experience that the Holt-Winter's Method provides the best forecasts, even better than the ARIMA models. I have come to this conclusion on having to update long-term forecasts for hundreds of time series based on monthly data, and in roughly 90 percent of the cases, Holt-Winters produced the minimal forecast error. Additionally, you don't have to worry about the assumption of stationarity as in an ARIMA model. Stationarity is where the time series has a constant mean, variance, and correlation between all the time periods. Having said this, it is still important to understand the ARIMA models as there will be situations where they have the best performance.

Starting with the autoregressive model, the value of Y at time T is a linear function of the prior values of Y. The formula for an autoregressive lag-1 model, AR(1), is Yt = constant + ΦYt-1 + Et. The critical assumptions for the model are as follows:

  • Et is the errors that are identically and independently distributed with a mean zero and constant variance
  • The errors are independent of Yt
  • Yt, Yt-1, Yt-n… is stationary, which means that the absolute value of Φ is less than one

With a stationary time series, you can examine Autocorrelation Function (ACF). The ACF of a stationary series gives correlations between Yt and Yt-h for h = 1, 2…n. Let's use R to create an AR(1) series and plot it:

> set.seed(123)

> ar1 = arima.sim(list(order=c(1,0,0), ar=0.5), n=200)

> plot(ar1)

The following is the output of the preceding command:

Univariate time series analysis

Now, lets examine ACF:

> acf(ar1)

The output of the preceding command is as follows:

Univariate time series analysis

The ACF plot shows the correlations exponentially decreasing as the Lag increases. The dotted blue lines indicate a significant correlation. In addition to ACF, one should also examine Partial Autocorrelation Function (PACF). PACF is a conditional correlation, which means that the correlation between Yt and Yt-h is conditional on the observations that come between the two. One way to intuitively understand this is to think of a linear regression model and its coefficients. Let's assume that you have Y = B0 + B1X1 versus Y = B0 + B1X1 + B2X2. The relationship of X to Y in the first model is linear with a coefficient, but in the second model, the coefficient will be different because of the relationship between Y and X2 now being accounted for as well. Note that in the following PACF plot, the partial autocorrelation value at lag-1 is identical to the autocorrelation value at lag-1, as this is not a conditional correlation.

> pacf(ar1)

The following is the output of the preceding command:

Univariate time series analysis

We will assume that the series is stationary and the preceding time series plot confirms this. We'll look at a couple of statistical tests in the practical exercise to ensure that the data is stationary, but most of the times, the eyeball test is sufficient. If the data is not stationary, then it is possible to de-trend the data by taking its differences. This is the Integrated (I) in ARIMA. After differencing, the new series becomes ΔYt = Yt – Yt-1. One should expect a first-order difference to achieve stationarity, but on some occasions, a second-order difference may be necessary. An ARIMA model with AR(1) and I(1) would be annotated as (1,1,0).

The MA stands for moving average. This is not the simple moving average as the 50-day moving average of a stock price but rather, a coefficient that is applied to the errors. The errors are, of course, identically and independently distributed with a mean zero and constant variance. The formula for an MA(1) model is Yt = constant + Et + ΘEt-1. As we did with the AR(1) model, we can build an MA(1) in R, as follows:

> set.seed(123)

> ma.sim = arima.sim(list(order=c(0,0,1), ma=-0.5), n=200)

> plot(ma.sim)

The following is the output of the preceding command:

Univariate time series analysis

The ACF and PACF plots are a bit different from the AR(1) model. Note that there are some rules of thumb in looking at the plots in order to determine if the model has AR and/or MA terms. They can be a bit subjective; so I will leave it to you to learn these heuristics, but trust R to identify the proper model. In the following plots, we will see a significant correlation at lag-1 and two significant partial correlations at lag-1 and lag-2:

> acf(ma.sim)

The output of the preceding command is as follows:

Univariate time series analysis

The preceding figure is the ACF plot, and now, we will see the PACF plot:

> pacf(ma.sim)
Univariate time series analysis

With ARIMA models, it is possible to incorporate seasonality, including the autoregressive, integrated, and moving average terms. The nonseasonal ARIMA model notation is commonly (p,d,q). With seasonal ARIMA, assume that the data is monthly, then the notation would be (p,d,q) x (P,D,Q)12. In the packages that we will use, R will automatically identify if the seasonality should be included and if so, then the optimal terms will be included as well.

Bivariate regression

Having covered regression way back in Chapter 2, Linear Regression – The Blocking and Tackling of Machine Learning, we can skip many of the basics. However, when doing regression with time series, it is important to understand how the regression may be spurious and/or missing vital information. With time series regression, it is necessary to understand how the lagged variables can contribute to the model. Let's take advertising expenditure and product sales with the data available on a weekly basis. In many cases, you will need to model the lagged values of an advertising campaign in the prediction of sales, as it may take time for the campaign to be effective and the impact can last beyond its termination.

In R, you can manually code the lagged values, trends, and seasonality; however, the dynlm package for dynamic linear regression offers tremendous flexibility and ease in doing this type of analysis. In the practical exercise, we will put the package through its paces. To examine the problem of spurious regression, we will need to test the assumption of no serial correlation, which is the autocorrelation of the residuals. Examining the ACF plot and statistical tests will address the question. If autocorrelation exists, then you might run into the problem of spurious regression. Here, the beta coefficients are not the best estimates as important information is being ignored and the statistical tests on these coefficients are no longer valid, which means that you may overfit your model as some predictor variables will appear important when actually they are not important.

One method to deal with serial correlation is to include the ARIMA errors with regression models. A simplified notation for this type of model is Bivariate regression, where Nt is an ARIMA model for the errors and Et is the remaining errors of the ARIMA model that are not correlated and are referred to as white noise.

This type of regression can be implemented in R using functions in the forecast or orcutt packages. The interpretation of these methods can get quite complicated and challenging to explain to business partners. Not to fear; because in general, if you find the right lag structure, you can forgo incorporating the ARIMA errors in your regression model. We will explore this in detail in the practical exercise.

Granger causality

With two sets of time series data, x and y, Granger causality is a method that attempts to determine whether one series is likely to influence a change in the other. This is done by taking different lags of one series and using this to model the change in the second series. To accomplish this, we will create two models that will predict y, one with only the past values of y () and the other with the past values of y and x (π). The models are as follows, where k is the number of lags in the time series:

  • Granger causality
  • Granger causality

The RSS are then compared and F-test is used to determine whether the nested model () is adequate enough to explain the future values of y or if the full model (π) is better. F-test is used to test the following null and alternate hypotheses:

  • H0: αi = 0 for each i [1,k], no Granger causality
  • H1: αi ≠ 0 for at least one i [1,k], Granger causality

Essentially, we are trying to determine whether we can say that statistically, x provides more information about the future values of y than the past values of y alone. In this definition, it is clear that we are not trying to prove actual causation; only that the two values are related by some phenomenon. Along these lines, we must also run this model in reverse in order to verify that y does not provide information about the future values of x. If we find that this is the case, it is likely that there is some exogenous variable, say Z, that needs to be controlled or would possibly be a better candidate for Granger causation. To avoid spurious results, the method should be applied to a stationary time series. Note that research papers are available that discuss the techniques for non-stationary series and also nonlinear models, but this is outside of the scope for this book. There is an excellent introductory paper that revolves around the age-old conundrum of the chicken and the egg. (Thurman, 1988)

There are a couple of different ways to identify the proper lag structure. Naturally, one can use brute force and ignorance to test all the reasonable lags one at a time. One may have a rational intuition based on domain expertise or perhaps prior research that exists to guide the lag selection. If not, then Vector Autoregression (VAR) can be applied to identify the lag structure with the lowest information criterion, such as Aikake's Information Criterion (AIC) or Final Prediction Error (FPE). For simplicity, here is the notation for the VAR models with two variables and this incorporates only one lag for each variable. This notation can be extended for as many variables and lags as are appropriate.

  • Granger causality
  • Granger causality

In R, this process is quite simple to implement as we will see in the following practical problem.

Business understanding

 

"The planet isn't going anywhere. We are! We're goin' away."

 
 --Philosopher and comedian, George Carlin

Climate change is happening. It always has and always will, but the big question—at least from a political and economic standpoint—is that is the climate change man-made? Even Pope Francis and the Vatican have weighed in on the controversy, casting aspersions on man-made climate change deniers. Not one to shy away from a political donnybrook, I will use this chapter to put econometric time series modeling to the test to try and determine if man-made carbon emissions cause—statistically speaking—climate change, in particular, rising temperatures. Personally, I would like to take a neutral stance on the issue; always keeping in mind the tenets that Mr. Carlin left for us in his teachings on the subject.

The first order of business is to find and gather the data. For temperature, we should choose the HadCRUT4 annual time series. This data is compiled by a cooperative effort of the Climate Research Unit of the University of East Anglia and the Hadley Centre at the UK's Meteorological Office. A full discussion of how the data is compiled and modeled is available at http://www.metoffice.gov.uk/hadobs/index.html.

The data that we will use is provided as an annual anomaly, which is calculated as the difference of the median annual surface temperature for a given year versus the average of the reference years (1961-1990). The annual surface temperature is an ensemble of the temperatures collected globally and blended from the CRUTEM4 surface air temperature and HadSST3 sea-surface datasets. Recently, this data has come under attack as biased and unreliable: http://www.telegraph.co.uk/comment/11561629/Top-scientists-start-to-examine-fiddled-global-warming-figures.html. This is way outside of our scope of effort here, so we must utilize this data as it is.

To read this data in R, which is in a fixed-width format, we will use the read.fwf() function. The first thing to do that helps in the web-scraping process is to specify an object with the appropriate URL as follows:

> url1 = "http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/time_series/HadCRUT.4.4.0.0.annual_ns_avg.txt"

From this URL, we only want the year and first column, which is the annual anomaly. To do this, we will specify in the function the width of the items that we want. The simplest way to know these widths is to take a few rows of the data, paste it in a text editor, and do some counting. Having done this, we will put in widths as 4 for the year, 3 for the space, and 6 for the anomaly:

> temp = read.fwf(url1, widths=c(4,3,6),sep="")

> head(temp)
        V1     V2
1 1850 -0.376
2 1851 -0.222
3 1852 -0.225
4 1853 -0.270
5 1854 -0.247
6 1855 -0.270

> tail(temp)
     V1    V2
161 2010 0.555
162 2011 0.421
163 2012 0.467
164 2013 0.492
165 2014 0.564
166 2015 0.670

With this data frame created, we can name the columns properly:

> names(temp)=c("Year","Temperature")

Furthermore, this data needs to be converted to a time series object with the ts() function. We will also need to specify the frequency of the data, for example, 1 is annual, 4 is quarterly, 12 is monthly, and the start and end year:

> T = ts(temp$Temperature, frequency=1, start=c(1850), end=c(2015))

Global CO emission estimates can be found at the Carbon Dioxide Information Analysis Center (CDIAC) of the US Department of Energy at http://cdiac.ornl.gov/. We will download the data of total emissions of fossil fuel combustion and cement manufacture. We could use the read.fwf() function for this data as well, but let's look at a different method, where we will put the URL in a read.csv() function:

> url2 = "http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/global.1751_2011.csv"
> co2 = read.csv(file=url2,skip=2,header=FALSE, col.names=c("Year","Total","3","4","5","6","7","8"))

What we did here was specify that we wanted to skip the first two rows of the commentary. By stating header=FALSE, we will prevent the first row of data from becoming our column names, which we will create with col.names().

Looking at the structure, we want to keep only the first two columns (the Year and co2 emissions). Putting this together, we can then look at the first six and last six observations as a double check:

> str(co2)
'data.frame':261 obs. of  8 variables:
$ Year : int  1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 ...
 $ Total: int  3 3 3 3 3 3 3 3 3 3 ...
 $ X3   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ X4   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ X5   : int  3 3 3 3 3 3 3 3 3 3 ...
 $ X6   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ X7   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ X8   : num  NA NA NA NA NA NA NA NA NA NA ...

> co2 = co2[,1:2]

> head(co2)
    Year Total
1 1751     3
2 1752     3
3 1753     3
4 1754     3
5 1755     3
6 1756     3
> tail(co2)
        Year Total
256 2006  8363
257 2007  8532
258 2008  8740
259 2009  8700
260 2010  9140
261 2011  9449

Finally, we will put this in a time series:

> E = ts(co2$Total, frequency=1, start=c(1751),end=c(2011))

With our data downloaded and put in time series structures, we can now begin to understand and further prepare it, prior to analysis.

Data understanding and preparation

Only three packages are required for this effort, so ensure that they are installed on your system. Now, let's get them loaded:

> library(dynlm)

> library(forecast)

> library(tseries)

> library(vars)

As always, we will want to produce plots of the data and will do so in a reasonable timeframe. Let's look at the data starting shortly after the end of World War I. The Co2 emissions data ends in 2011, so we will truncate the temperature data to match:

> SurfaceTemp = window(T, start=c(1920), end=c(2011))

> Emissions = window(E, start=c(1920), end=c(2011))

Now, combine both the time series to one object and plot it:

> climate = cbind(SurfaceTemp, Emissions)

> plot(climate, main="Temp Anomalies and CO2 Emissions")

The output of the preceding command is as follows:

Data understanding and preparation

This plot shows that the temperature anomalies started to increase from the norm roughly around 1970. Emissions seem to begin a slow uptick in the mid-40s and a possible trend increase in 2000. There does not appear to be any cyclical patterns or obvious outliers. Variation over time appears constant. Using the standard procedure, we can see that the two series are highly correlated, as follows:

> cor(climate)
                SurfaceTemp Emissions
SurfaceTemp   1.0000000 0.8264994
Emissions     0.8264994 1.0000000

As discussed earlier, this is nothing to jump for joy as it proves absolutely nothing. We will look for the structure by plotting ACF and PACF for both the series on the same plot. You can partition your plot using the par() function and specifying the rows and columns; in our case, a 2 x 2, as follows:

> par(mfrow=c(2,2))

Calling ACF and PACF for each series will automatically populate the plot area:

> acf(climate[,1], main="Temp")

> pacf(climate[,1], main="Temp")

> acf(climate[,2], main="CO2")

> pacf(climate[,2], main="CO2")

The output of the preceding code snippet is as follows:

Data understanding and preparation

With the decaying ACF patterns and rapidly decaying PACF patterns, we can assume that these series are both autoregressive. For temperature, there is a slight spike at lag-4 and lag-5, so there may be an MA term as well. Next, let's have a look at Cross Correlation Function (CCF). Note that we put our x before our y in the function:

> ccf(climate[,1],climate[,2], main="CCF")
Data understanding and preparation

CCF shows us the correlation between the temperature and lags of CO2. For instance, the correlation between the temperature and CO2 at lag-5 is just over 0.6. If the negative lags of the x variable have a high correlation, we can say that x leads y. If the positive lags of x have a high correlation, we say that x lags y. Here, we can see that CO2 is both a leading and lagging variable. For our analysis, it is encouraging that we see the former, but odd for the latter. We will see during the VAR and Granger causality analysis if this will matter or not.

Additionally, to a calibrated eye, the data is not stationary. We can prove this with the Augmented Dickey-Fuller (ADF) test available in the tseries package, using the adf.test() function, as follows:

> adf.test(climate[,1])

Augmented Dickey-Fuller Test

data:  climate[, 1]
Dickey-Fuller = -1.8429, Lag order = 4, p-value = 0.641
alternative hypothesis: stationary

> adf.test(climate[,2])

Augmented Dickey-Fuller Test

data:  climate[, 2]
Dickey-Fuller = -1.0424, Lag order = 4, p-value = 0.9269
alternative hypothesis: stationary

You can see that, in both the cases, the p-values are not significant, so we fail to reject the null hypothesis of the test that the data is not stationary.

Having explored the data, let's begin the modeling process by applying univariate techniques to the temperature anomalies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset