Time series are omnipresent in both industry and research. We can find examples of time series in commerce, tech, healthcare, energy, finance, and so on. We are mostly interested in the last one, as the time dimension is inherent to trading and many financial/economic indicators. However, pretty much every business generates some sort of time series, for example, its profits collected over time or any other measured KPI. That is why the techniques we cover in the following two chapters can be used for any time series analysis task you might encounter in your line of work.
Time series modeling or forecasting can often be approached from different angles. The two most popular are statistical methods and machine learning approaches. Additionally, we will also cover some examples of using deep learning for time series forecasting in Chapter 15, Deep Learning in Finance.
In the past, when we did not have vast computing power available at our disposal and the time series were not that granular (as data was not collected everywhere and all the time), statistical approaches dominated the domain. Recently, the situation has changed, and ML-based approaches are taking the lead when it comes to time series models running in production. However, that does not mean that the classical statistical approaches are not relevant anymore—in fact, far from it. They can still produce state-of-the-art results in cases when we have very little training data (for example, 3 years of monthly data) and the ML models simply cannot learn the patterns from it. Also, we can observe that statistical approaches were used to win quite a few of the most recent M-Competitions (the biggest time series forecasting competition started by Spyros Makridakis).
In this chapter, we introduce the basics of time series modeling. We start by explaining the building blocks of time series and how to separate them using decomposition methods. Later, we cover the concept of stationarity—why it is important, how to test for it, and ultimately, how to achieve it if the original series is not stationary.
Then, we look at two of the most widely used statistical approaches to time series modeling—exponential smoothing methods and ARIMA class models. In both cases, we show you how to fit the models, evaluate their goodness of fit, and forecast the future values of the time series.
We cover the following recipes in this chapter:
One of the goals of time series decomposition is to increase our understanding of the data by breaking down the series into multiple components. It provides insight in terms of modeling complexity and which approaches to follow in order to accurately capture/model each of the components.
An example can shed more light on the possibilities. We can imagine a time series with a clear trend, either increasing or decreasing. On one hand, we could use the decomposition to extract the trend component and remove it from our time series before modeling the remaining series. This could help with making the time series stationary (please refer to the following recipe for more details). Then, we can always add it back after the rest of the components have been accounted for. On the other hand, we could provide enough data or adequate features for our algorithm to model the trend itself.
The components of time series can be divided into two types: systematic and non-systematic. The systematic ones are characterized by consistency and the fact that they can be described and modeled. By contrast, the non-systematic ones cannot be modeled directly.
The following are the systematic components:
The following is the non-systematic component:
The classical approach to time series decomposition is usually carried out using one of two types of models: additive and multiplicative.
An additive model can be described by the following characteristics:
A multiplicative model can be described by the following characteristics:
To make things more interesting, we can find time series with combinations of additive and multiplicative characteristics, for example, a series with additive trend and multiplicative seasonality.
Please refer to the following figure for visualization of the possible combinations. And while real-world problems are never that simple (noisy data with varying patterns), these abstract models offer a simple framework that we can use to analyze our time series before attempting to model/forecast it.
Figure 6.1: Additive and multiplicative variants of trend and seasonality
It can be the case that we do not want to (or simply cannot, due to some models’ assumptions) work with a multiplicative model. One possible solution is to transform the multiplicative model into an additive one using logarithmic transformation:
In this recipe, we will present how to carry out time series decomposition of monthly US unemployment rates downloaded from the Nasdaq Data Link.
Execute the following steps to carry out the time series decomposition:
import pandas as pd
import nasdaqdatalink
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
nasdaqdatalink.ApiConfig.api_key = "YOUR_KEY_HERE"
df = (
nasdaqdatalink.get(dataset="FRED/UNRATENSA",
start_date="2010-01-01",
end_date="2019-12-31")
.rename(columns={"Value": "unemp_rate"})
)
In Figure 6.2, we can see some clear seasonal patterns in the time series. We did not include more recent data in this analysis, as the COVID-19 pandemic caused quite abrupt changes in any patterns observable in the unemployment rate time series. We do not show the code used for generating the plot, as it is very similar to the one used in Chapter 3, Visualizing Financial Time Series.
Figure 6.2: Seasonal plot of the US unemployment rate in the years 2010 to 2019
WINDOW_SIZE = 12
df["rolling_mean"] = df["unemp_rate"].rolling(window=WINDOW_SIZE).mean()
df["rolling_std"] = df["unemp_rate"].rolling(window=WINDOW_SIZE).std()
df.plot(title="Unemployment rate")
Running the snippet generates the following plot:
Figure 6.3: The US unemployment rate together with the rolling average and standard deviation
From the analysis of Figure 6.3, we can infer that the trend and seasonal components seem to have a linear pattern. Therefore, we will use additive decomposition in the next step.
decomposition_results = seasonal_decompose(df["unemp_rate"],
model="additive")
(
decomposition_results
.plot()
.suptitle("Additive Decomposition")
)
Running the snippet generates the following plot:
Figure 6.4: The seasonal decomposition of the US unemployment rate (using an additive model)
In the decomposition plot, we can see the extracted component series: trend, seasonal, and random (residual). To evaluate whether the decomposition makes sense, we can look at the random component. If there is no discernible pattern (in other words, the random component is indeed random and behaves consistently over time), then the fit makes sense. In this case, it looks like the variance in the residuals is slightly higher in the first half of the dataset. This can indicate that a constant seasonal pattern is not good enough to accurately capture the seasonal component of the analyzed time series.
After downloading the data in Step 2, we used the rolling
method of a pandas
DataFrame to calculate the rolling statistics. We specified that we wanted to use the window size of 12 months, as we are working with monthly data.
We used the seasonal_decompose
function from the statsmodels
library to carry out the classical decomposition. When doing so, we indicated what kind of model we would like to use—the possible values are additive
and multiplicative
.
When using seasonal_decompose
with an array of numbers, we must specify the frequency of the observations (the freq
argument) unless we are working with a pandas
Series object. If we have missing values or want to extrapolate the residuals for the missing periods at the beginning and the end of the time series, we can pass an extra argument extrapolate_trend='freq'
.
The seasonal decomposition we have used in this recipe is the most basic approach. It comes with a few disadvantages:
Over time, a few alternative approaches to time series decomposition were introduced. In this section, we will also cover seasonal and trend decomposition using LOESS (STL decomposition), which is implemented in the statsmodels
library.
LOESS stands for locally estimated scatterplot smoothing and it is a method of estimating non-linear relationships.
We will not go into the details of how STL decomposition works; however, it makes sense to be familiar with its advantages over the other approaches:
Naturally, it is not a silver-bullet solution and comes with some drawbacks of its own. For example, STL can only be used with additive decomposition and it does not automatically account for trading days/calendar variations.
There is a recent variant of STL decomposition that can handle multiple seasonalities. For example, a time series of hourly data can exhibit daily/weekly/monthly seasonalities. The approach is called Multiple Seasonal-Trend Decomposition using LOESS (MSTL) and you can find the reference to it in the See also section.
We can carry out the STL decomposition with the following snippet:
from statsmodels.tsa.seasonal import STL
stl_decomposition = STL(df[["unemp_rate"]]).fit()
stl_decomposition.plot()
.suptitle("STL Decomposition")
Running the code generates the following plot:
Figure 6.5: The STL decomposition of the US unemployment time series
We can see that the decomposition plots of the STL and classical decompositions are very similar. However, there are some nuances in Figure 6.5, connected to the benefits of STL decomposition over the classical one. First, there are no missing values in the trend estimate. Second, the seasonal component is slowly changing over time. You can see it clearly when looking at, for example, the values for January across the years.
The default value of the seasonal
argument in STL
is set to 7, but the authors of the approach suggest using larger values (must be odd integers greater than or equal to 7). Under the hood, the value of that parameter indicates the number of consecutive years to be used in estimating each value of the seasonal component. The larger the chosen value, the smoother the seasonal component becomes. This, in turn, causes fewer of the variations observed in the time series to be attributed to the seasonal component. The interpretation is similar for the trend
argument, though it represents the number of consecutive observations to be used for estimating the trend component.
We have also mentioned that one of the benefits of the STL decomposition is its higher robustness against outliers. We can use the robust
argument to switch on a data-dependent weighting function. It re-weights the observations when estimating the LOESS, which becomes LOWESS (locally weighted scatterplot smoothing) in such a scenario. When using robust estimation, the model can tolerate larger errors that are visible on the residual component’s plot.
In Figure 6.6 you can see a comparison of fitting two STL decompositions to the US unemployment data—with and without the robust estimation. For the code used to generate the figure, please refer to the notebook in the book’s GitHub repository.
Figure 6.6: The effect of using robust estimation in the STL decomposition process
We can clearly observe the effects of using robust estimation—larger errors are tolerated and the shape of the seasonal component is different over the first few years of the analyzed time series. There is no clear answer whether the robust or non-robust approach is better in this case; it all depends on what we want to use the decomposition for. Seasonal decomposition methods presented in this recipe can also serve as simple outlier detection algorithms. For example, we could decompose the series, extract the residuals, and flag observations as outliers when their residuals are outside 3 times the interquartile range (IQR). The kats
library provides an implementation of such an algorithm in its OutlierDetector
class.
Other available approaches to seasonal decomposition include:
hpfilter
function of statsmodels
.Useful references on time series decomposition:
One of the most important concepts in time series analysis is stationarity. Plainly speaking, a stationary time series is a series whose properties do not depend on the time at which the series is observed. In other words, stationarity implies that the statistical properties of the data-generating process (DGP) of a certain time series do not change over time.
Hence, we should not be able to see any trend or seasonal patterns in a stationary time series, as their existence violates the stationarity assumptions. On the other hand, a white noise process is stationary, as it does not matter when we observe it; it will always look pretty much the same at any point in time.
A time series without trend and seasonality but with cyclic behavior can still be stationary because the cycles are not of a fixed length. So unless we explicitly observe a time series, we cannot be sure where the peaks and troughs of the cycles will be located.
To put it more formally, there are multiple definitions of stationarity, some stricter in terms of the assumptions than others. For practical use cases, we can work with the one called weak stationarity (or covariance stationarity). For a time series to be classified as (covariance) stationary, it must satisfy the following three conditions:
Stationarity is a desired characteristic of time series as it makes modeling and extrapolating (forecasting) into the future more feasible. That is because a stationary series is easier to predict than a non-stationary one, as its statistical properties will be the same in the future as they have been in the past.
Some drawbacks of non-stationary data are:
While stationarity is a desired trait of a time series, this is not applicable to all statistical models. We would want our time series to be stationary when modeling the series using some kind of auto-regressive model (AR, ARMA, ARIMA, and so on). However, there are also models that do not benefit from stationary time series, for example, those that depend heavily on time series decomposition (exponential smoothing methods or Facebook’s Prophet).
In this recipe, we will show you how to test the time series for stationarity. To do so, we employ the following methods:
We will investigate the stationarity of monthly unemployment rates in the years 2010 to 2019.
We will use the same data that we used in the Time series decomposition recipe. In the plot presenting the rolling mean and standard deviation of the unemployment rates (Figure 6.3), we have already seen a negative trend over time, suggesting non-stationarity.
Execute the following steps to test if the time series of monthly US unemployment rates is stationary:
import pandas as pd
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller, kpss
def adf_test(x):
indices = ["Test Statistic", "p-value",
"# of Lags Used", "# of Observations Used"]
adf_test = adfuller(x, autolag="AIC")
results = pd.Series(adf_test[0:4], index=indices)
for key, value in adf_test[4].items():
results[f"Critical Value ({key})"] = value
return results
Having defined the function, we can run the test:
adf_test(df["unemp_rate"])
Running the snippet generates the following summary:
Test Statistic -2.053411
p-value 0.263656
# of Lags Used 12.000000
# of Observations Used 107.000000
Critical Value (1%) -3.492996
Critical Value (5%) -2.888955
Critical Value (10%) -2.581393
The null hypothesis of the ADF test states that the time series is not stationary. With a p-value of 0.26
(or equivalently, the test statistic is greater than the critical value for the selected confidence level), we have no reason to reject the null hypothesis, meaning that we can conclude that the series is not stationary.
def kpss_test(x, h0_type="c"):
indices = ["Test Statistic", "p-value", "# of Lags"]
kpss_test = kpss(x, regression=h0_type)
results = pd.Series(kpss_test[0:3], index=indices)
for key, value in kpss_test[3].items():
results[f"Critical Value ({key})"] = value
return results
Having defined the function, we can run the test:
kpss_test(df["unemp_rate"])
Running the snippet generates the following summary:
Test Statistic 1.799224
p-value 0.010000
# of Lags 6.000000
Critical Value (10%) 0.347000
Critical Value (5%) 0.463000
Critical Value (2.5%) 0.574000
Critical Value (1%) 0.739000
The null hypothesis of the KPSS test states that the time series is stationary. With a p-value of 0.01
(or a test statistic greater than the selected critical value), we have reasons to reject the null hypothesis in favor of the alternative one, indicating that the series is not stationary.
N_LAGS = 40
SIGNIFICANCE_LEVEL = 0.05
fig, ax = plt.subplots(2, 1)
plot_acf(df["unemp_rate"], ax=ax[0], lags=N_LAGS,
alpha=SIGNIFICANCE_LEVEL)
plot_pacf(df["unemp_rate"], ax=ax[1], lags=N_LAGS,
alpha=SIGNIFICANCE_LEVEL)
Running the snippet generates the following plots:
Figure 6.7: Autocorrelation and Partial Autocorrelation plots of the unemployment rate
In the ACF plot, we can see that there are significant autocorrelations (above the 95% confidence interval, corresponding to the selected 5% significance level). There are also some significant autocorrelations at lags 1 and 4 in the PACF plot.
In Step 2, we defined a function used for running the ADF test and printing out the results. We specified autolag="AIC"
while calling the adfuller
function, so the number of considered lags is automatically selected based on the Akaike Information Criterion (AIC). Alternatively, we could select the number of lags manually.
For the kpss
function (Step 3), we specified the regression
argument. A value of "c"
corresponds to the null hypothesis stating that the series is level-stationary, while "ct"
corresponds to trend-stationary (removing the trend from the series would make it level-stationary).
For all the tests and the autocorrelation plots, we selected a significance level of 5%, which indicates the probability of rejecting the null hypothesis (H0) when it is, in fact, true.
In this recipe, we have used the statsmodels
library to carry out the stationarity tests. However, we had to wrap its functionalities in custom functions to have a nicely presented summary. Alternatively, we can use the stationarity tests from the arch
library (we will cover the library in more depth when we explore the GARCH models in Chapter 9, Modeling Volatility with GARCH Class Models).
We can carry out the ADF test using the following snippet:
from arch.unitroot import ADF
adf = ADF(df["unemp_rate"])
print(adf.summary().as_text())
Which returns a nicely formatted output containing all the relevant information:
Augmented Dickey-Fuller Results
=====================================
Test Statistic -2.053
P-value 0.264
Lags 12
-------------------------------------
Trend: Constant
Critical Values: -3.49 (1%), -2.89 (5%), -2.58 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.
The arch
library also contains more stationarity tests, including:
statsmodels
)statsmodels
)A potential drawback of the ADF and KPSS tests is that they do not allow for the possibility of a structural break, that is, an abrupt change in the mean or other parameters of the data-generating process. The Zivot-Andrews test allows for the possibility of a single structural break in the series, with an unknown time of its occurrence.
We can run the test using the following snippet:
from arch.unitroot import ZivotAndrews
za = ZivotAndrews(df["unemp_rate"])
print(za.summary().as_text())
Which generates the summary:
Zivot-Andrews Results
=====================================
Test Statistic -2.551
P-value 0.982
Lags 12
-------------------------------------
Trend: Constant
Critical Values: -5.28 (1%), -4.81 (5%), -4.57 (10%)
Null Hypothesis: The process contains a unit root with a single structural break.
Alternative Hypothesis: The process is trend and break stationary.
Based on the test’s p-value, we cannot reject the null hypothesis stating that the process is not stationary.
For more information on the additional stationarity tests, please refer to:
In the previous recipe, we learned how to investigate if a given time series is stationary. In this one, we will investigate how to make a non-stationary time series stationary by using one (or multiple) of the following transformations:
For this exercise, we will use monthly gold prices from the years 2000 to 2010. We have chosen this sample on purpose, as over that period the price of gold exhibits a consistently increasing trend—the series is definitely not stationary.
Execute the following steps to transform the series from non-stationary to stationary:
import pandas as pd
import numpy as np
import nasdaqdatalink
import cpi
from datetime import date
from chapter_6_utils import test_autocorrelation
nasdaqdatalink.ApiConfig.api_key = "YOUR_KEY_HERE"
In this recipe, we will be using the test_autocorrelation
helper function, which combines the components we covered in the previous recipe, the ADF and KPSS tests, together with the ACF/PACF plots.
df = (
nasdaqdatalink.get(dataset="WGC/GOLD_MONAVG_USD",
start_date="2000-01-01",
end_date="2010-12-31")
.rename(columns={"Value": "price"})
.resample("M")
.last()
)
We can use the test_autocorrelation
helper function to test if the series is stationary. We have done so in the notebook (available on GitHub) and the time series of monthly gold prices is indeed not stationary.
DEFL_DATE = date(2010, 12, 31)
df["dt_index"] = pd.to_datetime(df.index)
df["price_deflated"] = df.apply(
lambda x: cpi.inflate(x["price"], x["dt_index"], DEFL_DATE),
axis=1
)
(
df.loc[:, ["price", "price_deflated"]]
.plot(title="Gold Price (deflated)")
)
Running the snippet generates the following plot:
Figure 6.8: Monthly gold prices and the deflated time series
We could also adjust the gold prices to another point in time, as long as it is the same point for the entire series.
WINDOW = 12
selected_columns = ["price_log", "rolling_mean_log",
"rolling_std_log"]
df["price_log"] = np.log(df.price_deflated)
df["rolling_mean_log"] = df.price_log.rolling(WINDOW)
.mean()
df["rolling_std_log"] = df.price_log.rolling(WINDOW)
.std()
(
df[selected_columns]
.plot(title="Gold Price (deflated + logged)",
subplots=True)
)
Running the snippet generates the following plot:
Figure 6.9: Time series after applying the deflation and natural logarithm, together with its rolling statistics
From the preceding plot, we can see that the log transformation did its job, that is, it made the exponential trend linear.
test_autocorrelation
(helper function for this chapter) to investigate if the series became stationary:
fig = test_autocorrelation(df["price_log"])
Running the snippet generates the following plot:
Figure 6.10: The ACF and PACF plots of the transformed time series
We also print the results of the statistical tests:
ADF test statistic: 1.04 (p-val: 0.99)
KPSS test statistic: 1.93 (p-val: 0.01)
After inspecting the results of the statistical tests and the ACF/PACF plots, we can conclude that deflation and a natural algorithm were not enough to make the time series of monthly gold prices stationary.
selected_columns = ["price_log_diff", "roll_mean_log_diff",
"roll_std_log_diff"]
df["price_log_diff"] = df.price_log.diff(1)
df["roll_mean_log_diff"] = df.price_log_diff.rolling(WINDOW)
.mean()
df["roll_std_log_diff"] = df.price_log_diff.rolling(WINDOW)
.std()
df[selected_columns].plot(title="Gold Price (deflated + log + diff)")
Running the snippet generates the following plot:
The transformed gold prices give the impression of being stationary—the series oscillates around 0 with no visible trend and approximately constant variance.
fig = test_autocorrelation(df["price_log_diff"].dropna())
Running the snippet generates the following plot:
Figure 6.12: The ACF and PACF plots of the transformed time series
We also print the results of the statistical tests:
ADF test statistic: -10.87 (p-val: 0.00)
KPSS test statistic: 0.30 (p-val: 0.10)
After applying the first differences, the series became stationary at the 5% significance level (according to both tests). In the ACF/PACF plots, we can see that there were a few significant values of the function at lags 11, 22, and 39. This might indicate some kind of seasonality or simply be a false signal. Using a 5% significance level means that 5% of the values might lie outside the 95% confidence interval—even when the underlying process does not show any autocorrelation or partial autocorrelation.
After importing the libraries, authenticating, and potentially updating the CPI data, we downloaded the monthly gold prices from Nasdaq Data Link. There were some duplicate values in the series. For example, there were entries for 2000-04-28 and 2000-04-30, both with the same value. To deal with this issue, we resampled the data to monthly frequency by taking the last available value.
By doing so, we only removed potential duplicates in each month, without changing any of the actual values. In Step 3, we used the cpi
library to deflate the time series by accounting for inflation in the US dollar. The library relies on the CPI-U index recommended by the Bureau of Labor Statistics. To make it work, we created an artificial index column containing dates as objects of the datetime.date
class. The inflate
function takes the following arguments:
value
—the dollar value we want to adjust.year_or_month
—the date that the dollar value comes from.to
—optionally, the date we want to adjust to. If we don’t provide this argument, the function will adjust to the most recent year.In Step 4, we applied the natural logarithm (np.log
) to all the values to transform what looked like an exponential trend into linear. This operation was applied to prices that had already been corrected for inflation.
As the last transformation, we used the diff
method of a pandas
DataFrame to calculate the difference between the value in time t and time t-1 (the default setting corresponds to the first difference). We can specify a different number by changing the period
argument.
The considered gold prices do not contain obvious seasonality. However, if the dataset shows seasonal patterns, there are a few potential solutions:
diff(12)
.seasonal_decompose
function or another more advanced automatic decomposition algorithm. In this case, we should subtract the seasonal component when using the additive model or divide by it if the model is multiplicative. Another solution would be to use np.polyfit()
to fit the best polynomial of a chosen order to the selected time series and then subtract it from the original series.The Box-Cox transformation is another type of adjustment we can use on the time series data. It combines different exponential transformation functions to make the distribution more similar to the Normal (Gaussian) distribution. We can use the boxcox
function from the scipy
library, which allows us to automatically find the value of the lambda
parameter for the best fit. One condition to be aware of is that all the values in the series must be positive, so the transformation should not be used after calculating the first differences or any other transformations that potentially introduce negative values to the series.
A library called pmdarima
(more on this library can be found in the following recipes) contains two functions that employ statistical tests to determine how many times we should differentiate the series in order to achieve stationarity (and also remove seasonality, that is, seasonal stationarity).
We can employ the following tests to investigate stationarity: ADF, KPSS, and Phillips–Perron:
from pmdarima.arima import ndiffs, nsdiffs
print(f"Suggested # of differences (ADF): {ndiffs(df['price'], test='adf')}")
print(f"Suggested # of differences (KPSS): {ndiffs(df['price'], test='kpss')}")
print(f"Suggested # of differences (PP): {ndiffs(df['price'], test='pp')}")
Running the snippet returns the following:
Suggested # of differences (ADF): 1
Suggested # of differences (KPSS): 2
Suggested # of differences (PP): 1
For the KPSS test, we can also specify what type of null hypothesis we want to test against. The default is level stationarity (null="level"
). The results of the tests, or more precisely the need for differencing, suggest that the series without any differencing is not stationary.
The library also contains two tests for seasonal differences:
To run them, we also need to specify the frequency of our data. In our case, it is 12, as we are working with monthly data:
print(f"Suggested # of differences (OSCB): {nsdiffs(df['price'], m=12,
test='ocsb')}")
print(f"Suggested # of differences (CH): {nsdiffs(df['price'], m=12, test='ch')}")
The output is as follows:
Suggested # of differences (OSCB): 0
Suggested # of differences (CH): 0
The results suggest no seasonality in gold prices.
Exponential smoothing methods are one of the two families of classical forecasting models. Their underlying idea is that forecasts are simply weighted averages of past observations. When calculating those averages, more emphasis is put on recent observations. To achieve that, the weights are decaying exponentially with time. These models are suitable for non-stationary data, that is, data with a trend and/or seasonality. Smoothing methods are popular because they are fast (not a lot of computations are required) and relatively reliable when it comes to forecasts’ accuracy.
Collectively, the exponential smoothing methods can be defined in terms of the ETS framework (Error, Trend, and Season), as they combine the underlying components in the smoothing calculations. As in the case of the seasonal decomposition, those terms can be combined additively, multiplicatively, or simply left out of the model.
Please see Forecasting: Principles and Practice (Hyndman and Athanasopoulos) for more information on the taxonomy of exponential smoothing methods.
The simplest model is called simple exponential smoothing (SES). This class of models is most apt for cases when the considered time series does not exhibit any trend or seasonality. They also work well with series with only a few data points.
The model is parameterized by a smoothing parameter with values between 0 and 1. The higher the value, the more weight is put on recent observations. When = 0, the forecasts for the future are equal to the average of training data. When = 1, all the forecasts have the same value as the last observation in the training set.
The forecasts produced using SES are flat, that is, regardless of the time horizon, all forecasts have the same value (corresponding to the last level component). That is why this method is only suitable for series with neither trend nor seasonality.
Holt’s linear trend method (also known as Holt’s double exponential smoothing method) is an extension of SES that accounts for a trend in the series by adding the trend component to the model’s specification. As a consequence, this model should be used when there is a trend in the data, but it still cannot handle seasonality.
One issue with Holt’s model is that the trend is constant in the future, which means that it increases/decreases indefinitely. That is why an extension of the model dampens the trend by adding the dampening parameter, . It makes the trend converge to a constant value in the future, effectively flattening it.
is rarely smaller than 0.8
, as the dampening has a very strong effect for smaller values of . The best practice is to restrict the values of so that they lie between 0.8
and 0.98
. For = 1 the damped model is equivalent to the model without dampening.
Lastly, we will cover the extension of Holt’s method called Holt-Winters’ seasonal smoothing (also known as Holt-Winters’ triple exponential smoothing). As the name suggests, it accounts for the seasonality in time series. Without going into too much detail, this method is most suitable for data with both trend and seasonality.
There are two variants of this model and they have either additive or multiplicative seasonalities. In the former one, the seasonal variations are more or less constant throughout the time series. In the latter one, the variations change in proportion to the passing of time.
In this recipe, we will show you how to apply the covered smoothing methods to monthly US unemployment rates (non-stationary data with trend and seasonality). We will fit the model to the prices from 2010 to 2018 and make forecasts for 2019.
We will use the same data that we used in the Time series decomposition recipe.
Execute the following steps to create forecasts of the US unemployment rate using the exponential smoothing methods:
import pandas as pd
from datetime import date
from statsmodels.tsa.holtwinters import (ExponentialSmoothing,
SimpleExpSmoothing,
Holt)
TEST_LENGTH = 12
df.index.freq = "MS"
df_train = df.iloc[:-TEST_LENGTH]
df_test = df[-TEST_LENGTH:]
ses_1 = SimpleExpSmoothing(df_train).fit(smoothing_level=0.5)
ses_forecast_1 = ses_1.forecast(TEST_LENGTH)
ses_2 = SimpleExpSmoothing(df_train).fit()
ses_forecast_2 = ses_2.forecast(TEST_LENGTH)
ses_1.params_formatted
Running the snippet generates the following table:
Figure 6.13: The values of the fitted coefficients for the first SES model
We can use the summary
method to print a more detailed summary of the fitted model.
ses_df = df.copy()
ses_df["ses_1"] = ses_1.fittedvalues.append(ses_forecast_1)
ses_df["ses_2"] = ses_2.fittedvalues.append(ses_forecast_2)
opt_alpha = ses_2.model.params["smoothing_level"]
fig, ax = plt.subplots()
ses_df["2017":].plot(style=["-",":","--"], ax=ax,
title="Simple Exponential Smoothing")
labels = [
"unemp_rate",
r"$alpha=0.2$",
r'$alpha={0:.2f}$'.format(opt_alpha),
]
ax.legend(labels)
Running the snippet generates the following plot:
Figure 6.14: Modeling time series using SES
In Figure 6.14, we can observe the characteristic of SES that we described in the introduction to this recipe—the forecast is a flat line. We can also see that the optimal value that was selected by the optimization routine is equal to 1. Immediately, we can see the consequences of picking such a value: the fitted line of the model is effectively the line of the observed prices shifted to the right and the forecast is simply the last observed value.
# Holt's model with linear trend
hs_1 = Holt(df_train).fit()
hs_forecast_1 = hs_1.forecast(TEST_LENGTH)
# Holt's model with exponential trend
hs_2 = Holt(df_train, exponential=True).fit()
hs_forecast_2 = hs_2.forecast(TEST_LENGTH)
# Holt's model with exponential trend and damping
hs_3 = Holt(df_train, exponential=False,
damped_trend=True).fit()
hs_forecast_3 = hs_3.forecast(TEST_LENGTH)
hs_df = df.copy()
hs_df["hs_1"] = hs_1.fittedvalues.append(hs_forecast_1)
hs_df["hs_2"] = hs_2.fittedvalues.append(hs_forecast_2)
hs_df["hs_3"] = hs_3.fittedvalues.append(hs_forecast_3)
fig, ax = plt.subplots()
hs_df["2017":].plot(style=["-",":","--", "-."], ax=ax,
title="Holt's Double Exponential Smoothing")
labels = [
"unemp_rate",
"Linear trend",
"Exponential trend",
"Exponential trend (damped)",
]
ax.legend(labels)
Running the snippet generates the following plot:
Figure 6.15: Modeling time series using Holt’s Double Exponential Smoothing
We can already observe an improvement as, compared to the SES forecast, the lines are not flat anymore.
One additional thing worth mentioning is that while we were optimizing a single parameter alpha
(smoothing_level
) in the case of the SES, here we are also optimizing beta
(smoothing_trend
) and potentially also phi
(damping_trend
).
SEASONAL_PERIODS = 12
# Holt-Winters' model with exponential trend
hw_1 = ExponentialSmoothing(df_train,
trend="mul",
seasonal="add",
seasonal_periods=SEASONAL_PERIODS).fit()
hw_forecast_1 = hw_1.forecast(TEST_LENGTH)
# Holt-Winters' model with exponential trend and damping
hw_2 = ExponentialSmoothing(df_train,
trend="mul",
seasonal="add",
seasonal_periods=SEASONAL_PERIODS,
damped_trend=True).fit()
hw_forecast_2 = hw_2.forecast(TEST_LENGTH)
hw_df = df.copy()
hw_df["hw_1"] = hw_1.fittedvalues.append(hw_forecast_1)
hw_df["hw_2"] = hw_2.fittedvalues.append(hw_forecast_2)
fig, ax = plt.subplots()
hw_df["2017":].plot(
style=["-",":","--"], ax=ax,
title="Holt-Winters' Triple Exponential Smoothing"
)
phi = hw_2.model.params["damping_trend"]
labels = [
"unemp_rate",
"Seasonal Smoothing",
f"Seasonal Smoothing (damped with $phi={phi:.2f}$)"
]
ax.legend(labels)
Running the snippet generates the following plot:
Figure 6.16: Modeling time series using Holt-Winters’ Triple Exponential Smoothing
In the preceding plot, we can see that now the seasonal patterns were also incorporated into the forecasts.
After importing the libraries, we fitted two different SES models using the SimpleExpSmoothing
class and its fit
method. To fit the model, we only used the training data. We could have manually selected the value of the smoothing parameter (smoothing_level
), however, the best practice is to let statsmodels
optimize it for the best fit. This optimization is done by minimizing the sum of squared residuals (errors). We created the forecasts using the forecast
method, which requires the number of periods we want to forecast for (which, in our case, is equal to the length of the test set).
In Step 3, we combined the fitted values (accessed using the fittedvalues
attribute of the fitted model) and the forecasts inside of a pandas
DataFrame, together with the observed unemployment rate. We then visualized all the series. To make the plot easier to read, we capped the data to cover the last 2 years of the training set and the test set.
In Step 5, we used the Holt
class (which is a wrapper around the more general ExponentialSmoothing
class) to fit Holt’s linear trend model. By default, the trend in the model is linear, but we can make it exponential by specifying exponential=True
and adding dampening with damped_trend=True
. As in the case of SES, using the fit
method with no arguments results in running the optimization routine to determine the optimal values of the parameters. In Step 6, we again placed all the fitted values and forecasts into a DataFrame and then we visualized the results.
In Step 7, we estimated two variants of Holt-Winters’ Triple Exponential Smoothing models. There is no separate class for this model, but we can adjust the ExponentialSmoothing
class by adding the seasonal
and seasonal_periods
arguments. Following the taxonomy of the ETS models, we should indicate that the models have an additive seasonal component. In Step 8, we again put all the fitted values and forecasts into a DataFrame and then we visualized the results as a line plot.
When creating an instance of the ExponentialSmoothing
class, we can additionally pass in the use_boxcox
argument to automatically apply the Box-Cox transformation to the analyzed time series. Alternatively, we could use the log transformation by passing the "log"
string to the same argument.
In this recipe, we have fitted various exponential smoothing models to forecast the monthly unemployment rate. Each time, we specified what kind of model we were interested in and most of the time, we let statsmodels
find the best-fitting parameters.
However, we could approach the task differently, that is, using a procedure called AutoETS. Without going into much detail, the goal of the procedure is to find the best-fitting flavor of an ETS model, given some constraints we provide upfront. You can read more about how the AutoETS procedure works in the references mentioned in the See also section.
The AutoETS procedure is available in the sktime
library, which is a library/framework inspired by scikit-learn
, but with a focus on time series analysis/forecasting.
Execute the following steps to find the best ETS model using the AutoETS approach:
from sktime.forecasting.ets import AutoETS
from sklearn.metrics import mean_absolute_percentage_error
AutoETS
model:
auto_ets = AutoETS(auto=True, n_jobs=-1, sp=12)
auto_ets.fit(df_train.to_period())
auto_ets_fcst = auto_ets.predict(fh=list(range(1, 13)))
auto_ets_df = hw_df.to_period().copy()
auto_ets_df["auto_ets"] = (
auto_ets
._fitted_forecaster
.fittedvalues
.append(auto_ets_fcst["unemp_rate"])
)
fig, ax = plt.subplots()
auto_ets_df["2017":].plot(
style=["-",":","--","-."], ax=ax,
title="Holt-Winters' models vs. AutoETS"
)
labels = [
"unemp_rate",
"Seasonal Smoothing",
f"Seasonal Smoothing (damped with $phi={phi:.2f}$)",
"AutoETS",
]
ax.legend(labels)
Running the snippet generates the following plot:
Figure 6.17: The results of the AutoETS forecast plotted over the results of the Holt-Winters’ approach
In Figure 6.17, we can see that the in-sample fits of the Holt-Winters’ model and AutoETS are very similar. When it comes to the forecast, they do differ and it is hard to say which one better predicts the unemployment rate.
That is why in the next step we calculate the Mean Absolute Percentage Error (MAPE), which is a popular evaluation metric used in time series forecasting (and other fields).
fcst_dict = {
"Seasonal Smoothing": hw_forecast_1,
"Seasonal Smoothing (damped)": hw_forecast_2,
"AutoETS": auto_ets_fcst,
}
print("MAPEs ----")
for key, value in fcst_dict.items():
mape = mean_absolute_percentage_error(df_test, value)
print(f"{key}: {100 * mape:.2f}%")
Running the snippet generates the following summary:
MAPEs ----
Seasonal Smoothing: 1.81%
Seasonal Smoothing (damped): 6.53%
AutoETS: 1.78%
We can see that the accuracy scores (measured by MAPE) of the Holt-Winters’ method and the AutoETS approach are very similar.
Please see the following references for more information about the ETS methods:
ARIMA models are a class of statistical models that are used for analyzing and forecasting time series data. They aim to do so by describing the autocorrelations in the data. ARIMA stands for Autoregressive Integrated Moving Average and is an extension of a simpler ARMA model. The goal of the additional integration component is to ensure the stationarity of the series. That is because, in contrast to the exponential smoothing models, the ARIMA models require the time series to be stationary. Below we briefly go over the models’ building blocks.
AR (autoregressive) model:
I (integration):
MA (moving average) model:
All of these components fit together and are directly specified in the commonly used notation: ARIMA (p,d,q). In general, we should try to keep the values of the ARIMA parameters as small as possible in order to avoid unnecessary complexity and prevent overfitting to the training data. One possible rule of thumb would be to keep d <= 2, while p and q should not be higher than 5. Also, most likely one of the terms (AR or MA) will dominate in the model, leading to the other one having a comparatively small value of the parameter.
ARIMA models are very flexible and by appropriately setting their hyperparameters, we can obtain some special cases:
ARIMA models are still very popular in the industry as they deliver near state-of-the-art performance (mostly for short-horizon forecasts), especially when we are dealing with small datasets. In such cases, more advanced machine and deep learning models are not able to show their true power.
One of the known weaknesses of the ARIMA models in the financial context is their inability to capture volatility clustering, which is observed in most financial assets.
In this recipe, we will go through all the necessary steps to correctly estimate an ARIMA model and learn how to verify that it is a proper fit for the data. For this example, we will once again use the monthly US unemployment rate from the years 2010 to 2019.
We will use the same data that we used in the Time series decomposition recipe.
Execute the following steps to create forecasts of the US unemployment rate using ARIMA models:
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from chapter_6_utils import test_autocorrelation
from sklearn.metrics import mean_absolute_percentage_error
TEST_LENGTH = 12
df_train = df.iloc[:-TEST_LENGTH]
df_test = df.iloc[-TEST_LENGTH:]
We create the train/test split just as we have done in the previous recipe. This way, we will be able to compare the performance of the two types of models.
df_train["unemp_rate_log"] = np.log(df_train["unemp_rate"])
df_train["first_diff"] = df_train["unemp_rate_log"].diff()
df_train.plot(subplots=True,
title="Original vs transformed series")
Running the snippet generates the following figure:
Figure 6.18: Applying transformations to achieve stationarity
fig = test_autocorrelation(df_train["first_diff"].dropna())
Running the function produces the following output:
ADF test statistic: -2.97 (p-val: 0.04)
KPSS test statistic: 0.04 (p-val: 0.10)
By analyzing the test results, we can state that the first differences of the log transformed series are stationary. We also look at the corresponding autocorrelation plots.
Figure 6.19: The autocorrelation plots of the first differences of the log transformed series
arima_111 = ARIMA(
df_train["unemp_rate_log"], order=(1, 1, 1)
).fit()
arima_111.summary()
Running the snippet generates the following summary:
The first model was a vanilla ARIMA(1,1,1). For the second one, we go with ARIMA(2,1,2).
arima_212 = ARIMA(
df_train["unemp_rate_log"], order=(2, 1, 2)
).fit()
arima_212.summary()
Running the snippet generates the following summary:
Figure 6.21: The summary of the fitted ARIMA(2,1,2) model
df["pred_111_log"] = (
arima_111
.fittedvalues
.append(arima_111.forecast(TEST_LENGTH))
)
df["pred_111"] = np.exp(df["pred_111_log"])
df["pred_212_log"] = (
arima_212
.fittedvalues
.append(arima_212.forecast(TEST_LENGTH))
)
df["pred_212"] = np.exp(df["pred_212_log"])
df
Running the snippet generates the following table:
Figure 6.22: The predictions from the ARIMA models—raw and transformed back to the original scale
(
df[["unemp_rate", "pred_111", "pred_212"]]
.iloc[1:]
.plot(title="ARIMA forecast of the US unemployment rate")
)
Running the snippet generates the following figure:
Figure 6.23: The forecast and the fitted values from the two ARIMA models
Now we also zoom into the test set, to clearly see the forecasts:
(
df[["unemp_rate", "pred_111", "pred_212"]]
.iloc[-TEST_LENGTH:]
.plot(title="Zooming in on the out-of-sample forecast")
)
Running the snippet generates the following figure:
In Figure 6.24, we can see that the forecast of the ARIMA(1,1,1) is virtually a straight line, while ARIMA(2,1,2) did a better job at capturing the pattern of the original series.
Now we calculate the MAPEs:
mape_111 = mean_absolute_percentage_error(
df["unemp_rate"].iloc[-TEST_LENGTH:],
df["pred_111"].iloc[-TEST_LENGTH:]
)
mape_212 = mean_absolute_percentage_error(
df["unemp_rate"].iloc[-TEST_LENGTH:],
df["pred_212"].iloc[-TEST_LENGTH:]
)
print(f"MAPE of ARIMA(1,1,1): {100 * mape_111:.2f}%")
print(f"MAPE of ARIMA(2,1,2): {100 * mape_212:.2f}%")
Running the snippet generates the following output:
MAPE of ARIMA(1,1,1): 9.14%
MAPE of ARIMA(2,1,2): 5.08%
preds_df = arima_212.get_forecast(TEST_LENGTH).summary_frame()
preds_df.columns = ["fcst", "fcst_se", "ci_lower", "ci_upper"]
plot_df = df_test[["unemp_rate"]].join(np.exp(preds_df))
fig, ax = plt.subplots()
(
plot_df[["unemp_rate", "fcst"]]
.plot(ax=ax,
title="ARIMA(2,1,2) forecast with confidence intervals")
)
ax.fill_between(plot_df.index,
plot_df["ci_lower"],
plot_df["ci_upper"],
alpha=0.3,
facecolor="g")
ax.legend(loc="upper left")
Running the snippet generates the following figure:
Figure 6.25: The forecast from the ARIMA(2,1,2) model together with its confidence intervals
We can see that the forecast is following the shape of the observed values. Additionally, we can see a typical cone-like pattern of the confidence intervals—the longer the horizon of the forecast, the wider the confidence intervals, which correspond to the increasing uncertainty.
After creating the training and test sets in Step 2, we applied the log transformation and first differences to the training data.
If we want to apply differencing to a given series more than once, we should use the np.diff
function as it implements recursive differencing. Using the diff
method of a DataFrame/Series with periods
> 1 results in taking the difference between the current observations and the one from that many periods
before.
In Step 4, we tested the stationarity of the first differences of the log transformed series. To do so, we used the custom test_autocorrelation
function. By looking at the outputs of the statistical tests, we see that the series is stationary at the 5% significance level.
When looking at the ACF/PACF plots, we can also clearly see the yearly seasonal pattern (at lags 12 and 24).
In Step 5, we fitted two ARIMA models: ARIMA(1,1,1) and ARIMA(2,1,2). First, the series turned out to be stationary after the first differences, so we knew that the order of integration was d=1. Normally, we can use the following set of “rules” to determine the values of p and q.
Identifying the order of the AR model:
Identifying the order of the MA model:
Regarding the manual calibration of ARIMA’s orders, Hyndman and Athanasopoulos (2018) warned that if both p and q are positive, the ACF/PACF plots might not be helpful in determining the specification of the ARIMA model. In the next recipe, we will introduce an automatic approach to determine the optimal values of ARIMA hyperparameters.
In Step 6, we combined the original series with the predictions from the two models. We extracted the fitted values from the ARIMA models and appended the forecasts for 2019 to the end of the series. Because we fitted the models to the log transformed series, we had to reverse the transformation by using the exponent function (np.exp
).
When working with series that can have 0 values, it is safer to use np.log1p
and np.exp1m
. This way, we avoid potential errors when taking the logarithm of 0.
In Step 7, we plotted the forecasts and calculated the mean absolute percentage error. The ARIMA(2,1,2) provided much better forecasts than the simple ARIMA(1,1,1).
In Step 8, we chained the get_forecast
method of the fitted ARIMA model together with the summary_frame
method to obtain the forecast and its corresponding confidence intervals. We had to use the get_forecast
method, as the forecast
method only returns point forecasts, without any additional information. Lastly, we renamed the columns and plotted them together with the original series.
We have already fitted the ARIMA models and explored the accuracy of their forecasts. However, we can also investigate some goodness-of-fit criteria of the fitted models. Instead of focusing on the out-of-sample performance, we can dive a bit deeper into how well the models fit the training data. We do so by looking at the residuals of the fitted ARIMA models.
First, we plot diagnostic plots for the residuals of the fitted ARIMA(2,1,2) model:
arima_212.plot_diagnostics(figsize=(18, 14), lags=25)
Running the snippet generates the following figure:
Figure 6.26: The diagnostics plot of the fitted ARIMA(2,1,2) model
Below we cover the interpretation of each of the plots:
To continue investigating the autocorrelation of the residuals, we can also apply Ljung-Box’s test for no autocorrelation. To do so, we use the test_serial_correlation
method of the fitted ARIMA model. Alternatively, we could use the acorr_ljungbox
function from statsmodels
.
ljung_box_results = arima_212.test_serial_correlation(method="ljungbox")
ljung_box_pvals = ljung_box_results[0][1]
fig, ax = plt.subplots(1, figsize=[16, 5])
sns.scatterplot(x=range(len(ljung_box_pvals)),
y=ljung_box_pvals,
ax=ax)
ax.axhline(0.05, ls="--", c="r")
ax.set(title="Ljung-Box test's results",
xlabel="Lag",
ylabel="p-value")
Running the snippet generates the following figure:
Figure 6.27: The results of the Ljung-Box test for no autocorrelation in the residuals
All of the returned p-values are below the 5% significance level, which means we should reject the null hypothesis stating there is no autocorrelation in the residuals. It makes sense, as we have already observed a significant yearly correlation caused by the fact that our model is missing the seasonal patterns.
What we should also keep in mind is the number of lags to investigate while performing the Ljung-Box test. Different sources suggest a different number of lags to consider. The default value in statsmodels
is min(10, nobs // 5)
for non-seasonal models and min(2*m, nobs // 5)
for seasonal time series, where m
denotes the seasonal period. Other commonly used variants include min(20,nobs − 1)
and ln(nobs)
. In our case, we did not use a seasonal model, so the default value is 10. But as we know, the data does exhibit seasonal patterns, so we should have looked at more lags.
The fitted ARIMA models also contain the test_normality
and test_heteroskedasticity
methods, which we could use for further evaluation of the model’s fit. We leave exploring those as an exercise for the reader.
Please see the following references for more information on fitting ARIMA models and helpful sets of rules for manually picking up the correct orders of the models:
For more information on the Ljung-Box test:
As we have seen in the previous recipe, the performance of an ARIMA model varies greatly depending on the chosen hyperparameters (p, d, and q). We can do our best to choose them based on our intuition, the statistical tests, and the ACF/PACF plots. However, this can prove to be quite difficult to do in practice.
That is why in this recipe we introduce auto-ARIMA, an automated approach to finding the best hyperparameters of the ARIMA class models (including variants such as ARIMAX and SARIMA).
Without going much into the technical details of the algorithm, it first determines the number of differences using the KPSS test. Then, the algorithm uses a stepwise search to traverse the model space, searching for a model that results in a better fit. A popular choice of evaluation metric used for comparing the models is the Akaike Information Criterion (AIC). The metric provides a trade-off between the goodness of fit of the model and its simplicity—AIC deals with the risks of overfitting and underfitting. When we compare multiple models, the lower the value of AIC, the better the model. For a more complete description of the auto-ARIMA procedure, please refer to the sources mentioned in the See also section.
The auto-ARIMA framework also works well with the extensions of the ARIMA model:
In this recipe, we will once again work with the monthly US unemployment rates from the years 2010 to 2019.
We will use the same data that we used in the Time series decomposition recipe.
Execute the following steps to find the best-fitting ARIMA model using the auto-ARIMA procedure:
import pandas as pd
import pmdarima as pm
from sklearn.metrics import mean_absolute_percentage_error
TEST_LENGTH = 12
df_train = df.iloc[:-TEST_LENGTH]
df_test = df.iloc[-TEST_LENGTH:]
auto_arima = pm.auto_arima(df_train,
test="adf",
seasonal=False,
with_intercept=False,
stepwise=True,
suppress_warnings=True,
trace=True)
auto_arima.summary()
Executing the snippet generates the following summary:
Figure 6.28: The summary of the best-fitting ARIMA model, as identified using the auto-ARIMA procedure
The procedure indicated that the best-fitting ARIMA model is ARIMA(2,1,2). But as you can see, the results in Figure 6.28 and Figure 6.21 are different. That is because in the latter case, we have fitted the ARIMA(2,1,2) model to the log transformed series, while in this recipe, we have not applied the log transformation.
Because we indicated trace=True
, we also see the following information about the models fitted during the procedure:
Performing stepwise search to minimize aic
ARIMA(2,1,2)(0,0,0)[0] : AIC=7.411, Time=0.24 sec
ARIMA(0,1,0)(0,0,0)[0] : AIC=77.864, Time=0.01 sec
ARIMA(1,1,0)(0,0,0)[0] : AIC=77.461, Time=0.01 sec
ARIMA(0,1,1)(0,0,0)[0] : AIC=75.688, Time=0.01 sec
ARIMA(1,1,2)(0,0,0)[0] : AIC=68.551, Time=0.01 sec
ARIMA(2,1,1)(0,0,0)[0] : AIC=54.321, Time=0.03 sec
ARIMA(3,1,2)(0,0,0)[0] : AIC=7.458, Time=0.07 sec
ARIMA(2,1,3)(0,0,0)[0] : AIC=inf, Time=0.07 sec
ARIMA(1,1,1)(0,0,0)[0] : AIC=78.507, Time=0.02 sec
ARIMA(1,1,3)(0,0,0)[0] : AIC=60.069, Time=0.02 sec
ARIMA(3,1,1)(0,0,0)[0] : AIC=41.703, Time=0.02 sec
ARIMA(3,1,3)(0,0,0)[0] : AIC=10.527, Time=0.10 sec
ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=inf, Time=0.08 sec
Best model: ARIMA(2,1,2)(0,0,0)[0]
Total fit time: 0.740 seconds
Similar to the ARIMA model estimated with the statsmodels
library, with pmdarima
(which is in fact a wrapper around statsmodels
) we can also use the plot_diagnostics
method to analyze the fit of the model by looking at its residuals:
auto_arima.plot_diagnostics(figsize=(18, 14), lags=25)
Executing the snippet generates the following figure:
Similar to the diagnostics plot in Figure 6.26, this ARIMA(2,1,2) model is also struggling with capturing the yearly seasonal patterns—we can clearly see that in the correlogram.
auto_sarima = pm.auto_arima(df_train,
test="adf",
seasonal=True,
m=12,
with_intercept=False,
stepwise=True,
suppress_warnings=True,
trace=True)
auto_sarima.summary()
Executing the snippet generates the following summary:
Figure 6.30: The summary of the best-fitting SARIMA model, as identified using the auto-ARIMA procedure
Just as we have done before, we will also look at the various residual plots:
auto_sarima.plot_diagnostics(figsize=(18, 14), lags=25)
Executing the snippet generates the following figure:
Figure 6.31: The diagnostic plots of the best-fitting SARIMA model
We can clearly see that the SARIMA model results in a much better fit than the ARIMA(2,1,2) model.
df_test["auto_arima"] = auto_arima.predict(TEST_LENGTH)
df_test["auto_sarima"] = auto_sarima.predict(TEST_LENGTH)
df_test.plot(title="Forecasts of the best ARIMA/SARIMA models")
Executing the snippet generates the following plot:
Figure 6.32: The forecasts from the ARIMA and SARIMA models identified using the auto-ARIMA procedure
It should not come as a surprise that the SARIMA model is capturing the seasonal patterns better than the ARIMA model. That is also reflected in the performance metrics calculated below. We also calculate the MAPEs:
mape_auto_arima = mean_absolute_percentage_error(
df_test["unemp_rate"],
df_test["auto_arima"]
)
mape_auto_sarima = mean_absolute_percentage_error(
df_test["unemp_rate"],
df_test["auto_sarima"]
)
print(f"MAPE of auto-ARIMA: {100*mape_auto_arima:.2f}%")
print(f"MAPE of auto-SARIMA: {100*mape_auto_sarima:.2f}%")
Executing the snippet generates the following output:
MAPE of auto-ARIMA: 6.17%
MAPE of auto-SARIMA: 5.70%
After importing the libraries, we created the training and test set just as we have done in the previous recipes.
In Step 3, we used the auto_arima
function to find the best hyperparameters of the ARIMA model. While using it, we specified that:
statsmodels
(under the trend
argument in the ARIMA
class).False
, the function will run an exhaustive grid search (trying out all possible hyperparameter combinations) similarly to the GridSearchCV
class of scikit-learn
. When using that scenario, we can indicate the n_jobs
argument to specify how many models can be fitted in parallel.There are also many different settings we can experiment with, for example:
out_of_sample_size
). This will make the algorithm fit the models on the data up until a certain point in time (the last observation minus out_of_sample_size
) and evaluate on the hold-out set. This way of selecting the best model might be preferable when we care more about the forecasting performance than the fit to the training data.In Step 4, we used the auto_arima
function to find the best SARIMA model. To do so, we specified seasonal=True
and indicated that we are working with monthly data by setting m=12
.
Lastly, we calculated the forecasts coming from the two models using the predict
method, plotted them together with the ground truth, and calculated the MAPEs.
We can use the auto-ARIMA framework in the pmdarima
library to estimate even more complex models or entire pipelines, which include transforming the target variable or adding new features. In this section, we show how to do so.
We start by importing a few more classes:
from pmdarima.pipeline import Pipeline
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima.preprocessing import LogEndogTransformer
from pmdarima import arima
For the first model, we train an ARIMA model with additional features (exogenous variables). As an experiment, we try to provide features indicating which month a given observation comes from. If this works, we might not need to estimate a SARIMA model to capture the yearly seasonality.
We create dummy variables using the pd.get_dummies
function. Each column contains a Boolean flag indicating if the observation came from the given month or not.
We also need to drop the first column from the new DataFrame to avoid the dummy-variable trap (perfect multicollinearity). We added the new variables for both the training and test sets:
month_dummies = pd.get_dummies(
df.index.month,
prefix="month_",
drop_first=True
)
month_dummies.index = df.index
df = df.join(month_dummies)
df_train = df.iloc[:-TEST_LENGTH]
df_test = df.iloc[-TEST_LENGTH:]
We then use the auto_arima
function to find the best-fitting model. The only thing that changes as compared to Step 3 of this recipe is that we had to specify the exogenous variables using the exogenous
argument. We indicated all columns except the one containing the target. Alternatively, we could have kept the additional variables in a separate object with identical indices as the target:
auto_arimax = pm.auto_arima(
df_train[["unemp_rate"]],
exogenous=df_train.drop(columns=["unemp_rate"]),
test="adf",
seasonal=False,
with_intercept=False,
stepwise=True,
suppress_warnings=True,
trace=True
)
auto_arimax.summary()
Executing the snippet generates the following summary:
Figure 6.33: The summary of the ARIMA model with exogenous variables
We also look at the residuals plots by using the plot_diagnostics
method. It seems that the autocorrelation issues connected to the yearly seasonality were fixed by including the dummy variables.
Figure 6.34: The diagnostic plots of the ARIMA model with exogenous variables
Lastly, we also show how to create an entire data transformation and modeling pipeline, which also finds the best-fitting ARIMA model. Our pipeline consists of three steps:
FourierFeaturizer
—explaining Fourier series is outside of the scope of this book. In practice, using them permits us to account for the seasonality in a seasonal time series without using a seasonal model per se. To provide a bit more context, it is something similar to what we have done with the month dummies. The FourierFeaturizer
class supplies decomposed seasonal Fourier terms as an exogenous array of features. We had to specify the seasonal periodicity m.AutoARIMA
class instead of the pm.auto_arima
function. Those two offer the same functionalities, just this time we had to use a class to make it compatible with the Pipeline
functionality.auto_arima_pipe = Pipeline([
("log_transform", LogEndogTransformer()),
("fourier", FourierFeaturizer(m=12)),
("arima", arima.AutoARIMA(stepwise=True, trace=1,
error_action="warn",
test="adf", seasonal=False,
with_intercept=False,
suppress_warnings=True))
])
auto_arima_pipe.fit(df_train[["unemp_rate"]])
In the log produced by fitting the pipeline, we can see that the following model was selected as the best one:
Best model: ARIMA(4,1,0)(0,0,0)[0] intercept
The biggest advantage of using the pipeline is that we do not have to carry out all the steps ourselves. We just define a pipeline and then provide a time series as input to the fit
method. In general, pipelines (also the ones in scikit-learn
as we will see in Chapter 13, Applied Machine Learning: Identifying Credit Default) are a great functionality that helps us with:
A potential disadvantage of using the pipelines is that some operations are not that easy to track anymore (the intermediate results are not stored as separate objects) and it is a bit more difficult to access the particular elements of the pipeline. For example, we cannot run auto_arima_pipe.summary()
to get the summary of the fitted ARIMA model.
Below, we create forecasts using the predict
method. Some noteworthy things about this step:
predict
method with a fitted ARIMAX model, we also need to provide the required exogenous variables for the predictions. They are passed as the X
argument.predict
method of a pipeline that transforms the target variable, the returned predictions (or fitted values) are expressed on the same scale as the original input. In our case, the following sequence happened under the hood. First, the original time series was log transformed. Then, new features were added. Next, we obtained predictions from the model (still on the log transformed scale). Finally, the predictions were converted to the original scale using the exponent function.
results_df = df_test[["unemp_rate"]].copy()
results_df["auto_arimax"] = auto_arimax.predict(
TEST_LENGTH,
X=df_test.drop(columns=["unemp_rate"])
)
results_df["auto_arima_pipe"] = auto_arima_pipe.predict(TEST_LENGTH)
results_df.plot(title="Forecasts of the ARIMAX/pipe models")
Running the code generates the following plot:
Figure 6.35: The forecasts from the ARIMAX model and the ARIMA pipeline
For reference, we also add the scores of those forecasts:
MAPE of auto-ARIMAX: 6.88%
MAPE of auto-pipe: 4.61%
Of all the ARIMA models we have tried in this chapter, the pipeline model performed best. However, it still performs significantly worse than the exponential smoothing methods.
When using the predict
method of the ARIMA models/pipelines in the pmdarima
library, we can set the return_conf_int
argument to True
. When we do so, the method will not only return the point forecast but also the corresponding confidence intervals.
In this chapter, we have covered the classical (statistical) approaches to time series analysis and forecasting. We learned how to decompose any time series into trend, seasonal, and remainder components. This step can be very helpful in getting a better understanding of the explored time series. But we can also use it directly for modeling purposes.
Then, we explained how to test if a time series is stationary, as some of the statistical models (for example, ARIMA) require stationarity. We also explained which steps we can take to transform a non-stationary time series into a stationary one.
Lastly, we explored two of the most popular statistical approaches to time series forecasting—exponential smoothing methods and ARIMA models. We have also touched upon more modern approaches to estimating such models, which involve automatic tuning and hyperparameter selection.
In the next chapter, we will explore ML-based approaches to time series forecasting.