Time series analysis is the study of the evolution of phenomena over time, in order to predict their future trends. It finds its application in various sectors, such as the price trend of a given product, tourist flows to a given location, and the performance of a product on the stock exchange.
Generally speaking, a time series can be represented by different components, such as the trend, which can be increasing, stable or decreasing; the seasonality – that is, the repetition over time; and the presence of breaking points, due to external events, which interrupt its normal trend.
In this chapter, you will review some basic concepts behind time series analysis, including the concept of stationarity, time series components, and how to check for the presence of breakpoints in a time series.
Over the last few years, different open source tools and libraries have been implemented to perform time series, including Prophet, statsmodels, and Kats. In this chapter, we will review the Prophet library and how to integrate it with Comet.
In the last part of this chapter, you will implement a practical use case, which uses the Prophet library, and tracks the result in Comet.
The chapter is organized as follows:
Before starting to review the basic concepts related to time series analysis, let’s install the required software needed to run the examples described in this chapter.
We will run all the experiments and codes in this chapter using Python 3.8. You can download it from the official website, https://www.python.org/downloads/, choosing the 3.8 version.
The examples described in this chapter use the following Python packages:
We have already described the comet-ml, matplotlib, NumPy, pandas, and scikit-learn packages and how to install them in Chapter 1, An Overview of Comet, so please refer to that for further details on installation.
In this section, you will see how to install the other required packages.
Prophet is an open source Python package for time series analysis. You can install it as follows:
pip install prophet
For more details about Prophet installation, you can read its official documentation, available at the following link: https://facebook.github.io/prophet/docs/installation.html.
statsmodels is a Python library for statistical analysis. You can install it as follows:
pip install statsmodels
For more details about statsmodels installation, you can read its official documentation, available at the following link: https://www.statsmodels.org/stable/install.html.
Now that you have installed all the software needed in this chapter, let’s move on to how to use Comet for time series analysis, starting with reviewing some basic concepts.
A time series is an ordered sequence of values over time, representing the variation of a certain phenomenon. Examples of time series include the trend of the prices of a certain product, and the trend of rainfall in a given region over time. The following figure shows an example of time series representing the natural gas price from 2000 to 2020:
Figure 11.1 – The natural gas price time series
Data was extracted from the DataHub website and is available at https://datahub.io/core/natural-gas under the public domain and the use of Energy Information Administration (EIA) content license.
Time series analysis, also known as time series forecasting, is the study of the past values of a time series, with the purpose of building a model that predicts its future values.
In this section, you will learn the following basic concepts and aspects related to time series:
Let’s start with the first aspect, loading a time series in Python.
To load a dataset as a time series in Python, you can proceed as follows:
import pandas as pd
df = pd.read_csv('https://datahub.io/core/natural-gas/r/monthly.csv', parse_dates=['Month'])
You should make sure that the column related to dates is parsed as a datetime (parse_dates=['Month']). As an example, we have used the natural gas prices dataset, as described previously.
df = df.set_index('Month')
ts = df['Price']
Now that you have learned how to load a time series in Python, you are ready to check whether a time series is stationary.
Stationarity is a time series property, meaning that the statistical properties of the process that generates the time series do not change over time. This property does not mean that the time series is constant over time, just that the way it changes does not change over time. For example, the time series of Figure 11.1 is not stationary because you cannot recognize a constant generating process for that time series.
The section is organized as follows:
Let’s start from the first point, the stationarity test.
To check whether a time series is stationary, you can use different methods, including the Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. In this chapter, we will focus on the ADF test. For more details, you can refer to the books contained in the Further reading section.
The ADF test is a statistical test, which is conducted with the following assumptions:
If the test fails to reject the null hypothesis, the series is non-stationary. In the ADF test, there are two conditions to reject the null hypothesis:
The test statistic, the p-value, and the critical value are variables returned by the test. Conversely, the alpha variable is usually set to 0.05. If both the conditions are satisfied, you can conclude that the time series is stationary. The statsmodels Python package provides a function to perform the ADF test. You will explore it through a practical example in the Using time series analysis from project setup to report building section.
If a time series is stationary, you can build a prediction model, which, in theory, could be very accurate. Instead, if the time series is not stationary, you can even build a prediction model, but the results of the prediction could be unreliable. Thus, it would be better if your time series were stationary.
What should you do with a non-stationary time series? Let’s investigate it in the next section.
If a time series is not stationary, you can perform a transformation to make the time series stationary. Examples of transformations include differentiating the current value from the previous one, and a logarithmic transformation.
To apply a differencing transformation to the time series, you can proceed as follows:
ts2 = ts.diff()
This diff() method calculates the difference between the current value and the previous one of the time series. Obviously, this operation cannot be done for the first value of the time series, which is set to NaN by the diff() method.
ts2.dropna(inplace=True)
The following figure shows the effects of the differencing transformation on the time series of Figure 11.1:
Figure 11.2 – The effects of the differencing transformation
In the example, the differencing transformation makes the time series stationary. Thus, you can use the differentiated time series to build the model. However, if you want to reconstruct the original time series, you need to perform the inverse operation. In the case of differencing, you should calculate the cumulative function:
ts3 = ts2.cumsum()
The new time series, ts3, is equal to the original time series. However, in ts3, the first value is missing, since it is also missing in ts2. Now that we have learned some basic concepts regarding stationarity, we can move on to the next aspect, exploring the time series components.
You can think of a time series as being composed of the following three components:
You can use different techniques to decompose a time series in the previous three components and how to deal with seasonality. In this section, you will see the following:
Let’s start from the first point, decomposing a time series in Python.
In Python, you can decompose a time series into its three components through the seasonal_decompose() function, provided by the statsmodels package. Let’s suppose that you want to decompose the time series shown in the following figure:
Figure 11.3 – The air passengers time series
The time series represents the number of air passengers per month. The dataset is available on Kaggle at https://www.kaggle.com/datasets/rakannimer/air-passengers under the Open Data Commons license.
The following piece of code shows how to decompose the time series through the seasonal_decompose() function:
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(ts, model='additive', period=12)
The function receives the time series as input, the model used to decompose (either multiplicative or additive), and the period, in months.
You can also plot the results of decomposition through the result.plot() method:
Figure 11.4 – The decomposed time series
You can easily identify the trend (Trend), the seasonality (Seasonal), and the residuals (Resid).
Now that you have learned how to decompose a time series in Python, we can move on to the next step, dealing with seasonal time series.
When you build a model for time series forecasting, you should take into account whether your time series has a seasonal component or not. In fact, some models perform better if the time series does not present seasonality, and others if they do.
One strategy to use any model regardless of the presence of seasonality is to remove seasonality from the time series and predict future values for the seasonally adjusted time series. Then, you can add the seasonality to the predicted values, to obtain the correct predictions.
Follow these detailed steps:
seasonality = result.seasonal
adjusted_ts = ts – seasonality
Since we have used an additive model, we can remove seasonality simply by calculating the difference between the original time series and the seasonality. The following figure shows the original time series and the seasonally adjusted time series:
Figure 11.5 – The original time series and the seasonally adjusted time series
Now that we have learned about time series components, we can discuss how to identify breakpoints in a time series.
A breakpoint is a structural change in a time series, such as an anomaly or an expected event. The following figure shows an example of time series with two breakpoints:
Figure 11.6 – A time series with at least two breakpoints
The figure shows the time series related to the number of new daily cases of infection of COVID-19 in Italy, extracted from the Humanitarian Data Exchange, available at the https://data.humdata.org/dataset/coronavirus-covid-19-cases-and-deaths under the Creative Commons attribution for the Intergovernmental Organizations license.
You can use the following two main techniques to identify breakpoints:
Now that we have learned the basic concepts of breakpoints in a time series, we can move on to the next step, exploring the Prophet package.
Prophet is an algorithm for time series analysis, released by Facebook’s Core Data Science team. Other algorithms exist for time series analysis, including Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA). You can refer to the books in the Further reading section if you are interested in them.
In this section, you will investigate Prophet, with a focus on the following aspects:
Let’s start with the first point, introducing the Prophet package.
To build a model using Prophet, you can proceed as follows:
from prophet import Prophet
model = Prophet()
model.fit(df)
future = model.make_future_dataframe(periods=N)
The N variable indicates the number of months.
forecast = model.predict(future)
You can customize the Prophet model with different parameters, which permit you to deal with the following main aspects:
model = Prophet(changepoints=['2020-05-06', '2022-07-06'])
We use the changepoints parameter to set the list of changepoints.
model = Prophet(weekly_seasonality=False, yearly_seasonality=False)
model.add_seasonality(name='monthly', period=30.5, fourier_order=3)
We disable the weekly and yearly seasonality (if not needed), and we use the add_seasonality() method to add a custom seasonality. This method receives as input the associated name, the period of seasonality, and the Fourier order. For more details on the Fourier order, you can read the official Prophet documentation, available at the following link: https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html.
from prophet.diagnostics import cross_validation
df_cv = cross_validation(model, initial='365 days', period='30 days', horizon = '100 days')
fig3 = plot_cross_validation_metric(df_cv, "mse")
You should pass to the cross_validation() function the following additional three parameters:
The cross_validation() function returns a DataFrame that contains the predicted values and the actual values, which you can use to calculate the classical performance metrics, such as MSE, RMSE, and the Mean Absolute Percentage Error (MAPE).
For more details on the Prophet package, you can refer to its official documentation, available at the following link: https://facebook.github.io/prophet/.
Now that you have learned some basic concepts of the Prophet package, we can move on to the next step, integrating Prophet with Comet.
Comet is fully integrated with Prophet, so whenever you build a Prophet model, Comet will log the following elements automatically:
You can control which elements you want to log by setting the corresponding parameter, as described in the Comet official documentation, available at the following link: https://www.comet.com/docs/python-sdk/prophet/.
To make automatic logging work, you should make sure that the comet_ml library is imported before the prophet one:
from comet_ml import Experiment
from prophet import Prophet
It may happen that you have a version of Prophet that is not supported by Comet. In this case, we can use the configuration parameters to make Comet log elements. For example, we can set the COMET_AUTO_LOG_FIGURES environment variable to 1 to make Comet log figures. We can set this variable in many ways. For example, we can use the os library, as follows:
import os
os.environ['COMET_AUTO_LOG_FIGURES'] = '1'
Alternatively, you should log the elements manually, as shown in the following piece of code:
experiment.log_figure(figure_name='forecast', figure=fig1)
The code shows how to plot a matplotlib figure named fig1.
Now that we have learned how to integrate Comet with Prophet, we can implement a practical example, which starts from project setup to report building.
In this section, you will implement a practical example that builds two models to predict the future trend of a time series describing arrivals at tourist accommodation establishments. The dataset shows the trend from 1990 to 2022; thus, it contains a breakpoint in correspondence in April 2020, when the COVID-19 pandemic began.
In the example, you will build two models, one which considers the breakpoint at the beginning of the COVID-19 pandemic and another which does not. You will compare the two models in Comet to establish which one performs better.
The full code of the example described in this section is available at the following link: https://github.com/PacktPublishing/Comet-for-Data-Science/tree/main/11.
You can write the code using the editor or the notebook you prefer. In this example, you will use Deepnote, a popular online notebook, which is fully integrated with Comet.
You will focus on the following aspects:
Let’s start from the first point, configuring the Deepnote environment. If you do not want to use Deepnote for your project, you can skip the next section and jump directly to the Loading and preparing the dataset section.
Deepnote is a platform that permits you to create notebooks over the web. The entry point to the Deepnote website is available at the following link: https://deepnote.com/.
With respect to the classical Jupyter notebooks, you can share notebooks created in Deepnote with your colleagues in real time. In addition, Deepnote is compatible with Jupyter notebook, so you can easily import notebooks implemented in Jupyter directly into Deepnote, and vice versa. Deepnote also provides you with a virtual machine that you can configure according to your needs. For example, you can configure the Python version, as well as the Python libraries used by a specific project.
Comet is fully integrated with Deepnote, so you can run your Comet experiments directly in Deepnote, and then you can display the Comet dashboard directly in Deepnote.
Let’s investigate how to configure Deepnote to work with Comet:
Figure 11.7 – The Deepnote dashboard
Figure 11.8 – The menu to create a new object in a Deepnote project
pandas
comet_ml
statsmodels
prophet
You do not need to save the file because Deepnote will do it for you. Whenever you run your project, Deepnote will install all the software contained in the requirements.txt file automatically.
We have just configured Deepnote to host our code, so we can move on to the next step, loading and preparing the dataset.
As a use case, you will use the arrivals at tourist accommodation establishments, released by Eurostat and available at https://ec.europa.eu/eurostat/web/tourism/data/database and released under the Creative Commons CC BY 4.0 License. The dataset contains the tourist arrivals related to 42 European countries from January 1990 to April 2022. For each country, the dataset provides details related to the type of tourism (foreign, domestic, or both), the type of accommodation, and the unit used to measure the tourist arrivals. In this example, we will select Italian tourism, with a focus on hotels and similar accommodation (code I551). In addition, we will select absolute numbers as the unit to measure the tourist arrivals.
The following figure shows an extract of the dataset:
Figure 11.9 – An extract of the dataset used in the example
The dataset contains 2,358 rows and 389 columns. Each row contains data related to a country, while columns contain values related to different years. Before you can use the dataset, we should prepare it by extracting a subset containing only the Italian time series.
Let’s proceed to extraction:
import pandas as pd
df = pd.read_csv('source/tour_occ_arm.tsv', sep=' ', na_values=': ')
As additional parameters for the read_csv() function, we pass the sep argument to specify the separator character, as well as the na_values parameter to make read_csv() recognize the ':' character as an additional NaN value. The following figure shows an extract of the loaded dataset:
Figure 11.10 – The tourist arrivals dataset loaded as a pandas DataFrame
The first column of the dataset is not loaded properly, since columns are not separated by the tab character ( ) but by the comma character (',').
df[["c_resid", "unit", "nace_r2", "geo_time"]] = df["c_resid,unit,nace_r2,geo\time"].str.split(",",expand=True)
We access the single cell of the selected column through the str accessor. Then, we use the split() function to extract the tokens in the string and assign each of them to a new column.
The country code is contained in the geo_time column.
df = df[df['unit'] == 'NR']
The unit is contained in the unit column.
df = df[df['c_resid'] == 'TOTAL']
The type of tourists is contained in the c_resid column.
df = df[df['nace_r2'] == 'I551']
The type of accommodation is contained in the nace_r2 column.
The following figure shows the dataset produced by applying the previous operations:
Figure 11.11 – The filtered dataset
Now, we should convert the previous dataset into a time series, where each row represents a different date. You can do it by transposing the DataFrame. However, before doing it, we should remove all the columns that do not refer to dates, as well as columns containing NaN values.
df = df.drop(['c_resid,unit,nace_r2,geo\time', 'c_resid', 'unit', 'nace_r2', 'geo_time'], axis=1)
df = df.dropna(axis=1)
df = df.T
df = df.reset_index()
df['index'] = df['index'].str.replace('M', '-').str.strip() + '-01'
df = df.rename(columns={'index' : 'ds', 1669 : 'y'})
df['ds'] = pd.to_datetime(df['ds'])
df['y'] = pd.to_numeric(df['y'])
df = df.iloc[::-1]
df.reset_index(inplace=True)
df.drop(['index'], axis=1,inplace=True)
The following figure shows the first five rows of the resulting dataset:
Figure 11.12 – The extracted time series
import matplotlib.pyplot as plt
plt.plot(df['ds'], df['y'])
plt.show()
The following figure shows the produced plot:
Figure 11.13 – The time series representing the total number of arrivals at Italian accommodation establishments
The figure clearly shows the breakpoint in correspondence during April 2020, when the lockdown produced by the COVID-19 pandemic began.
Now that we have loaded and prepared the dataset, we can move on to the next step, checking stationarity in data.
To check stationarity in data, we use the adfuller() function, provided by the statsmodels library. This function performs the ADF test. We proceed as follows:
from statsmodels.tsa.stattools import adfuller
def is_stationary(df):
df2 = df.set_index('ds')
ts = df['y']
dftest = adfuller(ts)
adf = dftest[0]
pvalue = dftest[1]
critical_value = dftest[4]['5%']
if (pvalue < 0.05) and (adf < critical_value):
return True
else:
return False
The adfuller() function returns a tuple containing the test statistic (dftest[0]), the p-value (dftest[1]), and other values, including the critical ones.
test_result = is_stationary(df)
if test_result == True:
print('The series is stationary')
else:
print('The series is NOT stationary')
In our case, the series is stationary, so do not need to transform it to make it stationary.
We are ready to build the two prediction models, so let’s proceed to the next step, building the models.
We build two different models; the first one does not consider the COVID-19 breakpoint, and the second one does. In both cases, we perform the following operations:
In the description that follows, we will analyze the model without breakpoints. The procedure adopted for the model with breakpoints is very similar, with only one difference when creating the model.
Let’s start from the first step, building the Comet experiment.
To build the Comet experiment, you can proceed as follows:
from comet_ml import Experiment
from prophet import Prophet
You should make sure that you import the comet_ml library before the Prophet one.
experiment = Experiment()
experiment.set_name('WithoutChangePoints')
To make the preceding code work, you should configure the .comet.config file, as described in Chapter 1, An Overview of Comet.
We are ready to build the model, so let’s proceed.
We split the dataset into training and test sets. The training set contains the first rows, up to the date of 2020-12-01, and the test set contains the remaining rows. We have chosen to keep in the training set some rows related to the effects of the COVID-19 pandemic (from April 2020 to December 2020), to give the model the possibility to learn the presence of the breakpoint:
index = df.index[df['ds'] == '2021-01-01'].tolist()[0]
n = df.shape[0] - index
df_train = df.head(index)
df_test = df.tail(n)
We extract the index used to separate the training and test sets, and then we assign the first index rows to the training set and the remaining row to the test set.
m = Prophet()
m.fit(df_train)
In the case of the model that also considers the breakpoint, we create a different model, as shown here:
m = Prophet(changepoints=['2020-03-01'])
future = m.make_future_dataframe(periods=n,freq='MS')
forecast = m.predict(future)
We predict n values, with a monthly frequency (freq='MS'). The forecast variable is a DataFrame, which contains all the information related to the predicted values. The following figure shows the columns of the forecast DataFrame:
Figure 11.14 – The list of columns of the forecast DataFrame
After building the model, we are ready to log model results in Comet, so let’s proceed.
fig1 = m.plot(forecast)
The following figure shows the produced plot:
Figure 11.15 – The graph produced by the plot() method
The points in the figure indicate the data points used to train the model, the bold line indicates the prediction, and the light area indicates the uncertainty intervals. Note that the model is not able to recognize the breakpoint in correspondence at the beginning of the lockdown caused by the COVID-19 pandemic.
fig2 = m.plot_components(forecast)
Similar to the plot() method, plot_components() should log the figure in Comet automatically. The following figure shows the produced graphs:
Figure 11.16 – The graphs produced by the plot_components() method
The graph shows the trend line and the yearly seasonality. From the first graph, you can clearly see a change in trend between 2013 and 2017. From the second graph, you see a peak in tourist arrivals in August, which is the hottest month in Italy.
df_cv = cross_validation(m, initial="7300 days", period="365 days", horizon="730 days")
fig3 = plot_cross_validation_metric(df_cv, "rmse")
We use the first 20 years (7,300 days) as historical data for the training set, then we use a cutoff window of 365 days, and we predict the next 2 years (730 days). We also calculate RMSE. The following figure shows the plotted figure:
Figure 11.17 – RMSE over Horizon, as produced by the plot_cross_validation_metric() function
Now that we have logged the model results in Comet, we can move on to the next step, logging performance metrics in Comet.
We log three performance metrics, MAE, MAPE, and RMSE, as follows:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error
def log_metrics(ds, forecast, experiment):
df_merge = pd.merge(df[['ds', 'y']], forecast[['ds','yhat']],on='ds')
y_true = df_merge['y'].values
y_pred = df_merge['yhat'].values
metrics = {}
metrics['mae'] = mean_absolute_error(y_true, y_pred)
metrics['mape'] = mean_absolute_percentage_error(y_true, y_pred)
metrics['rmse'] = mean_squared_error(y_true, y_pred, squared=False)
experiment.log_metrics(metrics)
log_metrics(df,forecast, experiment)
You should remember that, at the end of your code, you should call the experiment.end() method, since you are using Deepnote.
You can repeat the previous steps for the second Prophet model, which also considers the COVID-19 breakpoint.
Now that we have learned how to log performance metrics in Comet, we can move on to the next step, exploring results in Comet.
After running the experiments, you should see the results in Comet:
Figure 11.18 – The charts produced automatically by Comet
Figure 11.19 – The hyperparameters logged automatically by Comet
Figure 11.20 – The logged metrics in Comet
The table in the preceding screenshot shows the last (Last), minimum (Min), and maximum (Max) values for each metric.
Figure 11.21 – The figures logged in Comet by the Prophet plots
Now that we have explored the results in Comet, we can move on to the final step, building the final report.
Now you are ready to build the final report. In this example, we will build a simple report with the results of both models. As a further exercise, you could improve them by applying the concepts learned in Chapter 5, Building a Narrative in Comet. To create the report, in the Comet dashboard, click on the Panels tab and then select Add | Add to Report | New Report.
You will create a report with the following two sections:
Let’s start with the first section, comparing forecasts visually.
In this section, we add all the graphics available under the Graphics menu:
In this section, we add two panels related to performance metrics:
To add the two panels, you can proceed as follows:
You can repeat steps 1, 2, and 3 to add the MAPE panel.
Your report is ready! You can view the final result directly in Comet at the following link: https://www.comet.com/packt/time-series-analysis-deepnote/reports/time-series-forecasting.
We have just built a time series analysis model in Prophet and tracked it in Comet!
Throughout this chapter, we described some general concepts regarding time series analysis, including stationarity, seasonality, and breakpoints. In addition, you have learned the main concepts behind the Prophet package and how to combine it with Comet.
In the last part of the chapter, you implemented a practical use case that showed you how to track and compare two time series analysis experiments in Comet, as well as how to build a report with the results of the experiments.
The world of data science is very promising and challenging. Both research and industry sectors are constantly trying to improve current knowledge with new algorithms, frameworks, and tools. Throughout this book, you have investigated Comet, one of the promising platforms for experiment tracking and monitoring.
I hope that all the concepts you learned in this book will help you to increase your knowledge to build better models, track them with a valid tool, and eventually, become a better data scientist and be able to increase your overall knowledge for the future.