You often hear in the data science industry that a data scientist typically spends close to 80% of their time on getting the data, processing it, cleaning it, and so on. And only then the remaining 20% of the time is actually spent on modeling, which is often considered to be the most interesting part. In the previous chapter, we have already learned how to download data from various sources. We still need to go through a few steps before we can draw actual insights from the data.
In this chapter, we will cover data preprocessing, that is, general wrangling/manipulation applied to the data before using it. The goal is not only to enhance the model’s performance but also to ensure the validity of any analysis based on that data. In this chapter, we will focus on the financial time series, while in the subsequent chapters, we will also show how to work with other kinds of data.
In this chapter, we cover the following recipes:
Many of the models and approaches used for time series modeling require the time series to be stationary. We will cover that topic in depth in Chapter 6, Time Series Analysis and Forecasting, however, we can get a quick glimpse of it now.
Stationarity assumes that the statistics (mathematical moments) of a process, such as the series’ mean and variance, do not change over time. Using that assumption, we can build models that aim to forecast the future values of the process.
However, asset prices are usually non-stationary. Their statistics not only change over time, but we can also observe some trends (general patterns over time) or seasonality (patterns repeating over fixed time intervals). By transforming the prices into returns, we attempt to make the time series stationary.
Another benefit of using returns, as opposed to prices, is normalization. It means that we can easily compare various return series, which would not be that simple with raw stock prices, as one stock might start selling at $10, while another at $1,000.
There are two types of returns:
Rt = (Pt - Pt-1)/Pt-1 = Pt/Pt-1 -1
rt = log(Pt/Pt-1) = log(Pt) - log(Pt-1)
Pt is the price of an asset in time t. In the preceding case, we do not consider dividends, which obviously impact the returns and require a small modification of the formulas.
The best practice while working with stock prices is to use adjusted values as they account for possible corporate actions, such as stock splits.
In general, log returns are often preferred over simple returns. Probably the most important reason for that is the fact that if we assume that the stock prices are log-normally distributed (which might or might not be the case for the particular time series), then the log returns would be normally distributed. And the normal distribution would work well with quite a lot of classic statistical approaches to time series modeling. Also, the difference between simple and log returns for daily/intraday data will be very small, in accordance with the general rule that log returns are smaller in value than simple returns.
In this recipe, we show how to calculate both types of returns using Apple’s stock prices.
Execute the following steps to download Apple’s stock prices and calculate simple/log returns:
import pandas as pd
import numpy as np
import yfinance as yf
df = yf.download("AAPL",
start="2010-01-01",
end="2020-12-31",
progress=False)
df = df.loc[:, ["Adj Close"]]
df["simple_rtn"] = df["Adj Close"].pct_change()
df["log_rtn"] = np.log(df["Adj Close"]/df["Adj Close"].shift(1))
df.head()
The resulting DataFrame looks as follows:
Figure 2.1: Snippet of the DataFrame containing Apple’s adjusted close prices and simple/log returns
The first row will always contain a NaN (not a number) value, as there is no previous price to use for calculating the returns.
In Step 2, we downloaded price data from Yahoo Finance and only kept the adjusted close price for the calculation of the returns.
To calculate the simple returns, we used the pct_change
method of pandas Series/DataFrame. It calculates the percentage change between the current and prior element (we can specify the number of lags, but for this specific case the default value of 1 suffices). Please bear in mind that the prior element is defined as the one in the row above the given row. In the case of working with time series data, we need to make sure that the data is sorted by the time index.
To calculate the log returns, we followed the formula given in the introduction to this recipe. When dividing each element of the series by its lagged value, we used the shift
method with a value of 1
to access the prior element. In the end, we took the natural logarithm of the divided values by using the np.log
function.
When doing different kinds of analyses, especially long-term ones, we might want to consider inflation. Inflation is the general rise of the price level of an economy over time. Or to phrase it differently, the reduction of the purchasing power of money. That is why we might want to decouple the inflation from the increase of the stock prices caused by, for example, the companies’ growth or development.
We can naturally adjust the prices of stocks directly, but in this recipe, we will focus on adjusting the returns and calculating the real returns. We can do so using the following formula:
where Rrt is the real return, Rt is the time t simple return, and stands for the inflation rate.
For this example, we use Apple’s stock prices from the years 2010 to 2020 (downloaded as in the previous recipe).
Execute the following steps to adjust the returns for inflation:
import pandas as pd
import nasdaqdatalink
nasdaqdatalink.ApiConfig.api_key = "YOUR_KEY_HERE"
df = df.resample("M").last()
df_cpi = (
nasdaqdatalink.get(dataset="RATEINF/CPI_USA",
start_date="2009-12-01",
end_date="2020-12-31")
.rename(columns={"Value": "cpi"})
)
df_cpi
Running the code generates the following table:
Figure 2.2: Snippet of the DataFrame containing the values of the Consumer Price Index (CPI)
df = df.join(df_cpi, how="left")
df["simple_rtn"] = df["Adj Close"].pct_change()
df["inflation_rate"] = df["cpi"].pct_change()
df["real_rtn"] = (
(df["simple_rtn"] + 1) / (df["inflation_rate"] + 1) - 1
)
df.head()
Running the code generates the following table:
Figure 2.3: Snippet of the DataFrame containing the calculated inflation-adjusted returns
First, we imported the libraries and authenticated with Nasdaq Data Link, which we used for downloading the inflation-related data. Then, we had to resample Apple’s stock prices to a monthly frequency, as the inflation data is provided monthly. To do so, we chained the resample
method with the last
method. This way, we took the last price of the given month.
In Step 3, we downloaded the monthly Consumer Price Index (CPI) values from Nasdaq Data Link. It is a metric that examines the weighted average of prices of a basket of consumer goods and services, such as food, transportation, and so on.
Then, we used a left join to merge the two datasets (prices and CPI). A left join is a type of operation used for merging tables that returns all rows from the left table and the matched rows from the right table while leaving the unmatched rows empty.
By default, the join
method uses the indices of the tables to carry out the actual joining. We can use the on
argument to specify which column/columns to use otherwise.
Having all the data in one DataFrame, we used the pct_change
method to calculate the simple returns and the inflation rate. Lastly, we used the formula presented in the introduction to calculate the real returns.
We have already explored how to download the inflation data from Nasdaq Data Link. Alternatively, we can use a handy library called cpi
.
import cpi
At this point, we might encounter the following warning:
StaleDataWarning: CPI data is out of date
If that is the case, we just need to run the following line of code to update the data:
cpi.update()
cpi_series = cpi.series.get()
Here we download the default CPI index (CUUR0000SA0: All items in U.S. city average, all urban consumers, not seasonally adjusted
), which will work for most of the cases. Alternatively, we can provide the items
and area
arguments to download a more tailor-made series. We can also use the get_by_id
function to download a particular CPI series.
pandas
DataFrame:
df_cpi_2 = cpi_series.to_dataframe()
df_cpi_2.query("period_type == 'monthly' and year >= 2010")
.loc[:, ["date", "value"]]
.set_index("date")
.head(12)
Figure 2.4: The first 12 values of the DataFrame containing the downloaded values of the CPI
In this step, we used some filtering to compare the data to the data downloaded before from Nasdaq Data Link. We used the query
method to only keep the monthly data from the year 2010 onward. We displayed only two selected columns and the first 12 observations, for comparison’s sake.
We will also be using the cpi
library in later chapters to directly inflate the prices using the inflate
function.
cpi
libraryWhen working with time series, and especially financial ones, we often need to change the frequency (periodicity) of the data. For example, we receive daily OHLC prices, but our algorithm works with weekly data. Or we have daily alternative data, and we want to match it with our live feed of intraday data.
The general rule of thumb for changing frequency can be broken down into the following:
For any process with independent increments (for example, the geometric Brownian motion), the variance of the logarithmic returns is proportional to time. For example, the variance of rt3 - rt1 is going to be the sum of the following two variances: rt2−rt1 and rt3−rt2, assuming t1≤t2≤t3. In such a case, when we also assume that the parameters of the process do not change over time (homogeneity) we arrive at the proportionality of the variance to the length of the time interval. Which in practice means that the standard deviation (volatility) is proportional to the square root of time.
In this recipe, we present an example of how to calculate the monthly realized volatilities for Apple using daily returns and then annualize the values. We can often encounter annualized volatility when looking at the risk-adjusted performance of an investment.
The formula for realized volatility is as follows:
Realized volatility is frequently used for calculating the daily volatility using intraday returns.
The steps we need to take are as follows:
We assume you have followed the instructions from the previous recipes and have a DataFrame called df
with a single log_rtn
column and timestamps as the index.
Execute the following steps to calculate and annualize the monthly realized volatility:
import pandas as pd
import numpy as np
def realized_volatility(x):
return np.sqrt(np.sum(x**2))
df_rv = (
df.groupby(pd.Grouper(freq="M"))
.apply(realized_volatility)
.rename(columns={"log_rtn": "rv"})
)
df_rv.rv = df_rv["rv"] * np.sqrt(12)
fig, ax = plt.subplots(2, 1, sharex=True)
ax[0].plot(df)
ax[0].set_title("Apple's log returns (2000-2012)")
ax[1].plot(df_rv)
ax[1].set_title("Annualized realized volatility")
plt.show()
Figure 2.5: Apple’s log return series and the corresponding realized volatility (annualized)
We can see that the spikes in the realized volatility coincide with some extreme returns (which might be outliers).
Normally, we could use the resample
method of a pandas
DataFrame. Supposing we wanted to calculate the average monthly return, we could use df["log_rtn"].resample("M").mean()
.
With the resample
method, we can use any built-in aggregate function of pandas
, such as mean
, sum
, min
, and max
. However, our case at hand is a bit more complex so we first defined a helper function called realized_volatility
. Because we wanted to use a custom function for aggregation, we replicated the behavior of resample
by using a combination of groupby
, Grouper
, and apply
.
We presented the most basic visualization of the results (please refer to Chapter 3, Visualizing Financial Time Series, for information about visualizing time series).
While working with any time series, it can happen that some data is missing, due to many possible reasons (someone forgot to input the data, a random issue with the database, and so on). One of the available solutions would be to discard observations with missing values. However, imagine a scenario in which we are analyzing multiple time series at once, and only one of the series is missing a value due to some random mistake. Do we still want to remove all the other potentially valuable pieces of information because of this single missing value? Probably not. And there are many other potential scenarios in which we would rather treat the missing values somehow, rather than discarding those observations.
Two of the simplest approaches to imputing missing time series data are:
In this recipe, we show how to use those techniques to easily deal with missing values in the example of the CPI time series.
Execute the following steps to try out different ways of imputing missing data:
import pandas as pd
import numpy as np
import nasdaqdatalink
nasdaqdatalink.ApiConfig.api_key = "YOUR_KEY_HERE"
df = (
nasdaqdatalink.get(dataset="RATEINF/CPI_USA",
start_date="2015-01-01",
end_date="2020-12-31")
.rename(columns={"Value": "cpi"})
)
np.random.seed(42)
rand_indices = np.random.choice(df.index, 5, replace=False)
df["cpi_missing"] = df.loc[:, "cpi"]
df.loc[rand_indices, "cpi_missing"] = np.nan
df.head()
In the following table, we can see we have successfully introduced missing values into the data:
Figure 2.6: Preview of the DataFrame with downloaded CPI data and the added missing values
for method in ["bfill", "ffill"]:
df[f"method_{method}"] = (
df[["cpi_missing"]].fillna(method=method)
)
df.loc[rand_indices].sort_index()
Running the code results in the following output:
Figure 2.7: Preview of the DataFrame after imputing the missing values
We can see that backward filling worked for all the missing values we created. However, forward filling failed to impute one value. That is because this is the first data point in the series, so there is no available value to fill forward.
df.loc[:"2017-01-01"]
.drop(columns=["cpi_missing"])
.plot(title="Different ways of filling missing values");
Running the snippet generates the following plot:
Figure 2.8: The comparison of backward and forward filling on the CPI time series
In Figure 2.8, we can clearly see how both forward and backward filling work in practice.
After importing the libraries, we downloaded the 6 years of monthly CPI data from Nasdaq Data Link. Then, we selected 5 random indices from the DataFrame to artificially create missing values. To do so, we replaced those values with NaNs.
In Step 4, we applied two different imputation methods to our time series. We used the fillna
method of a pandas
DataFrame and specified the method
argument as bfill
(backward filling) or ffill
(forward filling). We saved the imputed series as new columns, in order to clearly compare the results. Please remember that the fillna
method replaces the missing values and keeps the other values intact.
Instead of providing a method of filling the missing data, we could have specified a value of our choice, for example, 0
or 999
. However, using an arbitrary number might not make much sense in the case of time series data, so that is not advised.
We used np.random.seed(42)
to make the experiment reproducible. Each time you run this cell, you will get the same random numbers. You can use any number for the seed and the random choice will be different for each of those.
In Step 5, we inspected the imputed values. For brevity, we have only displayed the indices we have randomly selected. We used the sort_index
method to sort them by the date. This way, we can clearly see that the first value was not filled using the forward filling, as it is the very first observation in the time series.
Lastly, we plotted all the time series from the years 2015 to 2016. In the plot, we can clearly see how backward/forward filling imputes the missing values.
In this recipe, we have explored some simple methods of imputing missing data. Another possibility is to use interpolation, to which there are many different approaches. As such, in this example, we will use the linear one. Please refer to the pandas
documentation (the link is available in the See also subsection) for more information about the available methods of interpolation.
df["method_interpolate"] = df[["cpi_missing"]].interpolate()
df.loc[rand_indices].sort_index()
Running the snippet generates the following output:
Figure 2.9: Preview of the DataFrame after imputing the missing values with linear interpolation
Unfortunately, linear interpolation also cannot deal with the missing value located at the very beginning of the time series.
df.loc[:"2017-01-01"]
.drop(columns=["cpi_missing"])
.plot(title="Different ways of filling missing values");
Running the snippet generates the following plot:
Figure 2.10: The comparison of backward and forward filling on the CPI time series, including interpolation
In Figure 2.10, we can see how linear interpolation connects the known observations with a straight line to impute the missing value.
In this recipe, we explored imputing missing data for time series data. However, these are not all of the possible approaches. We could have, for example, used the moving average of the last few observations to impute any missing values. There are certainly a lot of possible methodologies to choose from. In Chapter 13, Applied Machine Learning: Identifying Credit Default, we will show how to approach the issue of missing values for other kinds of datasets.
pandas
.Another quite common preprocessing step you might encounter while working on financial tasks is converting currencies. Imagine you have a portfolio of multiple assets, priced in different currencies and you would like to arrive at a total portfolio’s worth. The simplest example might be American and European stocks.
In this recipe, we show how to easily convert stock prices from USD to EUR. However, the very same steps can be used to convert any pair of currencies.
Execute the following steps to convert stock prices from USD to EUR:
import pandas as pd
import yfinance as yf
from forex_python.converter import CurrencyRates
df = yf.download("AAPL",
start="2020-01-01",
end="2020-01-31",
progress=False)
df = df.drop(columns=["Adj Close", "Volume"])
CurrencyRates
object:
c = CurrencyRates()
df["usd_eur"] = [c.get_rate("USD", "EUR", date) for date in df.index]
for column in df.columns[:-1]:
df[f"{column}_EUR"] = df[column] * df["usd_eur"]
df.head()
Running the snippet generates the following preview:
Figure 2.11: Preview of the DataFrame containing the original prices in USD and the ones converted to EUR
We can see that we have successfully converted all four columns with prices into EUR.
In Step 1, we have imported the required libraries. Then, we downloaded Apple’s OHLC prices from January 2020 using the already covered yfinance
library.
In Step 3, we instantiated the CurrencyRates
object from the forex-python
library. Under the hood, the library is using the Forex API (https://theforexapi.com), which is a free API for accessing current and historical foreign exchange rates published by the European Central Bank.
In Step 4, we used the get_rate
method to download the USD/EUR exchange rates for all the dates available in the DataFrame with stock prices. To do so efficiently, we used list comprehension and stored the outputs in a new column. One potential drawback of the library and the present implementation is that we need to download each and every exchange rate individually, which might not be scalable for large DataFrames.
While using the library, you can sometimes run into the following error: RatesNotAvailableError: Currency Rates Source Not Ready
. The most probable cause is that you are trying to get the exchange rates from weekends. The easiest solution is to skip those days in the list comprehension/for
loop and fill in the missing values using one of the approaches covered in the previous recipe.
In the last step, we iterated over the columns of the initial DataFrame (all except the exchange rate) and multiplied the USD price by the exchange rate. We stored the outcomes in new columns, with _EUR
subscript.
Using the forex_python
library, we can easily download the exchange rates for many currencies at once. To do so, we can use the get_rates
method. In the following snippet, we download the current exchange rates of USD to the 31 available currencies. We can naturally specify the date of interest, just as we have done before.
usd_rates = c.get_rates("USD")
usd_rates
The first five entries look as follows:
{'EUR': 0.8441668073611345,
'JPY': 110.00337666722943,
'BGN': 1.651021441836907,
'CZK': 21.426641904440316,
'DKK': 6.277224379537396,
}
In this recipe, we have mostly focused on the forex_python
library, as it is quite handy and flexible. However, we might download historical exchange rates from many different sources and arrive at the same results (accounting for some margin of error depending on the data provider). Quite a few of the data providers described in Chapter 1, Acquiring Financial Data, provide historical exchange rates. Below, we show how to get those rates using Yahoo Finance.
df = yf.download("USDEUR=X",
start="2000-01-01",
end="2010-12-31",
progress=False)
df.head()
Running the snippet results in the following output:
In Figure 2.12, we can see one of the limitations of this data source—the data for this currency pair is only available since December 2003. Also, Yahoo Finance is providing the OHLC variant of the exchange rates. To arrive at a single number used for conversion, you can pick any of the four values (depending on the use case) or calculate the mid-value (the middle between low and high values).
forex-python
libraryBefore diving into building a machine learning model or designing a trading strategy, we not only need reliable data, but we also need to aggregate it into a format that is convenient for further analysis and appropriate for the models we choose. The term bars refers to a data representation that contains basic information about the price movements of any financial asset. We have already seen one form of bars in Chapter 1, Acquiring Financial Data, in which we explored how to download financial data from a variety of sources.
There, we downloaded OHLCV data sampled by some time period, be it a month, day, or intraday frequencies. This is the most common way of aggregating financial time series data and is known as the time bars.
There are some drawbacks of sampling financial time series by time:
Iceberg orders are large orders that were divided into smaller limit orders to hide the actual order quantity. They are called “iceberg orders” because the visible orders are just the “tip of the iceberg,” while a significant number of limit orders is waiting, ready to be placed.
To overcome those issues and gain a competitive edge, practitioners also use other kinds of aggregation. Ideally, they would want to have a bar representation in which each bar contains the same amount of information. Some of the alternatives they are using include:
Each of these forms of aggregations has its strengths and weaknesses that we should be aware of.
Tick bars offer a better way of tracking the actual activity in the market, together with the volatility. However, a potential issue arises out of the fact that one trade can contain any number of units of a certain asset. So, a buy order of a single share is treated equally to an order of 10,000 shares.
Volume bars are an attempt at overcoming this problem. However, they come with an issue of their own. They do not correctly reflect situations in which asset prices change significantly or when stock splits happen. This makes them unreliable for comparison between periods affected by such situations.
That is where the third type of bar comes into play—the dollar bars. It is often considered the most robust way of aggregating price data. Firstly, the dollar bars help bridge the gap with price volatility, which is especially important for highly volatile markets such as cryptocurrencies. Then, sampling by dollars is helpful to preserve the consistency of information. The second reason is that dollar bars are resistant to the outstanding amount of the security, so they are not affected by actions such as stock splits, corporate buybacks, issuance of new shares, and so on.
In this recipe, we will learn how to create all four types of bars mentioned above using trade data coming from Binance, one of the most popular cryptocurrency exchanges. We decided to use cryptocurrency data as it is much easier to obtain (free of charge) compared to, for example, equity data. However, the presented methodology remains the same for other asset classes as well.
Execute the following steps to download trade data from Binance and aggregate it into four different kinds of bars:
from binance.spot import Spot as Client
import pandas as pd
import numpy as np
BTCEUR
trades:
spot_client = Client(base_url="https://api3.binance.com")
r = spot_client.trades("BTCEUR")
pandas
DataFrame:
df = (
pd.DataFrame(r)
.drop(columns=["isBuyerMaker", "isBestMatch"])
)
df["time"] = pd.to_datetime(df["time"], unit="ms")
for column in ["price", "qty", "quoteQty"]:
df[column] = pd.to_numeric(df[column])
df
Executing the code returns the following DataFrame:
Figure 2.13: The DataFrame containing the last 500 BTC-EUR transactions
We can see that 500 transactions in the BTCEUR
market happened over a span of approximately nine minutes. For more popular markets, this window can be significantly reduced. The qty
column contains the traded amount of BTC, while quoteQty
contains the EUR price of the traded quantity, which is the same as multiplying the price
column by the qty
column.
def get_bars(df, add_time=False):
ohlc = df["price"].ohlc()
vwap = (
df.apply(lambda x: np.average(x["price"], weights=x["qty"]))
.to_frame("vwap")
)
vol = df["qty"].sum().to_frame("vol")
cnt = df["qty"].size().to_frame("cnt")
if add_time:
time = df["time"].last().to_frame("time")
res = pd.concat([time, ohlc, vwap, vol, cnt], axis=1)
else:
res = pd.concat([ohlc, vwap, vol, cnt], axis=1)
return res
df_grouped_time = df.groupby(pd.Grouper(key="time", freq="1Min"))
time_bars = get_bars(df_grouped_time)
time_bars
Running the code generates the following time bars:
Figure 2.14: Preview of the DataFrame with time bars
bar_size = 50
df["tick_group"] = (
pd.Series(list(range(len(df))))
.div(bar_size)
.apply(np.floor)
.astype(int)
.values
)
df_grouped_ticks = df.groupby("tick_group")
tick_bars = get_bars(df_grouped_ticks, add_time=True)
tick_bars
Running the code generates the following tick bars:
Figure 2.15: Preview of the DataFrame with tick bars
We can see that each group contains exactly 50 trades, just as we intended.
bar_size = 1
df["cum_qty"] = df["qty"].cumsum()
df["vol_group"] = (
df["cum_qty"]
.div(bar_size)
.apply(np.floor)
.astype(int)
.values
)
df_grouped_ticks = df.groupby("vol_group")
volume_bars = get_bars(df_grouped_ticks, add_time=True)
volume_bars
Running the code generates the following volume bars:
Figure 2.16: Preview of the DataFrame with volume bars
We can see that all the bars contain approximately the same volume. The last one is a bit smaller, simply because we did not have enough total volume in the 500 trades.
bar_size = 50000
df["cum_value"] = df["quoteQty"].cumsum()
df["value_group"] = (
df["cum_value"]
.div(bar_size)
.apply(np.floor)
.astype(int)
.values
)
df_grouped_ticks = df.groupby("value_group")
dollar_bars = get_bars(df_grouped_ticks, add_time=True)
dollar_bars
Running the code generates the following dollar bars:
Figure 2.17: Preview of the DataFrame with dollar bars
After importing the libraries, we instantiated the Binance client and downloaded the 500 most recent trades in the BTCEUR
market using the trades
method of the Binance client. We chose this one on purpose, as it is not as popular as BTCUSD
and the default 500 trades actually span a few minutes. We could increase the number of trades up to 1,000 using the limit
argument.
We have used the easiest way to download the 500 most recent trades. However, we could do better and recreate the trades over a longer period of time. To do so, we could use the historical_trades
method. It contains an additional argument called fromId
, which we could use to specify from which particular trade we would like to start our batch download. Then, we could chain those API calls using the last known ID to recreate the trade history from a longer period of time. However, to do so, we need to have a Binance account, create personal API keys, and provide them to the Client
class.
In Step 3, we prepared the data for further analysis, that is, we converted the response from the Binance client into a pandas
DataFrame, dropped two columns we will not be using, converted the time
column into datetime
, and converted to columns containing prices and quantities into numeric ones, as they were expressed as object
type, which is a string.
Then, we defined a helper function for calculating the bars per some group. The input of the function must be a DataFrameGroupBy
object, that is, the output of applying the groupby
method to a pandas
DataFrame. That is because the function calculates a bunch of aggregate statistics:
ohlc
method.np.average
method and using the quantity of the trade as the weights
argument.size
method.All of those are separate DataFrames, which we ultimately concatenated using the pd.concat
function.
In Step 5, we calculated the time bars. We had to use the groupby
method combined with pd.Grouper
. We indicated we want to create the groups on the time
column and used a one-minute frequency. Then, we passed the DataFrameGroupBy
object to our get_bars
function, which returned the time bars.
In Step 6, we calculated the tick bars. The process was slightly different than with time bars, as we first had to create the column on which we want to group the trades. The idea was that we group the trades in blocks of 50 (this is an arbitrary number and should be determined according to the logic of the analysis). To create such groups, we divided the row number by the chosen bar size, rounded the result down (using np.floor
), and converted it into an integer. Then, we grouped the trades using the newly created column and applied the get_bars
function.
In Step 7, we calculated the volume bars. The process was quite similar to the tick bars. The difference was in creating the grouping column, which this time was based on the cumulative sum of the traded quantity. We selected the bar size of 1 BTC.
The last step was to calculate the dollar bars. The process was almost identical to the volume bars, but we created the grouping column by applying a cumulative sum to the quoteQty
column, instead of the qty
one used before.
The list of alternative kinds of bars in this recipe is not exhaustive. For example, De Prado (2018) suggests using imbalance bars, which attempt to sample the data when there is an imbalance of buying/selling activity, as this might imply information asymmetry between market participants. The reasoning behind those bars is market participants either buy or sell large quantities of a given asset, but they do not frequently do both simultaneously. Hence, sampling when imbalance events occur helps to focus on large movements and pay less attention to periods without interesting activity.
In this chapter, we have learned how to preprocess financial time series data. We started by showing how to calculate returns and potentially adjust them for inflation. Then, we covered a few of the popular methods for imputing missing values. Lastly, we explained the different approaches to aggregating trade data and why choosing the correct one matters.
We should always pay significant attention to this step, as we not only want to enhance our model’s performance but also to ensure the validity of any analysis. In the next chapter, we will continue working with the preprocessed data and learn how to create time series visualization.