The first chapter of this book is dedicated to a very important (some may say the most important) part of any data science/quantitative finance project—gathering data. In line with the famous adage “garbage in, garbage out,” we should strive to obtain data of the highest possible quality and then correctly preprocess it for later use with statistical and machine learning algorithms. The reason for this is simple—the results of our analyses are highly dependent on the input data and no sophisticated model will be able to compensate for that. That is also why in our analyses, we should be able to use our (or someone else’s) understanding of the economic/financial domain to motivate certain data for, for example, modeling stock returns.
One of the most frequently reported issues among the readers of the first edition of this book was getting high-quality data. That is why in this chapter we spend more time exploring different sources of financial data. While quite a few of these vendors offer similar information (prices, fundamentals, and so on), they also offer additional, unique data that can be downloaded via their APIs. An example could be company-related news articles or pre-computed technical indicators. That is why we will download different types of data depending on the recipe. However, be sure to inspect the documentation of the library/API, as most likely its vendor also provides standard data such as prices.
Additional examples are also covered in the Jupyter notebooks, which you can find in the accompanying GitHub repository.
The data sources in this chapter were selected intentionally not only to showcase how easy it can be to gather high-quality data using Python libraries but also to show that the gathered data comes in many shapes and sizes.
Sometimes we will get a nicely formatted pandas
DataFrame, while other times it might be in JSON format or even bytes that need to be processed and then loaded as a CSV. Hopefully, these recipes will sufficiently prepare you to work with any kind of data you might encounter online.
Something to bear in mind while reading this chapter is that data differs among sources. This means that the prices we downloaded from two vendors will most likely differ, as those vendors also get their data from different sources and might use other methods to adjust the prices for corporate actions. The best practice is to find a source you trust the most concerning a particular type of data (based on, for example, opinion on the internet) and then use it to download the data you need. One additional thing to keep in mind is that when building algorithmic trading strategies, the data we use for modeling should align with the live data feed used for executing the trades.
This chapter does not cover one important type of data—alternative data. This could be any type of data that can be used to generate some insights into predicting asset prices. Alternative data can include satellite images (for example, tracking shipping routes, or the development of a certain area), sensor data, web traffic data, customer reviews, etc. While there are many vendors specializing in alternative data (for example, Quandl/Nasdaq Data Link), you can also get some by accessing publicly available information via web scraping. As an example, you could scrape customer reviews from Amazon or Yelp. However, those are often bigger projects and are unfortunately outside of the scope of this book. Also, you need to make sure that web scraping a particular website is not against its terms and conditions!
Using the vendors mentioned in this chapter, you can get quite a lot of information for free. But most of those providers also offer paid tiers. Remember to do thorough research on what the data suppliers actually provide and what your needs are before signing up for any of the services.
In this chapter, we cover the following recipes:
One of the most popular sources of free financial data is Yahoo Finance. It contains not only historical and current stock prices in different frequencies (daily, weekly, and monthly), but also calculated metrics, such as the beta (a measure of the volatility of an individual asset in comparison to the volatility of the entire market), fundamentals, earnings information/calendars, and many more.
For a long period of time, the go-to tool for downloading data from Yahoo Finance was the pandas-datareader
library. The goal of the library was to extract data from a variety of sources and store it in the form of a pandas
DataFrame. However, after some changes to the Yahoo Finance API, this functionality was deprecated. It is definitely good to be familiar with this library, as it facilitates downloading data from sources such as FRED (Federal Reserve Economic Data), the Fama/French Data Library, or the World Bank. Those might come in handy for different kinds of analyses and some of them are presented in the following chapters.
As of now, the easiest and fastest way of downloading historical stock prices is to use the yfinance
library (formerly known as fix_yahoo_finance
).
For the sake of this recipe, we are interested in downloading Apple’s stock prices from the years 2011 to 2021.
Execute the following steps to download data from Yahoo Finance:
import pandas as pd
import yfinance as yf
df = yf.download("AAPL",
start="2011-01-01",
end="2021-12-31",
progress=False)
print(f"Downloaded {len(df)} rows of data.")
df
Running the code generates the following preview of the DataFrame:
Figure 1.1: Preview of the DataFrame with downloaded stock prices
The result of the request is a pandas
DataFrame (2,769 rows) containing daily Open, High, Low, and Close (OHLC) prices, as well as the adjusted close price and volume.
Yahoo Finance automatically adjusts the close price for stock splits, that is, when a company divides the existing shares of its stock into multiple new shares, most frequently to boost the stock’s liquidity. The adjusted close price takes into account not only splits but also dividends.
The download
function is very intuitive. In the most basic case, we just need to provide the ticker (symbol), and it will try to download all available data since 1950.
In the preceding example, we downloaded daily data from a specific range (2011 to 2021).
Some additional features of the download
function are:
["AAPL", "MSFT"]
) or multiple tickers as a string ("AAPL MSFT"
).auto_adjust=True
to download only the adjusted prices.actions='inline'
. Those actions can also be used to manually adjust the prices or for other analyses.progress=False
disables the progress bar.interval
argument can be used to download data in different frequencies. We could also download intraday data as long as the requested period is shorter than 60 days.yfinance
also offers an alternative way of downloading the data—via the Ticker
class. First, we need to instantiate the object of the class:
aapl_data = yf.Ticker("AAPL")
To download the historical price data, we can use the history
method:
aapl_data.history()
By default, the method downloads the last month of data. We can use the same arguments as in the download
function to specify the range and frequency.
The main benefit of using the Ticker
class is that we can download much more information than just the prices. Some of the available methods include:
info
—outputs a JSON object containing detailed information about the stock and its company, for example, the company’s full name, a short business summary, which exchange it is listed on, as well as a selection of financial metrics such as the beta coefficientactions
—outputs corporate actions such as dividends and splitsmajor_holders
—presents the names of the major holdersinstitutional_holders
—shows the institutional holderscalendar
—shows the incoming events, such as the quarterly earningsearnings
/quarterly_earnings
—shows the earnings information from the last few years/quartersfinancials
/quarterly_financials
—contains financial information such as income before tax, net income, gross profit, EBIT, and much morePlease see the corresponding Jupyter notebook for more examples and outputs of those methods.
For a complete list of downloadable data, please refer to the GitHub repo of yfinance
(https://github.com/ranaroussi/yfinance).
You can check out some alternative libraries for downloading data from Yahoo Finance:
yahoofinancials
—similarly to yfinance
, this library offers the possibility of downloading a wide range of data from Yahoo Finance. The biggest difference is that all the downloaded data is returned as JSON.yahoo_earnings_calendar
—a small library dedicated to downloading the earnings calendar.Alternative data can be anything that is considered non-market data, for example, weather data for agricultural commodities, satellite images that track oil shipments, or even customer feedback that reflects a company’s service performance. The idea behind using alternative data is to get an “informational edge” that can then be used for generating alpha. In short, alpha is a measure of performance describing an investment strategy’s, trader’s, or portfolio manager’s ability to beat the market.
Quandl was the leading provider of alternative data products for investment professionals (including quant funds and investment banks). Recently, it was acquired by Nasdaq and is now part of the Nasdaq Data Link service. The goal of the new platform is to provide a unified source of trusted data and analytics. It offers an easy way to download data, also via a dedicated Python library.
A good starting place for financial data would be the WIKI Prices database, which contains stock prices, dividends, and splits for 3,000 US publicly traded companies. The drawback of this database is that as of April 2018, it is no longer supported (meaning there is no recent data). However, for purposes of getting historical data or learning how to access the databases, it is more than enough.
We use the same example that we used in the previous recipe—we download Apple’s stock prices for the years 2011 to 2021.
Before downloading the data, we need to create an account at Nasdaq Data Link (https://data.nasdaq.com/) and then authenticate our email address (otherwise, an exception is likely to occur while downloading the data). We can find our personal API key in our profile (https://data.nasdaq.com/account/profile).
Execute the following steps to download data from Nasdaq Data Link:
import pandas as pd
import nasdaqdatalink
nasdaqdatalink.ApiConfig.api_key = "YOUR_KEY_HERE"
You need to replace YOUR_KEY_HERE
with your own API key.
df = nasdaqdatalink.get(dataset="WIKI/AAPL",
start_date="2011-01-01",
end_date="2021-12-31")
print(f"Downloaded {len(df)} rows of data.")
df.head()
Running the code generates the following preview of the DataFrame:
Figure 1.2: Preview of the downloaded price information
The result of the request is a DataFrame (1,818 rows) containing the daily OHLC prices, the adjusted prices, dividends, and potential stock splits. As we mentioned in the introduction, the data is limited and is only available until April 2018—the last observation actually comes from March, 27 2018.
The first step after importing the required libraries was authentication using the API key. When providing the dataset argument, we used the following structure: DATASET/TICKER.
We should keep the API keys secure and private, that is, not share them in public repositories, or anywhere else. One way to make sure that the key stays private is to create an environment variable (how to do it depends on your operating system) and then load it in Python. To do so, we can use the os
module. To load the NASDAQ_KEY
variable, we could use the following code: os.environ.get("NASDAQ_KEY")
.
Some additional details on the get
function are:
["WIKI/AAPL", "WIKI/MSFT"]
.collapse
argument can be used to define the frequency (available options are daily, weekly, monthly, quarterly, or annually).transform
argument can be used to carry out some basic calculations on the data prior to downloading. For example, we could calculate row-on-row change (diff
), row-on-row percentage change (rdiff
), or cumulative sum (cumul
) or scale the series to start at 100 (normalize
). Naturally, we can easily do the very same operation using pandas
.Nasdaq Data Link distinguishes two types of API calls for downloading data. The get
function we used before is classified as a time-series API call. We can also use the tables API call with the get_table
function.
get_table
function:
COLUMNS = ["ticker", "date", "adj_close"]
df = nasdaqdatalink.get_table("WIKI/PRICES",
ticker=["AAPL", "MSFT", "INTC"],
qopts={"columns": COLUMNS},
date={"gte": "2011-01-01",
"lte": "2021-12-31"},
paginate=True)
df.head()
Figure 1.3: Preview of the downloaded price data
This function call is a bit more complex than the one we did with the get
function. We first specified the table we want to use. Then, we provided a list of tickers. As the next step, we specified which columns of the table we were interested in. We also provided the range of dates, where gte
stands for greater than or equal to, while lte
is less than or equal to. Lastly, we also indicated we wanted to use pagination. The tables API is limited to 10,000 rows per call. However, by using paginate=True
in the function call we extend the limit to 1,000,000 rows.
df = df.set_index("date")
df_wide = df.pivot(columns="ticker")
df_wide.head()
Running the code generates the following preview of the DataFrame:
Figure 1.4: Preview of the pivoted DataFrame
The output of the get_tables
function is in the long format. However, to make our analyses easier, we might be interested in the wide format. To reshape the data, we first set the date
column as an index and then used the pivot
method of a pd.DataFrame
.
Please bear in mind that this is not the only way to do so, and pandas
contains at least a few helpful methods/functions that can be used for reshaping the data from long to wide and vice versa.
nasdaqdatalink
library for Python.Another interesting source of financial data is Intrinio, which offers access to its free (with limits) database. The following list presents just a few of the interesting data points that we can download using Intrinio:
Most of the data is free of charge, with some limits on the frequency of calling the APIs. Only the real-time price data of US stocks and ETFs requires a different kind of subscription.
In this recipe, we follow the preceding example of downloading Apple’s stock prices for the years 2011 to 2021. That is because the data returned by the API is not simply a pandas
DataFrame and requires some interesting preprocessing.
Before downloading the data, we need to register at https://intrinio.com to obtain the API key.
Please see the following link (https://docs.intrinio.com/developer-sandbox) to understand what information is included in the sandbox API key (the free one).
Execute the following steps to download data from Intrinio:
import intrinio_sdk as intrinio
import pandas as pd
intrinio.ApiClient().set_api_key("YOUR_KEY_HERE")
security_api = intrinio.SecurityApi()
You need to replace YOUR_KEY_HERE
with your own API key.
r = security_api.get_security_stock_prices(
identifier="AAPL",
start_date="2011-01-01",
end_date="2021-12-31",
frequency="daily",
page_size=10000
)
df = (
pd.DataFrame(r.stock_prices_dict)
.sort_values("date")
.set_index("date")
)
print(f"Downloaded {df.shape[0]} rows of data.")
df.head()
The output looks as follows:
Figure 1.5: Preview of the downloaded price information
The resulting DataFrame contains the OHLC prices and volume, as well as their adjusted counterparts. However, that is not all, and we had to cut out some additional columns to make the table fit the page. The DataFrame also contains information, such as split ratio, dividend, change in value, percentage change, and the 52-week rolling high and low values.
The first step after importing the required libraries was to authenticate using the API key. Then, we selected the API we wanted to use for the recipe—in the case of stock prices, it was the SecurityApi
.
To download the data, we used the get_security_stock_prices
method of the SecurityApi
class. The parameters we can specify are as follows:
identifier
—stock ticker or another acceptable identifierstart_date
/end_date
—these are self-explanatoryfrequency
—which data frequency is of interest to us (available choices: daily, weekly, monthly, quarterly, or yearly)page_size
—defines the number of observations to return on one page; we set it to a high number to collect all the data we need in one request with no need for the next_page
tokenThe API returns a JSON-like object. We accessed the dictionary form of the response, which we then transformed into a DataFrame. We also set the date as an index using the set_index
method of a pandas
DataFrame.
In this section, we show some more interesting features of Intrinio.
Not all information is included in the free tier. For a more thorough overview of what data we can download for free, please refer to the following documentation page: https://docs.intrinio.com/developer-sandbox.
You can use the previously defined security_api
to get the real-time stock prices:
security_api.get_security_realtime_price("KO")
The output of the snippet is the following JSON:
{'ask_price': 57.57,
'ask_size': 114.0,
'bid_price': 57.0,
'bid_size': 1.0,
'close_price': None,
'exchange_volume': 349353.0,
'high_price': 57.55,
'last_price': 57.09,
'last_size': None,
'last_time': datetime.datetime(2021, 7, 30, 21, 45, 38, tzinfo=tzutc()),
'low_price': 48.13,
'market_volume': None,
'open_price': 56.91,
'security': {'composite_figi': 'BBG000BMX289',
'exchange_ticker': 'KO:UN',
'figi': 'BBG000BMX4N8',
'id': 'sec_X7m9Zy',
'ticker': 'KO'},
'source': 'bats_delayed',
'updated_on': datetime.datetime(2021, 7, 30, 22, 0, 40, 758000, tzinfo=tzutc())}
One of the potential ways to generate trading signals is to aggregate the market’s sentiment on the given company. We could do it, for example, by analyzing news articles or tweets. If the sentiment is positive, we can go long, and vice versa. Below, we show how to download news articles about Coca-Cola:
r = intrinio.CompanyApi().get_company_news(
identifier="KO",
page_size=100
)
df = pd.DataFrame(r.news_dict)
df.head()
This code returns the following DataFrame:
Figure 1.6: Preview of the news about the Coca-Cola company
Running the following snippet returns a list of companies that Intrinio’s Thea AI recognized based on the provided query string:
r = intrinio.CompanyApi().recognize_company("Intel")
df = pd.DataFrame(r.companies_dict)
df
As we can see, there are quite a few companies that also contain the phrase “intel” in their names, other than the obvious search result.
Figure 1.7: Preview of the companies connected to the phrase “intel”
We can also retrieve intraday prices using the following snippet:
response = (
security_api.get_security_intraday_prices(identifier="KO",
start_date="2021-01-02",
end_date="2021-01-05",
page_size=1000)
)
df = pd.DataFrame(response.intraday_prices_dict)
df
Which returns the following DataFrame containing intraday price data.
Figure 1.8: Preview of the downloaded intraday prices
Another interesting usage of the security_api
is to recover the latest earnings records. We can do this using the following snippet:
r = security_api.get_security_latest_earnings_record(identifier="KO")
print(r)
The output of the API call contains quite a lot of useful information. For example, we can see what time of day the earnings call happened. This information could potentially be used for implementing trading strategies that act when the market opens.
Figure 1.9: Coca-Cola’s latest earnings record
Alpha Vantage is another popular data vendor providing high-quality financial data. Using their API, we can download the following:
In this recipe, we show how to download a selection of crypto-related data. We start with historical daily Bitcoin prices, and then show how to query the real-time crypto exchange rate.
Before downloading the data, we need to register at https://www.alphavantage.co/support/#api-key to obtain the API key. Access to the API and all the endpoints is free of charge (excluding the real-time stock prices) within some bounds (5 API requests per minute; 500 API requests per day).
Execute the following steps to download data from Alpha Vantage:
from alpha_vantage.cryptocurrencies import CryptoCurrencies
ALPHA_VANTAGE_API_KEY = "YOUR_KEY_HERE"
crypto_api = CryptoCurrencies(key=ALPHA_VANTAGE_API_KEY,
output_format= "pandas")
data, meta_data = crypto_api.get_digital_currency_daily(
symbol="BTC",
market="EUR"
)
The meta_data
object contains some useful information about the details of the query. You can see it below:
{'1. Information': 'Daily Prices and Volumes for Digital Currency',
'2. Digital Currency Code': 'BTC',
'3. Digital Currency Name': 'Bitcoin',
'4. Market Code': 'EUR',
'5. Market Name': 'Euro',
'6. Last Refreshed': '2022-08-25 00:00:00',
'7. Time Zone': 'UTC'}
The data
DataFrame contains all the requested information. We obtained 1,000 daily OHLC prices, the volume, and the market capitalization. What is also noteworthy is that all the OHLC prices are provided in two currencies: EUR (as we requested) and USD (the default one).
Figure 1.10: Preview of the downloaded prices, volume, and market cap
crypto_api.get_digital_currency_exchange_rate(
from_currency="BTC",
to_currency="USD"
)[0].transpose()
Running the command returns the following DataFrame with the current exchange rate:
Figure 1.11: BTC-USD exchange rate
After importing the alpha_vantage
library, we had to authenticate using the personal API key. We did so while instantiating an object of the CryptoCurrencies
class. At the same time, we specified that we would like to obtain output in the form of a pandas
DataFrame. The other possibilities are JSON and CSV.
In Step 3, we downloaded the daily BTC prices using the get_digital_currency_daily
method. Additionally, we specified that we wanted to get the prices in EUR. By default, the method will return the requested EUR prices, as well as their USD equivalents.
Lastly, we downloaded the real-time BTC/USD exchange rate using the get_digital_currency_exchange_rate
method.
So far, we have used the alpha_vantage
library as a middleman to download information from Alpha Vantage. However, the functionalities of the data vendor evolve faster than the third-party library and it might be interesting to learn an alternative way of accessing their API.
import requests
import pandas as pd
from io import BytesIO
AV_API_URL = "https://www.alphavantage.co/query"
parameters = {
"function": "CRYPTO_INTRADAY",
"symbol": "ETH",
"market": "USD",
"interval": "30min",
"outputsize": "full",
"apikey": ALPHA_VANTAGE_API_KEY
}
r = requests.get(AV_API_URL, params=parameters)
data = r.json()
df = (
pd.DataFrame(data["Time Series Crypto (30min)"])
.transpose()
)
df
Running the snippet above returns the following preview of the downloaded DataFrame:
Figure 1.12: Preview of the DataFrame containing Bitcoin’s intraday prices
We first defined the base URL used for requesting information via Alpha Vantage’s API. Then, we defined a dictionary containing the additional parameters of the request, including the personal API key. In our function call, we specified that we want to download intraday ETH prices expressed in USD and sampled every 30 minutes. We also indicated we want a full output (by specifying the outputsize
parameter). The other option is compact
output, which downloads the 100 most recent observations.
Having prepared the request’s parameters, we used the get
function from the requests
library. We provide the base URL and the parameters
dictionary as arguments. After obtaining the response to the request, we can access it in JSON format using the json
method. Lastly, we convert the element of interest into a pandas
DataFrame.
Alpha Vantage’s documentation shows a slightly different approach to downloading this data, that is, by creating a long URL with all the parameters specified there. Naturally, that is also a possibility, however, the option presented above is a bit neater. To see the very same request URL as presented by the documentation, you can run r.request.url
.
AV_API_URL = "https://www.alphavantage.co/query"
parameters = {
"function": "EARNINGS_CALENDAR",
"horizon": "3month",
"apikey": ALPHA_VANTAGE_API_KEY
}
r = requests.get(AV_API_URL, params=parameters)
pd.read_csv(BytesIO(r.content))
Figure 1.13: Preview of a DataFrame containing the downloaded earnings information
While getting the response to our API request is very similar to the previous example, handling the output is much different.
The output of r.content
is a bytes
object containing the output of the query as text. To mimic a normal file in-memory, we can use the BytesIO
class from the io
module. Then, we can normally load that mimicked file using the pd.read_csv
function.
In the accompanying notebook, we present a few more functionalities of Alpha Vantage, such as getting the quarterly earnings data, downloading the calendar of the upcoming IPOs, and using alpha_vantage
's TimeSeries
module to download stock price data.
The last data source we will cover is dedicated purely to cryptocurrencies. CoinGecko is a popular data vendor and crypto-tracking website, on which you can find real-time exchange rates, historical data, information about exchanges, upcoming events, trading volumes, and much more.
We can list a few of the advantages of CoinGecko:
In this recipe, we download Bitcoin’s OHLC from the last 14 days.
Execute the following steps to download data from CoinGecko:
from pycoingecko import CoinGeckoAPI
from datetime import datetime
import pandas as pd
cg = CoinGeckoAPI()
ohlc = cg.get_coin_ohlc_by_id(
id="bitcoin", vs_currency="usd", days="14"
)
ohlc_df = pd.DataFrame(ohlc)
ohlc_df.columns = ["date", "open", "high", "low", "close"]
ohlc_df["date"] = pd.to_datetime(ohlc_df["date"], unit="ms")
ohlc_df
Figure 1.14: Preview of the DataFrame containing the requested Bitcoin prices
In the preceding table, we can see that we have obtained the requested 14 days of data, sampled every 4 hours.
After importing the libraries, we instantiated the CoinGeckoAPI
object. Then, using its get_coin_ohlc_by_id
method we downloaded the last 14 days’ worth of BTC/USD exchange rates. It is worth mentioning there are some limitations of the API:
1
/7
/14
/30
/90
/180
/365
/max
.The output of the get_coin_ohlc_by_id
is a list of lists, which we can convert into a pandas
DataFrame. We had to manually create the column names, as they were not provided by the API.
We have seen that getting the OHLC prices can be a bit more difficult using the CoinGecko API as compared to the other vendors. However, CoinGecko has additional interesting information we can download using its API. In this section, we show a few possibilities.
We can use CoinGecko to acquire the top 7 trending coins—the ranking is based on the number of searches on CoinGecko within the last 24 hours. While downloading this information, we also get the coins’ symbols, their market capitalization ranking, and the latest price in BTC:
trending_coins = cg.get_search_trending()
(
pd.DataFrame([coin["item"] for coin in trending_coins["coins"]])
.drop(columns=["thumb", "small", "large"])
)
Using the snippet above, we obtain the following DataFrame:
Figure 1.15: Preview of the DataFrame containing the 7 trending coins and some information about them
We can also extract current crypto prices in various currencies:
cg.get_price(ids="bitcoin", vs_currencies="usd")
Running the snippet above returns Bitcoin’s real-time price:
{'bitcoin': {'usd': 47312}}
In the accompanying notebook, we present a few more functionalities of pycoingecko
, such as getting the crypto price in different currencies than USD, downloading the entire list of coins supported on CoinGecko (over 9,000 coins), getting each coin’s detailed market data (market capitalization, 24h volume, the all-time high, and so on), and loading the list of the most popular exchanges.
You can find the documentation of the pycoingecko
library here: https://github.com/man-c/pycoingecko.
In this chapter, we have covered a few of the most popular sources of financial data. However, this is just the tip of the iceberg. Below, you can find a list of other interesting data sources that might suit your needs even better.
Additional data sources are:
pyex
, the official Python library.tiingo
library.shrimpy-python
—the official Python library for the Shrimpy Developer API.In the next chapter, we will learn how to preprocess the downloaded data for further analysis.
To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below: