The following code downloads and extracts all historical filings contained in the Financial Statement and Notes (FSN) datasets for the given range of quarters (see edgar_xbrl.ipynb for addition details):
SEC_URL = 'https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/'
first_year, this_year, this_quarter = 2014, 2018, 3
past_years = range(2014, this_year)
filing_periods = [(y, q) for y in past_years for q in range(1, 5)]
filing_periods.extend([(this_year, q) for q in range(1, this_quarter +
1)])
for i, (yr, qtr) in enumerate(filing_periods, 1):
filing = f'{yr}q{qtr}_notes.zip'
path = data_path / f'{yr}_{qtr}' / 'source'
response = requests.get(SEC_URL + filing).content
with ZipFile(BytesIO(response)) as zip_file:
for file in zip_file.namelist():
local_file = path / file
with local_file.open('wb') as output:
for line in zip_file.open(file).readlines():
output.write(line)
The data is fairly large and to enable faster access than the original text files permit, it is better to convert the text files to binary, columnar parquet format (see Efficient data storage with pandas section in this chapter for a performance comparison of various data-storage options compatible with pandas DataFrames):
for f in data_path.glob('**/*.tsv'):
file_name = f.stem + '.parquet'
path = Path(f.parents[1]) / 'parquet'
df = pd.read_csv(f, sep=' ', encoding='latin1', low_memory=False)
df.to_parquet(path / file_name)
For each quarter, the FSN data is organized into eight file sets that contain information about submissions, numbers, taxonomy tags, presentation, and more. Each dataset consists of rows and fields and is provided as a tab-delimited text file:
File |
Dataset |
Description |
SUB |
Submission |
Identifies each XBRL submission by company, form, date, and so on |
TAG |
Tag |
Defines and explains each taxonomy tag |
DIM |
Dimension |
Adds detail to numeric and plain text data |
NUM |
Numeric |
One row for each distinct data point in filing |
TXT |
Plain text |
Contains all non-numeric XBRL fields |
REN |
Rendering |
Information for rendering on SEC website |
PRE |
Presentation |
Detail on the tag and number presentation in primary statements |
CAL |
Calculation |
Shows arithmetic relationships among tags |