A key goal in using ML from text data for algorithmic trading is to extract signals from documents. A document is an individual sample from a relevant text data source, such as a company report, a headline or news article, or a tweet. A corpus, in turn, is a collection of documents (plural: corpora).
The following diagram lays out the key steps to convert documents into a dataset that can be used to train a supervised ML algorithm capable of making actionable predictions:
Fundamental techniques extract text features semantic units called tokens, and use linguistic rules and dictionaries to enrich these tokens with linguistic and semantic annotations. The bag-of-words (BoW) model uses token frequency to model documents as token vectors, which leads to the document-term matrix that is frequently used for text classification.
Advanced approaches use ML to refine features extracted by these fundamental techniques and produce more informative document models. These include topic models that reflect the joint usage of tokens across documents and word-vector models that capture the context of token usage.
We will review key decisions made at each step and related trade-offs in more detail before illustrating their implementation using the spaCy library in the next section. The following table summarizes the key tasks of an NLP pipeline:
Feature |
Description |
Tokenization |
Segments text into words, punctuation marks, and so on. |
POS tagging |
Assigns word types to tokens, such as a verb or noun. |
Dependency parsing |
Labels syntactic token dependencies, such as subject <=> object. |
Stemming and lemmatization |
Assigns the base forms of words: was => be, rats => rat. |
Sentence boundary detection |
Finds and segments individual sentences. |
Named entity recognition |
Labels real-world objects, such as people, companies, and locations. |
Similarity |
Evaluates the similarity of words, text spans, and documents. |