The NLP workflow

A key goal in using ML from text data for algorithmic trading is to extract signals from documents. A document is an individual sample from a relevant text data source, such as a company report, a headline or news article, or a tweet. A corpus, in turn, is a collection of documents (plural: corpora).

The following diagram lays out the key steps to convert documents into a dataset that can be used to train a supervised ML algorithm capable of making actionable predictions:

Fundamental techniques extract text features semantic units called tokens, and use linguistic rules and dictionaries to enrich these tokens with linguistic and semantic annotations. The bag-of-words (BoW) model uses token frequency to model documents as token vectors, which leads to the document-term matrix that is frequently used for text classification.

Advanced approaches use ML to refine features extracted by these fundamental techniques and produce more informative document models. These include topic models that reflect the joint usage of tokens across documents and word-vector models that capture the context of token usage.

We will review key decisions made at each step and related trade-offs in more detail before illustrating their implementation using the spaCy library in the next section. The following table summarizes the key tasks of an NLP pipeline:

Feature	Description
Tokenization	Segments text into words, punctuation marks, and so on.
POS tagging	Assigns word types to tokens, such as a verb or noun.
Dependency parsing	Labels syntactic token dependencies, such as subject <=> object.
Stemming and lemmatization	Assigns the base forms of words: was => be, rats => rat.
Sentence boundary detection	Finds and segments individual sentences.
Named entity recognition	Labels real-world objects, such as people, companies, and locations.
Similarity	Evaluates the similarity of words, text spans, and documents.

Table of Contents for The NLP workflow

Create new playlist

Sign In

Sign Up

Table of Contents for
The NLP workflow