Chapter 5. Parsing Textual Data with Natural Language Processing

It's no accident that Peter Brown, Co-CEO of Renaissance Technologies, one of the most successful quantitative hedge funds of all time, had previously worked at IBM, where he applied machine learning to natural language problems.

As we've explored in earlier chapters, in today's world, information drives finance, and the most important source of information is written and spoken language. Ask any finance professional what they are actually spending time on, and you will find that a significant part of their time is spent on reading. This can cover everything from reading headlines on tickers, to reading a Form 10K, the financial press, or various analyst reports; the list goes on and on. Automatically processing this information can increase the speed of trades occurring and widen the breadth of information considered for trades while at the same time reducing overall costs.

Natural language processing (NLP) is making inroads into the finance sector. As an example, insurance companies are increasingly looking to process claims automatically, while retail banks try to streamline their customer service and offer better products to their clients. The understanding of text is increasingly becoming the go-to application of machine learning within the finance sector.

Historically, NLP has relied on hand-crafted rules that were created by linguists. Today, the linguists are being replaced by neural networks that are able to learn the complex, and often hard to codify, rules of language.

In this chapter, you will learn how to build powerful natural language models with Keras, as well as how to use the spaCy NLP library.

The focus of this chapter will be on the following:

  • Fine-tuning spaCy's models for your own custom applications
  • Finding parts of speech and mapping the grammatical structure of sentences
  • Using techniques such as Bag-of-Words and TF-IDF for classification
  • Understanding how to build advanced models with the Keras functional API
  • Training models to focus with attention, as well as to translate sentences with a sequence to sequence (seq2seq) model

So, let's get started!

An introductory guide to spaCy

spaCy is a library for advanced NLP. The library, which is pretty fast to run, also comes with a range of useful tools and pretrained models that make NLP easier and more reliable. If you've installed Kaggle, you won't need to download spaCy, as it comes preinstalled with all the models.

To use spaCy locally, you will need to install the library and download its pretrained models separately.

To install the library, we simply need to run the following command:

$ pip install -U spacy
$ python -m spacy download en

Note

Note: This chapter makes use of the English language models, but more are available. Most features are available in English, German, Spanish, Portuguese, French, Italian, and Dutch. Entity recognition is available for many more languages through the multi-language model.

The core of spaCy is made up of the Doc and Vocab classes. A Doc instance contains one document, including its text, tokenized version, and recognized entities. The Vocab class, meanwhile, keeps track of all the common information found across documents.

spaCy is useful for its pipeline features, which contain many of the parts needed for NLP. If this all seems a bit abstract right now, don't worry, as this section will show you how to use spaCy for a wide range of practical tasks.

Tip

You can find the data and code for this section on Kaggle at https://www.kaggle.com/jannesklaas/analyzing-the-news.

The data that we'll use for this first section is from a collection of 143,000 articles taken from 15 American publications. The data is spread out over three files. We are going to load them separately, merge them into one large DataFrame, and then delete the individual DataFrames in order to save memory.

To achieve this, we must run:

a1 = pd.read_csv('../input/articles1.csv',index_col=0)
a2 = pd.read_csv('../input/articles2.csv',index_col=0)
a3 = pd.read_csv('../input/articles3.csv',index_col=0)

df = pd.concat([a1,a2,a3])

del a1, a2, a3

As a result of running the preceding code, the data will end up looking like this:

id

title

publication

author

date

year

month

url

content

17283

House Republicans Fret...

New York Times

Carl Hulse

2016-12-31

2016.0

12.0

NaN

WASHINGTON — Congressional Republicans...

After getting our data to this state, we can then plot the distribution of publishers to get an idea of what kind of news we are dealing with.

To achieve this, we must run the following code:

import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
df.publication.value_counts().plot(kind='bar')

After successfully running this code, we'll see this chart showing the distribution of news sources from our dataset:

An introductory guide to spaCy

News page distribution

As you can see in the preceding graph the dataset that we extracted contains no articles from classical financial news media, instead it mostly contains articles from mainstream and politically oriented publications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset