Over the course of the last seven chapters we've developed a large toolbox of machine learning algorithms that we could use for machine learning problems in finance. To help round-off this toolbox, we're now going to look at what you can do if your algorithms don't work.
Machine learning models fail in the worst way: silently. In traditional software, a mistake usually leads to the program crashing, and while they're annoying for the user, they are helpful for the programmer. At least it's clear that the code failed, and often the developer will find an accompanying crash report that describes what went wrong. Yet as you go beyond this book and start developing your own models, you'll sometimes encounter machine learning code crashes too, which, for example, could be caused if the data that you fed into the algorithm had the wrong format or shape.
These issues can usually be debugged by carefully tracking which shape the data had at what point. More often, however, models that fail just output poor predictions. They'll give no signal that they have failed, to the point that you might not even be aware that they've even failed at all, but at other times, the model might not train well, it won't converge, or it won't achieve a low loss rate.
In this chapter, we'll be focusing on how you debug these silent failures so that they don't impact the machine learning algorithms that you've created. This will include looking at the following subject areas:
The first step you must take, before even attempting to debug your program, is to acknowledge that even good machine learning engineers fail frequently. There are many reasons why machine learning projects fail, and most have nothing to do with the skills of the engineers, so don't think that just because it's not working, you're at fault.
If these bugs are spotted early enough, then both time and money can be saved. Furthermore, in high-stakes environments, including finance-based situations, such as trading, engineers that are aware can pull the plug when they notice their model is failing. This should not be seen as a failure, but as a success to avoid problems.
You'll remember that back in the first chapter of this book, we discussed how machine learning models are a function of their training data, meaning that, for example, bad data will lead to bad models, or as we put it, garbage in, garbage out. If your project is failing, your data is the most likely culprit. Therefore, in this chapter we will start by looking at the data first, before moving on to look at the other possible issues that might cause our model to crash.
However, even if you have a working model, the real-world data coming in might not be up to the task. In this section, we will learn how to find out whether you have good data, what to do if you have not been given enough data, and how to test your data.
There are two aspects to consider when wanting to know whether your data is up to the task of training a good model:
To find out whether your model does contain predicting information, also called a signal, you could ask yourself the question, could a human make a prediction given this data? It's important for your AI to be given data that can be comprehended by humans, because after all, the only reason we know intelligence is possible is because we observe it in humans. Humans are good at understanding written text, but if a human cannot understand a text, then the chances are that your model won't make much sense of it either.
A common pitfall to this test is that humans have context that your model does not have. A human trader does not only consume financial data, but they might have also experienced the product of a company or seen the CEO on TV. This external context flows into the trader's decision but is often forgotten when a model is built. Likewise, humans are also good at focusing on important data. A human trader will not consume all of the financial data out there because most of it is irrelevant.
Adding more inputs to your model won't make it better; on the contrary, it often makes it worse, as the model overfits and gets distracted by all the noise. On the other hand, humans are irrational; they follow peer pressure and have a hard time making decisions in abstract and unfamiliar environments. Humans would struggle to find an optimal traffic light policy, for instance, because the data that traffic lights operate on is not intuitive to us.
This brings us to the second sanity check: a human might not be able to make predictions, but there might be a causal (economic) rationale. There is a causal link between a company's profits and its share price, the traffic on a road and traffic jams, customer complaints and customers leaving your company, and so on. While humans might not have an intuitive grasp of these links, we can discover them through reasoning.
There are some tasks for which a causal link is required. For instance, for a long time, many quantitative trading firms insisted on their data having a causal link to the predicted outcomes of models. Yet nowadays, the industry seems to have slightly moved away from that idea as it gets more confident in testing its algorithms. If humans cannot make a prediction and there is no causal rationale for why your data is predictive, you might want to reconsider whether your project is feasible.
Once you have determined that your data contains enough signal, you need to ask yourself whether you have enough data to train a model to extract the signal. There is no clear answer to the question of how much is enough, but roughly speaking, the amount needed depends on the complexity of the model you hope to create. There are a couple of rules of thumb to follow, however:
Keep in mind these rules are only rules of thumb and might be very different for your specific application. If you can make use of transfer learning, then you can drastically reduce the number of samples you need. This is why most computer vision applications use transfer learning.
If you have any reasonable amount of data, say, a few hundred samples, then you can start building your model. In this case, a sensible suggestion would be to start with a simple model that you can deploy while you collect more data.
Sometimes, you find yourself in a situation where despite starting your project, you simply do not have enough data. For example, the legal team might have changed its mind and decided that you cannot use the data, for instance due to GDPR, even though they greenlit it earlier. In this case, you have multiple options.
Most of the time, one of the best options would be to "augment your data." We've already seen some data augmentation in Chapter 3, Utilizing Computer Vision. Of course, you can augment all kinds of data in various ways, including slightly changing some database entries. Taking augmentation a step further, you might be able to generate your data, for example, in a simulation. This is effectively how most reinforcement learning researchers gather data, but this can also work in other cases.
The data we used for fraud detection back in Chapter 2, Applying Machine Learning to Structured Data was obtained from simulation. The simulation requires you to be able to write down the rules of your environment within a program. Powerful learning algorithms tend to figure out these often over-simplistic rules, so they might not generalize to the real world as well. Yet, simulated data can be a powerful addition to real data.
Likewise, you can often find external data. Just because you haven't tracked a certain data point, it does not mean that nobody else has. There is an astonishing amount of data available on the internet. Even if the data was not originally collected for your purpose, you might be able to retool data by either relabeling it or by using it for transfer learning. You might be able to train a model on a large dataset for a different task and then use that model as a basis for your task. Equally, you can find a model that someone else has trained for a different task and repurpose it for your task.
Finally, you might be able to create a simple model, which does not capture the relationship in the data completely but is enough to ship a product. Random forests and other tree-based methods often require much less data than neural networks.
It's important to remember that for data, quality trumps quantity in the majority of cases. Getting a small, high-quality dataset in and training a weak model is often your best shot to find problems with data early. You can always scale up data collection later. A mistake many practitioners make is that they spend huge amounts of time and money on getting a big dataset, only to find that they have the wrong kind of data for their project.
If you build a model, you're making assumptions about your data. For example, you assume that the data you feed into your time series model is actually a time series with dates that follow each other in order. You need to test your data to make sure that this assumption is true. This is something that is especially true with live data that you receive once your model is already in production. Bad data might lead to poor model performance, which can be dangerous, especially in a high-stakes environment.
Additionally, you need to test whether your data is clean from things such as personal information. As we'll see in the following section on privacy, personal information is a liability that you want to get rid of, unless you have good reasons and consent from the user to use it.
Since monitoring data quality is important when trading based on many data sources, Two Sigma Investments LP, a New York City-based international hedge fund, has created an open source library for data monitoring. It is called marbles, and you can read more about it here: https://github.com/twosigma/marbles. marbles builds on Python's unittest
library.
You can install it with the following command:
pip install marbles
Note: You can find a Kaggle kernel demonstrating marbles here: https://www.kaggle.com/jannesklaas/marbles-test.
The following code sample shows a simple marbles unit test. Imagine you are gathering data about the unemployment rate in Ireland. For your models to work, you need to ensure that you actually get the data for consecutive months, and don't count one month twice, for instance.
We can ensure this happens by running the following code:
import marbles.core #1 from marbles.mixins import mixins import pandas as pd #2 import numpy as np from datetime import datetime, timedelta class TimeSeriesTestCase(marbles.core.TestCase,mixins.MonotonicMixins): #3 def setUp(self): #4 self.df = pd.DataFrame({'dates':[datetime(2018,1,1),datetime(2018,2,1),datetime(2018,2,1)],'ireland_unemployment':[6.2,6.1,6.0]}) #5 def tearDown(self): self.df = None #6 def test_date_order(self): #7 self.assertMonotonicIncreasing(sequence=self.df.dates,note = 'Dates need to increase monotonically') #8
Don't worry if you don't fully understand the code. We're now going to go through each stage of the code:
core
module does the actual testing, while the mixins
module provides a number of useful tests for different types of data. This simplifies your test writing and gives you more readable and semantically interpretable tests.TestCase
class. This way, our test class is automatically set up to run as a marbles test. If you want to use a mixin, you also need to inherit the corresponding mixin class.MonotonicMixins
class provides a range of tools that allow you to test for a monotonically increasing series automatically.setUp
function is a standard test function in which we can load the data and prepare for the test. In this case, we just need to define a pandas DataFrame by hand. Alternatively, you could also load a CSV file, load a web resource, or pursue any other way in order to get your data.tearDown
method is a standard test method that allows us to cleanup after our test is done. In this case, we just free RAM, but you can also choose to delete files or databases that were just created for testing.test_
. marbles will automatically run all of the test methods after setting up.To run a unit test in a Jupyter Notebook, we need to tell marbles to ignore the first argument; we achieve this by running the following:
if __name__ == '__main__': marbles.core.main(argv=['first-arg-is-ignored'], exit=False)
It's more common to run unit tests directly from the command line. So, if you saved the preceding code in the command line, you could run it with this command:
python -m marbles marbles_test.py
Of course, there are problems with our data. Luckily for us, our test ensures that this error does not get passed on to our model, where it would cause a silent failure in the form of a bad prediction. Instead, the test will fail with the following error output:
F #1 ================================================================== FAIL: test_date_order (__main__.TimeSeriesTestCase) #2 ------------------------------------------------------------------ marbles.core.marbles.ContextualAssertionError: Elements in 0 2018-01-01 1 2018-02-01 2 2018-02-01 #3 Name: dates, dtype: datetime64[ns] are not strictly monotonically increasing Source (<ipython-input-1-ebdbd8f0d69f>): #4 19 > 20 self.assertMonotonicIncreasing(sequence=self.df.dates, 21 note = 'Dates need to increase monotonically') 22 Locals: #5 Note: #6 Dates need to increase monotonically ----------------------------------------------------------------------
Ran 1 test in 0.007s FAILED (failures=1)
So, what exactly caused the data to fail? Let's have a look:
test_date_order
method of the TimeSeriesTestCase
class failed.The point of unit testing data is to make the failures loud in order to prevent data issues from giving you bad predictions. A failure with an error message is much better than a failure without one. Often, the failure is caused by your data vendor, and by testing all of the data that you got from all of the vendors, it will allow you to be aware when a vendor makes a mistake.
Unit testing data also helps you to ensure you have no data that you shouldn't have, such as personal data. Vendors need to clean datasets of all personally identifying information, such as social security numbers, but of course, they sometimes forget. Complying with ever stricter data privacy regulation is a big concern for many financial institutions engaging in machine learning.
The next section will therefore discuss how to preserve privacy and comply with regulations while still gaining benefits from machine learning.
In recent years, consumers have woken up to the fact that their data is being harvested and analyzed in ways that they cannot control, and that is sometimes against their own interest. Naturally, they are not happy about it and regulators have to come up with some new data regulations.
At the time of writing, the European Union has introduced the General Data Protection Regulation (GDPR), but it's likely that other jurisdictions will develop stricter privacy protections, too.
This text will not go into depth on how to comply with this law specifically. However, if you wish to expand your understanding of the topic, then the UK government's guide to GDPR is a good starting place to learn more about the specifics of the regulation and how to comply with it: https://www.gov.uk/government/publications/guide-to-the-general-data-protection-regulation.
This section will outline both the key principles of the recent privacy legislation and some technological solutions that you can utilize in order to comply with these principles.
The overarching rule here is to, "delete what you don't need." For a long time, a large percentage of companies have just stored all of the data that they could get their hands on, but this is a bad idea. Storing personal data is a liability for your business. It's owned by someone else, and you are on the hook for taking care of it. The next time you hear a statement such as, "We have 500,000 records in our database," think of it more along the lines of, "We have 500,000 liabilities on our books." It can be a good idea to take on liabilities, but only if there is an economic value that justifies these liabilities. What happens astonishingly often though is that you might collect personal data by accident. Say you are tracking device usage, but accidentally include the customer ID in your records. You need practices in place that monitor and prevent such accidents, here are four of the key ones:
Say you went for coffee with a friend, paid by credit card, and posted a picture of the coffee on Instagram. The bank might collect anonymous credit card records, but if someone went to crosscheck the credit card records against the Instagram pictures, there would only be one customer who bought a coffee and posted a picture of coffee at the same time in the same area. This way, all your credit card transactions are no longer anonymous. Consumers expect companies to be mindful of these effects.
Noise, as introduced by obfuscation, is random. When averaged over a large sample of data about a single user, the mean of the noise will be zero as it does not present a pattern by itself. The true profile of the user will be revealed. Similarly, recent research has shown that deep learning models can learn on homomorphically encrypted data. Homomorphic encryption is a method of encryption that preserves the underlying algebraic properties of the data. Mathematically, this can be expressed as follows:
Here E is an encryption function, m is some plain text data, and D is a decryption function. As you can see, adding the encrypted data is the same as first adding the data and then encrypting it. Adding the data, encrypting it, and then decrypting it is the same as just adding the data.
This means you can encrypt the data and still train a model on it. Homomorphic encryption is still in its infancy, but through approaches like this, you can ensure that in the case of a data breach, no sensitive individual information is leaked.
To avoid the possibility of inference of user data from the gradients, you only upload a few gradients at random. You can then apply the gradients to your master model.
To further increase the overall privacy of the system, you do not need to download all the newly update weights from the master model to the user's device, but only a few. This way, you train your model asynchronously without ever accessing any data. If your database gets breached, no user data is lost. However, we need to note that this only works if you have a large enough user base.
In earlier chapters, we have seen the benefits of normalizing and scaling features, we also discussed how you should scale all numerical features. There are four ways of feature scaling; these include standardization, Min-Max, mean normalization, and unit length scaling. In this section we'll break down each one:
This is probably the most common way of scaling features. It's especially useful if you suspect that your data contains outliers as it is quite robust. On the flip side, standardization does not ensure that your features are between zero and one, which is the range in which neural networks learn best.
If you know for sure that your data contains no outliers, which is the case in images, for instance, Min-Max scaling will give you a nice scaling of values between zero and one.
Mean normalization is done less frequently but, depending on your application, might be a good approach.
The length of the vector usually means the L2 norm of the vector , that is, the square root of the sum of squares. For some applications, the vector length means the L1 norm of the vector, , which is the sum of vector elements.
However you scale, it is important to only measure the scaling factors, mean, and standard deviation on the test set. These factors include only a select amount of the information about the data. If you measure them over your entire dataset, then the algorithm might perform better on the test set than it will in production, due to this information advantage.
Equally importantly, you should check that your production code has proper feature scaling as well. Over time, you should recalculate your feature distribution and adjust your scaling.
Why did your model make the prediction it made? For complex models, this question is pretty hard to answer. A global explanation for a very complex model might in itself be very complex. The Local Interpretable Model-Agnostic Explanations (LIME) is, a popular algorithm for model explanation that focuses on local explanations. Rather than trying to answer; "How does this model make predictions?" LIME tries to answer; "Why did the model make this prediction on this data?"
Note: The authors of LIME, Ribeiro, Singh, and Guestrin, curated a great GitHub repository around their algorithm with many explanations and tutorials, which you can find here: https://github.com/marcotcr/lime.
On Kaggle kernels, LIME is installed by default. However, you can install LIME locally with the following command:
pip install lime
The LIME algorithm works with any classifier, which is why it is model agnostic. To make an explanation, LIME cuts up the data into several sections, such as areas of an image or utterances in a text. It then creates a new dataset by removing some of these features. It runs this new dataset through the black box classifier and obtains the classifiers predicted probabilities for different classes. LIME then encodes the data as vectors describing what features were present. Finally, it trains a linear model to predict the outcomes of the black box model with different features removed. As linear models are easy to interpret, LIME will use the linear model to determine the most important features.
Let's say that you are using a text classifier, such as TF-IDF, to classify emails such as those in the 20 newsgroup dataset. To get explanations from this classifier, you would use the following snippet:
from lime.lime_text import LimeTextExplainer #1 explainer = LimeTextExplainer(class_names=class_names) #2 exp = explainer.explain_instance(test_example, #3classifier.predict_proba, #4num_features=6) #5 exp.show_in_notebook() #6
Now, let's understand what's going on in that code snippet:
model.predict
;for scikit models, we need to use the predict_proba
method.The explanation shows the classes with different features that the text gets classified as most often. It shows the words that most contribute to the classification in the two most frequent classes. Under that, you can see the words that contributed to the classification highlighted in the text.
As you can see, our model picked up on parts of the email address of the sender as distinguishing features, as well as the name of the university, "Rice." It sees "Caused" to be a strong indicator that the text is about atheism. Combined, these are all things we want to know when debugging datasets.
LIME does not perfectly solve the problem of explaining models. It struggles if the interaction of multiple features leads to a certain outcome for instance. However, it does well enough to be a useful data debugging tool. Often, models pick up on things they should not be picking up on. To debug a dataset, we need to remove all these "give-away" features that statistical models like to overfit to.
Looking back at this section, you've now seen a wide range of tools that you can use to debug your dataset. Yet, even with a perfect dataset, there can be issues when it comes to training. The next section is about how to debug your model.