Chapter 8. Privacy, Debugging, and Launching Your Products

Over the course of the last seven chapters we've developed a large toolbox of machine learning algorithms that we could use for machine learning problems in finance. To help round-off this toolbox, we're now going to look at what you can do if your algorithms don't work.

Machine learning models fail in the worst way: silently. In traditional software, a mistake usually leads to the program crashing, and while they're annoying for the user, they are helpful for the programmer. At least it's clear that the code failed, and often the developer will find an accompanying crash report that describes what went wrong. Yet as you go beyond this book and start developing your own models, you'll sometimes encounter machine learning code crashes too, which, for example, could be caused if the data that you fed into the algorithm had the wrong format or shape.

These issues can usually be debugged by carefully tracking which shape the data had at what point. More often, however, models that fail just output poor predictions. They'll give no signal that they have failed, to the point that you might not even be aware that they've even failed at all, but at other times, the model might not train well, it won't converge, or it won't achieve a low loss rate.

In this chapter, we'll be focusing on how you debug these silent failures so that they don't impact the machine learning algorithms that you've created. This will include looking at the following subject areas:

  • Finding flaws in your data that lead to flaws in your learned model
  • Using creative tricks to make your model learn more from less data
  • Unit testing data in production or training to ensure standards are met
  • Being mindful of privacy and regulation, such as GDPR
  • Preparing data for training and avoiding common pitfalls
  • Inspecting the model and peering into the "black box"
  • Finding optimal hyperparameters
  • Scheduling learning rates in order to reduce overfitting
  • Monitoring training progress with TensorBoard
  • Deploying machine learning products and iterating on them
  • Speeding up training and inference

The first step you must take, before even attempting to debug your program, is to acknowledge that even good machine learning engineers fail frequently. There are many reasons why machine learning projects fail, and most have nothing to do with the skills of the engineers, so don't think that just because it's not working, you're at fault.

If these bugs are spotted early enough, then both time and money can be saved. Furthermore, in high-stakes environments, including finance-based situations, such as trading, engineers that are aware can pull the plug when they notice their model is failing. This should not be seen as a failure, but as a success to avoid problems.

Debugging data

You'll remember that back in the first chapter of this book, we discussed how machine learning models are a function of their training data, meaning that, for example, bad data will lead to bad models, or as we put it, garbage in, garbage out. If your project is failing, your data is the most likely culprit. Therefore, in this chapter we will start by looking at the data first, before moving on to look at the other possible issues that might cause our model to crash.

However, even if you have a working model, the real-world data coming in might not be up to the task. In this section, we will learn how to find out whether you have good data, what to do if you have not been given enough data, and how to test your data.

How to find out whether your data is up to the task

There are two aspects to consider when wanting to know whether your data is up to the task of training a good model:

  • Does the data predict what you want it to predict?
  • Do you have enough data?

To find out whether your model does contain predicting information, also called a signal, you could ask yourself the question, could a human make a prediction given this data? It's important for your AI to be given data that can be comprehended by humans, because after all, the only reason we know intelligence is possible is because we observe it in humans. Humans are good at understanding written text, but if a human cannot understand a text, then the chances are that your model won't make much sense of it either.

A common pitfall to this test is that humans have context that your model does not have. A human trader does not only consume financial data, but they might have also experienced the product of a company or seen the CEO on TV. This external context flows into the trader's decision but is often forgotten when a model is built. Likewise, humans are also good at focusing on important data. A human trader will not consume all of the financial data out there because most of it is irrelevant.

Adding more inputs to your model won't make it better; on the contrary, it often makes it worse, as the model overfits and gets distracted by all the noise. On the other hand, humans are irrational; they follow peer pressure and have a hard time making decisions in abstract and unfamiliar environments. Humans would struggle to find an optimal traffic light policy, for instance, because the data that traffic lights operate on is not intuitive to us.

This brings us to the second sanity check: a human might not be able to make predictions, but there might be a causal (economic) rationale. There is a causal link between a company's profits and its share price, the traffic on a road and traffic jams, customer complaints and customers leaving your company, and so on. While humans might not have an intuitive grasp of these links, we can discover them through reasoning.

There are some tasks for which a causal link is required. For instance, for a long time, many quantitative trading firms insisted on their data having a causal link to the predicted outcomes of models. Yet nowadays, the industry seems to have slightly moved away from that idea as it gets more confident in testing its algorithms. If humans cannot make a prediction and there is no causal rationale for why your data is predictive, you might want to reconsider whether your project is feasible.

Once you have determined that your data contains enough signal, you need to ask yourself whether you have enough data to train a model to extract the signal. There is no clear answer to the question of how much is enough, but roughly speaking, the amount needed depends on the complexity of the model you hope to create. There are a couple of rules of thumb to follow, however:

  • For classification, you should have around 30 independent samples per class.
  • You should have 10 times as many samples as there are features, especially for structured data problems.
  • Your dataset should get bigger as the number of parameters in your model gets bigger.

Keep in mind these rules are only rules of thumb and might be very different for your specific application. If you can make use of transfer learning, then you can drastically reduce the number of samples you need. This is why most computer vision applications use transfer learning.

If you have any reasonable amount of data, say, a few hundred samples, then you can start building your model. In this case, a sensible suggestion would be to start with a simple model that you can deploy while you collect more data.

What to do if you don't have enough data

Sometimes, you find yourself in a situation where despite starting your project, you simply do not have enough data. For example, the legal team might have changed its mind and decided that you cannot use the data, for instance due to GDPR, even though they greenlit it earlier. In this case, you have multiple options.

Most of the time, one of the best options would be to "augment your data." We've already seen some data augmentation in Chapter 3, Utilizing Computer Vision. Of course, you can augment all kinds of data in various ways, including slightly changing some database entries. Taking augmentation a step further, you might be able to generate your data, for example, in a simulation. This is effectively how most reinforcement learning researchers gather data, but this can also work in other cases.

The data we used for fraud detection back in Chapter 2, Applying Machine Learning to Structured Data was obtained from simulation. The simulation requires you to be able to write down the rules of your environment within a program. Powerful learning algorithms tend to figure out these often over-simplistic rules, so they might not generalize to the real world as well. Yet, simulated data can be a powerful addition to real data.

Likewise, you can often find external data. Just because you haven't tracked a certain data point, it does not mean that nobody else has. There is an astonishing amount of data available on the internet. Even if the data was not originally collected for your purpose, you might be able to retool data by either relabeling it or by using it for transfer learning. You might be able to train a model on a large dataset for a different task and then use that model as a basis for your task. Equally, you can find a model that someone else has trained for a different task and repurpose it for your task.

Finally, you might be able to create a simple model, which does not capture the relationship in the data completely but is enough to ship a product. Random forests and other tree-based methods often require much less data than neural networks.

It's important to remember that for data, quality trumps quantity in the majority of cases. Getting a small, high-quality dataset in and training a weak model is often your best shot to find problems with data early. You can always scale up data collection later. A mistake many practitioners make is that they spend huge amounts of time and money on getting a big dataset, only to find that they have the wrong kind of data for their project.

Unit testing data

If you build a model, you're making assumptions about your data. For example, you assume that the data you feed into your time series model is actually a time series with dates that follow each other in order. You need to test your data to make sure that this assumption is true. This is something that is especially true with live data that you receive once your model is already in production. Bad data might lead to poor model performance, which can be dangerous, especially in a high-stakes environment.

Additionally, you need to test whether your data is clean from things such as personal information. As we'll see in the following section on privacy, personal information is a liability that you want to get rid of, unless you have good reasons and consent from the user to use it.

Since monitoring data quality is important when trading based on many data sources, Two Sigma Investments LP, a New York City-based international hedge fund, has created an open source library for data monitoring. It is called marbles, and you can read more about it here: https://github.com/twosigma/marbles. marbles builds on Python's unittest library.

You can install it with the following command:

pip install marbles

Note

Note: You can find a Kaggle kernel demonstrating marbles here: https://www.kaggle.com/jannesklaas/marbles-test.

The following code sample shows a simple marbles unit test. Imagine you are gathering data about the unemployment rate in Ireland. For your models to work, you need to ensure that you actually get the data for consecutive months, and don't count one month twice, for instance.

We can ensure this happens by running the following code:

import marbles.core                                 #1
from marbles.mixins import mixins

import pandas as pd                                 #2
import numpy as np
from datetime import datetime, timedelta

class TimeSeriesTestCase(marbles.core.TestCase,mixins.MonotonicMixins):                            #3
    def setUp(self):                                            #4

        self.df = pd.DataFrame({'dates':[datetime(2018,1,1),datetime(2018,2,1),datetime(2018,2,1)],'ireland_unemployment':[6.2,6.1,6.0]})   #5
        
    
    def tearDown(self):
        self.df = None                                          #6
        
    def test_date_order(self):                                  #7
        
        self.assertMonotonicIncreasing(sequence=self.df.dates,note = 'Dates need to increase monotonically')                                                 #8

Don't worry if you don't fully understand the code. We're now going to go through each stage of the code:

  1. Marbles features two main components. The core module does the actual testing, while the mixins module provides a number of useful tests for different types of data. This simplifies your test writing and gives you more readable and semantically interpretable tests.
  2. You can use all the libraries, like pandas, that you would usually use to handle and process data for testing.
  3. Now it is time to define our test class. A new test class must inherit marbles' TestCase class. This way, our test class is automatically set up to run as a marbles test. If you want to use a mixin, you also need to inherit the corresponding mixin class.
  4. In this example, we are working with a series of dates that should be increasing monotonically. The MonotonicMixins class provides a range of tools that allow you to test for a monotonically increasing series automatically.
  5. If you are coming from Java programming, the concept of multiple inheritances might strike you as weird, but in Python, classes can easily inherit multiple other classes. This is useful if you want your class to inherit two different capabilities, such as running a test and testing time-related concepts.
  6. The setUp function is a standard test function in which we can load the data and prepare for the test. In this case, we just need to define a pandas DataFrame by hand. Alternatively, you could also load a CSV file, load a web resource, or pursue any other way in order to get your data.
  7. In our DataFrame, we have the Irish unemployment rate for two months. As you can see, the last month has been counted twice. As this should not happen, it will cause an error.
  8. The tearDown method is a standard test method that allows us to cleanup after our test is done. In this case, we just free RAM, but you can also choose to delete files or databases that were just created for testing.
  9. Methods describing actual tests should start with test_. marbles will automatically run all of the test methods after setting up.
  10. We assert that the time indicator of our data strictly increases. If our assertion had required intermediate variables, such as a maximum value, marbles will display it in the error report. To make our error more readable, we can attach a handy note.

To run a unit test in a Jupyter Notebook, we need to tell marbles to ignore the first argument; we achieve this by running the following:

if __name__ == '__main__':
    marbles.core.main(argv=['first-arg-is-ignored'], exit=False)

It's more common to run unit tests directly from the command line. So, if you saved the preceding code in the command line, you could run it with this command:

python -m marbles marbles_test.py

Of course, there are problems with our data. Luckily for us, our test ensures that this error does not get passed on to our model, where it would cause a silent failure in the form of a bad prediction. Instead, the test will fail with the following error output:

Note

Note: This code will not run and will fail.

F                                                     #1
==================================================================
FAIL: test_date_order (__main__.TimeSeriesTestCase)   #2
------------------------------------------------------------------
marbles.core.marbles.ContextualAssertionError: Elements in 0   2018-01-01
1   2018-02-01
2   2018-02-01                                        #3
Name: dates, dtype: datetime64[ns] are not strictly monotonically increasing

Source (<ipython-input-1-ebdbd8f0d69f>):              #4
     19 
 >   20 self.assertMonotonicIncreasing(sequence=self.df.dates,
     21                           note = 'Dates need to increase monotonically')
     22 
Locals:                                               #5

Note:                                                 #6
    Dates need to increase monotonically


----------------------------------------------------------------------
Ran 1 test in 0.007s

FAILED (failures=1)

So, what exactly caused the data to fail? Let's have a look:

  1. The top line shows the status of the entire test. In this case, there was only one test method, and it failed. Your test might have multiple different test methods, and marbles would display the progress by showing how tests fail or pass.
The next couple of lines describe the failed test method. This line describes that the test_date_order method of the TimeSeriesTestCase class failed.
  2. marbles shows precisely how the test failed. The values of the dates tested are shown, together with the cause for failure.
  3. In addition to the actual failure, marbles will display a traceback showing the actual code where our test failed.
  4. A special feature of marbles is the ability to display local variables. This way, we can ensure that there was no problem with the setup of the test. It also helps us in getting the context as to how exactly the test failed.
  5. Finally, marbles will display our note, which helps the test consumer understand what went wrong.
  6. As a summary, marbles displays that the test failed with one failure. Sometimes, you may be able to accept data even though it failed some tests, but more often than not you'll want to dig in and see what is going on.

The point of unit testing data is to make the failures loud in order to prevent data issues from giving you bad predictions. A failure with an error message is much better than a failure without one. Often, the failure is caused by your data vendor, and by testing all of the data that you got from all of the vendors, it will allow you to be aware when a vendor makes a mistake.

Unit testing data also helps you to ensure you have no data that you shouldn't have, such as personal data. Vendors need to clean datasets of all personally identifying information, such as social security numbers, but of course, they sometimes forget. Complying with ever stricter data privacy regulation is a big concern for many financial institutions engaging in machine learning.

The next section will therefore discuss how to preserve privacy and comply with regulations while still gaining benefits from machine learning.

Keeping data private and complying with regulations

In recent years, consumers have woken up to the fact that their data is being harvested and analyzed in ways that they cannot control, and that is sometimes against their own interest. Naturally, they are not happy about it and regulators have to come up with some new data regulations.

At the time of writing, the European Union has introduced the General Data Protection Regulation (GDPR), but it's likely that other jurisdictions will develop stricter privacy protections, too.

This text will not go into depth on how to comply with this law specifically. However, if you wish to expand your understanding of the topic, then the UK government's guide to GDPR is a good starting place to learn more about the specifics of the regulation and how to comply with it: https://www.gov.uk/government/publications/guide-to-the-general-data-protection-regulation.

This section will outline both the key principles of the recent privacy legislation and some technological solutions that you can utilize in order to comply with these principles.

The overarching rule here is to, "delete what you don't need." For a long time, a large percentage of companies have just stored all of the data that they could get their hands on, but this is a bad idea. Storing personal data is a liability for your business. It's owned by someone else, and you are on the hook for taking care of it. The next time you hear a statement such as, "We have 500,000 records in our database," think of it more along the lines of, "We have 500,000 liabilities on our books." It can be a good idea to take on liabilities, but only if there is an economic value that justifies these liabilities. What happens astonishingly often though is that you might collect personal data by accident. Say you are tracking device usage, but accidentally include the customer ID in your records. You need practices in place that monitor and prevent such accidents, here are four of the key ones:

  • Be transparent and obtain consent: Customers want good products, and they understand how their data can make your product better for them. Rather than pursuing an adversarial approach in which you wrap all your practices in a very long agreement and then make users agree to it, it is usually more sensible to clearly tell users what you are doing, how their data is used, and how that improves the product. If you need personal data, you need consent. Being transparent will help you down the line as users will trust you more and this can then be used to improve your product through customer feedback.
  • Remember that breaches happen to the best: No matter how good your security is, there is a chance that you'll get hacked. So, you should design your personal data storage under the assumption that the entire database might be dumped on the internet one day. This assumption will help you to create stronger privacy and help you to avoid disaster once you actually get hacked.
  • Be mindful about what can be inferred from data: You might not be tracking personally identifying information in your database, but when combined with another database, your customers can still be individually identified.

    Say you went for coffee with a friend, paid by credit card, and posted a picture of the coffee on Instagram. The bank might collect anonymous credit card records, but if someone went to crosscheck the credit card records against the Instagram pictures, there would only be one customer who bought a coffee and posted a picture of coffee at the same time in the same area. This way, all your credit card transactions are no longer anonymous. Consumers expect companies to be mindful of these effects.

  • Encrypt and Obfuscate data: Apple, for instance, collects phone data but adds random noise to the collected data. The noise renders each individual record incorrect, but in aggregate the records still give a picture of user behavior. There are a few caveats to this approach; for example, you can only collect so many data points from a user before the noise cancels out, and the individual behavior is revealed.

    Noise, as introduced by obfuscation, is random. When averaged over a large sample of data about a single user, the mean of the noise will be zero as it does not present a pattern by itself. The true profile of the user will be revealed. Similarly, recent research has shown that deep learning models can learn on homomorphically encrypted data. Homomorphic encryption is a method of encryption that preserves the underlying algebraic properties of the data. Mathematically, this can be expressed as follows:

    Keeping data private and complying with regulations
    Keeping data private and complying with regulations

    Here E is an encryption function, m is some plain text data, and D is a decryption function. As you can see, adding the encrypted data is the same as first adding the data and then encrypting it. Adding the data, encrypting it, and then decrypting it is the same as just adding the data.

    This means you can encrypt the data and still train a model on it. Homomorphic encryption is still in its infancy, but through approaches like this, you can ensure that in the case of a data breach, no sensitive individual information is leaked.

  • Train locally, and upload only a few gradients: One way to avoid uploading user data is to train your model on the user's device. The user accumulates data on the device. You can then download your model on to the device and perform a single forward and backward pass on the device.

    To avoid the possibility of inference of user data from the gradients, you only upload a few gradients at random. You can then apply the gradients to your master model.

    To further increase the overall privacy of the system, you do not need to download all the newly update weights from the master model to the user's device, but only a few. This way, you train your model asynchronously without ever accessing any data. If your database gets breached, no user data is lost. However, we need to note that this only works if you have a large enough user base.

Preparing the data for training

In earlier chapters, we have seen the benefits of normalizing and scaling features, we also discussed how you should scale all numerical features. There are four ways of feature scaling; these include standardization, Min-Max, mean normalization, and unit length scaling. In this section we'll break down each one:

  • Standardization ensures that all of the data has a mean of zero and a standard deviation of one. It is computed by subtracting the mean and dividing by the standard deviation of the data:
    Preparing the data for training

    This is probably the most common way of scaling features. It's especially useful if you suspect that your data contains outliers as it is quite robust. On the flip side, standardization does not ensure that your features are between zero and one, which is the range in which neural networks learn best.

  • Min-Max rescaling does exactly that. It scales all data between zero and one by first subtracting the minimum value and then dividing by the range of values. We can see this expressed in the formula below:
    Preparing the data for training

    If you know for sure that your data contains no outliers, which is the case in images, for instance, Min-Max scaling will give you a nice scaling of values between zero and one.

  • Similar to Min-Max, mean normalization ensures your data has values between minus one and one with a mean of zero. This is done by subtracting the mean and then dividing by the range of data, which is expressed in the following formula:
    Preparing the data for training

    Mean normalization is done less frequently but, depending on your application, might be a good approach.

  • For some applications, it is better to not scale individual features, but instead vectors of features. In this case, you would apply unit length scaling by dividing each element in the vector by the total length of the vector, as we can see below:
    Preparing the data for training

    The length of the vector usually means the L2 norm of the vector Preparing the data for training, that is, the square root of the sum of squares. For some applications, the vector length means the L1 norm of the vector, Preparing the data for training, which is the sum of vector elements.

However you scale, it is important to only measure the scaling factors, mean, and standard deviation on the test set. These factors include only a select amount of the information about the data. If you measure them over your entire dataset, then the algorithm might perform better on the test set than it will in production, due to this information advantage.

Equally importantly, you should check that your production code has proper feature scaling as well. Over time, you should recalculate your feature distribution and adjust your scaling.

Understanding which inputs led to which predictions

Why did your model make the prediction it made? For complex models, this question is pretty hard to answer. A global explanation for a very complex model might in itself be very complex. The Local Interpretable Model-Agnostic Explanations (LIME) is, a popular algorithm for model explanation that focuses on local explanations. Rather than trying to answer; "How does this model make predictions?" LIME tries to answer; "Why did the model make this prediction on this data?"

Note

Note: The authors of LIME, Ribeiro, Singh, and Guestrin, curated a great GitHub repository around their algorithm with many explanations and tutorials, which you can find here: https://github.com/marcotcr/lime.

On Kaggle kernels, LIME is installed by default. However, you can install LIME locally with the following command:

pip install lime

The LIME algorithm works with any classifier, which is why it is model agnostic. To make an explanation, LIME cuts up the data into several sections, such as areas of an image or utterances in a text. It then creates a new dataset by removing some of these features. It runs this new dataset through the black box classifier and obtains the classifiers predicted probabilities for different classes. LIME then encodes the data as vectors describing what features were present. Finally, it trains a linear model to predict the outcomes of the black box model with different features removed. As linear models are easy to interpret, LIME will use the linear model to determine the most important features.

Let's say that you are using a text classifier, such as TF-IDF, to classify emails such as those in the 20 newsgroup dataset. To get explanations from this classifier, you would use the following snippet:

from lime.lime_text import LimeTextExplainer               #1
explainer = LimeTextExplainer(class_names=class_names)     #2
exp = explainer.explain_instance(test_example,             #3classifier.predict_proba, #4num_features=6)           #5
                                 
exp.show_in_notebook()                                     #6

Now, let's understand what's going on in that code snippet:

  1. The LIME package has several classes for different types of data.
  2. To create a new blank explainer, we need to pass the names of classes of our classifier.
  3. We'll provide one text example for which we want an explanation.
  4. We provide the prediction function of our classifier. We need to provide a function that provides probabilities. For Keras, this is just model.predict;for scikit models, we need to use the predict_proba method.
  5. LIME shows the maximum number of features. We want to show only the importance of the six most important features in this case.
  6. Finally, we can render a visualization of our prediction, which looks like this:
Understanding which inputs led to which predictions

LIME text output

The explanation shows the classes with different features that the text gets classified as most often. It shows the words that most contribute to the classification in the two most frequent classes. Under that, you can see the words that contributed to the classification highlighted in the text.

As you can see, our model picked up on parts of the email address of the sender as distinguishing features, as well as the name of the university, "Rice." It sees "Caused" to be a strong indicator that the text is about atheism. Combined, these are all things we want to know when debugging datasets.

LIME does not perfectly solve the problem of explaining models. It struggles if the interaction of multiple features leads to a certain outcome for instance. However, it does well enough to be a useful data debugging tool. Often, models pick up on things they should not be picking up on. To debug a dataset, we need to remove all these "give-away" features that statistical models like to overfit to.

Looking back at this section, you've now seen a wide range of tools that you can use to debug your dataset. Yet, even with a perfect dataset, there can be issues when it comes to training. The next section is about how to debug your model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset