Deployment

Deployment into production is often seen as separate from the creation of models. At many companies, data scientists create models in isolated development environments on training, validation, and testing data that was collected to create models.

Once the model performs well on the test set, it then gets passed on to deployment engineers, who know little about how and why the model works the way it does. This is a mistake. After all, you are developing models to use them, not for the fun of developing them.

Models tend to perform worse over time for several reasons. The world changes, so the data you trained on might no longer represent the real world. Your model might rely on the outputs of some other systems that are subject to change. There might be unintended side effects and weaknesses of your model that only show with extended usage. Your model might influence the world that it tries to model. Model decay describes how models have a lifespan after which performance deteriorates.

Data scientists should have the full life cycle of their models in mind. They need to be aware of how their model works in production in the long run.

In fact, the production environment is the perfect environment to optimize your model. Your datasets are only an approximation for the real world. Live data gives a much fresher and more accurate view of the world. By using online learning or active learning methods, you can drastically reduce the need for training data.

This section describes some best practices for getting your models to work in the real world. The exact method of serving your model can vary depending on your application. See the upcoming section Performance Tips for more details on choosing a deployment method.

Launching fast

The process of developing models depends on real-world data as well as an insight into how the performance of the model influences business outcomes. The earlier you can gather data and observe how model behavior influences outcomes, the better. Do not hesitate to launch your product with a simple heuristic.

Take the case of fraud detection, for instance. Not only do you need to gather transaction data together with information about occurring frauds, you also want to know how quick fraudsters are at finding ways around your detection methods. You want to know how customers whose transactions have been falsely flagged as fraud react. All of this information influences your model design and your model evaluation metrics. If you can come up with a simple heuristic, deploy the heuristic and then work on the machine learning approach.

When developing a machine learning model, try simple models first. A surprising number of tasks can be modeled with simple, linear models. Not only do you obtain results faster, but you can also quickly identify the features that your model is likely to overfit to. Debugging your dataset before working on a complex model can save you many headaches.

A second advantage of getting a simple approach out of the door quickly is that you can prepare your infrastructure. Your infrastructure team is likely made up of different people from the modeling team. If the infrastructure team does not have to wait for the modeling team but can start optimizing the infrastructure immediately, then you gain a time advantage.

Understanding and monitoring metrics

To ensure that optimizing metrics such as the mean squared error or cross-entropy loss actually lead to a better outcome, you need to be mindful of how your model metrics relate to higher order metrics, which you can see visualized in the following diagram. Imagine you have some consumer-facing app in which you recommend different investment products to retail investors.

Understanding and monitoring metrics

Higher order effects

You might predict whether the user is interested in a given product, measured by the user reading the product description. However, the metric you want to optimize in your application is not your model accuracy, but the click-through rate of users going to the description screen. On a higher order, your business is not designed to maximize the click-through rate, but revenue. If your users only click on low-revenue products, your click-through rate does not help you.

Finally, your business' revenue might be optimized to the detriment of society. In this case, regulators will step in. Higher order effects are influenced by your model. The higher the order of the effect, the harder it is to attribute to a single model. Higher order effects have large impacts, so effectively, higher order effects serve as meta-metrics to lower-order effects. To judge how well your application is doing, you align its metrics, for example, click-through rates, with the metrics relevant for the higher order effect, for example, revenue. Equally, your model metrics need to be aligned with your application metrics.

This alignment is often an emergent feature. Product managers eager to maximize their own metrics pick the model that maximizes their metrics, regardless of what metrics the modelers were optimizing. Product managers that bring home a lot of revenue get promoted. Businesses that are good for society receive subsidies and favorable policy. By making the alignment explicit, you can design a better monitoring process. For instance, if you have two models, you can A/B test them to see which one improves the application metrics.

Often, you will find that to align with a higher order metric, you'll need to combine several metrics, such as accuracy and speed of predictions. In this case, you should craft a formula that combines the metrics into one single number. A single number will allow you to doubtlessly choose between two models and help your engineers to create better models.

For instance, you could set a maximum latency of 200 milliseconds and your metric would be, "Accuracy if latency is below 200 milliseconds, otherwise zero." If you do not wish to set one maximum latency value, you could choose, "Accuracy divided by latency in milliseconds." The exact design of this formula depends on your application. As you observe how your model influences its higher order metric, you can adapt your model metric. The metric should be simple and easy to quantify.

Next, to regularly test your model's impact on higher order metrics, you should regularly test your models own metrics, such as accuracy. To this end, you need a constant stream of ground truth labels together with your data. In some cases, such as detecting fraud, ground truth data is easily collected, although it might come in with some latency. In this case, customers might need a few weeks to find out they have been overcharged.

In other cases, you might not have ground truth labels. Often, you can hand-label data for which you have no ground truth labels coming in. Through good UI design, the process of checking model predictions can be fast. Testers only have to decide whether your model's prediction was correct or not, something they can do through button presses in a web or mobile app. If you have a good review system in place, data scientists who work on the model should regularly check the model's outputs. This way, patterns in failures (our model does poorly on dark images) can be detected quickly, and the model can be improved.

Understanding where your data comes from

More often than not, your data gets collected by some other system that you as the model developer have no control over. Your data might be collected by a data vendor or by a different department in your firm. It might even be collected for different purposes than your model. The collectors of the data might not even know you are using the data for your model.

If, say, the collection method of the data changes, the distribution of your data might change too. This could break your model. Equally, the real world might just change, and with it the data distribution. To avoid changes in the data breaking your model, you first need to be aware of what data you are using and assign an owner to each feature. The job of the feature owner is to investigate where the data is coming from and alert the team if changes in the data are coming. The feature owner should also write down the assumptions underlying the data. In the best case, you test these assumptions for all new data streaming in. If the data does not pass the tests, investigate and eventually modify your model.

Equally, your model outputs might get used as inputs of other models. Help consumers of your data reach you by clearly identifying yourself as the owner of the model.

Alert users of your model of changes to your model. Before deploying a model, compare the new model's predictions to the old model's predictions. Treat models as software and try to identify "breaking changes," that would significantly alter your model's behavior. Often, you might not know who is accessing your model's predictions. Try to avoid this by clear communication and setting access controls if necessary.

Just as software has dependencies, libraries that need to be installed for the software to work, machine learning models have data dependencies. Data dependencies are not as well understood as software dependencies. By investigating your model's dependencies, you can reduce the risk of your model breaking when data changes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset