Chapter 2. Incident Response

In this world, sometimes bad things happen, even to good data and systems. Disks fail. Files get corrupted. Machines break. Networks go down. API calls return errors. Data gets stuck or changes subtly. Models that were once accurate and representative models become less so. The world can also change around us: things that never, or almost never, previously happened can become commonplace; this itself has an impact on our models.

Much of this book is about building ML systems that prevent these things from happening, or when they happen - and they will - recognizing the situation correctly, and mitigating it. Specifically, this chapter is about how to respond when bad, urgent things happen to ML systems. You may already be familiar with how teams handle systems going down or otherwise having some problem: this is known as incident management, and there are a number of best practices for managing incidents common across lots of different computer systems.1

We’ll cover these generally applicable practices, but our focus will be how to manage outages for ML systems, and in particular how those outages and their management differ from other distributed computing system outages.

The main thing to remember is that ML systems have a number of attributes which make resolving their incidents potentially very different to non-ML production systems. The most important attribute in this context is their strong connection to real-world situations and user behavior. This means that we can see unintuitive effects where there is a disconnect among the ML system, the world, or the user behavior we are trying to model. We will cover this in detail later, but the major thing to understand now is that troubleshooting ML incidents can involve very much more of the organization than standard production incidents do, including finance, supplier and vendor management, PR, legal, and so on. ML incident resolution is not necessarily something that only engineering does.

A final serious point we would like to make here at the beginning: as with other aspects of ML systems, incident management has serious implications for ethics in general and very commonly for privacy. It is a mistake to worry about getting the system working first and worry about privacy afterwards. Do not lose sight of this critical part of our work in this section. Privacy and ethics will make an appearance in several parts of the chapter and it will be addressed directly towards the end because by then we will be in a better place to draw some clear conclusions about how ML ethics principles interact with incident management.

Incident Management Basics

Three basic concepts for successful incident management are knowing the state the incident is in, establishing the roles, and recording information for follow-up. Many incidents are prolonged because of failures to identify what state the incident is in, and who is responsible for managing which aspects of it. If this continues for long enough, you have an unmanaged incident, which is the worst kind of incident.2 Indeed, if you’ve worked with incidents for long enough, you’ve probably seen one already, and it probably starts something like this: an engineer becomes aware of a problem; they troubleshoot the problem alone, hoping to figure out what the cause is; they fail to assess the impact of the problem on end users; they don’t communicate the state of the problem, either to other members of their team or the rest of the organization. The troubleshooting itself is typically disorganized and characterized by delays between actions, and assessing what happened after the actions. Once the initial troubleshooters realize the scope of the incident, there may be even more delays while they try to figure out what other teams need to be involved and send pages or alerts to track them down. In the limit, if the problem continues indefinitely, other parts of the organization can notice that something is wrong and independently (sometimes counterproductively) take uncoordinated steps to resolve the problem.

The key idea here is to actually have a process–a well-rehearsed one–and to apply it reliably and methodically when something bad that’s happened is worthy of being called an incident. Of course, creating a managed incident has some cost, and formalizing communications, behavior, and follow-up incurs some overhead. So we don’t do it for everything; not every WARNING in our logs warrants a couple hours of meetings or phone calls. Being an effective oncall engineer involves developing a sense for what is serious and what isn’t, and smoothly engaging incident machinery when required. It is enormously helpful to have clearly defined guidelines ahead of time about when to declare an incident, how to manage it and how to follow up after it.

For the rest of this section, we assume that we have an idea of what serious incidents are, describe a process for managing them, a process for follow-up, and we will illustrate with examples how ML specifics end up applying to the art of incident management.

Life of an Incident

Incidents have distinct phases of their existence. Although people of good will may differ on the specifics, incidents probably include states such as:

  • Pre-incident: architectural and structural decisions that set the conditions for the outage.

  • Trigger: something happens to create the user-facing impact.

  • Outage begins: our service is affected in a noticeable way by at least some users for at least some functions.

  • Detection: the owners of the service become aware of the problem, either through automated monitoring notifying us or outside users complaining.

  • Troubleshooting: we try to figure out what is going on and devise some means of fixing the problem.

  • Mitigation: we identify the fastest and least risky steps to prevent at least the worst of the problems. This can range from something as mild as posting a notice that some things don’t work right all the way to completely disabling our service.

  • Resolution: we actually fix the underlying problem and the service returns to normal.

  • Follow-up: we conduct a retrospective, learn what we can about the outage, identify a series of things we’d like to fix or other actions we would like to take, and then carry those out.

Computer system outages can roughly be described by these phases. We’ll briefly cover the roles in a typical incident and then we will try to understand what differs in handling an ML incident.

Incident Response Roles

Some companies have thousands of engineers working on systems infrastructure and others might be lucky to have a single person. But whether your organization is large or small, these roles described below need to be filled.

It’s important to note that not all of the roles require an additional person to fulfill, since not all of the responsibilities are equally urgent, and not all incidents demand isolated focus. Also, your organization and your team has a particular size - not every team can fill every position directly. Furthermore, certain problems only emerge at scale: communication costs, in particular, tend to increase in larger orgs, often correlated with the complexity of infrastructure under management. Conversely, smaller engineering teams can suffer from tunnel vision and a lack of diversity of experience. Nothing in our guidance frees you of the necessity of adapting to the situation, and making the right choices - often by first making the wrong ones. But one critical fact is that you must plan ahead for the organizational capacity to properly support incident management duties. If they are a poorly-staffed after-thought or you assume anyone can jump in when incidents occur with no structure, training or spare time, the results can be quite bad.

The framework we are most familiar with for incident management derives from the US FEMA National Incident Response system3. In this framework, the minimum viable set of roles is typically:

  • Incident commander: a coordinator who has the high-level state of the incident in their head and is responsible for assigning and monitoring the other roles.

  • Communications lead: responsible for outbound and inbound communication. The actual responsibilities for this role differ significantly based on the system but it may include updating public documents for end users, contacting other internal services groups and asking for help, or answering queries from customer-facing support personnel.

  • Operations lead: approves, schedules and records all production changes related to the outage (including stopping previously scheduled production changes on the same systems even if unrelated to the outage).

  • Planning lead: keeps track of longer term items that should not be lost but do not impact immediate outage resolution. This would include recording work items to be fixed, storing logs to be analyzed, and scheduling time to review the incident in the future. (Where applicable, the planning lead should also order dinner for the team.)

Those roles are invariant under whether or not you are dealing with an ML incident. The things that do vary are:

  • Detection: ML systems are less deterministic than non ML systems. As a result, it is harder to write monitoring rules to catch all incidents before a human user detects them.

  • Roles and systems involved in resolution: ML incidents usually involve a broader range of staff in the troubleshooting and resolution, including business/product and management/leadership. ML systems have broad impact on multiple systems, and are generally built on and fed by multiple complex systems. This leads to a likely diverse set of stakeholders for any incident. ML outages often impact multiple systems due to their role in integrating with and modifying other parts of your infrastructure.

  • Unclear timeline/resolution: many ML incidents involve impact to quality metrics that themselves already vary over time. This makes the timeline of the incident and the resolution more difficult to specify precisely.

In order to develop a more intuitive and concrete understanding of why these differences show up in this context, let’s consider a few example outages of ML systems.

Anatomy of an ML-centric Outage

These examples are drawn from real experiences by the authors but do not correspond to individual, specific examples that we have participated in. Nonetheless, our hope is that many people with experience in running ML systems will see familiar characteristics in at least one of these examples.

As you read them through, play close attention to some of the following characteristics that may differ substantially from other kinds of outages:

  • Architecture and underlying conditions: What decisions did we make about the system before this point that could have played a role in the incident?

  • Impact start: How do we determine the start of the incident?

  • Detection: How easy is it to detect the incident? How do we do it?

  • Troubleshooting and investigation: Who is involved? What roles do they play in our organization?

  • Impact: What is the “cost” of the outage to our users? How do we measure that?

  • Resolution: How confident are we in the resolution?

  • Follow-up: Can we distinguish between “fixing” and “improving”? How do we know when the follow-up from the incident is done and prospective engineering is taking place?

Keep these questions in mind while you consider these stories.

Terminology Reminder: “Model”

In the Basic Introduction to Models chapter we introduced a distinction between:

  • Model Architecture: the general approach to learning.

  • Model (or Configured Model): the specific configuration of an individual model plus the learning environment, and structure of the data we will train on.

  • Trained Model: a specific instance of once Configured Model trained on one set of data at a point in time.

This distinction matters particularly because we often care about which of these has changed to possibly be implicated in an incident. We will try to be clear in the following sections which we’re referring to.

Story Time

Introductory note: we tell these stories within the framework of our invented firm, YarnIt, in order to help them resonate with readers. But they are all based on, or at least inspired by, real events that we’ve observed in production. In some cases they are based on a single outage at a single time and in others they are composites.

Story 1: Searching But Not Finding

One of the main ML models that YarnIt uses is a search ranking model. Like most webstores, customers come to the site and click on links offered to them on the front page, but they also search directly for the products they’re looking for. To generate those search results, we first filter our product database for all of the products that roughly match the words that the customer is looking for, and then rank them with an ML model that tries to predict how to order those results, given everything we know about the search at the time it’s performed.

Ariel, a production engineer who works on search system reliability, is working on the backlog of monitoring ideas. One of the things the search team has been wishing they monitored and trended over time is the rate that a user clicks on one of the first five links in a search result. They hypothesize that that might be a good way to see whether the ranking system is working optimally.

Ariel looks through the available logs and determines an approach for exposing the resulting metric. After doing a week-on-week report for the past 12 weeks to make sure that the numbers look reasonable, Ariel finds that initially promising results. From 12 weeks ago to three weeks ago, Ariel sees that the top five links are clicked on by customers around 62% of the time. Of course that could be better, but a substantial majority of the time we’re finding something that the users are curious about within the first few results.

Three weeks ago, however, the click-rate on the first five links started going down. In fact, this week it’s only 54% and Ariel notes that it appears to still be dropping. That’s a huge drop in a very short period of time. Ariel suspects that the new dashboard is flawed and asks the search reliability team to take a look. They confirm: the data looks correct and those numbers are really concerning!

Note

Note: detection has occurred.

Ariel declares an incident and notifies the search model team since it might be a problem with the model. Ariel also notifies the retail team, just to check that we’re not suddenly making less money from customers who are searching for products (as opposed to browsing for them) and also asks them to check for recent changes to the website that would change the way results are rendered. Ariel then digs into the infrastructure for the search reliability team themselves: what has changed on their end? Ariel finds–and the search model team confirms–that there have been no changes to the model configuration in the past two months. There have also been no big changes in the data, or accompanying metadata used by the model–just the normal addition of customer activity to the logs.

Instead, one of the search model team members notes something interesting: they use a “golden set” of queries to test new models daily, and they’ve noticed that in the past three weeks the golden set is producing incredibly consistent results–consistent enough to be suspicious. The search model is normally updated daily by retraining the same model on searches and resulting clicks from the previous day. This helps keep the model updated with new preferences and new products. It also tends to produce some small instability in the results from the golden set of queries, but that instability is normally within some reasonable bounds. But starting three weeks ago, those results became remarkably stable.

Ariel goes to look at the trained model deployed in production. It’s three weeks old, and has not been updated since that point. This explains the stability of the golden queries. It also explains the drop-off in user click behavior: we’re probably showing fewer good results on new preferences and new products. Indefinitely, of course, if we keep the same, stale model we’ll eventually be unable to correctly recommend anything new. So Ariel looks at the search model training system, which schedules the search model training every night. It has not completed a training run in over three weeks, which would definitely explain why there isn’t a new trained model in production.

Note

Note: we have a proximal cause for the outage, but at this point we don’t know the underlying cause and there’s no obvious simple mitigation: without a new trained model in production we cannot improve the situation. This is also a very rough proximal start of impact.

The training system is distributed. There is a scheduler that loads a set of processes to store the state of the model, and another set of processes to read the last day’s search logs and update the model with the new expressed preferences of the users. Ariel notes that all of the processes trying to read logs from the search system are spending most of their time waiting for those logs to be returned from the logging system.

The logging system accesses raw customer logs via a set of processes called log-feeders that have permission to read the relevant parts of the logs. Looking at those log-feeder processes, Ariel notices that there is a group of 10 of them and that they’re each crashing and exiting every few minutes. Diving into process crash logs, Ariel sees that the log-feeders are running out of memory, and when they can’t allocate more memory, they crash. When they crash, a new log-feeder process is started on a new machine and the training process retries its connection, reads a few bytes and then that process runs out of memory and crashes again. This has been going on for three weeks.

Ariel proposes that they try increasing the number of log-feeder processes from 10 to 20. If it spreads the load from the training jobs around, it might prevent the jobs from crashing. They can also look at allocating more memory to the jobs if needed. The team agrees, Ariel makes the change, the log-feeder jobs stop crashing and the search training run completes a few hours later.

Note

Note: the outage is mitigated as soon as the training run completes and the new trained model is put into production.

Ariel works with the team to double check that the new trained model loads automatically into the serving system. The query golden set performs differently than the one from three weeks ago but performs acceptably well. Then they all wait a few hours to accumulate enough logs to generate the data they need to make sure that the updated trained model is really performing well for customers. Later, they analyze the logs and see that the click-through-rate in the first five results is now back to where it should be.

Note

Note: at this point the outage is resolved. Sometimes there is no obvious mitigation stage and mitigation and resolution take place at the same time.

Ariel and the team work on a review of the incident, accumulating some post-outage work they’d like to perform, including:

  • Monitor the age of the model in serving and alert if it’s over some threshold of hours old. Note that “age” here might be wall-clock age (literally what is the timestamp on the file) or data age (how old is the data that his model is based on). Both of these are mechanically measurable.

  • Determine our requirements for having a fresh model and then distribute the available time to the subcomponents of the training process. For example, if we need to get a model updated in production every 48 hours at the most, we might give ourselves 12 hours or so to troubleshoot problems and train a new model, so then we can allocate the remainder of the 36 hours to the log processing, log-feeding, training, evaluation, and copying to serving portions of the pipeline.

  • Monitor the golden query test and alert if it is unchanged as well as alerting if it’s changed too much.

  • Monitor the training system “training rate” and alert if it falls below some reasonable threshold such that we predict we will miss our deadline for training completion based on the allocated amount of time. Selecting what to monitor is difficult and setting thresholds for those variables is even harder. This is covered briefly in the <ML Incident Management Principles> section below, but covered previously in chapter 9, Running and Monitoring.

  • Finally, and most importantly: monitor the top-five-results-click-through-rate and alert if it falls below some threshold. This should catch any problem that affects the quality as perceived by users, but not caught by any of the other causes. Ideally, the metric for this should be available at least hourly so that we can use it while troubleshooting future problems, even if it’s only stable on a day-by-day basis.

With those follow-up items scheduled, Ariel is ready for a break and resolves to stop looking for problems in the future.

Stages of ML Incident Response for Story 1

This outage, although quite simple in cause, can help us start to see the way that ML incidents manifest somewhat differently for some phases of the incident response lifecycle.

  • Pre-incident: The training and serving system was a somewhat typical structure with one system producing a trained model and periodically updating it, and another using that model to answer queries. This architecture can be very resilient, since the live customer-facing system is insulated from the learning system. When it fails it is often because the model in serving is not updated. The underlying logs data is also abstracted away in a clean fashion that protects the logs but still lets the training system learn on them. But this interface to the logs is precisely where the weakness in our system occurred.

  • Trigger: Distributed systems often fail when they pass some threshold of scaling that sharply reduces performance, sometimes referred to as a bottleneck. In this case, we passed the threshold of performance of our log-feeder deployment and did not notice. The trigger was the simple growth of the data, corresponding growth of the training system requirements, and the business need to consume that data.

  • Outage begins: The outage begins three weeks before we notice it. This is unfortunate, and why good monitoring is so important.

  • Detection: ML systems that are not well instrumented often only manifest systems problems as quality problems–they simply start performing less well and get gradually worse over time. Model quality changes are often the only end-to-end signal that something is wrong with the systems infrastructure.

  • 4
  • Mitigation/Resolution: The fastest and least risky steps to mitigate the problem, in this case, involved training a new model and successfully deploying it to our production search serving system. For ML systems, especially those that train on lots of data or that produce large models, there might be no such quick resolution available.

  • Follow-up: There’s a rich set of monitoring we can add here, much of which is not easy to implement but which will benefit us during future incidents.

This first story shows a fairly simple ML-associated outage. We can see that outages can present as quality problems where models aren’t quite doing what we expect or need them to do. We can also start to see the pattern of broad organizational coordination that is required in many cases for resolution. Finally, we can see that it is tempting to be ambitious in specifying the follow-up work. Keeping these three themes in mind, consider another outage.

Story 2: Suddenly Useless Partners

At YarnIt we have two different types of business. The first part of our business is a first-party store where we sell knitting and crocheting products. But we also have a marketplace where we recommend products from other partners who sell them through our store. This is a way that we can make a wider variety of products available to our customers without having to invest more in inventory or marketing.

When and how to recommend these marketplace products is a little tricky. We’ll need to incorporate them into our search results and discovery tools on the website as a baseline, but how should we make recommendations? The simplest thing to do would be to list every product into our product database, include all actions that touch them in our logs, and add them to our main prediction model. A notable constraint is that each of these partners requires that we separate their data from every other partner–otherwise they won’t let us list their products5. As a result, we’ll have to train a separate model per partner, and extract partner-specific data into isolated repositories, though we can still have a common feature store for shared data.

YarnIt is ambitious enough to plan for a potentially very large number of partners - somewhere between five thousand or five million–and so instead of a set up optimized for a few large models we need a setup optimized for thousands of tiny models. As a result, we built a system that extracts the historical data from each partner and puts it into a separate directory or small feature store. Then at the end of every day we separate out the previous day’s deltas and add them to our stores just before starting training. Now our main models train quickly and our smaller partner models train quickly as well. Best of all, we’re compliant with the access protection demanded by our partners.

Note

Note: the pre-incident is complete at this point. The stage is set and the conditions for the outage have been set. It may be obvious by this point that there are several opportunities for things to go wrong.

Sam, a production engineer at YarnIt, works on the partner training system. Sam is asked to produce a report for CrochetStuff, a partner, in advance of a business meeting. When preparing the report, Sam notices that the partner in question has zero recent “conversions” (sales) recorded in the ML training data but that the accounting system reports that they’re selling products every day. Sam produces a report and forwards it to colleagues who work on the data extraction and joining jobs for some advice. In the meantime, Sam leaves this fact off of the report on data to the partner team and simply includes the sales data.

Note

Note: the detection happens here. No computer system detected the outage, which means that it may have been going on for an indefinite amount of time.

Data discrepancies in counts like this happen all the time so the data extraction team does not treat Sam’s report as being a high priority. Sam is reporting a single discrepancy for a single partner and they file a bug and plan to get to it in the coming week or so.

Note

Note: the incident is unmanaged and continuing chaotically. It might be small or it might not. No one has determined the extent of the impact of the data problem yet and no one is responsible for coordinating a quick and focused response to it.

At the business meeting, CrochetStuff noted that their sales are down 40% week-on-week and continuing to drop daily. Their reports on page views, recommendations and user inquiries are all down, even though when users do find the products, the rate at which they purchase continues to be high. CrochetStuff demands to know why YarnIt suddenly stopped recommending all of their products!

Note

Note: so by this point we have had internal detection, an internal partner advocate, customer reports, and a possible lead of what is happening. This is a lot of noise but sometimes we don’t declare an incident until many people independently notice it.

Sam declares an incident and starts working on the problem. The logs of the partner model training system clearly report that the partner models are successfully training every day, and there are no recent changes to either the binaries that carry out the training or the structure and features of the models themselves. Looking at the metrics coming from the models, Sam can see that the predicted value of every product in the CrochetStuff catalog has declined significantly every day for the past two weeks. Sam looks at other partners’ results and sees exactly the same drop.

Sam brings in the ML engineers who built the model to troubleshoot what is happening. They double-check that nothing has changed and then do some aggregate checks on the underlying data. One of the things they notice is what Sam noticed originally: there are no sales for any partners in the last two weeks in the ML training data. The data all comes from our main logs system and is extracted every day to be joined with the historical data we have for each partner. The data extraction team resurrects Sam’s bug from a few days before and starts looking at it.

Sam needs to find a fast mitigation for the problem. Sam notes that the team stores older copies of trained models for as long as several months and asks the ML Engineers about the consequences of just loading an old model into serving for now. The team confirms that while the old trained model versions won’t have any information about new products or big changes in consumer behavior, it will have the expected recommendation behavior for all existing products. Since the scope of the outage is so significant, the partner team decides it is worth the risk to roll back the models. In consultation with the partner team, Sam rolls back all of the partner trained models to versions that were created two weeks earlier, since that seems to be before the impact of the outage began. The ML engineers do a quick check of aggregate metrics on the old models and confirm that recommendations should be back to where they were two weeks ago.6

Note

Note: at this point the outage is mitigated but not really resolved. Things are in a pretty unstable state–notably, we cannot build a new model with our accustomed process and have it work well–and we still need to figure out the best full resolution as well as how to avoid getting ourselves into this situation again.

While Sam has been mitigating, the data extraction team has been investigating. They find that while the extractions are working well, the process that merges extracted data into the existing data is consistently finding no merges possible for any partners. This appears to have started about two weeks ago. Further investigation reveals that two weeks ago, in order to facilitate other data analysis projects, the data management team changed the unique partner key, used to identify each partner in their log entries. This new unique key was included in the extracted data and because it differed from previous partner identifiers, the newly extracted logs could not be merged with any data extracted prior to the key being added.

Note

Note: this is now a reasonable root cause for the outage.

Sam requests that a single partner’s data be re-extracted and that a model trained on the new data in order to quickly verify that the system will work correctly end-to-end. Once this is done, Sam and the team are able to verify that the newly extracted data contain the expected number of conversions and that the models are now, again, predicting that these products are good recommendations for many customers. Sam and the data extraction engineers do some quick estimations on how long it will take to re-extract all of the data and Sam then consults with the ML engineers on how long it will take to retrain all of the models. They arrive at a collective estimate of 72 hours, during which they will continue to serve recommendations from the stale model versions that they restored from two weeks prior. After consulting with the retail product and business team, they all decide to carry out this approach. The partner team drafts some mail to partners to let them know about the problem and a timeline for resolution.

Sam requests that all partner data be re-extracted and that all partner models be re-trained. They monitor the process for three days and once it is done, verify that the new models are recommending not only the older products but also newer products that didn’t exist two weeks prior. After careful checking, the new models are deemed to be good by the ML engineers and they are put into production. Serving results are carefully checked, along with many folks doing live searches and browsing to verify that partner listings are actually showing up as expected. Finally, the outage is declared closed and the partner team drafts an update to partners letting them know.

Note

Note: at this point the outage is resolved.

Sam brings the team together to go over the outage and file some follow-up bugs so that they can avoid this kind of outage in the future and detect it more quickly than they did this time. The team considers rearchitecting the whole system so that they can eliminate the problem of having two copies of all of the data, with slightly different uses and constraints, but decides that they still don’t have a good idea about how to meet their performance goals for both systems if they are unified.

They do file a set of bugs related to monitoring the success of data extraction, data copying and data merging. The biggest problem is that they don’t have a good source of truth for the question: how many lines of data should be merged? This failure happened for an entire class of logs and the team was quickly able to add an alert for “log lines merged must be greater than zero”. But during the investigation a series of less catastrophic failures were also found, and in order to catch those, we would need to know the expected number of logs per partner to be merged and then the actual number that were merged.

The data extraction team settles on a strategy where they store the count of merged log lines by partner by day and compares today’s successes to the trailing average of the last n-days. This will work relatively well when partners are stable but will be noisy when they experience big changes in popularity.

Two years later, this alerting strategy is still unimplemented as a result of challenges in implementing it without unnecessary noise. It may be a good idea but given the dynamic retail environment, it has proven unworkable and the team still lacks good end-to-end rapid detection of this kind of log extraction and merging failure, except in the catastrophic case. However, a heuristic they did implement a few months in–a hook which triggers on any relevant change to the partner configurations and notifies an engineer to potentially expect breakage–has at least increased ongoing awareness of such a change as a potential trigger for outages.

Stages of ML Incident Response for Story 2

Many of the characteristic stages that this incident went through are similar to those of any distributed systems incident. There are some prominent differences though, and the best way to see those with some context and nuance is to walk through the partner training outage and look at what ML-salient features occur during each section.

  • Pre-incident: Most of what went wrong was already latent in the structure of our system. We have a system with two authoritative sources for the data, one of which is an extracted version of the other, with incremental extracts being applied periodically. Problems with the data, and the metadata, is where ML systems typically fail. We will dig into the tactics for observing and diagnosing outages across systems with coupled data and ML in the <ML Incident Management Principles> section.

  • Trigger: The data schema was changed. It was changed very far away from where we observed the problem, which obviously made it difficult to identify. It is important to think about this outage as a way of identifying what assumptions we have made about our data throughout the processing stack. If we can identify those assumptions and where they are implemented, we can avoid creating data processing systems that can be damaged by changes to those assumptions. In this case, it should have been impossible to change the schema of our main feature store without also modifying or at least notifying all downstream users of that feature store. Explicit data schema versioning is one way to achieve this result.

  • Outage begins: The outage begins when one internal system processing data uses another internal system that processes data in a way that is no longer consistent with its structure. This is a common hazard for any large distributed pipeline system.

  • Detection: ML systems quite commonly fail in ways that are detected first by end users. One challenge with this is that ML systems are often accused of failure, or at least not working as well as we might hope, even under normal operations, and so it may seem reasonable to disregard the complaints of users and customers. The primary method of noticing this particular outage is a common one: the recommendations system wasn’t making recommendations of the same quality as it used to. With ML system monitoring, keeping the high-level, end-to-end coarse-grained picture in mind is particularly useful– with the central question being, have we substantially changed what the model is predicting over the past short while? These kinds of end-to-end quality metrics are completely independent of the implementation and will detect any kind of outage that substantially damages models. The challenge will be to filter that signal so that there are not too many false positives.

  • Troubleshooting: Sam needs to work with multiple teams to understand the scope and potential causes of the outage. We have commercial and product staff (the partner team), ML engineers who build the model, data extraction engineers who get the data out of the feature store and logs store and ship it to our partner model training environment, and production engineers like Sam coordinating the whole effort. Troubleshooting ML outages really has to start not with the data but with the outside world: what is our model saying and why is that wrong? There is so much data that starting by “just looking through the data” or even “doing aggregate analysis of the data” is likely to be a long and fruitless search. Start with the model’s changed or problematic behavior and it will be much easier to work backwards to why the model is now doing what it is doing.

  • Mitigation: With some services it is possible to simply restore an older version of the software while a fix is prepared. While this may inconvenience any users depending upon new features, everyone else can continue unaffected. ML outages can only sometimes be mitigated in this way because their job is to help computer systems adapt to the world and there’s no way to restore a snapshot of the world as it used to be. Additionally, quickly training up new models often requires more computing capacity than we have available. As was the case with our partner model outage, there is no cost-free quick mitigation. In order to determine which mitigation was the best option, the decision ultimately needed to be made by the product and business staff most familiar with our partners, users and business. This level of escalation to business leaders happens sometimes for non-ML services but much more frequently for ML services. Most organizations who rely upon ML to run important parts of their business will need to cultivate technical leaders who understand the business and business leaders who understand the technology.

  • Resolution: Sam makes sure that the data in the partner training system is correct (at least in aggregate and spot checks seem to confirm that it looks good). New models are trained. When we are ready to deploy them there’s actually no simple way to determine whether the new models “fix” the problem. The world continues to change while we are working on resolving this problem. So some previously popular products may be less in vogue now. Some neglected products may have been discovered by our users. We can look at the aggregate metrics to see whether we are recommending partner products at closer to the rate that we did previously, but it won’t be identical. Sometimes people use a golden set of queries here to see if they can produce a “correct” set of recommendations for some pre-canned results. This can increase our confidence somewhat but adds the new problem that we will want to continuously curate this golden set of queries to be representative of what our users search for. Once we do that, we will not necessarily have stable results over very long periods of time.7

  • Follow-up: After-incident work is always difficult. For a start, the people with direct knowledge are tired, and may have been neglecting their other work for some time by this point. We have already paid the price of the outage so we might as well get the value for it. While monitoring bugs are typically included in post-incident follow-up, it is incredibly common for them to languish (in some cases for years) for ML-based systems. The reason is relatively simple: it is extremely difficult to monitor real data and real models in a high-signal, low-noise way. Anything that is overly sensitive will alert all the time – the data is different! But anything that is overly broad will miss complete outages of subsets of our services. These problems exist for most distributed systems but are characteristic for ML systems.

While this outage was technically complex and somewhat subtle in its manifestation, many ML outages have very simple causes but still show up in difficult-to-correlate ways.

Story 3: Recommend You Find New Suppliers

We have models for several aspects of our business at YarnIt. The recommendations model in particular has an important signal: purchases. Simply put, we will recommend a product in every browsing context where users tend to purchase that product when it is offered to them. This is good for our users, who more quickly find products that they want to buy, and for YarnIt, who will presumably sell more products more quickly.

Gabi is a production engineer who works on the discovery modeling system. One unusually pleasant summer day, Gabi is working through some configuration clean ups that have been lingering and addressing some requests from other departments. Customer support sent a note that they have been tracking a theme in feedback on the website for the past couple of weeks, saying that the recommendations are “weird”. Subjective impressions by some customers like this are generally pretty hard to take any concrete action on, but Gabi files the request into a “pending follow-up” section for later followup.

No spoilers! We definitely cannot say whether or not incident detection has happened yet at this point.

Further in the incoming requests, Gabi spots an unusual problem report. The website payments team tells Gabi that finance is reporting a big drop in revenue. Revenue is down 3% for the past month on the site. That might not seem like a big drop but after some further digging they find that last week versus four weeks ago is down closer to 15%! The payments team has checked the payments processing infrastructure, and found that customers are paying for carts successfully at the same rate they historically have. They note, though, that the carts have fewer average products than they used to, and in particular there are fewer people purchasing products from recommendations than expected. This is why the payments team has contacted Gabi. Seeing numbers this big, Gabi declares an incident.

Incident detected and declared.

Gabi asks the financial team to double-check the week-vs-four-weeks-ago comparison for the past several weeks, and also asks for a more detailed timeline of revenue for the past several weeks. Finally, she asks for any product, category, or partner breakdowns available. Gabi then asks the payments team to verify their numbers about recommendations added to carts as well as to provide any breakdowns they can. In particular: do they see some particular type of carts that have fewer recommendations than others or that have changed more recently?

Meanwhile, Gabi starts looking at some aggregate metrics for the application, just trying to figure out some basic questions. Are we showing recommendations at all? Are we showing recommendations as often as we have in the past, and for all the queries and users and products that we did in the past, and in the same proportions across user subpopulations? Are we generating sales from recommendations at the same rate as we typically have? Is there anything else salient about the recommendations that is obviously different?

Gabi also starts doing the normal production investigation, focusing particular attention on what changed in the recommendations stack recently. The results are not promising for finding an obvious culprit: the recommendations models and binaries to train the models are unchanged in the last six weeks. The data for the model is updated daily, of course, so that’s something to look at. But the data schema in the feature store hasn’t changed in several months.

Gabi needs to continue troubleshooting but takes time to compose a quick message to the finance and payments teams that asked for help with this issue. Gabi confirms what is known so far: the recommendations system is running and producing results, there are no recent changes to be found, but the quality of the results has not been verified. Gabi reminds them to inform their department heads if they have not already, which seems wise given the amount of money the company appears to be losing.

There are no obvious software, modeling, or data updates that correlate with the outage so Gabi decides that it’s time to dig into the recommendations model itself. Gabi sends a quick message to Imani, who built the model, asking for help. As Gabi is explaining to Imani what they know so far (fewer products purchased, fewer recommendations purchased per checkout, no system changes to speak of), the note from customer support comes to mind. Customers complaining about “weird” recommendations, if the timeline matches up, certainly seems relevant.

Customer support confirms that they started getting the first sporadic complaints just over three weeks ago but that they have been intensifying and are especially pointed in the last week. Imani thinks there may be something worth investigating and asks Gabi to grab enough data to trend some basic metrics on the recommendations system: number of recommendations per browse page, average hourly delta between expected “value” of all recommendations (probability that a customer will purchase a recommended product times the purchase price) and the observed value (total value of recommended products ultimately purchased). Imani grabs a copy of some recent customer queries and product results in order to use them as a repeatable test of the recommendation system. The recommendation system uses the query that a user made, the page that they are on, and their purchase history (if we know it) to make recommendations, so this is the information that Imani will need to query the recommendation model directly.

Note

Note: Without more information we have to worry that by doing this Imani may have violated the privacy of YarnIt’s customers. Search queries may contain protected information like user IP addresses and any collection of search queries contains the additional problem that when correlated with each other for a given user they reveal even more private information8. Imani definitely should have consulted with some of the privacy and data protection professionals at YarnIt, or better yet, not even had direct, unmonitored access to the queries to make this kind of a mistake.

Imani extracts out about 100,000 queries + page views and sets up a test environment where they can be played against the recommendation model. After a test run through the system, Imani has recommendations for all of the results and has stored a copy of the whole run so that it can be compared to future runs if they need to modify or fix the model itself.

Gabi comes back and reports something interesting. Just over three weeks ago, the number of recommendations per page dropped slowly. For one week, the difference between expected value and observed value of each recommendation only declined a bit. By two weeks ago the recommendation count plateaued at just under 50% lower than it has been. But then the value of the recommendations began to drop significantly compared to the expected value. That decline continued through last week. Two weeks ago the observed value of the recommendations hit a low value of 40% of their expected value. Even more strangely, though, the gap between the expected value and observed value of the recommendations started narrowing a week ago, but at the same time the number of recommendations shown began falling again, so that now we seem to be showing very few recommendations at all, but those that we do show seem to be relatively accurately valued. Something is definitely wrong and it is starting to look like it’s the model, but there’s no clear diagnosis flowing from this set of facts.

Imani continues to build the QA environment to test hypotheses. On a hunch, Gabi and Imani grab another 100,000 queries + page views from a month ago (before there was any evidence of a problem) as well as a snapshot of a model from every week in the last six weeks. Since the model retrains daily, even though the configuration of the model is exactly the same day over day, each day the model has learned from the things the users did the preceding day. Imani plans to run the old and new queries against each of the models and see what can be learned.

Gabi pushes for a quick test first: today’s queries versus a month-old model. Here’s the thinking: if that works then there’s a quick mitigation (restore the old model to serving) while troubleshooting continues. Gabi is focused on solving the lost revenue problem as quickly as possible. Imani runs the tests and the results are not that promising and difficult to evaluate. The old model makes different recommendations than the new model and does seem to make slightly more of them. But the old model still makes many fewer recommendations against today’s queries than it did against the queries a month ago.

Without something quite a bit more concrete, Gabi isn’t comfortable that changing the model to an older one will help. It might even do more damage to our revenue than the current model. Gabi decides to leave the recommendation system in its current state. It’s time to send another note to the folks in finance and payments about the current status of the troubleshooting. The payments and finance contacts both report that their bosses want a lot more information about what’s going on. Gabi’s colleague Yao, who has been shadowing the investigation and is familiar with the recommendations system, is drafted to handle communications. Yao promptly sets up a shared document with the state as it is known so far and links to specific dashboards and reports for more information. Yao also sends out a broad notice to senior folks in the company, notifying them of the outage and the current status of the investigation.

Imani and Gabi finish running the full sweep of old and new queries against older and newer models. The results are different for each pair, but there’s nothing broadly systematic standing out that might explain the differences and the general metrics match the weird pattern described above. Imani decides to forget the model for a second and focus instead on the queries and pageviews themselves. Imani wants to figure out how they have changed in the last month, thinking maybe the problem is with the model’s ability to handle some shift in user behavior rather than something being wrong with the model itself.

Imani spot-checks the queries but there’s 100,000 of them in each of two batches and it is not exactly obvious what might be substantially different about them. Gabi, meanwhile, produces two different reports. The first looks exclusively at the search queries that customers used to get to the product pages they ended up on. Gabi tokenizes the search queries and just counts the appearances of each word. While that’s running, Gabi takes the product pages the customers ended up on and assigns each of them to a large category (yarn, pattern, needles, accessories, gear) and then to sub-categories within those according to the product ontology (built by another team). Gabi lines up the two pairs of reports and looks for the biggest differences between the user behavior four weeks ago and today.

The results are shockingly obvious: compared to four weeks ago, users have increasingly been looking for very different products. In particular, they are now looking for lightweight yarns, patterns for vests and smaller items, and smaller gauge needles. Imani and Gabi stare at the results and it suddenly seems so obvious. What happened four weeks ago? It got very hot in the northern hemisphere where the majority of YarnIt’s customers are based. The heat came earlier than usual and significantly decreased the interest most customers had in knitting with chunky, warm wool.

Imani points out, however, that that doesn’t explain the decrease in recommendations, only the change in what the recommendations should be. This still leaves the question of why aren’t we just recommending good hemp and silk yarn instead of wool? Gabi walks through a few queries to the recommendation engine by hand, using a command-line tool built for troubleshooting like this, and notices something. The recommendation engine test instance is set to log many more details than the production instance. One of the things it’s logging at a pretty high rate is that many candidate recommendations are disqualified from being shown to users because they’re out of stock.

Yao gets an update from Imani and Gabi, updates some of the shared doc and publishes some information to the increasingly large group of people waiting to find out how the company is going to fix this problem. Someone from the retail team sees the note about many recommendations being out of stock and mentions to Yao that YarnIt did lose several important suppliers recently. One of the biggest, KnitPicking, is a popular supplier of fashionable yarns, many of which happen to be lightweight. In fact, KnitPicking was one of the largest suppliers of those weights of yarn at those price points. Yao gets more details on the timing of the supply problems, adds it to the doc and reports back to Gabi.

Note

Note: this is an interesting state for the incident to be in. We have a very likely root cause but no obvious way to mitigate it or resolve it.

Imani and Gabi have a solid hypothesis on the weird recommendations. The recommendations system is configured with a minimum threshold of expected value for each recommendation it shows so that it won’t show terrible recommendations when it doesn’t have any good ones. But it takes a while for a recommendation’s expected value to adjust, especially when it hasn’t been shown very often recently. Imani concludes that the system quickly learned that few people wanted heavyweight wool yarns. But once those were understood to be poor recommendations, it took a while for the system to cycle through many other products until it finally concluded that we really don’t have very much stock in the products that our customers currently want to buy.

Gabi, Imani and Yao schedule a meeting with the heads of retail and finance to discuss what they have learned and ask for guidance on how to proceed. Oddly, the current state seems to be that the recommendations system is now moderately good for current circumstances. It recommends few products for most customers on most pageviews since we don’t have much of what most of our customers want right now. The loss of revenue was as much due to supply problems as it ever was due to the recommendations system. Presented with the facts as they are known, the head of retail asks the team to verify their findings to be certain but agrees that fixing the supply problem is the highest priority. The finance lead nods and goes off to sharply reduce projections for how much money we will make this quarter. There is no obvious change to the recommendations model that can improve the situation given our supply shortfalls and the weather.

Note

Note: at this point the outage is probably over, since we’ve decided not to change the system or model.

The team gathers together the next day to review what happened and what they can learn from it. Some of the proposed follow-up actions include:

  • Monitor and graph the number of recommendations per pageview, revenue per recommendation, and percentage gap between expected value of all recommendations per hour and the actual sales value.

  • Monitor and alert on high rates of candidate product recommendations unavailable (for whatever reason: out of stock, legal restrictions, etc.). We can also consider monitoring stock levels directly if we can find a high-signal way to do so, although ideally this would be the responsibility of a supply chain or inventory management team. We should be careful here not to be over-broad in our monitoring of other teams’ work to avoid burdening our future selves with excessive alerting. We should think about monitoring user query behavior, in aggregate, directly as well so that we might be able to detect significant shifts in query topics and distribution. This kind of monitoring is generally good for graphing but not for alerting–it’s just too hard to get right. Finally, we can work more closely with the customer support team to get them tools to investigate user reports like these. If the support team had a query replicator/analyzer/logger they may have been able to generate a considerably more detailed report than “customers say they get weird recommendations”. This kind of “empower another team to be more effective” effort often pays off much more than pure automation.

  • Review ways to get the model to adjust more quickly. The fact that it took the model so many days to converge on the right recommendation behavior isn’t reasonable. The overall stability of the model has been perceived to be of value, but in this case it ended up showing bad recommendations to users for many days and also making it harder for the production team to troubleshoot problems with it. Imani wants to find a way to improve the responsiveness to new situations without making the model overly unstable.

  • We should treat this as an opportunity to think about what the model should do when it doesn’t have any good recommendations. This is fundamentally a product and business problem rather than an ML engineering problem–we need to figure out what behavior we want the model to exhibit and what kinds of recommendations we think we should surface to users under these circumstances. At a high level, we would like to keep making money at a reasonable rate with good margins even when we do not have the products that our customers want the most. Figuring out whether there’s a way to identify a product recommendation strategy to do that is a hard problem.

  • Finally, it’s clear that some exogenous data to the ML system should be always available to make troubleshooting situations like these easier. In particular, the production engineers should have revenue results in aggregate and broken down by product category in the product catalog, by geography and by the original source of the user viewing the product (search result or recommendation or home page).

Many of these follow-ups are quite ambitious and unlikely to be completed in any reasonable amount of time. Some of them, though, can be done fairly quickly and should make our system more resilient to problems like this in the future. As always, figuring out the right balance and understanding the trade-offs in implementing those is precisely the art of good follow-up, though we should favor the ones that make problems faster to troubleshoot.

Stages of ML Incident Response for Story 3

Although this incident had a somewhat different trajectory from that of <stories 1 and 2>, we can see many of the same themes appear. Rather than repeat them, let’s try to focus on what additional lessons we can learn from this outage.

  • Pre-incident: There is no obvious significant failure in the architecture or implementation of our system that led to this outage, which is interesting. There are definitely some choices we could have made that would have made the outage progress differently, and more smoothly for our users, but in the end we cannot recommend products we don’t have and sales were going to go down. There may be a model that could produce better recommendations under these circumstances (rapid change in demand combined with an inventory problem) but that falls more under the heading of continuous model improvement rather than incident avoidance.

  • Trigger: The weather changed and we lost a supplier. This is a tough combination of events to directly detect, but we can certainly try with some of the monitoring efforts picked above.

  • Outage begins: In some ways there is no outage. That is what is most interesting about this incident. An outage can be understood to be a failing of the system such that it yields an incorrect result. It’s appropriate to describe the “weird recommendations” period as an outage, but one with only minimal costs since the main impact was probably to annoy our users a bit. But the loss in revenue wasn’t caused by the recommendations model nor was it preventable by it. Likewise the outage won’t end until the weather changes or we source a new supply of lightweight yarns.

  • Detection: The earliest sign of the outage was the customer complaints about weird recommendations. That’s the kind of noisy signal that probably cannot be relied on, but as noted we can get the support team better tools so that they can report problems in more detail. There may be other, less obvious, signals that would have a higher accuracy that we could use for detection but even figuring them out is a data science problem.

  • Troubleshooting: The process of investigating this outage includes some of the hallmarks of many ML-centric outage investigations: detailed probing at a particular model (or set of models or modeling infrastructure) coupled with broad investigation of changes in the world around us. The investigation might have proceeded more quickly if Sam had followed up on the detailed timeline of revenue from the finance team. With the breakdown of revenue changes by product, category, or partner, we should have been able to see a sharp shift in consumer behavior combined with a sharp rise and then drop in sales from KnitPicking (as our stock in their products ran low). It is sometimes difficult to remember that clarity about an outage might come from looking more broadly at the whole situation rather than more carefully at a single part of it.

  • Mitigation/Resolution: Some outages have no obvious mitigation. This is tremendously disappointing but occasionally there’s no quick way to restore the system to whatever properties it previously had. Moreover, the only way to actually resolve the core outage, and get our revenue back on track, is to change what our users want or fix the products that we have available to sell. One thing the team didn’t think about, probably in part because they were focused on troubleshooting the model and resolving the ML portion of the outage was that there may have been other, non-ML ways of mitigating the outage: what if our system showed out of stock recommendations as recommendations that were out of stock and invited customers to be notified when we had those (or similar) products available? In that case we might have avoided some of the lost revenue by shifting it forward in time and also reduced the weird recommendations served to customers. Sometimes, mitigations can be found outside of our system.

  • Follow-up: In many cases follow-up from an ML-centric incident evolves into a phase that doesn’t resemble “fix the problem” so much as “improve the performance on the model.” Post-incident follow-up often devolves into longer term projects, even for non-ML-related systems and outages. But the boundary between a “fix” and “ongoing model improvement” is particularly fuzzy for ML systems. One recommendation: define your model improvement process clearly first. Track efforts that are underway and define the metrics you plan to use to guide model quality improvement. Once an incident occurs, take input from the incident to add, update or reprioritize existing model improvement work. For more on this see chapter 12 on Evaluation and Model Quality.

These three stories, however different in detail, demonstrate some common patterns for ML incidents in their detection, troubleshooting, mitigation, resolution and ultimately post-incident follow-up actions. Keeping these in mind, it is useful to take a broader view of what is happening to make these incidents somewhat different from other outages in distributed computing systems.

ML Incident Management Principles

While each of these stories is specific, many of the lessons from them remain useful across different events. In this section we will try to back away from the immediacy of the stories, and distill what they, and the rest of our experience with ML systems outages, can teach us in the long term. We hope to produce a specific list of recommendations for readers to follow to get ready for and respond to incidents.

Guiding Principles

There are three overarching themes that appear across ML incidents that are so common that we wanted to list them here as “guiding principles”:

  1. Public: ML outages are often detected first by end users or at least at the very end of the pipeline, all the way out in serving or integrated into an application. This is partly true because ML model performance (quality) monitoring is very difficult. Some kinds of quality outages are obvious to end-users but not obvious to developers, decision makers or SREs. Typical examples include anything that affects a small sample of users 100% of the time. Those users get terrible performance from our systems all the time, but unless we happen to look at a slice of just those users, aggregate metrics probably won’t show anything wrong.

  2. Fuzzy: ML outages are less sharply defined in two dimensions: in impact and in time. With respect to time, it is often difficult to determine the precise start and end of an ML incident. Although there may well be a traceable originating event, establishing a definitive causal chain can be impractical. ML outages are also unclear in impact: it can be hard to see whether a particular condition of an ML system is a significant outage or just a model that is not yet as sophisticated or effective as we would like it to be. One way to think about this is every model starts out very basic, only doing some portion of what we hope it can do one day. If our work is effective, the model gets better over time as we refine our understanding of how to model the world and improve the data the model uses to do so. But there may be no sharp transition between “bad” and “good” for models. There is often only “better” and “not quite as good”. The line between “broken” and “could be better” is not always easy to see.

  3. Unbounded: ML outage troubleshooting and resolution involves a broad range of systems and portions of the organization. This is a consequence of the way that ML systems span more technical, product, and business arms within organizations than non-ML systems. This isn’t to say that ML outages are necessarily more costly or more important than other outages–only that understanding and fixing them usually involves broader organizational scope.

With the three big principles in mind, the rest of this section is organized by role. As we have stated, many people working on ML systems play multiple roles. It is worth reading the principles for each role whether you expect to do that work or not. But by structuring the lessons by role, we can bring out the particular perspective and organizational placement particular to that role.

Model Developer or Data Scientist

People working at the beginning of the ML system pipeline sometimes don’t like to think about incidents. To some, that seems like the difficult “operations” work that they would rather avoid. If ML ends up mattering in an application or organization, however, the data and modeling staff will absolutely be involved in incident management in the end. There are things that they can do to get ready for that.

Preparation

Organize and version all models and data.

This is the most important step that data and modeling staff can take to get ready for forthcoming incidents. If you can, put all training data in a versioned feature store with clear metadata spelling out where all the data came from and what code or teams are responsible for its creation and curation. That last part is often skipped: we will end up performing transformations on the data we put into the feature store and it is critical that we track and version the code that performs those transformations.

Specify an acceptable fallback.

When we first start, the acceptable fallback might be “whatever we’re doing now” if we already have a heuristic that works well enough. In a recommendations case this might be “just recommend the most popular products” with little or no personalization. The challenge is that as our model gets better, the gap between that and what we used to do may get so large that the old heuristic no longer counts as a fallback. For example, if our personalized recommendations are good enough, we may start attracting multiple (potentially very different) groups of users to our applications and sites9. If our fallback recommendation is “whatever is popular” then that might produce truly awful recommendations for every different subgroup using the site. If we become dependent on our modeling system, the next step is to save multiple copies of our model and periodically test failing back to them. This can be integrated into our experimentation process by having several versions of the model in use at any one time, with (for example) a primary, a new and an old model.

Decide on useful metrics.

The final bit of preparation that is most useful is to think carefully about model quality and performance metrics. We need to know if the model is working and model developers will have a set of objective functions that they use to determine this. Ultimately, we want a set of metrics that detect when the model stops working well that are independent of how it is implemented. This turns out to be a more challenging task than it might seem but the closer we can approximate this, the better. Chapter 9 on Running and Monitoring will address the topic of selecting these metrics in a little more detail.

Incident handling

Model developers and data scientists play an important role during incidents: they explain the models as they currently are built. They also generate and validate hypotheses about what might be causing the problems we are seeing.

In order to play that role, model and data folks need to be reachable–i.e.: available off-hours on an organized schedule such as an on-call rotation or equivalent. They should not expect to be woken up frequently, but they might well be indispensable if they are.

Finally, during incident handling and triage, model and data staff may be called upon to do custom data analysis and even to generate variants of the current model to test hypotheses. They should be ready to do so, but also prepared to push back on any requests that require violating user privacy or other ethics principles. See the <Ethical On-call Engineer Manifesto> section below for some more detail on this idea.

Continuous Improvement

Model and data staff should work to shorten the model quality evaluation loop as a valuable but not dominant priority. There will be much more on this in chapter 12 on Model Quality, but the idea here is similar to any troubleshooting: the shorter the delay between a change and an evaluation of that change, the faster we can resolve a problem. This approach will also pay notable benefits to the ongoing development of models, even when we’re not having an outage. In order to do this, we’ll have to justify the staffing and machine resources to get the training iterations, tools and metrics that we need to do this. It won’t be cheap, but if we’re investing in ML to create value, this is one of the best ways for this part of our team to deliver that value with the least risk of multi-day outages.

Software Engineer

Some, but not all, organizations have software engineers who implement the systems software to make ML work, glue the parts together and move the data around. Whoever is playing this role can significantly improve the odds that incidents go better.

Preparation

Data handling should be clean with clear provenance and as few versions of the same data as possible. In particular, when there are multiple “current” copies of the same data, this can result in subtle errors detected only in drops in model quality or unexpected errors. Data versioning should be explicit and data provenance should be clearly labeled and discoverable.

It is helpful if model and binary roll-outs are separate and separable. That is, the binaries that do the inference in serving, for example, and the model that they are reading from, should be pushed to production independently with quality evaluations conducted each time. This is because binaries can affect quality subtly as can models. If the rollouts are coupled, troubleshooting can be much more difficult.

Feature handling and use in serving and training should be as consistent as possible. Some of the most common and most basic errors are differences in feature use between training and serving (called training-serving-skew). These include simple differences in quantization of a feature, or even a change in certain features’ continents altogether (a feature that used to be income becomes zip code and chaos ensues immediately, for example).

Implement or develop tooling (sometimes test development is done by specialist Test Engineers, but this is organizationally specific). As much as possible. We will want tooling for model roll-out and model roll-back and for binary roll-out and roll-back. We should have tools to show the versions of the data (reading from the metadata) in every environment, and tools for customer support staff or production engineers (SREs) to read data directly for troubleshooting purposes (with appropriate logging and audit trails to respect privacy and data integrity guarantees). Where possible, find tooling that exists for your framework and environment already, but plan to implement at least some. The more tooling that exists that works, the lower the burden on software engineers during incidents.

Incident Handling

Software engineers should be a point of escalation during incidents, but if they have done their jobs well, they should be alerted only rarely. There will be software failures in the model servers, data synchronizers, data versioners, model learners, model training orchestration software, and feature store. But as our system gets more mature we will be able to treat this as a few large systems that can be well managed: a data system (feature store), a data pipeline (training), an analytics system (model quality), and a serving system (serving). Each of these is only slightly harder for ML than for non-ML problems and so software engineers who do this well may have very low production responsibilities.

Continuous Improvement

Software engineers should work regularly with model developers, with SREs/production engineers, and with customer support in order to understand what is missing and how the software should be improved. Most common improvement will involve resilience to big shifts in data and thoughtful exporting of software state for more effective monitoring.

ML SRE or Production Engineer

ML systems are run by someone. At larger organizations there may be dedicated teams of production engineers or SREs who take responsibility for managing these systems in production.

Preparation

Production teams should be staffed with sufficient spare time to handle incidents when they come up. Many production teams fill their plate with projects, ranging from automation to instrumentation. Project work like this is enjoyable and often results in lasting improvements in the system, but if it is high priority and deadline-driven it will always suffer during and after an incident. If we want to do this well, we have to have spare capacity.

We will also need training and practice. Once the system is mature, large incidents may happen infrequently. The only way that our oncall staff will gain fluency with the incident management process itself, but also with our troubleshooting tools and techniques, is to practice. Good documentation and tooling helps, but not if oncall staff can’t understand the docs or find the dashboards.

Production teams should conduct regular architectural reviews of the system to think through the biggest likely weak spots and address them. These might be unnecessary data copies, manual procedures, single points of failure, or stateful systems that cannot easily be rolled back.

Setting up monitoring and dashboards is a topic unto itself and will be covered more extensively in chapter 9 on Running and Monitoring. For now, we should note that monitoring distributed throughput pipelines is extremely difficult. Since progress is not reducible to a single value (oldest data we’re still reading, newest data we have read, how fast we’re training, how much data is left to read), we need to make decisions based on changes in the distribution of data in the pipeline.

We will need to set up SLOs (service level objectives) and defend them. As noted, our systems will be behaving in complex ways with multiple dimensions of variable performance along the “somewhat better” and “somewhat worse” axis. In order to pick some thresholds the first thing we’ll need to do is define SLIs (service level indicators) that we want to track. In ML these are generally slices (subsets) of the data or model. Then we’ll pick some metric for how those are performing. Since these metrics will change over time, if our data are normally distributed, we can pick thresholds by how far from the median they are10. If we update that periodically but not too often, we will continue to be sensitive to large shifts while ignoring longer-term trends. This may miss outages that happen slowly over weeks or months, but it will not be overly sensitive.

Production engineering teams should educate themselves about the business that they are in. This seems ancillary but it isn’t. ML systems that work make a difference for the organizations that deploy them. In order to successfully navigate incidents, SREs or production engineers should understand what matters to the business and how ML interacts with that. Does the ML system make predictions, avoid fraud, connect clients, recommend books, or reduce costs? How and why does it do that and why does that matter to our organization? Or even more basically, how is our organization put together? Where is an organizational chart (for a sufficiently large organization)? Answering those questions ahead of time prepares a production engineer for the necessary work of prioritizing, troubleshooting and mitigating ML outages.

Finally, we need as many objective criteria to trigger an incident as possible. The hardest stage of an incident is before it is declared. Often many people are concerned. There is pervasive, and disconnected, evidence that things are not going well. But until someone declares an incident and engages the formal machinery of incident management we cannot manage the incident directly. The clearer the guidelines we determine in advance, the shorter that period of confusion.

Incident Handling

Step back and look at the whole system. ML outages are seldom caused by the system or metric where they manifest. Poor revenue can be caused by missing data (on the other side of the whole system!). Crashes in serving can be caused by changes in model configuration in training or errors in the synchronization system connecting training to serving. And, as we’ve seen, changes in the world around us can themselves be a source of impact. This is a fairly different practice than production engineers normally employ but it is required for ML systems outages.

Be prepared to deal with product leaders and business decision makers. ML outages rarely stop at the technical team’s edge. If things are going wrong they usually impact sales or customer satisfaction–the business. Extensive experience interacting with customers or business leaders is not a typical requirement for production engineers. ML production engineers tend to get over that preference quickly.

The rest of incident handling is normal SRE/production incident handling and most production engineers are good at it.

Continuous Improvement

ML production engineers will collect significant numbers of ideas about how the incident could have gone better. These ideas range from monitoring to rapidly detect the problem that we’re not yet doing to system rearchitectures that would avoid the whole outage in the first place. The role of the production engineer is to prioritize these ideas.

Post-incident followup items have two dimensions of prioritization: value and cost/feasibility of implementation. We should prioritize work on items that are both valuable and easy to implement. Many followup items will fall into the category of “likely valuable but extremely difficult to implement”. These are a separate category that should be reviewed regularly with senior leads but not prioritized alongside other tactical work since they’ll never make sense to start working on in that setting.

Product Manager or Business Leader

Business and product leaders often think that following and tracking incidents is not their problem but rather one for the technical teams. Once you add ML to your environment in all but the most narrow ways, your awareness of it likely becomes critical. Business and product leaders can report on the real-world impact of ML problems, and can also suggest which causes are most likely and which mitigations are least costly. If ML systems matter, then business and product leaders should and will care about them.

Preparation

To the extent possible, business and product leaders should educate themselves about the ML technologies that are being deployed in their organization and products, including and especially the need to responsibly use these technologies. Just as production engineers should educate themselves about the business, business leaders should educate themselves about the technology.

There are two critical things to know: first, how does our system work (what data does it use to make what predictions or classifications) and second, what are its limitations. There are many things that ML systems can do but also many they cannot (yet, perhaps). Knowing what we cannot do is as critical as knowing what we’re trying to do.

Business and product leaders who take a basic interest in how ML works will be astoundingly more useful during a serious incident than those who do not. They will also be able to directly participate in the process of picking ML projects worth investing in.

Finally business leaders should ensure that their organization has the capacity to handle incidents. This largely means that the organization is staffed to a level capable of managing these incidents, has trained in incident management and that we’ve invested in the kind of spare time necessary to make space for incidents. If we have not, it is the job of the business leader to make space for these investments. Anything else creates longer, larger outages.

Incident Handling

It is rare that business leaders have an oncall rotation or other systematized way of reaching them urgently but the alternative is “everyone is mostly on call most of the time”. Culturally, business leaders should consider formalizing these oncall rotations if only to enable themselves to take a vacation with freedom. The alternative is to empower some other rotation of oncall staff to make independent decisions that can have significant revenue consequences.

During the actual incident, the most common problem that business leaders will face is the desire to lead. For once, they are not the most valuable or knowledgeable person. They have the right to two things: first, to be informed and second, to offer context of the impact of the incident on the business. They do not generally usefully participate directly in the handling of the incident–they’re simply too far removed from the technical systems to do so. Many business leaders should consider proxying their questions through someone else and stay off of direct (irc, phone, slack) incident communications as a way of avoiding their natural desire to take over.

Continuous Improvement

Business leaders should determine the prioritization of work after outages and should set standards for what completion of those items means. They can do that without having particular opinions about how, exactly, we improve. But, rather, they can advocate for general standards and approaches. For example, if we rank followup work items in priority order (P0 through P3), we can prioritize work on the P0s ahead of the P1s and so on. And we can set guidelines that if all of the P0 items are not done after some period of time we have a review where we figure out whether there is anything blocking them and what we can do, if anything, to speed up implementation.

Similarly, product teams have a huge role in specifying, maintaining, and developing SLOs. SLOs should represent the conditions that will meet a customer’s needs and make them happy. If they do not, we should change them until they do. The people to own the definition and evolution of those values are principally the product management team.

Special Topics

There are two important topics that haven’t been directly addressed yet, that show up during the handling of ML incidents.

Production Engineers and ML Engineering vs Modeling

Given that many ML systems problems present as model quality problems, there seems to be a minimum level of ML modeling skill and experience required by ML production engineers. Without knowing something about the structure and functioning of the model, it may be difficult for those engineers to effectively and independently troubleshoot problems and evaluate potential solutions. The converse problem also appears: if there is no robust production engineering group, then we might well end up with modelers responsible for the production serving system indefinitely. While both of these outcomes may be unavoidable, it is not ideal.

This is not completely wrong but it’s also entirely situationally dependent. Specifically, in smaller organizations it will be common to have the model developer, system developer and production engineer be a single person or the same small team. This is somewhat analogous to the model where the developer of a service is also responsible for the production deployment, reliability, and incident response for that service. In these cases, obviously expertise with the model is a required part of the job.

As the organization and services get larger, though, the requirement that production engineers be model developers vanishes entirely. In fact, most SREs doing production engineering on ML systems at large employers never or rarely train models on their own. That is simply not their expertise and is not a required, or even useful, expertise to do their jobs well.

There are ML-related skills and knowledge that ML SREs or ML production engineers do need to be effective. They need some basic familiarity with what ML models are, how they are constructed, and above all the flavor and structure of the interconnected systems that build them. The relationship of components and the flow of data through the system is more important than the details of the learning algorithm.

Let us say, for example, that we have a supervised learning system that uses tensorflow jobs scheduled at a particular time of day to read all of the data from a particular feature store or storage bucket and produce a saved model. This is one completely reasonable way to build an ML training system. In this case, the ML production engineer needs to know something about what tensorflow is and how it works, how the data is updated in the feature store, how the model training processes are scheduled, how they read the data, what a saved model file looks like, how big it is, and how to validate it. That engineer does not need to know how many layers the model has or how they are updated, although there’s nothing wrong with knowing that. They do not need to know how the original labels were generated (unless we plan to generate them again).

On the other side of the same coin, suppose we have settled on a delivery pipeline where an ML modeling engineer packages their model into a Docker container, annotates a few configuration details in an appropriate config system, and submits the model for deployment as a microservice running in Kubernetes. The ML modeling engineer may need to understand the implications of how the Docker container is built and how large the container is, how the configuration choices will affect the container (particularly if there are config errors), and how to follow the container to its deployment location and do some cursory log checking or system inspection to verify basic health checks. The ML modeling engineer probably does not, however, need to know about low-level Kubernetes choices like pod-disruption budget settings, DNS resolution of the container’s pod, or the network connectivity details between the Docker container registry and Kubernetes. While those details are important especially in the case where infra components are part of a failure, the ML modeling engineer won’t be well suited to address them and may need to rely on handing those types of errors off to an SRE specialist familiar with that part of the infrastructure.

Detailed knowledge of model building can certainly be extremely helpful. But the biggest reliability problem that most organizations run into is not lack of knowledge about ML. It is rather a lack of knowledge and experience building and productionizing distributed systems. The ML knowledge is a nice addition rather than the most important skill set.

The Ethical On-Call Engineer Manifesto

We’ve written a lot in this chapter about how performing incident response is different and more difficult when ML is involved. Another way in which ML incident response is hard is how to handle customer data when you’re on-call and actively resolving a problem, a constraint we call privacy-preserving incident management. This is a difficult change for some to make, since today (and decades previous), on-call engineers are accustomed to having prompt and unmediated access to systems, configuration, and data in order to resolve problems. Sadly, for most organizations and in most circumstances, this access is absolutely required. We cannot easily remove it, and still allow for fixing problems promptly.

On-call engineers, in the course of their response, troubleshooting, mitigation, and resolution of service outages, need to take extra care to ensure that their actions are ethical. In particular they must respect the privacy rights of users, watch for and identify unfair systems, and prevent unethical uses of ML. This means carefully considering the implications of their actions–not something easy to do during a stressful shift–and consulting with a large and diverse group of skilled colleagues to help make thoughtful decisions.

To help us understand why this should be the case, let’s consider the four incident dimensions in which ethical considerations for ML can arise: the impact (severity and type), the cause (or contributing factors), the troubleshooting process itself, and call to action.

Impact

Model incidents with effects on fairness can wreak truly massive and immediate harm on our users, and of course reputational harm to our organization. It doesn’t matter whether the effect is obvious to production dashboards tracking high-level KPIs or not. Think of the case of a bank loan approval program that is accidentally biased. Although the data supplied in applications might omit details on the applicants’ race, there are many ways the model could learn race categories from the data that is supplied and from other label data11. If the model then systematically discriminates against some races in approving loans, we might well issue just as many loans–and show about the same revenue numbers on a high-level dashboard -- but the result is deeply unfair. Such a model in a user-facing production system could be bad both for our customers and our organization. In ideal circumstances, no organization would employ ML without undergoing at least a cursory Responsible AI evaluation as part of the design of the system and the model. This evaluation would provide some clear guidelines for metrics and tools to be used in identifying and mitigating bias that might appear in the model.

Cause

For any incident, the cause or contributing factors to the outage can have consequences for the ethically minded on-call engineer. What if the cause turns out to be a deliberate design decision which is actually hard to reverse? Or the model was developed without enough (or any) attention being paid to ethical and fairness concerns? What if the system will continue to fail in this unfair way without expensive refactoring? Insider threat is real,12 don’t forget, but we don’t need to imagine that malice aforethought took place for these kinds of things to happen: a homogeneous team, strongly focused on shipping product to the exclusion of all else, can enable it purely by accident. Of course, all of the above is enhanced by the current lack of explainability of most ML systems.

Troubleshooting

Ethics concerns, generally privacy, often arise during the troubleshooting phase for incidents. As we saw in Story 3, it is tempting - maybe sometimes even required - to look at raw user data while troubleshooting modeling problems. But doing so directly exposes private customer data. Depending on the nature of the data, it might have other ethical implications as well - consider, for example, the case of a financial system that includes customer investment decisions in the raw data. If a staff member has access to that private information and uses it to direct their own personal investments, this is obviously unethical and in a number of jurisdictions would be seriously illegal.

Solutions and a call to action

The good news is that a lot of these problems have solutions, and getting started can be reasonably cheap. On the one hand, we’ve already spoken about the generally underweighted role that diverse teams can play in ensuring an organization against bad outcomes. Fixing those generally involves fixing the process that produced them rather than mitigating a specific one-time harm. But a diversity of team members is not, by itself, a solution. Teams need to adopt the use of responsible AI practices during the model and system design phase in order to create consistent monitoring of fairness metrics and to provide incident responders a framework to evaluate against. For deliberate or inadvertent access to customer data during incident management, restricting that access by default with justification, logging, and multiple people in charge over the data (to act as ethical checks on each other) is a reasonable balance of risk versus reward. Other mechanisms that are useful to avoid the construction of flawed models are outlined in chapter 3 on Fairness.

Finally, though it is not within our remit to declare it unilaterally - nor would we wish to - we strongly believe there is an argument for formalizing such a manifesto, and promoting it industry-wide. The time will come - if it is not already here - when an on-call engineer will discover something vital and worthy of public disclosure, and may be conflicted about what to do. Without a commonly understood definition of what a whistleblower is in the ML world, society will suffer.

Conclusion

An ML model is an interface between the world, as it is and as it changes on the one hand, and a computer system on the other. The model is designed to represent the state of the world to the computer system and, through its use, allow the computer system to predict and ultimately modify the world. This is true in some sense of all computer systems but it is true at a semantically higher and broader sense for ML systems.

Think of tasks such as prediction or classification, where an ML model attempts to learn about a set of elements in the world in order to correctly predict or categorize future instances of those elements. The purpose of the prediction or classification is always, and has to be, changing the behavior of a computing system or an organization in response to the prediction or classification. For example, if an ML model determines that an attempted payment is fraud, based on characteristics of the transaction and previous transactions that our model has learned from, this fact is not merely silently noted in a ledger somewhere -- instead, the model will usually reject the transaction once such a categorization is made.

ML failures specifically occur when there is a mismatch between three elements: the world itself and the salient facts about it, the ML system’s ability to represent the world, and the ability of the system as a whole to change the world appropriately. The failures can occur at each of those elements or, most commonly, in combination or at the intersection between them.

ML incidents are just like incidents for other distributed systems, except for all of the ways that they are not. The stories here share several common themes that will help ML production engineers prepare to identify, troubleshoot, mitigate and resolve issues in ML systems as they arise.

Of all of the observations about ML systems made in this chapter, the most significant is that ML models, when they work, matter for the whole organization. Model and data quality have to therefore be the mission of everyone in the organization. When ML models go bad it sometimes will take the whole organization to fix it as well. ML production engineers who hope to get their organizations ready to manage these kinds of outages would do well to make sure that the engineers understand the business, and the business and product leaders understand the technology.

1 If you’re looking for detailed coverage of general incident management, you may consider instead reviewing <the first SRE book> and the PagerDuty incident response handbook at https://response.pagerduty.com/ .

2 See <SRE Book Chapter 14> for more.

3 See https://www.fema.gov/national-incident-management-system for further details.

4 This drift, as well as the contents of the golden set, are also key places where unfair bias can become a factor for our system. See Chapter 3: Fairness and Privacy for a discussion of these topics at more length.

5 This kind of restriction on data commingling is somewhat common. Companies are sensitive about their commercially valuable data (for example: who bought X after searching for Y) being used to benefit their competitors. In this case these companies may even regard YarnIt as a competitor (although one who sends them significant business that they value).

6 This was a risky way to test this hypothesis. It would have been better to roll out a single model first to validate that the old models performed better and didn’t have some other catastrophic problem. But this is a defensible choice that people make during high-stakes outages.

7 One pair of useful concepts in incident response are RPO (Recovery Point Objective–the point in time that we are able to replicate the system to full functionality after recovering from the outage, ideally very close before the outage) and RTO (Recovery Time Objective–how long it will take to restore the system to functionality from an outage). ML systems certainly have an RTO: retraining a model, copying an old version: these take time. But the problem is that most ML systems have no meaningful notion of an RPO. Occasionally a system runs entirely in the “past” on pre-existing inputs, but most of the time ML systems exist to adapt our responses to current changes in the world. So the only RPO that matters is “now” for ever-changing values of now. A model trained “a few minutes ago” might be good enough for “now” but might not. This significantly complicates thinking about resolution.

8 IP addresses are probably “Personal Information” or “Personal Identifiable Information” in many jurisdictions, so caution must be exercised. This is not always widely understood by systems engineers or operators, especially those who work in countries with looser legal governance frameworks. Additionally, search queries that can be correlated to the same user demonstrably reveal private information by means of the combination of the queries. Famously, see https://en.wikipedia.org/wiki/AOL_search_data_leak for context.

9 Of course, these different recommendations might mean the model is picking up on proxies for unfair bias and model designers and operators should use fairness evaluation tools to regularly look for this bias.

10 The statistics covering this and techniques for setting thresholds are covered well in Mike Julian’s <Practical Monitoring> especially Chapter 4 Statistics Primer.

11 Models can learn race from geographic factors in areas that have segregated housing, from family or first names in places where those are racially correlated, from educational history in places that have segregated education, and even from job titles or job industry. In a racially biased society, many signals correlate with race in a way that models can easily learn even without a race label in the model.

12 See, for example, this incident at Twitter -- https://www.secureworldexpo.com/industry-news/famous-twitter-accounts-hacked-cryptocurrency-details -- though the ones that make the papers are almost by definition a small subset of the ones that actually happen.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset