Serving a machine learning model is one of the most complex steps in the machine learning life cycle. In Chapter 1, we saw why serving machine learning models is challenging. Serving machine learning models involves two groups in two different domains: the ML developer develops the model and the software developer serves the model. So, we need to agree upon a common language so that we can be sure how our model will be deployed to solve a particular kind of problem. Patterns in design help software architects systematically solve complicated software engineering problems. Similarly, as we learn about design patterns in model serving, the complicated process of model serving will eventually become a piece of cake. This chapter will build on the ideas of some already used patterns for ML serving. We collect the patterns followed by developers and organize and classify those patterns. This chapter will discuss the following topics:
In the engineering domain, a pattern indicates a common approach or strategy that can be reused. This reuse helps us to understand engineering problems and solve them easily by following the solution pattern that has been made available to us by prior engineers. That’s why, when we need to serve a website, we do not have to go back to the theory and try to reinvent the wheel every time. We know the pattern required to serve the web application, which makes our job easier. Most of the time, an engineering team writes down a runbook/docs to solve a recurring problem that appears. This helps engineers avoid debugging the problem every time, thinking of a solution, designing the solution, and applying the solution.
Design patterns are handy to nail hard software engineering problems.
The Gang of Four book on design patterns in software engineering
You might be interested to learn the software engineering design patterns from the book Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Helm, and John Vlissides. This book brought about such a dramatic revolution in enhancing the productivity of software development that these four authors became popularly known as the Gang of Four.
To understand how design patterns help us make better software, let’s consider a hypothetical problem scenario. We want to make software that will help to create supervised ML models based on customer requirements. It currently supports the following models:
A naive solution for this would be the following:
class Model: def __init__(self, model_name, model_params): self.model_name = model_name self.model_params = model_params class ModelTrainer: def __init__(self, model): self.model= model def train(self): if self.model.model_name == "LinearRegression": trained_model = linear_model.LinearRegression()https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html trained_model.fit(self.model.model_params['X'], self.model.model_params['Y']) return trained_model elif self.model.model_name == "LogisticRegression": trained_model = linear_model.LogisticRegression() trained_model.fit(self.model.model_params['X'], self.model.model_params['Y']) return trained_model # Example client calls model = Model("LinearRegression", {"X": [[1, 1], [0, 0]], "Y": [1, 0]}) model_trainer = ModelTrainer(model) trainer_model = model_trainer.train()
Now, let's see the problems with this call:
However, if we look at what this program is doing, it can be seen as a factory providing different trainers (for different models). Then, it becomes easier for us to visualize the problem and also use this common pattern in other similar problems.
Now, let’s modify the previous program in the following way:
class Model: def __init__(self, model_name, model_params): self.model_name = model_name self.model_params = model_params class ModelTrainerFactory: @classmethod def get_trainer(cls, model): if model.model_name == "LinearRegression": return LinearRegression Trainer(model) elif model.model_name == "LogisticRegression": return LogisticRegressionTrainer(model) class ModelTrainer: def __init__(self, model): self.model = model def train(self): pass class LinearRegressionTrainer(ModelTrainer): def train(self): trained_model = linear_model.LinearRegression() trained_model.fit(self.model.model_params['X'], self.model.model_params['Y']) return trained_model # Example client calls model= Model("LinearRegression", {"X": [[1, 1], [0, 0]], "Y": [1, 0]}) modelTrainer = ModelTrainerFactory.get_trainer(model) trained_model = model_trainer.train()
The user now gets the desired model trainer from the factory. Whenever a new trainer is needed, we can start providing that trainer from the factory (such as starting a new product from a factory producing products without hampering other production pipelines) by adding a new trainer with a single responsibility. This program is now very easy to maintain and modify. We can use this as a template and apply it in similar problems where we need a collection of different approaches or objects. This pattern is called the factory pattern. It is a very basic but useful pattern.
Similarly, there are more than 20 different software engineering patterns that help users to approach common and repeating problems following a well-known template.
In a similar way, using patterns in ML model serving can solve recurring ML model serving problems.
In the next section, we will get a high-level overview of the patterns for serving ML models.
Using patterns for ML model serving make us more productive in bringing our model to clients. If we do not follow any patterns, then we may struggle to find the right tool and strategy needed to serve the model for a particular problem.
Figure 2.1 – Alice needs to perform trial and error with multiple tools to find the right one
Let’s consider the situation of Alice in Figure 2.1. Alice has a problem that involves making a data-driven decision. She needs to create a model to solve the problem and deploy the model using a serving tool. She has thousands of tools on offer. She needs to study all these solutions and find the best solution. There is another challenge in the approach of selecting the right tool. Alice is at risk of making a bad choice of tool, as she is solving an optimization problem manually and can be stuck at local maxima. This is always an impediment to productivity, as it involves extra manual effort.
Alice might often have to backtrack to find a suitable model, which creates an exponential search space for her. This situation brings a big tech debt to ML developers because the company might move forward with bad choices of tools that need to be replaced in the near future.
Let’s think about it from the point of view of hiring managers. The hiring manager now needs to solve a difficult hiring problem to find suitable talent. They will have difficulties and challenges finding skilled developers who can come up with a solution within a reasonable amount of time. It might be more intuitive to think mathematically about why finding a skilled developer may be hard. Let’s say that company A usually faces P kind of problems, each of which needs a different model serving approach. There are N different tools available to serve the model.
So, for each problem, a developer needs to try N tools before finding a satisfactory solution. Therefore, for P problems, there will be PN different choices for the developer, and in the worst-case scenario, the developer might have to try all these choices to find the best option. Through experience and observations, developers will be able to create a shortlist of the best tools to avoid trying all the choices, and their knowledge of model serving patterns will help the developer to easily make that shortlist. This creates a big bottleneck in productivity. The learning curve to getting skilled in these tools is high. The developer needs to learn the pros and cons of each tool for a particular problem. Therefore, getting a sufficiently skilled developer who can serve the model efficiently becomes hard.
Figure 2.2 – Bob takes the problem pattern and matches it with a few solution patterns
Conversely, let’s consider Bob in Figure 2.2. As the category of ML problems can be served using only a few recurring patterns, he can quickly map a problem to the serving strategy or pattern needed.
If a problem is encountered, he can quickly map the problem to a suitable serving pattern. Serving becomes a very easy step in the ML life cycle for Bob. Let’s revisit the same math problem as before. Now, Bob only has to apply a single pattern for a problem. So, for P problems, he only needs to go deeper into a few patterns. This makes the learning curve easier for a new developer and brings benefits to both the developers and the AI industry.
From this hypothetical scenario, we get the idea that we should follow pattern-oriented approaches in model serving instead of following tool-oriented approaches.
Here are some of the reasons why we need to know model serving patterns:
Figure 2.3 – This pattern can help to develop abstract pattern-oriented serving tools
From Figure 2.3, we can see that the user is serving models using a tool based on an online model serving pattern. The user does not have to do a lot of complicated tasks such as data cleaning, feature selection, model selection, training, and choosing serving techniques. The user only plugs in the input data to the hypothetical tool and gets the predictions from the APIs exposed to the tool. When we understand the serving patterns for different problem types, developing these tools will be easier.
As Bob is gaining all these advantages from using patterns, he has happy users and a more robust model serving pipeline. On the other hand, as Alice is not following any pattern, she is at high risk of the model experiencing downtime, a sudden drop in performance making clients unhappy, and also dealing with the flaky nature of model inference.
There are patterns that are specific to serving ML models. In this section, we will discuss the patterns for ML model serving. We will see the categories of patterns and describe each of the categories separately.
Model serving patterns can be classified into the following two categories at a high level:
We will look at these two categories in more detail in the following subsections. We will describe the two categories and see the patterns under each of these categories.
In this section, we will learn about the patterns in model serving that describe the state-of-the-art principles in model serving. We place these patterns under the class of patterns for serving philosophies.
These patterns give us ideas about the best practices that we should follow whenever we want to serve models. These patterns, instead of suggesting a particular deployment strategy, provide principles we should follow in all the serving strategies. From these kinds of patterns, we learn to make model serving resilient, available, and consistent, meaning that the responses are the same given the same input. For serving web applications, there are already agreed-upon principles and protocols – for example, communication to a server happens through REST APIs. Similarly, in this section, we will learn some standard principles for ML model serving.
Based on serving philosophies, we can classify serving patterns into three categories, as introduced in the book Machine Learning Design Patterns by Michael Munn, Sara Robinson, and Valliappa Lakshmanan:
We want to avoid stateful serving. That is an anti-pattern and should not be classified as a pattern for serving.
In web serving, the server does not store any state information, meaning any client data needed to serve the calls for that particular client (we will go into more detail on states in Chapter 3, Stateless Model Serving). The user needs to transfer all the necessary states if they want to use the web service using a REST API. Anyone who needs to access a web service needs to provide the state information needed, and the web service will store that state information in the placeholders to return the desired response after processing. This ensures the scalability of web APIs, as they can be deployed to any server on an on-demand basis.
REST APIs
Representational State Transfer (REST) is a set of architectural constraints for designing APIs. For further reading on REST APIs, please read the original thesis by Roy Thomas Fielding (https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm) that introduced REST, and we can also use the following link to learn about it at a high level: https://www.redhat.com/en/topics/api/what-is-a-rest-api.
Whenever we want to make our application stateful – or the business logic demands the application needs to be stateful – then we need to be very careful. In web serving, there is a lot of talk about stateless and stateful serving. Both might be used depending on the requirements of the application. In web applications, states in stateful serving mainly refer to the states or status information in the previous calls. However, in model serving, most of the states will come from the states of the model. If the model stores state during serving, then the model might give different results for the same input at different times.
Let’s imagine there is a website to check the time at a given location. Whenever a particular user wants to use the application, the user needs to pass the location information along with the API call. Let’s also imagine for a moment that the server stores this state information (location). A user from Los Angeles has made a request to the web service to get the current time. The web service got the location information, stored it within its global state, and returned the information. At the same time, if another user from Sydney makes a request to get the time, they might get the wrong time, as the state in the server points to Los Angeles. Therefore, the application becomes buggy and also not scalable.
We can see graphically in Figure 2.4 that a stateful application that is storing states within the server can cause inconsistent results. A call, Call 1, to the server is made to the server, and before it is processed, another call, Call 2, is made. Call 2 will now have access to the states from Call 1 that the server has stored in its different placeholders or variables. Therefore, there might be inconsistent results in both Response 1 and Response 2.
Figure 2.4 – The stateful server stores states
On the contrary, a stateless server requires the client to pass the necessary states. The server is blind to the states of any call and does not store anything related to a call. Each call is served individually and independently. Each of the calls is independent of other calls and does not show any side effect that would result from the mingling of states between the calls.
As shown in Figure 2.5, the server does not store any states, and the parallel calls, Call 1 and Call 2, are independent and do not have access to the states of one another.
Figure 2.5 – Stateless serving requires the client to pass states
You got the idea of how stateful serving can be problematic. There are many side effects to allowing states within served applications. Some common problems include the following:
In ML, we have different states that are used during training the model. To export the model for serving using the stateless serving principle, we need to avoid exporting these parameters. Due to the probabilistic nature of the model, sometimes we use different random states during training. While serving the application, we need a mechanism to get rid of this. Otherwise, this will create a bad user experience, as the user will keep getting different results for the same call at different times. Clients might become intolerant of the probabilistic nature of the response from the models, as they are more interested in getting consistent results.
As an example, to understand the problem that stateful model serving might create, let’s say a developer has demoed an application to a team manager, and they have seen the result, R1. Now, the demo is given to a program manager, and the result is R2. This will create distrust and a lack of confidence among the production team as well. Also, if the users start using it and they keep getting different results, R1, R2, and R3, at different times, then it will create a bad customer experience and a high churn rate.
Therefore, we need to use a stateless serving pattern as much as we can. This makes client code and responsibility a little complicated because the client side needs to take a lot of responsibility in extracting the states from the model. However, the hope is that, in the future, more tools will come to remove this client burden.
Sometimes, making stateless serving might be really difficult. For example, let’s consider a chatbot application. It needs to store the previous state to make the answers and responses more reasonable.
One of the key differences between model serving and web serving is that ML models evolve based on data. An ML model will become useless quickly as more and more new example cases appear that aren't taken into account by the model.
For example, let’s say that there is a model to detect the house price of a fast-growing city. The model will become stale very soon. Let’s say the hypothetical price of a house today is $300,000. After 2 months, the price may become $500,000. So, the model that is developed today cannot be used to make predictions after 2 months. But in a web application, the functionalities of the existing feature do not usually change and the requirements are deployed incrementally step by step. Usually, the deployed requirements do not change significantlly after the User Acceptance Testing (UAT) is completed.
For example, a feature for user registration might remain exactly the same for years in a web application. However, ML models might need frequent upgrades, as data is growing every day. If we continue to use an old model, then it might suffer from the following problems:
So, we should follow the philosophy of evaluating the model continuously and setting a threshold point at which the model needs to be upgraded.
Figure 2.6 – A high-level overview of the continuous evaluation of an ML model
In Figure 2.6, we see a high-level overview of serving following the continuous evaluation pattern. Here, we serve the model, and its prediction performance is evaluated using an evaluation service. The evaluation service determines whether we should retrain the model or not based on the performance of the model.
In summary, we need to evaluate the model event after deployment and retrain when it no longer performs well. We can’t stay silent after the model is deployed, as the model performance will decay over time.
Consider a case where we have a model being served for predicting the house price in a city in the USA. We pass a single input to the model (3 bedrooms and 2 bathrooms), and we get the output of $300,000. This looks very simple for a single-in and single-out case.
However, let’s say we send a batch request to the server and the features are provided as an array. Let’s assume the input feature array is [(3, 2), (2, 2), (4, 2)]. Now, we get an output of [$300k, $250k, $400k]. Now, if you are asked what kind of house has a price of $250,000, you will answer with the house whose features are (2, 2), meaning it has two bedrooms and two bathrooms. This seems a very fair claim. However, there is a problem here. The answer is assuming the requests are processed sequentially and the response array is filled up sequentially, according to the sequence input features are passed in. Let’s pause for a second and think: does this system scale well? If the number of instances in the batch request increases, the time to get a response will keep increasing linearly.
We should rather take advantage of the distributed nature of the servers and parallelize the computation. So, let’s say that for our request, the (3, 2) feature has gone to server S1, the (2, 2) feature has gone to server S2, and the (4, 2) feature has gone to server S3. The prediction by server S3 is completed first, then by server S2, and finally, server S1 completes the prediction for (3, 2). So, the response array is now jumbled, and we get [$400, $250k, $300k].
Now, if we are asked what kind of house has a price of $400,000 and we answer houses with three bedrooms and two bathrooms, our answer will be wrong.
A keyed prediction model serving pattern now comes into the picture to solve this problem and enable scalability in the serving of an ML model. The client supplies a key along with the features so that the responses can be identified using the key later on. The key can be any value that is distinct and can be used to map to the input instances easily. For example, the key can be as simple as the row number or index of the array element in the input data. The purpose of the key is to be able to match the response against the input instance.
Let’s revisit the preceding problem by passing keys now. The request now contains the following instances: [(k1, 3, 2), (k2, 2, 2), (k3, 4, 2)]. The response now will be [(k3, $400), (k2, $250k), (k1, $300k)]. Therefore, we can easily identify which response belongs to which feature set. Therefore, our problem of leveraging distributed serving is now resolved.
In this section, we will discuss the serving patterns that give a clear picture of different serving approaches. These patterns describe which strategy should be followed to serve a particular type of model. These patterns are placed under the classification of patterns of serving approaches.
Serving approaches involve well-vetted strategies to serve an ML model to production – for example, where the model will be served in an online fashion so that the impact of fresh data is immediately visible in the trained model, or in batch mode where the model updates with new training data after some interval. Patterns based on serving approaches categorize different serving strategies.
Some of the main differences between serving philosophy patterns and serving approach patterns include the following:
Based on serving approaches, we can see the following patterns in model serving:
Predictions from an ML model are not often possible instantly in a synchronous fashion. Whenever we need a prediction for a single feature set or a small array of feature sets, we might get the response instantly. However, when we need prediction for a large number of instances, we often need to do it asynchronously in a batch manner because of the following reasons:
For example, let’s consider creating monthly sales predictions for different items at different locations of a retail store. For that, we need predictions for thousands of features (locations and items) every month so that the demand planners can make appropriate monthly estimates of sales. For this, the following steps need to take place:
For example, let’s consider the following hypothetical sample prediction of a retail store, X, in Miami, Florida:
Additionally, for the location in Miami alone, we might need predictions for thousands of items. Considering all the different locations besides Miami, the number of instances needing prediction will be very large, and the prediction needs to happen only after the model is retrained with a new volume of data.
The batch serving pattern deals with this kind of problem, where the model serving solves these batch prediction problems and makes predictions asynchronously in situations when response latency is a big concern.
In online prediction, the model needs to make a prediction immediately after the request is made. Usually, the model makes predictions for a single instance or a small number of instances that can be provided via the HTTP payload limit. In this kind of model, we aim for less latency to provide a better customer experience.
In online models, the model is updated with new features each time a user makes a request using continual learning/online models.
The major advantage of online model serving is that the model stays updated with new data and we can avoid the challenge of retraining.
However, there are some problems that need to be kept in mind while serving the model using the online model serving pattern. Some of these include the following:
In this pattern, two models are served; usually, one model stays on the cloud and the other on the edge devices, but there are other possible differences, including model size or other properties about which we may care. The model on the cloud is often complex and heavy. When handheld edge devices are clients of these models, we have to be aware of the problem that the edge device might be offline or in a weak connection zone. So, we need a lightweight model to be deployed on the edge device to serve the functional requirements within a Service Level Agreement (SLA).
In serving ML, often a complex task is broken down into multiple steps, and each step focuses on a particular ML task. We can structure these steps in the form of a pipeline.
For example, let’s imagine we have a computer vision model where we identify different objects from an image or video and provide captions to the objects. The different steps involved in this whole process might be as follows:
All these separate steps can reside in a separate block in a pipeline. This will give us some flexibility in restarting the pipeline from a failed step whenever needed and debugging the process in a more granular way.
An ensemble pattern comes becomes useful when we need to use predictions from multiple models. This blog gives overviews on ensemble and business logic patterns: https://www.anyscale.com/blog/considerations-for-deploying-machine-learning-models-in-production. Some use cases of using ensemble model serving include the following:
The preceding scenarios require an ensemble serving pattern, as more than one model is ensembled or stacked together. The serving needs to accommodate this logic to support these problem scenarios.
Model serving often requires a lot of business logic to be performed before inference can take place. Any logic that takes place other than inference falls into business logic. Some common business logic includes the following:
These business logic functions require expensive I/O operations. Often, the inference server where the model is served is kept separate from the server where business logic is deployed. Only when the business logic is successful can a user invoke the inference API. For example, confidential military-purpose ML models might not be accessible to any users except those authorized. We need to add business logic to check that authorization. We might have to check for sensitive and malicious data in the input and need business logic to do this. We might have to add business logic to prevent Distributed Denial of Service (DDoS) attacks.
In this chapter, we have learned about the patterns in model serving. We have learned that patterns in model serving can be seen from two angles at a high level: serving patterns based on serving philosophies and serving patterns based on serving strategies.
Serving patterns based on serving philosophies involve the best practices in serving models. These patterns help us ensure resilient model serving by ensuring fault-tolerant, scalable processes in model serving.
Serving patterns based on serving strategies involve recurring approaches used for serving models for different business use cases – for example, a batch serving strategy if the predictions are not necessary immediately and online serving if the predictions are needed immediately.
We also discussed a high-level overview of each of the patterns. We saw that the serving principles such as stateless serving, continued model evaluation, and keyed prediction can help the uninterrupted and resilient serving of the model.
The serving strategy patterns such as batch serving, online serving, two-phase model serving, pipeline patterns, ensemble patterns, and business logic patterns can help us to serve models for different business use cases.
These kinds of patterns can help reduce tech debt in model serving and inspire the future development of pattern-oriented model serving tools.
In this section, you can find some further reading that can help you to do further study of the concepts we discussed in this chapter: