Failure patterns

We have already touched upon some of the ways of dealing with failures in microservices in this chapter. There are, however, a couple of more interesting approaches we should consider. The first of these is service degradation.

Service degradation

This pattern could also be called graceful degradation and is related to progressive enhancement. Let us hark back to the example of replacing the Pythagorean distance function with the haversine equivalent. If the haversine service is down for some reason, the less demanding function could be used in its place without a huge impact on users. In fact, they may not notice it at all. It isn't ideal that users have a worse version of the service but it is certainly more desirable than simply showing the user an error message. When the haversine service returns to life then we can stop using the less desirable service. We could have multiple levels of fallback allowing several different services to fail while we continue to present a fully functional application to the end user.

Another good application of this form of degradation is to fall back to more expensive services. I once had an application that sent SMS messages. It was quite important that these messages actually be sent. We used our preferred SMS gateway provider the majority of the time but, if our preferred service was unavailable, something we monitored closely, then we would fail over to using a different provider.

Message storage

We've already drawn a bit of a distinction between services which are query-only and those which actually perform some lasting data change. When one of these updating services fails there is still a need to run the data change code at some point in the future. Storing these requests in a message queue allows them to be run later without risk of losing any of the ever-so important messages. Typically, when a message causes an exception it is returned to the processing queue where it can be retried.

There is an old saying that insanity is doing the same thing over again and expecting a different outcome. However, there are many transient errors which can be solved by simply performing the same action over again. Database deadlocks are a prime example of this. Your transaction may be killed to resolve a deadlock, in which case performing it again is, in fact, the recommended approach. However, one cannot retry messages ad infinitum so it is best to choose some relatively small number of retry attempts, three or five. Once this number has been reached then the message can be sent to a dead letter or poison message queue.

Poison messages, or dead letters as some call them, are messages which have actual legitimate reasons for failing. It is important to keep these messages around not only for debugging purposes but because the messages may represent a customer order or a change to a medical record: not data you can afford to lose. Once the message handler has been corrected these messages can be replayed as if the error never happened. A storage queue and message reprocessor can be seen illustrated here:

Message storage

Message replay

Although not a real production pattern, a side-effect of having a message-based architecture around all the services which change data is that you can acquire the messages for later replay outside of production. Being able to replay messages is very handy for debugging complex interactions between numerous services as the messages contain almost all the information to set up a tracing environment identical to production. Replay capabilities are also very useful for environments where one must be able to audit any data changes to the system. There are other methods to address such audit requirements but a very solid message log is simply a delight to work with.

Indempotence of message handling

The final failure pattern we'll discuss is idempotence of message handling. As systems grow larger it is almost certain that a microservices architecture will span many computers. This is even more certain due to the growing importance of containers, which can, ostensibly, be thought of as computers. Communicating between computers in a distributed system is unreliable; thus, a message may end up being delivered more than once. To handle such an eventuality one might wish to make messaging handling idempotent.


For more about the unreliability of distributed computing, I cannot recommend any paper more worth reading than Falacies of Distributed Computing Explained by Arnon Rotem-Gal-Oz at

Idempotence means that a message can be processed many times without changing the outcome. This can be harder to achieve than one might realize, especially with services which are inherently non-transactional such as sending e-mails. In these cases, one may need to write a record that an e-mail has been sent to a database. There are some scenarios in which the e-mail will be sent more than once, but a service crashing in the critical section between the e-mail being sent and the record of it being written is unlikely. The decision will have to be made: is it better to send an e-mail more than once or not send it at all?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.