The saga pattern

The saga pattern is the solution, proposed by microservice architectures to manage transactions in distributed systems.

A saga is a sequence of operations that represent a unit of work that can be undone by the compensation action. When an operation is successful, it publishes a message or event to trigger the next local transaction in the saga; otherwise, the saga executes a series of compensating transactions that undo the changes that were made by the preceding. Each operation can be seen as a local transaction; so, it performs a commit or rollback to its own data source, but communicates with all other operations or local transactions that build the saga.

The saga guarantees that either all of the operations complete successfully, or the corresponding compensation actions are run for all of the executed operations, to cancel partial processing.

This approach is different than the one followed by the 2PC protocol, where a global distributed transaction (XA) is created, involving all of the resources/services that build the business function. The saga pattern implements the concept of divide et impera—each single service runs in a local transaction and provides a compensate action. The set of all of the individual operations that make up the business functionality communicates by using events (events/choreography pattern) or a Saga Execution Coordinator (command/orchestration).

It is immediately possible to notice that the approach, implemented by the saga pattern violates the principle of the isolation of ACID transactions; the ability to commit a partial operation breaks the isolation, since it makes the segment changes available before the saga ends. Saga overcomes this approach by using the eventual consistency model that, as I described previously, guarantees that the state will eventually become consistent, after the saga completes.

So, the saga pattern utilizes an alternative to ACID; we can define it as BASE. This acronym summarizes the following properties:

Basically available: The system guarantees availability, as defined in the CAP theorem, published by Seth Gilbert and Nancy Lynch in 2002, stating that network shared data systems can only guarantee support for two of the following three properties:
- Consistency: Every node in a distributed cluster returns the same, most recent, successful write, so that every client has the same view of the data
- Availability: Every available node returns a response for all read and write requests in a reasonable amount of time
- Partition tolerant: The system continues to function and maintains its consistency guarantees, in spite of network partitions

Soft state: The state may change as time progresses, even without any immediate modification requests, due to the eventual consistency.
Eventual consistency: As I described earlier, the state of the system is allowed to be in inconsistent states for short periods of time. If the system does not receive any new update requests, then it guarantees that the state will eventually get to a consistent state.

One of the key elements in the saga pattern is the compensation. Its role is to undo the work performed by the original operation, but with another operation, and not with the common approach of the transaction's rollback.

The reason is related to not only technical aspects, as was described previously, but also to a different functional approach. The compensation action does not necessarily have to restore the status of the data to the initial situation; its function is to set the status of the data to a value that is consistent for the business domain that is processing, in the case of the failure of a saga operation that prevented the successful completion of the original operation requested.

The important thing is that the compensation must be idempotent, because that is the only way to intercept the failures and to implement strong recovery management.

The main reason is that even the compensation may fail; it gives no guarantees to always work, unlike with the traditional transaction rollback that is guaranteed by the database or application server. The causes of the failure are heterogeneous, so it's difficult to implement an algorithm that's able to intercept all.

You can implement two different strategies, as follows:

Backward recovery: This is the most common approach, and it requires that all operations define a compensation handler. In the case of failure, the saga execution component aborts the currently executed operation, and then, for every previous successful operation, in the reverse order of the original execution, it calls its respective compensation action.

Forward recovery: This strategy requires that the system is able to produce a checkpoint that represents a snapshot of the system state at that particular point in time, to which the system can always be restored. This concept is similar to the one used in business process management applications. This effectively eliminates the need to define any compensation actions, since the system, in the case of a failure, will always have a safe point from which it can try to complete the business operation. Using this approach, the saga execution component is reduced to a basic, persistent transaction executor, losing most of the saga's benefits.

It is also possible to combine the two approaches to get the benefits of each one.

The transaction system makes checkpoints in predefined intervals, which can be periodical or based on different criteria. In the case of a failure, the system performs backward recovery to the last defined checkpoint, and then continues saga execution in a forward recovery mode.

Table of Contents for The saga pattern

Create new playlist

Sign In

Sign Up

Table of Contents for
The saga pattern