Fault tolerance and failover

In a microservices architecture, there might be many reasons for a fault. It is important to handle faults or failovers gracefully, as follows:

When the request takes a long time to complete, have a predetermined timeout instead of waiting for the service to respond.
When the request fails, identify the server, notify the service registry, and stop connecting to the server. This way, we can prevent other requests from going to that server.
Shut down the service when it is not responding and start a new service to make sure services are working as expected.

This can be achieved using the following:

Fault tolerance libraries, which prevent cascading failures by isolating the remote instance and services that are not responding or taking a longer time than in the SLA to respond. This prevents other services from calling the failed or unhealthy instances.
Distributed tracing system libraries help to trace the timing and latency of the service or system, and highlight any discrepancies with the agreed SLA. They also help you to understand where the performance bottleneck is so that you can act on this.

Table of Contents for Fault tolerance and failover