Retry pattern

Problem description—we have discussed a bit about this pattern previously. Applications are distributed in the sense that the application components are being expressed and exposed as a service and delivered from different IT environments (private, public, and edge clouds). Typically, the IT spans across embedded, enterprise, and cloud domains. With the fast-growing device ecosystem, the connectivity has grown to various devices at the ground level. That is the reason that we very often hear, read, and even experience cyber-physical system (CPS). Also, the enterprise-scale applications (both legacy and modern) are accordingly modernized and moved to cloud environments to reap the distinct benefits of the cloud idea. However, certain applications, due to some specific reasons, are being kept in enterprise servers/private clouds. With embedded and networked devices joining in the mainstream computing, edge/fog devices are being enabled to form kind of ad hoc clouds to facilitate real-time data capture, storage, processing, and decision-making. The point to be noted here is that application services ought to connect to other services in the vicinity and remotely hold services over different networks. Faults can occur, stampeding the application calls. As articulated previously, there are temporary faults impacting the service connectivity, interaction, and execution. However, these faults are typically self-correcting and if the action that triggered a fault is repeated after a suitable delay, the connectivity and accessibility may go through.

The solution approach—in cloud environments, transient faults are common and an application should be designed to handle them elegantly and transparently. This minimizes the effects faults can have on the business tasks the application is duly performing. If an application detects a failure when it tries to send a request to a remote service, it can handle the failure using the following strategies:

  • Cancellation: If the fault indicates that the failure is not temporary (that is, persists for more time), or is likely to be unsuccessful if repeated, the application should cancel the operation and report an exception.
  • Retry: If the specific fault reported is unusual or rare, it might have been caused by some unusual circumstances such as a network packet getting corrupted while it was being transmitted. In this case, the application can try again as the subsequent request may attain the required success.
  • Retry after delay: If the fault is caused by one of the more commonplace connectivity or busy failures, then the application has to wait for some time and try again.

The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed previously. Requests sent to different services can be subjected to different policies. Some vendors provide libraries that implement retry policies, where the application can specify the maximum number of retries, the time between retry attempts, and other parameters. An application should log the details of faults and failing operations. This information is useful to operators. If a service is frequently unavailable or busy, it's often because the service has exhausted its resources. We can reduce the frequency of these faults by scaling out the service. For example, if a database service is continually overloaded, it might be beneficial to partition the database and spread the load across multiple servers.

In conclusion, having understood the strategic significance that the resiliency, robustness, and reliability of next-generation IT systems are to fulfil the various business and people needs with all the QoS and Quality of Experience (QoE) traits and tenets enshrined and etched, IT industry professionals, academic professors, and researchers are investing their talents, treasures, and time to unearth scores of easy-to-understand and useful techniques, tips, and tricks to simplify and streamline software and infrastructure engineering tasks. I ask the readers to visit https://docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency for further reading.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset