Retrying operations

Sometimes, exceptions are thrown due to unexpected outages or so-called hiccups. It is not an uncommon scenario for a system that is highly integrated with other systems or services. For example, the trading system in a stock exchange may need to publish trade execution data to a messaging system for downstream processing. But if the messaging system experiences just a momentary outage, then the operation could fail. In that case, the most common approach is to sleep for a while and then come back and try again. If the retry fails again, then the operation will be retried again later, until the system fully recovers.

Such retry logic is not difficult to write. Here, we will play with an example. Suppose that we have a function that fails randomly:

using Dates

function do_something(name::AbstractString)
println(now(), " Let's do it")
if rand() > 0.5
println(now(), " Good job, $(name)!")
else
error(now(), " Too bad :-(")
end
end

On a good day, we would see this lovely message:

On a bad day, we would get this instead:

Naively, we can develop a new function that incorporates the retry logic:

function do_something_more_robustly(name::AbstractString;
max_retry_count = 3,
retry_interval = 2)
retry_count = 0
while true
try
return do_something(name)
catch ex
sleep(retry_interval)
retry_count += 1
retry_count > max_retry_count && rethrow(ex)
end
end
end

This function just calls the do_something function. If it encounters an exception, it will wait 2 seconds as specified in the retry_interval keyword argument and try again. It keeps a track of a counter in retry_count, and so it will just retry up to 3 times by default, as indicated by the max_retry_count keyword argument:

Of course, this code is fairly straightforward and easy to write. But we will get bored quickly if we do this over and over again for many functions. It turns out that Julia comes with a retry function that solves this problem nicely. We can achieve the exact same functionality with a single line of code:

retry(do_something, delays=fill(2.0, 3))("John")

The retry function takes a function as the first argument. The delays keyword argument can be any object that supports the iteration interface. In this case, we have provided an array of 3 elements, each containing the number of 2.0. The return value of the retry function is an anonymous function that takes any number of arguments. Those arguments will be fed into the original function that needs to be called, in this case, do_something. Here is how it looks using the retry function:

Since the delays argument can contain any number, we could utilize a different strategy that comes back with a different waiting time. A common usage is that we would want to retry quickly (that is, sleep less) in the beginning but slow down over time. When connecting to a remote system, it is possible that the remote system is just having a short hiccup, or perhaps it is undergoing an extended outage. In the latter scenario, it does not make sense to flood the system with quick requests as it would be a waste of system resources and get the water muddier when it is already in a mess.

In fact, the default value for the delays argument is ExponentialBackOff, which iterates by exponentially increasing the delay time. On a very unlucky day, using ExponentialBackOff yields the following pattern:

Let's pay attention to the wait time between retries. The result should match the default setting of ExponentialBackOff as seen from its signature:

ExponentialBackOff(; n=1, first_delay=0.05, max_delay=10.0, factor=5.0, jitter=0.1)

The keyword argument, n, indicates the number of retries, for which we used the value of 10 in the preceding code. The first retry comes after 0.05 seconds. Then, for every retry, the time of delay grows by a factor of 5 up until it hits a maximum of 10 seconds. The growth rate may be jittered by 10%.

The retry function is often overlooked but it is a very convenient and powerful way to make the system more robust.

It is easy to throw an exception when something goes wrong. But that's not the only way to handle error conditions. In the next section, we will discuss the concepts of exceptions versus normal negative conditions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset