11. Asynchronous Communication and Message Buses

Asynchronous communication between applications and services has been both the savior and the downfall of many platforms. And the vehicle (pun intended) most often used on this journey to paradise or inferno is the message bus. When implemented properly, asynchronous communication is absolutely a valuable rung in the ladder of near infinite scale. When implemented haphazardly, it merely hides the faults and blemishes of a product and is very much akin to putting “lipstick on a pig.”

As a rule, we favor asynchronous communication whenever possible. As we discuss in this chapter, this favorable treatment requires that one not only communicates in an asynchronous fashion but actually develops the application to be asynchronous in behavior. This means, in a large part, the move away from request/reply protocols—at least those with temporal constraints on responses. At the very least it requires aggressive timeouts and exception handling when responses are required within a specified period.

As the most often preferred implementation of asynchronous communication, the message bus is often underimplemented. In our experience, it is often thrown in as an afterthought without the appropriate monitoring or architectural diligence. The result is often delayed catastrophe; as critical messages back up, the system appears to be operating properly until the entire bus grinds to a halt or crashes altogether. As a critical portion of the product infrastructure, the site goes “off the air.” The purpose of this chapter is to keep such brown, gray, or black-outs from happening.

Rule 43—Communicate Asynchronously As Much As Possible

In general asynchronous calls, no matter whether they are within a service or between two different services, are much more difficult to implement than synchronous calls. The reason is that asynchronous calls often require coordination to communicate back to the service that first sent a message that the request has been completed. If you’re firing and forgetting then there is no requirement for communication or coordination back to the calling method. This can easily be done a variety of ways including something as simple as the following PHP function, which makes use of the ampersand & to run the process in the background.

function asyncExec($filename, $options = '') {
  exec("php -f {$filename} {$options} >> /dev/null &");
}

However, firing and forgetting is not always an option. Often the calling method wants to know when the called method is complete. The reason for this could be that other processing has to happen before results can be returned. We can easily imagine a scenario in an ecommerce platform where the postage needs to be recalculated along with crediting discount codes. Ideally we’d like to perform these two tasks simultaneously instead of having to calculate the shipping, which might require a third-party call to a vendor, and then processing the discount codes on the items in the shopping cart. But we can’t send the final results back to the user until both are complete.

In most languages there are mechanisms designed to allow for the coordination and communication between the parent method and the asynchronous child method called callbacks. In C/C++, this is done through function pointers; in Java, it is done through object references. There are many design patterns that use callbacks, such as the delegate design pattern and the observer design pattern. Buy why go to all this hassle to call other methods or services asynchronously?

We go through the hassle of making some of our calls asynchronously because when all the methods, services, and tiers are tied together through synchronous calls, a slow down or failure in one causes a delayed but nevertheless cascading failure in the entire system. As we discussed in Rule 38 (Chapter 9, “Design for Fault Tolerance and Graceful Failure”), putting all your components in series has a multiplicative effect of failure. We covered this concept with availability, but it also works for the risk of a bug per KLOC (thousand lines of code). If methods A, B, and C have a 99.99% chance of being bug free and one calls the other, which calls the other, all synchronously, the chance of a bug affecting that logic stream of the system is 99.99% × 99.99% × 99.99% = 99.97%.

The same concept of reducing the risk of propagating failures was covered in Rule 36 (Chapter 9). In that rule we covered the idea of splitting your system’s pools into separate lanes for different sets of customers. The benefit being that if there is a problem in one swimlane it will not propagate to the other customers’ lanes, which minimizes the impact. Additionally, fault detection is much easier because there are multiple versions of the same code that can be compared. This ability to more easily detect faults when an architecture has swimlanes also applies to modules or methods that have asynchronous calls.

Asynchronous calls prevent the spreading of failures or slow-downs, and they aid in the determination of where the bug resides when there is a problem. Most people who have had a database problem have seen it manifest itself in the app or Web tier because a slow query causes connections to back up and then sockets to remain open on the app server. The database monitoring might not complain, but the application monitoring will. In this case you have synchronous calls between the app and database servers, and the problem becomes more difficult to diagnose.

Of course you can’t have all asynchronous calls between methods and tiers in your system, so the real question is which ones should be made asynchronous. To start with calls that are not asynchronous should have timeouts that allow for gracefully handling errors or continued processing when a synchronously called method or service fails. The way to determine which calls are asynchronous candidates is to analyze each call based on the criteria such as the following:

External API/third party— Is the call to a third party or external API? If so these absolutely should be made into asynchronous calls. Way too many things can go wrong with external calls to make these synchronous. If any way possible, you do not want the health and availability of your system tied to a system that you can’t control.

Long running processes— Is the process being called notorious for being long running? Are the computational or I/O requirements significant? If so these calls are great candidates for asynchronous calls. Often the more problematic issues are with slow running processes rather than outright failures.

Error prone/changed frequently methods— Is the call to a method that gets changed frequently? The greater the number of changes the more likely there is to be a bug within the code. Avoid tying critical code with code that needs to be changed frequently. That is asking for an increased number of failures.

Temporal constraint— When there does not exist a temporal constraint between two processes consider firing and forgetting the child process. This might be the scenario when a new registrant receives a welcome e-mail. While your system should care if the e-mail doesn’t go out, the results of the registration page back to the user should not be stalled waiting for it to be sent.

These are just a few of the most important criteria to use in determining whether a call should be made asynchronously. A full set of considerations is left as an exercise for the reader and the reader’s team. While we could list out another ten of these criteria as we increase in numbers they become more specific to particular systems. Also, going through the exercise with your team of developers for an hour will make everyone aware of the pros and cons of using synchronous and asynchronous calls, which is more powerful in terms of following this rule and thus scaling your systems than any list that we could provide.

Rule 44—Ensure Your Message Bus Can Scale

One of the most common failures we identify within technology architectures is a giant single point of failure often dubbed the enterprise service bus or message bus. While the former is typically a message bus on steroids that often includes transformation capabilities and interaction APIs, it is also more likely to have been implemented as a single artery through the technology stack replete with aged messages clinging to its walls like so much cholesterol. When asked, our clients most often claim the asynchronous nature of messages transiting this bus as a reason why time wasn’t spent on splitting up the bus to make it more scalable or highly available. While it is true that applications designed to be asynchronous are often more resilient to failure, and while these apps also tend to scale more efficiently, they are still prone to high demand bottlenecks and failure points. The good news is that the principles you have learned so far in this book will as easily resolve the scalability needs of a message bus as they will solve the needs of a database.

Asynchronous systems tend to scale more easily and tend to be more highly available than synchronous systems. This attribute is due in a large part to the component systems and services being capable of continuing to function in the absence or tardiness of certain data. But these systems still need to offload and accept information to function. When the system or service that allows them to “fire and forget” or to communicate but not block on a response becomes slow or unavailable, they are still subject to the problems of having logical ports fill up to the point of system failure. Such failures are absolutely possible in message buses, as the “flesh and blood” of these systems is still the software and hardware that run any other system. While in some cases the computational logic that runs the bus is distributed among several servers, systems and software are still required to allow the passing and interpretation of messages sent over the bus.

Having dispelled the notion that message buses somehow defy the laws of physics that bind the rest of our engineering endeavors, we can move on to how to scale them. We know that one of anything, whether it is physical or logical, is a bad idea from both an availability and scalability perspective, so we need to split it up. As you may have already surmised from our previous hint, a good approach is to apply the AKF Scale Cube to the bus. In this particular case, though, we can remove the X axis of scale (see Rule 7, Chapter 2, “Distribute Your Work”) as cloning the bus probably won’t buy us much. By simply duplicating the bus infrastructure and the messages transiting the bus we would potentially raise our availability (one bus fails, the other could continue to function), but we would still be left with a scale problem. Potentially we could send 1/Nth the messages on each of the N buses that we implement, but then all potential applications would need to listen to all buses. We still potentially have a reader congestion problem. What we need is a way to separate or differentiate our messages by something unique to the message or data (the Y axis—Rule 8, Chapter 2) or something unique to the customer or user (the Z axis—Rule 9, Chapter 2). Figure 11.1 is a depiction of the AKF’s Three Axes of Scale repurposed to message queues.

Figure 11.1. AKF Scale Cube for message buses

image

Having discarded the X axis of scale, let’s further investigate the Y axis of scale. There are several ways in which we might discriminate or separate messages by attribute. One easy way is to dedicate buses to particular purposes. For a commerce site, we may choose a resource-oriented approach that transits customer data on one bus (or buses), catalog data on another, purchase data on another, and so on. We may also chose a services-oriented approach and identify affinity relationships between services and implement buses unique to affinity groups. “Hold on,” you cry, “if we choose such an approach of segmentation, we lose some of the flexibility associated with buses. We can’t simply hook up some new service capable of reacting to all messages and adding new value in our product.”

The answer, of course, is that you are absolutely correct. Just as the splitting of databases reduces the flexibility associated with having all of your data comingled in a single place for future activity, so does the splitting of a service bus reduce flexibility in communication. But remember that these splits are to enable the greater good of enabling hyper growth and staying in business! Do you want to have a flat business that can’t grow past the limitations of your monolithic bus, or be wildly successful when exponentially increasing levels of demand come flooding into your site?

We have other Y axis options as well. We can look at things we know about the data such as its temporal qualities. Is the data likely to be needed quickly or is it just a “for your information” piece of datum? This leads us to considerations of quality of service, and segmenting by required service level for any level of data means that we can build buses of varying quality levels and implied cost to meet our needs. Table 11.1 summarizes some of these Y axis splits, but it is by no means meant to be an all-encompassing list.

Table 11.1. AKF Y Axis Splits of Message Bus

image

Returning to Figure 11.1, we now apply the AKF Z axis of scale to our problem. As previously identified this approach is most often implemented by splitting buses by customer. It makes most sense when your implementation has already employed a Z axis split, as each of the swimlanes or pods can have a dedicated message bus. In fact, this would need to be the case if we truly wanted fault isolation (see Chapter 9). That doesn’t mean that we can’t leverage one or more message buses to communicate asynchronously between swimlanes. But we absolutely do not want to rely on a single shared infrastructure among the swimlanes for transactions that should complete within the swimlane.

The most important point in determining how to scale a message bus is to ensure that the approach is consistent with the approach applied to the rest of the technology architecture. If, for instance, you have scaled your architecture along customer boundaries using the AKF Z axis of scale then it makes most sense to put a message bus in each of these pods of customers. If you have split up functions or resources as in the Y axis of scale, then it makes sense that the message buses should follow a similar trend. If you have done both Y and Z axis and need only one method for the amount of message traffic you experience, the Z axis should most likely trump the Y axis to allow for greater fault isolation.

Rule 45—Avoid Overcrowding Your Message Bus

Nearly anything, if done to excess, can have severe and negative consequences. Physical fitness, for example, if taken to an extreme over long periods of time can actually depress the immune system of the body and leave the person susceptible to viruses. Such is the case with publishing absolutely everything that happens within your product on one (or if you follow Rule 43—several) message buses. The trick is to know which messages have value, determine how much value they have, and determine whether that value is worth the cost of publishing the message at volume.

Why, having just explained how to scale a message bus, are we so interested in how much information we send to this now nearly infinitely scalable system? The answer is the cost and complexity of our scalable solution. While we are confident that following the advice in Rule 42 will result in a scalable solution, we want that solution to scale within certain cost constraints. We often see our clients publishing messages for nearly every action taken by every service. In many cases, this publication of information is duplicative of data that their application also stores in some log file locally (as in a Web log). Very often they will claim that the data is useful in troubleshooting problems or in identifying capacity bottlenecks (even while it may actually create some of those bottlenecks). In one case we’ve even had a client claim that we were the reason they published everything on the bus! This client claimed that they took our advice of “Design your systems to be monitored” (See Rule 49 in Chapter 12, “Miscellaneous Rules”) to mean “Capture everything your system does.”

Let’s start with the notion that not all data has equivalent value to your business. Clearly in a for-profit business, the data that is necessary to complete revenue producing transactions is more important in most cases than data that helps us analyze transactions for future actions. Data that helps us get smarter about what we do in the future is probably more important than data that helps us identify bottlenecks (although the latter is absolutely very important). Clearly most data has some “option value” in that we might find use for it later, but this value is lower than the value of data that has a clear and meaningful impact on our business today. In some cases, having a little bit of data gives us nearly as much value as having all of it as in the case of statistically significant sampling of lower value data in a high transaction system.

In most systems and especially across most message buses (except when we segment by quality of service in Rule 44) data has a somewhat consistent cost. Even though the value of a transaction or data element (datum) may change by the type of transaction or even value of the customer, the cost of handling that transaction remains constant. This runs counter to how we want things to work. Ideally we want the value of any element of our system to significantly exceed the cost of that element or in the worst case do no more than equal the cost. Figure 11.2 shows a simple illustration of this relationship and explains the actions a team should take with regard to the data.

Figure 11.2. Cost/value relationship of data and corresponding message bus action

image

The upper-left quadrant of Figure 11.2 is the best possible case—a case where the value of the data far exceeds the cost of sending it across the bus. In commerce sites clear examples of these transactions would be shopping cart transactions. The lower-right quadrant is an area in which we probably just discard the data altogether. A potential case might be where someone changes his profile picture on a social networking site (assuming that the profile picture change actually took place without a message being produced).

The rate at which we publish something has an impact on its cost on the message bus. As we increase the demand on the bus, we increase the cost of the bus(es) as we need to scale it to meet the new demand. Sampling allows us to reduce the cost of those transactions, and in some cases as we’ve described previously may still allow us to retain 100% of the value of those transactions. The act of sampling serves to reduce the cost of the transaction and move us from right to left and may allow us to get the value of the data to exceed the cost thereby allowing us to keep some portion of the data. Reducing the cost of the transaction means we can reduce the size and complexity of our message bus(ses) as we reduce the total number of messages being sent.

The overall message here is that just because you have implemented a message bus doesn’t mean that you have to use it for everything. There will be a strong urge to send more messages than are necessary, and you should fight that urge. Always remember that not every datum is created equally in terms of value, while its cost is likely equal to that of its peers. Use the technique of sampling to reduce the cost of handling data, and throw away (or do not publish) those things of low value. We return to the notion of value and cost in Rule 47 (Chapter 12) when we discuss storage.

Summary

This chapter is about asynchronous communication, and while it is the preferred method of communication it is generally more difficult, more expensive (in terms of development and system costs), and can actually be done to excess. We started this chapter by providing an overview of asynchronous communication and then offering a few of the most critical guidelines for when to implement asynchronous communication. We then followed up with two rules dealing with message buses, which are one of the most popular implementations of asynchronous communication.

In Rules 43 and 44, we covered how to scale a message bus and how to avoid overcrowding it. As we mentioned in the introduction to this chapter the message bus, while often the preferred implementation of asynchronous communication, is also often underimplemented. Being thrown in as an afterthought without the appropriate monitoring or architectural diligence, this can turn out to be a huge nightmare instead of an architectural advantage.

Pay attention to these rules to ensure that the communication within and between services can scale effectively as your system grows.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset