One of the key aspects that attributes to the speed of Storm is the use of LMAX disruptor versus queues. We did touch upon this topic in one of the earlier chapters, but now we are going to dive deep into the same. To be able to appreciate the use of LMAX in Storm, we first need to get acquainted with LMAX as an exchange platform.
Just to reiterate what's been stated in one of the earlier chapters, this is how internal Storm communication happens:
LMAX is basically the world's most performance-optimized and thus the fastest trading platform so far. It's recognized for its low latency, simplified design, and high throughput. Well, the answer as to what makes LMAX disruptors so fast is in the key to what made traditional data exchange systems slow: it's the queues. Traditionally, all data exchange systems had data passed between components during the various stages of processing. Queues were used and they actually introduced the latency. Disruptor has attempted to optimize this area by focusing on contention issues with queues and on optimizing/eliminating the same with the new ring buffer disruptor data structure. This disruptor data structure is actually the heart of LMAX framework.
At the core, it's nothing but an open source Java library created by LMAX. It's all about reusing memory, and it's purely sequential in nature. The more we learn about it, the more fascinating it gets in terms of its simplicity and non-concurrent implementation. The philosophy is straightforward: you need to understand the hardware well to write the piece of software that can exploit and make the most efficient solutions on it. I would like to quote Henry Petroski from Martin Fowler's blog:
"The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry."
Let's revisit the memory organization of our operating system. This is the quick picture that comes to mind:
The preceding diagram depicts the main memory along with the SRAM/cache and its three levels, along with store buffers.
Now, let's go a step further to detail how memory and caching is used by the core and the data locality around cores. Registers are the closest to the cores for access, and then come the store buffers. These store buffers disambiguate the memory access in order of execution. Often, the operands and intermediate results from the previous operations are moved from the registers to accommodate the operands. A trip down memory lane to touch upon assembly language would help you remember and recall how we used to load and save the data from the register. We have a limited number of registers and all data that needs to be executed has to be in registers; thus, we have this frequent loading and embarking of intermediate results/data into store buffers during the execution of instructions. One thing that is sure and universally accepted is that these registers are closest to the core and the fastest. All the caches let us stage data away from the core at various levels, so the caches are faster than the main memory, but still not as fast as registers.
Let's learn about this efficiency using a simple example. I will write an instruction set for an iteration of six items (now what happens is that we have four store buffers, and the data for four items are held in store buffers, while for the other two, the data is stored out of the store buffers). Each application has access to four store buffers in general, so iterating six items in one loop is slower than having three each in two loops (it's been benchmarked and is available on Martin Thompson's blog, http://mechanical-sympathy.blogspot.in/). I found it shocking to my modular programming mind as well, but yes, context switching and load/unload play havoc with performance. At the hardware level, the preceding example does make sense and has been proved. I suggest that all programmers read more extensively about mechanical sympathy so that we can make use of the efficiency hardware provides. This has been very fundamental to the LMAX disruptor.
Moving away from registers to memory increases the latency from the order of nanoseconds to microseconds, and thus the code/instructions should be written with a conscious effort to keep the data needed for execution more local to the core, which gives lighting fast execution.
Now, let's move to cache and try to understand a couple of key aspects about it as captured in the following diagram:
We all know and acknowledge that cache misses are expensive operations and should be avoided, but we also know why they are bound to happen. We always have more data than the cache can hold and, based on our caching strategy (however good and efficient its is), there are bound to be cache misses. As to why, in more technical terms, we have to touch upon the terms depicted in the preceding diagram:
Another significant point to be noted is that when we read from cache, what is returned is always a 64-bit/32-bit cache line depending on the architecture. In the case of a multiprocessor setup, there is contention between processors for updates to cache lines even when they are actually accessing different data on the same cache line.
For high performance, the following points need to be taken care of:
The people at LMAX have taken care of this and designed an interesting data structure, called disruptor, for us. The following diagram captures some key contributors to the performance of disruptor:
A brief description of the key contributors is as follows:
A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction.
Processors used membars to demarcate the sections in the code where the ordering of updates to the memory mattered and should be adhered to. One item to note is that compilers also add membars cited example is volatile keyword in Java.
An obvious and enlightening question to ask here is: why not queues? What makes them non-efficient as compared to a disruptor framework? Here are a few prominent reasons:
They start allocating memory to create ring buffers, which have million entries, where they reuse the memory spaces as they work their way around the memory buffers. So, they don't have to worry about GC and compaction.
The sequencing of data around these buffers to make use of the cache lines is a very important aspect and LMAX uses padding to ensure non-sharing of cache lines to avoid contention.
It's an array-based circular data structure that is used as a buffer to pass the data between contexts (one thread to another). In very simple terms, it's an array that has a pointer to the next available slot. Producers and consumers have the ability to write and read data from this data structure without locking/contention. The sequence/pointer keeps wrapping around the ring. What makes it different from the other circular buffers is the lack of an end pointer. These data structures are highly reliable as messages, once written, are not removed. They stay in there until they are marked to be overwritten. In the case of failures, replays are very easy: a simple example consumer. A request to replay from the message at slot 4 when the producer is writing message 10 replays all messages from slots 4 to 10 and retains them as non-overwritable till it gets an ack from the consumer.
The following diagram captures the significant aspects, as well as giving a common depiction of a ring buffer:
Talking about the implementation and salient features, it's a bounded array that is retraversed and modeled to behave as if it's virtually unbounded. How does this happen? It goes from 1 to 10 and then starts again at 1. It uses an interesting concept of bit mask modulus—it starts reusing the memory from 1 after reaching 10, but the sequence of numbers aren't restarted from 1. They proceed as 11, 12, 20, 21, and so on, to maintain the notion of virtual unboundedness. It can use modulus to figure out the next available number, but it's an expensive operation as compared to a bit mask operation. So it uses bitmap modulus (with a size of ring buffer of -1
being applied to the sequence number) to compute next available element. This approach is yet another operation that demonstrates the mindset to exhibit efficiency driven programming. This is called
smart traversing.
Note that ring buffers aren't doing anything new, but they are built with a mindset for taking advantage of hardware concepts. The network card we have been using has had two of them ever since it was created. LMAX has brought that same hardware concept, packaged in a software bundle for software programmers, to do efficient programming.
The preceding diagram depicts the typical moving components in a disruptor: producers and consumers.
The producers basically do what the name suggests. They dump the data into the ring buffer in sequence, say 1, 2, 3, and so on. Obviously, the producer has to be cognizant of the next available slot to be sure that they don't overwrite the data that's still not been consumed or read by the consumer. The consumers can provide notification that up to what sequence slot the producer is to fill the buffer. For example, if we go ahead and put data all the way up to slot 5, then the producers have to be intelligent enough to ensure they don't end up overwriting the data. Here, the claim strategy comes into play. It could be single-threaded or multithreaded. Single-threaded is absolutely great in a single producer model. However, if we have multiple producers in the mix, we need a multithreaded claim strategy to be deployed, which will help the producers to figure out the next available slot on the buffer. This uses a compare followed by a swap and is an expensive operation.
Before moving to batching, let's talk about claim strategy and write operations in the case of producer contention:
The batching effect is another feature used by the producers to be efficient on latency. Here, the producer knows which slot the consumer is working on. Let's say that the consumer is consuming data from slot 9, while they have put data up to 4. It can definitely go and write the data in bursts till 9 or up to the point when the slots are empty in the ring buffer.
We definitely need timeouts or circuit breakers built inside the producers so that they can override and break the circle in a case where latency starts to build in the system.
Consumers are basically the components to read the data from the ring. Let's say data has been put in slots 1, 2, 3, 4, and 5 by the producers, and there are two consumers in the system (C1 and C2) sitting behind the single consumer barrier (CB1). They can operate in complete independence of each other until they hit the last slot that's been filled in by the producer. The third consumer, here consumer 3, is sitting behind consumer barrier 2 with a dependency on anyone CB1 here. This consumer can't actually access any slot until C1 and C2 are finished with it, so we can establish dependencies judicially using consumer barriers.
Batching plays a very important role here as well; if C1 is working on 1 and its taking a long time, the producer starts putting elements at 6. When C2 is done with 2, 3, and 4, it can run and simply start reading the elements from point 6 upward. It's intelligent enough to sense and act in batches.
We have a single data structure for all components; the smart-batching to catch up has an effect on both the producer and consumers help beat the latency in the most unconventional manner. It performs better with increased loads and we don't get the J shaped graph that we do with queues that have surging loads. The Storm framework internally uses this clever framework for data transfer and communication between workers.