Understanding LMAX

One of the key aspects that attributes to the speed of Storm is the use of LMAX disruptor versus queues. We did touch upon this topic in one of the earlier chapters, but now we are going to dive deep into the same. To be able to appreciate the use of LMAX in Storm, we first need to get acquainted with LMAX as an exchange platform.

Just to reiterate what's been stated in one of the earlier chapters, this is how internal Storm communication happens:

  • Communication within different processes executing on the same worker (in a way, its inter-thread communication on a single Storm node); the Storm framework is designed to used LMAX disruptor
  • Communication between different workers across the node might be on the same node (here, ZeroMQ or Netty is used)
  • Communication between two topologies is attained by external and non-Storm mechanisms such as queues (for example, RabbitMQ, Kafka, and so on) or distributed caching mechanisms (for example, Hazelcast, Memcache, and so on)

LMAX is basically the world's most performance-optimized and thus the fastest trading platform so far. It's recognized for its low latency, simplified design, and high throughput. Well, the answer as to what makes LMAX disruptors so fast is in the key to what made traditional data exchange systems slow: it's the queues. Traditionally, all data exchange systems had data passed between components during the various stages of processing. Queues were used and they actually introduced the latency. Disruptor has attempted to optimize this area by focusing on contention issues with queues and on optimizing/eliminating the same with the new ring buffer disruptor data structure. This disruptor data structure is actually the heart of LMAX framework.

At the core, it's nothing but an open source Java library created by LMAX. It's all about reusing memory, and it's purely sequential in nature. The more we learn about it, the more fascinating it gets in terms of its simplicity and non-concurrent implementation. The philosophy is straightforward: you need to understand the hardware well to write the piece of software that can exploit and make the most efficient solutions on it. I would like to quote Henry Petroski from Martin Fowler's blog:

"The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry."

Memory and cache

Let's revisit the memory organization of our operating system. This is the quick picture that comes to mind:

Memory and cache

The preceding diagram depicts the main memory along with the SRAM/cache and its three levels, along with store buffers.

Now, let's go a step further to detail how memory and caching is used by the core and the data locality around cores. Registers are the closest to the cores for access, and then come the store buffers. These store buffers disambiguate the memory access in order of execution. Often, the operands and intermediate results from the previous operations are moved from the registers to accommodate the operands. A trip down memory lane to touch upon assembly language would help you remember and recall how we used to load and save the data from the register. We have a limited number of registers and all data that needs to be executed has to be in registers; thus, we have this frequent loading and embarking of intermediate results/data into store buffers during the execution of instructions. One thing that is sure and universally accepted is that these registers are closest to the core and the fastest. All the caches let us stage data away from the core at various levels, so the caches are faster than the main memory, but still not as fast as registers.

Let's learn about this efficiency using a simple example. I will write an instruction set for an iteration of six items (now what happens is that we have four store buffers, and the data for four items are held in store buffers, while for the other two, the data is stored out of the store buffers). Each application has access to four store buffers in general, so iterating six items in one loop is slower than having three each in two loops (it's been benchmarked and is available on Martin Thompson's blog, http://mechanical-sympathy.blogspot.in/). I found it shocking to my modular programming mind as well, but yes, context switching and load/unload play havoc with performance. At the hardware level, the preceding example does make sense and has been proved. I suggest that all programmers read more extensively about mechanical sympathy so that we can make use of the efficiency hardware provides. This has been very fundamental to the LMAX disruptor.

Moving away from registers to memory increases the latency from the order of nanoseconds to microseconds, and thus the code/instructions should be written with a conscious effort to keep the data needed for execution more local to the core, which gives lighting fast execution.

Now, let's move to cache and try to understand a couple of key aspects about it as captured in the following diagram:

Memory and cache

We all know and acknowledge that cache misses are expensive operations and should be avoided, but we also know why they are bound to happen. We always have more data than the cache can hold and, based on our caching strategy (however good and efficient its is), there are bound to be cache misses. As to why, in more technical terms, we have to touch upon the terms depicted in the preceding diagram:

  • Cold cache: This refers to the part of the cache which is occupied, but has never been accessed or read.
  • Capacity: Cache is limited so there has to be an eviction to accommodate new data. Loads of algorithms are available around the same.
  • Conflict for mapping/replacement: This is related to associativity with data. Specific data has to be in specific places else we get a miss. The other contender for replacement is eviction.

Another significant point to be noted is that when we read from cache, what is returned is always a 64-bit/32-bit cache line depending on the architecture. In the case of a multiprocessor setup, there is contention between processors for updates to cache lines even when they are actually accessing different data on the same cache line.

For high performance, the following points need to be taken care of:

  • Do not to share the cache line to avoid contention.
  • Be sequential in the data organization and let the hardware perfect the data. For example, when iterating in a list, the data structure is sequential by definition and thus it's perfected in the cache.
  • GC and compaction—especially old gen should be attempted to be avoided; for example, write code that refers data only from Eden/young generation.
  • Restart every day so that we don't need compaction

The people at LMAX have taken care of this and designed an interesting data structure, called disruptor, for us. The following diagram captures some key contributors to the performance of disruptor:

Memory and cache

A brief description of the key contributors is as follows:

  • Control the core and never release the core to the kernel (thus maintaining your L3 cache).
  • Write the code in a non-contending manner; locks are extremely non-performing due to context switching in the kernel, which suspends threads waiting on a lock until it is released.

    Note

    Note that during a kernel context switch, the OS may decide to perform other tasks not related to your process, and thus lose valuable cycles.

  • Avoid memory barriers. Before we delve deeper into why, let's recollect what memory barriers are from the classic definition at https://en.wikipedia.org/wiki/Memory_barrier. It is very self-explanatory:

    A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction.

    Processors used membars to demarcate the sections in the code where the ordering of updates to the memory mattered and should be adhered to. One item to note is that compilers also add membars cited example is volatile keyword in Java.

  • Preallocate and use the sequential memory. The main reason for asking this is prefetching of the data in the cache, avoiding cache line contention, and reuse reduces the cost impact caused by GC and compaction.
  • Thread control is an extremely important task to be taken care of while developing low-latency, high-throughput systems.

An obvious and enlightening question to ask here is: why not queues? What makes them non-efficient as compared to a disruptor framework? Here are a few prominent reasons:

  • When we get to the realm of unbounded queues, we end up using linked lists that are not contiguous and thus, as a consequence, do not support striding (prefetching cache lines).
  • Now, the obvious choice could be bounded queues that rely on arrays. We have a head (which reads the messages of the queue) and a tail (where messages are fed into the queue), which are the classic enqueue and dequeue operations. But we need to understand that both these basic queue operations are not only the primary points of write contention but also have a tendency to share cache lines. In Java, queues are known as GC-heavy data structures. For a bounded queue, the memory for data must be allocated, and, if unbounded, the linked list node must also be allocated. When dereferenced or marked for GC, all of these objects must be reclaimed.
  • Smart use of mechanical sympathy in the way disruptors are modeled, as in a data structure and a framework:
    • It's a system that's designed to utilize the caches as effectively as possible by the nature of the hardware.
    • It starts by creating a large ring buffer of arrays at the start-up and data is copied into it as it's received.
    • Sequencing is vital. What you need next is loaded into the cache sequentially by effectively using padding where needed.
    • Avoid false sharing of cache lines.

They start allocating memory to create ring buffers, which have million entries, where they reuse the memory spaces as they work their way around the memory buffers. So, they don't have to worry about GC and compaction.

The sequencing of data around these buffers to make use of the cache lines is a very important aspect and LMAX uses padding to ensure non-sharing of cache lines to avoid contention.

Ring buffer – the heart of the disruptor

It's an array-based circular data structure that is used as a buffer to pass the data between contexts (one thread to another). In very simple terms, it's an array that has a pointer to the next available slot. Producers and consumers have the ability to write and read data from this data structure without locking/contention. The sequence/pointer keeps wrapping around the ring. What makes it different from the other circular buffers is the lack of an end pointer. These data structures are highly reliable as messages, once written, are not removed. They stay in there until they are marked to be overwritten. In the case of failures, replays are very easy: a simple example consumer. A request to replay from the message at slot 4 when the producer is writing message 10 replays all messages from slots 4 to 10 and retains them as non-overwritable till it gets an ack from the consumer.

The following diagram captures the significant aspects, as well as giving a common depiction of a ring buffer:

Ring buffer – the heart of the disruptor

Talking about the implementation and salient features, it's a bounded array that is retraversed and modeled to behave as if it's virtually unbounded. How does this happen? It goes from 1 to 10 and then starts again at 1. It uses an interesting concept of bit mask modulus—it starts reusing the memory from 1 after reaching 10, but the sequence of numbers aren't restarted from 1. They proceed as 11, 12, 20, 21, and so on, to maintain the notion of virtual unboundedness. It can use modulus to figure out the next available number, but it's an expensive operation as compared to a bit mask operation. So it uses bitmap modulus (with a size of ring buffer of -1 being applied to the sequence number) to compute next available element. This approach is yet another operation that demonstrates the mindset to exhibit efficiency driven programming. This is called smart traversing.

Note that ring buffers aren't doing anything new, but they are built with a mindset for taking advantage of hardware concepts. The network card we have been using has had two of them ever since it was created. LMAX has brought that same hardware concept, packaged in a software bundle for software programmers, to do efficient programming.

Ring buffer – the heart of the disruptor

The preceding diagram depicts the typical moving components in a disruptor: producers and consumers.

Producers

The producers basically do what the name suggests. They dump the data into the ring buffer in sequence, say 1, 2, 3, and so on. Obviously, the producer has to be cognizant of the next available slot to be sure that they don't overwrite the data that's still not been consumed or read by the consumer. The consumers can provide notification that up to what sequence slot the producer is to fill the buffer. For example, if we go ahead and put data all the way up to slot 5, then the producers have to be intelligent enough to ensure they don't end up overwriting the data. Here, the claim strategy comes into play. It could be single-threaded or multithreaded. Single-threaded is absolutely great in a single producer model. However, if we have multiple producers in the mix, we need a multithreaded claim strategy to be deployed, which will help the producers to figure out the next available slot on the buffer. This uses a compare followed by a swap and is an expensive operation.

Before moving to batching, let's talk about claim strategy and write operations in the case of producer contention:

  • In a multiple producer environment, we have a CAS (short for content-addressed storage) instead of a basic pointer to protect the counter
  • In race condition, they can race and use the CAS for the next available slot
  • When a producer reclaims a slot in the ring buffer, it copies the data into one of the preallocated elements using a two-step process, as follows:
    • Get the sequence number and copy the data
    • Explicitly commit it to the producer barrier thus making the data visible
  • Producers can check with consumers to see where they are so they don't overwrite ring buffer slots still in use

The batching effect is another feature used by the producers to be efficient on latency. Here, the producer knows which slot the consumer is working on. Let's say that the consumer is consuming data from slot 9, while they have put data up to 4. It can definitely go and write the data in bursts till 9 or up to the point when the slots are empty in the ring buffer.

We definitely need timeouts or circuit breakers built inside the producers so that they can override and break the circle in a case where latency starts to build in the system.

Consumers

Consumers are basically the components to read the data from the ring. Let's say data has been put in slots 1, 2, 3, 4, and 5 by the producers, and there are two consumers in the system (C1 and C2) sitting behind the single consumer barrier (CB1). They can operate in complete independence of each other until they hit the last slot that's been filled in by the producer. The third consumer, here consumer 3, is sitting behind consumer barrier 2 with a dependency on anyone CB1 here. This consumer can't actually access any slot until C1 and C2 are finished with it, so we can establish dependencies judicially using consumer barriers.

Batching plays a very important role here as well; if C1 is working on 1 and its taking a long time, the producer starts putting elements at 6. When C2 is done with 2, 3, and 4, it can run and simply start reading the elements from point 6 upward. It's intelligent enough to sense and act in batches.

We have a single data structure for all components; the smart-batching to catch up has an effect on both the producer and consumers help beat the latency in the most unconventional manner. It performs better with increased loads and we don't get the J shaped graph that we do with queues that have surging loads. The Storm framework internally uses this clever framework for data transfer and communication between workers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset