Chapter 7. Other Bottlenecks in Distributed Systems

Introduction

Whereas the previous few chapters have covered specific low-level hardware resources that can cause performance bottlenecks, distributed systems often have a number of components that work together, and any one of these components can become a bottleneck for the entire system. Also, in some cases, what is presented to a programmer or user as a single distributed system can in fact be one distributed system built on top of another—for example, HBase runs on HDFS—and so can suffer from bottlenecks in the underlying system that are not immediately visible and require additional monitoring.

This chapter focuses on performance and management challenges for Hadoop in particular, but many of the same concepts apply to other distributed systems—only the specific components differ.

NameNode Contention

HDFS uses an architecture with a single NameNode that stores the directory tree of all files in the distributed file system and tracks which nodes hold each file. The NameNode process stores information in memory about every HDFS file, including metadata for the file, the blocks the file is broken into, and the nodes that store the HDFS blocks. As a result, the NameNode on large clusters might store 300 bytes of data for each of many millions of HDFS files and blocks,1 and thus its memory usage can be large and grow over time. If the memory needed exceeds the available physical memory on the node, the NameNode can start swapping and bring the entire distributed system to a halt.

Applications that create many small HDFS files exacerbate this problem because breaking a large file into many smaller files creates proportionately more metadata for the same underlying data. Changing applications to use larger HDFS files will reduce this impact (along with the other performance impacts of small HDFS files mentioned previously). In some cases, this can be difficult; for example, when applications write log files, which are generally small. One solution is to combine many small files into a single large file (for example, in a .tar file or Hadoop archive2) and store the result in a single HDFS file.

Memory isn’t the only potential NameNode bottleneck. Before a container can begin reading from or writing to an HDFS file, the datanode daemon on the host running the container must exchange information with the NameNode. This obviously creates significant CPU load on the NameNode host, which can affect performance. (It also creates network traffic, although this traffic is small compared to the bandwidth used for sending the HDFS data itself between nodes; HDFS data does not pass through the NameNode.)

Because there’s no simple way to protect the NameNode (and the system as a whole) from too many files, operators of large Hadoop clusters closely monitor NameNode behavior. Key metrics include the total size of the file system, the number of files and blocks, the total size (and file and block count) per user, and the number of requests going to the NameNode. In particular, the rate of change of those metrics can signal potential problems before they occur. (These NameNode metrics are examples of the monitoring that will be covered in Chapter 8.)

Thinking more broadly about distributed systems performance, HDFS is a case for which a different architecture can improve performance and reduce the effect of a single hardware bottleneck. Early versions of HDFS were modeled after the first publicly announced version of the Google File System, which had a single master node (like the NameNode). More recently, Google has moved internally to a new version of its file system that stores file metadata into a distributed tabular store (BigTable, roughly equivalent to HBase) so that the metadata is distributed among an arbitrary number of nodes, meaning that the system can scale as needed. HDFS now has a similar feature (HDFS Federation) that uses multiple primary NameNodes, each with a different namespace, to reduce the load on each one. With federation, each primary NameNode serves a deterministic set of HDFS files, so a request from a worker node for a file will go to just its corresponding NameNode. This access pattern reduces the likelihood of one node being a bottleneck, and the distributed system can scale linearly by adding additional NameNodes.

ResourceManager Contention

A similar bottleneck in Hadoop is the ResourceManager (RM), which is the master that communicates with all of the compute nodes in the cluster to schedule work to run on those nodes. Strictly speaking, the RM communicates with each compute node’s NodeManager (NM) to make containers available to each application’s ApplicationMaster.

With this architecture, the single RM node can become a bottleneck if its CPU, memory, or network becomes overwhelmed, and the entire cluster can slow down as a result because new containers will not be scheduled as fast as they could be. This can happen in very large clusters (thousands of nodes) because every NM sends a frequent heartbeat message to the RM with information about the resources available on that node. If this occurs, increasing the heartbeat interval for each node (one second by default) can reduce the overhead on the RM; it’s better to send a heartbeat less frequently and thus launch new containers on nodes more slowly than if the RM were not overwhelmed, but still much faster than would be the case when the RM is overwhelmed.

The RM can also be overwhelmed if the cluster is running many short-lived tasks, due to the extra memory and processing needed to track and schedule many tasks. Applications that run in this way should be rewritten or tuned in any case because running many quick tasks (i.e., those requiring just a few seconds to complete) can cause significant performance problems in other ways—for example, more time and work is spent in the overhead of task setup and teardown.

The RM is another case in which architectural changes over time have improved performance of the distributed system. In earlier versions of Hadoop, a JobTracker (JT) node had to handle all of the work the RM does today, in addition to managing every MapReduce job in detail. Handling so much work meant that the JT could easily become a bottleneck for the entire distributed system, and in fact was often the primary factor limiting the number of nodes that could be run in a single cluster. Splitting up the JT’s work into cluster resource management (which is done by the RM) and application management (which is now done by a separate ApplicationMaster for each application and distributed across compute nodes in the cluster) meant that the RM is much less often a bottleneck and allowed Hadoop clusters to scale more easily, up to many thousands of nodes.

ZooKeeper

Apache ZooKeeper is a centralized service for maintaining configuration information and name registries as well as providing distributed synchronization and various other services used by distributed systems. ZooKeeper is used in many types of distributed systems, including Hadoop.3

ZooKeeper is fast when used in the intended scenario of predominantly read (not write) traffic. It can become a performance bottleneck if used in another way (for example, if a poorly-written application uses ZooKeeper as a message system), because when writing data, ZooKeeper performs a consensus check among the ZooKeeper nodes. In addition, each ZooKeeper client (running on the compute nodes, not the ZooKeeper nodes) might have watchers that must be notified every time new ZooKeeper data is written, increasing overhead on those nodes and on the network.

To avoid the performance problems that can occur if ZooKeeper is misused as a message system, it’s best to use Zookeeper only for infrequently changing data like service locations; message passing should be done by using other mechanisms designed for high throughput, such as direct socket connections or the shuffle mechanism built into Hadoop.

Locks

A common performance issue in multithreaded programming is lock contention. If many threads are all trying to acquire the same lock rapidly, parallelism (the ability for multiple threads to effectively work in parallel) decreases. Also, the locking itself involves a negotiation between multiple cores regarding which will hold the lock at the moment, which can cause CPU pipeline stalls.

Performance degradation from locks can be especially severe for multi-tenant distributed systems, for several reasons:

  • The computing work is significant and is thus scheduled across many cores.

  • Unlike a custom-designed distributed application, the allocation across cores can be ad hoc and suboptimal.

  • Because most developers writing jobs aren’t experts in (or even familiar with) systems performance implications, they might be completely unaware of the impact of lock contention.

(These factors will be familiar from the earlier discussions of hardware bottlenecks.)

A related issue is the impact of using atomic variables; if one thread on one core writes within a lock to an object field, and a thread on a different core tries to read that variable, the CPU instruction pipeline can stall waiting for synchronization between cores.

Careful design of multithreaded programs is necessary to minimize lock contention. Programs should hold locks for as little time as possible, and developers should consider lockless architectures. One technique that can simplify the programming model is Communicating Sequential Processes.4

DNS Servers

One external service that is critical for distributed systems is the domain name system (DNS), which provides the mapping from Internet domain names to server IP addresses. Unfortunately, in many cases DNS can become a bottleneck for distributed systems like Hadoop—especially those where the platform and applications are written in Java, which suffers from some key performance limitations in its use of DNS, which include the following:

  • Java libraries often have long DNS server timeouts (on the order of 30 seconds) before attempting a different DNS server. (They also use very simple retry logic when DNS failures occur, retrying repeatedly at a fixed time interval; contrast this with protocols like TCP that use an exponential backoff algorithm, allowing for rapid recovery from transient failures, without flooding the network with retries in the case of an ongoing outage.)

  • Java does not use all available nameservers, which is particularly challenging.

Production deployments frequently suffer from performance and reliability problems caused by DNS instability. Examples include ZooKeeper timeouts because of DNS problems as well as entire nodes locking up because they were running multithreaded applications for which each thread was stuck waiting for DNS.

To reduce the likelihood of performance bottlenecks and instability from DNS, the most straightforward solution is to install more powerful DNS infrastructure. An extreme solution is to run a DNS resolver on every worker node. These local resolvers provide their own caching and properly load-balance between multiple upstream DNS servers.

Summary

In addition to scheduling inefficiencies and potential bottlenecks from specific hardware resources, distributed systems also suffer from bottlenecks due to performance problems in centralized services. Because there are many such services, and the details change depending on the specific distributed system being used, there is no short list of steps to take to avoid all bottlenecks. A general theme is to monitor centralized services directly, along with the overall distributed system performance, in a way that will quickly detect problems and identify when problems are due to centralized services.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset