Chapter 5. Disk Performance: Identifying and Eliminating Bottlenecks

Introduction

Distributed systems frequently use a distributed file system (DFS). Today, such systems most commonly store data on the disks of the compute nodes themselves, as in the Hadoop Distributed File System (HDFS). HDFS generally splits data files into large blocks (typically 128 MB) and then replicates each block on three different nodes for redundancy and performance.

Some systems (including some deployments of Hadoop) do not store the distributed file system data directly on the compute nodes and instead rely on cloud data storage (e.g., using Amazon’s S3 as storage for Hadoop and others) or network-attached storage (NAS). Even in these cases, some data is often copied to a compute node’s local disks before processing, or jobs’ intermediate or final output data might be written temporarily to the compute node’s local disks.

Whether using a system like HDFS that stores all of its data on the computing cluster itself or a system that uses separate storage such as S3 or NAS, there are three major conceptual locations of data (see Figure 5-1):

Truly local

Data is written to temporary files (such as spill files that cache local process output before being sent to remote nodes), log files, and so on.

Local but managed via the DFS

The data being read by a process on a node lives in HDFS, but the actual bytes are on a local disk on that node. (Hadoop tries to lay out computation to increase the likelihood that processes will use local data in this way.)

Remote

I/O data is read from or written to a remote disk, either on-cluster (HDFS) or elsewhere (S3, NAS, or on some other system such as a database).

Figure 5-1. Locations of data in a distributed system that uses HDFS for storage. A: truly local disk access. B: Access to HDFS data stored on the local node. C: Access to HDFS data stored on another node, either on the same network rack or on another network rack.

Because of this distribution of data across the cluster, on a given node some disk usage is caused by local processes and some by processes on other nodes (or even off the cluster, as in extract, transform, and load [ETL] to/from some external source). Correspondingly, some disk performance impacts are due to local processes, and some are due to remote processes.

Overview of Disk Performance Limits

A single physical hard drive has two fundamental performance limits:

Disk seek time

The amount of time it takes for the read/write head to move to a new track on the disk to access specific data. This time is typically a few milliseconds.1 (There is additional rotational latency in waiting for the disk to spin until the desired data is under the head; this time is about half of the seek time.) The number of such disk operations is measured in I/O operations per second (IOPS).

Disk throughput

The speed with which data can be read from or written to the disk after the head has moved to the desired track. Modern disks typically have sustained throughput of roughly 200 MBps. (Disk throughput is sometimes also referred to as disk bandwidth.)

A given disk becomes a bottleneck when the processes on the node are either trying to read or write data faster than the disk head or the disk controller can handle and thus have reached the throughput cap, or are reading or writing a large number of small files (rather than a few large files that are read sequentially from the disk) and thus causing too many seeks. The latter case is often due to the “small-files problem,” in which poorly-designed applications have broken their input or output data into too many small chunks. The end result of the small-files problem is that a disk that could otherwise read 200 MBps ends up only reading at a small fraction of that speed.

Disk Behavior When Using Multiple Disks

Modern servers have several hard drives per node, which generally improves performance. Striping disk reads and writes across multiple disks (which can be done either explicitly by the application, as in HDFS, or at a low level, as in RAID 0) generally increases both the total disk throughput and total IOPS of a node. For example, Hadoop spreads some intermediate temporary files (such as spills from mappers and inputs to reducers that come from the shuffle stage) across disks. Likewise, it’s a good practice to spread frequently flushed log files across disks because they are generally appended in very small pieces and writing/flushing them can cause many IOPS. Splitting the throughput and IOPS across many disks means that the CPU can spend more time doing useful work and less time waiting for input data.

However, if a particular disk is slow (due to hardware problems) or is being accessed more than the other disks on the node, that disk can effectively slow down the entire system; if a process must complete reading from or writing to files that are spread across all disks, its performance is constrained by the slowest disk. This can happen, for example, if a particular HDFS file is located on one particular disk on the node and a task, either running on that node or running on a remote node, needs to read the file, slowing down the use of that disk for every other process on the node and effectively creating a tax on the entire system. Data replication reduces the chance of hitting the slow disk because the data blocks on the slow disk are also present on other disks on the cluster, but it does not eliminate the problem, especially because HDFS does not consider the performance or load of specific disks when deciding which replica to instruct a process to access.

Although there are no good general solutions to this problem in distributed systems, understanding the behavior of applications that use multiple disks at once can help developers and operators diagnose performance issues and ultimately solve them. For example, when analyzing the performance of a particular node, it’s important to look at the slowest disk at that time, not aggregate or average metrics across all disks. Also, making developers aware of exactly how their applications are using disks as well as the general performance characteristics of hard disks can lead application owners to reorganize their data to avoid unnecessary seeks. Combining several small files into one in a rollup process is a common way to reduce the number of seeks and use disks more efficiently. Alternatively, using systems like HBase that turn random reads/writes into sequential large blocks that can be written out (by using a combination of edit logs and RAM buffering) greatly reduces the number of random disk IOPS, speeding up disk access while still allowing what appear to the application to be random and small operations.

Disk Performance in Multi-Tenant Distributed Systems

As with other hardware resources, disk performance can suffer in a way that is specific to multi-tenant systems: when multiple processes each need to read a different large file, the system can behave as if it has a small-files problem. Because seeking (moving the disk head) is costly, reading first one file sequentially and then the other file sequentially would be much faster than reading them simultaneously. Under low to moderate usage, the kernel handles this gracefully, but a multi-tenant system can have several well-written applications (each reading large amounts at once) that read so much total data that the kernel needs to switch frequently between them, appearing from the hardware point of view as many small operations.

In contrast to applications running on a single node, distributed systems sometimes have the potential to use the network to get data out of another node’s memory faster than from the local disk. This is particularly true when the amount of data is small; the latency of sending a few packets over the network to a remote node that reads data from memory and sends the resulting packets back is around 0.5–1.0 ms, roughly one-tenth the time to seek and read from a local disk.2 In addition, the impact to overall cluster performance is reduced by using data out of another node’s memory—the minimal network usage and remote memory read do not interfere with anything else, whereas a disk seek can slow down other processes on the node.

Recent versions of HDFS introduced caching of HDFS data, in which a Hadoop application can first request that particular data be cached in memory and then request the scheduler to place the application’s containers on nodes where the data is cached. The result is that containers benefit even more from HDFS data locality by accessing data directly in memory, not just data on the local disk. Applications that support interactive querying on fixed datasets have achieved significant performance gains by exploiting caching.3

Similar to applications that use HDFS caching, Spark also takes great advantage of data stored in memory, either on the local node or on another node. Spark’s resilient distributed datasets (RDDs) allow other Spark executors (processes) to access a RAM cache of data, without needing to go through HDFS to get it.

Controlling Disk I/O Usage to Improve Performance for High-Priority Applications

Operators often want to prioritize applications against one another—for example, reducing disk read/write latency for a database that serves end-customer queries in a web application might be critical, whereas latency for batch jobs is relatively unimportant. I/O-heavy batch jobs typically run as expected with just 90 percent of the disk throughput or seeks instead of 100 percent, so slowing them down a little would not affect them—but freeing up that other 10 percent could speed up latency-sensitive applications (which are generally much lighter in terms of I/O needs).

Some special-purpose systems such as databases support such prioritization among different types of applications and different users. More general distributed systems such as Hadoop (including latency-sensitive applications like HBase running on Hadoop clusters) normally do not; in the case of Hadoop, the scheduler allocates memory and to some extent CPU but disregards disk throughput and disk seeks. As a result, one process on a node can saturate disk throughput or use up nearly all of the available IOPS on the disks on that node, regardless of the scheduler’s plan for allocation and prioritization.

Basic Disk I/O Prioritization Tools and Their Limitations

Operating systems such as Linux provide basic tools to provide some control over I/O usage:

  • ionice does for disk usage what nice does for CPU usage.

  • Cgroups behaves like ionice for I/O for multiple processes (as it does for multiple processes’ use of CPU).

One fundamental limitation of using ionice (and cgroups) results from the difference between direct I/O, wherein the application essentially interacts directly with the disk hardware controller, and buffered I/O, wherein the kernel is an intermediary. With buffered I/O, instead of communicating with the device directly, the application instructs the kernel “read from or write to file X,” and the kernel interacts with the disk (with kernel memory buffers sometimes substituting for actual disk I/O operations). Most real-world applications use buffered I/O.

The problem is that while cgroups/ionice always work for direct I/O and sometimes work for buffered reads,4 they do not work at all for buffered writes, because those get into the kernel’s write buffer and are eventually written by the kernel itself, rendering the source of each write unknown to ionice.

Distributed multi-tenant systems have an additional problem due to how disk access is implemented by the platform. In Hadoop or any other system using a DFS, a separate daemon generally handles access to data on local and remote disks; tasks running on the node interact with that DFS daemon, not directly with the operating system. For example, HDFS reads and writes are intermediated by a separate datanode process. From the kernel’s point of view, the datanode process is now responsible for all I/O, not the application processes. As a result, kernel tools including cgroups/ionice cannot control throughput or seeks used by the different tenants on the node, nor can they prioritize access. The same is true for access by other nodes or even off-cluster processes (such as ETL to or from a data warehouse) to data stored on the local node; because it’s intermediated by the datanode daemon, it is completely opaque to the kernel.

These limitations of using basic operating system tools to control disk I/O usage mean that operators must use third-party software solutions to effectively prioritize usage among different applications.

Effective Control of Disk I/O Usage

Software like Pepperdata enables operators to prioritize access to disk I/O, ensuring timely disk access to latency-sensitive applications. Such software makes this functionality possible by instrumenting and controlling the disk usage of each application within the process itself rather than externally at the kernel level. As a result, Pepperdata is aware of disk usage for each individual application, whether it is using direct I/O, buffered I/O, or data access via the HDFS daemon. Because Pepperdata also measures the overall usage and capacity of the node, it can dynamically adjust disk access by each individual process to use the full hardware capacity but also prioritize the most important or latency-sensitive applications.

Solid-State Drives and Distributed Systems

Solid-state drives (SSDs) provide much higher throughput and IOPS—and much lower latency—than hard disk drives (HDDs). Gains are particularly significant for random reads/writes (typically on the order of 100–200 times better) but are much smaller, though still significant, for large sequential reads/writes (on the order of 3–6 times).5 SSDs also have much higher reliability compared to HDDs because they have no moving parts.

The performance gains from using SSDs vary by workload type and workload stage. If a job is completely CPU bound and not bottlenecked by disk I/O, the gains from using SSDs are negligible. In contrast, if the job is I/O bound and doing random reads/writes, using SSDs can improve performance dramatically. I/O-bound workloads doing primarily large sequential I/O will still achieve some performance gain from using SSDs, but those gains might not be significant.

For example, in MapReduce, intermediate job input/output data such as spill and shuffle outputs benefit more from SSDs than do phases that read from and write to HDFS because the I/O on intermediate outputs is generally on much smaller files and is more random than HDFS input reading or output writing phases.6

Hadoop also provides an option to compress the intermediate data of MapReduce jobs, like map outputs. When compression is enabled, the intermediate data file sizes shrink, giving SSDs a higher performance boost compared to HDDs, as the resulting intermediate output is on smaller files and is more random.

Note

Because compression adds CPU overhead, compressing all intermediate data might not result in a significant performance improvement on clusters running jobs that are primarily CPU bound.

In terms of optimizing performance versus cost, it is important to note that SSD costs are around 40–60 times the costs of HDDs. If changing to SSDs from HDDs in a particular cluster would not yield significant performance improvements for the dominant workloads on that particular cluster, it is difficult to justify the steep cost of exchanging HDDs for an equivalent capacity of SSD storage.

To reduce the cost of switching to SSDs while achieving much of the benefit, operators can build a cluster that uses both HDD and SSD storage and then configure the cluster to use HDD or SSD as appropriate for each workload. For example, to speed up production jobs with strict SLAs, operators could configure the cluster to store HDFS I/O files for only production jobs on SSDs, and configure production jobs to store intermediate output on SSDs. Nonproduction jobs would use only HDDs, not SSDs, for HDFS files and intermediate output.7 Operators could also enable intermediate output to be stored on SSDs for nonproduction jobs, as well, if sufficient SSD capacity is available. In that case, HDDs would be used only for HDFS I/O files for nonproduction jobs, but the cost savings from using HDDs instead of SSDs for those files would still be significant.

Measuring Performance and Diagnosing Problems

Standard operating system tools can identify when a disk is working at maximum capacity and thus acting as a bottleneck. For example, the kernel metrics for that disk might indicate that the disk is servicing requests most of the time. When “most of the time” becomes “all the time” (e.g., 900 ms per second), performance problems result.8

It’s possible to detect a disk performance problem by observing particular behavior—for example, a node suffering from a disk working at maximum capacity behaves differently from a node for which physical memory is full and the node is thrashing (excessively swapping). A disk at maximum capacity is generally still responsive, whereas a thrashing node can be completely unresponsive. Initially logging into a node with a disk running at maximum capacity might initially appear a bit slow but behave normally after login.

It is particularly useful in distributed systems to consider disks that are reporting significantly different read latencies for the same amounts of data and doing the same amounts of IOPS. The distributed system aspect makes this kind of comparison much more feasible than with a single-node system or a set of servers all working independently because many disks are frequently performing similar tasks at the same point in time, making comparisons easy (see Figure 5-2). Operators can use this to detect poorly performing disks, which can be a sign of impending failure.

Figure 5-2. Pepperdata Dashboard view showing disk statistics for a production cluster; note the top few rows, whose values of I/O service time are significantly higher than the others, potentially indicating hardware issues. (source: Pepperdata)

One commonly used tool to examine disk usage is iotop, which is similar to the standard Unix top program for CPU; iotop breaks down disk usage by process and thread. Unfortunately, iotop suffers (in measuring I/O usage) from the same fatal flaw that ionice and cgroups do (when controlling I/O usage)—it only measures local (not DFS) data usage. This is because DFS disk access is handled by a separate process, not the actual worker processes.

Just as it does with control and prioritization, software like Pepperdata avoids the problem of being unable to identify the causes of all disk usage; because such software instruments each application’s usage within the process, it can report on direct I/O, buffered I/O, and DFS usage. This allows distributed-system operators to quickly identify individual applications that are causing performance bottlenecks due to excessive or badly behaved disk usage as well as the applications that are suffering from those performance bottlenecks.

In considering the efficiency of disk access for multi-tenant systems, the ultimate goal is to allow each individual process to maximize useful work while minimizing time waiting for disks; less time waiting equates to more effective disk usage.

Summary

Disk I/O can be a critical bottleneck for multi-tenant distributed systems; contention for disk access frequently causes high-priority jobs to finish late, and inefficient disk usage can reduce overall cluster utilization. Developers should design their applications to use disk I/O efficiently, for example by using fewer, larger files to reduce the number of disk seeks. Operators should consider using SSDs where appropriate to provide increased disk performance. They should also use a software solution that effectively prioritizes disk I/O usage across different users and applications to ensure disk access for the most critical and time-sensitive applications.

1 For some recent statistics, see, for example, http://www.tomsitpro.com/articles/best-enterprise-hard-drives,2-981.html.

2 See “Latency Numbers Every Programmer Should Know” at https://gist.github.com/jboner/2841832, with an interesting visualization at http://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html.

3 See http://blog.cloudera.com/blog/2014/08/new-in-cdh-5-1-hdfs-read-caching/.

4 Some kernel versions support controls for buffered reads, whereas some do not. None support controls for buffered writes; see https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch-Subsystems_and_Tunable_Parameters.html.

5 See http://www.thessdreview.com/featured/ssd-throughput-latency-iopsexplained/.

6 See http://blog.cloudera.com/blog/2014/03/the-truth-about-mapreduce-performance-on-ssds/ and Kambatla, Karthik, and Yanpei Chen. “The Truth About MapReduce Performance on SSDs.” 28th Large Installation System Administration Conference (LISA14). 2014. https://www.usenix.org/system/files/conference/lisa14/lisa14-paper-kambatla.pdf.

7 See https://hadoop.apache.org/docs/r2.4.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml.

8 See http://www.pcp.io/pcp.git/man/html/howto.diskperf.html for an example of interpreting disk metrics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset