Chapter 3. CPU Performance Considerations

Introduction

Historically, large-scale distributed systems were designed to perform massive amounts of numerical computation, for example in scientific simulations run on high-performance computing (HPC) platforms. In most cases, the work done on such systems was extremely compute intensive, so the CPU was often the primary bottleneck.

Today, distributed systems tend to run applications for which the large scale is driven by the size of the input data rather than the amount of computation needed—examples include both special-purpose distributed systems (such as those powering web search among billions of documents) and general-purpose systems such as Hadoop. (However, even in those general systems, there are still some cases such as iterative algorithms for machine learning where making efficient use of the CPU is critical.)

As a result, the CPU is often not the primary bottleneck limiting a distributed system; nevertheless, it is important to be aware of the impacts of CPU on overall speed and throughput.

At a high level, the effect of CPU performance on distributed systems is driven by three primary factors:

  • The efficiency of the program that’s running, at the level of the code as well as how the work is broken into pieces and distributed across nodes.

  • Low-level kernel scheduling and prioritization of the computational work done by the CPU, when the CPU is not waiting for data.

  • The amount of time the CPU spends waiting for data from memory, disk, or network.

These factors are important for the performance even of single applications running on a single machine; they are just as important, and even more complicated, for multi-tenant distributed systems due to the increased number and diversity of processes running on those systems, and their varied input data sources.

Algorithm Efficiency

Of course, as with any program, when writing a distributed application, it is important to select a good algorithm (such as implementing algorithms with N*log(N) complexity instead of N2, using joins efficiently, etc.) and to write good code; the best way to spend less time on the CPU is to avoid computation in the first place. As with any computer program, developers can use standard performance optimization and profiling tools (open source options include gprof, hprof, VisualVM, and Perf4J; Dynatrace is one commercial option) to profile and optimize a single instance of a program running on a particular machine.

For distributed systems, it can be equally important (if not more so) to break down the work into units effectively. For example, with MapReduce programs, some arrangements of map-shuffle-reduce steps are more efficient than others. Likewise, whether using MapReduce, Spark, or another distributed framework, using the right level of parallelism is important. For example, because every map and reduce task requires a nontrivial amount of setup and teardown work, running too many small tasks can lead to grossly inefficient overhead—we’ve seen systems with thousands of map tasks that each require several seconds for setup and teardown but spend less than one second on useful computation.

In the case of Hadoop, open source tools like Dr. Elephant1 (as well as some commercial tools) provide performance measurement and recommendations to improve the overall flow of jobs, identifying problems such as a suboptimal breakdown of work into individual units.

Kernel Scheduling

The operating system kernel (Linux, for example) decides which threads run where and when, distributing a fixed amount of CPU resource across threads (and thus ultimately across applications). Every N (~5) milliseconds, the kernel takes control of a given core and decides which thread’s instructions will run there for the next N milliseconds. For each candidate thread, the kernel’s scheduler must consider several factors:

  • Is the thread ready to do anything at all (versus waiting for I/O)?

  • If yes, is it ready to do something on this core?

  • If yes, what is its dynamic priority? This computation takes several factors into account, including the static priority of the process, how much CPU time the thread has been allocated recently, and other signals depending on the kernel version.

  • How does this thread’s dynamic priority compare to that of other threads that could be run now?

The Linux kernel exposes several control knobs to affect the static (a priori) priority of a process; nice and control groups (cgroups) are the most commonly used. With cgroups, priorities can be set, and scheduling affected, for a group of processes rather than a single process or thread; conceptually, cgroups divide the access to CPU across the entire group. This division across groups of processes means that applications running many processes on a node do not receive unfair advantage over applications with just one or a few processes.

In considering the impact of CPU usage, it is helpful to distinguish between latency-sensitive and latency-insensitive applications:

  • In a latency-sensitive application, a key consideration is the timing of the CPU cycles assigned to it. Performance can be defined by the question “How much CPU do I get when I need it?”

  • In a latency-insensitive application, the opposite situation exists: the exact timing of the CPU cycles assigned to it is unimportant; the most important consideration is the total number of CPU cycles assigned to it over time (usually minutes or hours).

This distinction is important for distributed systems, which often run latency-sensitive applications alongside batch workloads, such as MapReduce in the case of Hadoop. Examples of latency-sensitive distributed applications include search engines, key-value stores, clustered databases, video streaming systems, and advertising systems with real-time bidding that must respond in milliseconds. Examples of latency-insensitive distributed applications include index generation and loading for search engines, garbage collection for key-value stores or databases, and offline machine learning for advertising systems.

An interesting point is that even the same binary can have very different requirements when used in different applications. For example, a distributed data store like HBase can be latency-sensitive for reading data when serving end-customer queries, and latency-insensitive when updating the underlying data—or it can be latency-sensitive for writing data streamed from consumer devices, and latency-insensitive when supporting analyst queries against the stored data. The semantics of the specific application matter when setting priorities and measuring performance.

Intentional or Accidental Bad Actors

As is the case with other hardware resources, CPU is subject to either intentional or accidental “bad actors” who can use more than their fair share of the CPU on a node or even the distributed system as a whole. These problems are specific to multi-tenant distributed systems, not single-node systems or distributed systems running a single application.

A common problem case is due to multithreading. If most applications running in a system are single threaded, but one developer writes a multithreaded application, the system might not be tuned appropriately to handle the new type of workload. Not only can this cause general performance problems, it is considered unfair because that one developer can nearly monopolize the system. Some systems like Hadoop try to mitigate this problem by allowing developers to specify how many cores each task will use (with multithreaded programs specifying multiple cores), but this can be wasteful of resources, because if a task is not fully using the specified number of cores, the cores might remain reserved and thus go unused.

Applying the Control Mechanisms in Multi-Tenant Distributed Systems

Over time, kernel mechanisms have added additional knobs like cgroups and CPU pinning, but today there is still no general end-to-end system that makes those mechanisms practical to use. For example, there is no established policy mechanism to require applications to state their need, and no distributed system framework connects application policies with kernel-level primitives.

It’s common practice among Unix system administrators to run system processes at a higher priority than user processes, but the desired settings vary from system to system depending on the applications running there. Getting things to run smoothly requires the administrator to have a good “feel” for the way the cluster normally behaves, and to watch and tune it constantly.

In some special cases, software developers have designed their platforms so that they can use CPU priorities to affect overall application performance. For example, the Teradata architecture was designed to make all queries CPU bound, and then CPU priorities can be used to control overall query prioritization and performance. Similarly, HPC frameworks like Portable Batch System (PBS) and Terascale Open-Source Resource and QUEue Manager (TORQUE) support cgroups.

For general-purpose, multi-tenant distributed systems like Hadoop, making effective use of kernel primitives such as cgroups is more difficult, because a given system might be running a multitude of diverse workloads at any given time. Even if CPU were the only limited resource, it would be difficult to adjust the settings correctly in such an environment, because the amount of CPU required by the various applications changes constantly. Accounting for RAM, disk I/O, and network only multiplies the complexity. Further complicating the situation is the fact that distributed systems necessarily divide applications into tens, hundreds, or thousands of processes across many nodes, and giving one particular process a higher priority might not affect the run time of the overall application in a predictable way.

Software such as Pepperdata helps address these complications and other limitations of Hadoop. With Pepperdata, Hadoop administrators set high-level priorities for individual applications and groups of applications, and Pepperdata constantly monitors each process’s use of hardware and responds in real time to enforce those priorities, adjusting kernel primitives like nice and cgroups.

I/O Waiting and CPU Cache Impacts

The performance impact of waiting for disk and network I/O on multi-tenant distributed systems is covered in Chapters 4 and 5 of this book; this section focuses on the behavior of the CPU cache.

In modern systems, CPU chip speeds are orders of magnitude faster than memory speeds, so processors have on-chip caches to reduce the time spent waiting for data from memory (see Figure 3-1). However, because these caches are limited in size, the CPU often has cache misses when it needs to wait for data to come from slower caches (such as L2 or L3), or even from main memory.2

Figure 3-1. Typical cache architecture for a multicore CPU chip.

Well-designed programs that are predominantly running alone on one or more cores can often make very effective use of the L1/L2/L3 caches and thus spend most of their CPU time performing useful computation. In contrast, multi-tenant distributed systems are an inherently more chaotic environment, with many processes running on the same machine and often on the same core. In such a situation, each time a different process runs on a core, the data it needs might not be in the cache, so it must wait for data to come from main memory—and then when it does, that new data replaces what was previously in the cache, so when the CPU switches back to a process it had already been running, that process, in turn, must fetch its data from main memory. These pathological cache misses can cause most of the CPU time to be wasted waiting for data instead of processing it. (Such situations can be difficult to detect because memory access/wait times show up in most metrics as CPU time.)

Along with the problems due to cache misses, running a large number of processes on a single machine can slow things down because the kernel must spend a lot of CPU time engaged in context switching. (This excessive kernel overhead can be seen in kernel metrics such as voluntary_ctxt_switches and nonvoluntary_ctxt_switches via the /proc filesystem, or by using a tool such as SystemTap.)3

The nature of multi-tenant systems also exacerbates the cache miss problem because no single developer or operator is tuning and shaping the processes within the box—a single developer therefore has no control over what else is running on the box, and the environment is constantly changing as new workloads come and go. In contrast, special-purpose systems (even distributed ones) can be designed and tuned to minimize the impact of cache misses and similar performance problems. For example, in a web-scale search engine, each user query needs the system to process different data to produce the search results. A naive implementation would distribute queries randomly across the cluster, resulting in high cache miss rates, with the CPU cache constantly being overwritten. Search engine developers can avoid this problem by assigning particular queries to subsets of the cluster. This kind of careful design and tuning is not possible with general-purpose, multi-tenant distributed systems.

Similarly, another aspect of distributed systems can affect developers’ mindsets: because they have access to much more RAM than they would on a single computer, they might naturally begin to think of RAM as an elastic resource that incurs no performance penalty when accessed. They might thus pay less attention to the details of hardware usage and its performance implications than they otherwise would.

Summary

CPU utilization is a key metric for improving the performance of a distributed system. If CPU utilization is high for the distributed system (overall or on one particular “hotspot” node), it is important to determine whether that CPU utilization reflects useful work (the computation required by the application) or waste due to overhead, such as task startup/teardown, context switching, or CPU cache misses.

Developers can reduce this overhead in some straightforward ways, such as optimizing the division of an application into discrete units. In addition, operators should use tools to control CPU priorities so that CPU cycles are dedicated to the most important workloads when they need them.

1 See https://github.com/linkedin/dr-elephant/wiki.

2 Typically 64-256 KB for the L1 and L2 cache for each core, and a few megabytes for the L3 cache that is shared across cores; see https://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29.

3 See http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html for interesting data on context switch times.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset