Chapter 6. Network Performance Limits: Causes and Solutions

Introduction

Networking is fundamental to distributed systems, which are composed of many machines communicating and working together. Networking is also often a bottleneck, so managing network performance is essential.

Network usage is much more like disk than CPU or memory in terms of its impact on multi-tenant distributed systems. Like disk usage, network usage by a given application tends to be extremely “bursty” and can be a bottleneck either before or after the bulk of a process’s useful computation is done. Network usage can also cause significant delays for latency-sensitive applications, even though those applications tend to be much lighter users of network (and disk) than latency-insensitive, data-heavy batch applications.

Bandwidth Problems in Distributed Systems

Whereas disk performance is limited by both total bandwidth and disk seeks (I/O operations per second [IOPS]), network performance is primarily limited by bandwidth alone. However, bandwidth in “the network” is not monolithic; depending on the network topology, bandwidth can be limited at several points, including the following (see Figure 6-1):

  • The network interface card (NIC) or cards on each node

  • The kernel’s ability to access the NIC to send and receive data

  • The switch each node is connected to—generally a top-of-rack (ToR) switch connecting 20–40 nodes together

  • The switch backplane connecting ports on the rack switch

  • The uplink from the rack switch to the core switch that connects many racks

  • The backplane of the core switch

  • For some applications, an external connection to a different data center (for example, if large amounts of data are being transferred across geographically distant clusters)

Network bandwidth bottlenecks (either at the level of an individual node or the rack or core switch) are quite common and impactful in distributed systems, especially multi-tenant ones, which often have mixed workloads with latency-sensitive but low-bandwidth applications competing with batch, high-bandwidth applications.

A single process from a data-intensive distributed application can often saturate the link between the node it’s running on and the ToR switch, which is typically 1 Gbps. Similarly, when many nodes in a rack need to send data to other racks at the same time (which is a common occurrence for applications like MapReduce, which can spread a compute- and data-intensive job across every node of the cluster), the link between the ToR switch and the core switch can be saturated. That link from the ToR switch to the core switch is typically 10 Gbps, but there might be 40 nodes in the rack, with each sending 1 Gbps to the ToR switch to relay on to the core switch.

Even making the network faster might not solve the problem, because distributed systems like Hadoop are often intentionally tuned to consume all available bandwidth and maximize raw throughput. For example, a cluster might have 10 Gbps networking hardware on each node and 160 Gbps uplinks from every ToR switch, but in that case the nodes could be sending 400 Gbps of data, which would overwhelm the 160 Gbps uplink.

Figure 6-1. Typical network architecture for a Hadoop cluster.

Sometimes, the network-intensive application is known and consistent over time—for example, in the case of a search engine pushing out large search indexes to each node in the cluster, or a regular ETL job loading data onto or off of the cluster. In those cases, it might be possible for the operator to tune the application by artificially reducing its network usage to a level that will leave sufficient bandwidth for the other tenants. Unfortunately, as distributed systems run more and more diverse applications from more and more users, such specific workarounds often become impractical or impossible.

Hadoop’s Solution to Network Bottlenecks: Move Computation to the Data

One of the fundamental design points of Hadoop is to reduce the chance of network bottlenecks by moving the computation closer to the data:

  1. For a given task that needs to use a specific HDFS file as input, first try to schedule the task on a node that has that file on its disks.

  2. If that cannot be done in a reasonable amount of time (i.e., if all nodes storing that HDFS file are working at capacity), try to schedule the task on a node on the same rack as a node that has the data.

  3. If that fails, schedule the task on a node on some other rack.

Having three replicas of every HDFS block, with the replicas on two different racks, increases the chance of success at each step. Even in the common case when step 1 (node locality) fails, step 2 (rack locality) usually succeeds (see Figure 6-2).

In a typical system with 1 Gbps links for each node and multiple disks per node, node locality is important; tasks with node locality complete faster than otherwise identical tasks with rack locality. In systems with somewhat more expensive networks (10Gbps links from the node to the ToR switch1), host locality is less important, but rack locality is crucial. Note that rack locality means that input data needs to pass only through the ToR switch, not be sent to the core switch, and the ToR switch generally has more than enough backplane bandwidth so that rack-local reads are not a bottleneck.

In MapReduce, node locality and rack locality can be considered only for the mapper stage (since a given large HDFS file is generally used as input to only one map task). The shuffle stage (where the output from all mappers is sorted and sent to reducers) cannot be node- or rack-local because, by definition, it sends different map outputs to different reducers. However, the amount of shuffled data is typically much less than the total amount of HDFS data used as input to all of the mappers (because mappers generally distill the input data), so shuffle bandwidth is generally not a bottleneck.

Figure 6-2. Read volume from HDFS for a representative production cluster, showing a typical breakdown of reads from the local node, a node on the same rack, and remote rack reads.
Note

Other systems like Mesos also optimize for data locality, by delaying use of the first node chosen for a given unit of work if it isn’t optimally located; if an optimal location can’t be found within a timeout, a suboptimal location is used, instead.2

Hadoop’s rack-locality optimization is so significant that it has made networking limits mostly irrelevant for Hadoop workloads like MapReduce and Spark. The story is very different for some other distributed systems; for example, in traditional HPC, the application developer needed to ensure that the network was not a bottleneck. Alternatively, systems could be built with very expensive networks like a full mesh topology, in which every node is connected directly to every other node, in which case as much as 20 percent of the cluster’s hardware cost might be spent on networking.

Hadoop also reduces the likelihood of network bottlenecks affecting overall performance by setting a bandwidth limit on HDFS block replication. Otherwise, because HDFS replication always sends one copy of a given HDFS block to a node on a different rack, replicating large files could saturate the core switch. (HDFS replication is a classic example of a bandwidth-heavy but latency-insensitive background batch process.)

Why Network Quality of Service Does Not Solve the Problem of Network Bottlenecks

Network Quality of Service (QoS) technology and techniques enable the ability to provide differentiated levels of delay, delay variation (jitter), bandwidth, and packet loss for different types of network traffic. At first glance, QoS would seem to let distributed systems operators reduce the impact of network bottlenecks on latency-sensitive applications by ensuring that latency-insensitive, data-heavy applications do not saturate the network. Unfortunately, this doesn’t work in practice, for several reasons.

Network QoS is unaware of the underlying applications generating traffic, so when network bottlenecks are caused by one particular application, all low-priority packets (not just packets from that offending application) are dropped.

Similarly, when a daemon like a distributed file system sends requests on behalf of multiple different kinds of tasks, all network traffic generated by the DFS is grouped together, without any per-application QoS set. (This is similar to the problem described in the Chapter 5 with HDFS-intermediated disk access.) Theoretically, daemons like HBase and HDFS could send application priorities (based on the specific application and the user/group) all the way through to the network layer, but they do not. In practice, platform developers address the problem for some specific operations like HBase compactions by limiting bandwidth from within the application; otherwise, the problem is generally ignored, and all traffic is treated the same.

Finally, network QoS tends to be most effective when some packets absolutely must get through but whose latency is unimportant, such as network control packets. When saturation occurs and network queues fill up, the switch often fails to respond quickly enough to prioritize packets from latency-sensitive applications.

Controlling Network Usage on a Per-Application Basis

Software like Pepperdata gives operators the ability to prioritize access to the network for latency-sensitive applications by instrumenting and controlling the network usage of each application within the process itself rather than externally at the OS or switch level. (This behavior is similar to the way Pepperdata handles disk I/O.) As a result, Pepperdata is aware of network usage for each individual application, whether directly or via a daemon such as HDFS. Because such software also measures the overall usage and capacity of the network, it can dynamically adjust the usage by each individual process to make full use of hardware capacity but also prioritize the most important or latency-sensitive applications.

Measuring Network Performance and Debugging Problems

The most obvious symptoms of network performance problems are slow jobs. In Hadoop, for example, saturated networks or hardware problems can lead to slow HDFS I/O (which involves network as well as disk, for all nonlocal data) and slow MapReduce shuffle stages (which has no locality and thus always relies on the network, given that shuffling is a many-node-to-many-node operation).

Another telltale sign is easily observed when logged in to a terminal on a machine; in the case of significant network delay, the text typed on the command line can “stutter,” with short bursts of characters appearing after a delay. This behavior is unlike that of thrashing (excessive swapping of memory onto and off of disk), which can freeze up a node entirely, or disk I/O problems, which cause slow initial logins but then allow normal interactive typing on the command line.

When detecting and diagnosing network problems, it’s important to understand the baseline normal operating case for a particular network and distributed system; latency and retransmissions beyond the baseline indicate a problem. It’s also critical to look at percentiles (for example, what is the 95th percentile bandwidth usage for the latest hour, compared to the same hour yesterday?), given that network usage tends to be quite spiky and only maximum usage versus capacity matters.

In addition to prioritizing network usage by particular applications, software like Pepperdata can help in diagnosing the source of network performance problems. Because Pepperdata gathers metrics at the individual process level, it can connect network usage to the semantics of the platform (e.g., HDFS reads versus HDFS writes versus shuffle) and to particular applications. This level of granularity can be critical in quickly detecting both badly behaved applications (and victims of those applications) or hardware problems.

ping and mtr

The ping tool is a common first step in diagnosing network problems. Because most switches under saturation drop packets from ping first, if ping returns reasonable results, the likelihood of network problems is low. However, in the case of a minor outage or bursty networking, single pings might not detect a problem; in such cases, a “flood ping” (ping -f) can help because it sends a constant stream of ping packets and so is likely to detect intermittent or partial outages.

A useful tool that combines ping and traceroute is mtr, which continuously monitors the route between two machines and reports packet loss and ping times for each hop.3

Both ping and mtr are more useful for detecting routing problems when traversing wide networks like the Internet; they are less useful within a smaller network like a data center or a distributed system.

Retransmissions

Retransmission timeouts are cases when the sender has sent a packet and not received a response (ACK) from the receiver within a specified timeout (the recovery time objective [RTO] value). The RTO value is adjusted dynamically for each open socket; an increase in the RTO value suggests that a specific connection is saturated or has a hardware issue.4 Operators can see RTO on the command line via the ss -i command.5

The standard tool netstat reports retransmission counts, along with other useful network statistics.6 Retransmission rates above 3 percent are a problem in a chaotic environment like AWS. Within a data center, however, much lower retransmission rates (closer to 0.3 percent) can be a sign of a serious problem. When looking at TCP retransmissions, it’s important to distinguish between SYN packets and data packets, since unacknowledged SYN packets might result from not finding the receiving host, rather than from network congestion.

Summary

Network access is generally like disk I/O in its effects on distributed system performance. Even though systems like Hadoop have dramatically reduced the impact of network bottlenecks by moving computation to the same node or rack as the data being processed, network spikes still occur and can affect the timely completion of high-priority applications. As with disk I/O, to ensure network bandwidth for critical and time-sensitive applications, operators should use a software solution that effectively prioritizes network usage across different users and applications.

1 Not surprisingly, a 10 Gbps NIC costs about ten times as much as a 1 Gbps NIC but is still a very small fraction of the total node cost.

2 See http://mesos.apache.org/documentation/latest/architecture/.

3 https://www.digitalocean.com/community/tutorials/how-to-use-traceroute-and-mtr-to-diagnose-network-issues has a useful introduction to mtr.

4 http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html and https://www.extrahop.com/community/blog/2016/retransmission-timeouts-rtos-application-performance-degradation/ give good introductions to RTO.

5 See http://linuxaleph.blogspot.com/2013/07/how-to-display-tcp-rto.html and https://blogs.janestreet.com/inspecting-internal-tcp-state-on-linux/ for information on ss -i and http://www.binarytides.com/linux-ss-command/ for more on ss in general.

6 http://www.netcraftsmen.com/application-analysis-using-tcp-retransmissions-part-1/ and http://www.brendangregg.com/blog/2014-09-06/linux-ftrace-tcp-retransmit-tracing.html have useful guidance for understanding retransmissions and their causes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset