10
Case Study QOS in the Data Center

QOS in the Data Center—is that an impossible equation? QOS is all about making the best use of the available bandwidth in relation to need and that sometimes means selectively punishing someone while protecting someone else. In brief, it’s overprovisioning as a service, which most often has been applied to end users. But what is the overprovisioning in a Data Center, and what applications or traffic types can be seen as better or worse compared to others? Drops and TCP retransmissions, are those acceptable behaviors in today’s Data Center? And with today’s Data Centers where most of the traffic volume is actually East–West, that is, between bare-metal servers and virtual machines (VM) within the same Data Center (not the legacy North–South traffic model), as illustrated in Figure 10.1, what is an important flow versus a not so important one? One last fact to consider is that end users are hard to identify and applications inside the Data Center are more or less equally important.

Image described by caption/surrounding text.

Figure 10.1 North–South versus East–West traffic

10.1 The New Traffic Model for Modern Data Centers

The legacy traffic pattern for Data Centers has been the classical client–server path and model. The end user sends a request to a resource inside the Data Center and that resource computes and responds to the end user as illustrated in Figure 10.2. The Data Center is designed more for the expected North to South traffic, rather than the possible traffic that traverses between the racks. The design focus on transport, not really on residential applications. This results on traffic patterns that do not always follow a predictable path due to asymmetric bandwidth between layers in the Data Center.

Image described by caption/surrounding text.

Figure 10.2 North–South traffic model

The evolution that has taken place is the increased machine-to-machine traffic inside the Data Center. The traffic proportion regarding North–South versus East–West is now at a ratio of at least 80% in favor of East–West compared to North–South. The result is that traffic patterns exist for only that which resides within the Data Center itself, such as those shown in Figure 10.3.

Image described by caption/surrounding text.

Figure 10.3 East–West traffic model

The reasons for this increased East-to-West traffic are:

  • Applications are much more tiered where web, database, and storage interact with each other.
  • Increased virtualizations where applications are easily moved to wherever compute resources are available.
  • Increased server-to-storage traffic due to the separation of compute nodes and storage nodes demanding much higher bandwidth and scalability.

How does the tiered applications work? Well, if you go to a website with the intention of buying a book like this, then there are lots of windows that pop up on your web screen window. There is of course information about the book, but also review information, suggestions of other similar QOS book titles (none as good as this one, of course), shopping carts, location-based advertising, and so on. Your shopping session results in lots of subsessions where pictures are grabbed from one server, samples from another book, and so forth. There is also lots of synchronization traffic between servers running the same applications and data.

Another driver of the East–West traffic model is the increased usage of virtualization, which in brief increases the utilization usage per hardware in the Data Center. This means VMs and applications become more mobile since any free computing resources can be used in the Data Center. One such example is VMware vMotion that allows a VM to be moved wherever resources are available, illustrated in Figure 10.4. The result is that applications are not hosted within the same rack any longer, and instead the network has to support any-to-any communication regarding paths and predictable performance.

Image described by caption/surrounding text.

Figure 10.4 Applications run where resources are available

Another reason for the increased East–West traffic is the vast crunching of data. For example, when you surf to a website on your laptop, it’s not just by happenstance that offers are suddenly presented to you that you have nothing to do with your current session. Lots of information is taken from your browser, location, profile, and even prior web-surfing history. This concept is called “big data” and is one of the most bandwidth intense driven solutions in a Data Center of today.

A software program called Hadoop is commonly used for these big data scenarios. Originally from an open-source software, the Apache Hadoop algorithm allows distributed processing of large data sets across clusters of machines using the same software library. Hadoop is designed to scale from single servers to thousands of servers where each server does the local computation and storage. What makes Hadoop scalable is its architecture. The Hadoop Distributed File System (HDFS) splits files into blocks and distributes these to other machines in the cluster called DataNodes. Processing is handled by the MapReduce function that distributes the processing ability to where the data resides, allowing high-redundancy, same-time parallel processing of large amounts of data. The MapReduce engine consists of a JobTracker to whom the client applications submit jobs. JobTracker then pushes jobs as tasks to the TaskTrackers that reside on each of the DataNodes in the cluster. All the DataNodes are handle by a NameNode that secures the validity of the DataNodes and thereby the actual data itself. Hadoop is designed to detect and handle failures at the application layer in any of the machines that may be prone to failure. JobTracker just assigns job to other nodes, and NameNode secures data that’s been available on other nodes. The result is a self-healing setup with limited needs, for example, dual-homed links and dedicated hardware.

In big data, jobs gets processed as close to the local data as possible (rack), however the distribution and replication of all this data results in a heavy burden on the Data Center infrastructure since all data is replicated and that replication happens more or less all the time in a Hadoop cluster, as illustrated in Figure 10.5.

Image described by caption/surrounding text.

Figure 10.5 Data replication

This is a clear difference from the old model of one interaction being limited to one North–South communication. Instead there are tons of East–West traffic involved before any messages are even delivered back to the end user. This East–West phenomenon is one of the main reasons why QOS in a Data Center is very different from traditional user–server (North–South) traffic.

10.2 The Industry Consensus about Data Center Design

The increased demands on consistent, predictable performance and latency, while at the same time being flexible enough to easily grow and scale in case of need, puts new demands on Data Center topology and structure. A well-thought out design is one where traffic paths are predictable and resources are well used. There is an industry consensus that a Spine-and-Leaf design, with a ECMP load balanced mesh, creates a well-built fabric that both scales and at the same time is efficient and most of all, predictable. That means no ring structure or asymmetric bandwidth clustering. With a two-tier Spine and Leaf design, paths are never more than two hops away resulting consistent round-trip time (RTT) and low jitter with flow/session hash, shown in Figure 10.6.

Image described by caption/surrounding text.

Figure 10.6 ECMP mesh

One Data Center design rule is to keep the Spine as clean as possible with no resources like servers attached to it. Instead, not just servers are allocated to the leafs, but also other resources like edge routers, load balancers, firewalls, etc., shown in Figure 10.7. By keeping this clean spine design the network is consistent and predictable regarding behavior and performance, while any failing equipment is easy to isolate and replace.

Image described by caption/surrounding text.

Figure 10.7 Functions and compute on the Leaf only

Regarding redundancy, the best redundancy is the one that is always in use. If you’re using an active–passive design for dual-homed servers, it’s a receipt for possible issues once the passive needs to take off for a faulty active. If both links are active, then not just redundancy has been verified as part of the daily performance, bandwidth is also increased. A common feature to achieve this is using Multichassis Link Aggregation (MLAG) that allows the same LAG from hosts or switches to be terminated on dual switches in a active–active design without any blocked ports or passive state. The same principle goes for the links between Spine and Leaf—if all links are active with the same metric then redundancy is part of the actual ECMP forwarding mesh. A failed link is just a temporary reduction of bandwidth and nothing else.

One trend in large Data Centers is to deploy IP as the protocol between Leaf and Spine. This limits the Layer 2 domain and possible Layer 2 protocols that are not designed for twenty-first century networking. Spanning Tree Protocol is, by its design, a blocking architecture, whatever variations of it are used, and it fits poorly in a Leaf and Spine ECMP design. Layer 2 on the Leaf only uses MLAG or cluster functionality.

10.3 What Causes Congestion in the Data Center?

Experts and designers vary on what is the best way to implement QOS functions in the Data Center. However, before discussing a cure, let’s identify the symptoms. One obvious question is What are the congestion scenarios in the Data Center? Let’s focus on the East–West traffic and identify QOS scenarios where packets are either dropped or delayed causing sessions rate being severely decreased inside the Data Center itself.

The bottleneck scenarios are:

  • Oversubscribed networks with bursts greater than available bandwidth
  • Multiple nodes trying to read/write to one node resulting in TCP performance challenges
  • Servers sending pause frames to the network layer, which in turn, causes congestion (a phenomena, together with flow control, detailed in Chapter 4)

10.3.1 Oversubscription versus Microbursts

No Data Center is built with the assumption and acceptance of latent bottlenecks. Instead, it’s when packet bursts occasionally go beyond available bandwidth. It is common practice in the Data Center to oversubscribe the trunk between the Leaf and Spine compared to the access port bandwidth—a 1 : 2 ratio means that the summary of port bandwidth intended for hosts are doubled compare to ports primary designed for trunks. A common oversubscription is in the ratio of 1 : 2 or 1 : 3. Of course there are variations here with more or less oversubscription. The following shows an illustration of a Leaf switch with 32*10GE access ports to hosts and 4*40GE trunk ports to connect to spine. This means a 2 : 1 oversubscription rate to hardware architecture for traffic leaving the trunk ports, as illustrated in Figure 10.8.

Image described by caption/surrounding text.

Figure 10.8 Leaf switch oversubscription by design

To avoid too much oversubscription by hardware design is of course simply not to oversubscribe. If the design is not to oversubscribe and instead use a 1 : 1 ratio regarding bandwidth, and if using the hardware earlier described, then do not use more access ports than the switch’s weak link, which is the trunk speed. If the available trunk bandwidth is 160 Gbps, then maximum access bandwidth can only be 160 Gbps, that is, one half of the access ports.

Another oversubscription scenario is the scenario called microbursts. The phenomenon has been known for a long time in the mobile networking world but is a relatively recent reality in the IP and Data Center arena. A microburst event is when packet drops occur despite there not being a sustained or noticeable congestion upon a link or device. The causes for these microbursts tend to be speed mismatch (10GE to 1GE and 40GE to 10GE). The more extreme the speed mismatch, the more dramatic the symptom. Note it’s not congestion due to fan-in or asymmetric speeds oversubscription, but it’s the simple fact that higher bandwidth interfaces have higher bit rates than lower bandwidth interfaces. A typical example in the Data Center world is with asymmetrical speed upgrades. The hosts are on the 1GE access links, but the storage has been upgraded to 10GE link. In theory, this appears not to be a problem as seen in the next illustration Figure 10.9.

Schematic of a cylinder-shaped data center connected to a 10GE port, switch, and 10*1 GE ports. Data flows from data center and split into ten data packets.

Figure 10.9 Speed conversion

However, the real challenge here is the bit rate difference between 1GE and 10GE. In brief, ten times (10*) the difference results a short burst of packets toward the 1GE interface. In case there are small interface buffers, or lots of flows that eat up the shared pool of buffering, there will be occasional drops that are not easy to catch if you are not polling interface statistics frequently enough. There is only one way to solve this and that is to be able to buffer these bursts as shown in the next illustration Figure 10.10.

Schematic of a 1GE port, Switch, and 10GE port. Datapackets  travels from 1 GE port, buffered by the Switch, and reaches the 10GE port. The gap between packets in 10GE port is 9.6ns interpacket and in 1 GE port is 96 ns interpacket.

Figure 10.10 Bit rate difference needs buffering

Another parameter to be aware of is the serialization speed difference between 1GE versus 10GE and 10GE versus 40GE. For example, the time it takes to build a 100-byte packet for 1GE interface is 800 ns, and the same 100-byte packet for 40GE takes 20 ns, which can constrains the buffering situation in a topology even more with speed mismatch shown in Figure 10.11.

Schematic of speed serialization, in which vertical axis represents speed variations from 1 GE to 10 GE and 10 GE to 40 GE. Horizontal axis represents time variations as 0 ns, 20 ns, 80 ns, and 100 ns.

Figure 10.11 Serialization difference interface speed 100 byte

One way to limit microbursts is to have the same bit rate and thereby serialization end to end within the Data Center. That is, try to avoid too much difference in the interfaces speed. For example, in a Hadoop cluster it’s suboptimal to run servers with different bandwidth capabilities since the concept is to have all the data replicated to several NameNodes. Then, if the nodes have different interface speeds, obviously microburst situations can easily occur. A common speed topology can be achieved by using multispeed ports (most often referred as MTP). In brief, it’s a single cable consisting of 12 fibers resulting in a 40-Gbps port can be 4*10 Gbps channel interfaces. Thereby the same bit rate can be achieved as shown in the following illustration (Figure 10.12).

Image described by caption/surrounding text.

Figure 10.12 Bit rate consistency

Still, the only really efficient and same-time solution that can flexibly handle traffic bursts inside the Data Center is to have buffering capabilities in both the Spine and the Leaf layer.

10.3.2 TCP Incast Problem

Some might consider TCP Incast as a fancy term for congestion due to multiple nodes trying to read/write to the same node. While it is a many-to-one scenario, it’s more than that since it affects the TCP protocol. In brief, what happens is:

  1. Many nodes access the same node in a tiered application scenario or over utilized DataNode in a Hadoop cluster.
  2. Packets will be dropped or delayed upon buffer exhaustion.
  3. Dropped packets or packets RTT exceeding the RTO will cause retransmission.

Retransmission is something you don’t want for East–West traffic for all the obvious reasons but also because the higher utilization leads to unfairness across flows. And this is a vicious cycle where throughput suffers due to TCP congestion behavior. The result can be that the network’s overall statistics look underutilized regarding statistics from network equipment, since those counters only see the traffic that actually can be delivered and do not show the TCP backoff symptoms (the details regarding TCP backoff mechanisms have been described in depth earlier in Chapter 4).

Let’s briefly review TCP’s congestion behavior. The sender’s TCP congestion window (CWND) described earlier in this book relies upon the receiving TCP acknowledge (ACK) packets in a timely manner from the receiver to be able to adjust the speed of transmission of packets. The rate is simply the number of packets transmitted before expecting to receive these TCP ACKs. When an accepted packet has been lost, the retransmission timeout (RTO) settings on the servers determine how long TCP waits for receiving acknowledgment (ACK) of the transmitted segment. In a nutshell, it’s the expected round-trip time (RTT). If the acknowledgment isn’t received within this time, it is deemed lost. This is basically the TCP rate control and congestion management applied on top of the available network bandwidth. If there is insufficient bandwidth in the network to handle these bursts, packets are dropped in order to signal the sender to reduce its rate.

So what now? Why not just increase the speed to those servers that have to feed others, or at least secure they are on the same rack. And here is the catch described earlier, in the modern world of virtualization and Hadoop clusters, it’s not that trivial to find out these many-to-one and thereby obvious bottlenecks. However, there is a known cure for these bursts—apply buffers on the networking equipment—a lesson that routers in the core of the Internet learned years ago. However, in the Data Center it’s been a bad habit for some time to not have enough buffering capacity.

But why in the name of Bill Gates would small buffers on networking equipment create such a wide range of bandwidth resulting in lucky versus unlucky flows? When many sessions, and thereby flows, pass through a congested switch with limited buffer resources, what packet is dropped and what flow is impacted is a function of how much packet buffering capacity was available at that moment when that packet arrived and thereby a function of lucky versus unlucky flows. TCP sessions with packets that are dropped will backoff and get less share of the overall network bandwidth, with some unlucky flows getting their packets dropped much more than others resulting in senders congestion window half, or even end up, in a TCP slow start state. Meantime, the lucky sessions that just by luck have packets arriving when packet buffer space is available do not drop packets and instead of slowing down will be able to increase their congestion window and rate of packets as they grab unfairly at the amount of bandwidth. This behavior is most often referred as “TCP/IP Bandwidth Capture Effect,” meaning that in a congested network with limited buffer resources, some session’s flows will capture more bandwidth than other flows, as illustrated in Figure 10.13.

Image described by caption/surrounding text.

Figure 10.13 TCP Incast

So how to avoid this behavior with lucky versus unlucky flows? Well, one traditional way is to implement RED schemes described earlier but ultimately that will also result in retransmissions and the TCP congestion management vicious cycle. RED originally was designed for low speed connections and is also more suitable to end users than East–West traffic server connections. Another solution discussed is to use UDP inside the Data Center, but the result is that the delivery assurance moved from the TCP stack to the application itself. Some advocate for a new form of TCP, Data Center TCP (DTCP) that uses the Explicit Congestion Notification (ECN) bits field, and by marking, detects congestion and speeds up the feedback process in order to reduce the transmission rate before too many drops and the TCP mechanics kicks in. In brief, a Layer 3 way of Layer 2 flow control. However, the true story is that the best way to handle TCP Incast scenarios is with buffering capabilities on the networking equipment, the same conclusion as with microbursts.

10.4 Conclusions

There is an ongoing discussion that too much buffer is just as bad as drop packets. The discussion is that buffers should be small: You can use larger buffers to mask the issue, but that just increases latency and causes jitter. In extreme cases, you can end up with the situation of packets arriving seconds latter. Dropping packets is not the end of the world, but it puts a limit on how large the latency and jitter can grow.

There is absolutely some validness in this italicized statement regarding end user’s quality experience for applications like VoIP and others. But for East–West server-to-server traffic, there is another story. If you lack buffers and cannot absorb burst, you end up in retransmissions, and that is really a consequence of inefficient management of resources and TCP performance. The default RTO on most UNIX/Linux implementations residing in Data Centers is around 200 ms. Tuning RTO is outside the scope of this book and it is not a trivial task, but let’s discuss what is the ultimate buffer size for networking equipment inside a Data Center if the majority of application and traffic is TCP based, since congestion results in TCP backoff and smaller TCP CWND. The perfect scenario is if the switch can buffer an entire TCP window for all possible congested session flows, then no retransmissions are needed!

Let’s play with numbers and plug this equation: the number of flows without drops = Total packet memory/TCP window:

  • The Max TCP window per flow equal to 64 KB
  • Servers sustain max 100 simultaneous flows
  • Packet memory needed per port = 64 KB × 100 = 6400 KB
  • Total packet memory needed (memory per port × number of ports) = 6400 KB × 32 = 256 MB

So if buffers is the name of the game for optimal performance for Data Center East–West traffic, let’s take a look at the most common hardware architectures for them on networking equipment inside the Data Center:

  • Dedicated per port buffer
  • Shared-pool buffer
  • Virtual output queuing (VOQ)

The architecture with a dedicated buffer allocated per port is a simple construct regarding its architecture. In brief, you quote the memory banks to each port and thereby each port has its own resource. It’s a good design if all the ports are expected to talk to each other simultaneously, thereby limiting possible greedy ports that might affect the performance of others. Its drawback is that it drains lots of memory and there’s also some possible backpressure scenarios that are hard to solve since that would demand multiple ingress and egress physical buffers. In a case where memory is well allocated, it’s mostly in an off-chip design scenario resulting in possible extra delay and costly extra components like extra processors to handle the queuing memory.

The shared-pool buffer architecture assumes that congestion occurs sporadically only to a few egress ports and, realistically, never happens on all ports simultaneously. This allows a more centralized chip buffer architecture that is a less costly design compared to dedicated memory. However, the shared-pool buffer architecture demands a complex algorithm so that the buffer is dynamically shareable and weighted toward congested ports when backpressure from these ports occurs due to congestion. No port and queue is allowed holding on the buffers for too long or too much just in case of peaks and burst. It also demands tuning thresholds for ingress versus egress buffering and port versus shared-pool volume.

The VOQ is a technique where traffic is separated into queues on ingress for each possible egress port and queue. It addresses a problem discussed in the earlier chapters with head-of-line blocking (HOL). In a VOQ design each input port maintains a separate queue for each output port. VOQ does demand some advanced forwarding logics like cell-based crossbars and advanced scheduling algorithms. And the VOQ mechanism provides much more deterministic throughput at a much higher rate than crossbar switches running without it, but VOQ demands much more advanced scheduling and queue algorithms, and the hardware designs can be costly. But like most things in life, there are no simple solutions and there are no such things as a free lunch, even at the buffer buffet.

Further Reading

  1. Bechtolsheim, A., Dale, L., Holbrook, H. and Li, A. (2015) Why Big Data Needs Big Buffer Switches, February 2015. https://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf (accessed September 8, 2015).
  2. Chen, Y., Griffith, R., Liu, J., Katz, R.H. and Joseph, A. (2009) Understanding TCP Incast Throughput Collapse in Data Center Networks, August 2009. http://yanpeichen.com/professional/TCPIncastWREN2009.pdf (accessed August 19, 2015).
  3. Das, S. and Sankar, R. (2012) Broadcom Smart-Buffer Technology in Data Center Switches for Cost-Effective Performance Scaling of Cloud Applications, April 2012. https://www.broadcom.com/collateral/etp/SBT-ETP100.pdf (accessed August 19, 2015).
  4. Dukkipati, N., Mathis, M., Cheng, Y. and Ghobadi, M. (2011) Proportional Rate Reduction for TCP, November 2011. http://dl.acm.org/citation.cfm?id=2068832 (accessed August 19, 2015).
  5. Hedlund, B. (2011) Understanding Hadoop Clusters and the Network, September 2011. http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ (accessed August 19, 2015).
  6. Wu, H., Feng, Z., Guo, C. and Zhang, Y. (2010) ICTCP: Incast Congestion Control for TCP in Data Center Networks, November 2010. http://research.microsoft.com/pubs/141115/ictcp.pdf (accessed August 19, 2015).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset