12
Quality of Service Configuration Tools in Switch/Routers

12.1 Introduction

The Catalyst 6500 has a rich set of QoS features and configuration tools that allow it to be used as a good reference platform for understanding the types of configuration tools available for switch/routers. The typical switch/router from other vendors in the market will support these features and configuration tools. Furthermore, devices from different vendors do interoperate very well nowadays partly due to similarities in the capabilities and feature sets built into the devices.

The QoS functions on the Cisco Catalyst 6500 running the Cisco IOS® Software can be configured via the modular QoS command-line interface (MQC) software feature. This toolkit comes as part of the Cisco IOS Software running on Cisco routers and switch/routers. The QoS configuration process can be described by the following steps (Figure 12.1):

  1. Step 1: Construct a class map that defines the relevant ACLs that identify the particular traffic of interest that QoS is to be applied. The class map defines some traffic classification criteria, essentially it specifies a set of flow matching criteria for the arriving traffic.
    • Class maps can be used to classify traffic based on, for example, Layer 3 and Layer 4 protocol information (source and destination IP address, transport protocol type, source and destination transport protocol port), Layer 7 protocol information (e.g., FTP request commands, HTTP header, HTTP URL, HTTP cookie, HTTP content), and so on.
  2. Step 2: Create a policy map that references the class map (in Step 1) and defines/specifies a number of actions that are to be performed on the classified traffic. In the QoS context, the policy map includes the QoS policy (priority queuing, bandwidth allocation, drop profile, etc.) that is to be applied to the matched traffic at a switch/router port, group of ports in a VLAN, and so on.
  3. Step 3: Apply the policy map to the device port, logical interface, a specific VLAN interface, and so on.
img

Figure 12.1 QoS command-line interface structure.

Configuring and implementing QoS features on the Cisco Catalyst 6500 [CISCQOS05] can be greatly simplified through the use of the auto-QoS software tool. Auto-QoS supports a set of QoS software macros that specify a number of QoS functions that can be invoked from the Cisco Catalyst 6500 CLI. Initially, auto-QoS was used to configure QoS features on a given port of a router that supports a Cisco IP Phone.

12.2 Ingress QoS and Port Trust Settings

A network device (switch, router, switch/router, etc.) can receive a packet at one of its ports or interfaces with a CoS priority value that has already been set by an upstream device. When this happens, it may be important for the receiving device to determine if the CoS priority setting in the arriving packet is valid. In addition to this, the device may want to know whether the CoS value was set by a trusted or valid upstream application or device according to well-established or understood predefined CoS marking rules.

Checking the trustfulness of the CoS settings may be necessary because, for instance, the CoS priority setting may have been carried out by a user hoping to get (unauthorized and unfair) better service from the network. Whatever the reason, the receiving device has to determine if it has to accept the CoS priority value as valid or alter it to another value.

The receiving device uses the port “trust” setting to decide whether to accept the CoS value or not (see Figure 12.2). A port trust setting of “untrusted” results in the receiving device clearing (i.e., wiping out) any CoS priority value carried in an arriving packet. The arriving CoS priority setting is not considered trustworthy in this case and has to be rewritten.

img

Figure 12.2 Switch port trust settings.

The default configuration of QoS on the Catalyst 6500 having all ports on the switch/router is set to the untrusted state. In this configuration, a packet arriving with a CoS priority setting on an untrusted port will have its CoS priority value reset to the default CoS value of zero. The network manager is responsible for identify the ports on the receiving device on which the CoS priority settings of arriving packets should be honored.

For example, the network manager may decide that all connections to well-known and clearly identified servers (e-mail servers, Web servers, etc.), default network gateways, IP telephony call managers, and IP telephones should be set to the trust state where the incoming CoS priority settings are honored. Ports also connected to specific VLANs or IP subnets (such as those carrying management traffic, secured traffic, etc.) can also be preset to the trust state.

A port in the Catalyst 6500 can be configured to trust one of the three priority settings: IEEE 802.1Q/p, IP Precedence, or DSCP. When a port is configured to trust the incoming CoS priority value of packets, the configuration also has to specify which of these three priority setting mechanisms (IEEE 802.1Q/p, IP Precedence, or DSCP) will be trusted.

12.3 Ingress and Egress Port Queues

The line cards of the Cisco Catalyst 6500 switch/routers support a number of input (receive) and output (transmit) queues per port. These queues are implemented in the port hardware ASICs and are fixed in number (cannot be reconfigured). Each queue is allocated an amount of buffer memory that is used to temporarily store arriving packet on the port. Some line cards have ports that are given a dedicated amount of memory for exclusive use – this memory is not shared with other ports on the line card.

Other line cards have a shared memory architecture where a pool of memory is supported on the card to be shared among a group of ports. The ports are organized in groups, with each allocated a pool of memory. Reference [CISCQOS05] provides a summary list of line card types, their queue structures, and the allocated buffers to each port.

Some line cards have, in addition, a strict priority queue that can be used to queue delay-sensitive and network control traffic. This special queue, typically used for latency-sensitive traffic such as streaming voice and video, is designed to allow the traffic scheduler to service the queued data immediately anytime a packet arrives to this queue.

12.4 Ingress and Egress Queue Thresholds

An important characteristic of data transmission using TCP is that if the network drops a packet, that loss will result in the TCP source retransmitting that packet. During times of heavy network load and congestion, retransmissions can add to the load the network is already carrying and can potentially lead to buffer overload, overflows, and data loss. To provide a way of managing network load, and ensuring that buffers do not overflow, the Catalyst 6500 switch/routers supports a number of techniques to manage congestion.

Queue thresholds are assigned by the network manager and are predefined queue occupancy limits for a queue. Thresholds define queue fill limits at which the device triggers congestion management mechanisms to start dropping packets from the queue or initial QoS control mechanisms such as packet priority marking, remarking, and so on. Typically, the queue thresholds defined in the device can be used for the functions discussed in the following sections.

12.4.1 Queue Utilization Thresholds

These queue thresholds are used to signal when the buffer space used by a queue has reached a certain predefined occupancy level. When the threshold is crossed, the device will initiate the dropping of newly arriving packets to the queue. The two most common packet drop mechanisms used by network devices are tail-drop and WRED.

12.4.2 Priority Discard Thresholds

These thresholds, typically, are configured on a queue holding packets that can have different priority settings or values. When such a threshold is crossed, a congestion management mechanism such as WRED will start dropping low-priority packets first, and if the congestion persists, the device will move progressively to higher priority packets (Figure 12.3).

Figure depicts mapping a packet to a queue or threshold.

Figure 12.3 Mapping a packet to a queue or threshold.

These thresholds can also be used to remark packets to lower priority settings when the queue fill exceeds the threshold. The threshold can also be used to trigger when dynamic buffer management mechanisms (for a shared memory buffer pool) can be initiated, for example, to move buffers from ports with lower utilization to ports with high traffic loads.

Furthermore, in load balancing across multiple interfaces (with corresponding queues) on a device, queue thresholds can be used to indicate which interface is lightly loaded and can accept newly arriving flows. A queue (and its interface) that has already crossed its threshold will not be assigned new flows.

12.5 Ingress and Egress QoS Maps

The Cisco Catalyst 6500 supports, in addition to class and policy maps, other types of maps that can be used to perform other QoS functions. These maps are described in the following sections.

12.5.1 Mapping a Packet Priority Value to a Queue and Threshold

A switch/router port with multiple priority queues requires a mechanism to determine to which priority queue an arriving packet should be placed. The queue placement can be accomplished by using a map that relates the packet's priority setting to a priority queue (Figure 12.3). A classifier decodes/reads the priority setting in an arriving packet and consults the map to determine which of the queues are to store the packet.

This map could be structured or organized to have two columns with the first column storing the possible priority values that can be set in the arriving packets and the second column holding the priority queue (and its associated threshold) to which a packet with a particular priority value should be assigned. It is sometimes necessary to implement an additional default queue to which all other packets, or packets that have markings that cannot fully be interpreted, will be placed.

12.5.2 Mapping Packet Priority Values to Internal Switch Priority Values

A packet that arrives at a switch/router port can already be marked with a priority value by an upstream device. However, the trust setting configured at the port will determine how the priority setting in an arriving packet (IP Precedence, IEEE 802.1p, or DSCP value) will be handled by the switch/router.

When the packet arrives at the switch/router, it is assigned an internal priority value that is of relevance only in the switch/router. The internal priority value is only used in the device for internal QoS management functions and is referred to as the internal DSCP (which is kind of analogous to the DSCP used in DiffServ networks).

The Catalyst 6500 switch/router uses a map to relate the arriving packet's priority setting to an internal DSCP value. A map is used to select an internal DSCP that is predefined for a particular priority setting of incoming packets. After the packet has been processed (forwarding table lookup, priority queuing, etc.) and transferred to the egress port of the switch/router, a second map is used to derive the appropriate IP Precedence, IEEE 802.1p, or DSCP value priority that will be rewritten into the packet before it is transmitted out the egress port. Table 12.1 shows a summary of the maps that can be used by the switch/router. Figure 12.4 shows two examples of the maps (on ingress) that can be used to derive the internal DSCP value.

Table 12.1 Map Summary

Map Name Related Trust Setting Used on Input or Output Map Description
IEEE 802.1p CoS to DSCP Map Trust IEEE 802.1p CoS Input Derives the internal DSCP from the incoming IEEE 802.1p CoS value
IP Precedence to DSCP Map Trust IP precedence Input Derives the internal DSCP from the incoming IP precedence value
DSCP to IEEE 802.1p CoS Map Output Derives the IEEE 802.1p CoS for the outbound packet from the internal DSCP
Figure depicts mapping priority to internal DSCP.

Figure 12.4 Mapping priority to internal DSCP.

The IP Precedence is derived from the now deprecated (and now obsolete) IP ToS, which is a 1 byte field that once existed in an IPv4 header. Out of the 8 bits in the IP ToS field, the first 3 bits are used to indicate the priority of the IP packet. These first 3 bits are referred to as the IP Precedence bits and can be set from 0 to 7 in a packet, with 0 being the lowest priority and 7 the highest priority. Cisco IOS has supported setting IP precedence in Cisco devices since many years.

12.5.3 Policing Map

Policing is primarily employed to limit the traffic rate at a particular point in a network to a predefined rate. Policing can also be used to determine the rate limit at which to lower the priority value of arriving packets when the traffic rate exceeds the rate limit. A switch/router can use a policing map to identify at what rate limit and what priority it will lower the priority settings in arriving packets.

The Catalyst 6500 uses a map called the “policed-dscp-map” to perform the priority remarking task. This policing map is a table that is organized into two columns, with the first column holding the original priority value in a packet and the second column holding the matching value the arriving packet's priority value will be marked down to.

12.5.4 Egress DSCP Mutation Map

As already described, when a packet arrives at a Catalyst 6500 switch/router port, the port's trust setting plays an important role in determining the internal DSCP value to be assigned to the packet. The switch/router uses this internal DSCP value to assign resources to the packet as it passes through the switch/router. When the packet reaches the egress port and before it is transmitted out of the port, a new DSCP value to be written into the outgoing packet is derived from a map using the internal DSCP value as an index (see Figure 12.5).

img

Figure 12.5 Egress DSCP mutation.

An egress DSCP mutation map is used to derive the new DSCP value in the outgoing packet. The egress DSCP mutation map contains the information about the DSCP value to be written in the outgoing packet based on the internal DSCP value of the packet. Egress DSCP mutation maps are supported in the Catalyst 6500 with PFC3A, PFC3B, or PFC3BXL modules.

12.5.5 Ingress IEEE 802.1p CoS Mutation Map

On some Catalyst 6500 line cards, an ingress IEEE 802.1p CoS mutation map can be used on a port that is configured as an IEEE 802.1Q trunk port. This mutation map allows the switch/router to change the incoming IEEE 802.1p CoS value in a packet to another predefined IEEE 802.1p CoS value. An IEEE 802.1p CoS mutation map lists for the possible incoming IEEE 802.1p CoS values corresponding to the outgoing CoS value that can be written in the outgoing packets. A network manager can construct an IEEE 802.1p CoS mutation map to suit the traffic management policy requirements of a particular network.

The IEEE 802.1p CoS mutation map feature is supported on some of the Cisco Catalyst 6500 line cards such as the 48-port GETX and SFP CEF720 Series line card, the 4-port 10GE CEF720 Series line card, and the 24-port SFP CEF720 Series line card. These line cards require a Supervisor Engine 720 to be installed on the Catalyst 6500 chassis for them to function.

12.6 Ingress and Egress Traffic Policing

The PFC on the Supervisor Engine and some line card types in the Catalyst 6500 are capable of supporting different policing mechanisms. Policing can be performed on aggregate flows or on microflows passing through the Catalyst 6500 performs. These different policing mechanisms are described in the following sections.

12.6.1 Aggregate Policing

An aggregate policer is a policing mechanism that is used to limit the rate of all traffic that matches a set of classification criteria (defined using an ACL) on a given port or VLAN to a predefined rate limit. The aggregate policer can be applied at a port to rate limit either inbound or outbound traffic. The aggregate policer can also be applied to rate limit traffic in a single VLAN that attaches to multiple ports on a switch/router.

When an aggregate policer is applied to a single port, it meters and rate limits all traffic that matches the classifying ACL passing through the port and the policer. When the aggregate policer is applied to a VLAN, it meters and rate limits all of the matching traffic passing through any of the ports in that VLAN to the predefined rate limit.

Let us assume, for example, that an aggregate policer is applied to a VLAN containing 10 ports on a switch/router to rate limit all traffic to a predefined rate of 50 Mb/s. Then, all traffic entering these 10 ports in the VLAN matching the classification criteria (set by the ACL) would be policed to not exceed 50 Mb/s. The PFC3 on Supervisor Engine 720 can support up to 1023 active aggregate policers in the Catalyst 6500 at any given time.

12.6.2 Microflow Policing

The microflow policer operates slightly differently from the aggregate policer in that it rate limits traffic belonging only to a discrete flow to a predefined rate limit. A flow can be defined as a unidirectional flow of packets that are uniquely identified by IP packet fields such source and destination IP addresses, transport protocol type (TCP or UDP), and source and destinations transport protocol port numbers. The default configuration of microflow policing in the Catalyst 6500 is based on a unique flow being identified by its source and destination IP address, and its source and destination TCP or UDP port numbers.

When applied to a VLAN, the aggregate policer would rate limit the total amount of traffic entering that VLAN at the specified rate limit. The microflow policer, on the other hand, would only rate limit each flow (in a VLAN) to the specified rate. For example, if a microflow policer is applied to a VLAN to enforce a rate limit of 2 Mb/s, then every flow entering any port in the VLAN would be policed to not exceed the specified rate limit of 2 Mb/s. It is important note that although a microflow policer can be used to rate limit traffic for specific flows in a VLAN, it does not limit the number of flows that can be supported or can be active in the VLAN.

Let us consider, for example, two applications – an e-mail client and an FTP session – creating two unique traffic flows through a switch/router port. If a microflow policer is applied to rate limit each one of these flows to 2 Mb/s, then the e-mail flow would be policed to 2 Mb/s and the FTP flow would also be policed to 2 Mb/s. The result of the microflow policing actions produces a total of not more than 4 Mb/s of traffic at the switch/router port.

However, an aggregate policer applied to the same scenario to rate limit traffic to 2 Mb/s would rate limit the combined traffic rate from the FTP and e-mail flows to 2 Mb/s. The PFC on Supervisor Engine 720 supports a total of 1023 aggregate policers and 63 microflow policers.

Another important difference between the microflow and aggregate policers is the location in the Catalyst 6500 where the policer can be applied. On Supervisor Engines 720 and 32 with a PFC3x present, the microflow policer can only be applied at ingress port, while the aggregate policer can be applied at ingress or egress port of the switch/router. Furthermore, in the Catalyst 6500, a microflow policer can support only a single instance of the token bucket algorithm, whereas an aggregate policer can support either a single token bucket or a dual token bucket algorithm.

12.6.3 User-Based Rate Limiting

User-based rate limiting (UBRL) [CISCUBRL06] was first introduced in Supervisor Engine 720 with PFC3 as an enhancement to microflow policing to allow for “macroflow” level policing in a switch/router. UBRL provides a network manager a configuration mechanism to view and monitor flows bigger than a microflow (which are typically defined based on source and destination IP addresses, transport protocol type (TCP or UDP), and source and destinations transport protocol port numbers).

UBRL allows a “macroflow” policer to be applied at a switch/router port to specifically rate limit all traffic to or from a specific IP address (a “macroflow”). In the microflow policing example mentioned earlier, the FTP and e-mail applications create two discrete flows (i.e., microflows) passing through the switch/router port. In this example, each flow is rate limited to the specified 2 Mb/s rate. UBRL slightly enhances the capability of microflow policing in the PFC3 to allow for policing a flow that comprises all traffic originating from a unique source IP address, or destined to a unique destination IP address.

UBRL is implemented as a policer using an ACL entry that has a source IP address only flow mask or destination IP address only flow mask. With this enhancement, a microflow policer can be applied to a switch/router port to limit the traffic rate originating from or going to a particular IP address or virtual IP address. UBRL allows a network manager to configure policing and ACL filtering rules that allow policing based on a per user basis.

The UBRL enhancement is beyond what simple microflow policing can do in the Catalyst 6500. With UBRL and a specified rate limit of 2 Mb/s, if each user (i.e., IP address) in a network initiates multiple sessions (e.g., e-mail, FTP, Telnet, HTTP), all data from that user would be policed to 2 Mb/s. Microflow policing, on the other hand, will rate limit each session to 2 Mb/s, resulting in a total of N × 2 Mb/s maximum rate from the single user, if N sessions are created.

12.7 Weighted Tail-Drop: Congestion Avoidance with Tail-Drop and Multiple Thresholds

As a queue in a network device begins to fill with data when the traffic load increases or during network congestion, congestion avoidance mechanisms can be used to control data loss and queue overflows. Queue thresholds can be used to signal when to drop packets and what traffic to drop when the thresholds are crossed. Packets can be marked with priority values, and the priority values may indicate to a network device which packets to drop when queue thresholds are breached.

In a multiple threshold single queue system, the priority value in a packet may also identify which particular threshold this packet is allowed to be dropped when the threshold is crossed. When that particular threshold is crossed, the queue will drop packets arriving with that priority value. In such a system, a particular priority value maps to a particular queue threshold value. The queue will continue to drop packets with that priority value as long as the amount of data in the single queue exceeds that threshold. Figure 12.6 illustrates how multiple thresholds can be used in a given queue to selectively discard packets with a particular priority value.

Figure depicts single queue with tail-drop and multiple thresholds – weighted tail-drop.

Figure 12.6 Single queue with tail-drop and multiple thresholds – weighted tail-drop.

The Catalyst 6000/6500 supports this enhanced version of tail-drop congestion avoidance (“weighted tail-drop”) mechanism. This mechanism drops packets marked with a certain CoS priority value when a certain percentage of the maximum queue size is exceeded [CISCQoSOS07]. Weighted tail-drop, allows a network manager to define a set of packet drop thresholds and assign a packet CoS priority value to each threshold.

In the following example, we consider a queue that supports four packet drop thresholds. Each drop threshold is defined as follows (Figure 12.6):

  • Threshold 1: This threshold is set at 50% of the maximum queue size. CoS priority values 0 and 1 are mapped to this drop threshold.
  • Threshold 2: This threshold is set at 60% of the maximum queue size. CoS priority values 2 and 3 are mapped to this drop threshold.
  • Threshold 3: This threshold is set at 80% of the maximum queue size. CoS priority values 4 and 5 are mapped to this drop threshold.
  • Threshold 4: This threshold is set at 100% of the maximum queue size. CoS priority values 6 and 7 are mapped to this drop threshold.

When weighted tail-drop with the above thresholds is implemented at a port, packets with a CoS priority value of 0 or 1 are dropped if the queue is 50% full. The queue will drop packets with a CoS priority value of 0, 1, 2, or 3 if it is 60% full. Packets with a CoS priority value of 6 or 7 are dropped when the queue is completely filled. However, as soon as the queue size drops below a particular drop threshold, the queue stops dropping packets with the associated CoS priority value(s).

When the queue size reaches the maximum configured threshold, all arriving packets are dropped. The main disadvantage of using tail-drop for TCP data transfer is that it can result in a phenomenon generally referred to as “global TCP synchronization.” When packets of multiple TCP connections are dropped at the same time, the affected TCP connections go at the same time into the state of TCP congestion avoidance and slow-start to reduce the transmitted traffic. The TCP sources then progressively grow their windows at the same time to cause another traffic peak that leads to further data losses.

This synchronized behavior of TCP transmissions causes oscillations in the data transfer load that shows repeated peak load, low load patterns in the network. Congestion control techniques such as RED and WRED are designed with the goal of minimizing global TCP synchronization, maintaining stable network queues, and maximizing network resource utilization.

12.8 Congestion Avoidance with Wred

The Catalyst 6500 supports WRED in hardware by implementing WRED in the port ASICs in the line cards. WRED provides a less aggressive packet discard function than tail-drop. WRED spreads packet drops in a queue randomly across all flows in the queue resulting in few flows being severely penalized than the others.

The WRED mechanism works by dropping fewer packets (thereby affecting fewer flows) when it initially starts its drop process. The random and probabilistic drop operations in WRED lead to fewer TCP connections going into “global synchronization” state.

WRED employs multiple thresholds (typically two) when applied to a queue. The lower and upper thresholds are configured to ensure reasonable queue and link utilization and to avoid queue overflow. When the lower threshold is crossed, WRED starts to drop, randomly, packets marked with a particular priority value. WRED tries to spread the packet drops so as to minimize penalizing heavily a few select flows.

As the queue size continues to grow beyond the lower threshold and approaches the upper threshold, WRED starts to more aggressively discard arriving packets. Increasingly, more flows are subject to packet drops because WRED increases the probability of packet drops. The goal here is to signal rate adaptive sources such as those using TCP to reduce their rates to avoid further packet losses and better utilize network resources.

A network device can implement RED or WRED at any of its queues where congestion can occur to minimize packet losses, queue overflow, and the global TCP synchronization problem. The network manager sets a lower threshold and an upper threshold for each queue using RED or WRED and processing of packets in a queue is carried out as summarized below:

  • When the queue occupancy is smaller than the lower threshold, no packets are dropped.
  • When the queue occupancy crosses the upper threshold, all arriving packets to the queue are dropped.
  • When the queue occupancy is between the lower threshold and the upper threshold, arriving packets are dropped randomly according to a computed drop probability. The larger the queue size, the higher the packet drop probability used by RED or WRED.
  • Typically, the packet drops are done up to a maximum configured packet drop probability.

If the instantaneous queue size is compared with the configured (lower or upper) queue thresholds to determine when to drop a packet, bursty traffic, potentially, can be unfairly penalized. To address this problem, RED and WRED (and other RED variants) use the average queue size to compare with the queue thresholds to determine when to drop packets. The average queue size is also used in the computation of the drop probabilities used by RED or WRED.

The average queue size captures and reflects the long-term dynamics of the queue size changes and is not sensitive to instantaneous queue size changes and bursty traffic arrivals. This allows the queue to accept instantaneous bursty traffic and also not unfairly penalize them with higher packet losses. This also allows both bursty and nonbursty flows to compete fairly for the system resource without suffering unfair data losses.

RED operates without considering the priority settings in packets arriving at the queue. WRED, however, operates while taking into consideration the priority markings in packets. WRED implements differentiated packet drop policies for packets arriving at a queue with different priority markings based on IP Precedence, DSCP, or MPLS EXP values. WRED randomly drops packets marked with a certain CoS priority setting when the queue reaches a threshold.

With WRED, packets marked with a lower priority value are more likely to be dropped when the lower threshold is crossed. RED does not recognize the IP Precedence, DSCP, or MPLS EXP values in arriving packets. In WRED, if the same packet drop policy (priority unaware policy) is configured at a queue for all possible priority values, then WRED behaves the same as RED.

Both RED and WRED help to mitigate against the global TCP synchronization problem by randomly dropping packets. This is because RED and WRED take advantage of the congestion avoidance mechanism that TCP uses to manage data transmission. RED and WRED avoid the typical queue congestion and data loss that occur on a network device when multiple TCP sessions (as a result of global TCP synchronization) go through the same device port.

However, when RED or WRED is used on the same device port, and when some TCP sessions reduce their transmission rates after packets are dropped at a queue, the other TCP sessions passing through the port can still remain at high data sending rates. Furthermore, because there are TCP sessions in the queue with high sending rates, system resources including link bandwidth are more efficiently utilized. Both RED and WRED provide a smoothing effect on the offered traffic load at the queue (from a data flow and data loss perspective), thereby leading to stable queue size and less queue overflows.

12.9 Scheduling with WRR

The WRR algorithm can be used to schedule traffic out of multiple queues on each switch/router port where the configuration of the algorithm allows for weights to be assigned to each queue. The weights determine the amount or percentage of total bandwidth to be allocated to each queue. The queues are serviced in a “round-robin” fashion where each queue is serviced in turn, one after the other. The WRR algorithm transmits a set amount of data from a queue before moving to the next queue.

The simple round-robin algorithm will rotate through the queues transmitting an equal amount of data from each queue before moving to the next queue. The WRR, instead, transmits data from a queue with a bandwidth that is proportional to the weight that has been configured for the queue. The amount of bandwidth allocated to each queue when it is serviced by the WRR algorithm depends on the weight assigned to the queue.

The higher the weight assigned to a queue, the higher the bandwidth allocated. The queues with higher weights can send more traffic per scheduling cycle than the queues with lower weights. This allows a network manager to define specific priority queues and configure how much these queues will have access to the available bandwidth. In this setup, the WRR algorithm will transmit more data from specified priority queues than the other queues, thus providing a preferential treatment for the specified priority queues.

12.10 Scheduling with Deficit Weighted Round-Robin (DWRR)

The deficit round-robin (DRR) scheduling algorithm [SHREEVARG96] was developed as a more effective scheduling algorithm than the simple round-robin algorithm and is used on many switches and routers. DRR is derived from the simpler round-robin scheduling algorithm, which does not service queues fairly when presented with variable packet sizes. With DRR, packets are classified into different queues and a fixed scheduling quantum is associated with each queue. The quantum associated with a queue is the number of bytes a queue can transmit in each scheduling cycle.

The main idea behind DRR is to keep track of which queues were not served in a scheduling cycle (i.e., compute a deficit for each queue) and to compensate for this deficit in the next scheduling round. A deficit counter is used to maintain the credit available to each queue as the scheduling of queues progresses in a round-robin fashion. The deficit counter is updated each time a queue is visited and is used to credit the queue the next time it is revisited and has data to transmit.

The deficit counter for each queue is initialized to a quantum value (which is an initial credit available to each queue). Each time a queue is visited, it is allowed to send a given amount of bytes (quantum) in that round of the round-robin. If the packet size at the head of a queue to be serviced is larger than the size of the quantum, then that queue will not be serviced.

The value of the quantum associated with the queue that was not serviced is added to the deficit counter and will be available as a credit in the next scheduling cycle. To avoid spending valuable processing time examining empty queues (bandwidth wasted), the DRR maintains an auxiliary list called the Active List, which is a list that holds the queues that have at least one packet waiting to be transmitted. Whenever a packet arrives in a queue that was once empty, the index of that queue is added to the Active List.

Packets in a queue visited in a scheduling cycle are served as long as the deficit counter (i.e., the available credit) is greater than zero. Each packet served reduces the deficit counter by an amount equal to the packet's length in bytes. A queue, even if it has packets queued, cannot be served after the deficit counter decreases to zero or negative. In each new round, after a qualified packet is scheduled, from a nonempty queue, the deficit counter is increased by its quantum value.

In general, the DRR quantum size for a queue is selected to be not smaller than the maximum transmission unit (MTU) of the switch or router interface. This ensures that the DRR scheduler always serves at least one packet (up to the MTU of the outgoing interface) from each nonempty queue in the system. The MTU for an Ethernet interface is 1500 bytes but an interface that supports Jumbo Ethernet frames has MTU of 9000 bytes.

12.10.1 Deficit Weighted Round-Robin

Deficit weighted round-robin is a scheduling mechanism used on egress (transmit) queues in the Cisco Catalyst 6500. Each DWRR queue is assigned a relative weight similar to the WRR algorithm. The DWRR scheduling algorithm is an enhanced version of the WRR algorithm. The weights allow the DWRR algorithm to assign bandwidth relative to the weight given to each queue when the interface is congested.

The DWRR algorithm services packets from each queue in a round-robin fashion (if there is data in the queue to be sent) but with bandwidth proportional to the assigned weight, while also accounting for deficits (or credits) accrued during the scheduling cycle. DWRR keeps track of the excess data transmitted when a queue exceeds its byte allocation and reduces the queue's byte allocation in the next scheduling cycles. This way, the actual amount of data transmitted by a queue matches closely the amount defined for it by the assigned weight when compared to the simple WRR.

Each time a queue is serviced, a fixed amount of data are transmitted (proportional to its assigned weight) and DWRR then moves to the next queue. When a queue is serviced, DWRR keeps track of the amount of data (bytes) that were transmitted in excess of the allowed amount. In the next scheduling cycle, when the queue is serviced again, less data will be removed to compensate for the excess data that were served in the previous cycle. As a result, the average amount of data transmitted (bandwidth) per queue will be close to the configured weighted bandwidth.

12.10.2 Modified Deficit Round-Robin (MDRR)

The Catalyst 6500 Series also supports a special form of the DRR scheduling algorithm called modified deficit round-robin, which provides relative bandwidth allocation to a number of regular queues, as well as guarantees a low latency (high priority) queue. MDRR supports a high-priority queue plus regular (unprioritized) queues, while DRR supports only the regular queues. In MDRR, the high-priority queue gets preferential service over the regular queues. The regular queues are served one after the other, in a round-robin fashion while recognizing the weight assigned to each queue.

When no packets are queued in the high-priority queue, MDRR services the regular queues in a round-robin fashion, visiting each queue once per scheduling cycle. When packets are queued in the high-priority queue, MDRR uses one of two options to service this high-priority queue when scheduling traffic from all the queues it handles:

  • Strict Priority Scheduling Mode: In this mode, MDRR serves the high-priority queue whenever it has data queued. A benefit of this mode is that any high-priority queued traffic always gets serviced regardless of the status of the regular queues. The disadvantage, however, is that this scheduling mode can lead to bandwidth starvation in the regular queues if there is always data queued in the high-priority queue. This can also cause the high-priority queue to consume an unfair and disproportionate amount of the available bandwidth than the regular queues, because this queue is served more often every scheduling cycle.
  • Alternate Scheduling Mode: In this mode, MDRR serves the high-priority queue in between serving each of the regular queues. This mode does not cause bandwidth starvation in the regular queues because each one of these queues gets served in a scheduling cycle. The disadvantage here is that the alternating serving operations between the high-priority queue and the regular queues can cause delay variations (jitter) and additional delay for the traffic in the high-priority queue, compared to MDRR in the strict priority mode.

For the regular queues, MDRR transmits packets from a queue until the quantum for that queue has been satisfied. The quantum specifies the amount of data (bytes) allowed for a regular queue and is used in the MDRR just like the quantum in the DRR scheduler. MDRR performs the same process for every regular queue in a round-robin fashion. With this, each regular queue gets some percentage of the available bandwidth in a scheduling cycle.

MDRR treats any extra data (bytes) sent by a regular queue during a scheduling cycle as a deficit as in the DRR algorithm. If an extra amount of data were transmitted from a regular queue, then in the next scheduling round through the regular queues, the extra data (bytes) transmitted by MDRR are subtracted from the quantum.

In other words, if more than the quantum is removed from a regular queue in one cycle, then the quantum minus the excess bytes are transmitted from the affected queue in the next cycle. As a result, the average bandwidth allocation over many scheduling cycles through the regular queues matches the predefined bandwidth allocation to the regular queues.

Each MDRR regular queue can be assigned a weight that determines the relative bandwidth each queue receives. The high-priority queue is not given a weight since it is serviced preferentially as described earlier. The weights assigned to the regular queue play an even more important role when the interface on which the queues are supported is congested. The MDRR scheduler services each regular queue in a round-robin fashion if there are data in a queue to be transmitted.

In addition, the Catalyst 6500 Series supports WRED as a drop policy within the MDRR regular queues. This congestion avoidance mechanism provides more effective congestion control in the regular queues and is an alternative to the default tail-drop mechanism. With WRED, congestion can be avoided in the regular MDRR queues by the controlled but random packet drops WRED provides.

12.11 Scheduling with Shaped Round-Robin (SRR)

SRR is another scheduling algorithm supported in the Cisco Catalyst 6500 Series of switch/routers. SRR was first implemented on the uplink ports of the Cisco Catalyst 6500 Series with Supervisor Engine 32 [CISCQOS05]. Unlike WRR, which operates without traffic policing capabilities, SRR supports round-robin scheduling plus a mechanism to shape the outgoing traffic from a queue to a specified rate. The operation of SRR has some similarities to a traffic policer except that data in excess of the specified rate arriving at the SRR scheduler are buffered rather than dropped as in a traffic policer.

The shaper in the SRR scheduler is implemented at the output of each queue and works by smoothing transient traffic bursts passing through the port on which the SRR scheduler is implemented. In SRR, a weight is assigned to each queue, which is used to determine what percentage of output bandwidth the queue should receive. The traffic transmitted from each queue by the SRR scheduler is then shaped to that allocated percentage of the output bandwidth. SRR limits outbound traffic on a queue to the specific amount of bandwidth that its weight allows.

12.12 Scheduling with Strict Priority Queuing

The Cisco Catalyst 6500 also supports strict priority queuing on a per port basis on select line cards. Strict priority queuing is used to service delay-sensitive traffic (like streaming video and voice) that gets queued on the switch/router's line card. In a multiple queue system with a strict priority queue and WRR low-priority queues implemented, when a packet is queued in the strict priority queue, the WRR ceases scheduling of packets from the low-priority queues and transmits the packet(s) in the strict priority queue. The packets from WRR low-priority queues will be served (in a WRR fashion) only when the strict priority queue is empty.

12.13 Netflow and Flow Entries

NetFlow is a collection of functions used for monitoring and gathering information about traffic that passes through a network device. A switch/router can implement NetFlow as part of an architecture that supports a microflow policer and UBRL to have a better view of the flows these mechanisms are applied to. NetFlow can store information about flows passing through the Catalyst 6500 in memory located on the PFC3 on the Supervisor Engine 720.

12.13.1 NetFlow Entry and Flow Mask

A flow mask is used to define what constitutes a flow and what the NetFlow table stores in a flow entry. The flow mask defines the fields in the arriving packets that identify a flow. It can also be used to define what constitutes a flow in a microflow policer and UBRL.

For example, when used in NetFlow or in the context of UBRL, the following three forms of flow masks can be used [CISCUBRL06]:

  • Source-Only Flow Mask: The source-only IP address flow mask identifies packets with a particular source IP address in arriving packets as constituting a distinct flow. When a user (i.e., IP address) initiates a Telnet, HTTP, and e-mail session passing through an interface being monitored, traffic from these three separate sessions would be seen as a single flow. This is because the three sessions, although separate, share a common source IP address. Only the source IP address is used as the flow mask to identify unique flows at the monitoring point.
  • Destination-Only Flow Mask: This flow mask identifies packets with a particular destination IP address as a unique flow. The destination-only IP address flow mask can be used, for example, to identify outbound traffic from an interface to a server (e-mail server, FTP server, Web server, call manager, etc.). This flow mask is used in many cases in conjunction with the source-only IP flow mask.
  • Full Flow Mask: The full flow mask uses a particular set of source and destination IP addresses, transport protocol type (TCP or UDP), and source and destination port numbers to identify a unique flow. A user who initiates a Telnet and e-mail session would be seen to initiate two separate flows. This is because the Telnet and e-mail sessions will each use distinct destination IP addresses and port numbers that allow them to be identified as distinct flows.

Other applications that use flow entries in a NetFlow table (and flow masks) include Network Address Translation (NAT), Port Address Translation (PAT), TCP intercept, NetFlow Data Export, Web Cache Communication Protocol (WCCP), content-based access control (CBAC), Server Load Balancing, and so on.

A full flow mask is the most specific mask among the three mask types described earlier and its use results in more flow entries being created in a NetFlow table. This can have implications on the processing required for table maintenance in the network device and on the memory requirements for the NetFlow table storage [CISCUBRL06].

The PFC1 and PFC2 (on Supervisor Engines 1 and 2 of the Catalyst 6000/6500) can only run a single flow mask at any one time. When a microflow policer (which operates with a full flow mask) is defined on a PFC1 or PFC2, it requires a full flow mask to be applied. However, as the PFC1 and PFC2 can use only a single flow mask at any one time, this means an active microflow policer will require other processes to use the same full flow mask. This limitation restricts the monitoring capabilities of the PFC1 or PFC2.

Supervisor Engine 720 with PFC3, however, incorporates a number of hardware enhancements over the older PFC1 and PFC2, one of which is the ability to store and activate more than one flow mask at any given time. The Supervisor Engine 720 supports a total of four flow masks in hardware [CISCUBRL06]. This capability is available in the PFC3a, PFC3B, and PFC3BXL.

Out of the four flow masks supported, one is reserved for multicast traffic, and another for internal system use. The remaining two flow masks are used for normal operations, such as the masks used in UBRL. The PFC3x also introduces a number of new flow masks as described in Table 12.2 [CISCUBRL06].

Table 12.2 Flow Masks Available on the PFC3x

Flow Mask Type Description
Source Only This is a less specific flow mask for identifying flows. The PFC maintains one flow entry for each source IP address identified. All packets from a given source IP address contribute to the information maintained for this entry
Destination Only This is also a less specific flow mask for identifying flows. The PFC maintains one flow entry for each destination IP address. All packets to a given destination IP address contribute to this entry
Destination-Source This is a more specific flow mask for identifying flows. The PFC maintains one flow entry for each source and destination IP address pair. All packets between the same source and destination IP addresses contribute to this entry
Destination-Source Interface This is a more specific flow mask for identifying flows. This flow mask adds the source VLAN SNMP ifIndex to the information in the destination source flow mask
Full A full flow entry includes the source IP address, destination IP address, protocol, and protocol interfaces. The PFC creates and maintains a separate cache entry for each IP flow
Full-Interface This is the most specific flow mask for identifying flows. This flow mask adds the source VLAN SNMP ifIndex to the information in the full-flow mask

12.13.2 NetFlow Table

The information collected for the flow entries in the NetFlow table can be stored in memory in the PFC3x. To facilitate high-speed lookups and updates for flow entries, a TCAM, also located on the PFC, is used to store the NetFlow table. On the Supervisor Engine 720, three PFC3x options (PFC3a, PFC3B, and PFC3BXL) can be supported, each having a different storage capacity. The capacities of each PFC3x with respect to the number of flows that can be stored in the TCAM NetFlow table are described in Ref. [CISCUBRL06].

The PFC uses a hash algorithm to locate and store flow entries in the TCAM NetFlow table. The hash algorithm is used together with the flow mask that identifies the fields of interest in the arriving packets. The packet fields specified by the flow mask are used as input to the hash algorithm. The hash algorithm output points to a TCAM location, which contains a key. The key in turn provides the index into the NetFlow table, which contains the actual NetFlow entry. This process is illustrated in Figure 12.7.

Figure depicts NetFlow hash operation.

Figure 12.7 NetFlow hash operation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset