2
Understanding Shared-Bus and Shared-Memory Switch Fabrics

2.1 Introduction

One of the major components that define the performance and capabilities of a switch, switch/router, router, and almost all network devices is the switch fabric. The switch fabric (both shared or distributed) in a network device influences in a great way the following:

  1. The scalability of the device and the network
  2. The nonblocking characteristics of the network
  3. The throughput offered to end users
  4. The quality of service (QoS) offered to users

The type of buffering employed in the switch fabric and its location also play a major role in the aforementioned issues. A switch fabric, in the sense of a network device, refers to a structure that is used to interconnect multiple components or modules in a system to allow them to exchange/transfer information, sometimes, simultaneously.

Packets are transferred across the switch fabric from input ports to output ports, and sometimes, held in small temporary “queues” within the fabric when contention with other traffic prevents a packet from being delivered immediately to its destination. The switch fabric in a switch/router or router is responsible for transferring packets between the various functional modules (network interface cards, memory blocks, route/control processors, forwarding engines, etc.). In particular, it transports user packets transiting the device from the input modules to the appropriate output modules. Figure 2.1 illustrates the generic architecture of a switch fabric.

Schematic diagram depicts generic switch fabric with input/output ports.

Figure 2.1 Generic switch fabric with input/output ports.

There exist many types of standard and user-defined switch fabric architectures, and deciding on what type of architecture to use for a particular network device usually depends on where the device will be deployed in the network and the amount and type of traffic it will be required to carry. In practice, switch fabric implementations are often a combination of basic or standard well-known architectures. Switch fabrics can generally be implemented as

  • time-division switch fabrics
    • - shared media
    • - shared memory
  • space-division switch fabrics
    • - crossbar
    • - multistage constructions

Time-division switch fabrics in turn can be implemented as

  • shared media
    • - bus architectures
    • - ring architectures
  • shared memory

The switch fabric is one of the most critical components in a high-performance network device and plays an important role in defining very much the switching and forwarding characteristics of the system. Under heavy network traffic load, and depending on the design, the internal switch fabric paths/channels can easily become the bottleneck, thereby limiting the overall throughput of a switch/router or router operating at the access layer or the core (backbone) of a network.

The design of the switch fabric is often complicated by other requirements such as multicasting and broadcasting, scalability, fault tolerance, and preservation of service guarantees for end-user applications (e.g., data loss, latency, and latency variation requirements). To preserve end-user latency requirements, for instance, a switch fabric may use a combination of fabric speed-up and intelligent scheduling mechanisms to guarantee predictable delays to packets sent over the fabric.

Switch/router and router implementations generally employ variations or various combinations of the basic fabric architectures: shared bus, shared memory, distributed output buffered, and crossbar switch. Most of the multistage switch fabric architectures are combinations of these basic architectures.

Switch fabric design is a very well-studied area, especially in the context of asynchronous transfer mode (ATM) switches [AHMA89,TOBA90]. In this chapter, we discuss the most common switch fabrics used in switch/router and router architectures. There are many different methods and trade-offs involved in implementing a switch fabric and its associated queuing mechanisms, and each approach has very different implications for the overall design. This chapter is not intended to be a review of all possible approaches, but presents only examples of the most common methods that are used.

2.2 Switch Fabric Design Fundamentals

The primary function of the shared switch fabric is to transfer data between the various modules in the device. To perform this primary function, the other functions described in Figure 2.2 are required. Switch fabric functions can be broadly separated into control path and data path functionality as shown in Figure 2.2. The control path functions include data path scheduling (e.g., node interconnectivity, memory allocation), control parameter setting for the data path (e.g., class of service, time of service), and flow and congestion control (e.g., flow control signals, backpressure mechanisms, packet discard). The data path functions include input to output data transfer and buffering. Buffering is an essential element for the proper operation of any switch fabric and is needed to absorb traffic when there are any mismatches between the input line rates and the output line service rates.

Figure depicts functions and partitioning of functions in a switch fabric.

Figure 2.2 Functions and partitioning of functions in a switch fabric.

In an output buffered switch, packets traversing the switch are stored in output buffers at their destination output ports. The use of multiple separate queues at each output port isolates packet flows to the port queues from each other and reduces packet loss due to contention at the output port when it is oversubscribed. With this, when port oversubscription occurs, the separate queues at the output buffered switch port constrain packet loss to only oversubscribed output queues.

By using separate queues and thereby reducing delays due to contention at the output ports, output buffered switches make it possible to control packet latency through the system, which is an important requirement for supporting QoS in a network device. The shared memory switch is one particular example of output buffered switches.

In an input buffered switch, packets are buffered at input ports as they arrive at the switch. Each input port buffering has a path into the switch fabric that runs at, at least, line speed. The switch fabric may or may not implement a fabric speed-up. Access to the switch fabric may be controlled by a fabric arbiter that resolves contention for access to the fabric itself and also to output ports. This arbiter may be required to schedule packet transfers across the fabric.

When the switch fabric runs at line speed, the memories used for the input buffering only need to run at the maximum port speed. The memory bandwidth in this case is not proportional to the number of input ports, so it is possible to implement scalable switch fabrics that can support a large number of ports with low-cost, lower speed memories.

An important issue that can severely limit the throughput of input buffered switches is head-of-line (HOL) blocking. If simple FIFO (first-in first-out) is used at each input buffer of the input buffered switch, and all input ports are loaded at 100% utilization with uniformly distributed traffic, HOL blocking can reduce the overall switch throughput to 58% of the maximum aggregate input rate [KAROLM87]. Studies have shown that HOL blocking can be eliminated by using per destination port buffering at each input port (called virtual output queues (VoQs)) and appropriate scheduling algorithms. Using specially designed input scheduling algorithms, input buffered switches with VoQs can eliminate HOL blocking entirely and achieve 100% throughput [MCKEOW96].

It is common practice in switch/router and router design to segment variable-length packets into small, fixed-sized chunks or units (cells) for transport across the switch fabric and also before writing into memory. This simplifies buffering and scheduling and makes packet transfers across the device more predictable. However, the main disadvantage to a buffer memory that uses fixed-size units is that memory usage can be inefficient when a packet is not a multiple of the unit size (slightly larger).

The last cell of a packet may not be completely filled with data when the packet is segmented into equal-size cells. For example, if a 64 bytes cell size is used, a packet of 65 bytes will require two cells (first cell of 64 bytes actual data and second cell of 1 byte actual data). This means 128 bytes of memory will be used to store the 65 bytes of actual data, resulting in about 50% efficiency of memory use.

Another disadvantage of using fixed size units is that all cells of a packet in the memory must be appropriately linked so that the cells can be reassembled to form the entire packet before further processing and transmission. The additional storage required for the information linking the cells, and the bandwidth needed to access these data can be a challenge to implement at higher speeds.

We describe below some of the typical design approaches for switch/router and router switch fabrics. Depending on the technology used, a large capacity switch fabric can be either realized with a single large switch fabric to handle the rated capacity or implemented with smaller switch fabrics as a building block. Using building blocks, a large-capacity switch can be realized by connecting a number of such blocks into a network of switch fabrics. Needless to say, endless variations of these designs can be imagined, but the example presented here are the most common fabrics found in switches/routers and routers.

2.3 Types of Blocking in Switch Fabrics

The following are the main types of data blocking in switch fabric:

  • Internal Blocking: Internal blocking occurs in the internal paths, channels, or links of a switch fabric.
  • Output Blocking: A switch that is internally nonblocking can be blocking at an output of a switch fabric due to conflicting requests to the port.
  • Head-of-Line Blocking: HOL blocking can occur at input ports that have strictly FIFO queuing. Buffered packets are served in a FIFO manner. Packets not forwarded due to output conflict are buffered leading to more data transfer delay. A packet at the front of a queue facing blocking prevents the next packet in the queue from being delivered to a noncontending output, resulting in reduced throughput of a switch.
Resolving Internal Blocking in Shared Bus and Shared Memory Architectures:

Internal nonblocking in these architectures can be achieved by using a high-capacity switch fabric with bandwidth equal to or greater than the aggregate capacity of the connected network interfaces.

Resolving Output Blocking in Shared Bus And Shared Memory Architectures:
  • Switch fabrics that do not support a scheduler for allocating/dedicating timeslots for packets (at the input interfaces) can have output port conflicts, which means output conflict resolution is needed on slot-by-slot basis.
  • Output conflicts can be resolved by polling each input one at a time (e.g., round-robin scheduling, token circulation). However, this is not scalable when the system has a large number of inputs. Also, outputs without conflicts (just served) have an unfair advantage in receiving more data (getting a new transmission timeslot)
Resolving HOL Blocking in Shared Bus and Shared Memory Architectures:
  • The system can allow packets behind a HOL blocked packet to contend for outputs.
  • A practical solution is to implement at each input port multiple buffers called VoQs, one for each output. In this case, if the next packet cannot be transmitted due to HOL blocking, another packet from another VoQ is transmitted.

2.4 Emerging Requirements for High-Performance Switch Fabrics

In the early days of networking, network devices were based on shared bus switch fabric architectures. The shared bus switch fabric served its purpose well for the requirements of switches, switch/routers, routers, and other devices at that time. However, based on the demands placed on the performance of networks today, a new set of requirements has emerged for switches, switch/routers, and routers.

  • High-Throughput: Switch fabrics are required to sustain very high link utilization under bursty and heavy traffic load conditions. Also, with the advent of 1, 10, 40, and 100 Gb/s Ethernet, network devices now demand correspondingly higher switch fabric bandwidth.
  • Wire-Speed Performance: The switch fabrics are required to deliver true wire-speed performance on any one of their attached interfaces. For high-performance switch fabrics, the design constraints are, typically, chosen to ensure the fabric sustains wire-speed performance even under worst case network and traffic conditions. The switch fabric has to deliver full wire-speed performance even when subjected to the minimum expected packet size (without any typical packet size assumption). Also, the performance of the switch fabric has to be independent of input and output port configuration and assignments (no assumptions about the traffic locality on the switch fabric).
  • Scalability: Switch fabrics are required to support an architecture that scales up in capacity and number of ports. As the amount of traffic in the network increases, the switch fabric must be able to scale up accordingly. The ability to accommodate more slots in a single chassis contributes to overall network scalability.
  • Modularity: Switch/routers and routers are now required to have a modular architecture with flexibility to allow users to add or mix and match the number/type of line modules, as needed.
  • Quality of Service: Users now depend on networks to handle different traffic types with different QoS requirements. Thus, switch fabrics will be required to provide multiple priority queuing levels to support different traffic types.
  • Multicasting: More applications are emerging that utilize multicast transport. These applications include distribution of news, financial data, software, video, audio, and multiperson conferencing. Therefore, the percentage of multicast traffic traversing the switch fabric is increasing over time. The switch fabric is required to support efficient multicasting capabilities which, in some designs, might include hardware replication of packets.
  • High Availability: Multigigabit and terabit switch/routers and routers are being deployed in the core of enterprise networks and the Internet. Traffic from thousands of individual users pass through the switch fabric at any given time. Thus, the robustness and overall availability of the switch fabric becomes a critical important design factor. The switch fabric must enable reliable and fault-tolerant solutions suitable for enterprise and carrier class applications.
  • Product Diversity: Vendors now support a family of products at various price/performance points. Vendors continuously seek to deliver switch/routers and routers with differing levels of functionality, performance, and price. To control expenses while migrating networks to the service-enabled Internet, it is important that service providers have an assortment of products supporting distributed architectures and high-speed interfaces. This breadth of choice gives service providers the flexibility to install the equipment with the mix of network connection types, port capacity and density, footprint, and corresponding cost that best matches the needs of each service provider site. The switch fabric plays an important role here.
  • Low Power Consumption and Smaller Rack Space: In addition to the challenge of designing a scalable system with guaranteed QoS, designers must build switch fabrics to consume minimal power while squeezing them into smaller and smaller amounts of rack space.
  • Hot Swap: The ability to hot swap, that is, to replace or add line cards without interrupting system operations, is particularly important for high-end switch/routers and routers. This capability is obviously an important contributor to the overall uptime and availability of the system.

Like most networking components, switch fabric designs involve trade-offs between performance, complexity, and cost. Today's most common switch designs vary greatly in their ability to handle multiple gigabit-level links.

2.5 Shared Bus Fabric

The simplest shared bus switch fabric comprises a single-signal channel medium over which all traffic between the system modules are transported. A shared bus is limited in capacity, length, and the overhead required for arbitrating access to the shared bus. The key design constraints here are the bus width (number of parallel bits placed on the bus) and speed (i.e., rate at which the bus is clocked, in MHz). The difficulty is designing a shared bus and arbitration mechanism that is fast enough to support a large number of multigigabit speed ports with nonblocking performance. Figures 2.3 and 2.4 illustrate high-level architectures of a bus-based switch fabric.

Figure depicts shared bus fabric–single bus system.

Figure 2.3 Shared bus fabric–single bus system.

Figure depicts shared bus fabric–multiple bus system.

Figure 2.4 Shared bus fabric–multiple bus system.

When multiple devices (e.g., network interface cards) simultaneously compete for access and control of the shared bus, arbitration is the process that determines which of the device gains access and control of the shared bus. Each device may be assigned a priority level for bus access, which is known as an arbitration level. This can be used to determine which device should gain access and control the bus during contention for the shared bus. The switch fabric may have a fairness mechanism, which ensures that each device gets a turn to access and control the bus, even if it has a low arbitration level.

The fairness mechanism ensures that none of the devices is locked out of the shared bus and that each device can gain access to the bus within a given period of time. The central arbitration control point or the bus controller (shown in Figures 2.3 and 2.4) is the point in the system where contending devices send their arbitration signals. A simple bus implementation would use a time-division multiplexed (TDM) scheme for bus arbitration where each device is given equal access to the bus in a round-robin fashion. Because of its simplicity, the shared bus switch fabric was the most common fabric used in early routers and even in current low-end routers. The shared bus architecture presents the simplest and most cost-effective solutions for low-speed switching and routing platforms.

A big disadvantage of the shared bus switch fabric is that traffic from the slowest speed port in a shared bus system cannot speed up enough to traverse a very high-speed bus. This typically requires intermediate buffering at the slow-speed port, which further increases both the complexity and the cost of the system. In addition, issues with the hot swappability of network interface cards and fair access to bandwidth (when ports have very different speeds and traffic loads) add further complications to the design.

The typical shared bus often can be defined by the following features:

  • Bus Width: Given that the shared bus is the signal channel/pathway over which the information from the system modules is carried, the wider the shared bus (number of bit lanes), the greater the amount of information carried over the bus. The width of the control and address buses are generally specified independent of the data bus width. The address bus width defines the number of different memory locations the bus can transfer data to.
  • Bus Speed: The speed of the shared bus defines how many bits of data can be transferred across each lane/line of the bus in each second. A simple bus may transfer 1 bit of data per line in a single clock cycle. Some buses may transfer 2 bits of data per clock cycle, doubling performance.
  • Bus Bandwidth: The shared bus bandwidth is the total amount of data that can be transferred across the bus in a given interval of time. If the bus width is the number of lanes and the bus speed is known, then the bus bandwidth is the product of the bus width (in bits) and the bus speed (bits per second). This defines the amount of data the bus can transfer in a second.
  • Data and Control Buses: The typical shared bus consists of at least two distinct parts: the data bus and the control bus. The data bus consists of the signal channels/lines that actually carry the data being transferred from one module to another. The control bus consists of the signal channels/lines that carry information on how the bus functions (i.e., the control information), and how users of the bus are signaled when data are available on the data bus. An address bus may be included, which is the set of lines that carry information about where in memory the data to be transferred is stored.
  • Burst Mode: Some shared buses can transfer data in a burst mode, where multiple sets of data can be transmitted back-to-back (sequentially in a row).
  • Bus Interfacing: In a system that has multiple different buses, a bus interfacing circuit called a “bridge” can be used to interconnect the buses and allow devices on the different buses to communicate with each other.

The characteristics of the bus-based architecture are summarized as follows [RAATIKP04]:

  • Switching over the bus is done in time domain, but implementations using both time and space switching are also possible (through the use of multiple buses).
  • Bus fabrics are easy to implement and normally have low cost.
  • Multicasting and broadcasting of traffic are easy to implement on the bus architecture.
  • On the bus fabric only one transmission (timeslot) can be carried/propagated on the bus at any given time, which can result in limited throughput, scalability, and low number of network interfaces.
  • Achieving internal nonblocking in bus architectures and implementations require a high-capacity bus with bandwidth equal to or greater than the aggregate capacity of the connected network interfaces.
  • Multiple buses can be used to increase the throughput and improve the reliability of bus-based architectures.

2.6 Hierarchical Bus-Based Architecture

Figure 2.5 shows a high-level view of hierarchical bus architecture. In this architecture, only packets traveling between local buses cross the backplane bus. In a hierarchical bus architecture, the main backplane bus is typically configured to have usable bandwidth less than the aggregate bandwidth of all the ports in the system. In such a configuration, the hierarchical bus-based switch operates well only when most of the traffic traversing the switch can be locally switched, meaning, traffic crossing the backplane bus is limited.

img

Figure 2.5 Hierarchical bus-based architecture.

A major limitation of the hierarchical bus-based architecture is that when the traffic transiting the switch is not localized (to a local bus), the backplane bus can become a bottleneck, thereby limiting the overall throughput of the system. Furthermore, performing port assignments in order to localize communication to the local buses would introduce unnecessary constraints on the network topology and also make network configuration and management very difficult.

2.7 Distributed Output Buffered Fabric

Figures 2.6 and 2.7 show high-level architectures of the distributed output buffered switch. The switch fabric has separate and independent channels (buses) that interconnect any two (pairs of) input and output ports resulting in N2 paths in total. In this architecture, packets that arrive on an input are broadcasted on separate buses (channels) that connect to each output port. Each output port has an address filter that allows it to determine which packets are destined to it.

Figure depicts distributed output buffered switch fabric: separate buffers per input port.

Figure 2.6 Distributed output buffered switch fabric: separate buffers per input port.

Figure depicts the distributed output buffered switch fabric: one buffer pool for all input ports.

Figure 2.7 Distributed output buffered switch fabric: one buffer pool for all input ports.

The packets that are destined to the output are filtered by the address filters to local output queues. This architecture provides many attractive switch fabric capabilities. Obviously, no conflicts exist among the N2 independent paths interconnecting the inputs and outputs, and all packet queuing takes place at the output ports.

Another feature is that the fabric operates in a broadcast-and-select manner, allowing it to support the forwarding of multicast and broadcast traffic inherently. Given that no conflicts exist among the paths, the fabric is strictly nonblocking and full input port bandwidth is available for traffic to any output port. With independent address filters at each port, the fabric also allows for multiple simultaneous (parallel) multicast sessions to take place without loss of fabric utilization or efficiency.

In Figure 2.6, the address filters and buffers at each port are separate and independent and need only operate at the input port speed. All of these output port components operate at the same speed. The fabric does not require any speed-up and scalability is limited to bus electronics only(operating frequency, signal propagation delay, electrical loading, etc.). For these reasons, this switch fabric architecture has been implemented in some commercial networking products. However, the N2 growth of address filters and buffers in the fabric limits the port size N that can be implemented in a practical design.

The distributed output buffered switch fabric shown in Figure 2.7 requires a fewer number of buffers at each port; however, these output buffers must run at a speed greater than the aggregate input port speeds to avoid blocking and packet loss. The output buffer memory bandwidth and type limit the rate at which the output buffer can be accessed by the port scheduler. This factor ultimately limits the bandwidth at the output port of the switch fabric.

2.8 Shared Memory Switch Fabric

Figure 2.8 shows a high-level architecture of a typical shared memory fabric. This switch fabric architecture provides a pool of memory buffers that is shared among all input and output ports in the system. Typically, the fabric receives incoming packets and converts the serial bit stream to a parallel stream (over parallel lines of fixed width) that is then written sequentially into a random-access memory (RAM).

Figure depicts shared memory switch fabric.

Figure 2.8 Shared memory switch fabric.

An internal routing tag (header) is typically attached/prepended to the packet before it is written into the memory. The writes and reads to the memory are governed by a system controller, which determines where in the memory the packet data are written into and retrieved from. The controller also determines the order in which packets are read out of the memory to the ports. The outgoing packet data are read from their memory locations and demultiplexed to the appropriate outputs, where they are converted from a parallel to a serial stream of bits.

A shared memory switch fabric is an output buffered switch fabric, but where the output buffers all physically reside in a common shared buffer pool. The output buffered switch has attractive features because it can achieve 100% throughput under full traffic load [KAROLM87]. A key advantage of having a common shared buffer pool is that it allows the switch fabric to minimize the total amount of buffers it should support to achieve a specified packet loss rate.

The shared buffer pool allows the switch fabric to accommodate traffic with varying dynamics and absorb large traffic bursts arriving at the system and any port. The key advantage is that a common shared buffer pool is able to take advantage of statistical sharing of the buffers as varying traffic arrives at the system. When an output port is subjected to high traffic, it can utilize more buffers until the common buffer pool is (partially or) completely filled.

Another advantage of a shared memory switch fabric is that it provides low data transfer latencies from input to output port by avoiding packet copying from port to port (only a write and read required). There is no need for copying packets from input buffers to output buffers as in other switch fabric architectures. Furthermore, the shared memory allows for the implementation of mechanisms that can be used to perform advanced queue and traffic management functions (priority queuing, priority discard, differentiated traffic scheduling, etc.).

Mechanisms can be implemented that can be used to sort incoming packets on the fly into multiple priority queues. The mechanism may include capabilities for advanced priority queuing and output scheduling. For output scheduling, the switch fabric may implement policies that determine which queues and packets get serviced at the output port. The shared memory architecture can also be implemented where shared buffers are allocated on a per port or per flow basis.

In addition, the switch fabric may implement dynamic buffer allocation policies and user-defined queue thresholds that can be used to manage buffer consumption among ports or flows and for priority discard of packets during traffic overload. For these reasons, the shared memory switch fabric has been very popular for the design of switches, switch/routers, and routers.

The main disadvantage of a shared memory architecture is that bandwidth scalability is limited by the memory access speed (bandwidth). The access speeds of memories have a physical limit, and this limit prevents the shared memory switch architecture from scaling to very high bandwidths and port speeds. Another factor is that the shared memory bandwidth has to be at least two times the aggregate system port speeds for all the ports to run at full line rate.

When the shared memory runs at full total line rate, all packets can be written into and read out from the memory resulting in nonblocking operation at the input ports. Depending on how memory is implemented and allocated, the total shared memory bandwidth may actually be a bit higher to accommodate the overhead that comes with storing variable packet sizes and the basic units of buffering.

The shared memory switch fabric is generally suitable for a network device with a small number of high-speed ports or a large number of low-speed ports. It is challenging to design a shared memory switch fabric when the system has to carry a high number of multigigabit ports. At very high multigigabit speeds, it is very challenging to design the sophisticated controllers required to allocate memory to incoming packets, arbitrate access to the shared memory, and determine which packets will be transmitted next.

Furthermore, depending on the priority queuing, packet scheduling, and packet discard policies required in the system, the memory controller can be very complicated and expensive to implement to accommodate all the high-speed ports. The memory controller can be a potential bottleneck to system performance. The challenge is to implement the controller to be fast enough to manage the shared memory, read the packets on priority, and implement other service policies in the system.

The memory controller may be required to handle multiple priority queues in addition to packet reads for complex packet scheduling. The requirement of multicasting and broadcasting in the switch fabric further increases the complexity of the controller. To build high-performing switching and routing devices, in addition to a high-bandwidth shared memory fabric, a forwarding engine (ASIC or processor) has to be implemented for packet filtering, address lookup, and forwarding operations.

These additional requirements add to the cost and complexity of the shared memory switch fabric. For core networks and as network traffic grows, the wide and faster memories and controllers required for large shared memory fabrics are generally not cost-effective. This is because as the network bandwidth grows beyond the current limits of practical memory pool implementations, a shared memory switch in the core of the network is not scalable and has to be replaced with a bigger unit. Adding a redundant switching plane to a shared memory switch is complex and expensive.

In shared memory fabric implementations, fast memories are used to buffer data-arriving packets before they are transmitted. The shared memory may be organized in blocks/cells of 64 bytes to accommodate the minimum Ethernet frame size of 576 bits to handle ATM cells. This means an arriving packet bigger than the basic block size has to be segmented into the unit block size before storage in the shared memory.

The total shared memory bandwidth is equal to the clock speed per memory lane/line (in megahertz or megabits per second) times the number of lanes/lines (in bits) into the memory. The size of the shared memory fabric is normally determined from the bandwidth of the input and output ports and the required QoS for the transiting traffic. Dynamic allocation can be used to improve shared memory buffer utilization and guarantee that data will not be blocked or dropped as the inputs contend for memory space.

The characteristics of the shared memory-based architectures are summarized as follows [RAATIKP04]:

  • The shared memory fabric supports switching in time domain, but implementations using time and space switching are also possible.
  • Shared memory fabrics are generally easy to implement and have relatively low cost.
  • Every timeslot is carried twice through the shared memory (one write and one read timeslot), resulting in low throughput, low number of network interfaces, and limited scalability.
  • Achieving internal nonblocking in shared memory architectures and implementations requires a high-capacity shared memory with bandwidth equal to or greater than the aggregate capacity of the connected network interfaces.
  • Multicasting and broadcasting of traffic are easy to implement on the shared memory architecture (using one write and multiple read operations from a single buffer location).

The shared memory switch fabric is also very effective in matching the speeds of different interfaces on a network device. However, the higher link speeds and the need to match very different speeds on input and output interfaces require the provisioning of a very big shared memory fabric and buffering.

As discussed earlier, the major problem in shared memory switches is the speed at which the memory can be accessed. One way to overcome this problem is to build memories with very wide buses that can load a large amount of data in a single memory cycle. However, shared memory techniques are usually not very effective in supporting very high network bandwidth requirements due to limitations of the access time of memory, that is, the precharge times, the effective burst size to amortize the storage overhead, and the width of the data bus.

2.8.1 Shared Memory Switch Fabric with Write and Read Controls

As illustrated in Figure 2.9, data from the inputs to the shared memory are time-division multiplexed (TDM), allowing only one input port at a time to store (write) a slice of data (cell) into the shared memory. Figure 2.10 illustrates generic memory architecture. As the memory write controller receives the cell, it decodes the destination output port information that is used to write the cell into a memory location that belongs to the output port and queue to receive the cell.

Figure depicts shared memory switch fabric with write and read controls.

Figure 2.9 Shared memory switch fabric with write and read controls.

Figure depicts generic memory architecture.

Figure 2.10 Generic memory architecture.

A free buffer address is taken from the free buffer address pool and used as the write address for the cell in the shared memory. In addition, the write controller links the write address to the tail of the output queue (managed by the memory read controller) belonging to the destination port. The read controller is signaled where the written cell is located in the shared memory.

Cells transmitted out of the shared memory are time-division demultiplexed, allowing only one output port at a time to have access to (i.e., read from) the shared memory. The read process to an output port usually involves arbitration, because there may be a number of cells contending for access to the output port. The memory read controller is responsible for determining which one of contending cells wins the arbitration and is transferred to the output port. Once a cell has been forwarded to its output port, the available shared memory location is declared free and its address is returned to the free buffer address pool.

2.8.2 Generic Shared-Memory-Based Switch/Router or Router

Figures 2.11 and 2.12 both describe the architecture of a shared-memory-based switch/router or router with distributed forwarding in the line cards. Each line card has a forwarding table, an autonomous processor that functions as the distributed forwarding engine, and a small local packet memory for temporary holding of incoming packets while they are processed. A copy of the central forwarding table maintained by the route processor is propagated to the line cards to allow for local forwarding of incoming packets.

Figure depicts generic shared-memory-based switch/router or router.

Figure 2.11 Generic shared-memory-based switch/router or router.

img

Figure 2.12 Generic switch/router or router with inbound and outbound processing components.

After the lookup decision is completed in a line card, the incoming packet is stored in the shared memory in buffer queues corresponding to the destination line card(s). Typically, packets are segmented into smaller size units, cells, by the line cards before storage in the shared memory. Various methods are used to signal to the destination line card(s) that buffer queues to retrieve a processed packet. Packets stored in the shared memory can be destined to other line cards or to the router processor. A stored packet can either be a unicast packet destined to only line card or multicast, that is, destined to multiple line cards.

A broadcast packet is destined to all other line cards other than the incoming line card. Typically, the system includes mechanisms that allow for a multicast or broadcast packet to be copied multiple times from a single memory location by the destination line cards without the need to replicate the packet multiple times in other memory locations. The system also includes mechanisms that allow for priority queuing of packets at a destination port and also discarding packets when a destination port experiences overload.

The outbound processing at the destination card(s) includes IP TTL (time-to-live) update, IP checksum update, Layer 2 packet encapsulation and address rewrite, and Layer 2 checksum update. The processing could include rewriting/remarking information in the outgoing packet for QoS and security purposes.

The separate route processor is responsible for nonreal-time tasks such as running the routing protocols, sending and receiving routing protocol updates, constructing and maintaining the routing tables, and monitoring network interface status. Other tasks include monitoring system environmental status, system configuration, and line card initialization, providing network management functions (SNMP, console/Telnet/Secure Shell interface, etc.).

2.8.3 Example Shared Memory Architectures

In this section, we describe examples of shared memory switch fabrics based on Motorola's NetRAM, which is a dual-port SRAM with configurable input/output data ports [MANDYLA04]. The NetRAM was designed specifically for networking devices that require shared memories with optimal performance for write/read/write cycles. Two implementations of NetRAM shared memory switch fabric are the “one-way” shared memory switch fabric with separate input and output ports (Figure 2.13) and the “snoop” switch fabric with two dual-ports (Figure 2.14).

Figure depicts “One-way” switch fabric implementation using NetRAM.

Figure 2.13 “One-way” switch fabric implementation using NetRAM.

Figure depicts “Snoop” switch fabric using NetRAMs.

Figure 2.14 “Snoop” switch fabric using NetRAMs.

2.8.3.1 “One-Way” Switch Fabric Implementation Using NetRAM

Figure 2.13 shows the architecture of the “one-way” shared memory switch fabric using the NetRAM. Ingress ports of line cards (or input modules) connect to the data inputs (DQY) of the NetRAM, while the egress of line cards (or output modules) connect to the data outputs (DQX). The write address ports (AY) and read address ports (AX) connect to the external control logic ASIC.

The control logic ASIC supplies the free memory address in which the cell is to be written/stored until the destination output port is ready to read/retrieve it. The free memory address is received by port AY of the NetRAM while the cell is written into the input port DQY of the NetRAM immediately after the write enable signal (WY) is activated.

When a destination output port is ready to receive a data cell, the control logic ASIC retrieves the read address of the stored cell's memory location and sends it to port AX of the NetRAM. After the NetRAM memory location is read, the data are sent out of the output port DQX after two internal clock cycles, provided the output enable (GX) signal is activated. The internal clock of the NetRAM runs at two times the external clock. This allows board frequencies to be maintained at values equal to or less than 83 MHz while the NetRAM-based device delivers data transfer performance equivalent to a conventional 166 MHz memory.

The NetRAM operates as a pipeline with reads occurring before writes. With this, if a read and a write are directed at the same memory address on the same clock edge, the cell data that were previously (written) in that memory location will be read and then the write data will be written into that location after the read has finished. This allows the NetRAM to resolve potential memory address contention problems without the additional requirement of an external arbitration mechanism or special ASICs to ensure the data transferred through the NetRAM are written and read without corruption.

Two additional functions provided by the NetRAM are a pass-through function and a write-and-pass-through function. The pass-through function allows input data to bypass the shared memory and be transferred directly from the input to an output. In situations where data need to be transferred quickly from the input port to an output port, the pass-through function can be used that saves two system cycles compared to a RAM without this function.

When data are transferred without this function from the input to the output, the output has to wait for the data to be written to the DQY port of the NetRAM (address sent through AY) and then read from the AX port (address provided through AX). The write-and-pass-through function can be used to transfer data from the input to an output, but also allowing the data to be written into the shared memory in the usual way.

The NetRAM is designed to have some advantages over conventional RAMs with single address port and common I/O port. Reference [MANDYLA04] states the main advantage to be that reads and writes can be performed at different memory addresses in the same clock cycle. This dual address read/write capability allows the NetRAM in the one-way shared memory fabric to support very high throughput of 2.98 Gb/s. In addition, implementations can be realized where several memories are banked together in parallel in the fabric. For example, if a fabric has 16 memory banks, the total available bandwidth would be 16 × 2.98 Gb/s (or 47.8 Gb/s).

If the data input and data output in an implementation are doubled, and reads can be performed simultaneously from both ports in the same clock cycle, then a bandwidth of approximately 6 Gb/s can be achieved. If an implementation has a 576 bit wide memory block and 16 NetRAMs connected in parallel, the maximum bandwidth becomes approximately 96 Gb/s.

2.8.3.2 “Snoop” Switch Fabric Using NetRAMs

Figure 2.14 shows a block diagram of the “snoop” shared memory switch fabric using the NetRAM. In this implementation, the NetRAM serves as a common shared memory while the system bus uses port DQX on the NetRAM for writing and reading cells. The ASIC that is connected to the dual port DQY allows the user to “snoop” into any memory address to read any data that need to be verified or modified, such as Ethernet destination MAC addresses and ATM VPI/VCI headers.

The user may want to include error checking bits or other user-defined data that the system requires as soon as the Ethernet or ATM data start to be written in the shared memory. This allows the system to make decisions at the beginning of the data input cycle, such as the next destination of the data or its service priority. Another advantage is the ability to write back data at any memory address. These features eliminate the need to have a separate system path to screen and separate out critical information from the data flow, because the entire Ethernet frame or ATM cell is stored in the shared memory. This implementation also reduces the chip count and board space. The “snoop” dual-ported switch fabric can be implemented as a 64 K × 18 and 32 K × 36 in a dual-ported device with pipelined reads.

2.8.3.3 Design Rationale of the NetRAM

For many years, burst SRAMs have been very suitable for the write-once, read-many functions of level-2 caches normally used in computing applications. To cut cost and also have access to available commercial memory devices, some designers have used burst SRAMs in the implementation of switch fabrics of network devices. Burst SRAMs generally have a common I/O port and one address port. As a result, the burst SRAM can perform either a read to an address location or write to an address location in one clock cycle but not both in the same cycle.

Also, when the common I/O on the burst SRAM has to be set from a read to write state, this requires a deselect cycle to be inserted into the timing, resulting in one wait state. The performance gets worse when turning the common I/O from a write to read state since two deselect cycles have to be inserted to ensure there are no conflicts on the burst SRAM's common I/O bus. These factors translate to a clock cycle utilization for data input or output to the burst SRAM, of between 50 and 70%, depending on the read/write and write/read patterns used to access the data in the burst SRAM.

These factors and reasons make the burst SRAM not efficient for the write/read/write cycles of network devices. The limitations of burst SRAM become more critical when they are used for the gigabit and terabit bandwidths seen in today's switches, switch/routers, and routers. NetRAM can be implemented as a dual I/O device or a separate I/O device that eliminates the overall system performance penalties seen in the burst SRAM with a common I/O where there is the need to insert deselect cycles.

NetRAM can perform reads and writes in the same clock cycle with separate memory addresses provided for each port, allowing a designer to implement higher speed network devices than would be possible with conventional burst SRAMs. Network devices generally perform write/read/write most of the time, but the conventional burst SRAMs are suitable for the burst reads and writes of computing applications.

In network devices (switches, switch/routers, and routers), which frequently transition from reads to writes, burst SRAMs have relatively lower performance, due to the dead bus cycles that occur between operations. Thus, the use of the conventional burst SRAM increases design complexity and reduces the overall system performance. The dual port feature of the NetRAM enables the switch fabric in the network device to perform simultaneously read and write to different memory addresses in each clock cycle.

2.9 Shared Ring Fabric

The ring fabric connects each node in the system to the immediate two adjacent nodes to it, forming a single continuous path for data traffic (Figure 2.15). Data placed on the ring travel (unidirectionally) from one node to another, allowing each node along the path to retrieve and process the data. Every packet placed on the ring is visible to the nodes on the ring. Data flow in one direction on the ring hop-by-hop with each node receiving the data and then transmitting them to the next node in the ring.

Figure depicts the ring fabric that connects each node in the system to the immediate two adjacent nodes to it, forming a single continuous path for data traffic.

Figure 2.15 Shared ring switch fabric.

The ring fabric uses controllers in the nodes that attach to the ring to manage the multiple bus segments they connect to. This fabric does not require a central controller to manage how the individual nodes access the ring. Some ring architectures support dual-counter-rotating rings to which the controllers connect. The ring fabric consists of a network of bus segments interconnected in a circular fashion.

The fabric is highly scalable because of its efficient use of internode pathways (bus segments). Adding a node to the system requires only two fabric interfaces to connect to the ring. In some ring architectures, due to the point-to-point interconnection of the nodes, it is relatively easy to install a new node since adding the node requires only inserting the node between just two existing connections. The point-to-point connectivity between nodes allows faults on the ring to be easily identified and isolated.

The ring fabric is highly scalable, although it is susceptible to single point of failure and traffic congestion. The ring is susceptible to single point of failures because it supports only one path between any two adjacent nodes. One malfunctioning node or link on the ring can disrupt communication on the entire fabric. Furthermore, adding and removing a node on the ring can disrupt communication on the fabric. Using dual (e.g., bidirectional rings) or more rings improves the reliability of the ring fabric. Generally, bidirectional ring-based architectures allow very fast reconfiguration of faults, and transmitted traffic does not require rerouting.

The bandwidth of the total ring fabric is limited by the bandwidth of the slowest link segment in the system. Furthermore, data transfer latency can be high if there are a high number of hops it takes to communicate between any two nodes in the system. Data transfer delay is directly proportional to number of nodes on the ring. This fabric architecture is advantageous where economy of system nodes (to which network interfaces are attached) is critical, but availability and throughput are less critical. Most practical systems use dual rings and identical link segments between nodes.

The characteristics of the ring-based architectures are summarized as follows [RAATIKP04]:

  • Ring fabrics can generally be categorized into source and destination release rings:
    • - In source release rings, only one switching operation takes place on the ring at a time, which results in limited system throughput (similar to the shared bus).
    • - In destination release rings, multiple timeslots (messages) can be carried on the ring simultaneously, thus allowing spatial reuse of ring resources. This improves the throughput of the system.
  • The ring fabric allows switching in time domain, but implementations using time and space switching are also possible.
  • Service on the ring fabric is very orderly where every node attached to the ring has an opportunity to access (e.g., through a circulating token) and transmit data on it. The ring fabric has better performance than a shared bus fabric under heavy traffic load because of its orderly service capabilities.
  • Ring fabrics are generally easy to implement and have relatively low cost (similar to shared bus architectures).
  • Ring-based architectures have better scalability than shared bus ones.
  • Multicasting and broadcasting of traffic are easy to implement on the ring architecture.
  • Achieving internal nonblocking in ring architectures and implementations require a high-capacity ring with bandwidth equal to or greater than the aggregate capacity of the connected network interface.
  • The capacity of the ring architectures can be improved by implementing parallel (multiple) rings. These rings are usually controlled in a distributed manner but Medium Access Control (MAC) implementation on the multiple rings can be difficult.
  • Multiple rings can be used to reduce internal blocking, increase the throughput, and improve the scalability and reliability of ring-based architectures.

2.10 Electronic Design Problems

The major problems encountered during the design of switch fabrics are summarized here [RAATIKP04]:

  • Signal Skew: This happens when a signal placed on a line arrives at different components on the line at different times. This happens in the switch fabric and/or on circuit boards and is caused by long signal traces/lines with varying capacitive load.
  • Varying Delay on Bus Lines: This happens when different/separate lines of a bus (with nonuniform capacitive loads) are routed through the switch fabric.
  • Crosstalk: This happens when a signal transmitted on one line in the switch fabric creates an undesired effect on another line (i.e., electromagnetic coupling of signals from adjacent signal lines).
  • Power Supply Feeds and Voltage Swings: When the power source/lines are incorrectly dimensioned, this can cause nonuniform voltage along a signal line. The lack of adequate filtering can also cause voltage fluctuation on the line.
  • Mismatching Timing Signals: Lines with different lengths from a single timing source can cause phase shift in the signal received. Also, these distributed timing signals may make it difficult to have adequate synchronization.
  • Mismatching Line Termination: This happens when terminating varying (high) bit rates inputs on long signal lines in the fabric.

Other design challenges are as follows:

  • The speed of commercially available components may not necessary fit/meet the requirements of a particular design/platform (required line speeds, memory bandwidth, etc.).
  • The component packing density can lead to board/circuit space utilization issues. Other problems may include system power consumption, heating, and cooling problems.
  • Another challenge is balancing the maximum practical system fan-out versus the required size of the switch fabric.
  • The required bus length inside a switch fabric are constrained by the following:
    • - Long buses tend to require decreasing the internal speed (clock) of the switch fabric (to prevent excessive signal skew, etc.).
    • - Internal switch fabric diagnostics gets difficult to carry out.

Specifically, the bus architecture itself places limits on the following:

  • The operating frequency that can be used.
  • Signal propagation delay that is tolerable.
  • Electrical loading on the signal lines.

Propagation delay limits the physical length of a bus (distance the signal can travel without significantly degrading), while electrical loading limits the number of devices that can be connected to the bus. Pragmatically speaking, these factors are dictated by physics and cannot be easily circumvented.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset