One of the major components that define the performance and capabilities of a switch, switch/router, router, and almost all network devices is the switch fabric. The switch fabric (both shared or distributed) in a network device influences in a great way the following:
The type of buffering employed in the switch fabric and its location also play a major role in the aforementioned issues. A switch fabric, in the sense of a network device, refers to a structure that is used to interconnect multiple components or modules in a system to allow them to exchange/transfer information, sometimes, simultaneously.
Packets are transferred across the switch fabric from input ports to output ports, and sometimes, held in small temporary “queues” within the fabric when contention with other traffic prevents a packet from being delivered immediately to its destination. The switch fabric in a switch/router or router is responsible for transferring packets between the various functional modules (network interface cards, memory blocks, route/control processors, forwarding engines, etc.). In particular, it transports user packets transiting the device from the input modules to the appropriate output modules. Figure 2.1 illustrates the generic architecture of a switch fabric.
There exist many types of standard and user-defined switch fabric architectures, and deciding on what type of architecture to use for a particular network device usually depends on where the device will be deployed in the network and the amount and type of traffic it will be required to carry. In practice, switch fabric implementations are often a combination of basic or standard well-known architectures. Switch fabrics can generally be implemented as
Time-division switch fabrics in turn can be implemented as
The switch fabric is one of the most critical components in a high-performance network device and plays an important role in defining very much the switching and forwarding characteristics of the system. Under heavy network traffic load, and depending on the design, the internal switch fabric paths/channels can easily become the bottleneck, thereby limiting the overall throughput of a switch/router or router operating at the access layer or the core (backbone) of a network.
The design of the switch fabric is often complicated by other requirements such as multicasting and broadcasting, scalability, fault tolerance, and preservation of service guarantees for end-user applications (e.g., data loss, latency, and latency variation requirements). To preserve end-user latency requirements, for instance, a switch fabric may use a combination of fabric speed-up and intelligent scheduling mechanisms to guarantee predictable delays to packets sent over the fabric.
Switch/router and router implementations generally employ variations or various combinations of the basic fabric architectures: shared bus, shared memory, distributed output buffered, and crossbar switch. Most of the multistage switch fabric architectures are combinations of these basic architectures.
Switch fabric design is a very well-studied area, especially in the context of asynchronous transfer mode (ATM) switches [AHMA89,TOBA90]. In this chapter, we discuss the most common switch fabrics used in switch/router and router architectures. There are many different methods and trade-offs involved in implementing a switch fabric and its associated queuing mechanisms, and each approach has very different implications for the overall design. This chapter is not intended to be a review of all possible approaches, but presents only examples of the most common methods that are used.
The primary function of the shared switch fabric is to transfer data between the various modules in the device. To perform this primary function, the other functions described in Figure 2.2 are required. Switch fabric functions can be broadly separated into control path and data path functionality as shown in Figure 2.2. The control path functions include data path scheduling (e.g., node interconnectivity, memory allocation), control parameter setting for the data path (e.g., class of service, time of service), and flow and congestion control (e.g., flow control signals, backpressure mechanisms, packet discard). The data path functions include input to output data transfer and buffering. Buffering is an essential element for the proper operation of any switch fabric and is needed to absorb traffic when there are any mismatches between the input line rates and the output line service rates.
In an output buffered switch, packets traversing the switch are stored in output buffers at their destination output ports. The use of multiple separate queues at each output port isolates packet flows to the port queues from each other and reduces packet loss due to contention at the output port when it is oversubscribed. With this, when port oversubscription occurs, the separate queues at the output buffered switch port constrain packet loss to only oversubscribed output queues.
By using separate queues and thereby reducing delays due to contention at the output ports, output buffered switches make it possible to control packet latency through the system, which is an important requirement for supporting QoS in a network device. The shared memory switch is one particular example of output buffered switches.
In an input buffered switch, packets are buffered at input ports as they arrive at the switch. Each input port buffering has a path into the switch fabric that runs at, at least, line speed. The switch fabric may or may not implement a fabric speed-up. Access to the switch fabric may be controlled by a fabric arbiter that resolves contention for access to the fabric itself and also to output ports. This arbiter may be required to schedule packet transfers across the fabric.
When the switch fabric runs at line speed, the memories used for the input buffering only need to run at the maximum port speed. The memory bandwidth in this case is not proportional to the number of input ports, so it is possible to implement scalable switch fabrics that can support a large number of ports with low-cost, lower speed memories.
An important issue that can severely limit the throughput of input buffered switches is head-of-line (HOL) blocking. If simple FIFO (first-in first-out) is used at each input buffer of the input buffered switch, and all input ports are loaded at 100% utilization with uniformly distributed traffic, HOL blocking can reduce the overall switch throughput to 58% of the maximum aggregate input rate [KAROLM87]. Studies have shown that HOL blocking can be eliminated by using per destination port buffering at each input port (called virtual output queues (VoQs)) and appropriate scheduling algorithms. Using specially designed input scheduling algorithms, input buffered switches with VoQs can eliminate HOL blocking entirely and achieve 100% throughput [MCKEOW96].
It is common practice in switch/router and router design to segment variable-length packets into small, fixed-sized chunks or units (cells) for transport across the switch fabric and also before writing into memory. This simplifies buffering and scheduling and makes packet transfers across the device more predictable. However, the main disadvantage to a buffer memory that uses fixed-size units is that memory usage can be inefficient when a packet is not a multiple of the unit size (slightly larger).
The last cell of a packet may not be completely filled with data when the packet is segmented into equal-size cells. For example, if a 64 bytes cell size is used, a packet of 65 bytes will require two cells (first cell of 64 bytes actual data and second cell of 1 byte actual data). This means 128 bytes of memory will be used to store the 65 bytes of actual data, resulting in about 50% efficiency of memory use.
Another disadvantage of using fixed size units is that all cells of a packet in the memory must be appropriately linked so that the cells can be reassembled to form the entire packet before further processing and transmission. The additional storage required for the information linking the cells, and the bandwidth needed to access these data can be a challenge to implement at higher speeds.
We describe below some of the typical design approaches for switch/router and router switch fabrics. Depending on the technology used, a large capacity switch fabric can be either realized with a single large switch fabric to handle the rated capacity or implemented with smaller switch fabrics as a building block. Using building blocks, a large-capacity switch can be realized by connecting a number of such blocks into a network of switch fabrics. Needless to say, endless variations of these designs can be imagined, but the example presented here are the most common fabrics found in switches/routers and routers.
The following are the main types of data blocking in switch fabric:
Internal nonblocking in these architectures can be achieved by using a high-capacity switch fabric with bandwidth equal to or greater than the aggregate capacity of the connected network interfaces.
In the early days of networking, network devices were based on shared bus switch fabric architectures. The shared bus switch fabric served its purpose well for the requirements of switches, switch/routers, routers, and other devices at that time. However, based on the demands placed on the performance of networks today, a new set of requirements has emerged for switches, switch/routers, and routers.
Like most networking components, switch fabric designs involve trade-offs between performance, complexity, and cost. Today's most common switch designs vary greatly in their ability to handle multiple gigabit-level links.
The simplest shared bus switch fabric comprises a single-signal channel medium over which all traffic between the system modules are transported. A shared bus is limited in capacity, length, and the overhead required for arbitrating access to the shared bus. The key design constraints here are the bus width (number of parallel bits placed on the bus) and speed (i.e., rate at which the bus is clocked, in MHz). The difficulty is designing a shared bus and arbitration mechanism that is fast enough to support a large number of multigigabit speed ports with nonblocking performance. Figures 2.3 and 2.4 illustrate high-level architectures of a bus-based switch fabric.
When multiple devices (e.g., network interface cards) simultaneously compete for access and control of the shared bus, arbitration is the process that determines which of the device gains access and control of the shared bus. Each device may be assigned a priority level for bus access, which is known as an arbitration level. This can be used to determine which device should gain access and control the bus during contention for the shared bus. The switch fabric may have a fairness mechanism, which ensures that each device gets a turn to access and control the bus, even if it has a low arbitration level.
The fairness mechanism ensures that none of the devices is locked out of the shared bus and that each device can gain access to the bus within a given period of time. The central arbitration control point or the bus controller (shown in Figures 2.3 and 2.4) is the point in the system where contending devices send their arbitration signals. A simple bus implementation would use a time-division multiplexed (TDM) scheme for bus arbitration where each device is given equal access to the bus in a round-robin fashion. Because of its simplicity, the shared bus switch fabric was the most common fabric used in early routers and even in current low-end routers. The shared bus architecture presents the simplest and most cost-effective solutions for low-speed switching and routing platforms.
A big disadvantage of the shared bus switch fabric is that traffic from the slowest speed port in a shared bus system cannot speed up enough to traverse a very high-speed bus. This typically requires intermediate buffering at the slow-speed port, which further increases both the complexity and the cost of the system. In addition, issues with the hot swappability of network interface cards and fair access to bandwidth (when ports have very different speeds and traffic loads) add further complications to the design.
The typical shared bus often can be defined by the following features:
The characteristics of the bus-based architecture are summarized as follows [RAATIKP04]:
Figure 2.5 shows a high-level view of hierarchical bus architecture. In this architecture, only packets traveling between local buses cross the backplane bus. In a hierarchical bus architecture, the main backplane bus is typically configured to have usable bandwidth less than the aggregate bandwidth of all the ports in the system. In such a configuration, the hierarchical bus-based switch operates well only when most of the traffic traversing the switch can be locally switched, meaning, traffic crossing the backplane bus is limited.
A major limitation of the hierarchical bus-based architecture is that when the traffic transiting the switch is not localized (to a local bus), the backplane bus can become a bottleneck, thereby limiting the overall throughput of the system. Furthermore, performing port assignments in order to localize communication to the local buses would introduce unnecessary constraints on the network topology and also make network configuration and management very difficult.
Figures 2.6 and 2.7 show high-level architectures of the distributed output buffered switch. The switch fabric has separate and independent channels (buses) that interconnect any two (pairs of) input and output ports resulting in N2 paths in total. In this architecture, packets that arrive on an input are broadcasted on separate buses (channels) that connect to each output port. Each output port has an address filter that allows it to determine which packets are destined to it.
The packets that are destined to the output are filtered by the address filters to local output queues. This architecture provides many attractive switch fabric capabilities. Obviously, no conflicts exist among the N2 independent paths interconnecting the inputs and outputs, and all packet queuing takes place at the output ports.
Another feature is that the fabric operates in a broadcast-and-select manner, allowing it to support the forwarding of multicast and broadcast traffic inherently. Given that no conflicts exist among the paths, the fabric is strictly nonblocking and full input port bandwidth is available for traffic to any output port. With independent address filters at each port, the fabric also allows for multiple simultaneous (parallel) multicast sessions to take place without loss of fabric utilization or efficiency.
In Figure 2.6, the address filters and buffers at each port are separate and independent and need only operate at the input port speed. All of these output port components operate at the same speed. The fabric does not require any speed-up and scalability is limited to bus electronics only(operating frequency, signal propagation delay, electrical loading, etc.). For these reasons, this switch fabric architecture has been implemented in some commercial networking products. However, the N2 growth of address filters and buffers in the fabric limits the port size N that can be implemented in a practical design.
The distributed output buffered switch fabric shown in Figure 2.7 requires a fewer number of buffers at each port; however, these output buffers must run at a speed greater than the aggregate input port speeds to avoid blocking and packet loss. The output buffer memory bandwidth and type limit the rate at which the output buffer can be accessed by the port scheduler. This factor ultimately limits the bandwidth at the output port of the switch fabric.
Figure 2.8 shows a high-level architecture of a typical shared memory fabric. This switch fabric architecture provides a pool of memory buffers that is shared among all input and output ports in the system. Typically, the fabric receives incoming packets and converts the serial bit stream to a parallel stream (over parallel lines of fixed width) that is then written sequentially into a random-access memory (RAM).
An internal routing tag (header) is typically attached/prepended to the packet before it is written into the memory. The writes and reads to the memory are governed by a system controller, which determines where in the memory the packet data are written into and retrieved from. The controller also determines the order in which packets are read out of the memory to the ports. The outgoing packet data are read from their memory locations and demultiplexed to the appropriate outputs, where they are converted from a parallel to a serial stream of bits.
A shared memory switch fabric is an output buffered switch fabric, but where the output buffers all physically reside in a common shared buffer pool. The output buffered switch has attractive features because it can achieve 100% throughput under full traffic load [KAROLM87]. A key advantage of having a common shared buffer pool is that it allows the switch fabric to minimize the total amount of buffers it should support to achieve a specified packet loss rate.
The shared buffer pool allows the switch fabric to accommodate traffic with varying dynamics and absorb large traffic bursts arriving at the system and any port. The key advantage is that a common shared buffer pool is able to take advantage of statistical sharing of the buffers as varying traffic arrives at the system. When an output port is subjected to high traffic, it can utilize more buffers until the common buffer pool is (partially or) completely filled.
Another advantage of a shared memory switch fabric is that it provides low data transfer latencies from input to output port by avoiding packet copying from port to port (only a write and read required). There is no need for copying packets from input buffers to output buffers as in other switch fabric architectures. Furthermore, the shared memory allows for the implementation of mechanisms that can be used to perform advanced queue and traffic management functions (priority queuing, priority discard, differentiated traffic scheduling, etc.).
Mechanisms can be implemented that can be used to sort incoming packets on the fly into multiple priority queues. The mechanism may include capabilities for advanced priority queuing and output scheduling. For output scheduling, the switch fabric may implement policies that determine which queues and packets get serviced at the output port. The shared memory architecture can also be implemented where shared buffers are allocated on a per port or per flow basis.
In addition, the switch fabric may implement dynamic buffer allocation policies and user-defined queue thresholds that can be used to manage buffer consumption among ports or flows and for priority discard of packets during traffic overload. For these reasons, the shared memory switch fabric has been very popular for the design of switches, switch/routers, and routers.
The main disadvantage of a shared memory architecture is that bandwidth scalability is limited by the memory access speed (bandwidth). The access speeds of memories have a physical limit, and this limit prevents the shared memory switch architecture from scaling to very high bandwidths and port speeds. Another factor is that the shared memory bandwidth has to be at least two times the aggregate system port speeds for all the ports to run at full line rate.
When the shared memory runs at full total line rate, all packets can be written into and read out from the memory resulting in nonblocking operation at the input ports. Depending on how memory is implemented and allocated, the total shared memory bandwidth may actually be a bit higher to accommodate the overhead that comes with storing variable packet sizes and the basic units of buffering.
The shared memory switch fabric is generally suitable for a network device with a small number of high-speed ports or a large number of low-speed ports. It is challenging to design a shared memory switch fabric when the system has to carry a high number of multigigabit ports. At very high multigigabit speeds, it is very challenging to design the sophisticated controllers required to allocate memory to incoming packets, arbitrate access to the shared memory, and determine which packets will be transmitted next.
Furthermore, depending on the priority queuing, packet scheduling, and packet discard policies required in the system, the memory controller can be very complicated and expensive to implement to accommodate all the high-speed ports. The memory controller can be a potential bottleneck to system performance. The challenge is to implement the controller to be fast enough to manage the shared memory, read the packets on priority, and implement other service policies in the system.
The memory controller may be required to handle multiple priority queues in addition to packet reads for complex packet scheduling. The requirement of multicasting and broadcasting in the switch fabric further increases the complexity of the controller. To build high-performing switching and routing devices, in addition to a high-bandwidth shared memory fabric, a forwarding engine (ASIC or processor) has to be implemented for packet filtering, address lookup, and forwarding operations.
These additional requirements add to the cost and complexity of the shared memory switch fabric. For core networks and as network traffic grows, the wide and faster memories and controllers required for large shared memory fabrics are generally not cost-effective. This is because as the network bandwidth grows beyond the current limits of practical memory pool implementations, a shared memory switch in the core of the network is not scalable and has to be replaced with a bigger unit. Adding a redundant switching plane to a shared memory switch is complex and expensive.
In shared memory fabric implementations, fast memories are used to buffer data-arriving packets before they are transmitted. The shared memory may be organized in blocks/cells of 64 bytes to accommodate the minimum Ethernet frame size of 576 bits to handle ATM cells. This means an arriving packet bigger than the basic block size has to be segmented into the unit block size before storage in the shared memory.
The total shared memory bandwidth is equal to the clock speed per memory lane/line (in megahertz or megabits per second) times the number of lanes/lines (in bits) into the memory. The size of the shared memory fabric is normally determined from the bandwidth of the input and output ports and the required QoS for the transiting traffic. Dynamic allocation can be used to improve shared memory buffer utilization and guarantee that data will not be blocked or dropped as the inputs contend for memory space.
The characteristics of the shared memory-based architectures are summarized as follows [RAATIKP04]:
The shared memory switch fabric is also very effective in matching the speeds of different interfaces on a network device. However, the higher link speeds and the need to match very different speeds on input and output interfaces require the provisioning of a very big shared memory fabric and buffering.
As discussed earlier, the major problem in shared memory switches is the speed at which the memory can be accessed. One way to overcome this problem is to build memories with very wide buses that can load a large amount of data in a single memory cycle. However, shared memory techniques are usually not very effective in supporting very high network bandwidth requirements due to limitations of the access time of memory, that is, the precharge times, the effective burst size to amortize the storage overhead, and the width of the data bus.
As illustrated in Figure 2.9, data from the inputs to the shared memory are time-division multiplexed (TDM), allowing only one input port at a time to store (write) a slice of data (cell) into the shared memory. Figure 2.10 illustrates generic memory architecture. As the memory write controller receives the cell, it decodes the destination output port information that is used to write the cell into a memory location that belongs to the output port and queue to receive the cell.
A free buffer address is taken from the free buffer address pool and used as the write address for the cell in the shared memory. In addition, the write controller links the write address to the tail of the output queue (managed by the memory read controller) belonging to the destination port. The read controller is signaled where the written cell is located in the shared memory.
Cells transmitted out of the shared memory are time-division demultiplexed, allowing only one output port at a time to have access to (i.e., read from) the shared memory. The read process to an output port usually involves arbitration, because there may be a number of cells contending for access to the output port. The memory read controller is responsible for determining which one of contending cells wins the arbitration and is transferred to the output port. Once a cell has been forwarded to its output port, the available shared memory location is declared free and its address is returned to the free buffer address pool.
Figures 2.11 and 2.12 both describe the architecture of a shared-memory-based switch/router or router with distributed forwarding in the line cards. Each line card has a forwarding table, an autonomous processor that functions as the distributed forwarding engine, and a small local packet memory for temporary holding of incoming packets while they are processed. A copy of the central forwarding table maintained by the route processor is propagated to the line cards to allow for local forwarding of incoming packets.
After the lookup decision is completed in a line card, the incoming packet is stored in the shared memory in buffer queues corresponding to the destination line card(s). Typically, packets are segmented into smaller size units, cells, by the line cards before storage in the shared memory. Various methods are used to signal to the destination line card(s) that buffer queues to retrieve a processed packet. Packets stored in the shared memory can be destined to other line cards or to the router processor. A stored packet can either be a unicast packet destined to only line card or multicast, that is, destined to multiple line cards.
A broadcast packet is destined to all other line cards other than the incoming line card. Typically, the system includes mechanisms that allow for a multicast or broadcast packet to be copied multiple times from a single memory location by the destination line cards without the need to replicate the packet multiple times in other memory locations. The system also includes mechanisms that allow for priority queuing of packets at a destination port and also discarding packets when a destination port experiences overload.
The outbound processing at the destination card(s) includes IP TTL (time-to-live) update, IP checksum update, Layer 2 packet encapsulation and address rewrite, and Layer 2 checksum update. The processing could include rewriting/remarking information in the outgoing packet for QoS and security purposes.
The separate route processor is responsible for nonreal-time tasks such as running the routing protocols, sending and receiving routing protocol updates, constructing and maintaining the routing tables, and monitoring network interface status. Other tasks include monitoring system environmental status, system configuration, and line card initialization, providing network management functions (SNMP, console/Telnet/Secure Shell interface, etc.).
In this section, we describe examples of shared memory switch fabrics based on Motorola's NetRAM, which is a dual-port SRAM with configurable input/output data ports [MANDYLA04]. The NetRAM was designed specifically for networking devices that require shared memories with optimal performance for write/read/write cycles. Two implementations of NetRAM shared memory switch fabric are the “one-way” shared memory switch fabric with separate input and output ports (Figure 2.13) and the “snoop” switch fabric with two dual-ports (Figure 2.14).
Figure 2.13 shows the architecture of the “one-way” shared memory switch fabric using the NetRAM. Ingress ports of line cards (or input modules) connect to the data inputs (DQY) of the NetRAM, while the egress of line cards (or output modules) connect to the data outputs (DQX). The write address ports (AY) and read address ports (AX) connect to the external control logic ASIC.
The control logic ASIC supplies the free memory address in which the cell is to be written/stored until the destination output port is ready to read/retrieve it. The free memory address is received by port AY of the NetRAM while the cell is written into the input port DQY of the NetRAM immediately after the write enable signal (WY) is activated.
When a destination output port is ready to receive a data cell, the control logic ASIC retrieves the read address of the stored cell's memory location and sends it to port AX of the NetRAM. After the NetRAM memory location is read, the data are sent out of the output port DQX after two internal clock cycles, provided the output enable (GX) signal is activated. The internal clock of the NetRAM runs at two times the external clock. This allows board frequencies to be maintained at values equal to or less than 83 MHz while the NetRAM-based device delivers data transfer performance equivalent to a conventional 166 MHz memory.
The NetRAM operates as a pipeline with reads occurring before writes. With this, if a read and a write are directed at the same memory address on the same clock edge, the cell data that were previously (written) in that memory location will be read and then the write data will be written into that location after the read has finished. This allows the NetRAM to resolve potential memory address contention problems without the additional requirement of an external arbitration mechanism or special ASICs to ensure the data transferred through the NetRAM are written and read without corruption.
Two additional functions provided by the NetRAM are a pass-through function and a write-and-pass-through function. The pass-through function allows input data to bypass the shared memory and be transferred directly from the input to an output. In situations where data need to be transferred quickly from the input port to an output port, the pass-through function can be used that saves two system cycles compared to a RAM without this function.
When data are transferred without this function from the input to the output, the output has to wait for the data to be written to the DQY port of the NetRAM (address sent through AY) and then read from the AX port (address provided through AX). The write-and-pass-through function can be used to transfer data from the input to an output, but also allowing the data to be written into the shared memory in the usual way.
The NetRAM is designed to have some advantages over conventional RAMs with single address port and common I/O port. Reference [MANDYLA04] states the main advantage to be that reads and writes can be performed at different memory addresses in the same clock cycle. This dual address read/write capability allows the NetRAM in the one-way shared memory fabric to support very high throughput of 2.98 Gb/s. In addition, implementations can be realized where several memories are banked together in parallel in the fabric. For example, if a fabric has 16 memory banks, the total available bandwidth would be 16 × 2.98 Gb/s (or 47.8 Gb/s).
If the data input and data output in an implementation are doubled, and reads can be performed simultaneously from both ports in the same clock cycle, then a bandwidth of approximately 6 Gb/s can be achieved. If an implementation has a 576 bit wide memory block and 16 NetRAMs connected in parallel, the maximum bandwidth becomes approximately 96 Gb/s.
Figure 2.14 shows a block diagram of the “snoop” shared memory switch fabric using the NetRAM. In this implementation, the NetRAM serves as a common shared memory while the system bus uses port DQX on the NetRAM for writing and reading cells. The ASIC that is connected to the dual port DQY allows the user to “snoop” into any memory address to read any data that need to be verified or modified, such as Ethernet destination MAC addresses and ATM VPI/VCI headers.
The user may want to include error checking bits or other user-defined data that the system requires as soon as the Ethernet or ATM data start to be written in the shared memory. This allows the system to make decisions at the beginning of the data input cycle, such as the next destination of the data or its service priority. Another advantage is the ability to write back data at any memory address. These features eliminate the need to have a separate system path to screen and separate out critical information from the data flow, because the entire Ethernet frame or ATM cell is stored in the shared memory. This implementation also reduces the chip count and board space. The “snoop” dual-ported switch fabric can be implemented as a 64 K × 18 and 32 K × 36 in a dual-ported device with pipelined reads.
For many years, burst SRAMs have been very suitable for the write-once, read-many functions of level-2 caches normally used in computing applications. To cut cost and also have access to available commercial memory devices, some designers have used burst SRAMs in the implementation of switch fabrics of network devices. Burst SRAMs generally have a common I/O port and one address port. As a result, the burst SRAM can perform either a read to an address location or write to an address location in one clock cycle but not both in the same cycle.
Also, when the common I/O on the burst SRAM has to be set from a read to write state, this requires a deselect cycle to be inserted into the timing, resulting in one wait state. The performance gets worse when turning the common I/O from a write to read state since two deselect cycles have to be inserted to ensure there are no conflicts on the burst SRAM's common I/O bus. These factors translate to a clock cycle utilization for data input or output to the burst SRAM, of between 50 and 70%, depending on the read/write and write/read patterns used to access the data in the burst SRAM.
These factors and reasons make the burst SRAM not efficient for the write/read/write cycles of network devices. The limitations of burst SRAM become more critical when they are used for the gigabit and terabit bandwidths seen in today's switches, switch/routers, and routers. NetRAM can be implemented as a dual I/O device or a separate I/O device that eliminates the overall system performance penalties seen in the burst SRAM with a common I/O where there is the need to insert deselect cycles.
NetRAM can perform reads and writes in the same clock cycle with separate memory addresses provided for each port, allowing a designer to implement higher speed network devices than would be possible with conventional burst SRAMs. Network devices generally perform write/read/write most of the time, but the conventional burst SRAMs are suitable for the burst reads and writes of computing applications.
In network devices (switches, switch/routers, and routers), which frequently transition from reads to writes, burst SRAMs have relatively lower performance, due to the dead bus cycles that occur between operations. Thus, the use of the conventional burst SRAM increases design complexity and reduces the overall system performance. The dual port feature of the NetRAM enables the switch fabric in the network device to perform simultaneously read and write to different memory addresses in each clock cycle.
The ring fabric connects each node in the system to the immediate two adjacent nodes to it, forming a single continuous path for data traffic (Figure 2.15). Data placed on the ring travel (unidirectionally) from one node to another, allowing each node along the path to retrieve and process the data. Every packet placed on the ring is visible to the nodes on the ring. Data flow in one direction on the ring hop-by-hop with each node receiving the data and then transmitting them to the next node in the ring.
The ring fabric uses controllers in the nodes that attach to the ring to manage the multiple bus segments they connect to. This fabric does not require a central controller to manage how the individual nodes access the ring. Some ring architectures support dual-counter-rotating rings to which the controllers connect. The ring fabric consists of a network of bus segments interconnected in a circular fashion.
The fabric is highly scalable because of its efficient use of internode pathways (bus segments). Adding a node to the system requires only two fabric interfaces to connect to the ring. In some ring architectures, due to the point-to-point interconnection of the nodes, it is relatively easy to install a new node since adding the node requires only inserting the node between just two existing connections. The point-to-point connectivity between nodes allows faults on the ring to be easily identified and isolated.
The ring fabric is highly scalable, although it is susceptible to single point of failure and traffic congestion. The ring is susceptible to single point of failures because it supports only one path between any two adjacent nodes. One malfunctioning node or link on the ring can disrupt communication on the entire fabric. Furthermore, adding and removing a node on the ring can disrupt communication on the fabric. Using dual (e.g., bidirectional rings) or more rings improves the reliability of the ring fabric. Generally, bidirectional ring-based architectures allow very fast reconfiguration of faults, and transmitted traffic does not require rerouting.
The bandwidth of the total ring fabric is limited by the bandwidth of the slowest link segment in the system. Furthermore, data transfer latency can be high if there are a high number of hops it takes to communicate between any two nodes in the system. Data transfer delay is directly proportional to number of nodes on the ring. This fabric architecture is advantageous where economy of system nodes (to which network interfaces are attached) is critical, but availability and throughput are less critical. Most practical systems use dual rings and identical link segments between nodes.
The characteristics of the ring-based architectures are summarized as follows [RAATIKP04]:
The major problems encountered during the design of switch fabrics are summarized here [RAATIKP04]:
Other design challenges are as follows:
Specifically, the bus architecture itself places limits on the following:
Propagation delay limits the physical length of a bus (distance the signal can travel without significantly degrading), while electrical loading limits the number of devices that can be connected to the bus. Pragmatically speaking, these factors are dictated by physics and cannot be easily circumvented.