4.3. Survivable Network Design and Traffic Restoration Concepts

4.3.1. Typical Network Architecture

In terms of studying the survivability of communication networks, it is useful to examine the basic structure of a typical network architecture. Here, we consider the Internet and the technologies used. Note that as traditional telephone and private data services migrate to the shared Internet, the survivability of the architecture becomes more important. Broadly speaking, the Internet consists of a set of interconnected networks that can be categorized based on their geographic size and function as either access networks, metropolitan area networks (MANs), or wide area networks (WANs), as shown in Figure 4.1.

Figure 4.1. Typical network architecture.


Access networks provide the end communication path to and from the users. A wide variety of technologies are utilized in access networks, including cellular networks, wireless LANs, cable modems, DSL, fiber to the home or office, and Ethernet. Access networks typically have a tree or hub-and-spoke type of topology with little or no redundancy provided due to cost constraints, though customers willing to pay for it (generally medium to large commercial customers) can often be provided with dual-homed premises.

MANs provide a local backbone network spanning a city or metro area. Technologies used here include wavelength division multiplexing (WDM) optical fiber, SONET, Ethernet, WiMAX, point-to-point microwave, free space optical, and so on.

WANs, also known as core backbone or long-haul networks, are the most uniform technology with almost all now using optical communication links with WDM or dense WDM (DWDM) technology.

Note that the Internet component networks are multilayer in nature, accommodating a wide variety of users and applications. Broadly speaking, each network consists of a top layer where services such as voice, video, data, and multicast and broadcast video are provided. These services are provided over a middle or switched layer (e.g., IP packet switching, MPLS label switching). Lastly, the middle layer is provided over a transport layer technology such as DWDM light paths on optical fibers. Note that the transport layer may contain several sublayers (e.g., SONET and DWDM). In the future, each layer is expected to be reconfigurable (e.g., rearrangeable light paths, LSPs. Lastly, it is worth noting that virtualization is a growing trend in all three types of networks with virtual private networks (VPNs) being deployed within a network (e.g., within an individual MAN or even backbone ISP WAN), across multiple networks, or end-to-end at the application layer. However, virtualization can often lead to poor resilience if the virtual network is not carefully designed to avoid failure propagation, as discussed in Section 4.6.

Network survivability research originated with telecommunications network operators and has been a subject of study for decades [21, 22]. Additionally, some research on survivability in an unfriendly environment was part of the early work on the Internet [23]. In recent years, however, there has been increasing interest in network resilience and survivability. Various journals have dedicated special issues to the survivability of networks and their components (e.g., [24]). There are now specialized conferences focused on the topic (e.g., IEEE Design of Reliable Communication Networks (DRCN) and IEEE Dependable Systems and Networks (DSN)) and several excellent books have been published [7, 9, 1217, 20]. The current literature tends to focus on providing survivability in a particular technology at a specific layer, such as the application overlay layer [25], switched layer (IP [26], ATM [15], MPLS [27]), or transport layer [16], or in a specific part of the network architecture (e.g., MAN or WAN). Examples include development of techniques for implementation of light path restoration in core DWDM optical backbone sections of the Internet (e.g., a Tier 1 ISP network) [17] or survivable SONET ring techniques to overcome link failures in MANs [14]. While the implementation of survivability in a particular technology or protocol in a component network involves many details particular to the application, the basic techniques and principles used are largely the same in each case. Here, we present a technology-independent discussion of survivable network design concepts and principles, beginning with a discussion of the basic concepts.

4.3.2. Basic Network Survivability Concepts

While a variety of survivability techniques (e.g., multiple homing, trunk diversity, self-healing rings, preplanned backup routes, p-cycles) have been proposed for a range of network technologies, they all work on the concept of redundancy and diversity. First, consider how diversity is utilized in a mesh network where traffic is routed on fixed end-to-end paths (e.g., light paths in a WDM network). An active path (AP), also sometimes called the working path, is the route taken by the traffic under normal operating conditions. For the network to be survivable to failures in the active path, one must be able to find a suitable backup path (BP) (i.e., an alternate path around the failure) in the topology. Obviously, the backup path and the active path must be physically diverse or disjoint so that both paths are not lost at the same time.

Diversity can be achieved in the active and backup paths in several ways. For example, they may be link disjoint, as shown in Figure 4.2a or node disjoint, as shown in Figure 4.2b. As shown in the figure, the link disjoint BP can potentially recover from any link failure in the AP, whereas the node disjoint BP can potentially recover from any link or relay node failure in the AP. Note that for diverse AP and BP paths to exist, the physical network topology must allow at least two disjoint end-to-end paths from every source destination pair. However, even though a BP may exist for an AP, restoration cannot proceed unless there are enough spare resources on the BP to carry the AP traffic at the required QoS level. This requires the allocation of redundant resources on the BP, which are typically not used except in the case of failure. The focus of survivable network design is to plan the allocation of diversity and redundancy in the network to support resilience to a set of failure scenarios (e.g., any single link failure). In order to take advantage of the redundancy and diversity in the network, appropriate fault management and traffic restoration procedures need to be in place.

Figure 4.2. Survivable network concepts: (a) link disjoint and (b) node disjoint.


4.3.3. Basic Network Management Concepts

When studying the survivability of communication networks, it is also useful to look at the general framework in which traffic management is implemented. Network management has become an indispensable tool of communication networks since it is responsible for ensuring continuous and secure functioning of a network. Generally, network management is divided into several functional areas: performance management, configuration management, and fault management. These key areas are implemented by control modules that operate in an integrated way to manage the network, including functions that support traffic management and restoration survivability techniques.

As an example, we can look at how fault management fulfills its goals. The key functions of fault management, RRRR [28], in prioritized order are:

  1. Restore services.

  2. Root cause identification of failures.

  3. Repair failed components.

  4. Report the incidences.

To implement the highest priority task, that is, restore services, fault management uses the functionality provided by configuration management and performance management. Upon detection of a network failure by the fault management system or detection of QoS degradation by the performance management system, the failed traffic connections are identified by the configuration management system, new paths are searched for if needed, the best path i.e., the one with the lowest cost, is selected for each failed connection, and the traffic is rerouted by establishing new connections along the selected paths. In parallel with the service restoration process, a repair process should begin with the network manager performing root cause identification followed by the detailed repair or replacement of the failed components. Once the failed components have been repaired, they can be put back in service and a normalization or reversion process might occur that consists of moving the traffic from their current routes to their original prefailure routes. Furthermore, all incidents in the process are monitored and reported for billing and management purposes.

The steps in the service recovery process are sometimes denoted as DeNISSE(RN) [29], which is derived from the restoration process’ major steps:

1.
Detection of the failure.

2.
Notification of the failure to the entities responsible for restoration.

3.
Identification of failed connections.

4.
Search for new paths.

5.
Selection of least-cost paths for the failed connections.

6.
Establishment of the new paths by signaling and rerouting.

7.
Report for billing and management.

8.
Normalization.

These steps are summarized in Figure 4.3.

Figure 4.3. Major steps in traffic management during restoration of service.


4.3.4. Protection versus Restoration

A quick scan of the literature will reveal the use of the terms protection and restoration, in many cases interchangeably and/or with little differentiation between the two. We define each separately here, but the reader should note that the differentiation we provide is not strictly adhered to by everyone in the industry.

In transport network survivability, protection usually refers to mechanisms where any postfailure switching actions are predefined and utilize spare capacity that is dedicated for a specific set of failure scenarios. In its purest form, the signal is either already duplicated on a backup path, such as in 1 + 1 automatic protection switching, or at the very least the backup path is preconfigured into a pretested and ready-to-use state, such as in p-cycles. Protection also often refers to survivability methods where a protection route is predefined but is not preconfigured, such as in shared backup path protection. In that case, capacity seizure and cross-connection is accomplished postfailure in real time.

Restoration usually refers to survivability mechanisms where backup paths are neither predefined nor preconfigured, and where spare capacity is not dedicated for any specific sets of failure scenarios, but rather is available and configured as needed when failures arise. In its purest form, backup path determination, spare-capacity seizure, and cross-connection are all accomplished postfailure in real time either through a centralized or distributed protocol. This is often how span or link restoration and shared path restoration are envisioned to occur. In accordance with this, restoration falls within the traffic management category defined in 4.1, and protection falls within the network design category. However, depending on the specific implementation, many such restoration mechanisms actually perform route-finding and preplanning exercises, and in some cases, even limited preconfiguration.

As such, most survivability mechanisms are actually not strictly pure protection or pure restoration as we just defined. Rather, they inhabit the space between the two and are only referred to as one or the other depending on historical considerations, convention, and so on.

4.3.5. Other Issues

Other important factors to consider in determining which mechanisms to use in assuring survivability are; which of the network layers that will be responsible for restoration, where the rerouting will take place, and what rerouting algorithm will be used (e.g., minimum cost, lowest latency).

The Layer Responsible for Restoration

Internet component networks are multilayer in nature, where lower layers are said to provide services to the layers immediately above them, and the inner workings of each layer is hidden by general interfaces to the layer. Thus, in order to deliver the services in a more reliable way, each layer could implement their own survivability techniques. The end-to-end principle [30] has been used as a guiding principle for allocating responsibility for overall network survivability among various layers in a network. In short, the principle says that one should let the end nodes of a connection (i.e., the higher layer) take care of reliability issues since functions placed at lower layers may be redundant or too costly in terms of performance. However, as demand shifted to high-speed, multiservice networks with strict QoS requirements, the argument does not hold anymore. In order to provide guaranteed QoS, each layer should deliver its services according to the QoS committed. Thus, each layer typically includes survivability functions as described further in Section 4.6 on multilayer issues.

The Location of Rerouting

Each step in the restoration process (DeNISSE(RN)) can take place in a node close to the failure or by a network node or management node more distant to the failure. Consider the essential step of rerouting a connection, that is, the establishment of a new connection along a new path. There are two major schemes based on how far upstream from the failure the rerouting is performed, namely, path and link (span) restoration. In path restoration, the source nodes of the active path connections that are affected by the failure perform the rerouting (Figure 4.4a). In link restoration, the nodes adjacent to the failure are responsible for rerouting all affected traffic demands around the failed link (Figure 4.4b). The main advantages of path restoration are capacity efficiency, the possibility of rerouting each connection individually, and the ability to respond to a wider range of failure scenarios. For link restoration, the main advantages are speed and simplicity of implementation.

Figure 4.4. (a) Path-based and (b) link-based survivability schemes.


Rerouting Algorithms

For dynamic restoration schemes, be they path or link based, the node responsible for restoration typically will have many failed connections that need to be restored simultaneously. Some of the connections may carry traffic that requires retransmission of lost packets. Retransmission introduces a burst not seen under normal operation that may cause transient network congestion. It has been shown that selection of an appropriate rerouting algorithm can control this congestion to some extent and improve the QoS provided. In particular, for lightly loaded networks, algorithms that distribute the load over a large area should be selected. For heavily loaded networks, minimum-delay algorithms should be used possibly in combination with preemption of some connections to ensure an acceptable QoS for the surviving connections [3133].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset