Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 4
Data Networks—The Nervous System of the Cloud

This chapter picks up just at the point where Chapter 3 left off. There we had already mentioned data networking more than once, always with a forward reference. No postponement anymore!

Data networking refers to a set of technologies that enable computer-to-computer communications. The ultimate result is that two processes located on different computers can talk to one another. This, in turn, supports distributed processing. As you may remember, this is an operating system task to enable interprocess communications on the same machine. Transferring this capability across machines—over a network—naturally involves operating systems, too.

In fact, the disciplines of data communications and operating systems have been evolving side by side since the 1960s. In earlier systems, the data communication capabilities were added by means of both (a) new device drivers for physical network access and (b) specialized libraries for interprocess communications across the network (which differed from those provided by the kernel for interprocess communications within the machine). It is interesting that, in these early days, achieving over-the-network interprocess communications was not an end in itself¹ but rather the means of accessing remote resources—typically, files. File transfer was a major data networking application. Another one was transaction systems (such as those employed in banking or airline reservations), in which a user would type a command that caused an action in a database located on a remote machine and then get a response from this machine.

Toward the end of the 1980s, the operating systems evolution proceeded along two branches: (1) network operating systems, which provide an environment in which users can access remote resources while being aware of their being remote; and (2) distributed operating systems, which provide an environment in which users access remote resources in exactly the same way they access local resources. This development—which is very interesting in itself—and its results are well described in [1]. In addition, we recommend [2] as an encyclopedic reference to distributed computing. In more than one way, the objectives of these developments were very similar to those of Cloud Computing inasmuch as these objectives included support of migration, which is three-pronged: data migration involves moving a file or its portion to the accessing machine; computation migration involves invoking a computation on a remote machine—where the required data reside—with the result returned to the accessing machine; process migration involves the execution of a program on a remote machine (e.g., for the purposes of load balancing). Cloud Computing, of course, augments the latter capability with full virtual machine (rather than process) migration. Yet, there is an emerging trend in which new applications are run on dedicated inexpensive microprocessors; here, the CPU virtualization component becomes much less relevant in achieving elasticity, while the data networking component remains prominent.

The ever-growing size of the manuscript does not allow us to concentrate on distributed processing in any detail. Fortunately, there are well-known monographs (including the references that we just mentioned) on the subject. In the rest of this chapter, we will deal with distributed computing primarily in the context of the World-Wide Web.

Going back to data networking, it requires, at a minimum, physical interconnection of computers, as depicted in Figure 4.1(a). Naturally, interconnection in this form is an essential aspect of Cloud Computing, as it effects the data exchange (1) within a Cloud; (2) between any two federating Clouds; and (3) between any computer that needs to access the Cloud and the Cloud.

Image described by surrounding text. — **Figure 4.1** Dual aspects of networking in Cloud Computing.

It is interesting to observe a duality present in Figure 4.1(b). (We have actually referred to it more than once in the previous chapter.) Here, as the physical machines “turn” into virtual machines, consolidated on a single physical machine, all the networking software must remain unchanged, and so the “physical network,” now located within the same physical machine, is effectively simulated by the hypervisor.

As far as the introduction to data networking is concerned, we have tried to make this book as self-contained as possible.

We start with an overview of data networking, using the classical Open Systems Interconnection (OSI) model developed by the International Organization for Standardization (ISO). The OSI reference model is the best universal tool for explaining both the principles of data communication and its major issues. All data communication protocols fit (more or less) within this model, and often the very question of where a particular protocol fits leads to a deeper understanding of the issues involved.

Following the discussion of the model, we introduce the Internet Protocol (IP) and review the Internet protocol suite. In the course of this review, it should become apparent that even though the Internet was designed around specific paradigms and with only a few applications in view, it was made fairly extensible. We demonstrate this extensibility with examples—which we will need later—of the standardized means to achieve Quality of Service (QoS), a capability to support “data pipes” of specific capacity.

An important issue—deserving a separate section (although a separate book could easily be dedicated to it) is that of addressing and routing in IP-based networks. Following that we have a section on Multi-Protocol Label Switching (MPLS), which almost leaps out of the IP context, by effectively introducing a technology that synthesizes circuit switching with packet switching.

The next step, building on the previously discussed capabilities, introduces a new dimension to virtualization. This dimension deals with network virtualization rather than computer virtualization. Instead of building a separate private network, an enterprise can use a set of mechanisms for carving out from a provider's network—or the Internet—what looks and feels like a dedicated private network, that is a Virtual Private Network (VPN). This capability has been carried over from public telephone networks (see [3] for the detail and history), which provided an enterprise with a unique numbering plan so that telephone calls were made in the same manner whether the two callers were in adjacent offices or in different countries.

In the case of data networks—and at this point it is accurate to say that telephony has become just another application of data networking, so we are approaching a situation in which there are really no other networks except data networks—the VPN variants are depicted in Figure 4.2. A private network, which is completely owned by an enterprise, is shown in Figure 4.2(a). Typically, such a network spans a single enterprise campus. If an enterprise has two campuses, the two networks can be interconnected through a “pipe,” as shown in Figure 4.2(b). (Incidentally, this is one of the means of interconnection provided by the Cloud carrier actor described in the NIST reference architecture discussed earlier.) Figure 4.2(c) depicts a scenario in which different islands of an enterprise (campus networks or even individual users) access a carrier's public network or the Internet to form a VPN.

Diagram shows three different types of networks. A private network, a private network segment where two networks can be interconnected through a pipe and a scenario in which different enterprise access a public network or the internet to form a virtual private network. — **Figure 4.2** Private and virtual private networks.

Further development of this topic, undertaken in a separate section, introduces the subject of a Software-Defined Network (SDN), which gives greater control to network providers for centralized decision making about routing and forwarding of data packets.

As is the case with the previous chapter, we conclude the present one with an overview of network security, although this time it is limited in scope to only network layer security.

4.1 The OSI Reference Model

This model (first published in 1984 and subsequently revised 10 years later [4]) has successfully applied a divide-and-conquer approach to the complex problem of making two different processes, executing at different hosts, communicate with each other. Even though many of the actual OSI standards ended up abandoned in favor of the Internet standards (see [5], which explains very well the reasons as well as the history of the effort), the model lives on.²

Figure 4.3 highlights the key aspects of the model. To begin with, the endpoints are structured as seven-layer entities, each layer being an independent module, responsible for a particular communication function.

Diagram shows the OSI reference model with two host and the seven-layer entities named application layer, presentation layer, session layer, transport layer, network layer, link layer and the physical layer with a network in-between. — **Figure 4.3** The OSI reference model.

There are two aspects to the model: the first aspect relates to intermachine communications; the second to processing within a single machine. In the remainder of this section we will discuss these aspects, followed by a functional description of each layer.

4.1.1 Host-to-Host Communications

As a separate module, each layer is designed to “talk” directly to its counterpart (the point that the dotted lines in Figure 4.1 illustrate). A set of messages pertinent to a particular layer, combined with the rules defining the order of the messages is called a protocol. For example, a file transfer (perhaps the earliest data communication application) may require a protocol that contains a message for copying a given file as well as a set of operation status messages—including error messages.

While the actual transfer of a message is relegated to the layer immediately below, terminating with a physical layer which relies on the physical medium for data transfer, the “direct” intra-layer message (called the protocol data unit) is presented to the receiving module at the endpoint intact.

The endpoint entities are identified by the layer's addressing scheme. Thus, an application layer's endpoints are identified by the application-defined addressing scheme, the session layer's endpoints by the session-layer addressing scheme, and so on. The chief rule for maintaining modularity is that a given layer is not supposed to have any knowledge of what happens below it—beyond what is specified in the interlayer interface. (As often happens with rules, this particular one has been broken, as we will see very soon, when discussing Internet protocols.)

The meaning of endpoints differs among different layers. For the layers from application down to session layer, the interlocutors are application processes, represented in the model through application entities. For the transport layer, the endpoints are the machines themselves. But the network layer is altogether different—as Figure 4.1 demonstrates—in that it relays each protocol data unit through multiple machines until that unit reaches the intended recipient host.

The delivery model deserves special attention, as there are two competing modes for it. In one mode—called connectionless—the network works like a postal system. Once an envelope with the address of the recipient is deposited into the system, it may pass multiple relay stations, until it ends up in the post office that serves the recipient. It used to be (and maybe still is?) that people carried on chess games by correspondence. With each turn in the game, the correspondent who made a move would write down that move in a letter, send it to his or her counterpart, and wait for the latter's move. Note that the route the envelopes travel may differ from one envelope to another, subject to mail pick-up regulations and transport availability. Playing chess by correspondence was a slow thing.

The other mode—called connection-oriented—took after telephony. A typical application sample here is a telephone conversation. In the classic Public Switched Telecommunications Network (PSTN) of the previous century, the telephone switches established an end-to-end route (called a circuit) among themselves, connecting the two telephones. Once established, the circuit would last for as long as the conversation continued. In addition, the PSTN could establish, on a customer's request, a semi-permanent circuit, connecting various pieces of the customer's enterprise, thus establishing a type of VPN. This model is natural for physical “on-the-wire” connections, but it has been applied to data communications where virtual circuits are defined in the network layer.³ An immediate advantage of the connection-oriented mode, compared with connectionless, is the relative ease of traffic engineering (i.e., guaranteeing the upper bounds of the end-to-end delay and its variance, called jitter). An immediate disadvantage is the potential waste of resources required to maintain a circuit. Going back to the chess game example, the game would definitely have moved faster if the players had used the telephone instead of mail, but it would have been much more expensive while also tying up telephone lines that would be idle while the players thought about their next moves.

We will return to this topic repeatedly. As we will see, both models are still alive and well, having evolved in synthesis. The evolution of these models in telephone networks and the Internet has been described in [3]. We note finally that the highest layer at which the machines comprising the so-far-observed network operate, for the purposes of the OSI model, is the network layer. We will call these machines network elements. Often, network elements are also called switches or routers—the distinction comes historically from the respective model: connection-oriented or connectionless.

Finally, the link layer and the physical layer both deal with a single physical link and thus connect either (1) the two network elements or (2) a network element and the endpoint host.

4.1.2 Interlayer Communications

Our experience in teaching data communications and operating systems has proven that the understanding of the computation model (i.e., what is being called within the context of an application process and under what circumstances) is central both to the overall understanding of data communications and to the understanding of each protocol involved.

We should note right away that a given model does not imply an implementation, although any implementation is based on one or another model. Ultimately, thinking about a specific implementation of any abstract idea is a good step toward understanding this idea. We will make one step from what the standard specifies toward developing an actual implementation.

According to the OSI model, each layer provides a well-defined service to the layer above. The service is a set of capabilities, and the capabilities are implemented by passing the information between the layers.

The key questions here are how a communicating process interacts with the layers of Figure 4.3, and how these layers interact with one another on a single machine (host).

The OSI specification states that in order to transmit its protocol data unit to its peer, a layer sends a request to the underlying layer. The layers interact strictly according to the layering structure, without bypassing the ranks. The incoming protocol data (from the peer layer) is carried in an indication from the underlying layer. Consequently, each layer receives a request from the layer above and sends its own request to the layer below; similarly, each layer receives an indication from the layer below and sends its own indication to the layer above. This model is concise, but it is abstract enough to allow drastically different implementations.

Let us start with a straightforward but rather inefficient implementation, in which each layer is implemented as a separate process⁴ and the messages are passed using interprocess communications. This results in rather slow operation, and there is an inherent difficulty. Since indications and requests are asynchronous,⁵ a simplistic implementation—possibly dictated by the message-passing capabilities of a given operating system—may end up with at least two processes (or two threads within a single process) for each layer. One process (or thread) would execute a tight loop waiting for the indications and propagating them up, and the other waiting for the requests and propagating them down. A slightly better approach is to wait for any message (either from the layer above or the layer below) and then process it accordingly.

One way to describe an efficient computational model, with the help of Figure 4.4, is to define a layer as a class with two methods: request and indication, along with the appropriate data structures and constants dictated by the layer's functional requirements. The request method is to be invoked only by the layer above; the indication method is to be invoked only by the layer below.

Diagram shows the layer as a class with two methods named request and indication. Request method is invoked only by the layer above and indication method is invoked only by the layer below. A layer sends a protocol data unit, it attaches to it a header to itself so that it can describe the payload that it follows. — **Figure 4.4** *Requests* and *indications* as methods of the layer class.

According to the OSI model, for each layer, two objects of the respective class are to be instantiated at the hosts and network elements. The objects exchange protocol data units, as defined in the protocol specification of the layer.

When a layer sends a protocol data unit, it attaches to it a header that further describes the payload that follows. Then it invokes the request method of the layer below, passing the protocol data unit to it. In our computation model, the request action starts with the application process itself, and so all the subsequent calls are performed in the context of this process (the application entity).

An invocation of a layer's request method from the layer above will, at a certain point, result in this layer's invocation of the request method of the layer below. It does not, however, necessarily result in that action—for instance, the layer's required behavior may be such that it needs to accumulate data to a certain point, before invoking the request method of the layer below. Conversely, when the request method is invoked, it can result in several request invocations at the layer below—as may be required by the necessity to break a large protocol data unit into smaller ones carried in the layer below. (For example, a huge file may not be sent in a single protocol data unit, and so a request to copy a file from one machine to another would typically result in several data units sent. We will see more examples later.)

The same consideration applies to invoking the indication method. Just as the messages to be sent out at a higher level may need to be broken into smaller units for processing at the layer below, so the received messages may need to be assembled before they are passed to the layer above.

We know that the first request in the chain is invoked by the application process. But who invokes the first indication? Naturally, the one whose job it is to watch the device that receives data from the network! As we know, the CPU already watches for the device signals, which suggests that the first indication method should be invoked by the relevant interrupt service routine, with the rest of the invocations propagating from there.

Now that we get into the subject of devices, it is appropriate to note that the link and physical layers have typically been implemented in hardware, as are parts of the network layer in network elements. Therefore, the actual device-facing interface may need to be maintained in the upper layers. And, as we will see from the Internet discussion, some applications skip a few top layers altogether, so the application process may face a layer that is lower than the application layer. This is reflected in the overall model depicted in Figure 4.5.

Diagram shows an overall summary of a computational model. Main lines of the application process pass through the request method of the application layer and reach the network layer. Interrupt service routine from the networking device is sent to the indication method and passes through the network layer and reaches the application layer and is sent to the application-supplied plug-in routine. — **Figure 4.5** Summary of the overall computational model.

Note that this model has a clear place for the indication as a method provided by the application. (We have omitted the word “process,” because the process is defined by the main line of code of the application program.) The indication method is exported by the application program as a plug-in routine, and so it ends up being invoked not by the application process, but by the operating system as part of the interrupt processing.

In conclusion, we repeat that what we have described is just a model rather than a prescription for an implementation. The purpose of any model is to elucidate all aspects of a conception; in the case of OSI interlayer exchanges, the major issue is the asynchronous nature of the indications. As a matter of fact, quite a few years ago, one of the authors was part of a team at Sperry Univac which implemented the OSI protocol suite variant on a Varian Data Machines minicomputer in exactly the way described above. To be precise, the link and physical layers were implemented on hardware (a separate board); the rest were object libraries.⁶ This implementation, however, was dictated by the absence of an interprocess communications mechanism in the operating system.

4.1.3 Functional Description of Layers

We will follow the model top down, starting with the application layer. Not surprisingly, there is not much in the standard description of the latter; however, its major functions—in addition to the data transfer—include identification of the interlocutors via an appropriate addressing scheme, authentication, authorization to enter the communications, and negotiation of the quality-of-service parameters. (We will discuss the last point in detail in the context of IP networks.)

Another important function is the establishment of the abstract syntax context, which refers to a translation of data structures (originally defined in a high-level language) to bit strings. The necessity for this came from the earliest days of data communications, starting with character encodings. The IBM terminals used the Extended Binary Coded Decimal Interchange Code (EBCDIC), which was the lingua franca of IBM operating systems, while the Bell System teletypes used the American Standard Code for Information Interchange (ASCII) whose genes came from telegraphy. Minicomputer operating systems adapted ASCII, which was also the standard for a virtual teletype terminal.

Character representation remains an important issue, and it resulted in new standards related to internationalization of the Internet, but this issue is only the tip of the iceberg when it comes to representation of data structures. For one thing, the machines' architectures differed in how the bits were read within an octet (a byte), with both options—left to right and right to left—prevalent. Furthermore, the order of bytes within a computer word (whether 16 or 32 bit) differs among the architectures, too, as does the representation of signed integers and floating-point numbers. Hence the necessity for a mechanism to construct (by the sender) and interpret (by the receiver) the on-the-wire bit string, which is what the abstract syntax context is for.

Finally, the OSI standard distinguishes at the application layer between connectionless and connection-oriented modes of operation. The security functions (oddly)—along with functions for establishing the responsibility for error recovery and synchronization—are assigned to the connection-oriented mode only.

Starting with the presentation layer, all layers provide specific services to the layer directly above. The presentation layer's services are the identification of all available transfer syntaxes (i.e., the layouts of the on-the-wire bit strings) and selection of the transfer syntax to use. Beyond that, the presentation layer provides the pass to the session layer, effectively translating the session payload by mapping the transfer syntax into the abstract syntax. Negotiation of the transfer syntax is a function of the presentation layer. It is useful to look at the computation aspect here: each payload-carrying request to the presentation layer results in the invocation of a request to the session layer and, conversely, each payload-carrying indication to the session layer results in the invocation of a request to the presentation layer. Both are processed synchronously, with the presentation layer translating the payload.⁷

The session layer provides a duplex connection between two processes (via their respective presentation entities), identified by session addresses. The session-layer services include the establishment and release of the connection and data transfer, as well as the connection resynchronization, management, and relevant exception reporting. In addition, there is a token management service, which allows the communicating processes to take turns in performing control functions. The session layer permits a special, expedited service for shorter session protocol data units, in support of quality-of-service requirements.⁸

The above services only relate to the connection-oriented mode. In the connectionless mode, the session-layer services reduce merely to a pass-through to the transport layer.

The transport layer actually implements the capabilities in support of the services defined for the layers above. The stated objective of the OSI design was to optimize “the use of the available network-service to provide the performance required by each session-entity at minimum cost.”

The services that the transport layer provides to the session layer actually do not differ in their description from those defined for the session layer, but the transport layer is the one that is doing the job. One interesting feature is session multiplexing, as demonstrated in Figure 4.6.

Diagram shows host X with layers A, B, and C, and host Y with layers D, E, and F. A is connected to D, B is connected to E, and C is connected to F. — **Figure 4.6** Session multiplexing in the OSI transport layer.

Here the three sessions—AC, BD, and EF—are multiplexed into the transport-layer connection between the hosts X and Y. (Unfortunately, from the very onset of the OSI standardization the model has not been implemented consistently, standing in the way of interoperability. As we will see, the Internet model dispensed with the distinction between the session and transport layers altogether.) In turn, the transport layer can multi- plex transport connections to network-layer connections. The latter can also be split when necessary.

Additional unique functions of the transport layer include the transport protocol data unit sequence control, segmenting and concatenation of the transport control data units, flow control, and error detection and recovery, all of these augmented by expedited data transfer and quality-of-service monitoring. This set of functions guarantees, in connection-oriented mode, a controllable, end-to-end error-free pipe.⁹

To achieve this is a major feat, as the connectionless network may deliver the data units out of order or lose some of them altogether. The establishment and tearing down of the transport-layer connection is a non-trivial problem in itself (as can be learned from Chapter 6 of [5]).

Once the connection is established, the transport layer must enumerate all the data units it sends¹⁰ and then keep track of the acknowledgments from the peer site that contain the sequence numbers of the received data units. All unacknowledged protocol data units need to be retransmitted. On the receiving side, for every two protocol data units with non-consecutive sequence numbers, the transport layer must wait to collect all the units that fill that gap before passing the data to the session layer. There are complex heuristics involved in tuning multiple parameters, particularly the timer values, and the overall scheme is further complicated by the need for recovery procedures for the cases where one or both of the hosts crash during data transfer.

Error detection and correction involves computing a standard hash function (such as a checksum) for the transport protocol data unit. The sender sends the computed quantity (typically as part of a header) and the receiver computes the function over the received message and compares the result with the quantity computed by the sender. In the simplest case, where no error correction is involved, if the comparison fails, the received message is discarded and no acknowledgment is sent back. The absence of an acknowledgment eventually causes the sender to retransmit the message.

Error correction employs special hash functions whose results carry “redundant” bits so that it is possible to reconstruct corrupted bits in the payload. Cyclical redundancy check (CRC) is a typical class of hash functions used for error proto-correction, but CRC performs best when computed in hardware. Jumping ahead again, we note that in the Internet, Transmission Control Protocol (TCP)—a simple type of checksum—is used for this purpose, and so error correction is achieved through retransmission.

The basic service that the network layer provides to the transport layer is transfer of data between transport entities. With that, the purpose is to shield the transport layer from the detail of relaying its protocol data units via the network elements. The network layer protocol data units are called packets.

In the network, the packets that arrive from the host may be broken into smaller pieces and sent on their way on different routes. (Imagine taking to the post office a parcel that weighs a ton to be mailed overseas. Most likely, you will be asked to break it into separate, smaller parcels. After that, some parcels may end up going to their destination by boat, while others will be flown.) This is pretty much how the connectionless network layer mode works.

In the OSI model, the network layer supports both connection-oriented and connectionless modes of operation, but these modes are very different, and this makes the OSI network layer rather complex. (In the Internet model, the connection-oriented mode was initially rejected altogether, but once it was introduced, the network layer became complex, too.)

In the connection-oriented mode, the network connection establishment phase results in the end-to-end virtual circuit traversing the network. In other words, a route from one host in the network is set up once for the duration of the connection, so all packets on the given virtual circuit traverse the same network elements. This helps in engineering the network and guaranteeing—for each connection—its specific quality-of-service parameters, such as throughput and end-to-end transit delay (and its variation). Furthermore, the connection-oriented mode allows the transport layer to outsource the lion's share of its function to the network layer. The main disadvantage of this mode is the expense involved in tying up the resources in the network elements for the duration of the connection. If the endpoints need to exchange a short message only once in a blue moon, there is no need to establish and maintain a connection which would remain idle most of the time. However, if short connections need to be set up often then the connection setup time becomes a factor to consider. Another disadvantage is a single point of failure: if a network element crashes, all virtual circuits that pass through it are broken, and all respective sessions are terminated. These disadvantages were not detrimental to the telecommunications business, as shown by the Public Service Data Networks (PSDN), based on a set of ITU-T X-series recommendations (most notably X.25 [6], which specifies the network layer).

A little historical detour is worthwhile here. The study of connection-oriented services was neither an academic nor a standardization exercise. By the end of the 1960s, there was an urgent need to connect computers in geographically separate parts of an enterprise. At the time, the telecommunications companies that already had vast networks, to which the enterprises were physically connected, were in a unique position to satisfy this business need.

In the 1970s, following the analogy with the Public Switched Telephone Networks (PSTN) service, the Public Data Network (PDN) services were developed. The Datran Data Dial service, which became operational in the USA in 1974 and later was provided by the Southern Pacific Communications Company,¹¹ had a virtual circuit with error rate no worse than one bit in 10⁷ bits transmitted [7]. After that, PDN services continued to grow. In the 1980s, telephone companies started to develop the Integrated Services Digital Network (ISDN), which provides pipes combining circuit-switched voice with X.25-based packet-switched data, the latter to be delivered over the PDN virtual circuits. In the 1990s, ISDN was deployed in several countries in Europe and in Japan, and standardization proceeded to prepare the world for broadband ISDN, in which the network layer employs X.25-like frame relay and Asynchronous Transfer Mode (ATM) protocols, both providing virtual circuits although based on different technologies. These developments culminated in the mid-1990s, but in the late 1990s it was clear that the World Wide Web, and with it the Internet, had won. The network layer in the Internet standards at that time had no place for the ISO-like connection-oriented model, although, as we will see later, it started to develop something similar.

In the connectionless mode, the network layer, in general, is not responsible for the data that are handed to it beyond the best effort to deliver it to its destination. Once a packet is received by a network element, it merely passes it down to its neighbor. The decision to select this neighbor is dictated by the packet's destination address and the network layer's routing table, which can be provisioned or built dynamically using one or another routing algorithm. A routing algorithm computes the shortest path to the destination. “Short” here refers not to the geographic distance, but to that defined by the weight assigned to each link. (The metrics here range from a link's capacity to its current load.) Routing demands truly distributed computation: all routers participate in this computation by advertising their presence and propagating the information about their links. (Once again we refer the reader to [5] for comprehensive coverage of the subject, including a bibliography.) In the simplest case, a technique, initially considered in ARPANET and called flooding, was used. With that technique, there are no routing tables. A network element simply resends the packet it received to each of its neighbors—except, of course, the neighbor that was the source of the packet.)

Whether connectionless or connection oriented, the network layer is the first, among those considered so far, whose operation is based on relaying rather than point-to-point communications. In fact, in the early networks (and in an early version of the OSI model), the network layer was the only one that was explicitly based on relaying. With that the OSI explicitly forbids relaying in the layers below the application layer and above the network layer.

Now that we are ready to discuss the lower layers of the OSI reference model, we will have to go back and forth between the link layer and the physical layer. Here is why. The later version of the OSI reference model standard diplomatically states: “The services of the Physical Layer are determined by the characteristics of the underlying medium … ,” but the link layer remains a catch-all for these services. There is nothing wrong with that per se, as the link layer has a clear and unique function, but for pedagogical reasons it is much better to elucidate the aspects of this function in view of the very characteristic of the underlying medium.

Going back to the early days of data networking, the physical connections (such as a twisted pair, or a circuit leased from a telephone company, or focused microwave links, or a satellite link) were all point to point. With that, data transfer along lines could have a significant error rate, especially when long-distance analog lines were involved. Analog transmission required modems at the endpoints, and it had an error rate two orders of magnitude higher than that of digital transmission.

Again, some historical review is in order. As late as the 1980s and early 1990s, only an analog service was available to individual telephone company subscribers at home; however, digital services were available to corporate subscribers.¹² The Western Union Broadband Exchange Service was successfully launched in Texas in 1964. According to [7], at the beginning of the 1970s Multicom and Caducée followed with similar services in Canada and France, respectively, providing “full-duplex, point-to-point switched service at speeds up to 4800 bits per second, while providing low bit error rates which had previously only been generally available on private lines.” These services, in fact, were the beginning of the PDN service—at the physical layer.

The physical-layer service is essentially a raw bit stream. Let us consider first point-to-point connection. In this case, the function of the link layer is almost identical to that of the transport layer: a connection between two endpoints needs to be established, and then the protocol data units are exchanged between the stations. On the sending side, the link layer encapsulates the network payload into frames, which are fed bit by bit to the physical layer. The payload in the frames is accompanied by error-correcting codes. As we observed, a connection can be faulty, especially on a long-distance analog dial-up line, employing modems at both ends. Error correction was thus an essential service provided by the link layer for this type of medium.

Error correction at the link layer was criticized by people who believed that the job done at the transport layer was sufficient; Figure 4.7 illustrates the counter-argument. Here, what can be fixed by local retransmission on a faulty link when error correction is employed has to be made up by retransmission across the network when it is not. The situation gets even worse when there are several faulty links in the path of a message.

Diagram shows the error correction at the link layer. On the sending side, the link layer encapsulates the network payload into frames which are fed bit by bit to the physical layer. The payload in the frames is accompanied by error-correcting codes. — **Figure 4.7** The case for error correction at the link layer.

Meanwhile, toward the mid-1970s the physical media changed in a major way.

Starting with the invention of the LAN [8], broadcast media proliferated. Figure 4.8 depicts four types: the bus, star, and wireless LANs actually share a single medium; the ring LAN combines a series of bit repeaters.

Diagram shows four different types of broadcast media configuration which includes the bus, star, ring, and wireless LAN. The bus, star, and wireless LAN actually share a single medium but the ring type LAN combines a series of bit repeaters. — **Figure 4.8** Broadcast media configurations.

This development changed the requirements for the link layer in more than one way. First, broadcast demands unique link-layer identities, which must be explicitly carried in every frame. Second, LANs operate at much higher bit rates, enabling voice and video traffic, and hence it made sense to develop quality-of-service parameters to allocate bandwidth according to specific traffic priorities. Third, the very nature of the broadcast results in pushing an essential network layer function—locating the receiver of a message (within the LAN perimeter, of course)—down to the link and physical layers. Fourth, while accepting some network-layer function, the link layer could easily delegate the function of maintaining a connection to the transport layer. To this end, the word “link” in the link layer almost became a misnomer. Fifth, the shared media required a special set of functions for access control, which made a distinct sublayer of the link layer. Sixth, broadcast media—as opposed to point-to-point links—demanded the implementation of security mechanisms to protect data.

We will return to LAN in Chapter 6, when discussing storage.

4.2 The Internet Protocol Suite

A 1974 paper [9] by Vint Cerf and Bob Kahn laid the ground for what became the Internet protocol suite.¹³ The paper itself, which pre-dates the creation of the ISO OSI standards project by three years, is a model of crystal-clear and concise technical approach. It enumerates all aspects of the networking problems present at the moment, deduces the requirements for an overarching solution, and produces a solution by systematically applying the minimalistic approach.

It is worth following the original motivation and the resulting requirements as presented in [9]. The paper observes that the data communications protocols that operate within a single packet-switched network already exist, and then enumerates and examines the five internetworking operation issues.

The first issue is addressing. Recognizing that multiple networks with “distinct ways of addressing the receiver” exist, the paper stresses the need for a “uniform addressing scheme … which can be understood by each individual network.”

The second issue is the maximum packet size. Because the maximum size of a single data unit that a given network can accept differs among networks, and the smallest such maximum size “may be impractically small” to agree on as a standard, it offers the alternative of “requiring procedures which allow data crossing a network boundary to be reformatted into smaller pieces.”

The third issue is the maximum acceptable end-to-end delay for an acknowledgment on whose expiration the packet can be considered lost. Such delay values vary among the networks; hence the need for “careful development of internetwork timing procedures to insure that data can be successfully delivered through the various networks.”

The fourth issue is mutation and loss of data, which necessitates “end-to-end restoration procedures” for the recovery.

Finally, the fifth issue is the variation among the networks in their “status information, routing, fault detection, and isolation.” To deal with this, “various kinds of coordination must be invoked between the communicating networks.”

To deal with the first issue, a uniform internetwork address is proposed, in which a “TCP address”¹⁴ contains the network identifier and the TCP identifier, which in turn specify the host within that network, and the port—which is a direct pipe to a communicating process.

To deal with the second issue, the concept of a gateway connecting two networks is proposed. Being aware of each of the two networks it connects, the gateway mediates between them by, for example, fragmenting and reassembling the packets when necessary. The rest of the issues are dealt with entirely end to end. The paper specifies in much detail the procedures for retransmission and duplicate detection—based on the sliding windows mechanism already used by the French Cyclade system.

In the conclusion, the paper calls for the production of a detailed protocol specification so that experiments can be performed in order to determine operational parameters (e.g., retransmission timeouts).

In the next six years, quite a few detailed protocol specifications that followed the proposal were developed for the ARPANET. In the process, the protocol was broken into two independent pieces: the Internetworking Protocol (IP) dealt with the packet structure and the procedures at the network layer; the Transmission Control Protocol (TCP) dealt with the end-to-end transport layer issues. Even though other transport protocols were accepted for the Internet, the term TCP/IP has become the norm when referring to the overall Internet protocol suit.

In 1981, the stable standard IP (IP version 4 or IPv4) was published in [10] as the US Department of Defense Internet Protocol.¹⁵ This protocol is widely used today, although a newer standard—IPv6—is being deployed, and the IETF has even been working on developing versions past that.

4.2.1 IP—The Glue of the Internet

Figure 4.9 displays the structure of the IPv4 packet, which—quite amazingly—has remained unchanged even though some fields have been reinterpreted, as we will see soon.

Diagram shows the IP header fields of IPv4 with a 32-bit string which includes parameters named identification, flags, fragment offset, time-to-live, protocol, header checksum, source address, destination address, options, and padding. — **Figure 4.9** The IPv4 packet header.

It is essential to understand the fields, since IPv4 is still the major version of the protocol deployed in the Internet (and consequently used in the Cloud).

We start with the IPv4 solution to the Internet addressing problem, as that problem was listed first in Cerf and Kahn's original vision that we discussed earlier. Each IP packet has a source and a destination address, which are of the same type—a 32-bit string representing a pair:

It is important to emphasize that by its very definition, an IP address is not an address of a host in that it is not unique to a host. A host may be multi-homed, that is reside on more than one network; in this case, it will have as many IP addresses as there are networks it belongs to.¹⁶ Similarly, a router, whose role is to connect networks, has as many distinct IP addresses as there are networks it interconnects.

In fact, RFC 791 even allows a host attached to only one network to have several IP addresses: “a single physical host must be able to act as if it were several distinct hosts to the extent of using several distinct internet addresses.” Conversely, with anycast addressing (which was not even envisioned at the time of RFC 791, but which we will encounter later), a common IP address is assigned to multiple hosts delivering the same service—for the purpose of load balancing.

Back to the IP address logical structure. With the initial design, the network address Class tag had three values: A, B, and C. To this end, RFC 791 states:

“There are three formats or classes of internet addresses: in class a, the high order bit is zero, the next 7 bits are the network, and the last 24 bits are the local address; in class b, the high order two bits are one-zero, the next 14 bits are the network and the last 16 bits are the local address; in class c, the high order three bits are one-one-zero, the next 21 bits are the network and the last 8 bits are the local address.”

First, we observe that the Class tag has been placed at the very beginning of the address. The parsing of the IP string would start with determining the class. If the leftmost bit of the IP address is “0,” the class is A; if the first two bits are “10”, the class is B; and if the first three bits are “110,” it is C. This encoding scheme convention was made easily extensible. (For example, later Class D—with the “1110” tag, was defined for multicast addresses.¹⁷)

Second, the idea of the class-based scheme was to have many small (Class C) networks, fewer Class B networks—which have more hosts, and only 128 huge Class A networks. This idea was very reasonable at the time. Figure 4.10 depicts the map of the Internet ca. 1982, as created by the late Jon Postel.¹⁸

Diagram shows the contemporary map of the entire Internet in semi-production phase in February 1982. The ovals are sites or networks, the rectangles are individual routers. — **Figure 4.10** Jon Postel's map of the Internet in 1982. *Source:* http://commons.wikimedia.org/wiki/File%3AInternet_map_in_February_82.png. By Jon Postel [Public domain], via Wikimedia Commons.

About 10 years later, however, the IETF identified three major problems, both unanticipated and unprecedented, which occurred with the growth of the Internet. First, the address space of mid-sized Class B networks was becoming exhausted. Second, the routing tables had grown much, too large for routers to maintain them, and—worse—they kept growing. Third, the overall IP address space was on its way to being exhausted.

The last problem, by its very nature, could not be dealt with in the context of the IPv4 design, but the first two problems could be fixed by doing away with the concept of a class. Thus, the Classless Inter-Domain Routing (CIDR) scheme emerged, first as an interim solution (until IPv6 took over) and then (since IPv6 has not taken over) as a more or less permanent solution. After the publication of three successive RFCs, it was published in 2006 as IETF Best Current Practice, in RFC 4632.¹⁹

CIDR gets rid of the class tag, replacing it with a specific “network mask” that delineates the prefix—the exact number of bits in the network part of the address. With that, the assignment of prefixes was “intended to follow the underlying Internet topology so that aggregation can be used to facilitate scaling of the global routing system.” The concept of CIDR aggregation is illustrated in Figure 4.11.

Diagram shows the concept of CIDR aggregation. Network A is identified by its prefix x, aggregates the address space of its subnets named network B and C, whose prefixes are x0 and x1 respectively. — **Figure 4.11** CIDR aggregation.

Network A, identified by its prefix x, aggregates the address space of its subnets—networks B and C, whose prefixes are, respectively, x0 and x1. Thus, to reach a host in any subnet, the routers on the way only need to find the longest matching prefix.

By 2006, the Internet provider business was sufficiently consolidated, and so it made sense to assign prefixes. Consequently, the “prefix assignment and aggregation is generally done according to provider-subscriber relationships, since that is how the Internet topology is determined.” Of course, the strategy was to be backward-compatible with the class-based system, and, as we will see shortly, this is exactly how the addressing works.

The prefixes are specified by the number of bits in the network portion of an IP address. The mask is appended to an IP address along with the “/” character. At this point, it is necessary to reflect on the IP address specification. Even though the semantics of an IP address is a pair of integer numbers, the IP addresses have traditionally been spelled out in decimal notation, byte by byte, separated by dots as in 171.16.23.42. The mask is spelled out the same. Considering the above address (which belongs to Class B), its network mask is 255.255.0.0, and so the full address in the prefix notation is 171.16.23.42/16. Similarly, a Class C address 192.16A99.17 has a mask of 255.255.255.0 and is specified as 192.16A99.17/A.

Figure 4.12 demonstrates how an existing Class B network can be broken into four subnets internally.

Diagram shows the Class B network that is broken down into four subnets internally. Subnet Mask includes a Network address of 18 bits and a Host address of 14 bits. — **Figure 4.12** “Subnetting” a Class B network.

At this point, we must recognize that we have veered off the RFC 791 by fast-forwarding 20 years ahead. Now we get back to the discussion of the rest of the IP header fields of Figure 4.9.

The first field is version, and it defines the structure of the rest of the header. It appears to be the most wasteful field in terms of the IP header real estate. The four bits were assigned to indicate the protocol version; as helpful as they were in experimentation, only two versions—IPv4 and IPv6—have been used in all these years. Yet the IP was designed to live forever, and it makes sense to anticipate a couple of new versions becoming widely used every 30 years or so.

The Internet Header Length (IHL) is just what it says it is: the length of the Internet header in 32-bit words. This is, in effect, a pointer to the beginning of the IP payload.

The type-of-service field has been designed to specify the quality of service parameters. It is interesting that even though the only applications explicitly listed in this (1981!) specification are telnet and FTP, the packet precedence had already been thought through and, as the document reports, implemented in some networks. The major choice, at the time, was a three-way tradeoff between low delay, high reliability, and high throughput. These bits, however, have been reinterpreted later as differentiated services; we will discuss the new interpretation in the next section.

Total length specifies the length of the packet, measured in bytes. (The length of the header is also counted.) Even though the field is 16 bits long, it is recommended that the hosts send packets longer than 576 bytes only if they are assured that the destination host is prepared to accept them. At the time of this writing the maximum packet size of 65,535 is already considered limiting, but in 1981 long packets were expected to be fragmented. To this end, the next three fields deal with fragmentation matters.

The identification field provides the value intended to “aid in assembling the fragments of a datagram.” (There were attempts to redefine it.) The flags field (the first bit still left unused) contains the DF and MF 1-bit fields, which stand respectively for Don't Fragment and More Fragments. The former flag instructs the router to drop the packet (and report an error) rather than fragment it; the latter flag indicates that more fragments are to follow. The fragment offset is the pointer to the place of the present fragment in the original payload. (Thus the first fragment has the offset value of 0.) It is important to note that IPv6 has dispensed with fragmentation altogether. Instead, it requires that the packet be sent over a path that accepts it.

The Time-To-Live (TTL) field is there to prevent infinite looping of a packet. Since the original Internet envisioned no circuits, the Internet routers applied the “best-effort” practice when forwarding a packet toward its destination. Routing protocols make routers react to changes in the network and change the routing tables, but propagation of such changes can be slow. As a result, packets may end up in loops. The value of TTL is used to determine whether a packet is vagrant. This field is “mutable” in that it is updated by each router. Originally, it was set to the upper bound of the time it might take the packet to traverse the network, and each router was supposed to modify it by subtracting the time it took to process the packet. Once the value reached zero, the packet was supposed to be discarded. The semantics of the field has changed slightly: instead of being measured in fractions of a second (which is somewhat awkward to deal with), it is measured in number of hops.

The protocol field specifies the transport-layer protocol employed. Going back to the computational model we discussed at the beginning of this chapter, it is necessary to determine which procedure to call (or which process to send it to) on reception of the packet by the host.

The header checksum is computed at each router by adding all the 16-bit words of the header (presumably while they are arriving), using 1's complement addition, and taking the 1's complement of the sum. When the packet arrives, the checksum computed over the whole header must be equal to zero, if there were no errors—because the header already includes the previously computed checksum. If the resulting value is non-zero, then the packet is discarded. (Note that only the header is checked at the network layer; verification of the payload is the job of upper layers.) If the router needs to forward the packet it recomputes the checksum, after the TTL value is changed.

Only at this point do the source and destination addresses appear in the header, followed by the options field. The latter field is optional; it has not been used much recently. (In the past, one typical use was to spell out explicit routes.) Finally, the packet is padded with a sufficient number of bits to stay on the 32-bit boundary (in case the options don't end with a full 32-bit word).

This completes our overview of the IPv4 header. It should be clear now how IPv4 solves the original problem of network interconnection. In the process, we have also mentioned several new problems, among them IP address space exhaustion. We have not discussed security, which is an elephant in the room and in need of a separate chapter (to fill in all the prerequisites). Another big problem is the absence of clear support for what is called “quality of service.” We will deal with this problem at length in Section 4.3; however, we mention it now because IPv6, which we are going to address next, has provided an improvement over IPv4.

The (basic) IPv6 header is depicted in Figure 4.13.

Diagram shows the packet header fields of ipv6 basic with a 32-bit string which includes parameters named version, traffic class, flow label, payload length, next header, hop limit, source address, two 128 bits, and destination address. — **Figure 4.13** The IPv6 basic packet header (after RFC 2460).

What is perhaps most noticeable from the first, cursory glance at the IPv6 header is that it appears to be simpler than that of IPv4. Indeed, it is! Some fields have been dropped or made optional, which makes it not only easier to comprehend, but also faster to execute. The header is extensible in that new headers (which can be defined separately, thus improving modularity) are linked together in an IP packet by means of the next header field. Hence the word “basic” in the definition of the header.

The major change, of course, is brought about by quadrupling the IP address size—from 32 to 128 bits. Besides the obvious benefit of a much greater number of addressable notes, this also supports addressing hierarchy. IPv6 improves the scalability of multicast routing by supporting the definition of the scope of a multicast address. IPv6 also formally defines the anycast address.

The IPv4 time-to-live field is now appropriately renamed the hop limit; the type-of-service field is renamed traffic class. The latter field supports quality of service, as does the new field called flow label. Again, we will return to this subject and cover it in detail, but it is worth mentioning right now that this field supports a capability that in effect is equivalent to that of providing a virtual circuit.

RFC 2460,²⁰ which defines the basic IPv6 header, describes this flow labeling capability as follows: “A new capability is added to enable the labeling of packets belonging to particular traffic “flows” for which the sender requests special handling, such as non-default quality of service or ‘real-time’ service.” The main reason for adding this capability was to correct a layering violation committed in an IPv4-based solution, which we will review later.

The rest of the IPv6 header's specifications (beyond the basic header) are referenced in RFC 2460.²¹ As a final note, even though IPv6 is not fully deployed, there are IPv6 networks as well as IPv6-capable devices. The IPv6 networks can be interconnected through IPv4, by means of tunneling. After all, IP has been developed with the primary goal of interconnecting networks!

So far, we have only briefly mentioned the routing protocols, whose job is to compute routing maps. Let us take a look at Figure 4.14. To begin with, no routing is needed in LANs, even though each host uses IP. (LANs can be interconnected further by Layer-1 and Layer-2 switches, to build even larger LANs called metropolitan-area networks. They can further be organized into wide-area networks using circuit-switched technology (such as a leased telephone line) or virtual-circuit-switched technology (such as frame relay or Asynchronous Transfer Mode (ATM) technologies, or Multi-Protocol Label Switching (MPLS), which we will discuss later.) The moment routers are involved, however, they need to learn about other routers in the network.

Diagram shows the classification of the routing protocol. Case one, no routing is needed in LANs and case two shows the routing within and among autonomous systems called interior and exterior routing — **Figure 4.14** Routing protocol classification: (a) LAN, no routing needed; (b) routing within and among autonomous systems.

The question is: “Which network?” For sure, no router can hold the routing map of the whole Internet. Small networks that employ only a few routers can provision in them static routing maps. Routers within a larger network need to implement one or another interior routing gateway protocol. The constrained space of this book does not allow any further discussion of these.²² The case of exterior routing among Autonomous Systems (ASs), however, is particularly important to Cloud Computing, because it is involved in offering “data pipe” services. The idea here is similar to the development of geographic maps: a country map to show the highways interconnecting cities (exterior routing) and another type of map to show the streets in a city (interior routing).

As Figure 4.15 shows, each AS is assigned—by its respective Regional Internet Authority (RIR)²³—a number called the Autonomous System Number (ASN). To the rest of the world, an AS is represented by routers called border gateways, which exchange information as defined by the Border Gateway Protocol (BGP). The BGP (whose current version is 4) is specified in the RFC 1771.²⁴

Diagram shows autonomous systems A, B and C that are assigned with a number called the autonomous system number, by its respective regional internet authority. Autonomous systems are represented by routers called border gateways, which exchange information as defined by the Border Gateway Protocol A, C, D. — **Figure 4.15** Autonomous systems and border gateways.

In the past, an “autonomous system” meant an Internet Service Provider (ISP), but this has changed with time. Now an ISP may have several separate ASs within it. According to RFC 1930,²⁵ “The classic definition of an Autonomous System is a set of routers under a single technical administration, using an interior gateway protocol and common metrics to route packets within the AS, and using an exterior gateway protocol to route packets to other ASs.”

The networks under a single technical administration have grown and otherwise evolved into using multiple interior gateway protocols and multiple metrics. The quality of sameness though has remained as long as two conditions are met: (1) the network appears to other networks to have a coherent interior routing plan (i.e., it can deliver a packet to the destination IP address belonging to it) and (2) it can tell which networks are reachable through it. The latter factor is non-trivial as not every network may allow the traffic from all other networks to flow through it, and, conversely, a given network may not want its traffic to pass through certain networks. We will return to this later.

Taking these changes into account, RFC 1930 redefines an AS as “a connected group of one or more IP prefixes run by one or more network operators.” With that, this group must have “a single and clearly defined routing policy.” The word policy is key here, reflecting a major difference between the interior and exterior routing objectives. In interior routing, the objective is to compute a complete map of the network and—at the same time—to determine the most efficient route to every network element.

Hence, when two interior routers get connected, they exchange information about all reachable nodes they know about (and then, based on this information, each router recomputes its own network map); conversely, when a router loses a connection, it recomputes its map and advertises the change to its remaining neighbors. In both cases, the routers that receive the news compute their respective maps and propagate the news to their remaining neighbors, and so forth.

This is not the case with exterior routing, where decisions on what information to propagate are made based on policies and agreements. Let us return to Figure 4.15. Here, an autonomous system B knows how to reach autonomous systems A, C, D, and E. It is neither necessary nor expected that B advertise to C automatically all the networks it knows. For example, in order for B to advertise E to C:

The policy of E must not exclude C from its transit destinations.
B must agree to route traffic from E to C.
C must agree to accept traffic from E.
E must agree to accept traffic from C (there is symmetry).

There is, in fact, a taxonomy of relationships between ASs, as depicted in Figure 4.16.

Diagram shows four autonomous systems A, B, C, and D. A and B have a transit relationship, B and C have a peering relationship, A and D have a customer relationship. — **Figure 4.16** Transit and (settlement-free) peering relationships.

Here we enter the business territory. We should note right away that the Internet is unregulated, and—to a great extent—self-organized. It has been and is being shaped by business. Contrary to popular belief, the Internet has never been and is not free. In its earlier days it was paid for largely by the US government, which explains the flat structure of Figure 4.10. Much has changed since then!

At the bottom of the food chain, the ISP customers pay their respective ISPs for connection to the Internet.

Smaller ISPs achieve interconnection by paying for a transit relationship, which combines two services: (a) advertising the routes to others (which has the effect of soliciting traffic toward the ISP's customers) and (b) learning other ISPs' routes so as to direct to those the traffic from the ISP's customers. The traffic is billed upstream—that is, a transit network bills for the traffic it receives and pays for the traffic that leaves it.

In peering, only the traffic between the networks and their downstream customers is exchanged and neither network can see upstream routes over the peering connection. Some networks are transit free and rely only on peering.

Initially, the word “peering” meant that the two parties involved did not exchange any money—one major difference from the transit relationship. Later the terminology was compromised, and the notion of a settlement in a peering connection appeared. There is a comprehensive article by Geoff Huston (then of Telstra) [11], which describes the business nuances of the matter.²⁶

Ultimately, the initial concept of peering has been reinstalled with the modifier “settlement free.” The (very large) service providers that have full access to the Internet through settlement-free peering comprise the Tier-1 network.

Two networks can be interconnected through private switches, but when there are more than two service providers in a region, it is more efficient to interconnect the peering networks over the Internet Exchange Points (IXPs)—which are typically operated independently. At the moment of this writing there are over 300 IXPs.

It is rather instructive to take a look at a specific policy of a Tier-1 network provider.

AT&T has published the respective document, called “AT&T Global IP Network Settlement-Free Peering Policy.”²⁷ This document (we are looking at the October 2012 official version) first lists the company's objective “to interconnect its IP network with other Internet backbone providers on a settlement-free basis when such interconnection provides tangible benefits to AT&T and its customers.” Then it provides the relevant ASNs: AS7018, for private peering in the USA; AS2685, in Canada; and AS2686, in Latin America. The requests for peering by an ISP must be submitted in writing with information on which countries the ISP serves, in which IXPs it has a presence, the set of ASNs and prefixes served, and the description of the traffic.

Specific requirements are spelled out for peering (in the USA), with AS7018. To begin with, the peer operator must have “a US-wide IP backbone whose links are primarily OC192 (10 Gbps) or greater,” and interconnect with AT&T in at least three points in the USA—one on the East Coast, one in the Central region, and one on the West Coast. In addition, the candidate peer must interconnect with AT&T in two “non-US peering locations on distinct continents where peer has a non-trivial backbone network.” A customer of AS7018 may not be a settlement-free peer.

The bandwidth and traffic requirements are spelled out: “Peer's traffic to/from AS7018 must be on-net only and must amount to an average of at least 7 Gbps in the dominant direction to/from AT&T in the US during the busiest hour of the month.” With that, “the interconnection bandwidth must be at least 10 Gbps at each US interconnection point.” The in-AT&T/out-AT&T traffic ratio is limited to no more than 2:1.

One benefit of peering is cooperation in resolving security incidents and other operational problems. To this end, the candidate peer is expected to cooperate in this and back up its ability to do so by having “a professionally managed 24×7 [Network Operations Center] NOC.”

Finally, we must consider the routing policy requirements—which are of special interest to us inasmuch as they illustrate the constraints of exterior routing:

The peer must announce a consistent set of routes at each point of interconnection.
The peer may not announce transit or third-party routes—only its own routes and the routes of its customers. With that, the peer customer's routes must be filtered by prefix.
The forbidden activities include “pointing a default route … or otherwise forwarding traffic for destinations not explicitly advertised, resetting next-hop, selling or giving next-hop to others.”

Taking business concerns and resulting policies into account, BGP is the protocol for exchanging the AS reachability information. Executing BFP results in constructing a directed acyclic (i.e., loop-free) graph of AS connectivity, which represents the combined policies of all ASs involved. To ensure this, RFC 1771 requires that a “BGP speaker advertise to its peers … in neighboring ASs only those routes that it itself uses.” This inevitably leads to the “hop-by-hop” routing, which has limitations as far as policies are concerned. One such limitation is the inability to enforce source routing (i.e., spelling out a part or all of the routing path). Yet, BGP does support all policies consistent with the “hop-by-hop” paradigm.

Unlike any other routing protocol, BGP needs reliable transport. It may look strange at first glance that a routing protocol is actually an application-layer protocol, but it is! Routing is an application that serves the network layer. (Similarly, network management applications serve the network layer—as they do all other layers as well.) Border gateways do not necessarily need to be interconnected at the link layer, and hence fragmentation, retransmission, acknowledgment, and sequencing—all functions of the reliable transport layer—need to be implemented. Another requirement is that the transport protocol supports a graceful close, ensuring that all outstanding data be delivered before the connection is closed. Perhaps not surprisingly, these requirements have been met by the TCP²⁸ introduced earlier in this chapter, and so BGP is using TCP for transport. To this end, the TCP port 179 is reserved for BGP connections.

And now that we have started to talk about transport protocols, it is high time we move on to discussing the rest of the protocols in the IP suite.

4.2.2 The Internet Hourglass

The metaphor of Figure 4.17 belongs to Steve Deering, then a member of the Internet Architecture Board, who developed it further in his presentation²⁹ on the state of the Internet protocol at the Plenary Session of the 51st IETF meeting in London, on August 8, 2001.

Diagram shows the internet hourglass. It shows the list of protocols for application, transport, network, link, and physical layers. The waist of the hourglass corresponds to network layer protocol IP. — **Figure 4.17** The Internet hourglass.

Noting that the Internet had reached middle age just as he himself had, Dr. Deering suggested that at such an age it is appropriate to watch one's waist.

The waist of the Internet is the IP. There are quite a few link-layer protocols—each corresponding to the respective physical medium—that run below the IP, and there are even more protocols that run above, at the transport layer and, especially, the application layers. The IP, however, used to be one protocol, and a very straightforward protocol at that. The Internet works as long as two maxims belonging to Vint Cerf hold: IP on Everything and Everything on IP.³⁰

In the rest of this section, we will take a very brief—and woefully incomplete—tour of the Internet protocol suite, simply to develop the perspective. We will revisit some of these protocols and discuss them in more detail later in the book.

Physical media span both wireline and wireless LANs, copper and optical fiber links, as well as longer-haul wireless broadcast to communicate with satellites. On point-to-point lines, such as twisted pair, we have the IETF Point-to-Point (PPP) protocol, as well as the ISO High-Level Data Link Control (HDLC) protocol. In the deployment of the IBM Systems Network Architecture (SNA), another supported Layer-2 protocol is Synchronous Data Link Control (SDLC)—a progenitor of HDLC and pretty much all other data link control protocols. LANs invariably employ the IEEE standard Logical Link Control (LLC) protocol family, which also carries SDLS genes.

ATM networks are a special matter. As we mentioned earlier in this chapter, ATM networks were positioned to be the Layer-3 networks. In 2001, even though IP had pretty much won the battle, the notion that an ATM switch was a peer of an IP router was still shared by part of the industry. Hence the complaint of Dr. Deering about the growing waist of the network layer. In a dramatic development, the ATM was finally relegated to the link layer as far as the IP is concerned: ATM Adaptation Layer (AAL) protocols were treated as Layer-2 protocols in the hour-glass model. IP on everything!

To this end, the IP can run on top of the IP, too. The technique of encapsulating an IP packet within another IP packet (in other words, making an IP packet a payload of another IP packet by attaching a new IP header) is perfectly legal, and has been used effectively in creating a kind of virtual private network. Dr. Deering lovingly referred to this property of the IP as its waist being “supple,” and reflected this in one of his slides by depicting the waist of the hourglass twisted into a spiral.

At the transport layer there are three protocols, all developed by the IETF. In addition to TCP, which effectively combines the session- and transport-layer services, guaranteeing end-to-end in-sequence, error-free byte-stream delivery, and also special mechanisms to detect network congestion (and adjust to it by exponentially reducing the amount of data sent), there are two other protocols: the User Datagram Protocol (UDP) and the Stream Control Transmission Protocol (SCTP).

The UDP, specified in RFC 768,³¹ is a connectionless protocol, which provides no delivery guarantees whatsoever. Nor does the UDP require any connection setup. UDP is indispensable in implementing fast transactions in which the application layer performs error control. Another core function of the UDP is transport of “streaming media” (i.e., voice or video packets). The UDP is much better at this because losing an occasional video frame has a much less adverse effect on the perception of video service than a frame “saved” by retransmission (which then causes significant variation in the delay).

The SCTP deserves a more detailed description here, because we will need to refer to it later. It is also important to elucidate why the Internet ended up with two different reliable transport-layer protocols. To begin with, the SCTP was developed as part of the IETF Internet/PSTN interworking movement described in [3]. The initial objective of the SCTP was to transport Signaling System No. 7 messages over the Internet, and so the work took place in the IETF SIGTRAN working group. Later, some application protocols—not related to telephony—specified the SCTP as the transport mechanism of choice, primarily because of its built-in reliability. The protocol is defined in RFC 4960,³² and as is often the case, there are other related RFCs.

Just as TCP does, SCTP provides error-free, non-duplicated transfer of payload, employing data fragmentation and network congestion avoidance. The SCTP has also addressed the following TCP limitations:

TCP combines reliable transfer with in-sequence delivery, but the need to separate these two features was clear at the end of the last century. SCTP allows an application to choose either of these features or both.
TCP treats the payload as a byte stream. This forces an application to use the push facility to force message transfer at the end of the application-defined record. In contrast, SCTP deals with “data chunks” which are transmitted at once. (SCTP does provide an option to bundle chunks.)
TCP is not effective with multi-homing (i.e., a host's attachment to more than one network) while SCTP explicitly supports it, as we will demonstrate shortly.
TCP has been vulnerable to synchronization denial-of-service attacks. SCTP has addressed this by maintaining the state of initialization in a special data structure (a cookie).

Multi-homing is illustrated by the example of Figure 4.18. Here, process X has an SCTP session with process Y. Just as in the case of TCP, SCTP provides a single port for each process (p_X for process X and p_Y for process Y) as the interface to transport-layer services.

What makes the example of Figure 4.18 very different from earlier scenarios is that the respective hosts on which the processes run are multi-homed. The host where the process X runs is attached to three networks, A, B, and C; the host where the process Y runs is attached to two networks, D and E. Consequently, the former host has three IP addresses (I_A, I_B, and I_C); the latter host has two IP addresses (I_D and I_E). The unique and new capability that SCTP adds is demultiplexing—over all involved networks—of the stream of packets received from a process combined with multiplexing of the packets received from these networks on their way to the process. This feature improves reliability as well as performance. The effect on improving the performance is clear when considering what happens when one of the available networks is congested.

The SCTP has slightly—but notably—redefined the semantics of a session, which it calls an association. Whereas a TCP session is a quadruple (I_S, P_S, I_D, P_D) defined by the source IP address, source port number, destination IP address, and destination port number, in the SCTP association (<I_S>, P_S, <I_D>, P_D), the quantities <I_S> and <I_D> are respectively the lists of source and destination addresses.

To conclude, we note that SCTP has been implemented in major operating systems, and it has been prescribed by 3GPP as the transport-layer protocol of choice for a variety of application protocols that require high reliability. (One example is Diameter—a protocol for authentication, authorization, and accounting.)

A small representative subset of the application-layer protocols is listed in Figure 4.17. These protocols have been developed in response to specific needs: Simple Mail Transfer Protocol (SMTP) for e-mail; telnet for remote connections to mainframe accounts through a virtual terminal; File Transfer Protocol (FTP) for file transfer; Simple Network Management Protocol (SNMP) for remote management of network elements and devices, and so forth.

The Hyper-Text Transfer Protocol (HTTP) not only defined and enabled the World-Wide Web, but, as we will see later, it has also influenced the creation of a new universal style of accessing resources and invoking remote operations on them. This style has become central for accessing Cloud Computing applications. A limitation of HTTP is that only an HTTP client can start a conversation with a server; the latter may only respond to the client. For full-duplex client-to-server channels, the WebSocket protocol and API have been developed and standardized by the IETF and W3C, respectively.

A decade earlier, a similar capability—receipt of server notifications—was built into the IETF Session Initiation Protocol (SIP), which was developed for the creation and management of multimedia sessions. The 3GPP IP Multimedia Subsystem (IMS), which signals the foundation of third- and fourth-generation wireless networks (and also landline next-generation networks) has been based on SIP.

The actual real-time media transfer is performed by the IETF Real Time Protocol (RTP). To this end, the (initially unforeseen) need to support real-time communications has resulted in the QoS development, described in the next section. In his talk, Steve Deering referred to this development as “putting on weight,” and the accompanying slide depicted a much wider-waisted hourglass!

4.3 Quality of Service in IP Networks

The somewhat enigmatic term Quality of Service (QoS) actually refers to something rather specific and measurable. For our purpose, this “something” boils down to a few parameters, such as bandwidth, delay, and delay variation (also called jitter). Adherence to certain values of these parameters makes or breaks any service that deals with audio or video.

In the early days of data communications, there was virtually no real-time application in sight that was, too sensitive to delay or delay variation. Telephony (even though digitized) was in the hands of telephone companies only, and it relied on connection-oriented transport, which provided constant bit-rate transfer. The 1980s vision of telephone companies for the Integrated Services Digital Network (ISDN) (see [3] for history and references) left the voice traffic to traditional switching, while the data traffic was to be offloaded to a packet network. With the bandwidth promise growing, the Broadband ISDN plan envisioned the use of the ATM connection-oriented switches. To this end, a detailed study and standardization of QoS provisioning and guarantees was first performed in the context of ATM, as described in [5].

Standardization of the QoS support for the IP networks, undertaken in the 1990s, was in a way reactive to the ISDN, but it was revolutionary. The very term Integrated Services Packet Network was coined by Jonathan Turner, who wrote in his truly visionary work [12]: “In this paper, I argue that the evolutionary approach inherent in current ISDN proposals is unlikely to provide an effective long-term solution and I advocate a more revolutionary approach, based on the use of advanced packet-switching technology. The bulk of this paper is devoted to a detailed description of an Integrated Services Packet Network (ISPN), which I offer as an alternative to current ISDN proposals.” Further theoretical support for this vision has been laid out in [13]. Ultimately, the standards for the integrated services model—in particular the Resource ReSerVation Setup Protocol (RSVP)—were developed by the IETF in the late 1990s. The integrated services model effectively establishes—although only temporarily—a virtual circuit, with the routers maintaining the state for it.

Sure enough, the ISDN did not end up as “an effective long-term solution,” but the ISPN revolution ran into difficulties—which we will explain later when discussing the technology—just at a time when the standard was being developed. The antithesis was the differentiated services movement (the name itself being a pun on “integrated services”), which took over the IETF in March 1998. The differentiated services model implied no virtual circuit. Routers keep no state, but they treat the arriving packets according to their respective classes, as encoded in IP packets themselves.

Ultimately, the integrated services model was found unscalable, and it was not deployed as such. The differentiated services model won; however, this happened because an altogether new virtual-circuit-based network element technology, Multi-Protocol Label Switching (MPLS) won, in itself synthesizing the Internet and ATM approaches. It is with MPLS that the integrated services and differentiated services models found their synthesis: A variant of the RSVP is used to establish the circuit; and differentiated services are used to maintain guaranteed QoS on the circuit.

The rest of this section first describes the traffic model and the QoS parameters, and then explains the integrated and differentiated services models and protocols. The section culminates—and ends—with the description of MPLS.

4.3.1 Packet Scheduling Disciplines and Traffic Specification Models

Just as the cars traveling on a highway maintain constant speed, so do the bits traveling over a wire. Variation in traffic speed takes place when the cars have to slow down or stop—where the roads merge or intersect; similarly, network traffic speed variation takes place at its “intersections”—in the routers.

A router needs to examine an arriving packet to determine the interface at which it needs to be forwarded toward its destination—that adds to the delay already caused by the I/O processing within the router. Given that packets arrive simultaneously on several interfaces, it is easy to see how a router can become a bottleneck. The packets end up in one or another queue, waiting to be forwarded. As long as the lengths of the queued packets add up to the assigned router memory limit, the packets inside the router are only delayed, but when this limit is reached, the router needs to drop them.

A router can shape the traffic it receives by treating packets differently according to their respective types—employing packet scheduling disciplines, which select the order of packet transmission. (At the moment, we have intentionally used the abstract term type without specifying how the type is determined. We will discuss a packet classification for each special case introduced in subsequent sections.)

Figure 4.19 illustrates the three major packet scheduling disciplines:

With best effort, the packets are sent on a first-come, first-served basis; they are dropped when queues become, too large;
With fair queuing, the packets are queued separately for each packet type and transmitted round-robin to guarantee each type an equal share of bandwidth. Here, the traffic type with statistically larger packets will end up letting lighter-type traffic go first; and
With weighted fair queuing, round-robin scheduling is also employed, but the bandwidth is allocated according to each type's priority. The difference with fair queuing is that instead of transmitting an equal number n of bytes from each queue, nw_t bytes are transmitted from the queue associated with type t. A larger weight w_t is assigned to a higher priority queue.

Using the movement of airplanes, elephants, and cars, the diagram depicts three major packet scheduling disciplines: best effort, fair queuing, and weighted fair queuing. — **Figure 4.19** Packet scheduling disciplines: (a) best effort; (b) fair queuing; (c) weighted fair queuing.

It turns out (as is proven in [14]) that it is possible to guarantee upper bounds on the end-to-end delay and buffer size for each type.

When enough routers are stressed to their limit, the whole network gets congested. This situation is well known in transportation and crowd control; in both cases admission control is employed at entrances to congested cities or crowded places.

The two traffic specification models used in network admission are based on an even simpler analogy (depicted in Figure 4.20)—a bucket with a small hole in the bottom. The hole symbolizes the entrance to the network. When the bucket is full, the water starts overflowing, never reaching the “network.”

Now envision a host playing the role of the spigot, inserting packets into the access control “bucket,” which is a network-controlled queue of size β bytes (the bucket depth). The queue is processed at a rate ρ bytes/sec.

With the leaky bucket model, once the bucket is full, the packets that arrive at a higher rate than ρ and cause overflow end up being dropped. This model eliminates traffic bursts—the traffic can enter the network no faster than at a constant rate.

The token bucket model is designed to allow bursts. Here the flow out of the bucket is controlled by a valve. The state of the valve is determined by the condition of the token bucket. The latter receives quantities called tokens at a rate of r tokens/sec, and it can hold b tokens. (The tokens stop arriving when the bucket is full.) Now, when exiting the token bucket, a token opens the output valve of the main bucket (Figure 4.20(b)), to allow the output of one byte, at which point the valve is closed. Consequently, no traffic is admitted when the token bucket is empty. Yet, if there is no traffic to output, the bottom hole of the token bucket is closed, and so the tokens are saved in the bucket until they start to overflow. The difference from the leaky bucket model is that now bursts—up to the token bucket size—are allowed.

The volume V(t) of the admitted traffic over time t is bounded thus:

Hence, the maximum burst M can last for (M − b)/r seconds.

By placing a token bucket after a leaky bucket that enforces rate R, one can shape the traffic never to exceed that rate while allowing controlled bursts.

4.3.2 Integrated Services

The type by which packets are distinguished in the integrated services model is defined by flows. A flow is a quintuple: (Source IP address, Source Port, Protocol, Destination Port, Destination IP address). Protocol designates the transport protocol. At the time the model was defined there were only two such protocols: TCP and UDP. (SCTP was defined later, and—because of its multiplexing feature—it does not fit exactly, unless the definition of a flow is extended to allow the source and destination of IP addresses to be sets of IP addresses.) Note that the layering principle gets violated right here and now, since the router has to inspect the IP payload.

In the simplest case, a flow is a simplex (i.e., one-way) end-to-end virtual circuit provided by the network layer to the transport layer; however, the integrated services envisioned multicast, so in the general case the virtual circuit here is a tree, originating at a point whence the packets are to be sent.

The integrated services framework was carefully built to supplement IP, but to support it, the IP routers had to be changed in a major way. The change included the necessity of making reservations—to establish the circuits and then to maintain these circuits for as long as they are needed.

The IETF 1994 integrated services framework, laid out in RFC 1633,³³ defined a new router function, called traffic control, which has been implemented with three components: admission control, classifier, and packet scheduler. (Incidentally, the same terminology has largely been reused by the differentiated services framework.)

The role of admission control is to decide whether a new flow can get the requested QoS support. It is possible that several flows belong to the same type and receive the same QoS treatment. The classifier then maps incoming packets to their respective types—also called classes of service. Finally, the packet scheduler manages the class-arranged packet queues on each forwarding interface and serializes the packet stream on the forwarding interfaces.

The above entities are set up in the router and also at the endpoint hosts. The integrated services are brought into motion by the RSVP, which creates and maintains, for each flow, a flow-specific state in the endpoint hosts and routers in between. We will discuss RSVP in more detail shortly, but the general idea is that the receiving application endpoint specifies the QoS parameters in terms of a flow specification or flowspec, which then travels through the network toward the source host.³⁴ While traveling through the routers, the flowspec needs to pass the admission control test in each router; if it does, the reservation is accepted and so the reservation setup agent updates the state of the router's traffic control database.

Figure 4.21, which presents a slight modification of the architectural view of RFC 1631, explains the interworking of the elements above. There is a clear distinction between the control plane, where signaling takes place asynchronously with actual data transfer and the data plane, where the packets are moving in real time. It is interesting that the authors of RFC 1631 envisioned the reservations to be done by network management, that is without employing a router-to-router protocol. The routing function of the routers is unrelated to forwarding.

Diagram shows the integrated services model which includes the control plane and data plane. Control plane contains routing agent, routing database, reservation setup agent, admission control, traffic control database, and management agent. Data plane contains classifier and packet scheduler. — **Figure 4.21** The integrated services model (after RFC 1631).

The Integrated Services model deals with two end-to-end services: guaranteed service and controlled-load service on a per-flow basis.

The guaranteed service provides guaranteed bandwidth and bounds on the end-to-end queuing delay for conforming flows. The conformance is defined by the traffic descriptor (TSpec), which is the obligation of the service to the network. The obligation of the network to the application is defined in the service specification (RSpec).

TSpec is based on the token bucket model, and contains five parameters:

token rate r (bytes/sec);
peak rate p (bytes/sec);
token bucket depth b (bytes);
minimum policed unit m (bytes) (if a packet is smaller, it will still count as m bytes); and
maximum packet size M.

RSpec contains two parameters: service rate R (bytes/sec) and—to introduce some flexibility in the scheduling—the slack term S (μsec), which is the delay a node can add while still meeting the end-to-end delay bounds.

Figure 4.22 provides the formula for the worst-case delay for the guaranteed service in terms of the TSpec and RSpec parameters. Two additional router-dependent variables here are

Formula shows the worst-case delay D for the guaranteed service in terms of token rate r, peak rate p, token bucket depth b, maximum packet size M, and service rate R. — **Figure 4.22** The end-to-end worst-case delay D (after RFC 2212).

C_i: the overhead a packet experiences in a router i due to the packet length and transmission rate; and
D_i: a rate-independent delay a packet experiences in a router i due to flow identification, pipelining, etc.

Unlike the guaranteed service, the controlled-load service is best described in terms of what it does not allow to happen, which is visible queuing delay or visible congestion loss. The definition is left quite ambiguous—no quantitative guarantees—because admission control is left to implementation. (Sometimes, this service is called a better-than-the-best-effort service.) With this service, costly reservations are avoided, and routers rely on statistical mechanisms. Consequently, only TSpec (which limits the traffic that an application can insert into the network) but not RSpec (which spells out the network obligation) is required for the controlled-load service.

In 1997, the IETF completed the standards for integrated services and specified the RSVP for that purpose³⁵ in a series of RFCs from RFC 2205 through RFC 2216.

Figure 4.23 provides a simple example of the use of the RSVP. The host on the receiving side of the integrated services starts reservations with the RESV message. This message carries the flow descriptor, time value (needed for refreshing, as we will explain in a moment), the destination IP and port addresses, protocol number, and other parameters. The requests are propagated upstream through the routers until they reach the host that is to provide the service.

Diagram shows an example of the RSVP exchange. It shows the PATH propagation from sender to receiver and RESV propagation from receiver to sender through the routers with a small circle inside it. — **Figure 4.23** An example of the RSVP exchange.

The latter host responds with the PATH message, which is propagated downstream and installs the state in the router. This message contains the flow identification, the sender TSpec, the time value, and other parameters. (Incidentally, for the purposes of security, both the RESV and PATH messages contain cryptographic proof of their integrity—that is, proof that they have not been tampered with.) As a result of PATH propagation, all routers install the state of the reservation (indicated by a small circle in Figure 4.23). The fact that keeping the state in the network has been accepted signifies a considerable compromise of the early Internet principles; however, the state has been declared soft. It is kept for as long as the RESV and PATH messages keep arriving within a specified time period—and indeed the procedure is to keep issuing those for the duration of a session. This should explain the need for the time value in both messages. If the exchange stops, the state is destroyed.

The session can also be torn down with a PATHtear message. The full set of RSVP messages is contained in the table of Figure 4.24.

Table shows the directions of RSVP messages. PATH, RESVErr, PATHTear, RESVConf are directed downstream. RESV, PATHerr, and RESVTear are directed upstream. — **Figure 4.24** Summary of the RSVP messages.

As we will see shortly, RSVP got another life beyond integrated services.

4.3.3 Differentiated Services

Differentiated services were called that to emphasize the difference from the integrated services which, as we have mentioned, had in turn been named after the ISDN. (Hence the etymology here still points to telephone networks.) Nevertheless, the services as such are pretty much the same in both models, just that the means of enabling them differ. The major reason for developing a new model was the concern that the integrated services model—with its reservation and state mechanisms—would not scale in large-core networks.

Just as in the integrated services model, the service (as far as QoS is concerned) is characterized by its end-to-end behavior. But no services are defined here. Instead, the model supports a pre-defined set of service-independent building blocks. Similarly, the forwarding treatment is characterized by the behavior (packet scheduling) at each router, and the model defines forwarding behaviors (rather than end-to-end behaviors).

Another drastic difference is the absence of any signaling protocol to reserve QoS. Consequently, no state is to be kept in the routers. Furthermore, there is no flow-awareness or any other virtual-circuit construct in this model. The type of traffic to which a packet belongs (and according to which it is treated) is encoded in the packet's IP header field (Type of Service, in IPv4; Class of Service, in IPv6). These types are called classes, and the routing resources are allocated to each class.

Instead of signaling, network provisioning is used to supply the necessary parameters to routers. Well-defined forwarding treatments can be combined to deliver new services.

Class treatment is based on service-level agreements between customers and service providers. Traffic is policed at the edge of the network; after that, class-based forwarding is employed in the network itself.

Let us look at some detail. Each Per-Hop-Behavior (PHB) is assigned a 6-bit Differentiated Services Codepoint (DSCP). PHBs are the building blocks: all packets with the same codepoint make a behavior aggregate, and receive the same forwarding treatment. PHBs may be further combined into PHB groups.

The treatment of PHBs is identical in all routers within a Differentiated Service (DS) domain. Again, for the purposes of QoS provision, the origination and destination addresses, protocol IDs, and ports are irrelevant—only the DSCPs matter.

Services are specified in a Service-Level Agreement (SLA) between a customer and a service provider as well as between two adjacent domains. An SLA specifies the traffic as well as security, accounting, billing, and other service parameters. A central part of an SLA is a Traffic Conditioning Agreement (TCA), which defines traffic profiles and the respective policing actions. Examples of these are token bucket parameters for each class; throughput, delay, and drop priorities; and actions for non-conformant packets.

Figure 4.25 illustrates the concepts. SLAs can be static (i.e., provisioned once) or dynamic (i.e., changed via real-time network management actions).

Diagram shows the traffic conditioning at the egress node of DS domain X, both ingress and egress nodes of DS domain Y, and ingress node of DS domain Z. — **Figure 4.25** Traffic conditioning at the edges of DS domains. *Source:* Reprinted from [3] with permission of Alcatel-Lucent, USA, Inc.

As far as the PHB classification is concerned, the default PHB group corresponds to the good old best-effort treatment. The real thing starts with the Class Selector (CS) PHB group (enumerated CS-1 through CS-8, in order of ascending priority).

The next group is the Expedited Forwarding (EF) PHB group, belonging to which guarantees the departure rate of the aggregate's packet to be no less than the arrival rate. It is assumed that the EF traffic may pre-empt any other traffic.

The Assured Forwarding (AF) PHB group allocates (in ascending order) priorities to four classes of service, and also defines three dropping priorities for each class of service (as the treatment for out-of-profile packets).

Figure 4.26 provides a self-explanatory illustration of the semantics of assured forwarding: for each transportation class, its priority as well as the dropping priority for its elements is spelled out.

Diagram shows the Assured Forwarding dropping priorities for four classes of service: class one elephant, class two tractor, class three car, and class four steam engine. — **Figure 4.26** An example of an AF specification.

The operations of the model inside a router are explained in Figure 4.27.

Figure 4.27 The inside view of DS. Source: RFC 2475.

The major elements of the architecture are: Classifier, Meter, Marker, Dropper, and Shaper.

The Classifier functions are different at the edge of the network and in its interior nodes. In the boundary nodes located at the edge of the network, the Classifier determines the aggregate to which the packet belongs and the respective SLA; everywhere else in the network, the Classifier only examines the DSCP values.

The Meter checks the aggregate (to which the incoming packet belongs) against the Traffic Agreement Specification and determines whether it is in or out of the class profile. Depending on particular circumstances, the packet is either marked or just dropped.

The Marker writes the respective DSCP in the DS field. Marking may be done by the host, but it is checked (and is likely to be changed when necessary) by the boundary nodes. In some cases, special DSCPs may be used to mark non-conformant packets. These doomed packets may be dropped later if there is congestion. Depending on the traffic adherence to the SLA profile, the packets may also be promoted or demoted.

The Shaper delays non-conformant packets until it brings the respective aggregate in compliance with the traffic profile. To this end, shaping often needs to be performed between the egress and ingress nodes.

To conclude, we provide a long-overdue note on the respective standards. Altogether there are more than a dozen differentiated services RFCs published by the IETF at different times. RFC 2474³⁶ maps PHBs into DS codepoints of the IP packets. The idea of combining integrated services, in edge networks and differentiated services, in core networks has been exploited in RFC 2998.³⁷ With the particular importance of network management here, we highly recommend reading RFC 3279³⁸ and RFC 3280³⁹, which respectively define the management information base (i.e., the set of all parameters) in the routers and explain the network management operation.

4.3.4 Multiprotocol Label Switching (MPLS)

In his IP hourglass presentation, Dr. Deering referred to MPLS—along with several technologies it enabled—as “below-the-waist-bulge,” lamenting that it is “mostly reinventing, badly, what IP already does (or could do, or should do).” Maybe so, but IP did not provide a straightforward virtual-circuit switching solution. This is not surprising, given that such a solution would have contradicted the fundamental principles of the Internet.

Given that in the mid-1990s, ATM switches as well as frame relay switches were around something had to be done to synthesize the connection-oriented and connectionless technologies. The synthesis, then called “tag switching,” was described in [15], which stated in no uncertain terms an objective to simplify “the integration of routers and asynchronous transfer mode switches by employing common addressing, routing, and management procedures.”

The IETF MPLS Working Group has been active since 1997, and it has been among the busiest IETF working groups, having published over 70 RFCs.⁴⁰

We will start our overview of MPLS by identifying, with the help of Figure 4.28, the difference between routing and switching.

Diagram on the left shows two men standing near a sign board kept at a cross road and making decisions where to turn. Diagram on the right shows how a datagram switching is done through different network nodes. Switching is done based on the locally significant interface number that is a virtual circuit ID of the datagram. — **Figure 4.28** Routing and switching.

Imagine walking in a strange city, map in hand, toward a landmark at a given address. To get to the landmark one needs to determine where exactly one is at the moment, find this place as well as the landmark on the map, find a combination of streets that looks like the shortest way to get to the landmark, and start walking, checking street names at corners, as depicted in Figure 4.28(a), and making decisions where to turn. This takes time, and an outdated map may further complicate the affair. This is pretty much what the routing problem is. And if walking is problematic, everyone who has driven under these circumstances knows how ineffective this may be (with other drivers blowing their horns all around!).

However, when all one has to do is follow the signs toward the landmark—whether on foot or driving—finding the destination is straightforward. This is the case with switching over a pre-determined circuit. The circuit does not have to be permanent (consider the case of real-estate signs marking the direction toward an open house or detour signs placed only for the duration of a road repair). Another metaphor is the route designation for US highways. To travel from Miami to Boston, one only needs to follow the Interstate Route 95, which actually extends over several different highways—even named differently. The interchanges are well marked though, and with a car's driver following the signs (rather than looking at the map), the car switches from one highway to another while always staying on the route.

Similarly, with datagram switching, a network node never has to determine the destination of a datagram in order to forward it to the next node. Instead, the datagram presents a (locally significant) interface number—a virtual circuit ID, so that the datagram is forwarded exactly on the interface it names. For example, the node on the left of Figure 4.28(b) has received a datagram to be forwarded on interface 3. The node's switching map indicates that the next node's interface to forward the datagram on is 2, and so the virtual circuit ID 3 is replaced by that of 2 on the next hop. With that, the datagrams follows a pre-determined “circuit.”

This is exactly how MPLS works. First, a Label-Switched Path (LSP)—a virtual end-to-end circuit—is established; then the switches (also called MPLS-capable routers) and then switching (rather than routing) IP packets, without ever looking at the packet itself. Furthermore, an LSP can be established along any path (not necessarily the shortest one), a feature that helps traffic engineering.

The locally significant circuit is designated by a label. (Hence the “L” in MPLS.) As Figure 4.29 demonstrates, the MPLS label prefixes the IP packet within the link-layer frame, sitting just where one would otherwise expect to see the link-layer header. In terms of the ISO reference mode, that fact alone would have put MPLS squarely in Layer 2 had it not been for the actual, “real” link layer running over the physical circuit. For this reason, MPLS is sometimes referred to as “Layer-2.5 technology.” The situation is further complicated because the labels can be stacked, for the purposes of building virtual private networks. (As is typical with virtual things, recursion comes in.) For the purposes of this section though, we always assume that the stack contains one and only one label.

Diagram represents the format of a link-layer frame. The first cell contains the Layer 2 Header, second cell has MPLS Label Stack, the third cell contains IP Packet and the fourth cell contains Layer 2 trailer. The MPLS label contains four fields; label value, traffic class, S and Time-to-Live. — **Figure 4.29** The location and structure of the MPLS label.

The structure of the MPLS label is defined in RFC 3032.⁴¹ The label contains four fields:

Label Value (20 bits), which is the actual virtual circuit ID;
Traffic Class (TC) (3 bits),⁴² which is used for interworking with diffserv and for explicit congestion marking;
S (1 bit) is a Boolean value indicating whether the present label is the last on the label stack; and
Time-to-Live (TTL) (8 bits), which has the same purpose and the same semantics as its namesake field in IPv4. (The IP packet is not examined in MPLS.)

The way the label is used is rather straightforward. The label value serves as an index into the internal table (called the Incoming Label Map (ILM)) of the MPLS switch, which, among other things, contains the outgoing label value, the next hop (i.e., the index to the forwarding interface), and the state information associated with the path. Then the new link-layer frame is formed, with the new outgoing label inserted before the IP packet.

One interesting question here: Where in this framework is it essential that the payload be an IP packet? The answer is that it is not essential at all. For the moment—as we are addressing the use of MPLS within IP networks—we naturally look at everything from the IP point of view, but we will shortly use the fact that any payload—including an ATM cell and the Ethernet frame—can be transferred with MPLS. This, finally, explains the “multi-protocol” part in the name “MPLS.” Yet MPLS was invented with IP networks in mind, and its prevalent use is in IP networks. To emphasize that, the MPLS standards call an MPLS switch a Label-Switched Router (LSR).

While the process of switching is fairly simple, the overall problem of engineering, establishing, and tearing down end-to-end circuits is exceedingly complex. It took the industry (and the IETF in particular) many years to resolve this, and work on many other aspects of it—especially where it concerns optical networking—is continuing.⁴³

Figure 4.30 illustrates the mapping of labels to LSPs along with some of their important properties. To begin with, all LSPs (just like the integrated services flows) are simplex. Our example actually deals only with flows, and we have three to look at:

Diagram shows the mapping of labels to label-switched paths. The data flows across four different hosts and label switched routers. Flow (a) starts from host C and ends at host B, flow (b) starts from host C and ends at host D, flow (c) starts from host A and ends at host D. — **Figure 4.30** An example of label assignment to flows and LSPs.

Flow (a) originates in host C and terminates in host B. For the purposes of load balancing (an important network function to which we will return more than once in this book) there are two LSPs. LSP I traverses three LSRs—S₁, S₂, and S₄—the last two assigning the flow labels 20 and 30, respectively. Similarly, LSP II, which traverses S₁, S₃, and S₄, has labels 72 and 18 assigned to the same flow.
Flow (b) originates in host C and terminates in host D, also traversing LSRs S₁, S₃, and S₄, and it assigns labels 25 and 56 to the flow.
Flow (c) originates in host A and terminates in host D, traversing LSRs S₂ and S₄, the latter switch assigning the flow label 71.

Each label in the Label-Switched Path (LSP) has a one-to-one association with a Forwarding Equivalence Class (FEC), which is in turn associated with the treatment the packets receive. An FEC is defined by a set of rules. For instance, a rule may be that the packets in the FEC match a particular IP destination address, or that they are destined for the same egress router, or that they belong to a specific flow (just as in the example of Figure 4.30). Naturally, different FECs have different levels of scalability. The classification of FEC packets is performed by the ingress routers.

At this point, we are ready to address—very briefly—the issue of forming LSPs. One simple rule here, which follows from the fact that only a specific LSR knows how to index its own incoming label map, is that the labels are assigned upstream—that is, from the sink to the source, similarly to the way reservations are made with RSVP.

In fact, RSVP in its new incarnation with extension for traffic engineering—called RSVP-TE—has become a standard protocol for label distribution. Because RSVP was originally designed to support multicast, the development of RSVP-TE reignited interest in broadband multicast.

Before RSVP-TE though, another protocol, appropriately called the Label Distribution Protocol (LDP), was designed for label distribution. Later, the designers of LDP had to take traffic engineering into account, too, and the resulting LDP extension was called Constraint-based Routing LDP (CR-LDP). It hardly makes things simpler, but BGP has introduced its own extension for managing LSPs.

Figure 4.31 gives an example of a simple explicit route setup with CR-LDP and RSVP-TE, concentrating on their similarities rather than their differences. The requests (emphasized in bold) flow upstream, while the assignment of labels proceeds downstream. In addition to setting up the LSPs, both protocols support the LSP tunnel rerouting, applying the “make-before-break” principle, and provide pre-emption options.

Diagram on the left shows a simple explicit route setup with CR-LDP. Diagram on the right shows upstream direction of a simple explicit route setup with RSVP-TE. Both the set-ups consist of two hosts A and B along with three label switched routers. — **Figure 4.31** Examples of the explicit route setup with (a) CR-LDP and (b) RSVP-TE.

Except for support for multicast, which, again, was the original feature of RSVP-TE, the differences between CR-LDP and RSVP-TE are fairly insignificant. In short, these are manifest in the underlying protocols (CR-LDP needs TCP or UDP, while RSVP-TE runs directly over IP); the state installed in an LSR (CR-LDP installs hard state, while RSVP-TE—true to its original design—installs soft state); LSP refresh (performed only by RSVP-TE); and security options (RSVP-TE has its own authentication mechanism).

To conclude, in less than 20 years since its inception, MPLS has gone a long way. First, it enabled peer-to-peer communication with ATM and frame relay switches. It also introduced a means of supporting Internet traffic engineering (e.g., offload, rerouting, and load balancing), accelerated packet forwarding, and—as we will see in the next section—provided a consistent tunneling discipline for the establishment of virtual private networks.

The MPLS technology, by means of its extension called Generalized MPLS (GMPLS) has also evolved to support other switching technologies (e.g., those that involve time-division multiplexing and wavelength switching). There is an authoritative book [17] on this subject. As we move to WAN virtualization technologies—the subject of the next section—it is appropriate to note that MPLS has proven to be the technology of choice in this area.

4.4 WAN Virtualization Technologies

As we mentioned in the Introduction, the need for virtual data networking pre-dated PDN times; PDN was, in fact, created to address the need for companies to have what appears to be their own separate private networks.

To this end, all that the term virtualization means, when applied to data networks, is just an environment where something is put on top of the existing network—as an overlay—to carve out of it non-intersecting, homogeneous pieces that, for all practical purposes, are private networks. In other words, each of these pieces has an addressing scheme defined by the private network it corresponds to, and it operates according to the network policies.

The association of endpoints forming a VPN can be achieved at different OSI layers. We consider Layer-1, Layer-2, and Layer-3 VPNs.

The Layer-1 VPN (L1VPN) framework, depicted in Figure 4.32, and standards have been developed by the IETF and ITU-T. The framework is described in RFC 4847,⁴⁴ which also lists related documents. L1VPN is defined as a “service offered by a core Layer 1 network to provide Layer 1 connectivity between two or more customer sites, and where the customer has some control over the establishment and type of the connectivity.” The model is based on GMPLS-controlled traffic engineering links. With that, the data plane (but not the control plane) is circuit switched.

Diagram shows the framework of a layer 1 VPN. The framework consists of a provider network with two provider edge devices connected using a provider device. On the sender end and the receiver end the PE devices are connected to customer edge devices that act as a VPN endpoint, which in turn is connected to customer devices. — **Figure 4.32** Layer-1 VPN framework (after RFC 4847).

The customer devices, C devices, are aggregated at a Customer Edge (CE) device, which is a VPN endpoint. (There are two VPNs, named A and B in the figure.) A CE device can be a Time Division Multiplexing (TDM) switch, but it can also be a Layer-2 switch or even a router. The defining feature of a CE device is that it be “capable of receiving a Layer-1 signal and either switching it or terminating it with adaptation.”

In turn, a CE device is attached to a Provider Edge (PE) device, which is the point of interconnection to the provider network through which the L1VPN service is dispensed. A PE device can be a TDM switch, an optical cross-connect switch, a photonic cross-connect switch, or an Ethernet private line device (which transports Ethernet frames over TDM). PE devices are themselves interconnected by switches, called Provider (P) devices.

The VPN membership information is defined by the set of CE-to-PE GMPLS links. Even though the CE devices belong to a customer, their management can be outsourced to a third party, which is one of the benefits of Layer-1 VPN. Another benefit is “small-scale” use of transmission networks: several customers share the physical layer infrastructure without investing in building it.

Layer-2 VPNs have two species. One species is a simple “pseudo-wire,” which provides a point-to-point link-layer service. Another species is LAN-like in that it provides a point-to-multipoint service, interconnecting several LANs into a WAN. Both species use the same protocols to achieve their goals, so we will discuss the use of these protocols only for the second species—virtual LAN (VLAN).

To begin with, there are two aspects to virtual LANs. The first aspect deals with carving a “real” (not-emulated) LAN into seemingly independent, separate VLANs. The second aspect deals with gluing instead of carving, and it deals with connecting LANs at the link layer.

Figure 4.33 illustrates the VLAN concept. The hosts that perform distinct functions—drawn in Figure 4.33(a) with different shapes—share the same physical switched LAN. The objective is to achieve—by software means—the logical grouping of Figure 4.33(b), in which each function has its own dedicated LAN. To this end, the logical LANs may even have different physical characteristics so that, for example, multimedia terminals get a higher share of bandwidth.

Diagram shows a comparison between a physically configured VLAN and a logically configured VLAN. In physical configuration, different functions are depicted as different shapes and are connected by the same LAN. In logical configuration, logical grouping of functions are done and each of the function will have its own dedicated LAN. — **Figure 4.33** The VLAN concept: (a) physical configuration; (b) logical configuration.

This is achieved by means of VLAN-aware switches that meet the IEEE 802.1Q⁴⁵ standard. The switches recognize the tags that characterize the class (i.e., VLAN) to which the frame belongs and deliver the frame accordingly.

VLANs can be extended into WAN Layer-2 VPNs by means of pseudo-wire (as defined in RFC 3985⁴⁶), where a bit stream—in the case of LANs, a frame—is transmitted over a packet-switched network.

Figure 4.34, which reuses the terminology of Figure 4.32, demonstrates the framework. The difference here is that the Layer-2 frames are tunneled through the PSN over an emulated (rather than a real) circuit.

Diagram shows the framework of a Pseudo-wire emulation edge-to-edge network. The framework consists of a packet switched network with two interconnected provider edge devices. On the sender end and the receiver end the PE devices are connected to customer edge devices that act as a VPN endpoints. — **Figure 4.34** Pseudo-wire emulation edge-to-edge network reference model (after Figure 2 of RFC 3985).

A pseudo-wire can naturally be established over an MPLS circuit, but the presence of an MPLS is not a requirement. There is an older—and in a way competing—technology for carrying link-layer packets over an IP network in the form of the Layer Two Tunneling Protocol (L2TP). Originally, as described in [3], L2TP was developed among the first mechanisms in support of PSTN/Internet integration, enabling a connection between a host and the Internet router over a telephone line. It aims to carry PPP frames with a single point-to-point connection. Now in its third version, the L2TPv3 as specified in RFC 3931⁴⁷ still retains the call setup terminology, but the scope of its application has been extended to support Ethernet-based VLAN.

Finally, Layer-3 VPNs are not much different conceptually—they also involve tunneling, but at Layer 3. The PEs in this case don't have to deal with the LAN-specific problems, but there is another set of problems that deal with duplicate addresses. Each Layer-2 network address card has a unique Layer-2 address; however, this is not the case with IP addresses. The reasons for that will become clear later, when we introduce Network Address Translation (NAT) appliances, but the fact is that it is possible for two private networks to have overlapping sets of IP addresses, and hence the provider must be able to disambiguate them.

RFC 2547⁴⁸ describes a set of techniques in which MPLS is used for tunneling through the provider's network and extended BGP advertises routes using route distinguisher strings, which allow disambiguation of duplicate addresses in the same PE.

Another approach is the virtual router architecture introduced in RFC 2917.⁴⁹ Here, MPLS is also used for tunneling, but no specialized route-advertising mechanism is needed. Disambiguation is achieved by using different labels in distinct routing domains.

By now, the role and benefits of MPLS in constructing VPNs, especially in view of its traffic engineering capabilities in support of QoS, should be evident. We can now explain the need for stacking MPLS labels, a need that arose precisely because of VPN requirements.

Figure 4.35(a) depicts two distinct LANs interconnected by an MPLS LSP in a private network. Consider now interworking this arrangement with a VPN, as shown in Figure 4.35(b). As the LSP enters the ingress PE, the latter can push its own label (preserving the original label, which has no significance in the provider network and thus tunneled through it along with the payload). That label is popped at the egress PE, and so the packet returns to the private network with the original label that now has a meaning for the rest of the LSP.

Diagram at the top shows a label-switched path in a single network, where two LAN's are connected using MPLS LSP in a private network. Diagram at the bottom shows how the label-switched path in a single network is interworked with a VPN. — **Figure 4.35** Label stacking in provider-supported VPN: (a) LSP in a single network; (b) LSP traversing a provider network.

We will return to VPN when discussing modern Cloud data centers.

4.5 Software-Defined Network

The very idea of SDN—a centrally managed network—has been around from the beginning of networking. (One could argue that it has been around from the beginning of mankind, as it involves the benefits of central planning and management versus those of agility and robustness of independent local governments.) It should be clear though that the point of the argument revolving around the idea of SDN is not central management in general, but the real-time central management of routing.

To this end, the initial SDN development is very similar to that of the PSTN Intelligent Network (IN) [18]. Telephone networks were always centrally managed, but the telephone switches established calls through co-operation among themselves. In the 1970s, they evolved so that the call establishment was performed out-of-band of the voice circuitry—via a separate packet network. Toward the 1990s, call control was increasingly separated from all-software service control. The latter started with a simple address translation (not unlike that performed by the Internet domain name servers, which we will address in detail later in this book), but evolved toward executing complex service logic programs.

The trajectory of that evolution was leading toward switches becoming mere “switching fabric,” with all other functions moving into the general computers that would exercise complete control over call and service logic, combining IN and network management. The Telecommunications Information Networking Consortium (TINA-C)⁵⁰ had worked for seven years (from 1993 to 2000) on this vision, providing the architecture, specifications, and even software. Although the direction was right, the vision never materialized simply because the “switching fabric” dissolved in the Internet—its functions taken on by the LAN switches⁵¹ and IP routers. That was the beginning of the end for telephone switches as such.

In the late 1990s, the progress of work (described in detail in [3]) in several IETF working groups resulted in clear standards for telephone switches to be effectively decomposed into pieces that spoke to both PSTN and IP. The “switching” though was a function of an IP router. Thus, a SoftSwitch concept was born,⁵² whose very name reflected the idea of its programmability, which came naturally as only general-service computers were needed to control it. The Session Initiation Protocol/Intelligent Network (SIN) design team led by Dr. Hui-Lan Lu in the IETF transport area has demonstrated how all PSTN-based services can be both reused and enhanced with the soft switch. The result of this work—at the time largely unnoticed, since 3GPP was completing a related but much more powerful effort standardizing the IP Multimedia Subsystem (IMS) for mobile networks—was RFC 3976.⁵³ In contributing to the RFC, its leading author, Dr. Vijay Gurbani, supplied the research that was part of his PhD thesis. As the IMS became a success, the name SoftSwitch disappeared, as did the industry's efforts to enhance or even maintain PSTN switches.

With the telephone switches gone, the attention of researchers turned toward IP routers. The latter started as general-purpose computers (Digital Equipment Corporation PDP-10 minicomputers, for example, which BBN had deployed in 1977 in ARPANET), but, by 2000, they had become complex, specialized hardware devices⁵⁴ controlled by proprietary operating systems. Then the push came to separate the forwarding part of the router (the “switching fabric” that requires specialized hardware) from the part that performs signaling and builds routing tables.

The IETF started the discussion on the Forwarding and Control Element Separation (ForCES) working group charter in 2000, with the ForCES framework published as RFC 3746⁵⁵ in 2004. Figure 4.36 reproduces the ForCES architecture.

Diagram shows the major elements of forwarding and control element separation (ForCES) working group's architecture. The framework mainly consists of a control plane and a data plane. The control plane consists of a CE manager and two control elements and the data plane consists of a FE manager and two forwarding element blocks. — **Figure 4.36** The ForCES architecture (after RFC 3746).

RFC 5810⁵⁶ defines the protocol (or rather protocols) for the interfaces—called, in the ITU-T tradition, reference points between the pairs of elements. The transaction-oriented protocol messages are defined for the Protocol Layer (PL), while the Protocol Transport Mapping Layer (ForCES TML) “uses the capabilities of existing transport protocols to specifically address protocol message transportation issues.” Specific TMLs are defined in separate RFCs.

By 2010 there were several implementations of the ForCES protocol, three of which are reported on and compared with one another in RFC 6053.⁵⁷ The ForCES development was taken to new heights with the SoftRouter project at Bell Labs. In the SoftRouter architecture, described in [19], the control plane functions are separated completely from the packet forwarding functions. There is no static association; the FEs and CEs find one another dynamically. When an FE boots, it discovers a set of CEs that may control it, and dynamically binds itself to the “best” CE. A seeming allusion to SoftSwitch is actually a direct reference. As the authors of [19] state: “The proposed network evolution has similarities to the SoftSwitch based transformation of the voice network architecture that is currently taking place. The SoftSwitch architecture … was introduced to separate the voice transport path from the call control software. The SoftRouter architecture is aimed at providing an analogous migration in routed packet networks by separating the forwarding elements from the control elements. Similar to the SoftSwitch, the SoftRouter architecture reduces the complexity of adding new functionality into the network.”

The next significant step came as an action reflecting the common academic sentiments: the Internet architecture was “ossified,” and it was impossible to change it; it was, in fact, hard to teach students practical aspects of networking when there was no access to real networks to experiment at scale. The Global Environment for Network Innovations (GENI) project⁵⁸ has addressed this by establishing a wide project for programmable networks, which use virtualization—including network overlays—so as to allow “network slicing” in order to give individual researchers what appears to be a WAN to experiment with.

Somewhat downscaling the concept to a campus network, eight researchers issued, in 2008, what they called a “white paper” [20].⁵⁹ The proposal of the paper was called OpenFlow.

Having observed that the flow tables in the present Ethernet switches and routers have common information, the authors of [20] proposed exploiting this so as to allow flow tables to be programmed directly. Systems administrators could then slice the campus network by partitioning the traffic and allocating the entries in the flow table to different users. In this way, researchers “can control their own flows by choosing the routes their packets follow and the processing they receive. In this way, researchers can try new routing protocols, security models, addressing schemes, and even alternatives to IP.”

According to [20], the ideal OpenFlow environment looks as depicted in Figure 4.37, which we intentionally drew in the context of the previous figure to demonstrate the progress of the development. Here the controller process, which is implemented on a general computer, maintains a secure transport-layer channel with its counterpart process in the switch. The latter process is responsible for the maintenance of the flow table. The OpenFlow protocol, through which the flow table is maintained, is exchanged between these two processes.

Diagram shows the framework of an ideal OpenFlow environment. The framework consists of a control plane and a data plane. The control plane has a controller implemented on a general computer, which maintains a secure transport-layer channel with the flow table in the data plane. — **Figure 4.37** The OpenFlow switch.

A flow has a broad definition: it can be a stream of packets that share the same quintuple (as defined in the integrated services) or VLAN tag, or MPLS label; or it can be a stream of packets emanating from a given Layer-2 address or IP address.

In addition to the header that defines the flow, a flow table entry defines the action to be performed on a packet and the statistics associated with each flow (number of packets, number of bytes, and the time since the last packet of the defined flow arrived). The action can be either sending a packet on a specific interface, or dropping it, or—and here is where things become new and interesting—sending it to the controller(!) for further inspection. It is obvious how this feature can benefit researchers, and its benefit for law enforcement is equally obvious.

Thus, the industry moved even further toward transforming the data-networking architecture into a fully programmable entity—coming, within the space of 20 years, to the realization of IN and TINA-C plans. In 2008, the OpenFlow Consortium was formed to develop the OpenFlow switch specifications. The membership of the consortium, as [20] reports, was “open and free for anyone at a school, college, university, or government agency worldwide.” Yet the Consortium restricted, in an effort to eliminate vendor influence, its welcome to those individual members who were “not employed by companies that manufacture or sell Ethernet switches, routers or wireless access points.” The situation changed in 2011, when the work of the Consortium was taken over by the Open Networking Foundation (ONF),⁶⁰ which has a large and varied enterprise membership including both network operators and vendors. A special program supports research associates whose membership is free of charge.

4.6 Security of IP

There are many aspects to network security, and we will keep returning to this subject throughout the book to introduce new aspects. Just to mention a couple of things that are the subject of later chapters, there are firewalls (which are but police checkpoints for the IP traffic) and there are Network Address Translation (NAT) devices (whose purpose is to hide the internal structure of the network, and which are often combined with firewalls). Then there are mechanisms for access management, and different cryptographic mechanisms for different OSI layers. We highly recommend [21] as a comprehensive—and beautifully written—monograph on network security. For the purposes of this chapter, which is mainly concerned with the network layer, we will deal with one and only one aspect: IP security.

Security was not a requirement when IP was first designed, and so IP packets crisscrossed the Internet from sources to destinations in the clear. As long as the Internet was the network for and by researchers, the situation was (or at least was thought to be) fine.

The fact is though that anyone on a broadcast network (or with access to a switch) can peek inside a packet (if only to learn someone else's password); people with more advanced means can also alter or replay the packets. Furthermore, it is easy for anyone to inject into the network arbitrary packets with counterfeit source IP addresses.

The suite of protocols known as IP security (IPsec) was developed to provide security services at the network layer. An important reason for developing security at the network layer is that existing applications can stay unchanged, while new applications can still be developed by people blissfully unaware of the associated complexity. Addressing security at the layers below provides only a hop-by-hop solution—because of the routers in the end-to-end path. (Addressing it at the transport layers initially appeared redundant, and would have remained redundant, had the network not been “broken”—as will become evident later.

IPsec is specified in a suite of IETF RFCs whose relationship is depicted in Figure 4.38. Understanding the relationship is important to the practical use of IPsec as it provides several services (authentication, confidentiality protection, integrity protection, and anti-replay) which can be mixed and matched to fit specific needs.

Diagram shows the interconnection between the features of IPSec suite. The three levels of the framework are architecture, encapsulating security payload protocol and authentication header protocol, and internet key exchange. The ESP protocol consists of encryption algorithm and combined algorithm. The AH protocol consists of integrity protection algorithm. — **Figure 4.38** Relationship among the IPSec specifications (after RFC 6071).

Cryptography (see [22] for not only an explanation of the algorithm but also the code) provides the foundation for these services. To this end, IPsec supports a range of cryptographic algorithms to maximize the chances of two involved endpoints having a common supported algorithm.⁶¹ It also allows new algorithms to be added as they emerge and supplant those that are found to be defective. For example, the latest IPsec requirements for cryptographic algorithms, as specified in RFC 7321,⁶² mandate support for the Advanced Encryption Standard (AES) in light of the weakness of the Data Encryption Standard (DES), and support for Keyed-Hashing for Message Authentication (HMAC) based on Secure Hash Algorithm-1 (SHA-1), because of the weakness of Message Digest 5 (MD5). Such open, full embrace of state-of-the-art cryptography seems natural today, but it actually took many years to achieve because of governmental restrictions that varied across different countries; cryptography has invariably been considered part of munitions, and its export has been regulated. Commercial and civilian users, if not banned from using cryptographic technology all together, were restricted to using weak and inadequate algorithms. Some governments had even mandated key escrow (i.e., disclosing keys to authorities). Given the global nature of the Internet, this situation did not help Internet security, and it prompted the IETF to publish an informational RFC to voice its concerns. This RFC⁶³ was assigned number 1984⁶⁴ intentionally.

Central to IPsec is the notion of security association. A security association is but a simplex end-to-end session that is cryptographically protected. Hence, a bidirectional session will need two security associations at least. To establish a security association involves the generation of cryptographic keys, choosing algorithms, and selecting quite a few parameters. In unicast, a security association is identifiable by a parameter known as the Security Parameters Index (SPI).

IPsec consists of two parts: security association management and packet transformation. The security association management part deals with authenticating IPsec endpoints and then setting up and maintaining security associations. An essential procedure here is negotiation of security services and agreement on the related security parameters. The whole procedure is performed at the beginning of an IPsec session and then, once in a while, afterward for a liveliness check. The corresponding protocol is complex, and its definition has gone through several iterations. Its present version, the Internet Key Exchange Protocol version 2 (IKEv2), is specified in RFC 7296.⁶⁵ For a vivid explanation of how IKE works (and how it should have worked), [21] is again the best if not the only source.

The packet transformation part deals with the actual application of previously agreed on cryptographic algorithms to each packet. There are two distinct protocols for that task (which unfortunately reflects more on the standard's politics than necessity). One protocol, called Authentication Header (AH), is defined in RFC 4302.⁶⁶ AH provides connectionless integrity protection, data origin authentication, and an anti-replay service. But true to its name, it does not provide confidentiality protection.

In contrast, the second protocol, Encapsulating Security Payload (ESP), defined in RFC 4303,⁶⁷ does provide confidentiality protection as well as all other security services. As a result, ESP is far more widely used than AH. ESP is computationally intensive though, because of all this cryptographic work that it performs. There is hardware to speed this up, and since operating systems know best how to deal with this hardware, ESP has been implemented as part of the kernel in major operating systems.

Through ESP, secure communication becomes possible for three scenarios, explained below with the help of Figure 4.39.

Diagram describes three different scenarios of IPSec; host-to-host, host-to-gateway, and gateway-to-gateway. A gateway constitutes enterprise network and a data center. — **Figure 4.39** IPSec scenarios.

The first scenario is host-to-host. Two mutually-authenticated hosts set up a session with each other using their pubic IP addresses over the Internet. Outsiders cannot eavesdrop or alter the traffic between them—as though there is a private communication line in use.

The second scenario is host-to-gateway. A remote host connects to a security gateway of an enterprise network over the Internet. The gateway has access to all other machines in the network, and so once a tunnel is established, the host can communicate with them, too. This scenario is typical for the enterprise VPN access. (Returning to VPNs for a moment, it is precisely the mechanism of this scenario that enables secure IP-over-IP VPN. A private telephone circuit is replaced here by an IPsec tunnel.) As in the first scenario, the traffic is protected, but, unlike in the first scenario, the host may have a second IP address, which is significant only within the enterprise. In this case, the gateway also has two IP addresses—one for the hosts on the rest of the Internet, and one for the enterprise. The significance of this will become clear after we discuss NATs in the next chapter.

The third and last scenario is gateway-to-gateway. Two remote enterprise campuses are stitched over the Internet through their respective security gateways into an integral whole. This scenario is particularly relevant to Cloud Computing, since two data centers can be interconnected in exactly the same way. In essence, this scenario is similar to the second one, the only difference being that the tunnel carries aggregate rather than host-specific traffic.

IPsec supports two operational modes: transport and tunnel. In transport mode, each packet has only one IP header, and the IPsec header (either AH or ESP) is inserted right after it. To indicate the presence of the IPsec header, the protocol field in the IP header takes on the protocol number of AH or ESP. Figure 4.40 depicts the packet structure of IPsec in transport mode for IPv4. For an AH packet, the integrity check covers the immutable fields of the IP header, such as version, protocol, and source address.⁶⁸ The integrity check value is part of the authentication header, which also carries an SPI and a sequence number. The SPI serves as an index to a security association database entry that contains the parameters (e.g., the message authentication algorithm and key) for validating the AH packet. The purpose of the sequence number, which is unique for each packet, is replay protection. Note that the IPsec header does not contain information about what mode is in use. The information is stored in the security association database.

Diagram shows the packet structure of IPsec in transport mode for IPv4. The cells in the AH packet are IP header, AH, TCP header and payload. The cells in the ESP packet are IP header, ESP, TCP header, payload, trailer and ICV. — **Figure 4.40** IPsec in transport mode in IPv4.

For an ESP packet, the integrity check covers everything beyond the IP header. Furthermore, the integrity check value is not part of the ESP header. It is supplied in a separate field.

In tunnel mode, there are two IP headers: an “inner” IP header carrying the ultimate IP source and destination addresses, and an “outer” IP header carrying the addresses of the IPsec endpoints. Figure 4.41 depicts the respective packet structures. (Figure 4.41 (a) is illustrative but not precisely correct. In addition to the immutable fields shown there, there are other immutable fields–such as Source IP address.)

Diagram shows the packet structure of IPsec in tunnel mode for IPv4. The cells in the AH packet are outer IP header, AH, Inner IP header, TCP header and Payload. The cells in the ESP packet are outer IP header, ESP, Inner IP header, TCP header, Payload, Trailer, and ICV. — **Figure 4.41** IPsec in tunnel mode in IPv4.

One may wonder about the usefulness of AH given that ESP can do everything that AH can but not the other way around. One argument is that AH offers slightly more protection by protecting the IP header itself. Another argument is that AH works better with firewalls (which, again, we will discuss later). There are other arguments, too, but what matters in the end is that the main reason for using IPsec is confidentiality protection, and this is not a service that AH can provide. As far as the IPsec scenarios are concerned, the transport mode applies to scenario 1, while the tunnel mode applies to the rest.

So far, we have addressed the operation of IPsec only with IPv4. We can only say that the operation of IPsec with IPv6 is not really very different, except, of course, for the headers. In the space of this book we cannot delve into the intricacies of IPv6. Again, [21] has done an excellent job on that, elucidating the aspects caused by the IETF politics.

Notes

References

Tanenbaum, A.S. and Van Steen, M. (2006) Distributed Systems: Principles and Paradigms, 2nd edn. Prentice Hall, Englewood Cliffs, NJ.
Birman, K.P. (2012) Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services. Springer-Verlag, London.
Faynberg, I., Lu, H.-L., and Gabuzda, L. (2000) Converged Networks and Services: Internetworking IP and the PSTN. John Wiley & Sons, New York.
International Organization for Standardization (1994) International Standard ISO/IEC 7498-1: Information Technology—Open Systems Interconnection—Basic Reference Model: The Basic Model. International Organization for Standardization/International Electrotechnical Commission (ISO/IEC), Geneva. (Also published by the International Telecommunication Union—Telecommunication Standardization Sector (ITU-T) as ITU-T Recommendation X.200 (1994 E).
Tanenbaum, A.S. and Wetherall, D.J. (2011) Computer Networks, 5th edn. Prentice Hall, Boston, MA.
ITU-T (1996) ITU-T Recommendation X.25 (formerly CCITT), Interface between Data Terminal Equipment (DTE) and Data Circuit-terminating Equipment (DCE) for terminals operating in the packet mode and connected to public data networks by dedicated circuit. International Telecommunication Union, Geneva.
Halsey, J.R., Hardy, L.E., and Powning, L.F. (1979) Public data networks: Their evolution, interfaces, and status. IBM Systems Journal, 18(2), 223–243.
Metcalfe, R.M. and Boggs, D. (1976) Ethernet: Distributed packet switching for local computer networks. Communications of the ACM, 19(7), 395–405.
Cerf, V. and Kahn, R. (1974) A protocol for packet network intercommunication. IEEE Transactions on Communications, 4(5), 637–648.
Information Sciences Institute, University of South California (1981) DoD Standard Internet Protocol. (Published by the IETF as RFC 791), Marina Dely Rey.
Huston, G. (1999) Interconnection, peering and settlements—Part II. The Internet Protocol Technical Journal, 2(2), 2–23.
Turner, J.S. (1986) Design of an integrated services packet network. IEEE Journal on Selected Areas in Communications, SAC-4(A), 1373–1380.
Clark, D.D., Shenker, S.S., and Zhang, L. (1992) Supporting real-time applications in an integrated services packet network: Architecture and mechanism. SIGCOMM ‘92 Conference Proceedings on Communications Architectures & Protocols, pp. 14–26.
Stiliadis, D. and Varma, A. (1998) Latency-rate servers: A general model for analysis of traffic scheduling algorithms. IEEE/ACM Transactions on Networking (TON), 6(5), 611–662.
Rekhter, Y., Davie, B., Rosen, E., et al. (1997) Tag switching architecture overview. Proceedings of the IEEE, 85(12), 1973–1983.
Lu, H.-L. and Faynberg, I. (2003) An architectural framework for support of quality of service in packet networks. Communications Magazine, IEEE, 41(6), 98–105.
Farrel, A. and Bryskin, I. (2006) GMPLS: Architecture and Applications. The Morgan Kaufmann Series in Networking. Elsevier, San Francisco, CA.
Faynberg, I., Gabuzda, L.R., Kaplan, M.P., and Shah, N. (1996) The Intelligent Network Standards: Their Applications to Services. McGraw-Hill, New York.
Lakshman, T., Nandagopal, T., Sabnani, K., and Woo, T. (2004) The SoftRouter Architecture. ACM SIGCOM HotNets, San Diego, CA. http://conferences.sigcomm.org/hotnets/2004/HotNets-III%20Proceedings/lakshman.pdf.
McKeown, N., Anderson, T., Balakrishnan, H., et al. (2008) OpenFlow: Enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review, 38(2), 69–74.
Kaufman, C., Perlman, R., and Speciner, M. (2002) Network Security: Private Communications in a Public World. Prentice Hall PTR, Upper Saddle River, NJ.
Schneier, B. (1995) Applied Cryptography: Protocols, Algorithms, and Source Code in C. John & Wiley Sons, New York.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.