CHAPTER 6
Cloud Storage and the Structure of a Modern Data Center

Data centers are the workhorses of Cloud Computing.

A data center is where servers, storage, and communication gear reside along with the necessary utilities (e.g., power, cooling, and ventilation equipment). Co-locating equipment this way is natural, since the environmental needs and physical security needs are often common. It also simplifies operations and maintenance. A case in point is that ten computers in a single room are easier to safeguard physically than when the same ten computers are distributed across five rooms.

Most data centers consume vast amounts of energy unnecessarily, wasting 90% or more of the electricity they draw from the grid [1].1 One reason2 for such inefficiency is under-utilization of servers: typical utilization figures range from 6% to 12%. Virtualization of data centers offers a way to increase server utilization and reduce energy consumption. It also sets loose the traditional delineation of data centers by hardware, physical casing and wiring, floor space, and other physical attributes.

The resulting virtual data centers no longer have well-defined physical boundaries. It goes without saying that a physical data center may host multiple virtual data centers. If the virtual data centers are intended for more than one organization, the physical data center is multi-tenant.

But it is also possible to have a virtual data center spanning multiple physical data centers. For instance, an organization may outsource part of its IT infrastructure while maintaining a private data center. In this case the two data centers, which are separate geographically and administratively, are stitched together through an appropriate virtual private network (or tunneling) mechanism to ensure isolation.

Naturally, on-demand allocation of virtual resources and dynamic relocation of the allocated resources are two essential features that enable virtualization of a data center. Where to allocate and relocate resources may be based on criteria such as performance requirements, load balancing, improved resilience, disaster recovery, and regulatory compliance. The mapping of physical to virtual resources is a matter of implementation. Overall, in addition to virtualization technology, it is also necessary to have a Cloud management system that manages all the resources across the underlying infrastructure and provides a uniform interface to applications [2]. Our discussion of the Cloud management system takes place in Chapter 7.

This chapter discusses the enablers of data center virtualization. We start with a bird's-eye view of a traditional data center, introducing its high-level functional architecture. The core components of this architecture are the compute3 (sic), storage, and networking. One characteristic of traditional data centers is the use of a dedicated, specialized network for storage traffic to bolster performance. This is costly. Fortunately, technological advances are making it possible to use a single network for all traffic without causing performance problems. Thus emerges the next-generation data center, which we are going to describe along the way.

Then we zoom in on storage-related matters, since computing and networking each have their own chapters. We draw upon the shared storage model [3] from the Storage Networking Industry Association (SNIA) to introduce the taxonomy of storage. We study three types of storage that are distinguished by how they connect to the host: direct-attached storage, network-attached storage, and the Storage Area Network (SAN). The technology to interconnect the processor and storage devices ranges from the Small Computer System Interface (SCSI), with a maximum cable length in the order of 10 meters, to Fibre Channel (FC) or Ethernet, with a maximum cable length in the order of 10 kilometers. For a long time, SCSI was the dominant technology used for direct-attached storage. Its parallel bus design, however, limits the speed and cable length. It is being replaced by the Serial Attached SCSI (SAS) technology, which makes the interconnection faster and over longer distances. The direct-attached storage is difficult, if not impossible, to share though.

The network-attached storage addresses this limitation. In particular, it allows file sharing over an IP network. But file sharing does not quite work for database applications, which require storage access at a lower level. In addition, storage throughput is limited by the underlying networking media.

This is where SAN enters. SAN is tailored to interconnect storage systems at high speed while supporting resource pooling and block-level access to the pooled resources. It is predominantly based on FC, a popular technology in data centers, which combines the qualities of a serial I/O bus and a switching network.

Obviously, deploying and managing a separate, bespoke network just for storage is expensive, as it entails specialized hardware as well as extra staff to operate it. Consequently, there has been constant interest in finding a way to use a single converged network to carry all types of traffic (while doing so effectively for storage-related traffic). We review the development in this area, focusing on the two primary approaches in data centers: FC over Ethernet (FCoE) and Internet SCSI (iSCSI). (It is interesting that at about the same time when iSCSI was being developed, a standard interface for object storage was in the works as well. This happened for a good reason. Object storage, which we discuss next, has been viewed as a missing link toward fulfilling the promise of the shareable storage enabled by iSCSI: unmediated host access with granular access control.)

The next topic discussed in this chapter is storage virtualization, which is a mechanism for shielding applications from the underlying detail of physical storage. Virtualization of network storage is important to Cloud Computing because it enables effective resource pooling and simplifies management tasks, such as snapshots and migration.

Finally, we discuss solid-state storage. In terms of storage media, technologies exist with varied performance and cost. At the one extreme is Random Access Memory (RAM); at the other extreme is magnetic tape. Somewhere in the middle is the relatively new flash memory base of solid-state technology. Like RAM, flash memory is semiconductor-based and so no moving parts are involved. It is also faster than hard disk and cheaper than RAM. These qualities make the flash memory technology a viable member of the storage hierarchy and a serious challenger to that of hard disk. Moreover, given its superior performance in random read operations, flash memory has become essential to Cloud Computing. The quest for high performance in Cloud Computing also prompts developments such as RAMCloud [4] and Memcached [5], which we examine as well. Both developments exploit RAM but focus on different aspects. RAMCloud in essence aims to build a remote cache practically of infinite capacity, while Memcached supports a simple key-value store for caching arbitrary data within the respective memory units for a pool of commodity computers.

The limits on the size of this book don't allow us to address some important aspects of data centers such as configuration, power, and thermal management. We refer to [6] for a complementary overview.

6.1 Data Center Basics

Figure 6.1 depicts the high-level functional architecture of a traditional data center. The hardware modules are organized into rows of racks of a standard dimension to ease deployment. This section introduces the key components, which, once virtualized, become the building blocks of virtual data centers.

Diagram shows a traditional data center which consists of storage arrays, storage area network, blade servers, rack-mounted servers, NAS arrays, two ToRs, a ToR or EoR, access and aggression network, and gateway connected to WAN.

Figure 6.1 Traditional data center.

6.1.1 Compute

The compute components are high-performance computers called servers, which are accessible via a network. They are expected to be reliable and capable of handling large workloads. Servers span a wide range in cost and capabilities. In comparison with desktop computers, they are much more expandable in terms of computing4 and input/output capacity. Servers also have different form factors than desktop computers. They come in the form of either rack-mounted or blade servers. These forms are optimized to reduce their physical footprint and interconnection complexity (cabling spaghetti). Such optimization is necessary in the face of an ever-increasing number of servers that need to be put in the constrained space of a data center.

A rack-mounted server is inserted horizontally into a rack (typically 19 inches wide). It is denoted by its height, which varies discretely in the rack unit, of 1.75 inches (known as RU or simply U), defined by [7]. Namely, a 1U server is 1U high, a 2U server is 2U high, and so on. Most single- and dual-socket servers are available as 1U servers.

A rack housing rack-mounted servers may be a simple metal enclosure or it can be a complex piece of equipment armed with power distribution, air or liquid cooling, and a keyboard/video/mouse switch that allows a single keyboard, video, and mouse to be shared among servers.

A blade server (or simply a blade) is even more compact than a rack-mounted server. The smaller form factor is achieved by eliminating pieces that are not specific to computing—such as cooling. As a result, a blade may amount to nothing more than a computer circuit board that has a processor, memory, I/O, and an auxiliary interface. Such a blade certainly cannot function on its own. It is operational only when inserted into a chassis that incorporates the missing modules. The chassis accommodates multiple blades. It also provides a switch through which the servers within connect to the external network. Worth noting here is that the chassis also fits into a rack much like a rack-mounted server.

A given rack space can house more blade servers than rack-mounted servers. The chassis–blade arrangement offers other benefits as well: reduced power consumption, simpler cabling, lower cost, and so on. This makes blade servers more attractive in the Cloud Computing environment.

6.1.2 Storage

In terms of how it is connected to servers, storage may be classified as Direct-Attached Storage (DAS), Network-Attached Storage (NAS), and Storage Area Network (SAN). For simplicity, Figure 6.2 depicts only NAS and SAN.

Diagram shows next-generation data center which consists of storage arrays, blade servers, rack-mounted servers, three ToRs, access and aggression network, and gateway connected to WAN.

Figure 6.2 Next-generation data center.

DAS, as the term implies, is directly attached to a processor through a point-to-point link. (The dominant technology in this case is the hard-disk drive.) In contrast, NAS and SAN reside across a network. This network is purpose-built for, and dedicated to, storage traffic in the case of SAN. One major difference between NAS and SAN lies in the semantics of the interface. The NAS units are files or objects, while the SAN units are disk blocks. Another key difference lies in the underlying transport. SAN relies on specialized transport, FC, which is optimized for storage traffic. NAS does not require anything special apart from the IP network. We will discuss the integration of both types of access after taking a closer look at the forms of storage.

For now, we note that NAS and SAN are readily applicable to Cloud Computing but DAS has a limitation. An essential feature of Cloud Computing is flexible allocation of virtual machines based on, among other factors, resource availability and geographical location. In the DAS case, when a virtual machine moves to a new physical host, the associated storage needs to move to the same host, too, which is likely to result in consuming both much bandwidth and much time.

Storage is further classified as online storage and offline storage. Online storage is accessible to a server, while offline storage, which is intended for archiving, is not. Magnetic tape libraries and optical jukeboxes are common implementations of offline storage. They usually come with automatic control via a robotic arm that can locate, fetch, mount, dismount, and put back a tape or disc. Google data centers, for example, have employed robotic tape libraries for backup.

Besides magnetic tapes and optical discs, common storage media include magnetic disks and integrated circuits (i.e., solid-state electronics). Among these, magnetic hard disks are most prevalent.5 Because of their mechanics, they are much more suited to sequential than random access. Here, solid-state storage comes to the rescue. Originally used in mobile devices such as smart phones, digital cameras, and MP3 music players, solid-state storage is faster (from 100 to 1000 times) and sturdier than hard disks, but it is more expensive. As the price of solid-state storage continues to drop, however, it has become a viable option. It is particularly applicable to Cloud Computing, which makes I/O operations more random than ever because of hardware sharing across unrelated applications. We will discuss solid-state storage further in Section 6.2.5.

6.1.3 Networking

The servers of a data center need to be interconnected, and they need to connect to the outside world as well. As the number of servers increases, more cables have to fit into a given space. Top-of-Rack (ToR) and End-of-Row (EoR) are two approaches to connectivity resulting in different cabling options. In the ToR approach, each rack has a switch at the top to which all servers in the rack connect. As a result, the cable connecting a server to the ToR switch does not need to be longer than the height of the rack. A ToR switch typically provides external network access. Normally, it is sufficient for a ToR switch to have just enough ports to support the servers within the same rack.

In the EoR approach, each row (of racks) has a switch at its end to which all nearby servers and switches in other racks in the same row connect. This may require long cables of different lengths running between the servers and switches. Depending on its actual length and required bandwidth, a fiber-optic cable may be needed (where a copper connection would suffice in the ToR case). Here the cost of cabling could exceed the cost of a server supporting multiple links.

An EoR switch is placed in a rack (possibly all by itself due to its size). It may provide network access and aggregation.

Both ToR and EoR switches are typically implemented using Ethernet technology. We will return to this subject later. For now we just note that Ethernet technology is particularly important to data centers because of its potential to eliminate employing separate transport mechanisms (e.g., FC) for storage and interprocessor traffic. Figure 6.2 depicts a next-generation data center with common Ethernet transport.

Finally, note that the data-center aggregation network connects to the Wide Area Network (WAN) through a gateway. On the other side of the WAN may be a single user device or a full-blown data center.

6.2 Storage-Related Matters

As the number of users and devices connecting to the Internet keeps rising, the world is deluged with data. Every day brings with it more than an exabyte6 (i.e., 1018 bytes) of traffic on the Internet. In 2011 alone, more than 1.8 exabytes of data were created globally (with about three-quarters of the data created by human users). The total amount of data is staggering, and it is still growing. The need to store and process the data puts an enormous strain on Cloud storage.

There are three aspects to keeping up with this pressure. First, the storage capacity needs to increase constantly; second, the stored data need to be secured; and third, access to the data needs to be made more efficient.

To help examine matters related to Cloud storage, we draw upon the shared-storage model [3] from the SNIA. Initially developed in 2001, the model reflects the then trend that storage should be managed as an independent resource shared among multiple computing systems. More than ten years later, Cloud Computing not only continues this trend but also amplifies it in a major way.

As shown in Figure 6.3, the SNIA model consists of multiple layers, with each layer providing certain services in an implementation-neutral manner to the higher layers. As a result, the higher layers are shielded from the implementation details of the lower layers and the design complexity of the system is reduced. In this sense, the model is similar to the OSI model [8].

Diagram shows the SNIA model with four layers. From bottom to top, the layers are block, block aggression, file or record, and application. First three layers are in the storage domain. First layer contains storage devices, second layer has host, network, and device, database and file system are in the third layer.

Figure 6.3 SNIA shared storage model.

At the top is the application layer that uses the services provided by the underlying storage domain. Example applications are web servers, search engines, analytics engines, and online transactions engines. The application layer is included in the model only to show the relationship between the storage domain and its clients; storage-specific applications have a special place. The services subsystem (denoted by the services box in Figure 6.3) captures the basic functions such as discovery, management, security, and backup.

Beside the services subsystem, the storage domain is divided into the file/record layer, the block aggregation layer, and the block layer. The file/record layer serves the application layer and is normally implemented in software. It presents data in terms of files, file records, and similar items that are easily accessible to applications. This involves mapping the application-accessible parcels to the underlying logical building blocks (i.e., logical volumes7). Database management and file systems in wide use belong here. A file system maps bytes to files to volumes.8 Similarly, a database management system maps records to tables to volumes. The file/record layer may be implemented in a host alone or as a network file system. The latter is a special case of network-attached storage, to be discussed later.

The block aggregation layer provides services to the file/record layer. It provides block-based aggregation independent of the actual storage devices, how the devices are interconnected, and how storage is distributed among them. To this end, aggregation may be achieved through virtualization at the host, network, or device level, involving tasks such as space management, striping,9 and mirroring.10 In particular, the transport of data to and from storage devices is governed by a set of peripheral interface and storage network standards, which we will address later.

The block layer provides services to the block aggregation layer, providing low-level storage of fixed-size blocks and functions such as numbering of logical units, caching, and access control.

6.2.1 Direct-Attached Storage

DAS, the most common storage arrangement, is dedicated to a single host. At least initially, the defining characteristic of this type of storage is that (1) the host and storage devices are interconnected through point-to-point links and (2) the host controls the devices. Figure 6.4 shows an example (in relation to the SNIA model) where a block-oriented protocol is used over direct links, and block aggregation is done by either the host (through a logical volume manager) or the storage array controller. A block-oriented protocol handles data in terms of fixed-size blocks. In contrast, a file-oriented protocol handles data in terms of variable-size files, which are then divided into blocks handled by the storage. Network-attached storage such as Network File System (NFS) employs a file-oriented protocol, which we will discuss in the next section.

Direct-attached storage diagram shows the storage array in the block layer connected to both host with and without logical volume manager in the file or record layer. Block layer consists of block aggregation which includes host, network, and device. Top layer is the application layer.

Figure 6.4 An example of direct-attached storage.

Since DAS is not subject to network delay, it is suitable for keeping local data such as boot image and swap space. Depending on the location of the storage device with respect to the host, DAS may be internal or external. The internal hard-disk drive of a host is an example of internal DAS. Naturally, the total capacity of internal DAS is in part constrained by the amount of physical space within a computer enclosure. External DAS will be more flexible in this regard. Whether DAS is internal or external, there is the need for an interface for the host and storage device to communicate with each other to carry out I/O operations. Figure 6.5 shows exactly where the interface belongs. The bus adapter on the host and the controller on the storage device implement the interface. The host bus adapter serves as a bridge between the system I/O bus and the direct attachment interface, and shields the details of the storage device. As a result, a storage device with a standard interface can be attached to hosts of distinct processors and architectures. The Small Computer System Interface (SCSI) is an example of this type of standard interface. Applicable to both internal and external DAS, SCSI is in common use in data centers.

Schematic diagram shows direct-attachment interface which includes memory bus, processor, memory, system I/O bus, and storage device controller connected to host bus adapter via direct attachment interface.

Figure 6.5 A schematic direct-attachment interface.

SCSI was first standardized by the American National Standards Institute (ANSI) in 1986 as X3.131-1986. The standard is based on the Shugart Associates System Interface (SASI), introduced around 1979 by Shugart Associates, then a premiere disk drive manufacturer.11 In light of the ensuing developments, the original standard is also known as SCSI-1. It defines a parallel bus for attachment of various types of peripheral devices, including hard-disk drive, tape drive, CD-ROM, scanner, printer, and host bus adapter. Multiple devices (or more precisely their controllers) can be attached to the same bus via daisy-chaining (as depicted in Figure 6.6), resulting in the multi-drop configuration. There is, however, a limit on the number of devices that can be chained. The limit depends on the data bus width. A k-bit-wide bus supports at most k devices, including the host bus adaptor. The limitation is due to the underlying design of mapping each single enabled bit to a particular address assignable to a device. Figure 6.7 shows the mapping for an eight-bit data bus. Note that the address (or identifier) of a SCSI device implies a certain priority level, which is an important factor in bus arbitration during contention. The same prioritization scheme is used in cases with wider buses. Namely, the more significant a bit is, the higher priority is assigned to its corresponding address. As it turns out, backward compatibility is important, and so a somewhat convoluted scheme is in place to preserve the priority rankings of the first byte of the bus.

Diagram shows SCSI configuration where the host is connected to SCSI target devices. Target devices include SCSI host adaptor which is the initiator, SCSI bus, and terminator.

Figure 6.6 An example SCSI configuration.

SCSI mapping for an eight-bit data bus from bit zero to bit seven. Bit zero stands for SCSI ID equals zero, bit one stands for SCSI ID equals one and like that bit seven stands for SCSI ID equals seven. Significance and priority increases form bit zero to bit seven.

Figure 6.7 SCSI addressing for an 8-bit data bus.

A device can be further associated with multiple logical units addressable by Logical Unit Numbers (LUNs). To the operating system, a LUN appears as an I/O device. An example of a multiple-LUN device is a CD jukebox, where each CD is a logical unit that can be addressed separately. Naturally, to serve any I/O purposes the devices on the same bus need to be able to communicate with one another as well as with the host. At a high level, SCSI communication is based on the master–slave model shown in Figure 6.8. In SCSI parlance, the entity issuing requests is called the initiator and the entity responding to requests, the target. A request may be for an I/O operation (a command) such as read or write, or for a task management function such as aborting an operation. In the case of an I/O operation, the data transfer takes place between the request and the final response. The direction of the data transfer is from the initiator to the target in a write operation, and the other way around in a read operation.

Diagram shows the SCSI master–slave model in which the left column consists of initiator device, client, and initiator port and the right column consists of target device, server, and target port. There is physical connection between initiator and target ports through service delivery subsystem. Data transfer from client to server occurs through request and from server to client through response.

Figure 6.8 SCSI client–server model.

The SCSI standard supports a set of commands, each command assigned its own operation code. The specifics of a command are communicated to the target via a Command Descriptor Block (CDB) over the bus (which is part of the service delivery subsystem). Because the bus is shared, SCSI specifies how exclusive control of the bus is arbitrated among multiple devices. Usually the device with the highest-priority address wins the arbitration. The winning device becomes the initiator, in a position to select and command a target device to carry out the desired I/O operations. It thus follows that the Host Bus Adaptor (HBA) is assigned the highest-priority identifier to secure the role of the initiator.

Over the years, SCSI has gone through several iterations of improvement in various aspects—including reliability and performance—making SCSI useful to data centers. Figure 6.9 compares different SCSI versions in terms of bus width, clock rate, throughput, and number of devices supported. On top of the various improvements, SCSI has evolved from a self-contained standard addressing a myriad of aspects (such as protocols and cabling) to a family of standards with a layered structure separating physical interconnections from transport protocols and I/O commands. Figure 6.10 is a snapshot12 of the forever-evolving SCSI family, which covers varied types of I/O device and interconnections. At the center is the SCSI Architecture Model (SAM), the glue that holds the family together. It specifies a functional abstraction of the common behaviors of I/O devices and interconnections in terms of objects,13 protocol layers, and service interfaces. Adjacent protocol layers, in particular, interact with each other through well-defined service requests, indications, responses, and confirmations (see Figure 6.11). The layered structure allows flexibility in choice of interface hardware, software, and media in actual implementations. It also allows each layer to advance independently.

Table shows the comparison of different SCSI versions in terms of bus width, clock rate, throughput, and number of devices supported. Different versions include SCSI-1, fast SCSI, fast wide SCSI, ultra SCSI, ultra wide SCSI, ultra2 SCSI, ultra2 wide SCSI, ultra3 SCSI, ultra-320 SCSI, and ultra-640 SCSI.

Figure 6.9 Comparison of different SCSI versions.

Diagram shows SCSI standards which covers various types of primary commands holding together with the SCSI architecture model at the center. Commands include SBC, SSC, OSD, SMC, SCC, and MMC. Interfaces include SPI, SBP, FCP, SSA-S3P, SRP, iSCSI, and SAS.

Figure 6.10 Organization of SCSI standards.

Diagram depicts the interlayer relationship between the SCSI application layer and SCSI transport layer. Application layer consists of client and server. These layers interact with each other through STP service request indications and responses, and confirmations.

Figure 6.11 SCSI interlayer relationship.

One important advance as a result is Serial Attached SCSI (SAS), first published as an ANSI standard in 2003. (The standard is called ANSI INCITS 376-2003). The shift to serial attachment aims to bypass the progressively intractable problems associated with parallel attachment as throughput increases. The key difference between the two lies in the physical layer. Traditional SCSI uses multiple lines to transfer data in parallel, which is subject to, among other things, cross-talk and inconsistent signal arrival times (i.e., timing skew). In contrast, SAS uses a single line to transfer data sequentially. Free from the problems associated with parallel attachment, SAS can support faster clocks and greater distances. At the time of this writing, 1.5 GBps is already available in the version known as SAS-3 and the throughput is expected to grow. In comparison, the fastest version of traditional SCSI (i.e., Ultra-640 SCSI) supports 640 MBps. SAS also offers other advantages, including better scalability (i.e., the capability to attach tens of thousands of devices) and simplified cabling. Because of its superiority, SAS is supplanting its parallel predecessors.14

The evolution of SCSI technology actually follows that of Advanced Technology Attachment (ATA), a low-cost, popular interface used in internal DAS in personal computers and electronic devices. Originally designed as a parallel interface for IBM PC AT, ATA has given rise to Serial ATA (SATA), which SAS has eventually leveraged. Among other things, SAS uses the point-to-point interconnection—just as SATA does, supports a superset of SATA signals (which are electric patterns sent on power-up for initialization, resetting, and speed-negotiation purposes), and adopts connectors that are compatible with SATA. As a result, it is possible to interconnect a mixed set of SAS and SATA devices, which, in turn, increases the relevance of SATA technology along with its later extension for external storage (called eSATA) to the needs of data centers.

The point-to-point interconnection is a departure from daisy chaining. It requires a special device (called an expander) when there are more than two SAS devices to be interconnected. The device (distinct from a SAS device which serves as an initiator, a target, or both) is essentially a virtual-circuit switch allowing an initiator to connect to multiple targets. It does so using three routing methods: direct, table, and subtractive. In direct routing, the expander recognizes that the targets of connection requests are directly attached and route the requests accordingly. In table routing, the expander routes connection requests to an attached expander based on a routing table (which is provisioned or created through a discovery procedure). In subtractive routing, the expander routes unresolved connection requests using the other two methods to another expander that may be able to resolve the requests.

Expanders come in two flavors: edge expanders and fan-out expanders. One difference between them is the number of expanders that can be attached. An edge expander can be attached to multiple SCSI devices but just one other expander. In contrast, a fan-out expander can be attached both to multiple SCSI devices and multiple expanders. Figure 6.12 shows an example configuration involving an HBA, five SAS storage devices, and two expanders. The HBA connects to each storage device point-to-point through the dedicated virtual circuits provided within the expanders. It is straightforward to interconnect more SAS devices by adding extra expanders. The relative flexibility in the use of expanders explains why SAS can support many more devices than traditional SCSI can. The gains, however, come at the expense of simplicity. A set of new needs arises, including those for connection and configuration management, a robust addressing scheme (which is not tied to the physical layout of the parallel bus), and a communication mechanism between a SAS device and an expander.

Diagram shows an SAS configuration which includes a host with SAS host adaptor, two expanders; one connected with two and the other connected with three target devices.

Figure 6.12 An example of SAS configuration.

The SAS architecture supports three protocols:

  1. Serial SCSI Protocol (SSP), which is for communication between two SAS devices and between a SAS device and an expander. It preserves the SCSI command set while adding support for multiple initiators and targets. SSP is the primary protocol in SAS;
  2. SATA Tunneled Protocol (STP), which allows a SAS initiator device to communicate with a SATA target device through an expander. The expander, serving as a gateway, speaks STP on the initiator side and SATA on the target side. The protocol extends SATA to support multiple initiators; and
  3. Serial Management Protocol (SMP), which is for communication with expanders. It covers discovery and configuration management.

Figure 6.13 shows the SAS architecture. Again, as in the OSI model, each layer provides services to the layer above and utilizes the services provided by the layer below:

Diagram shows the serial attached SCSI architecture. From bottom to top, the layers are: physical, phy, link, port, transport, and application.

Figure 6.13 Serial attached SCSI architecture.

  • Physical layer deals with the physical and electrical characteristics of cables, connectors, and transceivers.
  • Phy layer deals with line coding, out-of-band signals, and other preparations (e.g., speed negotiation) necessary for serial transmission. The name of the layer reflects the logical construct phy that represents a transceiver (consisting of a transmitter and a receiver) on a device. A phy has an 8-bit identifier that is unique within a device. The identifier is assigned by a management function. Its value is an integer equal to or greater than zero and less than the number of phys on the device. On line coding, SAS prescribes 8b/10b, following the lead of FC (to be discussed later). 8b/10b was originally developed by IBM researchers [9] for high-speed data transmission. The moniker reflects a key characteristic of the coding scheme: the 8-to-10-bit transformation of data blocks before transmission. The transformation is optimized to have enough transitions (i.e., 0 to 1 or 1 to 0) in each encoded block to keep the sender and receiver in sync, and to have the number of 0s and 1s as equal as possible to minimize the direct-current component. The 8-to-10 expansion gives enough room for such optimization while incurring a 25% transmission overhead. In comparison, Manchester encoding, which is used in 10-MBps Ethernet, has a 100% transmission overhead. The large overhead becomes a problem as the speed increases.
  • Link layer defines the primitives and their encodings on the wire and handles—among other things, connection management and flow control. Three link layers are defined—for SSP, STP, and SMP, respectively.
  • Port layer is primarily responsible for managing the phys on a port. A port contains one or more phys and is assigned a unique identifier by the device manufacturer. The identifier is the address used in all communications. It is 64 bits long in the World Wide Name format, which is also supported in FC (to be discussed later).
  • Transport layer addresses the transport services as defined in SAM and framing (including the frame formats) for SSP, STP, and SMP. In the case of SSP, the frame format incorporates the CDB data structure and other constructs to carry the information related to SCSI operations.
  • Application layer supports SCSI operations, ATA operations, and SAS management. To send commands to a server, for instance, an application client invokes the appropriate transport services (typically implemented as procedural calls).

The monographs [10, 11] provide additional information on ATA, SCSI, and their serial counterpart technologies.

6.2.2 Network-Attached Storage

NAS provides file- or object-level access over a local area network. Placing storage on the network facilitates information sharing among many computers and simplifies the related storage management. In this respect, NAS is a good fit for Cloud Computing, the observation validated by the rapid development of high-capacity, high-availability NAS systems for data centers.

A network file system is the earliest and most well-known manifestation of NAS. It is accessible to an arbitrary number of remote clients through an application-layer protocol as though it were local. Figure 6.14 depicts the arrangement. To help explain NAS, we review the key concepts of file systems and then the widely implemented NFS, originally developed at Sun Microsystems [12].

Diagram depicts the arrangement of network-attached storage. It shows network file system in the block layer connected to two hosts in the application layer through LAN in the file or record layer. Block layer consists of block aggregation which includes host, network, and device.

Figure 6.14 A network file system.

A file system native to a host is maintained by the operating system. At the highest level, a file system appears as a collection of files and directories15 (or folders). Files and directories can be created, deleted, opened, closed, read, and written. They can also be moved from directory to directory. Most file systems support a hierarchical structure: a directory may have subdirectories; a subdirectory may have sub-subdirectories; and so on. Figure 6.15 shows such a directory.

Diagram shows a hierarchical directory 'My directory'. It has three divisions: music, books, and photos. Subdirectories again divided into sub-sub directories. Music has a sub-division labelled classical.

Figure 6.15 A hierarchical directory.

A file system manages space in terms of blocks, in step with the back-end storage directly attached to the host. A block is a fixed-size sequence of bytes that is addressable as a whole. A file is implemented as a link list of blocks. How the file is represented to a user differs from one operating system to another. Earlier operating systems defined files as lists of records of specified format. In contrast, in the Unix operating system—as well as in Linux—the file is merely a sequence of bytes.

The block size is constrained by the logical organization of the underlying storage medium. It is traditionally set to a multiple of the smallest unit that can be handled on a magnetic disk—the dominant storage medium for the last 50 years or so.

A magnetic disk is a stack of platters, each platter coated with a magnetic material (such as ferric oxide) on both surfaces. It is also known as a hard disk, because the platter is made of a rigid material. Figure 6.16 shows the organization of a typical hard-disk drive. The stack of platters rotates on a common axis at a constant speed, and the disk arm with set of read/write heads—one head for each surface—moves along parallel radial lines. Reading and writing is achieved by means of electromagnetic interactions between the head and the coated material in a tiny surface area right across it on the associated platter. Obviously, the smaller the area, the higher the disk's overall capacity. Advances in physics and engineering have allowed the areal density to increase steadily. In particular, the Nobel Prize-winning discovery of giant magneto-resistance in 1988 by physicists Albert Fert and Peter Grünberg [13] made possible an areal density of over 100 gigabit/inch2. A case in point is that a 160-gigabyte iPod classic fits into one hand, while the first gigabyte-capacity hard-disk drive (IBM 3380) was as big as a refrigerator.

Diagram shows the structure of a magnetic disk drive. Stack of platters from 0 to m arranged from bottom to top rotates on a common axis at the center. The mechanical arm at the left with set of read or write heads with one head for each surface moves along parallel radial lines.

Figure 6.16 Structure of a magnetic disk drive.

As shown in Figure 6.17, each platter's surface is divided into concentric circles called tracks. (The set of tracks of the same radius on different surfaces collectively forms a cylinder.) Each track, in turn, is divided into hundreds of pie-shaped sectors. A sector is the smallest unit that is addressable on a disk. As a result, the file block size is a multiple of the sector size. The exact block size used, however, involves trade-offs. For example, if the block size is very large, most files will tie up larger-than-needed blocks and waste storage space. On the contrary, if the block size is very small, most files will span many blocks, which are likely non-contiguous. Access to the files, therefore, is subject to multiple seek and rotational delays.16 In addition, a very small block size may incur large data structures for tracking free blocks. A common scheme for keeping track of free blocks is to employ a characteristic function called a bitmap. The value of the nth bit indicates whether the nth block is free (value 1) or allocated (value 0).17 Large bitmaps, multiple seeks, and rotational delays all degrade performance. It is worth noting that the Google File System18 [14] uses 64-kilobyte blocks. The large block size is chosen in support of Google's need to process millions of files that are over 100 megabytes in size.

Diagram shows the organization of a platter’s surface. It is divided into concentric circles called tracks and each track is divided into pie-shaped sectors.

Figure 6.17 Organization of a platter's surface.

For management purposes, an operating system stores information about the file system (e.g., type, layout, and the bitmap for free blocks) and about each file (e.g., the pointer to the first block, the owner, last modification time, and access permissions).

If a system crashes during a file update operation, the file data may become corrupted. Modern file system management is able to repair—to some extent—corrupted data. There are algorithms for block and file consistency, but in a large system these take a substantial time to run. Fortunately, there are alternatives, such as journaling. Here the system keeps a log—appropriately called a journal—of all intended updates in a separate storage area.19

The updates that must be completed either in full, or not at all, are grouped together as an atomic transaction. As the actual updates are being made, the file system keeps track of the progress. In the event of a failure, the information contained in the journal can then be used during recovery to fix any inconsistency by redoing the required updates or undoing the incomplete updates. Fittingly, the former is called redo (or new-value) journaling and the latter undo (or old-value) journaling. Journaling is efficient because only the latest log needs to be examined instead of the entire file system. If it also logs the actual file content, file recovery is possible, too. It goes without saying that the extra operations due to journaling may affect performance. Nevertheless, the trade-off is such that most modern file systems support journaling. Doeppner's book [15] discusses journaling in more detail.

Listed below are examples of common file systems:

  • Unix File System (UFS);
  • Linux extended file system (ext2 or ext3 or ext4);
  • Windows New Technology File System (NTFS);
  • ISO 9660 (also known as Compact Disc File System).

Among these examples, ISO 9660 stands out, which is not tied to any particular operating system by design. It leads to a general need for an operating system to support multiple types of file system. Also worth noting is that the image of an ISO 9660 file system can be captured as a file (of the extension .iso) for electronic transfer. Known as an ISO image, this format has been used to distribute software modules and even virtual machine images.

An operating system may support different file systems directly, without making an attempt to integrate them. In this case the presence of different file systems is visible to user processes. Alternatively, the operating system may add an abstraction layer on top of the file systems to hide their differences—such as that shown in Figure 6.18. Many modern operating systems (notably those similar to Unix) implement such a layer as inspired by the Virtual File System (VFS) that Sun Microsystems first pioneered [16]. The VFS layer provides a file-system-independent interface to user processes, supporting standard system calls for file operations such as open, read, and write. It also provides an interface to the underlying file systems. As long as the underlying file systems support the interface, the VFS layer is not concerned about their specifics, including where the files are stored. Indeed, Sun's original VFS includes support for remote file systems such as NFS, which we are now ready to discuss.

Chart shows file system abstraction. User process at the top connected to virtual file system through file system independent interface. Virtual file system divided into three file systems; A, B, and C and each file system has its own device driver.

Figure 6.18 File system abstraction.

NFS was designed in the 1980s by Sun Microsystems for file sharing between networked computers with possibly varied operating systems. Figure 6.19 shows Sun's implementation, which most Unix-like operating systems follow closely. As shown, NFS is integrated with the operating system through a virtual file system, which is a natural fit here. When a user process attempts file access through a system call, the VFS determines whether the file is remote or local. If it is remote, the appropriate NFS procedure is invoked. The NFS proper is client/server-based. The NFS client initiates requests through the NFS protocol, which relies on the Remote Procedure Call (RPC) [17]. The NFS server only responds to requests, taking no actions on its own. The use of RPC hides the network-related details. To support machines of different architectures (big- or little-Endian), RPC, in turn, needs a presentation-layer protocol. The Sun's RPC message protocol [18], as standardized by the IETF, relies on the External Data Representation (XDR) standard. Similar to the Abstract Syntax Notation 1 (ASN.1), XDR [19] is a generic over-the-wire representation of basic data types (e.g., string, integer, Boolean, and array). It defines the size, byte order, and data alignment.

Diagram shows client machine on left and server machine on right. Both machines contain VFS, file systems and their device drivers, RPC, and network drivers connected to the network. Client and server machines have NFS client between VFS and RPC. Client machine has user process via system calls above the VFS.

Figure 6.19 A functional view of NFS.

Upon receiving an RPC call, the server invokes the appropriate operation in the VFS on the server machine, which eventually results in local file system operations. To return the result, the path across the network in Figure 6.19 is retraced. An advantage of this architecture is that the client and the server are symmetric. It is, therefore, straightforward to implement both a client and a server on the same machine.

Specifically, to make a file system accessible to remote clients, an NFS server exports it. To access a directory of the remote file system, an NFS client grafts (or mounts in Unix parlance) it to the local file system through the mount protocol. Upon receiving the mount request from the client, the server controls access to the file system based on pre-set policy and responds accordingly. Once the client receives a successful response, the remote directory becomes part of the local file system (such as shown in Figure 6.20) and is accessible to user processes through regular system calls. The actual interactions between the client and the server for file access are through the NFS file protocol. Most of the corresponding RPC routines map well to the regular Unix system calls for file operations. As an example, Figure 6.21 shows the open, read, and close operations of a remote file.

Diagram shows client machine on right and server machine on left. Server machine has shared music and subdirectories; POP and classical. Client machine has directory “My directory” and subdirectories; music, books, and photos. Classical subdirectories from client and server machines become a part of music directory after using the mount protocol.

Figure 6.20 An example remote file system.

Diagram shows vertical lines representing the user process, NFS client, and NFS server. It shows directions of requests and responses between these lines during the open, read, and close operations of a remote file.

Figure 6.21 Examples of remote file operations through NFS.

An interesting nuance: the close system call does not result in an RPC invocation. There are two reasons for this. First, the NFS protocol does not have the close routine because of the original stateless design of servers (which do not keep track of past requests) to facilitate crash discovery. Second, in this case there is no file modification.

A remote file operation, even if it has an RPC counterpart, does not necessarily result in an RPC invocation. No such invocation is needed when the information is stored in the client cache, which reduces the number of remote procedure calls and improves performance. Nevertheless, caching makes it difficult to maintain file consistency. For example, a write operation to a file at one site may not be visible at other sites that have this file open for reading.

NFS has gone through several iterations since its inception. In the process, the constraints of the stateless design have been relaxed, file consistency has been improved, and security strengthened. NFS and its evolution are explained in depth in [15].

Before leaving the topic of NAS, we observe that it is often implemented as Redundant Arrays of Independent Disks (RAIDs). Employing RAID technology allows recovery from a number of individual disk failures and overall improves the NAS performance and availability.

6.2.3 Storage Area Network

Historically, DAS is a stove-piped technology, which makes it difficult to share storage resources and stored information. NAS alleviates the problem but still leaves room for improvement. In particular, storage throughput is limited by the particular networking technology in use and block-level I/O access is unavailable. This is where SAN comes in. In essence, SAN is a high-speed network that is tailored to interconnecting storage systems to allow resource pooling and block-level access to the pooled resources. SAN is predominantly based on FC, a standard technology combining the qualities of a serial I/O bus and a switching network. The bus qualities (reflected in the choice of the word “channel”) allow hosts to see storage devices that are attached through FC as locally attached and to have reliable transmission as in SCSI. The network qualities allow flexibility in supporting multiple protocols and dynamic attachment of storage devices over a long distance.

The development of FC standards first started at ANSI in 1988, culminating in the specification ANSI X 3.230-1994. It is continuing in the T11 Technical Committee of INCITS, side by side with T10 that is responsible for the closely related SCSI project.

An open standard, FC has a layered structure as shown in Figure 6.22. FC-0 defines the physical and electrical characteristics of transceivers, connectors, and cables for serial lossless transmission (with a low bit error ratio) at different rates. To date, the fastest FC (known as 32GFC) supports 3.2 GBps per direction. Both fiber-optic and copper types of cabling are supported. It is even possible to have a mix of copper wire and optical fiber in an end-to-end path.20

Diagram shows the layered structure of FC. FC-0 is concerned with physical and electrical characteristics of transmission media and transceivers, FC-1 for line coding, error control, transceiver operations, FC-2  is divided into physical, multiplexing, and virtual sublevels, FC-3 for common services, and FC-4 for application mapping.

Figure 6.22 FC structure.

FC-1 is concerned with line coding and related transceiver operations. In particular, it defines a set of coding schemes suitable for high-speed data transmission, including 256b/257b, 64b/66b, and 8b/10b. In general, these coding schemes allow clock recovery at the receiver, enable detection of bit errors during transmission and reception, and help achieve transmission block alignment. You might recall that SAS utilizes 8b/10b coding. The coding scheme has a transmission overhead of 25% and is not efficient enough for the faster versions of FC. Figure 6.23 shows the coding scheme used by each of the FC versions available to date. In a nutshell, the 64b/66b coding scheme (also used in 10- and 100-gigabit Ethernet) transforms every 64-bit block to a 66-bit block before transmission. The 256b/257b scheme builds on 64b/66b by further transforming every four 64b/66b blocks to a 257-bit block before transmission. FC-1 also defines a number of ordered sets (i.e., certain encoded bit patterns). Among them are frame delimiters for marking frame boundaries, and primitive signals to signal a port's readiness to transmit and receive.

Table shows the coding schemes used, line rate, and throughput of each of the FC versions 1G, 2G, 4G, 8G, 10G, 16G, and 32G.

Figure 6.23 FC and line coding.

FC-2 consists of three sublevels named physical (FC-2P), multiplexing (FC-2M), and virtual (FC-2V), respectively. FC-2P addresses the format of frames (the basic units for carrying information on a physical link), and matters germane to transmitting and receiving frames, such as per-link flow control. The frame format includes a Cyclic Redundancy Check (CRC) field to detect and correct transmission errors.21

The flow control mechanism prevents a transmitter from sending to a receiver more frames than the latter can handle. It requires a feedback mechanism to allow the transmitter to regulate its transmission. If overwhelmed, the receiver will drop frames. The dropped frames need to be retransmitted, which worsens the congestion. Flow control is of particular importance in FC, given the requirement for lossless frame transmission. To this end, FC uses a flow control mechanism based on a notion of credits. A credit is the maximum number of buffers22 available for receiving frames on the receiver. It is negotiated buffer-to-buffer (i.e., per link) or end-to-end between the involved ports (i.e., the transmitter and receiver) during a login procedure (to be discussed later). The transmitter does book-keeping to ensure that it sends a frame if and only if the receiver has a free buffer. In the case of per-link flow control (which is applicable to fabric ports), the transmitter does so through the help of a primitive signal (i.e., R_RDY) from the receiver. The signal indicates that the receiver is ready to receive with a free buffer. The transmitter tracks the number of available buffers, decrementing it by one upon sending out a frame and incrementing it by one upon receiving an R_RDY.

FC-2M is concerned with end-to-end connectivity, addressing, and path selection. Three types of connection are supported: point-to-point, fabric, and arbitrated loop. The point-to-point topology is the simplest, with a direct link between two ports (which are analogous to the SAS ports discussed earlier). It has the same effect as DAS, while supporting longer distances and working at a higher speed.

The fabric topology is most flexible. It involves a set of ports attached to a network of interconnecting FC switches through separate physical links, as shown in Figure 6.24. The switching network (or fabric) has a 24-bit address space structured hierarchically, according to domains and areas. An attached port is assigned a unique address during the fabric login procedure (which we will discuss later). The exact address typically depends on the physical port of attachment on the fabric (or switch, to be precise). The fabric routes frames individually based on the destination port address in each frame header.

Diagram shows an example of fabric topology where a rhombus has a switch at each of its four corners and two ports separately linked to each of these switches.

Figure 6.24 An example of fabric topology.

Finally, the arbitrated loop topology allows three or more ports to interconnect without a fabric. Figure 6.25 shows an example together with an alternative using a hub (a simple device without any loop control capabilities) to simplify cabling. On the loop, only two ports can communicate with each other at any given time through arbitration.

Diagram at the left shows the arbitrated loop topology in which four ports are physically interconnected without a hub. Diagram at the right shows four logically interconnected ports with a hub connected between them.

Figure 6.25 An example of the arbitrated loop.

In all three types of topology, communication may be simplex, full-duplex, or half-duplex; and a port may be on an HBA, a storage device controller, a hub, or a switch.

In the case of fabric topology, the Fabric Shortest Path First (FSPF) protocol as defined in ANSI INCITS 461-2010 is used to select a path on a fabric. FSPF is a link-state routing protocol similar to the standard Open Shortest Path First (OSPF) routing protocol23 commonly used in IP networks.

Through FSPF, a switch in a fabric can keep track of the state of all the interswitch links throughout and maintain an up-to-date topology of the fabric consistently. The link state information includes the cost associated with each link, which is inversely proportional to its speed.24 Based on the link state and topology information, each switch computes the respective total costs of all possible paths to other switches and selects those with the least costs. The total cost of a path is simply the sum of the costs of all links therein. Figure 6.26 shows an example, where the least-cost path between switch A and switch C is through switch B.

Diagram shows a quadrilateral with switches A, B, C, and D connected to its four corners. Distances labeled are:  AB equals 1000 units, BC equals 500 units, CD equals 1000 units, DA equals 1000 units, and DB equals 500 units. Ports are linked to switches A and C.

Figure 6.26 A weighted-path network.

Obviously, a selected path is valid only for a given topology. A switch has to redo path computation whenever there is a topology change. Say link A–B in Figure 6.26 is down. Then path recomputation will yield two paths of the same cost. In this case, there is a need for a tie breaker, which could be based on load balancing considerations.

FC-2V is concerned with classes of service, end-to-end flow control, naming schemes, segmentation and reassembly in support of upper-layer protocols, among other things. At the time of this writing, three service classes are specified to support, respectively, the acknowledged frame delivery (Class 2), non-acknowledged frame delivery (Class 3), and interswitch frame delivery (Class F). Both Class 2 and Class 3 are datagram services. Frames are individually routed through the fabric without any guarantee of delivery order. Class 2 supports notification of frame delivery status, while Class 3 does not.

Delivery acknowledgment enables end-to-end flow control and improves error handling. End-to-end flow control is also based on credits, which communicating ports negotiate at login time. A transmitter may not send additional frames when the number of outstanding delivery acknowledgments has reached the negotiated credit. Note that delivery acknowledgment is done at the frame level (through ACK_1 frames) and has more overheads than acknowledgment done at the primitive-signal level.

The naming mechanism is similar to that of addressing in FC. Several schemes for identifying ports, nodes (e.g., storage devices and HBAs), fabrics, and other FC entities have been specified. In practice, the 64-bit World Wide Name (WWN) is what is implemented. WWNs are similar to MAC addresses, and they are administered by the IEEE in the same way. WWNs are assigned to FC entities by manufacturers.

All things considered, WWNs can potentially provide the basis for FC routing. But, as already noted, FC routing is based on a special 24-bit port address. This has happened because of concerns that the much longer WWN address might make routing, too slow to meet the overall performance objective.

Hence, a port ends up with two identifiers. In FC parlance, the address identifier is 24 bits long; the name identifier is 64 bits long. They serve different purposes. The name identifier is useful in services such as zoning. Based on their name identifiers, FC devices can be grouped into isolated zones so that they cannot see and communicate with each other across zones. Zoning may be based on address identifiers but it is harder to administer; address identifiers, unlike name identifiers, are subject to change when devices alter attachment ports on the fabric. The restrictiveness of address-identifier-based zoning, however, is a plus from the security point of view.

FC-3 is concerned with the link services to FC-4 for managing both the communication between FC devices and the interaction between the fabric and an attached device. FC-3 provides a link service further subdivided into basic and extended services.

The set of basic link services is small. It is intended to support aborting a sequence (which is a flow of frames resulting from fragmentation of a single upper-protocol data unit) and notifying a service requestor of the result (i.e., request completion or rejection).

In contrast, the set of extended link services is much larger. It supports, in particular, procedures for login and logout.

The login procedure is mandatory. It consists of two steps: fabric login and port login. A port on an FC device must perform the fabric login step first before attempting to do anything else. The step involves sending a frame carrying the fabric login command and other information to a well-known port address. If it completes successfully, the device port discovers the topology type (e.g., fabric or point-to-point), at which point it is assigned an address (if a fabric is present). The rest of the procedure settles some service parameters (such as supported classes of service and the credits for per-link flow control). Then the port login step is performed, which settles the values of other service parameters (such as port names and the credits for end-to-end flow control).

The login procedure results in a long-lived session. As long as the session is active, I/O operations can be performed between the two ports. Otherwise, a new login session must be established.

The logout procedure simply terminates a login session and frees up its resources. The login procedure does not have in-built authentication, despite the word “login.” Traditionally, FC devices are trusted.

The top layer, FC-4, is concerned with bridging the transport below and applications above to make FC devices accessible to applications in a transparent way. For example, an SCSI-aware application can access an FC device without modification. To this end, FC-4 defines the mapping between application protocols and the underlying FC constructs. The mapping is specific to an application protocol. A set of application-specific mappings has been defined, including, in particular, the Fibre Channel Protocol (FCP) for SCSI. FCP provides transport protocol services, as defined in the SCSI architecture model (discussed earlier). To name a few, it addresses the encapsulation of the information related to SCSI operations (e.g., CDBs), address mapping, and capability discovery. Through FCP, an FC storage device can appear as an SCSI device to an application.

Among other mappings that have been defined are the FC-Virtual Interface (FC-VI) for the Virtual Interface (VI) Architecture [20], and RFC 433825 for IPv4/IPv6. The VI architecture aims to provide processes with high performance needs a protected, directly accessible interface to the network hardware without involving the specific operating system services. It supports Remote Direct Memory Access (RDMA) in addition to traditional send and receive messaging constructs. RDMA has been applied to distributed scientific applications and proven effective. The versions used in practice, however, are not identical to RDMA in the VI architecture. InfiniBandTM is a notable example. In light of Cloud Computing, devising effective mechanisms for supporting RDMA in virtual machines is a topic for ongoing research.

6.2.4 Convergence of SAN and Ethernet

FC SAN has been popular in data centers because of its superior performance. But the need to deploy and manage a separate, bespoke network for storage is a drawback. Doing so entails the procurement of specialized hardware (i.e., tailored HBAs, connectors, cables, and switches) as well as the involvement of dedicated operational staff. Convergence of SAN and Ethernet is about using Ethernet networks for all types of traffic, including storage. A key motivation is the prospect of reduced capital and operational expenditure.

As the reader may remember from Chapter 2, the same motivation resulted in IT transformation and network function virtualization.

SAN's layered structure facilitates the approaches to convergence demonstrated in Figure 6.27. These approaches correspond to swapping out the lower-layer modules of the FC protocol. In the extreme approach (i.e., approach 3), no trace of FC26 is left.

The other two approaches, developed by the INCITS T11 Technical Committee, are kinder to FC. The idea is to keep various modules to help the transition to new deployments. Common to the three approaches is the use of the IEEE 802.3 MAC and physical layers in place of FC-1 and FC-0. Approach 1 further calls for replacing FC-2M and FC-2P with an FCoE layer, approach 2 replacing FC-3 to FC-2P with FCIP over TCP/IP, and approach 3 replacing FCP to FC-2P with iSCSI) over TCP/IP. The rest of the section focuses on approach 1 (FCoE) and approach 3 (iSCSI), the primary convergence approaches.

Table shows the SCSI commands with three approaches. Approach three includes iSCSI, TCP, and IP. Approach two includes FCP, FCIP, TCP, and IP. Approach one includes FCP, FC-3, FC-2V, and FCoE. Use of  IEEE 802.3 MAC and IEEE 802.3 PHY are common to the three approaches.

Figure 6.27 Examples of converged storage protocol options.

The main characteristic of FC is that it incurs no frame loss due to buffer congestion. Approach 1 relies on lossless Ethernet links to preserve this characteristic. To this end, the Ethernet PAUSE mechanism27 is essential. The mechanism allows a congested Ethernet switch to request an adjacent switch (through a PAUSE frame) not to send frames its way for a certain duration. If the congestion persists beyond the valid period of the request, the switch can send a new PAUSE request. On the contrary, if the congestion alleviates sooner than expected, the switch can cancel the outstanding request by sending a new PAUSE request with the duration set to zero. The PAUSE mechanism works on a per-link basis. Its invocation is handled independently by each switch, based on the local load conditions. Furthermore, once invoked, PAUSE applies to all traffic on the same link.

The way that PAUSE works is not ideal when the converged network is to be shared by different types of traffic. Storage performance will suffer even if PAUSE is caused by the traffic that is unrelated to storage.

Fortunately, there is a remedy, which is the Priority-based Flow Control (PFC) mechanism specified in IEEE 802.1Qbb.28 PFC allows PAUSE to be applied separately to traffic of different priority classes. Traffic priority classification is through a 3-bit tag, as defined in IEEE 802.1Q.29 With finer control granularity, different classes of traffic (with eight classes the maximum, as limited by the tag length) will not interfere with each other. It is also possible to pause non-storage traffic to allow storage traffic to get through and help meet the storage performance requirement.

Besides lossless Ethernet, approach 1 requires a new layer, namely FCoE as defined in the Fibre Channel-Backbone-5 (FC-BB-5) specification.30 FCoE's job is to fill the gaps resulting from the use of Ethernet for transport, such as FC frame encapsulation, emulating point-to-point links, and access to common services. An FCoE frame is an Ethernet frame that encapsulates an FC frame in its entirety without any change. Keeping the FC frame intact simplifies integration of FCoE and existing FC SANs. As depicted in Figure 6.28, the encapsulated FC frame has a relatively low overhead, with just additional delimiters to mark the beginning and end of the frame and padding to meet the minimum Ethernet frame size requirement. FCoE frames are distinguishable from other types of frame via a special Ethertype value. (Ethertype is a two-byte header field in the Ethernet frame to indicate the nature of the payload. The Ethertype value determines what the receiver does with a received frame. IPv4 and IPv6 are among the common Ethertype values. To avoid conflict, the Ethertype values are managed by the IEEE Registration Authority31.)

Diagram shows FCoE frame structure which includes ethernet header, encapsulated FC frame, and ethernet trailer. Encapsulated FC frame contains FCoE PDU header, FC frame, and FCoE PDU trailer.

Figure 6.28 FCoE frame structure.

FCoE-aware entities are classified into FCoE Nodes (ENodes) and FCoE Forwarders (FCFs). The former are relatively simple, in practice known as converged network adapters that consolidate FC HBAs, and Ethernet NICs. The latter are essentially FC switches in an Ethernet network, supporting the node and FC switching functions. Their central task is forwarding FCoE frames. For each frame received, an FCF decapsulates the FC frame and determines the forwarding MAC address based on the destination address in the FC frame. Then the FCF has to encapsulate the FC frame again in a way suitable for forwarding out of an Ethernet port, with the Ethernet source address set to the FCF MAC address and the Ethernet destination address the forwarding MAC address. If the FCF also supports native FC ports (thus the SAN gateway function), re-encapsulation may not be necessary. It is possible for the FCF to forward an FC frame bound for an FC SAN out of a native FC port.

To emulate point-to-point links over the Ethernet shared media, FCoE relies on the logical constructs of virtual ports and links.

Virtual ports emulate FC ports. They are created dynamically on ENodes and FCFs. Each virtual port is associated with an element (known as an FCoE Link End Point (FCoE_LEP)), which handles the encapsulation and decapsulation and also deals with the Ethernet transmission. Two virtual points can be interconnected through a virtual link identified by at least the MAC addresses of the associated FCoE_LEPs. The link serves as a tunnel to transport encapsulated FC frames between the two MAC addresses over an Ethernet network. Figure 6.29 shows virtual ports and links in a conceptual FCoE network. (For simplicity, FCoE_LEPs are not shown.) From the figure, we observe that:

Diagram shows a conceptual FCoE architecture with two FCoE nodes, two FCoE forwarders, and lossless ethernet between them. FCoE nodes and FCoE forwarders have multiple virtual links with other FCoE forwarders or FCoE nodes through their single ethernet port.

Figure 6.29 A conceptual FCoE architecture.

  • An FCoE node may establish multiple virtual links with different FCFs through a single Ethernet port.
  • An FCF may establish multiple virtual links with different ENodes through a single Ethernet port.
  • An FCF may establish multiple virtual links with different FCFs through a single Ethernet port.
  • An ENode may establish multiple virtual links with different ENodes through a single Ethernet port.

Now we can explain how the FCoE entities and virtual ports are discovered and how the virtual links are set up.

In FC, an end device is provisioned with a direct physical link to a switch. To perform fabric login, for example, the device simply sends a request to the corresponding well-known port address over the link. In FCoE, there is no direct physical link anymore between an ENode and an FCF. Instead, there are intermediate Ethernet links and switches. To perform fabric login, the end node has to have an appropriate virtual link set up first. If this is done manually, the procedure will be both ineffective and error-prone.

Hence the FCoE Initialization Protocol (FIP). The FIP messages are also carried in Ethernet frames. A special Ethertype value distinguishes these frames from the FCoE frames.

FIP addresses FCoE entity discovery, virtual link instantiation, and virtual link maintenance. Figure 6.30 shows the related interactions between an ENode and FCF. (The interactions between two ENodes or two FCFs are similar.) The entity discovery procedure is typically hinged on FCFs sending, periodically, multicast discovery advertisements to a known multicast address.

Diagram shows lines representing ENode and FCF and the directions of requests and responses during high-level FIP operations such as FCoE entity discovery, virtual link instantiation, and virtual link maintenance.

Figure 6.30 High-level FIP operations.

An ENode selects a compatible FCF based on the advertisement and sends a discovery solicitation at which the capability negotiation starts. Upon receiving the solicitation, the FCF responds to the ENode with a solicited discovery advertisement, confirming the negotiated capabilities.

Once receiving the solicited discovery advertisement, the ENode can proceed with setting up a virtual link to the FCF. The procedure here is similar to the fabric login procedure in FC.

Successful completion of the login procedure results in creation of a virtual port on the ENode, a virtual port on the FCF, and a virtual link between them.

The MAC address of the virtual port on the ENode is typically assigned by the FCF, although it may be assigned by the ENode. A MAC address in the former case is known as a Fabric-Provided MAC Address (FPMA). It is constructed by concatenating the 24-bit MAC prefix of the FCF and the 24-bit address identifier of the virtual port assigned by the FCF. This method ensures that the MAC address is unique within the fabric. Virtual ports and links can be deleted explicitly through a logout procedure. Successful completion of the procedure frees up all related resources, including MAC addresses and virtual port address identifiers.

A virtual link may span a series of Ethernet links and switches. Hence, a broken link due to a fault in an intermediate link or switch might not be immediately apparent to the associated FCoE entities. FIP deals with the problem by making the associated entities periodically check the state of a virtual link. An ENode does so by monitoring multicast discovery advertisements and sending keep-alive messages to the FCF. An FCF does so by monitoring the keep-alive messages from the ENode (in addition to its ongoing task of issuing multicast discovery advertisements).

A virtual port on the FCF is considered unreachable if the ENode logs two missing advertisements. In this case, the ENode deletes the associated virtual port and link. Similarly, a virtual port on an ENode is considered unreachable if the FCF logs two missing keep-alive messages. In this case, the FCF deletes the associated virtual port and link.

For good measure, FCoE services are typically provided over VLANs so that storage traffic is isolated appropriately. These VLANs may be pre-provisioned, but if this is not the case, there needs to be a mechanism to discover them. To this end, FIP includes an additional procedure that FCoE entities can perform before anything else. The VLAN discovery procedure is straightforward. An ENode (or FCF) sends a VLAN discovery message to a pre-set multicast address. The FCFs receiving the message respond to the ENode with a list of the identifiers of FCoE-capable VLANs.

iSCSI is a development enabled by the ubiquitous connectivity that came with the development of the Internet. As demonstrated in Figure 6.27, TCP is leveraged here32 for the features that are essential to SCSI operations: reliable in-order delivery, automatic retransmission of unacknowledged packets, and congestion control.

Initially there was a concern about potential performance problems, but the choice of TCP was validated by the Virtual Internet SCSI Adaptor (VISA) project [21], carried at the University of Southern California's Information Sciences Institute in the 1990s. The project demonstrated that the TCP/IP overhead was not as great as feared, and could be compensated for by employing more powerful processors. Discussions of the related standardization effort in the IETF followed, and led to the formation of the IP Storage Working Group in the last quarter of 2000. The effort resulted in the publication of a series of RFCs, starting with the core RFC33 specifying the iSCSI protocol in 2004. The paper [22] by K. Meth and J. Satran at IBM Haifa Research Laboratory gives a good explanation of the design of the iSCSI protocol.

To explain how iSCSI works, let us first review the conceptual model in Figure 6.31. The central construct here is the iSCSI node representing an iSCSI communication endpoint (initiator or target). The node is accessible from an IP network through one or more network portals.

Diagram shows the iSCSI conceptual model that includes an IP host with an initiator node, an IP host with two target nodes, and IP network. Communication during iSCSI session between an initiator node and a target node occurs through TCP connections. Nodes are accessible from the IP network through the network portals of the hosts.

Figure 6.31 iSCSI conceptual model.

The node is identified by a globally unique iSCSI name, which depends on neither the node location nor its IP address. Multiple iSCSI nodes may be reachable at the same address, and the same iSCSI node can be reached at multiple addresses. As a result, it is possible to use multiple TCP connections for a communication session between a pair of iSCSI nodes to achieve a higher throughput. We will return to this important aspect later. Figure 6.32 shows the two formats defined for iSCSI names: iSCSI Qualified Name (iqn) and Extensible Unique Identifier (eui). With the iqn format, the names can be issued by any organization that owns a domain name. In contrast, the names in the eui format are assigned by the IEEE Registration Authority.

Two formats defined for iSCSI names: iSCSI qualifier name or iqn and extensible unique identifier or eui. iqn format contains the constant string iqn, date, reversed domain name of the naming authority, and string defined by the naming authority. eui format contains the constant string eui and EUI-64 identifier.

Figure 6.32 iSCSI names.

Figure 6.33 outlines the format of the iSCSI PDU. Only the basic header segment field is mandatory. This field carries critical information such as the iSCSI PDU type and SCSI CDB.

Format shows the iSCSI protocol data unit. It has a basic header segment, additional header segments, header digest, data segment, and data digest. Only the first field is mandatory.

Figure 6.33 Format of the iSCSI protocol data unit.

The two optional fields with the label digest carry the checksums for detecting changes (e.g., the changes caused by noise) to the header and the data.34 To reduce the processing cost, the basic header is of a fixed size, 48 bytes in length to accommodate a normal SCSI CDB. In the case of a larger CDB, the use of additional header segments will be necessary.

The iSCSI PDU type identifies the key function of the PDU. Several PDU types are defined. Naturally some have direct counterparts in SCSI, such as SCSI Command, SCSI Response, SCSI Data-In, and SCSI Data-Out. Those that do not are introduced to provide the necessary adaptation functions to the underlying TCP/IP. Among such PDU types are Login Request, Login Response, Logout Request, and Logout Response to support connection management and capability negotiation; and Ready to Transfer (R2T) to support target-driven flow control.

To illustrate the roles of different types of iSCSI PDUs, Figure 6.34 depicts an example information flow for the write operation.

Diagram shows two vertical lines representing iSCSI Initiator and iSCSI target and the directions of the commands and responses during the write operation.

Figure 6.34 A work flow for the write operation.

Each SCSI command PDU must have a matched response PDU indicating if the command is carried out successfully. Before the SCSI response is issued, data transfer may be necessary between the initiator and the target. This transfer is carried out by the SCSI Data-In and SCSI Data-Out PDUs. In the example, the initiator, after the write command, sends the intended data to the target in several SCSI Data-Out PDUs until the pre-negotiated cap of unsolicited data35 is reached. After that the initiator may send anything else only when requested by the target.

An R2T PDU tells the initiator which parts of the data to send. The target sends R2T PDUs without waiting for responses to the old ones. Upon receiving an R2T PDU, the initiator sends the requested data in an SCSI Data-Out PDU. The target-driven scheme allows local optimization based on the load and configuration of the target. But the scheme comes at the cost of transmitting extra R2T PDUs, which might become unacceptable when the amount of data is small and the network delay is long. This is why iSCSI allows the initiator to transfer data to the target without solicitation, as seen earlier in the flow. (Also available is another even more efficient scheme known as immediate data, which allows data to be sent as part of the SCSI command PDU.) The maximum amount of unsolicited data is negotiated during a login procedure that takes place after a TCP connection is set up between the initiator and the target. We will discuss the login procedure later.

One problem with using TCP/IP as transport is under-utilization of the underlying physical media. As a remedy, the notion of an iSCSI session is introduced. An iSCSI session is a set of TCP connections linking an initiator and a target. This set may grow and shrink over time, allowing us to aggregate multiple TCP connections to achieve a higher throughput.

With the availability of multiple connections comes the problem of using them correctly in the context of carrying out I/O. It is certainly reasonable to use separate connections for control and data transfer to ensure that a connection is always available for task management. Yet such a scheme requires monitoring and coordination across multiple connections, which can even require different adaptors on the initiator or the target.

To avoid this complexity, iSCSI employs a scheme known as connection allegiance. With this scheme, the initiator can use any connection to issue a command but must stick to the same connection for all ensuing communications.

The iSCSI sessions need to be managed. A big part of session management is handled by the iSCSI login procedure. Successful completion of the login procedure results in a new session or adding a connection to an existing session.

A prerequisite for the procedure is that the initiator knows the name and address of the storage device (i.e., the target) to use. One approach is to have such information pre-configured in the initiator. Then any change will require reconfiguration.

An alternative approach is based on the Service Location Protocol.36 It allows the initiator to dynamically discover available targets. To start the login procedure, the initiator first sets up a connection to a known TCP port37 on the target. Once the connection is established, the initiator performs the login steps through Login Request and Login Response PDUs. Mutual authentication may take place through a negotiated authentication method, with the Challenge Handshake Authentication Protocol (CHAP)38 as the default authentication method. As a minimum, the operational parameters are negotiated—among which are the maximum amount of unsolicited data, the maximum size of SCSI Data-In PDUs, and whether to include cyclic redundancy checksums in PDUs for error protection. When everything is in order, the target sends a Login Response PDU with an indication that the login procedure has completed successfully. Only then can the new connection associated with the session be used for SCSI communication.

Effective distribution of loads across multiple TCP connections and recovery from errors are also part of session management. iSCSI supports a three-level hierarchy for error recovery, with ascending increase in complexity.

At the bottom is session recovery, which rebuilds a defunct session all over again. It involves cleaning up all the associated artifacts (such as closing all TCP connections and aborting all pending SCSI commands with error indications) and then re-establishing a new set of TCP connections.

Next is the digest failure recovery, which, in addition to session recovery, allows the receiver of a PDU with a mismatched data digest to request that the PDU be resent.

Finally, the connection recovery includes the digest failure recovery and also allows a pending command on a broken connection to be transferred to another connection (which may need to be created).

Each recovery procedure is suitable for a specific environment. For example, in a LAN where errors of any kind are rare, it would be sufficient to just have session recovery. Overall, an iSCSI session can remain active for as long as it is possible to have a connection between the initiator and the target. The session terminates when the last connection closes. To make multiple connections appear as a single SCSI interconnect between the initiator and the target, iSCSI employs sequence numbers and tags.

In iSCSI, the identifier of a session consists of an initiator part and a target part. The former (the initiator session ID) is explicitly assigned by the initiator at the session establishment; the latter is implied by the initiator's selection of the TCP endpoint at connection establishment. To ensure that the initiator session ID is unique for every session that an initiator has with a given target (especially when the initiator is distributed), a hierarchical namespace controlled by a registration authority is prescribed.

We must emphasize that the mutual authentication step that may be part of the login procedure is only a one-time affair. It has no bearing on whether the ensuing communication is still between the authenticated nodes. Moreover, iSCSI itself does not provide any mechanisms to protect a connection or a session. All native iSCSI communication is in the clear, subject to eavesdropping and active attacks. In an untrusted environment, iSCSI should be used along with IPsec.39

6.2.5 Object Storage

NAS provides controlled access to shared data at the file level in a manner independent of an operating system. Its design requires every file-related I/O request from a client to go through a file server, which acts as an adaptor to the device storing the file. Thus the file server is a potential bottleneck limiting the I/O throughput. One way to address this limitation is to allow the client to have direct access to the storage device by sharing metadata [23]. The achieved performance gain, however, comes at the expense of security. A traditional block-storage device is relatively simple. It can read and write blocks of data in terms of zeros and ones, but does not have the faculty to understand the meaning of the data or discern constructs such as files or directories. Access control is possible only for the whole device. The client is given access to either everything or nothing at all. This is clearly problematic. Here object storage comes to the rescue.

A comparatively new technology—called object storage—allows data sharing across multiple operating systems securely and at the speed of direct storage access. Its chief characteristics are (a) a new device interface at a higher abstraction level than blocks and (b) additional intelligence in the device itself. Through the new interface (which has been standardized by the INCITS T10 technical committee), a storage device appears as a collection of objects. An object is an ordered set of bytes that is uniquely identifiable in a flat namespace. An object can hold any type of data, be it a file, database, or even an entire file system. What gets to be part of an object is up to the storage application. As objects are being created, deleted, modified, or cloned, the associated tasks of allocating and releasing blocks are handled by the storage device. To keep track of the used and free blocks, the device relies on per-object metadata.

The additional device intelligence refers to the capabilities to understand metadata, manage space, and support granular access control. Such on-device capabilities permit new performance optimization mechanisms (e.g., file pre-fetching and data reorganization), simplify application clustering, and enable automated storage management.

Granular access control is fundamental in Cloud Computing. The access control mechanism as standardized in ANSI INCITS 458-201140 is based on the notion of capability and credential. A capability describes the access rights of a client to an object, such as read, write, create, or delete. A credential is essentially a cryptographically protected tamper-proof capability, involving the keyed-Hash Message Authentication Code (HMAC)41 of a capability with a shared key. More specifically, a credential is a structure:

<apability, object storage identifier, capability key>,

where

capability key = HMAC (secret key, capability∥object storage identifier).

Figure 6.35 depicts the conceptual model. The security manager is responsible for granting credentials according to policy upon a client's request. The secret key for computing the capability key is shared between the security manager and the object storage device. The latter will carry out a command if and only if the client provides proof that it possesses a valid credential. Thus, the storage device serves as the access policy enforcer.

Diagram shows object storage access control model which include the security manager, object storage device, and client device connected to a converged storage network. Security manager grants credentials upon a client's request. Secret key is shared between the security manager and object storage device. Storage device will carry out a command if the client provides the proof.

Figure 6.35 Object storage access control model.

Now the question is what makes reasonable proof. Could a credential itself serve as proof, much like a driver's license that attests to the driver's qualification for driving a certain class of vehicles?

To answer this question, let us consider the basic requirements of a proof of interest. At a minimum, it should be verifiable, tamper-proof, hard to forge, and safe against unauthorized use. A credential meets all but the last requirement; there is no in-built mechanism to bind it to the acquiring client or to the communication channel between the client and the storage device. (In contrast, a driver's license has a photograph of the driver to bind the license to the driver, although such a strong binding is not necessary for the problem at hand.) This is clearly not good, especially if the credential is subject to eavesdropping over an improperly protected storage transport. Thus, another proof scheme is in order.

The standardized scheme derives a proof based on the capability key. The proof is a quantity computed with the capability key over selective request components according to the negotiated security method. The following security methods are possible:

  • NOSEC. That is, no access control whatsoever. In this case, the storage device performs a command without requiring a proof. This method is useful only in a fully isolated environment where the links are secure and there is a trust relationship between the client and the storage device.
  • CAPKEY. In this case, the proof is the integrity-check value of the identifier of the channel between the client and the storage device. As a result, the particular channel in use is pinned. (The channel identifier is assigned by the object storage device, from which the client can obtain the information.) The scheme assumes that the channel itself is secure. It prevents unauthorized use of the credential over a different channel, while allowing delegation, namely forwarding the credential to another client [24]. For a given channel, the scheme is fairly lightweight. The client does not need to request a credential or compute the integrity value for each command separately.
  • CMDRSP. In this case, the proof is the integrity-check value of the command in the request. The scheme is tailored to environments where the channel between the client and the storage device is unsecure but it is impractical to provide integrity protection for the user data. In addition to command origin authentication and integrity protection, it also provides anti-replay protection through the use of a nonce42 in each request. The paper by M. Factor et al. [25] explains the corresponding nonce management mechanics.
  • ALLDATA. In this case, the proof contains multiple integrity check values, including that for CMDRSP. The additional integrity check values are computed over the data sent to and received from the storage device, respectively. So on top of what is afforded by CMDRSP, the scheme provides protection against replay and alteration of the data exchanged between the client and the storage device. As CMDRSP, it is tailored to environments where communication channels are insecure.

Note that the access control mechanism does not involve actual client authentication. The resulting decoupling of the client and the storage device improves scalability, allowing the latter to scale independently of the client specifics. However, it gives rise to the need for a way to revoke credentials when they become accessible to a rogue client. Here, two options are available.

One option is that the security manager and object storage device change the relevant secret keys. Relatively easy to implement, this option, however, has a systemic effect beyond a single object. As soon as the secret keys are changed, all outstanding credentials become invalid.

With the other option, the security manager resets the policy access tag of the problematic object in the storage device. The tag is also part of the capability structure. A valid credential must have a policy access tag matching what is stored in the device.

6.2.6 Storage Virtualization

Storage virtualization [26] is concerned with abstraction of physical storage to shield applications from its underlying details, such as the actual media, access interface, and location.43 As a result, physically dispersed, heterogeneous storage systems can appear as a single aggregated entity, and vice versa. For example, ten 800-gigabyte hard disks can emulate an 8-terabyte virtual disk to the operating system. Conversely, an 8-terabyte hard disk is partitioned into eight 1-terabyte virtual disks that can be allocated to different hosts separately. The flexibility to allocate storage logically also allows dynamic resource management and improves overall resource utilization.

In general, storage virtualization entails (a) management of the metadata that map logical storage into physical devices and (b) translation and redirection of the I/O operations according to the mapping. The file-level virtualization builds on top of the block-level virtualization. The former calls for storage volumes being presented as files. As shown in Figure 6.36, with the use of a virtualization entity multiple file servers can appear as a single virtual file server.

Diagram shows file-level storage virtualization which has three levels; the host with VFS, the virtual file server, and the file server.

Figure 6.36 File-level storage virtualization.

With the block-level storage virtualization, storage appears to the operating system as a set of logical volumes or virtual disks. (A logical volume may combine non-contiguous physical partitions and span multiple physical storage devices.)

As shown in Figure 6.36, there are three approaches to block-level virtualization depending on where virtualization is done: the host, the network, or the storage device. In the host-based approach, virtualization is handled by a volume manager, which could be part of the operating system. The volume manager is responsible for mapping native blocks into logical volumes, while keeping track of the overall storage utilization. Ideally the mapping should provide a capability to be adjusted dynamically to allow the capacity of virtual storage to grow or shrink according to the latest need of a particular application. A major drawback of the approach is that per-host control is not favorable to optimal storage utilization in a multi-host environment, not to mention that the operational overhead of the volume manager is multiplied.

In the storage device-based approach, virtualization is handled by the controller of a storage system. Because of the close proximity of the controller to physical storage, this approach tends to result in good performance. Nevertheless, it has the drawback of being vendor-dependent and difficult (if not impossible) to work across heterogeneous storage systems.

In the network-based approach, virtualization is handled by a special function in a storage network, which may be part of a switch. The approach is transparent to hosts and storage systems as long as they support the appropriate storage network protocols (such as FC, FCoE, or iSCSI). Depending on how control traffic and application traffic are handled, it can be further classified as in-band (symmetric) or out-of-band (asymmetric).

Figure 6.37 illustrates the in-band approach, where the virtualization function for mapping and I/O redirection is always in the path of both the control and application traffic. Naturally the virtualization function could become a bottleneck and a single point of failure. Caching and clustering are common techniques to mitigate these problems. On the positive side, the central point of control afforded by the in-band approach simplifies administration and support for advanced storage features such as snapshots, replication, and migration. The snapshot feature is of particular relevance to Cloud Computing. It can be applied to capture the state of a virtual machine at a certain point in time, reflecting the run-time conditions of its components (e.g., memory, disks, and network interface cards). The state information allows rolling back after applying a patch or a failure. Nevertheless, there is a trade-off as in this case the performance of other virtual machines on the same host may suffer when the snapshot of a virtual machine is being taken.

Diagram shows In-band storage virtualization where host, storage network, and storage array are connected through control traffic and application traffic. The virtualization function or metadata controller is in the path of both the control and application traffic.

Figure 6.37 In-band storage virtualization.

Figure 6.38 illustrates the out-of-band approach, where the virtualization function is in the path of the control traffic but not the application traffic. The virtualization function directs the application traffic. In comparison with the in-band approach, the approach results in better performance since the application traffic can go straight to the destination without incurring any processing delay in the virtualization function. But this approach does not lend itself to supporting advanced storage features. More important, it imposes an additional requirement on the host to distinguish the control and application traffic and route the traffic appropriately. As a result, the host needs to add a virtualization adaptor, which, incidentally, may also support caching of both metadata and application data to improve performance. Per-host caching, however, faces the challenging problem of keeping the distributed cache consistent.

Diagram shows Out-of-band storage virtualization where host, storage network, and storage array are connected through control and application traffic. The virtualization function or the metadata controller is in the path of the control traffic but not in the application traffic.

Figure 6.38 Out-of-band storage virtualization.

The network-based approach is most suitable for Cloud Computing, given its relative transparency and flexibility in storage pooling. With this approach, storage can be assigned to VM hosts, which, in turn, can allocate the assigned virtual storage to VMs through their own virtualization facilities as described in [27]. The choice between the in-band and out-of-band approach, however, is not as clear and depends on the application. It would be ideal to have a hybrid approach combining the best of two worlds. Apparently, this is possible with an intelligent switch that, in effect, handles the control traffic out-of-band and the application traffic in-band [28].

6.2.7 Solid-State Storage

Storage technologies vary widely in performance, cost, and other attributes. Figure 6.39 gives a glimpse of the differences. The access time and cost shown there are relative to those of a magnetic disk. The access time of a magnetic disk is in the order of milliseconds. The Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and flash memory are much faster than the hard disk. With no moving parts and fully electronic processing, they are also superior in other aspects (e.g., shock resistance and energy efficiency). In the case of DRAM, data are stored as electric charge in the capacitors. Since the charge leaks over time, the capacitors need to be refreshed regularly. The need for constant refreshing explains why this type of RAM is named dynamic. With SRAM, data are held in transistors rather than in capacitors. There is no need for refreshing, which makes SRAM faster than DRAM. Nevertheless, both SRAM and DRAM are volatile in that they need power to retain the stored data. In contrast, flash memory is non-volatile.

Table shows the relative access time, relative cost and retention time of storage technologies such as SRAM, DRAM, flash, magnetic disk, and magnetic tape.

Figure 6.39 A comparison of storage technologies.

Flash memory is a kind of Electrically Erasable Programmable Read-Only Memory (EEPROM).44 It keeps data in floating-gate islands, which can retain electric charge for a long period of time (years). A relatively new flash memory technology was invented by Dr. Fujio Masuoka while he worked at Toshiba in the 1980s. It was named flash because of the capability to erase a big chunk of memory fast. Dr. Masuoka actually devised two types of flash memory in 198445 and 1987, respectively. The first type is known as NOR flash because its basic construct has properties resembling those of a NOR gate. NOR flash is fast (at least faster than hard disk), and it can be randomly addressed to a given byte. Its storage density is limited however.

The later type of flash memory removes this limitation (while also reducing the cost). It is called NAND flash because its basic construct has properties similar to those of a NAND gate. NAND flash, however, allows random access only in units that are larger than a byte. The NAND flash has made a splash in consumer electronics [29], and it is used much more widely than NOR flash—in digital cameras, portable music players, and smart phones.

The way to deal with the wide disparity in storage technologies is to implement storage hierarchically. Figure 6.40 depicts the memory hierarchy, where the lower the level of the memory, the closer is its distance to the processor, the faster its speed, the higher its cost, and the smaller its size. The constraint of the memory size at different levels implies that data can be stored in full only at a high level where the memory has a sufficient size. Furthermore, the data kept at a certain memory level can only be a subset of what is kept at a level above it.46 Overall, the memory hierarchy aims to create an illusion of infinite fast memory. The general strategy for placing data in the memory hierarchy is to keep data items that are more recently accessed closer to the processor and to include their neighbors as well when copying the data items to a lower level. For an in-depth discussion of memory hierarchy, we recommend the textbook on computer architecture by John L. Hennessy and David A. Patterson [30].

Diagram depicts the memory hierarchy that has the shape of a pyramid. As level number increases, the memory speed decreases, the distance to the processor increases, and the memory size increases.

Figure 6.40 The memory hierarchy.

Until recently, a common memory hierarchy has been SRAM for caches, DRAM for main memory, and hard disk for paged memory. The continued advances in solid-state storage technology have introduced other practical options.

In particular, NAND flash has emerged as the first serious challenger to a hard disk. It is increasingly used between SRAM and hard disk in the memory hierarchy. When used this way, flash memory is fashioned into solid-state drives that emulate hard-disk drives to help integration with the existing systems.

While solid-state drives will remain more expensive (in cost per byte) than hard-disk drives in the near future, they can deliver better I/O performances. Solid-state drives outperform hard-disk drives in random read operations by about three orders of magnitude. They are especially useful for applications involving a high percentage of random I/O operations. Web and online transaction processing services are familiar examples of such applications. Cloud services make an even better example, because they multiplex unrelated workloads on the same hardware—the I/O blender effect.

To be deployed in the Cloud, the solid-state drives must overcome three limitations inherent to NAND flash:

  1. A write operation over the existing content requires that this content be erased first. (This makes write operations much slower than read operations.)
  2. Erase operations are done on a block basis, while write operations on a page basis47;
  3. Memory cells wear out after a limited number of write–erase cycles.

Given the limitations, directly updating the contents of a page in place will cause high latency because of the need to read, erase, and reprogram the entire block. Obviously, this is not desirable, which gives rise to the practice of relocate-on-write (or out-of-place write). Here, a free page is written with the latest data, while the old page is marked as invalid.

Write performance is improved at the expense of an ever-growing number of invalid pages. If not reclaimed, the invalid pages deplete the storage space quickly. To reclaim the storage, garbage collection is necessary.

For purely sequential write operations, garbage collection is straightforward. Blocks can be invalidated and reclaimed one by one as data are written page by page. No extra write–erase operations are incurred. For random write operations, in contrast, the situation is much more complex, and so more sophisticated algorithms are required. For instance, an algorithm could maximize the number of reclaimed pages or minimize the number of additional read and write operations. The effectiveness of a garbage collection algorithm depends on the degree of write amplification that it incurs.

With write amplification, the eventual number of write operations carried out on a NAND flash is greater than that requested by the host. Naturally, it is desirable to contain write amplification as much as possible for performance and endurance reasons. Such containment is of particular importance in Cloud Computing, since write amplification worsens as the amount of random write operations increases.

There are two techniques to reduce write amplification.

One technique is over-provisioning, namely limiting the user address space to a fraction of the raw memory capacity. Over-provisioning increases the number of invalid pages in the block selected for reclamation. It is effective and does not depend on special support from the operating system (or the file system).

The technique employs a special ATA command to inform the underlying storage what data are deleted. With hard disks, there is no need for such a command because the storage medium supports in-place writing. The command makes a great difference in the case of NAND flash as it saves time that is otherwise wasted during garbage collection.

To extend a NAND flash's lifetime in general, the practice of “wear leveling” is used to spread write–erase operations as evenly as possible. (Note that out-place-write intrinsically supports wear leveling.) The practice introduces write amplification. To see the effect, consider the extreme example shown in Figure 6.41.

Image described by surrounding text.

Figure 6.41 Hypothetical state of a NAND flash memory.

In the figure, the cells correspond to pages, which may be valid, invalid, or free (unmarked); the rows correspond to blocks. The shade of each block denotes its age. The darker the shade of a block, the older it is. Block 5 is the oldest, reaching the end of life, while block 1 is the youngest, still having many write–erase cycles ahead of it. Given the state, the memory will become unusable soon unless another block is vacated to replace block 5, the free block.

Assuming the data stored in block 1 are static, it is the best block to vacate for two reasons: (1) block 5 won't need to be updated once programmed with the data of block 1; and (2) the memory's life is extended by the most cycles. Relocating the contents of block 1 to block 5 requires more read and write operations than any other option. Hence, wear leveling needs to address not only unused cycles but also the cycles wasted moving unchanged data. A good analysis of write amplification together with wear leveling is given in [32].

Despite its high cost, the unique performance need of Cloud services still leaves room for DRAM-based storage devices. To address the volatility problem, such devices come with built-in batteries or other backup power sources. Yet more interesting is the RAMCloud project at Stanford University [4]. The project aims to create a new class of storage for use in a data center that can keep all data in DRAM all the time. Commodity servers are the building blocks. Depending on the scale needed, hundreds or thousands of commodity servers can be clustered to form a single unified large-scale storage system. Expected to have exceptional performance, such storage systems face several challenges. Among other things (e.g., management of highly distributed storage and low-latency networking), the data stored on RAMCloud ought to be as durable as if they were on hard disk; a power failure must not cause permanent data loss; and failure of a single storage server cannot result in data loss or unavailability for over a few seconds. These requirements speak to the need for a replication and backup technique that can make use of non-volatile disk storage while maintaining the original performance advantage afforded by DRAM. “Buffered logging” [33] is one such technique. (Note that buffered logging is related to the file system journaling discussed earlier.) It uses both disk and memory for backup. Data changes on the master storage server are replicated as log entries on backup memories synchronously but on disks asynchronously. The memory copies are temporary. They are buffered and then transferred to disks in batches. Once replicated on disks, the log entries are removed from backup memories. Buffered logging keeps up performance but leaves behind a potential problem: the buffered data vanish if the master and all backups lose power simultaneously. An obvious solution to this problem is to fit each storage server with a small battery for it to flush buffered log entries to the disk after a power loss. Ironically, the special fitting deviates from the original assumption of using commodity servers.

In essence, RAMCloud provides a remote cache of practically infinite capacity to boost application performance. In this respect, it is influenced by Memcached [5], an open-source distributed caching system originally developed by Brad Fitzpatrick to improve the performance of LiveJournal [34]. Memcached supports a simple key-value store for small chunks of arbitrary data in DRAM on commodity computers. It is specific to caching to allow applications to bypass heavy operations such as database queries. Data durability is never part of the equation; each cached item is valid only for a certain period. Memcached is client-server-based, employing a request/response protocol (which may run over TCP or UDP).

A server stores data in a hash table. Keys are unique strings used to index into the table. For example, a result from a database query can be cached in a memcached server with the query string as the key. Although each data item has a limited lifetime, memcached does not implement garbage collection to actively reclaim memory. Instead, memory is reclaimed only when an expired item is being retrieved or when the space is needed for caching a new item. In the latter case, one of the least-recently-used48 items is subject to eviction. If an expired item exists, it is selected for reclaiming first. Otherwise, a still-valid item is selected.

Depending on the size of DRAM available on a server, caching the workload data may need more than one server. In this case, the hash table is distributed across multiple servers, which form a cluster with aggregated DRAM. Memcached servers, by design, are neither aware of one another nor coordinated centrally. It is the job of a client to select what server to use, and the client (armed with the knowledge of the servers in use) does so based on the key of the data item to be cached.

How should the hash table be distributed so that the same server is selected for the same key? A naïve scheme might be as follows:

numbered Display Equation

where H(k) is a hashing function, k the key, n the number of server, and s the server label, which is assigned the remainder of the division of H(k) over n. The scheme works as long as n is constant, but it will most likely yield a different server when the number of servers grows or shrinks dynamically—as is typically the case in Cloud Computing. As a result, cache misses abound, application performance degrades, and all servers in the latest cluster have to be updated.

Obviously this is undesirable, and so another scheme is in order. To this end, memcached implementations usually employ variants of consistent hashing [35] to minimize the updates required as the server pool changes and maximize the chance of having the same server for a given key. The basic algorithm of consistent hashing [36] can be outlined as follows:

  • Map the range of a hash function to a circle, with the largest value wrapping around to the smallest value in a clockwise fashion;
  • Assign a value (i.e., a point on the circle) to each server in the pool as its identifier49; and
  • To cache a data item of key k, select the server whose identifier is equal to or larger than H(k).

In [36], the server selected for key k is called k's successor, which is responsible for the arc between k and the identifier of the previous server. As an example, Figure 6.42 shows a circle of three servers, where server 1 is responsible for caching the associated data items for keys hashed to 6, 7, 0, and 1; server 3 for keys hashed to 2 and 3; and server 5 for keys hashed to 4 and 5.

Image described by surrounding text.

Figure 6.42 A circle in consistent hashing.

An immediate result of consistent hashing is that a departure or an arrival of a server only affects its immediate neighbors. In other words, when a new server p joins the pool, certain keys that were previously assigned to the original p's successor will now be reassigned to server p, while other servers are not affected. Similarly, when an old server p leaves the pool, the keys previously assigned to it will now be reassigned to p's successor while other servers are not affected. In the example in Figure 6.42, adding a new server 7 would result in reassigning keys 6 and 7 to the new server; removing server 3 would result in reassigning keys 2 and 3 to server 5.

The basic algorithm allows the server pool to scale effectively and provides a sound foundation for further enhancements. An enhancement for achieving better load distribution among servers is described in [37].

Overall, memcached proves to be an effective, scalable mechanism to improve application performance. It is widely used by high-traffic websites such as Facebook, Twitter, and YouTube. In particular, Facebook has deployed thousands of memcached servers to support its social networking services, creating the largest key-value store in the world—where over a billion requests per second are processed and trillions of items are stored [38].

Notes

References

  1. Glanz, J. (2012) The Cloud factories: Power, pollution and the Internet. The New York Times, September 22. www.nytimes.com/2012/09/23/technology/data-centers-waste-vast-amounts-of-energy-belying-industry-image.html?pagewanted=all.
  2. Pianese, F., Bosch, P., Duminuco, A., et al. (2010) Toward a Cloud operating system. Network Operations and Management Symposium Workshops (NOMS Wksps). IEEE/IFIP, pp. 335–342.
  3. SNIA Technical Council (2003) The SNIA shared storage mode. www.snia.org/sites/default/files/SNIA-SSM-text-2003-04-13.pdf.
  4. Ousterhout, J. (n.d.) RAMCloud. Stanford University. https://ramCloud.stanford.edu/wiki/display/ramCloud/RAMCloud.
  5. Dormando (n.d.) What is memcached? http://memcached.org/.
  6. Kant, K. (2009) Data center evolution—a tutorial on state of the art, issues, and challenges. Computer Networks, 53(17), 2939–2965.
  7. Electronic Industries Alliance (1992) EIA-310-D: Cabinets, Racks, Panels, and Associated Equipment. Electronic Industries Alliance, Arlington.
  8. ISO/IEC (1994) ISO/IEC 7498-1: Information Technology—Open Systems Interconnection—Basis Reference Model: The Basic Model. International Organization for Standardization, Geneva.
  9. Widmer, A.X. and Franaszek, P.A. (1983) A DC-balanced, partitioned-block, 8B/10B transmission code. IBM Journal of Research and Development, 27(5), 440–451.
  10. Jacob, B., Ng, S.W., and Wang, D.T. (2008) Memory Systems: Cache, DRAM, Disk. Elsevier Science, Amsterdam.
  11. Paulsen, K. (2011) Moving Media Storage Technologies: Applications & Workflows for Video and Media Server Platforms. Elsevier Science, Amsterdam.
  12. Sandberg, R., Goldberg, D., Kleiman, S., et al. (1985) Design and implementation of the Sun network filesystem. Proceedings of the Summer USENIX Conference. USENIX, the Advanced Computing Systems Association, Berkeley.
  13. Nobel Prize organization (2007) Class for Physics of the Royal Swedish Academy of Sciences. The Nobel Prize in Physics 2007. www.nobelprize.org/nobel_prizes/physics/laureates/2007/advanced-physicsprize2007.pdf.
  14. Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003) The Google File System. SOSP '03, Bolton Landing, NY, pp. 29–43.
  15. Doeppner, T.W. (2011) Operating Systems In Depth: Design and Programming. John Wiley & Sons, Inc, Hoboken.
  16. Kleiman, S.R. (1986) Vnodes: An architecture for multiple file system types in Sun Unix. Proceedings of the Summer USENIX Conference.
  17. Birrell, A.D. and Nelson, B.J. (1984) Implementing remote procedure calls. ACM Transactions on Computer Systems, 2(1), 39–59.
  18. Thurlow, R. (2009) RFC 5531, RPC: Remote Procedure Call Protocol Specification Version 2. Vol. RFC 5531. http://tools.ietf.org/html/rfc5531.
  19. Eisler, E. (2006) RFC 4506, XDR: External Data Representation Standard. http://tools.ietf.org/html/rfc4506.
  20. Dunning, D., Regnier, G., McAlpine, G., et al. (1998) The virtual interface architecture. IEEE Micro, 2(18), 66–76.
  21. Van Meter, R., Finn, G.G., and Hotz, S. (1998) VISA: Netstation's virtual Internet SCSI adapter. ACM SIGOPS Operating Systems Review (ACM), 32(5), 71–80.
  22. Meth, K.Z. and Satran, J. (2003) Design of the iSCSI Protocol. Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03), IEEE, pp. 116–122.
  23. Azagury, A., Dreizin, V., Factor, M., et al. (2003) Towards an object store. Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03), IEEE, San Francisco, CA, pp. 165–176.
  24. Azagury, A., Canetti, R., Factor, M., et al. (2002) A two layered approach for securing an object store network. Proceedings of the First International IEEE Security in Storage Workshop, IEEE, San Francisco, CA, pp. 10–23.
  25. Factor, M., Nagle, D., Naor, D., et al. (2005) The OSD security protocol. Proceedings of the Third International IEEE Security in Storage Workshop, IEEE, San Francisco, CA, pp. 11–23.
  26. Troppens, U., Erkens, R., Mueller-Friedt, W., et al. (2011) Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS, iSCSI, Infiniband and FCoE. John Wiley & Sons Ltd, Chichester.
  27. Vaghani, S.B. (2010) Virtual machine file system. ACM SIGOPS Operating Systems Review, 44(4), 57–70.
  28. Smoot, S.R. and Tan, N.K. (2012) Private Cloud Computing: Consolidation, Virtualization, and Service-Oriented Infrastructure. Morgan Kaufmann, Waltham, MA.
  29. Harari, E. (2012) Flash memory—the great disruptor! In Winner, L. (ed.), IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, San Francisco, CA.
  30. Hennessy, J.L. and Patterson, D.L. (2012) Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Waltham, MA.
  31. Computer History Museum (2012) Oral history of Fujio Masuoka, September 21. http://archive.computerhistory.org/resources/access/text/2013/01/102746492-05-01-acc.pdf.
  32. Hu, X.-Y., Eleftheriou, E., Haas, R., et al. (2009) Write amplification analysis in flash-based solid state drives. Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, The Association for Computing Machinery, New York.
  33. Ousterhout, J., Agrawal, P., Erickson, D., et al. (2011) The case for RAMCloud. Communications of the ACM, 54(7), pp. 121–130.
  34. Fitzpatrick, B. (2004) Distributed caching with memcached. Linux Journal, 124, 5.
  35. Karger, D., Lehman, E., Leighton, T., et al. (1997) Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. Proceedings of the 29th Annual ACM Symposium on Theory of Computing, ACM, New York.
  36. Stoica, I., Morris, R., Karger, D., et al. (2001) Chord: A scalable peer-to-peer lookup service for internet applications. Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, ACM, New York, 31(4), pp. 149–160.
  37. DeCandia, G., Hastorun, D., Jampani, M., et al. (2007) Dynamo: Amazon's highly available key-value store. Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles, ACM, New York, 41(6), pp. 205–220.
  38. Nishtala, R., Fugal, H., Grimm, S., et al. (2013) Scaling memcache at Facebook. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset