Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Hardware

In this chapter, we will review the technology decisions to be made in terms of the hardware required for a RAC cluster. The question of hardware is often neglected by DBAs; however, for a well-tuned application, the potential performance improvements offered by the latest server, interconnect, and storage architectures running the current versions of Linux is greater than those achievable by upgrades or software tuning on previous generations of technology. For this reason, hardware is a crucial topic of consideration for designing the best RAC configurations; this is particularly true for understanding how all of the components operate in conjunction for optimal cluster performance.

Knowing your hardware is also essential for assessing the total cost of ownership of a clustered solution. A proper assessment considers not just the costs of hardware alone, but also the significant potential savings offered in software acquisition and maintenance by selecting and sizing the components correctly for your requirements. In a RAC environment, this knowledge is even more important because, with a requirement for increased capacity, you are often presented with a decision between adding another node to an existing cluster and replacing the cluster in its entirety with updated hardware. You also have the choice between a great number of small nodes and a small number of large nodes in the cluster. Related to how you make this choice is how the network communication between the nodes in the cluster is implemented, so this chapter will also discuss the implementation of input/output (I/O) on the server itself in the context of networking. Knowing the hardware building blocks of your cluster is fundamental to making the correct decisions; it can also assist you in understanding the underlying reasons of how to configure the Linux operating system to achieve optimal Oracle RAC performance for installing and configuring Linux (see Chapter 6 for more details).

In addition to the Linux servers themselves, a crucial component of any RAC configuration is the dedicated storage array separate from any of the individual nodes in the cluster upon with the Oracle database is installed and shared between the nodes (see in Chapter 2); therefore, we also cover the aspects of storage I/O relating to RAC. In the context of RAC, I/O relates to the reads and writes performed on a disk subsystem, irrespective of the protocols used. The term storage encompasses all aspects relating to this I/O that enable communication of the server serving as the RAC node with nonvolatile disk.

Considering hardware presents a challenge in that, given the extraordinary pace of development in computing technology, reviewing the snapshot of any particular cluster configuration, in time, will soon be made obsolete by the next generation of systems and technologies. So, rather than focus on any individual configuration, we will review the general areas to consider when purchasing hardware for RAC that should remain relevant over time. With this intent, we will also refrain from directly considering different form factors, such as blade or rack-mounted servers. Instead, we will consider the lower-level technology that often lies behind a form factor decision.

The aim of this chapter is to provide a grounding to build a checklist when selecting a hardware platform for RAC on Linux. This will enable you to make an optimal choice. However, one factor that should be abundantly clear before proceeding is that this chapter will not tell you precisely what server, processor, network interconnect, or storage array to purchase. No two Oracle environments are the same; therefore, a different configuration may be entirely applicable to each circumstance.

Oracle Availability

Before beginning the selection of a hardware platform to run Oracle RAC on Linux, your first port of call should be Oracle itself, to identify the architectures on which Oracle releases the Oracle database for Linux with the RAC option.

At the time of writing, the following four architectures using Oracle terminology have production releases of the Oracle Database on Linux:

x86: A standard 32-bit Intel compatible x86 processor
x86-64: A 64-bit extended x86 processor (i.e., Intel EM64T, AMD64)
Itanium: The Intel Itanium processor
Power: The IBM Power processor
zSeries: The IBM mainframe

In this chapter, our focus is on the first two architectures because of the availability of Oracle Enterprise Linux for these platforms as discussed in Chapter 1, and the additional availability of Oracle Database 11g Release 2.

In addition to simply reviewing the software availability, we also recommend viewing the RAC Technologies Matrix for Linux Platforms technology matrix to identify platform-specific information for running Oracle RAC in a particular environment. With the advent of the most recent generation of online Oracle support called My Oracle Support, Oracle Product Certification Matrices are no longer available for public access without a support subscription. It is therefore necessary to have a valid login at https://support.oracle.com to access the RAC Technologies Matrix.

If you have a valid My Oracle Support login at the top-level menu, click the More... tab, followed by Certifications.

On the Certification Information page, enter the search details in the dropdown menus under the Find Certification Information heading. For example, under Product Line, select Oracle Database Products. Under both Product Family and Product Area, select Oracle Database. Under Product, select Oracle Server – Enterprise Edition. And under Product Release, select 11gR2RAC. Leave the other options at their default selections and press the Search button to display the Certification Information by Product and Platform for Oracle Server - Enterprise Edition. Select the Certified link next to the platform of interest to display the Certification Detail. Now click Certification Notes, and the page displayed will include a link to a RAC Technologies Compatibility Matrix (RTCM) for Linux Clusters. These are classified into the following four areas:

Platform Specific Information on Server/Processor Architecture
Network Interconnect Technologies
Storage Technologies
Cluster File System/Volume Manager

It is important to note that these technology areas should not be considered entirely in isolation. Instead, the technology of the entire cluster should be chosen in a holistic manner. For example, the importance of the storage-compatibility matrix published by the storage vendors you are evaluating is worth stressing. Many storage vendors perform comprehensive testing of server architectures running against their technology, often including Oracle RAC as a specific certification area. To ensure compatibility and support for all of your chosen RAC components, the servers should not be selected completely independently of the storage—and vice versa.

An additional subject you'll want to examine for compatible technology for RAC is cluster software, which, as a software component, is covered in detail in Chapter 8.

In this chapter, we will concentrate on the remaining hardware components and consider the selections of the most applicable server/processor architecture, network interconnect, and storage for your requirements.

Server Processor Architecture

The core components of any Linux RAC configuration are the servers themselves that act as the cluster nodes. These servers all provide the same service in running the Linux operating system, but do so with differing technologies. One of those technologies is the processor, or CPU.

As you have previously seen, selecting a processor on which to run Oracle on an Oracle Linux supported platform presents you with two choices. Table 4-1 shows the information gleaned from the choices' textual descriptions in the Oracle technology matrix.

Table 4.1. Processor Architecture Information

Server/Processor Architecture	Processor Architecture Details
Linux x86	Support on Intel and AMD processors that adhere to the 32-bit x86 architecture.
Linux x86-64	Support on Intel and AMD processors that adhere to the 64-bit x86-64 architecture. 32-bit Oracle on x86-64 with a 64-bit operating system is not supported. 32-bit Oracle on x86-64 with a 32-bit operating system is supported.

x86 Processor Fundamentals

Although the processor architecture information in Table 4-1 describes two processor architectures, the second, x86-64 is an extension to the instruction set architecture (ISA) of x86. The x86 architecture is a complex instruction set computer (CISC) architecture and has been in existence since 1978, when Intel introduced the 16-bit 8086 CPU. The de facto standard for Linux systems is x86 (and its extension, x86-64) because it is the architecture on which Linux evolved from a desktop-based Unix implementation to one of the leading enterprise-class operating systems. As detailed later in this section, all x86-64 processors support operation in 32-bit or 64-bit mode.

Moore's Law is the guiding principle to understand how and why newer generations of servers continue to deliver near exponential increases in Oracle Database performance at reduced levels of cost. Moore's Law is the prediction dating from 1965 by Gordon Moore that, due to innovations in CPU manufacturing process technology, the number of transistors on an integrated circuit can be doubled every 18 months to 2 years. The design of a particular processor is tightly coupled to the silicon process on which it is to be manufactured. At the time of writing, 65nm (nanometer), 45nm, and 32nm processes are prevalent, with 22nm technologies in development. There are essentially three consequences of a more advanced silicon production process:

The more transistors on a processor, the greater potential there is for the CPU design to utilize more features: The most obvious examples of this are multiple cores and large CPU cache sizes, the consequences of which you'll learn more about later in this chapter.
For the same functionality, reducing the processor die size reduces the power required by the processor: This makes it possible to either increase the processor clock speed, thereby increasing performance; or to lower the overall power consumption for equivalent levels of performance.
Shrinking the transistor size increases the yield of the microprocessor production process: This lowers the relative cost of manufacturing each individual processor.

As processor geometries shrink and clock frequencies rise, however, there are challenges that partly offset some of the aforementioned benefits and place constraints on the design of processors. The most important constraint is that the transistor current leakage increases along with the frequency, leading to undesired increases power consumption and heat in return for gains in performance. Additionally, other constraints particularly relevant to database workloads are memory and I/O latency failing to keep pace with gains in processor performance. These challenges are the prime considerations in the direction of processor design and have led to features such as multiple cores and integrated memory controllers to maintain the hardware performance improvements that benefit Oracle Database implementations. We discuss how the implications of some of these trends require the knowledge and intervention of the DBA later in this chapter.

One of the consequences of the processor fundamentals of Moore's Law in an Oracle RAC environment is that you must compare the gains from adding additional nodes to an existing cluster on an older generation of technology to those from refreshing or reducing the existing number of nodes based on a more recent server architecture. The Oracle DBA should therefore keep sufficiently up-to-date on processor performance that he can adequately size and configure the number of nodes in a cluster for the required workload over the lifetime of the hosted database applications.

Today's x86 architecture processors deliver high performance with features such as being superscalar, being pipelined, and possessing out-of-order execution; understanding some of the basics of these features can help in designing an optimal x86-architecture based on Oracle RAC environment.

When assessing processor performance the clock speed is often erroneously used as a singular comparative measure. The clock speed, or clock rate, is usually measured in gigahertz, where 1GHz represents 1 billion cycles per second. The clock speed determines the speed at which the processor executes instructions. However, the CPU's architecture is absolutely critical to the overall level of performance, and no reasonable comparison can be made based on clock speed alone.

For a CPU to process information, it needs to first load and store the instructions and data it requires for execution. The fastest mode of access to data is to the processor's registers. A register can be viewed as an immediate holding area for data before and after calculations. Register access can typically occur within a single clock cycle, for example assume you have a 2GHz CPU: retrieving the data in its registers will take one clock cycle of ½ a billionth of a second (½ a nanosecond). A general-purpose register can be used for arithmetic and logical operations, indexing, shifting, input, output, and general data storage before the data is operated upon. All x86 processors have additional registers for floating-point operations and other architecture-specific features.

Like all processors, an x86 CPU sends instructions on a path termed the pipeline through the processor on which a number of hardware components act on the instruction until it is executed and written back to memory. At the most basic level, these instructions can be classified into the following four stages:

Fetch: The next instruction of the executing program is loaded from memory. In reality, the instructions will already have been preloaded in larger blocks into the instruction cache (we will discuss the importance of cache later in this chapter).
Decode: x86 instructions themselves are not executed directly, but are instead translated into microinstructions. Decoding the complex instruction set into these microinstructions may take a number of clock cycles to complete.
Execute: The microinstructions are executed by dedicated execution units, depending on the type of operation. For example, floating-point operations are handled by dedicated floating-point execution units.
Write-back: The results from the execution are written back to an internal register or system memory through the cache.

This simple processor model executes the program by passing instructions through these four stages, one per clock cycle. However, performance potentially improves if the processor does not need to wait for one instruction to complete write-back before fetching another, and a significant amount of improvement has been accomplished through pipelining. Pipelining enables the processor to be at different stages with multiple instructions at the same time; since the clock speed will be limited by the time needed to complete the longest of its stages, breaking the pipeline into shorter stages enables the processor to run at a higher frequency. For this reason, current x86 enterprise processors often have between 10- to 20-stage pipelines. Although each instruction will take more clock cycles to pass through the pipeline, and only one instruction will actually complete on each core per clock cycle, a higher frequency increases the utilization of the processor execution units—and hence the overall throughput.

One of the most important aspects of performance for current x86 processors is out-of-order execution, which adds the following two stages to the simple example pipeline around the execution stage:

Issue/schedule: The decoded microinstructions are issued to an instruction pool, where they are scheduled onto available execution units and executed independently. Maintaining this pool of instructions increases the likelihood that an instruction and its input will be available to process on every clock cycle, thereby increasing throughput.
Retire: Because the instructions are executed out of order, they are written to the reorder buffer (ROB) and retired by being put back into the correct order intended by the original x86 instructions before write-back occurs.

Further advancements have also been made in instruction-level parallelism (ILP) with superscalar architectures. ILP introduces multiple parallel pipelines to execute a number of instructions in a single clock cycle, and current x86 architectures support a peak execution rate of at least three and more typically four instructions per cycle.

Although x86 processors operate at high levels of performance all of the data stored in your database will ultimately reside on disk-based storage. Now assume that your hard-disk drives have an access time of 10 milliseconds. If the example 2GHz CPU were required to wait for a single disk access, it would wait for a period of time equivalent to 20 million CPU clock cycles. Fortunately, the Oracle SGA acts as an intermediary resident in random access memory (RAM) on each node in the cluster. Memory access times can vary, and the CPU cache also plays a vital role in keeping the processor supplied with instructions and data. We will discuss the performance potential of each type of memory and the influence of the CPU cache later in this chapter. However, with the type of random access to memory typically associated with Oracle on Linux on an industry-standard server, the wait will take approximately between 60 and 120 nanoseconds. The time delay represents 120 clock cycles for which the example 2GHz CPU must wait to retrieve data from main memory.

You now have comparative statistics for accessing data from memory and disk. Relating this to an Oracle RAC environment, the most important question to ask is this: "How does Cache Fusion compare to local memory and disk access speeds?" (One notable exception to this question is for data warehouse workloads as discussed in Chapter 14). A good average receive time for a consistent read or current block for Cache Fusion will be approximately two to four milliseconds with a gigabit-Ethernet based interconnect. This is the equivalent of 4 to 5 million clock cycles for remote Oracle cache access, compared to 120 clock cycles for local Oracle cache access. Typically, accessing data from a remote SGA through Cache Fusion gives you a dramatic improvement over accessing data from a disk. However, with the increased availability of high performance solid-state based storage, Flash PCIe cards, and enterprise storage that utilizes RAM based caching (as discussed later in this chapter), it may be possible that disk-based requests could complete more quickly than Cache Fusion in some configurations. Therefore the algorithms in Oracle 11g Release 2 are optimized so that the highest performing source of data is given preference, rather than simply assuming that Cache Fusion delivers the highest performance in all cases.

Similarly, you may consider supported interconnect solutions with lower latencies such as Infiniband, which you'll learn more about later in this chapter. In this case, Cache Fusion transfers may potentially reduce measurements from milliseconds to microseconds; however, even a single microsecond latency is the equivalent of 2000 CPU clock cycles for the example 2GHz CPU, and it therefore represents a penalty in performance for accessing a remote SGA compared to local memory.

x86-64

The 64-bit extension of x86 is called x86-64 but can also be referred to as x64, EM64T, Intel 64, and AMD64; however, the minor differences in implementation are inconsequential, and all of these names can be used interchangeably.

Two fundamental differences exist between x86, 32- and x86-64, 64-bit computing. The most significant is in the area of memory addressability. In theory, a 32-bit system can address memory up to the value of 2 to the power of 32, enabling a maximum of 4GB of addressable memory. A 64-bit system can address up to the value of 2 to the power of 64, enabling a maximum of 16 exabytes, or 16 billion GB, of addressable memory—vastly greater than the amount that could be physically installed into any RAC cluster available today. It is important to note, however, that the practical implementations of the different architectures do not align with the theoretical limits. For example, a standard x86 system actually has 36-bit physical memory addressability behind the 32-bit virtual memory addressability. This 36-bit physical implementation gives a potential to use 64GB of memory with a feature called Page Addressing Extensions (PAE) to translate the 32-bit virtual addresses to 36-bit physical addresses. Similarly, x86-64 processors typically implement 40-bit or 44-bit physical addressing; this means a single x86-64 system can be configured with a maximum of 1 terabyte or 16 terabytes memory respectively. You'll learn more about the practical considerations of the impact of the different physical and virtual memory implementations later in this chapter.

In addition to memory addressability, one benefit from moving to 64-bit registers is the processors themselves. With 64-bit registers, the processor can manipulate high-precision data more quickly by processing more bits in each operation.

For general-purpose applications x86-64 processors can operate in three different modes: 32-bit mode, compatibility mode, or 64-bit mode. The mode is selected at boot time and cannot be changed without restarting the system with a different operating system. However, it is possible to run multiple operating systems under the different modes simultaneously within a virtualized environment (see Chapter 5). In 32-bit mode, the processor operates in exactly the same way as standard x86, utilizing the standard eight of the general-purpose registers. In compatibility mode, a 64-bit operating system is installed, but 32-bit x86 applications can run on the 64-bit operating system. Compatibility mode has the advantage of affording the full 4GB of addressability to each 32-bit application. Finally, the processor can operate in 64-bit mode, realizing the full range of its potential for 64-bit applications.

This compatibility is indispensable when running a large number of 32-bit applications developed on the widely available x86 platform while also mixing in a smaller number of 64-bit applications; however, Oracle's published certification information indicates that 32-bit Oracle is not supported on a 64-bit version of Linux on the x86-64 architecture. That said, the architecture may be used for 32-bit Linux with 32-bit Oracle or for 64-bit Linux with 64-bit Oracle. The different versions cannot be mixed, though, and compatibility mode may not be used. To take advantage of 64-bit capabilities, full 64-bit mode must be used with a 64-bit Linux operating system, associated device drivers, and 64-bit Oracle—all must be certified specifically for the x86-64 platform.

The single most important factor for adopting 64-bit computing for Oracle is the potential for memory addressability beyond the capabilities of a 32-bit x86 system for the Oracle SGA. Additionally, the number of users in itself does not directly impact whether a system should be 32- or 64-bit. However, a significantly large number of users also depend on the memory handling of the underlying Linux operating system for all of the individual processes, and managing this process address space also benefits from 64-bit memory addressability. For this reason, we recommend standardizing on an x86-64 processor architecture for Oracle RAC installations. In addition, memory developments such as NUMA memory features are only supported in the 64-bit Linux kernel; therefore, the advantages of 64-bit computing are significantly enhanced on a server with a NUMA architecture as discussed later in this chapter.

Multicore Processors and Hyper-Threading

Recall for a moment the earlier discussion of Moore's Law in this chapter. Whereas improvements in manufacturing process have produced successive generations of CPUs with increasing performance, the challenges—especially those related to heat and power consumption—have resulted in the divergence of processor design from a focus on ever-advancing clock speeds. One of the most significant of these developments has been the trend towards multicore processors (see Figure 4-1). A multicore processor is one that contains two or more independent execution cores in the same physical processor package or socket.

Figure 4.1. A multicore processor

For example, a quad-core processor can execute four processes completely independently and simultaneously without contending for CPU resources, such as registers. In other words, the design aims to achieve higher performance and greater power efficiency in the same profile platform, as opposed to utilizing the same manufacturing technology to produce a single-core processor with a higher clock speed. The trend toward multicore processors is but one example of a reduced emphasis on attempting to increase processor performance solely by increasing this clock frequency; instead, the multicore approach attempts to achieve greater performance by implementing more parallelism by leveraging a greater number of shorter pipelines than earlier processor architectures.

It is important to note that, in an Oracle environment, the workload of a typical Oracle user session will be scheduled and executed on one physical core only. Therefore, the best way to achieve better performance is to improve throughput and scalability by making resources available to process the workload of multiple sessions, the Oracle Database, and operating system processes concurrently. This approach improves performance more than processing these tasks more quickly, but in a serial fashion. The notable exception to this occurs with Oracle Parallel Execution, where a Parallel Query can take one complex query; however, such a query almost always requires full table scans and breaking the query into tasks completed by a number of Parallel Execution Servers on multiple cores across all of the CPUs in a RAC environment (as noted previously, we discuss Parallel Execution in more depth in Chapter 14).

From a scalability perspective, it is important to note that not all of the processor resources are entirely independent within a multicore processor. Advantages may be leveraged by sharing some resources between the cores, such as the CPU LLC (Last Level Cache). Typically the distinction can be drawn between the processor cores themselves are being within the core block and the shared resources being identified as the uncore block.

At the time of writing, processors to run Linux with one, two, four, six, and eight cores are available. From a design perspective, multicore processors complement RAC architectures by enabling a greater level of processing power in increasingly lower profile platforms with fewer CPU sockets. When architecting a grid solution based on RAC, multicore processors present more options to increase the parallelism available within an individual server, as well as by adding additional servers across the cluster. This approach presents more design options for a finer level of granularity within the nodes themselves and across the entire cluster.

Simultaneous Multi-Threading (SMT), also known as Hyper-Threading (HT), is a feature distinct from multicore processing that appears on some processors. HT makes a single execution core appear to the operating system as two processors. For example, a quad-core CPU with HT will appear as eight processors. This appearance is a logical representation, and the number of physical processors remains the same. HT enables more efficient usage of the execution units of single processor core by scheduling two threads onto the same processor core at the same time. This usage model means that, in a multiprocess environment, the processor is more likely to process two simultaneous threads more rapidly than it would process the same two threads consecutively, resulting in higher performance and throughput. This approach is not to be confused with Switch-on-Event Multi-Threading (SoEMT), which is employed on alternative architectures. In that approach, only one process or thread executes on the core at a single time. However, upon encountering an event where that thread would stall, such as to read another scheduled thread from memory, the thread is switched to run on the processor.

HT can be enabled and disabled at the BIOS level of the server. However, where this feature is typically available, it can prove beneficial to Oracle workloads. We recommend that this feature be enabled by default, unless you're testing in a particular environment where it proves to have a detrimental impact upon performance. The main benefit of HT for Oracle is thatthroughput will improve by scheduling more processes to run, thus ensuring that all the processors are utilized while other processes are waiting to fetch data from memory.

The number of physical and logical CPUs presented to the operating system can be viewed in /proc/cpuinfo. The following extract shows some of the information for the first CPU, processor 0, on a system:

[root@london1 ˜]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping        : 5
cpu MHz         : 2933.570
cache size      : 8192 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
nx rdtscp lm constant_tsc ida pni monitor ds_cpl vmx est tm2 cx16 xtpr
popcnt lahf_lm
bogomips        : 5871.08
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

To correctly map the processor viewed at the Linux operating system level to the processor, you can use thread software available from the CPU manufacturers to determine the correct topology. The example output in Table 4-2 is generated on a two-processor socket Quad-Core Xeon processor based system with HT enabled.

Table 4.2. Processor Thread Mapping

Socket/Core	OS CPU to Core Mapping
Socket 0
OScpu#	\| 0 8	\| 1 9	\| 2 10	\| 3 11	\|
Core	\|c0_t0 c0_t1	\|c1_t0 c1_t1	\|c2_t0 c2_t1	\|c3_t0 c3_t1	\|
Socket 1
OScpu#	\| 4 12	\| 5 13	\| 6 14	\| 7 15	\|
Core	\|c0_t0 c0_t1	\|c1_t0 c1_t1	\|c2_t0 c2_t1	\|c3_t0 c3_t1	\|

The output illustrates that, against the 16 processors, the operating system identifies the mapping of the Hyper-Threads of the first four set of cores as corresponding to CPUs 0 to 3 and 8 to 11. The second set of four cores are mapped to CPUs 4 to 7 and 12 to 15, respectively. As we shall see later in this chapter when we discuss memory performance, understanding the correct mapping on your system may help you create a more efficient scheduling of the workload.

CPU Cache

If the memory, Cache Fusion, or disk access speeds were the maximum at which the CPU could fetch the data and instructions it requires, the CPU would spend much of its clock cycles stalling, waiting for them to be retrieved. Therefore, the crucial components of the processor for Oracle workloads are the faster units of memory on the processor itself (i.e., the CPU cache).

The CPU cache stores the most commonly accessed areas of main memory in terms of data, executable code, and the page table entries in the transaction look-aside buffer (TLB). For example, the RDBMS kernel executable size for Oracle 11g on Linux x86-64 is more than 200MB. This means you will gain immediate benefits from a large cache size that stores as much of this executable as possible, while also providing more rapid access to data.

Cache is usually (although not always) implemented in a hierarchy with the different levels feeding each other. Typically Level 1 (L1) cache can be accessed in anything from 1 clock cycle to 5 clock cycles, or up to 2.5 nanoseconds at 2GHz. L2 cache can be accessed in 5–10 clock cycles, or up to 5 nanoseconds at 2GHz. L3 cache, or the LLC in a shared cache processor architecture, can usually be accessed in anything from 10 to 40 clock cycles depending on implementation, which is equivalent to 20 nanoseconds at 2GHz. Therefore, data stored in cache can be accessed at least three times faster than data stored in main memory, providing a considerable performance improvement. Also, as with registers, the more cache there is within the CPU, the more likely it is that it will perform efficiently by having data available to process within its clock cycles. The data in cache is ultimately populated with the data from main memory. When a requested byte is copied from memory with its adjacent bytes as a memory block, it is stored in the cache as a cache line. A cache hit occurs when this data is requested again, and instead of going to memory, the request is satisfied from the cache. By implementing a hierarchical system, the need to store the most-requested data close to the processor illustrates one of the reasons why maintaining an optimal Oracle indexing strategy is essential. Unnecessary full-table scans on large tables within the database are guaranteed to challenge even the most advanced caching implementations.

Of course, this view of caching provides an oversimplification for the implementation of a mulitcore or multiprocessor system (and uniprocessor systems using direct memory access [DMA] for I/O) because every read from a memory address must always provide the most up-to-date memory from that address, without fail. When multiple processors all share the same memory, they may all may have their own copies of a particular memory block held in cache, and if these are updated, some mechanism must be employed to guarantee consistency between main memory and the cache lines held on each and every single processor. The more processors you have, the greater the workload required for ensuring consistency. This process for ensuring consistency is termed cache coherency. Cache coherency is one of the reasons why features such as hash clusters, where Oracle stores the rows for multiple tables in the same data blocks; or Index-Organized Tables (IOTs), where the data is stored in the same blocks as the index; can bring significant performance benefits to transactional systems.

At the most basic level, CPU caches operate in one of two modes: write-through or write-back. In write-through mode, when the processor modifies a line in the cache, it also writes that date immediately to the main memory. For performance, the processor usually maintains a valid bit to determine whether the cache line is valid at any one time. This valid bit is the simplest method to maintain coherency. In write-back mode, the cache does not necessarily write the value back to another level of the cache hierarchy or memory immediately. This delay minimizes the performance impact, but it also requires a more complex protocol to ensure that all values are consistent across all processors. When processors employ different levels of cache, they do not necessarily employ the same mode or protocol for each level of cache.

All of the architectures of interest implement what is termed a snooping protocol, where the processor monitors the system bus for signals associated with cache reads and writes. If an individual processor observes activity relating to a cache line that it has loaded and is currently in a valid state, then some form of action must take place to ensure coherency. If a particular cache line is loaded on another processor, but not modified (e.g., if multiple users select the same rows from the Oracle database), the cache line will be valid, but some action must be taken to ensure that each processor is aware that it is not the only one with that particular piece of data. In addition, if a cache line is in a modified state on another processor, then one of two things must happen. First, that particular line will need to be either written back to main memory to be reread by the requesting processor. Or second, it will be transferred directly between caches. For RAC, the performance potential for cache-to-cache transfers is an important reason why you're typically better off with fewer cluster nodes with more processors, as opposed to more nodes with fewer processors.

The most basic form of the snooping protocol to implement cache coherency is the Modified Shared Invalid (MSI) protocol. This protocol is applied to each and every line loaded in cache on all of the processors in the system; it is also applied to the corresponding data located in main memory. A cache line can be in a modified state on one processor and one processor only. When it is in this state, the same data in cache on another processor and main memory must always be in an invalid state. The modified state must be maintained until the main memory is updated (or the modified state is transferred to another processor) to reflect the change and make it available to the other processors. In the shared state, one or more caches have the same cache line, all of which are in the valid state, along with the same data in memory. While a cache line is in a shared state, no other cache can have the data in a modified state. If a processor wishes to modify the cache line, then it must change the state to modified and render all of the corresponding cache lines as invalid. When a cache line is in a shared state, the processor is not required to notify the other processors when it is replaced by other data. The shared state guarantees that it has not been modified. The invalid state determines that the cache line cannot be used, and it does not provide any information about the state of the corresponding data in main memory.

Table 4-3 illustrates the permitted states available for an example cache line within two caches. Note that more than a passing similarity exists between the way that cache coherency is maintained at the processor level and the way that Cache Fusion operates between the Oracle RAC nodes. This comparison illustrates how, in a RAC environment, you are in fact operating in an environment that implements multiple levels of coherency at both the individual system and cluster levels.

Table 4.3. MSI Cache States

	INVALID	SHARED	MODIFIED
INVALID	Invalid	Shared	Modified
SHARED	Shared	Shared	Not Permitted
MODIFIED	Modified	Not Permitted	Not Permitted

Although theoretically simple to implement, the MSI protocol would require all state changes to be atomic actions and, therefore, would prove impractical to implement on a real system running software such as Oracle. Additionally if using a point-to-point interconnect as opposed to a shared system bus between processors there is no longer a single point to resolve cache coherency. Instead, additional exclusive, owner and forwarding states are added for the architectures of interest to realize the Modified Exclusive Shared Invalid (MESI) protocol, the Modified Exclusive Shared Invalid Forwarding (MESIF) protocol and the Modified Owned Exclusive Shared Invalid (MOESI) protocol. The exclusive state is similar to the shared state, except that the cache line is guaranteed to be present in one cache only. This limitation enables the processor to then change the state of the cache line, for example, from an Oracle UPDATE statement to modified, if required, without having to notify the other processors across the system bus. The addition of the exclusive state is illustrated in Table 4-4.

Table 4.4. MESI Cache States

	INVALID	SHARED	EXCLUSIVE	MODIFIED
INVALID	Invalid	Shared	Exclusive	Modified
SHARED	Shared	Shared	Not Permitted	Not Permitted
EXCLUSIVE	Exclusive	Not Permitted	Not Permitted	Not Permitted
MODIFIED	Modified	Not Permitted	Not Permitted	Not Permitted

The owner state signifies that the processor holding a particular cache line is responsible for responding to all requests for that particular cache line, with the data in main memory or other processors being invalid. This state is similar to the modified state; however, the additional state is a necessity when an architecture implements the concept of individual CPUs being privately responsible for a particular section of main memory.

Later in this chapter, we will discuss implementations of memory architectures, including the attributes of cache coherent non-uniform memory architectures (ccNUMA) which are becoming the standard implementation for x86 platforms. A MESI protocol implemented in a ccNUMA architecture would result in a higher level of redundant messaging traffic being sent between the processors, which would impact latency and system performance. For this reason, the MESI protocol has been adapted to both the MESIF and MOESI protocols for efficiency. For example, in the MESIF protocol, that additional Forwarding state is added, and the Shared state is modified. In MESIF, only one cache line may be in the Forwarding state at any one time. Additional cache lines may hold the same data; however, these will be in the Shared state, and the response to any snoop request is satisfied by the cache line in the Forwarding state only. Because the cache lines in a Shared state do not respond to read requests, cache coherency messaging is significantly reduced when the same data is held in multiple cache lines. Once the cache line in the Forwarding state is copied, the new copy is then designated as the sole cache line to be in the Forwarding state, and the previous copy reverts back to a Shared status. This ensures that the single cache line in the Forwarding state is unlikely to be aged out of an individual cache by other requests. If a particular cache line has multiple requests, the workload and bandwidth to satisfy these requests is evenly distributed across all of the processors in the system.

CPU Power Management

Power management is of increasing importance in a data center environment. As processors and other server components reach greater clock frequencies, power management can present two challenges: using increased amounts of power and generating increased levels of heat. First, power always comes at a cost, and the levels of power that can be supported in any particular data center environment will always have a finite limit. Second, all of the heat generated requires that it be dissipated to keep all of the computing components within recommended operating temperatures. For these reasons, CPU power utilization is an important component of achieving power management goals. In a RAC environment, however, it is important to balance the power demands of the processor against the power requirements of the entire server. For example, as will be discussed later in this chapter, different implementations of memory and storage technology will also have an impact on the power consumption of a server. You should consider power as a criteria that you measure across an entire cluster, as opposed to focusing strictly on the power consumption of the individual components. Entire server power utilization can then be used as an additional factor in determining the appropriate size and number of nodes in the cluster.

For power management, a standard interface, the Advanced Configuration and Power Interface (ACPI), is implemented across all server architectures to give the operating system a degree of control over the power utilization of the system components. For the CPU itself, a number of technologies are employed to manage power utilization. For example, Intel processors have four power management states, the most important of which to an Oracle installation are the P-states that govern power management when the CPUs are operational and C-states when the CPUs are idle. P-states are implemented by dynamic CPU voltage and frequency scaling that steps processor voltage and frequency up and down in increments. This scaling up and down occurs in response to the demands for processing power. The processing demands are determined by the operating system. Hence, it is Linux that requests the P-states at which the processors operate. If this feature is available on a platform, it can be determined within the CPU information of the basic input/output system (BIOS) with an entry such as P-STATE Coordination Management. The output of the dmesg command should display information regarding the CPU frequency under an ACPI heading, such as the following:

ACPI: CPU0 (power states: C1[C1] C2[C3])
ACPI: Processor [CPU0] (supports 8 throttling states)

Once enabled, the frequency can be monitored and set by the cpuspeed daemon and controlled by the corresponding cpuspeed command based on the temperature and external power supplies of CPU idle thresholds. The processor frequency can also be controlled manually by sending signals to the cpuspeed daemon with the kill command. The current active frequency setting for a particular processor can be viewed from the CPU MHz entry of /proc/cpuinfo. For example, at the highest level of granularity, the cpuspeed daemon can be disabled as follows:

[root@london1 ˜]# service cpuspeed stop
Disabling ondemand cpu frequency scaling:                  [  OK  ]

When the daemon is disabled, the processors run at the level of maximum performance and power consumption:

[root@london1 ˜]# cat /proc/cpuinfo | grep -i MHz
cpu MHz         : 2927.000
cpu MHz         : 2927.000

Conversely, the daemon can be enabled as follows:

[root@london1 ˜]# service cpuspeed start
Enabling ondemand cpu frequency scaling:                   [  OK  ]

By default, the processors run at a reduced frequency; hence, they have reduced performance and power consumption:

[root@london1 ˜]# cat /proc/cpuinfo | grep -i MHz
cpu MHz         : 1596.000
cpu MHz         : 1596.000

From this initial state, the processors are able to respond dynamically to demands for performance by increasing frequency and voltage at times of peak utilization and lowering them when demand is lower. It is important to note that individual cores in a multicore processor can operate at different frequencies depending on utilization of the cores in question. You can see this at work in the following example:

[root@london1 ˜]# cat /proc/cpuinfo | grep -i MHz
cpu MHz         : 1596.000
cpu MHz         : 2927.000

In an HT environment where two threads run on a core simultaneously, the core will run at a frequency determined by the demands of the highest performing thread.

When a CPU is idle, it can be instructed to operate in a reduced power state. It is possible to halt the clock signal and reduce power or shut down units within the CPU. These idle power stats are called C-states. Like P-states, C-states are implemented in a number of steps. A deeper state conserves more energy. However, it also requires additional time for the CPU to re-enter a fully operational state. C-states can also be observed under the ACPI:

[root@london1 processor]# cat /proc/acpi/processor/CPU0/power
active state:            C2
max_cstate:              C8
bus master activity:     00000000
states:
    C1:                  type[C1] promotion[C2] demotion[--] latency[000]
 usage[00011170] duration[00000000000000000000]
   *C2:                  type[C3] promotion[--] demotion[C1] latency[245]
 usage[00774434] duration[00000000001641426228]

It is important to reiterate that, in a RAC environment, power management should be measured holistically across the entire cluster and not on an individual component level. Moreover, CPU power management should be handled the same way. Performance demands and power utilization is governed at the operating system level on an individual node. Hence, awareness is not extended to the level of the Oracle Database software. For this reason, there is the potential that power saving operations on one node in the cluster may negatively impact the performance of another, and you should therefore monitor the use of power management techniques against desired performance. For example, if levels of utilization are low on some nodes, you should consider at a design level whether reducing the number of nodes in the cluster or using virtualization can increase the utilization of individual CPUs to the level desired without requiring CPU power management enabled. If so, you will be able reduce the number of individual nodes, thereby saving in the overall power demands of the cluster.

Virtualization

All of the latest CPU architectures that support Oracle RAC on Linux implement additional features to support full virtualization of the operating system. This fact enables multiple instances of the operating system and nodes within a RAC cluster to be hosted on a single physical server environment. You can find in-depth details on these additional processor features in the context of virtualization with Oracle VM |in Chapter 5.

Memory

As we have progressed through the server architecture, you should clearly see that, in an Oracle RAC context, one of the most important components on each node is the RAM. Memory is where your SGA resides, and it's also the place where Cache Fusion of data blocks between the instances in the cluster takes place. Understanding how this memory is realized in hardware is essential to understanding the potentials of cluster performance.

Virtual Memory

When a process is initially created, the Linux kernel creates a set of page tables as virtual addresses that do not necessarily bear any relation to the physical memory addresses. Linux maintains the directory of page table entries for the process directly in physical memory to map the virtual memory addresses to the physical ones. For example, this translation between virtual and physical memory addresses means that, on a 32-bit system, each Linux process has its own 4GB address space, rather than the entire operating system being limited to 4GB. Similarly, on an x86-64 system with 48-bit virtual addressing, the limit is considerably greater than 256TB.

This directory of page table entries has a number of levels. When a virtual memory access is made, translating the virtual address can result in a number of physical memory accesses to eventually reach the actual page of memory required. To reduce this impact on performance within the Memory Management Unit (MMU) located on the CPU, a table exists with its own private memory—the TLB. Every request for data goes to the MMU, where the TLB maps the virtual memory addresses to the physical memory addresses based on the tables set up in the TLB. These tables are populated by the kernel according to the most recent memory locations accessed. If the page table entries are not located in the TLB, then the information must still be fetched from the page tables in main memory. Therefore, it's advantageous to ensure that highest number of memory references possible can be satisfied from the TLB.

The TLB capacity is usually small, and the standard page size on an x86 and x86-64 Linux system is 4KB. Thus, the large amount of memory required by Oracle means that most accesses will not be satisfied from the TLB, resulting in lower-than-optimal performance. Oracle uses a large amount of contiguous memory, so the references to this memory could more efficiently managed by mapping a smaller number of larger pages. For this reason, on Linux systems implementing the 2.6 kernel, Oracle 11g can take advantage of a huge TLB pool. You will learn how to configure these huge pages, which we strongly recommend using, in Chapter 6. When correctly implemented, huge pages increase the likelihood that an Oracle memory access will be satisfied from the TLB. This stands in contrast to traversing a number of physical memory locations to discover the desired memory address. This approach also saves CPU cycles that would otherwise be spent managing a large number of small pages. It also saves on physical memory to provide the address mappings in the first place. Additionally, huge pages are pinned in memory and not selected as candidates to be swapped to disk under conditions of memory contention.

Understanding the TLB and memory addressability can assist you in understanding why sizing the SGA too large for requirements can be detrimental, especially in a RAC environment. Sizing the SGA too large means that you could have a large number of address mappings that you do not need, increasing the likelihood that accessing the memory location that you do require will take longer by requiring that you traverse the ones you do not. For RAC, an oversized SGA has the additional impact of requiring an increased number of blocks to be mastered unnecessarily across the entire cluster. The SGA on each node in the cluster should be sized optimally, ensuring that it is not too small, but also not too large.

Physical Memory

Many different physical types of memory can be installed in computer systems, and they can vary significantly in terms of performance and capacity. Physical memory is comprised of dual in-line memory modules (DIMM) on which the RAM chips are located.

Most non-enterprise-based computer systems use single data rate RAM (SDRAM); however, enterprise class–based systems normally use double data rate RAM (DDR), which has been generally available since 2002. This technology has advanced through DDR2, available since 2004; and DDR3, which has been available since 2007.

Memory speed is measured in terms of memory clock performance. In fact, memory clock performance governs the speed of the memory I/O buffers and the rate at which data is pre-fetched, so the clock performance does not necessarily correspond directly with the speed of the memory itself, called the core frequency. DDR is technology extremely similar to SDRAM; however, unlike SDRAM, DDR reads data on both the rising and falling edges of the memory clock signal, so it can transfer data at twice the rate.

When reviewing a hardware specification for memory, the definition will resemble the following:

RAM Type - PC3-10600 DDR3-1333

This definition gives you both the bandwidth and the clock frequency of the memory specified. However, most memory buses are 64-bits wide (which equals to 8 bytes), so you can multiply the bus speed by 8 bytes (or 64 bits) to determine the bandwidth given in the RAM Type definition. Therefore, if you have a bus speed of 1333MHz (which is, in fact, twice the speed of 667MHz and can also be identified as megatransfers or MT/s), you can calculate that the following example is named PC3-10600 with the PC-3 prefix signifying the memory as DDR3:

2 × 667MHz × 8 bytes(64 bits) = 10667 MB/s (or 10.6 GB/s)

Similarly, the following memory type is named PC3-12800:

2 × 800MHz × 8 bytes(64 bits) = 12800 MB/s (or 12.8 GB/s)

Whereas DDR can transfer data at twice the core frequency rate with the examples shown (based on DDR3 memory), the total bus frequency is a factor of eight times the memory core frequency, which are 166MHz and 200MHz for these examples, respectively. DDR3 is able to operate at double the data rate of DDR2 and at four times the core memory frequency. Table 4-5 summarizes the bandwidths available for some common memory types based on DDR, DDR2, and DDR3.

Table 4.5. Common Memory Bandwidths

Bandwidth	Core Frequency	Clock Frequency	Name	Memory Type
1.6GB/s	100MHz	100MHz	PC1600	DDR200
2.1GB/s	133MHz	133MHz	PC2100	DDR266
2.7GB/s	166MHz	166MHz	PC2700	DDR333
3.2GB/s	200MHz	200MHz	PC3200	DDR400
3.2GB/s	100MHz	200MHz	PC2-3200	DDR2-400
4.3GB/s	133MHz	266MHz	PC2-4300	DDR2-533
5.3GB/s	166MHz	333MHz	PC2-5300	DDR2-667
6.4GB/s	200MHz	400MHz	PC2-6400	DDR2-800
6.4GB/s	100MHz	400MHz	PC3-6400	DDR3-800
8.5GB/s	133MHz	533MHz	PC3-8500	DDR3-1066
10.6GB/s	166MHz	667MHz	PC3-10600	DDR3-1333
12.8GB/s	200MHz	800MHz	PC3-12800	DDR3-1600

From the table, you can see that, at the clock frequency of 200MHz, the throughput of 3.2GB/s is the same for both DDR and DDR2. Similarly, at a clock frequency of 400MHz, the throughput of DDR2 and DDR3 is also the same at 6.4GB/s. However, in both cases, the core frequency of the more advanced memory is lower, offering more scope to increase frequencies and bandwidth beyond the limits of the previous generation. The lower core frequency also means that the power consumption is lower, with voltages for DDR3 at 1.5V, DDR2 at 1.8V, and DDR at 2.5V. However, the trade-off is that, with a lower memory core frequency, latency times may be longer for the time taken to set up any individual data transfer.

In choosing memory for a system, you will not achieve the best possible result simply by selecting the highest level of throughput possible. There are a number of selection criteria that must be considered in terms of both the processor and the memory to optimize a configuration for either the highest levels of bandwidth or the largest amounts of capacity. In addition to configuring the correct memory capacity and bandwidth, you also need to transfer the data to the CPU itself. Typically, the memory controller is integrated in the CPU itself in the uncore block of a multicore processor, and the uncore will support a different frequency from the processing cores themselves. The uncore frequency will be available with the processor specification from the manufacturer, and it is required to be double that of the memory frequency. For example, DDR3-1333 memory requires an uncore frequency of 2.66MHz. Therefore, it is essential to ensure that the processor itself supports the memory configuration desired.

Do not confuse the role of the memory controller with the MMU discussed earlier in the context of virtual memory. In earlier generations of architectures we are discussing, the memory controller may be located in a couple different places. First, it may be located on the Front Side Bus (FSB), between the CPU and main memory as part of the Memory Controller Hub (MCH), also known as the northbridge of the server chipset. Second, on more recent architectures, it may be integrated on the CPU itself (as noted previously). The memory controller provides a similar translation function to the MMU, but one of its roles is to map the physical addresses to the real memory addresses of the associated memory modules. In addition to the translation role, the memory controller counts read and write references to the real memory pages, averages the access gap for each memory bank, and manages the power states of each individual memory module. The memory controller also provides a degree of error checking and some memory reliability features.

Regardless of the location of the memory controller, the memory configuration supported will be determined by both the memory specification and the number of channels supported by the controller. For the architectures of interest, an integrated memory controller in which there is one per processor socket may support two, three, or more memory channels. For example, Figure 4-2 illustrates a dual multicore processor system with three channels per processor and three DIMMs per channel.

Figure 4.2. A dual, multicore processor system

Given DDR3-1333 memory with the specifications referenced in Table 4-5, and also given a processor with sufficient uncore frequency; each processor can support a maximum memory bandwidth of 32GB/s. This is equivalent to the 10.6 GB/s supported by each channel, and thus 64GB/s total for the two-processor configuration. With DDR3-1066, the maximum bandwidth per processor would be 25.5 GB/s, which is equivalent to the 8.5 GB/s supported by each channel. Typically however, we see a reduction in memory bus speeds as additional DIMMs are added to a memory channel, regardless of the speeds supported by the DIMMs themselves. Therefore, a fully populated system will deliver lower levels of bandwidth than a sparsely configured system. In this example, assuming DDR3 DIMMs are available at 2GB, 4GB, or 8GB configurations; an optimal maximum bandwidth configuration would populate 1 × 8GB DDR3-1333 DIMM per channel to deliver 48GB of memory across the system operating at 32GB/s per processor. Alternatively, an optimal maximum capacity configuration would populate 3x 8GB DDR3-800 DIMMs per channel to deliver 144GB of memory across the system, but this iteration would operate at a lower bandwidth of 19.2 GB/s per processor. All memory channels will operate at the lowest supported frequency by any one channel. Therefore, it is good practice to ensure that all memory populated within a system—and ideally across the entire cluster—is of the same type.

To verify the memory configuration in a system, the command dmidecode reports on the system hardware (as reported by the system BIOS). This report includes details about the system motherboard, processors, and the number and type of DIMMs. The output from dmidecode is extensive, so it is good practice to direct the output to a file. For example, the following extract shows the reported output for an occupied DIMM slot. This example shows the location of a DDR3-800 1GB DIMM and its relevant location in the system:

Handle 0x002B, DMI type 17, 27 bytes.
Memory Device
    Array Handle: 0x0029
    Error Information Handle: Not Provided

Total Width: 64 bits
    Data Width: 64 bits
    Size: 1024 MB
    Form Factor: DIMM
    Set: 1
    Locator: A1_DIMM0
    Bank Locator: A1_Node0_Channel0_Dimm0
    Type: DDR3
    Type Detail: Synchronous
    Speed: 800 MHz (1.2 ns)
    Manufacturer: A1_Manufacturer0
    Serial Number: A1_SerNum0
    Asset Tag: A1_AssetTagNum0
    Part Number: A1_PartNum0

. In this example, which is relevant to the system and memory configuration described previously, 18 DIMM slots populated with this type of memory verifies that the system is configured to provide memory bandwidth of 19.2 GB/s per processor, or 38.4 GB/s across the entire system. As such, dmidecode output serves as a useful reference for assessing the capabilities of a system.

Ultimately, the memory configuration limitations should be considered against the throughput and latency of the entire cluster, the private interconnect, the workload, and the level of interconnect traffic expected. We recommend a design where sufficient memory capacity is configured on each of the individual nodes to cache data on a local basis as much as possible. Although performance may be lost against the potential memory bandwidth within the system, this will certainly be more than outweighed by the gains from minimizing Cache Fusion traffic between the nodes.

NUMA

In discussing physical memory, we considered the role of the memory controller and whether it was located on the FSB or integrated on the CPU. In a multiprocessor configuration, if the memory controller is located on the FSB, it is known as a Symmetric Multi-Processing (SMP) system. In this configuration, memory access by all processors is shared equally across the same bus. Things change with an integrated memory controller, however (see Figure 4-3 for a logical representation of the type of memory architecture implemented in a four-processor configuration).

Figure 4.3. A NUMA configuration

You should be able to see how Figure 4-3 extends to four processors the physical implementation for two processors shown in Figure 4-2., For simplicity eachmemory link is shown as a logical representation of a number of physical memory channels. The important aspect of this configuration is that it introduces the concept of local and remote memory with the integrated memory controller on each CPU being responsible for a subset of the memory of the whole system. To this extent, it implements a Non-Uniform Memory Architecture (NUMA) which on all architectures of interest to Oracle on Linux are the same as a Cache Coherent Non-Uniform Memory Architecture (ccNUMA). This means, as far as the software is concerned, local and remote memory access can be treated the same. It is also worth noting that, for a NUMA configuration, an Oracle on Linux system does not necessarily mandate an integrated memory controller. For example, some systems are based upon a cell configuration where four processors connected by a FSB and a number of DIMMs are connected together into a larger system with NUMA at a coarser granularity. These systems are typically implemented in eight-processor socket and above configurations, so they are less common in a RAC on Linux environment than the NUMA configurations on a single-system board. For this reason, the focus in this chapter is on NUMA implemented at the CPU level with an integrated memory controller that usually has two, four, or even up to eight-processor sockets per system.

As discussed previously in this chapter, Cache Fusion introduces additional latencies to access data from a remote buffer cache when compared to the data cached in local memory. In a NUMA configuration, a similar concept can be applied; however, it is applied at the CPU level, as opposed to the system level. The communication between the processors is through dedicated point-to-point interconnects that replace the function of the shared FSB. In Intel systems, the interconnect is known as the QuickPath Interconnect (QPI); in AMD systems, it is known as HyperTransport. When a process requires access to remote memory, the communication takes place across the interconnect, and the request is serviced by the remote memory controller.

Determined by settings at the BIOS level of the system, a NUMA server can be booted in NUMA or non-NUMA mode. Typically in NUMA mode, all of the memory attached to the individual memory controllers is presented in a contiguous manner. For example, in a two-processor configuration, the first half of the system memory will map to the memory on one controller, while the second half will map to the other. In conjunction with a NUMA-aware operating system and application, this enables software to be optimized to show preference to local memory. Consequently, it is important to reiterate, as previously noted when discussing 64-bit computing, that 32-bit Linux operating systems are not NUMA-aware. Thus, they cannot take advantage of a server booted in NUMA mode. The alternative is to select a BIOS option that lets you run a NUMA server in non-NUMA mode. Typically, this means that cache line sized (64byte on the server architectures in question) allocations of memory are interleaved between all of the memory controllers in the system. For an example process running on a two-processor system booted in non-NUMA mode, this means that half of the memory accesses will be serviced by one memory controller, while the other half will be serviced by the other memory controller. A non-NUMA configuration is therefore likely to result in greater CPU interconnect traffic than a NUMA-optimized configuration at the hardware level. Additionally, as more processors are added to the configuration, interleaving memory may generally have a greater impact than when this policy is applied to a fewer number of processors. This impact does not necessarily mean that performance will be lower for any given workload because, conversely, optimizing for a NUMA aware configuration also requires additional software to implement. Typically, the benefits of NUMA awareness and optimization are realized as the number of processors increase. However, we recommend performance testing to determine the benefits for a particular environment. To determine the correct NUMA settings at the platform level, it is important to be familiar with the NUMA support available at both the Linux operating system and Oracle Database levels.

The 64-bit Linux operating systems with 2.6-based kernels that support Oracle 11g Release 2 include NUMA functionality, and NUMA support is enabled automatically when the operating system is booted on a system where NUMA has been enabled at the BIOS level, Even in this case however NUMA can be disabled at the operating system level by a kernel command line option in the grub.conf file. For example, consider this line:

kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00 numa=off

If NUMA support is disabled, this fact is noted by the kernel:

dmesg | grep -i numa
Command line: ro root=/dev/VolGroup00/LogVol00 numa=off
NUMA turned off

Alternatively, if NUMA support is enabled either at the BIOS or kernel level, then the status is also recorded:

[oracle@london1 ˜]$ dmesg | grep -i numa
NUMA: Using 31 for the hash shift

Application-level Linux NUMA functionality is provided in the numactl RPM package. This package includes a shared object library called libnuma, which is available for software-development level NUMA configuration. The numactl command provides a NUMA command-line interface and the ability to configure the preferred policy for NUMA memory allocation manually. Therefore, from an Oracle Database 11g perspective, utilization of the libnuma library is enabled by setting the appropriate NUMA related int.ora parameters (we will cover this in more detail later in this section). Also, the numactl command can be used both to view and set NUMA configuration, regardless of whether that Oracle level NUMA has been enabled. For example, the numactl command can be used to view the memory allocation, as in the following example, which shows a system with two memory controllers:

[root@london1 ˜]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 9059 MB
node 0 free: 8675 MB
node 1 size: 9090 MB
node 1 free: 8872 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

The following output shows the current configuration policy:

[root@london1 ˜]# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cpubind: 0 1
nodebind: 0 1
membind: 0 1

The preceding output introduces additional NUMA terminology: nodes and distance. A node refers to the units of managed memory; or, in other words, to a processor with an integrated memory controller an individual CPU. Therefore, it is important not to confuse the finer granularity of a NUMA node with the wider granularity of a RAC node. The distance is defined by the ACPI System Locality Information Table (SLIT), and determines the additional time required to access the memory on a remote node against a value of 10 for the local node. The default values will be governed by the architecture of the hardware concerned, but these values may be configured manually at the BIOS level. In the previous example, where you had two nodes, the matrix is simple. That example shows that the latency for accessing the memory on a remote node should be taken for NUMA policy as being just over twice the time as accessing local memory. Therefore, if the memory access time to local memory is 60ns, then remote memory will be just over 120ns. With four or more sockets, and depending on the architecture, memory access may require more than one hop to remote memory. Thus, for multiple hops, the matrix will be populated by access times that show higher latencies. The operating system NUMA information can also be read directly from the information in the /sys/devices/system/node directory, where NUMA details for the individual nodes are given in directories named after the nodes themselves. For example, the following shows how the distance information is derived for node0:

[root@london1 node0]# more distance
10 21

There is also basic information on NUMA statistics in this directory that records page-level access:

[root@london1 node0]# more numastat
numa_hit 239824
numa_miss 0
numa_foreign 0
interleave_hit 8089
local_node 237111
other_node 2713

These statistics can also be displayed for all nodes with the numastat command:

[oracle@london1 ˜]$ numastat
                          node0           node1
numa_hit                  394822          873142
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit             11826           11605
local_node                386530          854775
other_node                  8292           18367

One of the most important areas of information in the node directory concerns the NUMA memory allocation. The details are the same as in /proc/meminfo; however, here the allocation is shown for the individual nodes, illustrating how the memory allocation is configured and utilized:

[root@london1 node0]$ more meminfo
Node 0 MemTotal:      9276828 kB
Node 0 MemFree:       1465256 kB
Node 0 MemUsed:       7811572 kB
Node 0 Active:         211400 kB
Node 0 Inactive:        77752 kB
Node 0 HighTotal:           0 kB
Node 0 HighFree:            0 kB
Node 0 LowTotal:      9276828 kB
Node 0 LowFree:       1465256 kB
Node 0 Dirty:              28 kB
Node 0 Writeback:           0 kB
Node 0 FilePages:      170280 kB
Node 0 Mapped:          38620 kB
Node 0 AnonPages:      199608 kB
Node 0 PageTables:      13304 kB
Node 0 NFS_Unstable:        0 kB
Node 0 Bounce:              0 kB
Node 0 Slab:            15732 kB
Node 0 HugePages_Total:  3588
Node 0 HugePages_Free:    847

As previously noted, in a non-NUMA configuration all memory is interleaved between memory controllers in 64-byte-sized allocations. Therefore, all memory assigned, whether it's SGA or PGA, will be implicitly distributed in such a manner. With a NUMA-aware environment, however, there are more options available for configuration. By default, memory is allocated from the memory local to the processor where the code is executing. For example, when the Oracle SGA is allocated using the shmget() system call for automatic shared memory management or the mmap() system call for 11g automatic memory management (see Chapter 6 for more details about this), by default the memory is taken contiguously from a single or multiple nodes, according to requirements.

Let's look at how this works in Oracle version 11g Release 2. If it has been NUMA-enabled (we will cover this in more depth later in this section), and it is able to create multiple shared segments dedicated to individual memory nodes, then contiguous memory allocation at the operating system level would be the preferred behavior for optimal scalability under this scheme. However, if NUMA is not enabled, or the version of Oracle creates only a single shared memory segment, then this will potentially result in an unbalanced configuration with the SGA running entirely within the memory of one processor. If that happens, the system will suffer from latency and bandwidth limitations regarding physical memory, as discussed earlier. For example, the following listing shows a default memory configuration after an Oracle instance has been started with the SGA almost entirely resident in the memory on node 1:

[oracle@london1 ˜]$ numactl --hardware
available: 2 nodes (0-1)
node 0 size: 9059 MB
node 0 free: 8038 MB
node 1 size: 9090 MB
node 1 free: 612 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

For general Oracle performance, but especially when the Oracle 11g Database is not configured for NUMA, we recommend the use of huge pages as discussed in Chapter 6. When the huge pages are allocated at the operating system level the default policy is to allocate memory in an evenly distributed manner from all of the available memory controllers until the huge page allocation is complete. For example, consider the following test command:

echo 2048 > /proc/sys/vm/nr_hugepages

This command manually sets the number of huge pages. Their allocation can be viewed in the meminfo files under the /sys/devices/system/node directory:

Node 0 HugePages_Total:  1024
Node 0 HugePages_Free:   1024
Node 1 HugePages_Total:  1024
Node 1 HugePages_Free:   1024

It is clear that this approach is similar to the interleaving of memory in a non-NUMA configuration. However, in this case it has only been done at the coarser granularity of the huge pages from which the SGA will be allocated, rather than the entire underlying system memory map. That said, the Oracle Database NUMA configuration determines whether this allocation will be used on a per memory node basis or a more general interleaved basis.

When using huge pages on a NUMA system, we recommend allocating them at boot time. This is because, if the memory on one node has no further memory to contribute, then allocation of the rest of the huge pages will be taken from the remaining node or nodes, resulting in an unbalanced configuration. Similarly, when freeing huge pages, the requested de-allocation will take place, thus freeing the maximum amount of memory from all of the nodes in order.

You have seen that there are a number of implications for the correct set up of a NUMA system at both the BIOS and Linux operating system levels before starting Oracle. Also, the chosen settings must be considered in conjunction with the preferred NUMA policy for Oracle.

As noted previously, Oracle 11g Release 2 on Linux includes NUMA awareness and functionality; however, these are controlled and configured with unsupported parameters known as underscore parameters. At the highest level in Oracle 11g Release 2, Oracle NUMA features are determined by the parameter _enable_NUMA_support. Prior to Oracle 11g Release 2, there was also the NUMA-related parameter _enable_NUMA_optimization the difference between the two is that, even if the latter parameter was manually set to in 11g Release 1, a degree of NUMA functionality continued to remain within the Oracle 11g software. Therefore, at Oracle 11g Release 1, despite the default setting of the NUMA parameter, Oracle recommended disabling NUMA functionality by applying patch number 8199533, rather than by setting this parameter. However, it is important to note that this original parameter has been deprecated from Oracle 11g Release 2. For _enable_NUMA_support, the default value of this parameter is set to FALSE, and it can be viewed with a statement such as the following:

SQL> select a.ksppinm "Parameter", b.ksppstvl "Session Value",
c.ksppstvl "Instance Value"
from x$ksppi a, x$ksppcv b, x$ksppsv c
where a.indx = b.indx
AND a.indx = c.indx
AND ksppinm = '_enable_NUMA_support';

_enable_NUMA_support
FALSE

FALSE

By default, the additional parameter _db_block_numa is also set to the value of 1, which shows that memory will be configured as if for the presence of a single memory node. Oracle NUMA support can be enabled on a system with hardware and Linux NUMA support by setting the value of _enable_NUMA_support to TRUE. For example, consider the result of running the following command with sysdba privilege and then restarting the database:

SQL> alter system set "_enable_NUMA_support"=TRUE scope=spfile;

System altered.

If successfully configured, the database alert log will report that a NUMA system has been found and that support has been enabled, as in the following example for a two-socket system:

NUMA system found and support enabled (2 domains - 8,8)

The alert log for a four-socket system would like this:

NUMA system found and support enabled (4 domains - 16,16,16,16)

Additionally the parameter _db_block_numa will have been set to the same value as the number of memory nodes or domains reported in the alert log, which will also report that _enable_NUMA_support is set to a non-default value, as in this example:

System parameters with non-default values:
  processes                = 150
  _enable_NUMA_support     = TRUE

Use the following commands to determine Oracle's success at creating a NUMA configuration:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug ipc
Information written to trace file.

The generated trace file contains information on the SGA. The example that follows illustrates how to create a number of NUMA pools that correspond to the recognized number of memory nodes:

Area #0 'Fixed Size' containing Subareas 0-0
  Total size 000000000021eea0 Minimum Subarea size 00000000
  Owned by:  0,  1
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      0        0   884737 0x00000060000000 0x00000060000000
                              Subarea size     Segment size
                          000000000021f000 0000000044000000
 Area #1 'Variable Size' containing Subareas 3-3
  Total size 0000000040000000 Minimum Subarea size 04000000
  Owned by:  0,  1
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      1        3   884737 0x000000638fd000 0x000000638fd000
                              Subarea size     Segment size
                          0000000040703000 0000000044000000
 Area #2 'NUMA pool 0' containing Subareas 6-6
  Total size 0000000160000000 Minimum Subarea size 04000000
  Owned by:  0
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      2        6   950275 0x00000204000000 0x00000204000000
                              Subarea size     Segment size
                          0000000160000000 0000000160000000
 Area #3 'NUMA pool 1' containing Subareas 5-5
  Total size 000000015c000000 Minimum Subarea size 04000000
  Owned by:  1
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      3        5   917506 0x000000a4439000 0x000000a4439000
                              Subarea size     Segment size
                          000000015fbc7000 0000000160000000
 Area #4 'Redo Buffers' containing Subareas 4-4
  Total size 0000000000439000 Minimum Subarea size 00000000
  Owned by:  1
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      4        4   917506 0x000000a4000000 0x000000a4000000
                              Subarea size     Segment size
                          0000000000439000 0000000160000000

If you look later in the trace file report, or if you run the ipcs command as discussed in Chapter 12; you can observe the multiple shared memory segments created to correspond to the individual memory nodes—ID's 917506 and 950275, in this case. These segments optimize the Oracle configuration to give processes local memory access to the SGA, thereby reducing memory interconnect traffic and increasing scalability as more memory nodes are added to the configuration:

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0xa3c20e68 32768      oracle    660        4096       0
0x00000000 884737     oracle    660        1140850688 28
0x00000000 917506     oracle    660        5905580032 28
0x00000000 950275     oracle    660        5905580032 28
0xfb0938e4 983044     oracle    660        2097152    28

It is important to note that an Oracle NUMA configuration is supported, whether you use using automatic memory management or automatic shared memory management. Additionally, if huge pages have been configured, this is also compatible with enabling NUMA at the Oracle Database level. Therefore, we continue to recommend the use puge pages, even if enabling NUMA for performance benefits.

If _enable_NUMA_support has been set to TRUE, however, the alert reports that the parameter has been set to a non-default value. If this value is FALSE, it indicates that there has been an issue with enabling NUMA support on your system at the Oracle level. In particular, you should verify the presence of the numactl RPM package containing the libnuma library on the system and optionally the numactl-devel package. Prior to Oracle version 11.2.0.2, if the numactl-devel package is not installed it is necessary to create an additional symbolic link to the libnuma library. This enables the Oracle Database to detect the library's presence, which you accomplish as follows:

[root@london1 ˜]# cd /usr/lib64
[root@london1 lib64]# ln -s libnuma.so.1 libnuma.so

If NUMA is enabled at the BIOS and Linux operating system levels, but disabled within Oracle, such as with a release prior to Oracle 11g Release 2, it should be clear that how NUMA is configured within the operating system will consequently determine how memory is allocated to the Oracle SGA on startup. If huge pages have already been configured, as you have seen by default, this is done in a NUMA-aware manner. This means Oracle can take advantage of an interleaved memory configuration implemented on its behalf. If Huge Pages have not been pre-configured, the default policy is to allocate the SGA from standard-sized memory pages. In this case, SGA will therefore be allocated contiguously across memory nodes, as previously discussed. To mitigate the potential performance impact of SGA memory not being evenly distributed, it is also possible to set an interleaved memory configuration when starting a database with sqlplus. For example, you might start sqlplus as follows:

[oracle@london1 ˜]$ numactl --interleave=all sqlplus / as sysdba

SQL*Plus: Release 11.1.0.6.0 - Production on Sun Feb 27 00:29:09 2005

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>

The preceding example means that the SGA is interleaved across the available memory nodes, as illustrated here:

[oracle@london1 ˜]$ numactl --hardware
available: 2 nodes (0-1)
node 0 size: 9059 MB
node 0 free: 4096 MB
node 1 size: 9090 MB
node 1 free: 4443 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

However, in a RAC environment, it is not possible to control the NUMA allocation policy manually on all nodes when using the srvctl command to start multiple instances across the cluster. Therefore, in an Oracle 11g Release 2 RAC environment on NUMA systems, we recommend explicitly enabling NUMA support at the instance level. This is will be particularly beneficial on systems with four or more sockets and the most memory-intensive workloads. If you do not wish to enable NUMA support, then what we recommend is an interleaved or evenly distributed SGA configuration that is implemented either at the BIOS level or, preferably, with huge pages. In all cases, you should have the same configuration on all systems in the cluster.

Memory Reliability

In the era of 64-bit Linux computing and gigabytes of data held in buffer cache, the DBA should be fully aware of the technologies and limitations of memory available in enterprise-based servers, especially in a RAC environment that uses multiple SGAs and holds much of the database in memory at the same time. If there is a memory failure on one of the nodes, depending on the frequency of the checkpointing, you risk a considerable reduction in service while one of the remaining nodes recovers from the redo logs of the failed instance. As SGAs increase in size, the potential for the time to recover increases. On Oracle 11g, the checkpointing process is self-tuning and does not require the setting of any specific parameters; however, if parameters such as FAST_START_MTTR_TARGET are not set, you have less direct control over the possible recovery time, as will be explained later in the chapter.

To mitigate against the possibility of a memory error halting the cluster for a significant period of time, all systems will include a degree of memory error detection and correction. Initially, parity-based checking afforded the detection and correction of single-bit memory errors; however, the common standard for memory is now Error Correction Code (ECC) with an extra data byte lane to detect and correct multiple-bit errors and even offline failing memory modules. Additional memory reliability features include registered DIMMs, which improve memory signal quality by including a register to act as a buffer at the cost of increased latency. Registered DIMMs are identified by the addition of an R to the specification, such as PC3-10600R. Unbuffered DIMMS are identified by a U, such as PC3-10600U. Registered DIMMs can support a larger number of DIMMs per memory channel, which makes them a requirement for high memory capacity.

Additional memory protection features are also available. For example, to protect against DIMM failure, the feature known as Single Device Data Correction (SDDC) or by the IBM name of Chipkill enables the system to recover from the failure of a single DRAM device on a DIMM. In conjunction with Chipkill, the Memory Sparing feature is supported by some systems. On detection of a failing DRAM device, this feature enables the channel to be mapped to a spare device. To enhance memory protection beyond Memory Sparing, some systems also support Memory Mirroring. In a Memory Mirroring configuration, the same data is written to two memory channels at the same time, thereby increasing reliability and enabling the replacement of failed DIMMs online, but at the cost of lowering the memory utilization to half of the overall capacity.

Additional Platform Features

There are many additional platform features present, on top of the server attributes we have already discussed. These sophisticated features are becoming increasingly important in helping to maintain the highest levels of cluster availability.

The following sections describe many of the features that may be available, depending on the platform chosen and their applicability to deploying a successful RAC environment.

Onboard RAID Storage

Later in this chapter, there is a detailed discussion of RAID storage in general, as opposed to onboard RAID in particular. The value of a server equipped with an onboard RAID storage system is resiliency. By far, the most common fault occurring within a server will be the failure of the hard disk. In a RAC environment, the failure of an unprotected drive containing the operating system or Oracle binaries will cause the ejection of the node from the cluster, as well as a lengthy reconfiguration once the drive is replaced. Wherever possible, all internal disk drives should be protected by an onboard RAID storage system. (Booting from a protected SAN or hosting the Oracle Home directories on a SAN or NAS are also viable options.) If an onboard RAID storage system is not available, then software RAID within the Linux operating system of disk drives should be configured during installation, as detailed in Chapter 6.

Machine Check Architectures

A Machine Check Architecture (MCA) is an internal architecture subsystem that exists to some extent on all x86- and x86-64–based systems. MCA exists to provide detection of and resolution for hardware-based errors. However, not all hardware errors—for example, the failing of a disk—will be reported through the MCA system.

All MCA events fall within the following two categories:

CPU errors: These are errors detected within the components of the CPU, such as the following:
External bus logic
Cache
Data TLB
Instruction fetch unit
Platform errors: Errors delivered to the CPU regarding non-CPU errors, and events such as memory errors.

Depending on the severity of the error detected, the following three resolutions are possible, depending on the level of MCA available:

Continue: The resolution used when an error is detected and corrected by the CPU or the server firmware. This kind of error is transparent to the operating system. Examples of these errors include single- or multiple-bit memory error corrections and cache parity errors. Corrected machine check (CMC) is used to describe an error corrected by the CPU. A corrected platform error (CPE) is an error detected and corrected by the platform hardware.
Recover: The resolution used when a process has read corrupted data, also termed poisoned data. Poisoned data is detected by the firmware and forwarded to the operating system, which, through MCA support, terminates the process. However, the operating system remains available.
Contain: The resolution used when serious errors, such as system bus address parity errors, are detected, and the server is taken offline to contain them. These noncorrected, or fatal, errors can also be termed MCA, which in this case means Machine Check Abort.

Overstating the importance of MCA features within a RAC environment would be difficult. Within RAC, detecting errors as soon as possible is important, as is either correcting them or isolating the node. The worst possible outcome is for a failing node to corrupt the data stored on disk within the database, which is the only copy of the data in the entire cluster. To monitor the activity of the MCA the first indication of its activity will be an entry in the system log as follows.

Jan 17 13:22:29 london1 kernel: Machine check events logged

On observing this message the DBA should check the machine check event log in the file /var/log/mcelog where messages are preceded by a notice such as the following identifying the error detected.

MCE 30
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor

We recommend gaining familiarity with the MCA available on your system.

Remote Server Management and IPMI

In any enterprise-level RAC environment, remote server management is an essential feature for helping DBAs to meet the level of responsiveness required to manage clustered systems. A method must be available to access the system remotely in terms, diagnostics, or system management board functions, as well as to administer the system from any location, regardless of the current operating system state.

In a standard environment without remote server management, if the operating system has not started successfully, accessing the system will not be possible. In a RAC environment, being fully in control of all systems that have write access to the disks on the external storage where the database is located is especially important. Unless the DBA is always in a physical location near the RAC cluster, then remote management is a necessity.

Within Oracle 11g Release 2, there exists an integrated solution that implements a standard for remote server management: Intelligent Platform Management Interface (IPMI). IPMI raises system level functionality to the operating system for operations such as monitoring system temperatures, fan speeds, and hardware events, as well as for viewing the console. In particular, IPMI enables rebooting a server either locally or remotely, regardless of the status of the operating system. For this reason, IPMI is an optional configuration method selected during the Grid infrastructure installation used within Oracle 11g Release 2 RAC for I/O fencing. Although IPMI is not a mandatory requirement, we do recommend you ensure that IPMI is supported in new hardware you evaluate for RAC because of the advantages inherent in this tool's more advanced I/O fencing. Configuration and management of IPMI using ipmitool and the OpenIPMI packages are discussed in Chapter 6.

Network Interconnect Technologies

The private interconnect is an essential component of a RAC installation. It enables high-speed communication between the nodes in a cluster for Cache Fusion traffic. In this section, we will review the hardware availability for implementing the interconnect from the I/O implementation on the server to the available protocols and their related implementations supported on the Linux operating system.

You can learn more about the supported network and private interconnect, RAC Technologies Compatibility Matrix (RTCM) for Linux Clusters, in this chapter's "Network Interconnect Technologies" section. Table 4-6 illustrates the network protocols and supported configurations in this section.

Table 4.6. Network Interconnect Technologies

Network Protocol	Supported Configuration
Ethernet	100 Mbs, 1 Gigabit, or 10 Gigabit Ethernet
Infiniband (IB)	IP over IB and OFED 1.3.1/RDS v2 over IB and higher

Based on the available technologies detailed in Table 4-6, our focus in this section is on the Ethernet and Infiniband protocols, as well as the requisite hardware choices.

Server I/O

Previously in this chapter, we walked through all of the components of server architecture relevant to Oracle, from the processor to the system memory. Beyond local memory, you can access your data blocks through Cache Fusion. Before taking into account the available interconnect technologies external to the server, we will consider the fundamentals of the system bus on the server itself, through which all network communication takes place. These input/output (I/O) attributes of the server are also relevant in connecting to the SAN or NAS subsystem, which we will also discuss later in this chapter.

The precise configuration of the I/O connectivity supported by a server will be governed by the server chipset and, in particular, the I/O Controller Hub (ICH) on the system motherboard. The ICH is also known as the southbridge, and it connects to the Memory Controller Hub (MCH) or northbridge, thereby completing the connectivity of the system components previously discussed in this chapter. Therefore, you should review the specifications of the chipset to fully determine the I/O connectivity supporting the protocols discussed in the following sections.

PCI

For nearly all of the architectures available to run Linux, server I/O will be based on the Peripheral Component Interconnect (PCI). PCI provides a bus-based interconnection with expansion slots to attach additional devices, such as Network Interface Cards (NIC) or Fibre Channel host bus adapters (HBAs). For Oracle RAC in an enterprise environment, common requirements include the following: one external network interface; one backup network interface; two teamed network interconnect interfaces; and two teamed storage-based interfaces, which must be either network- or Fibre Channel–based. Of the six connections required, some may be satisfied by dual or quad cards. In some environments, such as a blade-based system, the connections are shared between servers. The expansion slots within a particular server may also have different specifications. Therefore, DBAs need to know whether the connections and bandwidth available from a particular environment will meet their requirements. Also note whether the PCI slots available are hot pluggable, which means that a failed card may be replaced without requiring server downtime.

The original implementation of PCI offered a 32-bit bus running at frequency of 33MHz. The following calculation shows that this presents 133 MB/s of bandwidth:

33 × 4 bytes (32 bits) = 133 MB/s

PCI-X

The bandwidth shown for PCI is shared between all of the devices on the system bus. In a RAC environment, the PCI bus would be saturated by a single Gigabit Ethernet connection. For this reason, the original 32-bit, 33MHz PCI bus has been extended to 64 bits at 266MHz. However, the 64-bit bus has also been extended to 100MHz and 133MHz, and it is now referred to as PCI-X. The configurations of PCI are summarized in Table 4-7.

Table 4.7. PCI Configurations

Bus	Frequency	32-bit Bandwidth	64-bit Bandwidth
PCI	33MHz	133MB/s	266MB/s
PCI	66MHz	266MB/s	532MB/s
PCI-X	100MHz	Not applicable	800MB/s
PCI-X	133MHz	Not applicable	1GB/s

PCI-X has sufficient capability to sustain a RAC node with gigabit-based networking and 1Gb- or 2Gb-based storage links. However, like PCI, PCI-X is a shared-bus implementation. With Ethernet, Fibre Channel, and Infiniband standards moving to 10Gb-based implementations and beyond, sufficient bandwidth is likely to be unavailable for all of the connections that an Oracle RAC node requires with PCI-X.

PCI-Express

The consideration of the bandwidth requirements of 10Gb connections should be made in conjunction with selecting a platform that supports the generation of PCI called PCI-Express. PCI-Express should not be confused with PCI-X. In contrast to PCI-X's shared bus, PCI-Express implements a high-speed point-to-point serial I/O bus.

When reviewing a hardware specification for PCI connectivity, the relevant section will explicitly reference PCI-Express or its abbreviation of PCIe, as follows:

Six (6) available PCI-Express Gen 2 expansion slots

In contrast to PCI, a PCI-Express link consists of dual channels, implemented as a transmit pair and a receive pair, to enable bi-directional transmission simultaneously. The bandwidth of a PCI-Express link can be scaled by adding additional signal pairs for multiple paths between the two devices. These paths are defined as x1, x4, x8, and x16, according to the number of pairs. Like memory configuration, the desired PCIe configuration should also be cross-referenced against the supported configuration of the system chipset. For example, if the chipset is described as supporting 36 PCIe lanes, this means it will support two x16 links and an x4 link; one x16, two x8 links, and an x4 link; or any supported combination that is also dependent on the configuration of the physical slots available on the motherboard.

The unencoded bandwidth is approximately 80% of the encoded data transfer rate PCI-Express uses a form of encoding that utilizes the remaining 20%. The encoding enables better synchronization, error detection, and error resolution. For this reason, the transfer rate is often written as GT/s (Gigatransfers per second) to distinguish the actual Gb/s transfer rate from the rate at which the data itself is transmitted. Each link consists of dual channels and is bi-directional. In the original implementation, each lane supported 250MB/s. However, the revised PCI Express Base 2.0 specification released in 2007 doubles the clock frequency from the maximum of 1.25GHz at the original implementation to a maximum of 2.5Ghz at the revised specification, which is commonly known as Gen 2. This effectively doubles the bandwidth per lane to 500MB/s. Table 4-8 illustrates the combined bandwidth of both directions with an x1 link that supports a total bandwidth of 1Gb/s.

Table 4.8. PCI-Express Gen 2 Configurations

PCIe Implementation	Transfer Rate	Data Rate	Bandwidth
x1	10GT/s	8Gb/s	1GB/s
x4	40GT/s	32Gb/s	4GB/s
x8	80GT/s	64Gb/s	8GB/s
x16	160GT/s	128Gb/s	16GB/s

At the time of writing, PCI Express Gen 3 is under development, this specification increases the frequency to 4GHz and the bandwidth per lane to 1GB/s.

Because it is based on a point-to-point architecture, PCI-Express will support 10Gb and higher links to the interconnect and storage without sharing bandwidth. PCI-Express adapters also natively support features that are usually managed at the system-board level with PCI and PCI-X, such as hot plug, hot swap, and advanced power management. You can use the command dmidecode as described previously for memory lspci to query your platform configuration for PCI-Express support. Do so using this format:

[root@london1 ˜]# /sbin/lspci | grep -i express
00:01.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root Port 1 (rev 12)
00:02.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root Port 2 (rev 12)
00:03.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root Port 3 (rev 12)
...

When considering Infiniband as an interconnect technology, it is important to note that Infiniband also implements a serial architecture-based protocol in its own right. Therefore, Infiniband does not, by definition, require PCI-Express connectivity with the option of Infiniband Landed on Motherboard (LOM) providing a direct system connection. That said, such configurations are not particularly common and tend to be focused more in the direction of HPC computing. More common Infiniband implementations feature Infiniband implemented with PCI-Express HBAs, which are also known as Host Channel Adapters (HCAs). The latter approach enables the selection of an industry standard server platform, while also adding Infinband as an additional technology. For this reason, we recommend systems with expansion slots that support PCI and PCI-derived technologies with bandwidth sufficient for your Oracle RAC requirements.

Private Interconnect

Building on the PCI I/O capabilities of the server platform, you can implement the cluster interconnect with a number of different technologies. As detailed previously in this section, the supported configurations are based on Ethernet and Infiniband connectivity.

Standard Ethernet Interconnects

The most popular choice for an Oracle RAC cluster interconnect is a Gigabit Ethernet switched network. Gigabit Ethernet connectivity is available through dedicated HBAs inserted into the PCI slots on a server. However, almost without exception, industry-standard servers will include at least two Gigabit Ethernet ports directly on the server motherboard. These industry-standard servers provide Ethernet connectivity without requiring additional hardware, thus leaving the additional PCI slots free for further connectivity.

Gigabit Ethernet run in full duplex mode with a nonblocking, data center class switch should be the minimum interconnect standard applied.

Note

A simple network crossover cable may at first appear to be a low-cost alternative to connect a two-node cluster, enabling you to create a directly linked network between two cluster nodes. However, this alternative is not supported as an interconnect with RAC on Linux. Without the electrical isolation provided by a switch, NIC hardware errors on one of the servers could also cause hardware errors on the other server, rendering the entire cluster unavailable. This configuration also eliminates the option of a redundant networking configuration and prevents the addition of more than two nodes to the cluster.

Gigabit Ethernet theoretically supports transfer rates of 1,000Mb/s (1000 Megabits are equivalent to 100 Megabytes) and latencies of approximately 60 to 100 microseconds for shorter packet sizes. A clear difference exists between this latency and the two to four milliseconds we previously specified for the average receive time for a consistent read or current block for Cache Fusion. In addition to the Oracle workload, processing the network stack at the Linux operating system level is a major contributing factor to the latency of Cache Fusion communication.

For standard TCP/IP networking, reading from or writing to a network socket causes a context switch, where the data is copied from user to kernel buffers and processed by the kernel through the TCP/IP stack and Ethernet driver appropriate to the NIC installed. Each stage in this workload requires a degree of CPU processing, and tuning kernel network parameters (you can learn more about this in Chapter 6) can play a role in increasing network throughput. To minimize the overhead of network processing, the default protocol for Oracle 11g RAC on Linux is the User Datagram Protocol (UDP), as opposed to Transmission Control Protocol (TCP). UDP is a non-connection-oriented protocol that doesn't guarantee data ordering. It also places the responsibility for verifying the data transmitted on the application itself—Oracle, in this case. The benefit of this approach is this: UDP reduces the kernel networking requirements.

At the time of writing, 10 Gigabit Ethernet (10GbE) technology is becoming more widely available and has recently been supported within the RAC Technologies Matrix. As 10GbE adoption increases, driven by reduction in cost and 10GbE ports available directly on server motherboards, it is likely to become the standard implementation over time, offering improved levels of bandwidth in addition to lower latencies compared to gigabit Ethernet. Therefore, we recommend monitoring the status of 10GbE closely. Given the required levels of cost, availability, and support for 10GbE, its adoption should be viewed as a progressive development of Ethernet, and it should be adopted where reasonably possible. When implementing 10 Gigabit Ethernet over copper cabling, it is important to also use the correct cabling for such an implementation; Cat 5 or 5e is commonly used for Gigabit Ethernet, while 5e or 6 is commonly used for 10 Gigabit Ethernet.

Fully Redundant Ethernet Interconnects

In a standard environment with a two-node cluster and no resiliency, each node connects to a single interconnect (Oracle does not recommend running the Clusterware and Database interconnect on separate networks) with a single NIC, in addition to the public network and any additional networks, such as the backup network. Figure 4-4 shows the interconnect network for simplicity.

Figure 4.4. Nonredundant interconnect

If a single interconnect NIC fails, Oracle Clusterware will attempt to reconnect for a user-defined period of time before taking action to remove the node from the cluster. By default, the amount of time it will wait is defined by the 11g Clusterware CSS Misscount value as 30 seconds. When this happens, the master node directs the Oracle database Global Cache Service to initiate recovery of the failed instance. However, if the interconnect network fails because, for example, the network switch itself fails, then a scenario will result that is equivalent to the failure of every single node in the cluster, except for the designated master node. The master node will then proceed to recover all of the failed instances in the cluster before providing a service from a single node. This will occur irrespective of the number of nodes in the cluster.

Because the master node must first recover all instances, this process will result in a significant reduction in the level of service available. Therefore, we recommend the implementing a fully redundant interconnect network configuration.

A common conisdieration is to implement the Oracle CLUSTER_INTERCONNECTS parameter. This parameter requires the specification of one or more IP addresses, separated by a colon, to define the network interfaces that will be used for the interconnect. This network infrastructure is configured as shown in Figure 4-5.

Figure 4.5. Clustered interconnects

The CLUSTER_INTERCONNECTS parameter however is available to distribute the network traffic across one or more interfaces, enabling you to increase the bandwidth available for interconnect traffic. This means it is most relevant in a data warehousing environment. The parameter explicitly does not implement failover functionality; therefore, the failure of any interconnect switch will continue to result in the failure of the entire interconnect network and a reduced level of service. Additionally, Oracle does not recommend setting the CLUSTER_INTERCONNECTS parameter, except in specific circumstances. For example, the parameter should not be set for a policy-managed database (you will learn more about workload management in Chapter 11).

To implement a fully redundant interconnect configuration requires the implementation of software termed NIC bonding at the operating system level. This software operates with the network driver level to provide two physical network interfaces that operate as a single interface. In its simplest usage, this software provides failover functionality, where one card is used to route traffic for the interface, and the other remains idle until the primary card fails. When a failure occurs, the interconnect traffic is routed through the secondary card. This occurs transparently to Oracle—its availability is uninterrupted, the IP address remains the same, and the software driver also remaps the hardware or MAC address of the card, so that failover is instantaneous. The failover is usually only detectable within Oracle by a minimal IPC time-out wait event. It is also possible in some configurations to provide increased bandwidth. This would deliver a similar solution to the CLUSTER_INTERCONNECTS parameter, with the additional protection of redundancy.

When implementing a bonding solution, understanding the implications of NIC failure, as opposed to failure of the switch itself, is important. Consider the physical implementation shown in Figure 4-5, where two nodes are connected separately to two switches. In this scenario, if a primary NIC fails on either of the nodes, that node will switch over to use its secondary NIC. However, this secondary NIC is now operating on a completely independent network from the primary card that still remains operational on the fully functional node. Communication will be lost, and the Oracle Clusterware will initiate cluster reconfiguration to eject the non-master node from the cluster.

Now considering the physical implementation shown in Figure 4-6, where the two nodes are connected separately to the two switches. In this case, the two switches are also connected with an interswitch link or an external network configuration, enabling traffic to pass between the two interconnect switches. As with the previous scenario, when a NIC fails, the network traffic is routed through the secondary interface. However, in this case, the secondary interface continues to communicate with the primary interface on the remaining node across the interswitch link. In a failover mode when no failures have occurred, only the primary switch is providing a service. After a failure, both switches are active and routing traffic between them.

Figure 4.6. Fully redundant clustered interconnects

Crucially, this failover scenario also guards against the failure of a switch itself. While operational, if the secondary switch fails, the network traffic remains with the primary interconnect, and no failover operations are required. However, if the primary switch fails, the driver software reacts exactly as it would if all of the primary NICs on all of the nodes in the cluster had failed at exactly the same time. All of the NICs simultaneously switch from their primary to their secondary interface, so communication continues with all of the network traffic now operating across the secondary switch for all nodes. In this solution, there is no single point of failure because either a single NIC, network cable, switch, or interswitch link can fail without impacting the availability of the cluster. If these components are hot-swappable, they can also be replaced, and the fully redundant configuration can be restored without impacting the availability of the clustered database.

This form of teaming requires interaction with the network driver of the NICs used for bonding. Therefore, this approach is best implemented with channel bonding in an active-backup configuration in the widest range of cases (see Chapter 6 for more information on this subject). The implementation of NIC bonding in a broadcast configuration is not recommended for the high availability demands of an Oracle RAC configuration. In addition to a high availability configuration, a load balancing configuration may be implemented, depending upon the support for such a configuration at both the network hardware and driver levels.

Infiniband

The specifications of Infiniband describe a scalable, switched, fabric-based I/O architecture to standardize the communication between the CPU and peripheral devices. The aim of Infiniband was to unify and replace existing standards such as Ethernet, PCI, and Fibre Channel, so as to move beyond static server configurations to a dynamic, fabric-based data center environment where compute power can be added separate from the devices that provide the data to be processed.

The evolution of technologies such as PCI-Express and 10 Gigabit Ethernet testify to the fact that Infiniband did not wholly succeed in its initial design goal; however, the technology has become established as a standard for the interconnect between compute nodes in high performance computing clusters, and more recently, as the connectivity technology utilized by the Oracle Exadata Storage Server.

Infiniband is similar to PCI-Express in the way it implements a serial architecture that aggregates multiple links. Infiniband supports 2.5Gb/s per link. It also includes support for double data rate (DDR), which increases this to 5Gb/s; and quad data rate (QDR), which increases its throughput to 10 Gb/s at the same frequency. The most common system interconnect implementation uses 4x links, which results in bandwidth of 20Gb/s and 40Gb/s; however, also like PCI-Express encoding, this reduces the available bandwidth to 16 Gb/s and 32 Gb/s respectively. In addition to its bandwidth benefits, Infiniband latencies are also measured in microseconds, with a typical value of 10 microseconds. This means Infiniband offers significant latency gains over Ethernet, as well as a reduction in the CPU utilization required for I/O.

To implement Infiniband in an Oracle RAC environment, IP over Infiniband is supported. However, the optimal solution with Oracle 11g on Linux is to use the Reliable Datagram Sockets (RDS) protocol for its lower utilization of CPU, rather than Infiniband over IB. It is important to note that the supported implementation and version of RDS for Linux is the Open Fabrics Enterprise Distribution (OFED) 1.3.1 for RDS version 3, which is not included in either Red Hat or Oracle Enterprise Linux by default. This means it must be downloaded from Oracle as patch 7514146 and installed separately. It is also necessary to relink the binary when both the ASM and database instance are shutdown to use RDS, as shown in the following lines:

[oracle@london1 ˜]$ cd $ORACLE_HOME/rdbms/lib
[oracle@london1 lib]$ make -f ins_rdbms.mk ipc_rds ioracle

Once relinked, Oracle RDS must be successfully configured and used. Note that both RDS and UDP cannot be configured at the same time. Therefore, returning to the default UDP settings requires relinking the binary again:

[oracle@london1 lib]$ make -f ins_rdbms.mk ipc_g ioracle

Infiniband fabrics are typically implemented with the focus on providing bandwidth in a clustered environment. For this reason, they are often configured in what is known as a Fat Tree Topology or Constant Bisectional Bandwidth (CBB) network to support a large number of nodes in a non-blocking switch configuration with multiple levels. This is in contrast to the hierarchical configuration of a typical Ethernet-based interconnect. However, it is important to note that the bandwidth requirements of Oracle RAC are typically considerably lower than the HPC supercomputing environments where Infiniband is typically deployed. Therefore, we recommend verifying the benefits of such an approach before deploying Infiniband as your interconnect technology.

Private Interconnect Selection Summary

In the absence of compelling statistical evidence that you will benefit from using a higher performance interconnect, we recommend selecting the default UDP protocol running over Ethernet at a gigabit or 10 gigabit rating for a default installation of an Oracle RAC cluster. Ethernet offers a simpler, more cost-effective interconnect method than Infiniband; and where cost is a decisive factor, implementing Infiniband should be compared against the alternative of 10 Gigabit Ethernet, coupled with high-performance server processors and the RAM to provide an adequate buffer cache on all server nodes. In other words, you need to consider your entire solution, rather than focusing on your method of interconnect entirely in isolation. To that end, we recommend comprehensive testing of your application in a RAC environment with multiple node and interconnect configurations as the basis for moving beyond a default configuration. As you have seen, Ethernet also offers greater potential for building redundancy into the interconnect configuration—and at a significantly lower cost.

There are a handful of exceptions where an Infiniband interconnect may prove a valuable investment. For example, it might make sense when implementing large data warehousing environments where high levels of inter-instance parallel processing capabilities are required. Additionally, Infiniband may prove worthwhile when looking to consolidate both the interconnect and storage fabric into a single technology with sufficient bandwidth to accommodate them both. However, in the latter circumstance, 10 Gigabit Ethernet should also be considered, to see if it can meet the increased bandwidth requirements.

Storage Technologies

In this chapter, we have progressed through the hierarchy of the RAC Technologies Compatibility Matrix (RTCM) for Linux Clusters, taking an in-depth look at the attributes of the server processor architecture, memory, and network interconnect technologies. In this section, we complete that analysis by reviewing the foundation that underpins every RAC implementation: storage technologies. As you learned in Chapter 2, you can have a single logical copy of the database shared between multiple server nodes in a RAC configuration. This logical copy may consist of multiple physical copies replicated either with ASM, alternative volume management software, or at the hardware level. However, maintaining the integrity of this single logical copy of data means that choosing the right storage solution is imperative. As with all architectural designs, if you fail to pay sufficient attention to the foundations of RAC, then it is unlikely that you will succeed in building a solution based on this technology. Table 4-9 details the storage protocols and configurations supported by Oracle in a RAC configuration. In addition to Oracle's level of support, it is also a requirement that the selected servers and storage be a supported configuration by the vendors in question.

Table 4.9. Storage Technologies

Storage Protocol	Supported Configuration
Fibre Channel	Switched Configuration or Fibre Channel Ports Integrated into Storage
Fibre Channel over Ethernet (FCoE)	Cisco FCoE supported
SCSI	Support for two nodes
iSCSI	Support for up to 30 nodes with gigabit Ethernet storage network
NFS	Supported solutions from EMC, Fujitsu, HP, IBM, NetApp, Pillar Data Systems and Sun

In the following sections, we will examine the merits of the supported storage protocols and associated storage technologies. Before doing so, however, we will first put this information into its proper context by examining how Oracle utilizes storage in a clustered environment.

RAC I/O Characteristics

The I/O characteristics of an Oracle database, including a RAC-based one, can be classified into four groups: random reads, random writes, sequential reads, and sequential writes. The prominence of each is dependent on the profile of the application; however, some generalizations can be made.

In a transactional environment, a number of random writes will be associated with the database writer (DBWR) process (we use DBWRn to represent all of the database writer processes). These writes are in addition to the sequential writes associated with the redo logs and the log writer (LGWR) process. A level of random reads would also be expected from index-based queries and undo tablespace operations. Sequential read performance from full-table scans and parallel queries is normally more prominent in data warehouse environments, where there is less emphasis on write activity, except during batch loads.

To lay the foundation for a discussion of storage, we will begin by focusing on the read and write behavior that an Oracle cluster is expected to exhibit.

Read Activity

In this section, we have used the terms random reads and sequential reads from a storage perspective. However, the Oracle wait event usually associated with a sequential read realized on the storage is db file scattered read, whereas the Oracle wait event usually associated with a random read on the storage is db file sequential read. Therefore, clarifying the differing terminology seems worthwhile.

Within Oracle, the reading of data blocks is issued from the shadow process for a user's session. A scattered read is a multiblock read for full-table scan operations where the blocks, if physically read from the disk storage, are normally accessed contiguously. The operations required to read the blocks into the buffer cache are passed to the operating system, where the number of blocks to fetch in a single call is determined by the Oracle initialization parameter, DB_FILE_MULTIBLOCK_READ_COUNT. The long-standing methodology for UNIX multiple buffer operations is termed scatter/gather. In Linux, the SCSI generic (sg) packet device driver introduced the scatter/gather I/O in the 2.2 kernel. For Oracle on Linux, a similar methodology is employed, where scattered reads are issued with the pread() system call to read or write multiple buffers from a file descriptor at a given offset into the user session's program global area (PGA). Subsequently, these buffers are written into noncontiguous buffers in the buffer cache in the SGA. The key concept to note is that the data may be retrieved from the disk in a contiguous manner, but it is distributed into noncontiguous, or scattered, areas of the SGA. In addition to db file scattered read, in some circumstances you may also see the wait event direct path read associated with a profile of multiblock reads on the storage. The ultimate destination of the data blocks in a direct path read is the PGA, as opposed to the SGA. This is most evident when utilizing Parallel Query; hence, the typical profile on the storage is one of multiple, contiguous reads with the requests issued asynchronously by the Parallel Execution Servers. Parallel Execution in a RAC environment is discussed in more detail in Chapter 14.

A db file sequential read event, on the other hand, is associated with index-based reads and often retrieves a single block or a small number of blocks that are then stored contiguously within the SGA. A single block read, by definition, is stored in contiguous memory. Therefore, this form of I/O is termed sequential if it's physically read from the disk storage, despite the fact that the file locations for these blocks are accessed randomly.

For the reasons previously described in this chapter, the important aspect to note in terms of read storage activity for RAC is that accessing data blocks through Cache Fusion from other nodes in the cluster is preferable to accessing them from disk. This holds true, no matter which method you employ. However, the buffer cache should always be sufficiently sized on each node to minimize both Cache Fusion and physical disk reads, as well as to ensure that as much read I/O as possible is satisfied logically from the local buffer cache. The notable exception occurs with data warehouse environments, where in-memory parallel execution is not being used. In that case, the direct path reads are not buffered in the SGA, so sufficient and usually considerable read I/O bandwidth must be available to satisfy the combined read demands of the cluster.

Write Activity

In a transactional environment, the most important aspect of storage performance for RAC in an optimally configured system will most likely be the sequential writes of the redo logs. The online redo logs will never be read as part of normal database operations, with a couple of exceptions.

Note

The fact that Oracle redo logs are only ever written in a sequential manner may not hold true when the redo logs are based on a file system where direct I/O has not been implemented. In this case, the operating system may need to read the operating system block before providing it to Oracle to write the redo information, and then subsequently write the entire block back to the storage. Therefore, on a file system, a degree of read activity may be associated with disks where the redo logs are located when observed from Linux operating system utilities.

These exceptions are as follows: when one node in the cluster has failed, so the online redo logs are being read and recovered by another instance; and when the archiver (ARCn) process is reading an online redo log to generate the archive log. These activities, however, will not take place on the current active online redo log for an instance.

To understand why these sequential writes take prominence in a transactional environment, it's important to look at the role of the LGWR process. The LGWR process writes the changes present in the memory resident redo log buffer, which itself comprises a number of smaller individual buffers to the online redo logs located on disk. LGWR does not necessarily wait until a commit is issued before flushing the contents of the log buffer to the online logs; instead, by default, it will write when the log buffer is 1/3 full, _LOG_IO_SIZE is modified, every three seconds, or if posted by the DBWRn process—whichever occurs first, as long as LGWR is not already writing. Additionally, if a data block requested by another instance through Cache Fusion has an associated redo, then LGWR must write this redo to disk on the instance currently holding the block before it is shipped to the requesting instance.

Note that there are two important aspects of this activity in terms of storage utilization. First, in a high-performance OLTP environment, storage performance for redo logs is crucial to overall database throughput. Also, with Oracle 11g Release 2, the log buffer is sized automatically, so it cannot be modified from the default value. Second, issuing a rollback instead of a commit does not interrupt the sequential write activity of the LGWR process to the online redo logs. The rollback operation is completed instead from the information stored in the undo tablespace. Because these undo segments are also stored in the database buffer cache, the rollback operation will also generate redo, resulting in further sequential writes to the online redo logs.

When an Oracle client process ends a transaction and issues a commit, the transaction will not be recoverable until all of the redo information in the log buffer associated with the transaction and the transaction's system change number (SCN) have been written to the online redo log. The client will not initiate any subsequent work until this write has been confirmed; instead, it will wait on a log file sync event until all of its redo information is written.

Given sufficient activity, LGWR may complete a log file sync write for more than one transaction at a time. When this write is complete, the transaction is complete, and any locks held on rows and tables are released.

It is important to note that a log file sync is not totally comprised of I/O related activity, and it requires sufficient CPU time to be scheduled to complete its processing. For this reason, it is true that poorly performing storage will inevitably result in significant times being recorded on log file sync wait events. However, the converse is not necessarily true, that high log file sync wait events are always indicative of poorly performing storage. Scenarios associated with high CPU utilization moving to a system with a more advanced CPU or, to a lesser extent, increasing the scheduling priority of the LGWR process, can significantly increase the redo throughput of an apparently disk-bound environment.

Note that the modified blocks stored in the database buffer will not necessarily have been written to the storage at this point. In any event, these will always lag behind the redo information to some degree. The redo information is sufficient for database recovery; however, the key factor is the time taken to recover the database in the event of a failure. The more that the SCN in the online redo logs and archive logs is ahead of the SCN in the database datafiles, the longer the time that will be taken to recover.

Oracle RAC has multiple redo log threads—one for each instance. Each instance manages its own redo generation, and it will not read the log buffer or online redo logs of another instance while that instance is operating normally. Therefore, the redo log threads operate in parallel.

For RAC, all of the redo log threads should be placed on storage with the same profile to maintain equilibrium of performance across the cluster. The focus for redo should also be placed on the ability of the storage to support the desired number of I/O operations per second (IOPS) and the latency time for a single operation to complete, with less emphasis placed on the bandwidth available.

In addition to the sequential write performance associated with the redo log threads, other important storage aspects for transactional systems include the random reads and writes associated with the buffer cache and the writes of the DBWRn process. For 11g, the actual number of DBWR processes configured is either 1 or CPU_COUNT/8, whichever is greater. Like the LGWR process, DBWRn has a three-second time-out when idle. And when active, DBWRn may write dirty or modified blocks to disk. This may happen before or after the transaction that modified the block commits. This background activity will not typically have a direct impact on the overall database performance, although the DBWRn process will also be active in writing dirty buffers to disk, if reading more data into the buffer cache is required and no space is available.

For example, when a redo log switch occurs or the limits set by the LOG_CHECKPOINT_TIMEOUT or LOG_CHECKPOINT_INTERVAL parameter are reached for a particular instance, a thread checkpoint occurs, and every dirty or modified block in the buffer cache for that instance only will be written to the datafiles located on the storage. A thread checkpoint can also be manually instigated on a particular instance by the command alter system checkpoint local. A thread checkpoint ensures that all changes to the data blocks from a particular redo log thread on an instance up to the checkpoint Number, the current SCN (System Change Number), have been written to disk. This does not necessarily mean that the resulting DBWRn activity is limited to the local instance. As discussed in Chapter 2, when another instance requires the data block already in use, that data block is shipped via the interconnect through Cache Fusion to the requesting instance, while the original instance maintains a copy of the original block called a past image. The past image must be held until its block master signals whether it can be written to disk or discarded. In this case, it is the DBWRn process of the instance with the most recent past image prior to the checkpoint Number that does the write to disk. From a storage perspective, however, this is not additional disk write activity in a RAC environment. Instead, DBWR Fusion writes are a result of the transfer of current data blocks between instances through Cache Fusion, without the modified blocks having been written to disk beforehand. Therefore, the most recent committed changes to a data block may reside on another instance. Without Cache Fusion, writes of modified blocks to disk to be read by another instance are known as DBWR Cross-Instance writes and involves an increased DBWRn workload. With DBWR Fusion writes, any additional overhead is in interconnect related messaging, as opposed to the storage level.

A full checkpoint on all instances, known as a Global Checkpoint, writes out all changes for all threads and requires the command alter system checkpoint global, which is also the default if local or global is not specified. Checkpoint details can be reported in the alert log by setting the parameter log_checkpoints_to_alert to the value of TRUE. Doing so reports that a checkpoint occurred up to a particular redo byte address (RBA) in the current redo log file for that instance.

Setting the parameter FAST_START_MTTR_TARGET to a time value in seconds of between 0 and 3600, while unsetting LOG_CHECKPOINT_TIMEOUT and LOG_CHECKPOINT_INTERVAL, calculates the correct checkpoint interval based on the desired recovery time in a process called fast start checkpointing. Fast start checkpointing has been available since Oracle 9i; therefore, setting LOG_CHECKPOINT_TIMEOUT and LOG_CHECKPOINT_INTERVAL is not recommended. Additionally, since release 10g, automatic checkpoint tuning is enabled if the FAST_START_MTTR_TARGET parameter is not set or is set to a sufficiently large value, which means that potentially no checkpoint related parameters are required to be set. Note that setting the FAST_START_MTTR_TARGET is not mandatory; nevertheless, we recommend setting this parameter in a RAC environment to enable fast start checkpointing. Doing so is preferable to relying entirely upon automatic checkpoint tuning because, in a RAC environment, checkpoint tuning also impacts recovery from a failed instance, in addition to crash recovery from a failed database.

It is important to be aware that the parameter FAST_START_MTTR_TARGET includes all of the timing required for a full crash recovery, from instance startup to opening the database datafiles. However, in a RAC environment, there is also the concept of instance recovery, whereby a surviving instance recovers from a failed instance. To address this issue, the _FAST_START_INSTANCE_RECOVERY_TARGET parameter was recommended by Oracle for RAC at version Oracle 10g Release 2 as a more specific way to determine the time to recover from the failed instance. However, this parameter was only relevant if one instance in the cluster failed, and it's no longer available for Oracle 11g Release 2. Therefore, we recommend setting the parameter FAST_START_MTTR_TARGET in a RAC environment. If you set this value too low for your environment and checkpointing is frequent, then the workload of the DBWRn process will be high, and performance under normal workloads is likely to be impacted. At the same time, the time to recover a failed instance will be comparatively shorter. If the FAST_START_MTTR_TARGET parameter is set too high, then DBWRn activity will be lower, and the opportunity for throughput will be higher. That said, the time taken to recover from a failed instance will be comparatively longer. FAST_START_MTTR_TARGET can be set dynamically, and the impact of different values can be observed in the column ESTIMATED_MTTR in the view V$INSTANCE_RECOVERY. For recovery from a failed instance, the instance startup time and total datafile open time should deducted from the ESTIMATED_MTTR value and can be read in the columns INIT_TIME_AVG and the number of datafiles multiplied by FOPEN_TIME_AVG in X$ESTIMATED_MTTR respectively.

During a recovery operation, the FAST_START_PARALLEL_ROLLBACK parameter determines the parallelism of the transactional component recovery operation after the redo apply. By default, this parameter is set to a value of LOW, which is twice the CPU_COUNT value. Given sufficient CPU resources, this parameter can be set to a value of HIGH to double the number of processes compared to the default value; the aim of changing this parameter is to improve the transaction recovery time.

The differing profiles of the LGWR and DBWRn processes on the nodes of the cluster will dictate the activity observed on the storage. There is more emphasis on LGWR activity in a transactional, as opposed to a data warehouse, environment. Some of the balance in activity between LGWR and DBWRn processes lies in the mean time to recover. But given redo logs of a sufficient size, a checkpoint on each node will most likely occur at the time of a log switch. This checkpoint will allow sufficient time for DBWRn to flush all of the dirty buffers to disk. If this flushing does not complete, then the Checkpoint Not Complete message will be seen in the alert.log of the corresponding instance, as follows:

Thread 1 cannot allocate new log, sequence 2411
Checkpoint not complete
  Current log# 4 seq# 2410 mem# 0: +REDO/prod/onlinelog/group_4.257.679813563
Sun Feb 27 05:51:55 2009
Thread 1 advanced to log sequence 2411 (LGWR switch)
  Current log# 1 seq# 2411 mem# 0: +REDO/prod/onlinelog/group_1.260.679813483
Sun Feb 27 05:52:59 2009
Thread 1 cannot allocate new log, sequence 2412
Checkpoint not complete
  Current log# 1 seq# 2411 mem# 0: +REDO/prod/onlinelog/group_1.260.679813483
Sun Feb 27 05:53:28 2009
Thread 1 advanced to log sequence 2412 (LGWR switch)
  Current log# 2 seq# 2412 mem# 0: +REDO/prod/onlinelog/group_2.259.679813523

When this error occurs, throughput will stall until the DBWRn has completed its activity and enabled the LGWR to allocate the next log file in the sequence. When the FAST_START_MTTR_TARGET parameter is set to a non-zero value, the OPTIMAL_LOGFILE_SIZE column in the V$INSTANCE_RECOVERY view is populated with a value in kilobytes that corresponds to a redo log size that would coincide a checkpoint with a log file switch, according to your MTTR target. Consequently, a longer recovery target will coincide with the recommendation for larger redo log files. This value varies dynamically; therefore, we recommend using this value for general guidance, but mainly to help ensure that the log files are not undersized in conjunction with expected DBWRn performance to the extent that Checkpoint Not Complete messages are regularly viewed in the alert log. Although the precise redo log size is dependent on the system throughput, typically a range from 512MB to 2GB is an acceptable starting value for a RAC environment.

Asynchronous I/O and Direct I/O

Without asynchronous I/O, every I/O request that Oracle makes to the operating system is completed singly and sequentially for a particular process, with the pread() system call to read and corresponding pwrite() system call for writes. Once asynchronous I/O has been enabled, multiple I/O requests can be submitted in parallel within a single io_submit() system call, without waiting for the requests to complete and subsequently be retrieved from the queue of completed events. Utilizing asynchronous I/O potentially increases the efficiency and throughput for both DBWRn and LGWR processes. For DBWRn, it also minimizes the requirement that you increase the values of DB_WRITER_PROCESSES beyond the default settings that may benefit synchronous I/O.

In an Oracle RAC on Linux environment, I/O processing is likely to benefit from enabling asynchronous I/O both in regular operations and for instance recovery. Asynchronous I/O support at the operating system level is mandatory for an Oracle 11g Release 2 installation, and it is a requirement for using ASM.

To utilize asynchronous I/O for Oracle on Linux, you need to ensure that the RPM package libaio has been installed according to the process described in Chapter 4:

[oracle@london1 ˜]$ rpm -q libaio
libaio-0.3.106-3.2

The presence of the libaio package is mandatory, even if you do not wish to use asynchronous I/O, because the Oracle executable is directly linked against it, and it cannot be relinked with the async_off option to break this dependency. The additional libaio-devel RPM package is checked for under package dependencies during the Oracle installation; and in x86-64 environments, both the 32-bit and 64-bit versions of this RPM are required.

Asynchronous I/O can be enabled directly on raw devices. However, with Oracle 11g Release 2, raw devices are not supported by the Oracle Universal Installer (OUI) for new installs. That said, raw devices are supported for upgrades of existing installations, and they may be configured after a database has been installed. Asynchronous I/O can also be enabled on file systems that support it, such as the OCFS2, and it is enabled by default on Automatic Storage Management (ASM) instances, which we discuss in Chapters 6 and 9, respectively. For NFS file systems, the Direct NFS Client incorporated within Oracle 11g is required to implement asynchronous I/O against NFS V3 storage devices.

Where asynchronous I/O has not been explicitly disabled, by default the initialization parameter DISK_ASYNCH_IO is set to TRUE. Thus, it will be used on ASM and raw devices when the instance is started. In fact, it is important to note that ASM is not itself a file system (see Chapter 9 for more details on this topic), so the compatible I/O is determined by the underlying devices, which are typically raw devices. For this reason, setting asynchronous I/O for ASM and raw devices implies the same use of the technology.

If the DISK_ASYNCH_IO parameter is set to FALSE, then asynchronous I/O is disabled. The column ASYNCH_IO in the view V$IOSTAT_FILE displays the value ASYNC_ON for datafiles where asynchronous I/O is enabled.

If you're using asynchronous I/O on a supporting file system such as OCFS2, however, you also need to set the parameter FILESYSTEMIO_OPTIONS, which by default is set to NONE. Setting FILESYSTEMIO_OPTIONS to anything other than the default value is not required for ASM. Oracle database files are stored directly in ASM; they cannot be stored in an Automatic Storage Management Cluster File System (ACFS), so FILESYSTEMIO_OPTIONS is not a relevant parameter for ACFS.

On a file system, this parameter should be set to the value ASYNCH by executing the following commands on one instance in the cluster:

[oracle@london1 lib]$ srvctl stop database -d PROD
SQL> startup nomount;

SQL>  alter system set filesystemio_options=asynch scope=spfile;

SQL> shutdown immediate;

[oracle@london1 lib]$ srvctl start database -d PROD

Once asynchronous I/O is enabled, there are a number of ways to ascertain whether it is being used. For example, if ASMLIB is not being used when running iterations of the command cat /proc/slabinfo | grep kio, changing values under kioctx and kiocb show that that asynchronous I/O data structures are being used. Additionally, during testing, the output of the operating system strace command run against the LGWR and DBWRn processes can be viewed for the use of the io_submit() and io_getevents() system calls. It is important to note, however, that if you're using ASMLIB on 2.6 kernel-based systems, only the read() system call is evident, even for write events. For example, the following snippet determines the process id of the LGWR process:

[oracle@london1 ˜]$ ps -ef | grep -i lgwr
oracle    5484     1  0 04:09 ?        00:00:00 asm_lgwr_+ASM1
oracle    5813     1  1 04:11 ?        00:02:32 ora_lgwr_PROD1
oracle   28506 13937  0 06:57 pts/2    00:00:00 grep -i lgwr

This process can then be traced with the output directed into a file until it is interrupted with Ctrl-C :

[oracle@london1 ˜]$ strace -af -p 5813 -o asynctest
Process 5813 attached - interrupt to quit
Process 5813 detached

Viewing this output in the file shows the use of the read() system call, even for the sequential writes of LGWR:

read(22,"MSA210P260306v3230017q202
327*"..., 80) = 80
read(22, "MSA210P260306v32"
..., 80) = 80

For this reason, when using ASMLIB the only determinant of the use of asynchronous I/O is by checking the value of the column ASYNCH_IO in the view V$IOSTAT_FILE. With Linux 2.6-based kernels, the size of the asynchronous I/O operations can no longer be modified with the kernel parameter, fs.aio-max-size. Instead, the size must be determined automatically.

Asynchronous I/O should not be confused with direct I/O. Direct I/O is an Oracle parameter set to avoid file system buffering on compatible file systems, and depending on the file system in question, it may be used either independently or in conjunction with asynchronous I/O. Remember: If you're using ASM, direct I/O is not applicable.

Direct I/O can be enabled in the initialization parameter file by setting the parameter FILESYSTEMIO_OPTIONS to DIRECTIO. If enabling both asynchronous I/O and direct I/O is desired on a compatible file system, then this parameter can alternatively be set to the value of SETALL, as in this example:

SQL>  alter system set filesystemio_options=directIO scope=spfile;

Alternatively, you can set the value like this:

SQL>  alter system set filesystemio_options=setall scope=spfile;

This means a combination of the parameters DISK_ASYNCH_IO and FILESYSTEMIO_OPTIONS can be used to fine-tune the I/O activity for the RAC cluster, where appropriate. However, adding such support should always be referenced against the operating system version and storage.

Hard Disk and Solid State Disk Drive Performance

The previous synopsis of RAC I/O characteristics activity illustrates that the most important aspects of I/O are two-fold. First, it's essential to ensure that all committed data is absolutely guaranteed to be written to the storage. Second, the database must always recoverable in a timely and predictable manner in the event of a node, cluster, or even site failure. In terms of the permanent storage, the database is able to recover when the data is actually written to the disks themselves.

Hard disk drives tend to offer dramatically increased capacity and a degree of improved performance as newer models are introduced; however, the basic technology has remained the same since hard disks were introduced by IBM in the 1950s.

Within each drive, a single actuator with multiple heads reads and writes the data on the rotating platters. The overall performance of the drive is determined by two factors: seek time and latency. Seek time is the time the heads take to move into the desired position to read or write the data. Disks in enterprise storage tend to be available in configurations of up to 10,000 or 15,000 rotations per minute (rpm), with the 15,000 rpm disks utilizing smaller diameter platters. The shorter distance traveled by the heads results in a lower seek time, typically around 3.5 to 4 milliseconds. Latency is the time taken for the platter to rotate to the correct position once the head is in position to read or write the correct sector of data. A full rotation on a 15,000 rpm drive takes approximately four milliseconds and on a 10,000 rpm drive. Thus, it takes six milliseconds with full rotations resulting in the longest latency possible; the average latency will be approximately half this amount for all reads and writes, and the typical quoted value for latency on a 15,000 rpm drive is therefore two milliseconds. Once the head and platter are in the correct position, the time taken to actually transfer the data is negligible compared to the seek and latency times.

An additional important concept for achieving the maximum performance from a drive is that of destroking. Destroking is the process of storing data only on the outer sectors of the disk, leaving the inner sectors unused. The outer sectors store more information, and therefore enable the reading and writing of more data without repositioning the heads, which improves the overall access time at the cost of a reduction in capacity. Additionally, all enterprise class disk drives include a RAM memory-based buffer, typically in the order of 8MB to 16MB. This RAM improves the potential burst rate data transfer, which is the fastest possible data transfer a disk can sustain. This is measured by taking the time to transfer buffered data and excluding actual disk operations.

Hard disk performance seen by the Oracle Database is dependent on multiple factors, such as the drive itself, the controller, cables, and HBA. For this reason, a number of figures and terminologies are quoted in reference to overall performance, and it is beneficial to be able to identify and compare the relevant values. The first of these is the External Transfer Rate, which is usually given in megabytes per second (MB/s). For example, you might see 400MB/s for a drive with a 4Gb/s Fibre Channel interface. At first, this might initially seem to indicate that a single drive would be sufficient to sustain a 4Gb/s Fibre Channel HBA. However, this value is based on the burst rate value, so it's mostly academic with respect to Oracle database performance. Second, the internal transfer rate is given in megabits, and it's usually shown as a range from the slowest to the fastest sustained performance expected from the drive, such as 1051 to 2225. As this value is shown is megabits, and there are 8 megabits to the megabyte, it is necessary to divide by 8 to reach a comparable figure to the external transfer rate. In this example, that translates to range of 131MB/s to 278MB/s. With a 25% deduction for the disk controller, this example drive should be able to sustain a minimum transfer rate of 98.25MB/s. Third, the sustained transfer rate is the long-term read or write rate, with the former being the higher value that can be sustained for a longer period of time than the burst rate. With this example, the sustained transfer rate can be calculated as the average of the internal transfer rate values, as in this example:

(((1051 + 2225) /2 ) /8 ) * 0.75  = 154 MB/s

The sustained transfer rate quoted in a specification sheet is typically shown as a range of values around this average value, such as 110MB/s to 171MB/s, with the actual value dependent on the overall hardware configuration. Therefore, the average value is the optimal figure for calculations based on sustained transfer rates. At the time of writing, average read performance of up to 150MB/s and average write performance of up to 140MB/s were typical values for best-in-class enterprise drives. Therefore, in an ideal theoretical scenario, a minimum of three such drives would be required to sustain the bandwidth of a single 4Gb/s Fibre Channel HBA on a single node in the cluster. For an additional comparison, more than 425 such drives, coupled with the appropriate PCI and HBA configuration, would be required to approach a similar bandwidth capacity available within memory of a single two-socket node from the example given previously in this chapter, where the memory bandwidth was 64GB/s. This example is a further illustration of the importance of ensuring as much data as possible is cached in the local memory of the cluster nodes. Finally, in addition to transfer rates the value of IOPS (Input/Output Operations Per Second) is also important. Whereas transfer rates measure the data bandwidth of a drive, IOPS measures the number of individual operations that the drive can support within a second. Consequently, the actual IOPS value is dependent upon the workload, although this is typically measured on short random I/O operations, and it can be estimated from the average seek time, in addition to the average latency. Hard disk IOPS vary can also be improved by queuing technology, depending on the drive of Tagged Command Queuing (TCQ) for ATA and SCSI disks and Native Command Queuing (NCQ) for SATA disks up to values of 300-400 IOPS, with similar values for both read and write.

We recommend that even when purchasing high specification hard-disk drives and enterprise class storage for an entry level two-node cluster, a starting disk configuration should at a minimum be twelve drives in a RAID configuration (we will cover this in greater depth later in this Chapter). Taking into account RAID technologies to mitigate the risks of drive failure, as well as the requirements for the configuration of different RAID groups to account for the different storage profiles of the LGWR and DBWRn processes, it is likely that number of drives required will significantly exceed this number to prevent performance being negatively affected.

Solid State disk drives (SSDs) are a technology gaining prominence. They promise to enhance both disk drive performance and reduce power consumption; however, it is important to note that, as opposed to being an emerging technology, it is one that has taken decades to mature. For example, EMC Corporation began reintroducing SSDs in 2008 for the first time since 1987. Current generations of SSDs are based on NAND Flash memory, and they are either single-level cell (SLC) or multi-level cell based. SLC stores a single bit per memory cell, whereas MLC applies different levels of voltage to store multiple bits per cell. MLC-based drives offer higher capacity, whereas SLC drives bring higher levels of write performance and durability. At the time of writing, the best-in-class of both SLC and MLC SSDs offer sustained read transfer rates of 250MB/s. However, whereas SLCs also deliver a sustained write transfer rate of 170MB/s, the equivalent value for an MLC is only 70MB/s. Write performance is also superior when SSDs are below full capacity. Additionally, MLCs have a lifespan of 10,000 write-erase cycles; for SLCs, this value is 100,000 write-erase cycles. In contrast to hard disks, SSD latency varies and is dependent upon the operation. Typical latencies for read, write, and erase are 25 microseconds, 250 microseconds, and 0.5 milliseconds for SLCs; and 50 microseconds, 900 microseconds, and 3.5 milliseconds for MLCs, respectively. Nevertheless, these latencies offer a significant improvement over the seek times associated with hard disks. Also, they immediately point to the most significant gains being for random read operations, as illustrated by the associated IOPS values. For example, SLCs deliver up to 3300 write IOPS, but 35000 read IOPS. SSD IOPS values do depend more upon the transfer size when compared to hard disks, with the higher IOPs values measured at the 2Kb and 4Kb transfer sizes.SSD IOPS values approach hard disk equivalent IOPS at larger 64Kb and 128Kb transfer sizes for both read and write. Queuing technologies can also enhance SSD performance most significantly for read IOPS values and the addition of TRIM commands to identify blocks deleted at the OS level for write IOPS values. Both SLCs and MLCs have advantages for different areas of utilization. SLCs, for example, with their enhanced durability, higher write bandwidth but lower capacity, would be preferred over MLCs for redo log disks.Redo performance exceeds that of hard disks most notably due to the higher IOPS at smaller transfer sizes,MLCs with their higher capacity and SLC equivalent random I/O performance deliver comparative advantages for read-focused datafile performance such as indexes.

SSDs are a technology experiencing increasing levels of adoption. Where the opportunity arises, we recommend testing to determine whether the performance advantages are realized in your RAC environment. It is important to note performance characteristics can vary between vendors. To maximize drive lifespan, it would be advantageous to spread the write-erase cycles across the entire drive as evenly as possible. For this reason, all SSDs employ a logical block address (LBA) allocation table where each LBA is not necessarily stored in the same physical area of the disk each time it is written. Unlike a hard-disk drive, SSDs may take a period of time to optimize performance and reach a steady state of high sustained transfer rates. For this reason, it is especially important to either ensure that your use of SSDs are incorporated into a certified enterprise storage solution or that drives are correctly prepared with a low level format (LLF) before doing direct comparisons with hard-disk drive solutions.

If adopting SSD versus hard-disk technology in a disk-for-disk comparison, SSDs offer improved levels of performance, with the most significant gains delivered in random I/O operations, while the gap between the two technologies is narrower for throughput of sequential operations that are less dependent on hard disk seek times. Hard-disk drives also support much larger storage capacities and lower costs for that capacity. If you're considering power consumption across the cluster, then it is important to be aware that an active SSD will typically consume 2.4 Watts of power, compared to the 12.1 Watts for a hard-disk drive. Therefore, the impact of your choice of disk on the power consumption of the storage array should be a consideration, in addition to the total power consumption at the server level, as well as the power consumption of the CPU and memory components.

In addition to SSDs, Flash Storage technology can also be deployed in additional form factors, such as on motherboard flash modules or Flash PCIe cards. It is important to note, from an architectural perspective, that in a RAC environment these devices are local to an individual node within the cluster, whereas an SSD can be viewed as a direct replacement for a hard-disk drive across the configuration that can be integrated into an enterprise storage array shared between the nodes in the cluster. Flash PCIe cards also can connect directly into the PCIe interface on a server, whereas SSDs, if implemented directly into a host, will connect through the SATA protocol, as discussed later in this chapter. An SSD can also be used within an external storage array through an HBA (Host Bus Adapter) and a protocol such as Fibre Channel to connect the array. Both of these mediums add latency to the I/O solution, but offer greater flexibility, manageability, and reliability to enterprise environments than a single card configuration.

Flash PCIe cards in particular have relevance to an Oracle storage feature introduced with Oracle 11g Release 2 called the Database Smart Flash Cache. It is important to note, however, that in its initial release, this technology was supported by the Linux operating system, although it was disabled. Patch 8974084 was released later to resolve this issue. With its parameters DB_FLASH_CACHE_FILE and DB_FLASH_CACHE_SIZE, the Flash Cache is documented to be a second-tier buffer cache residing on Flash Storage, and it is presented by the operating system as a standard disk device. The Flash Cache can be considered in implementation in a manner similar to the CPU L2 Cache, as discussed previously in this chapter. A Flash Cache sized up to a factor of 10 greater than the buffer cache is accessed in a hierarchical manner, supplying the RAM resident buffer cache with data blocks fetched with access times in the region of 50 microsecond latencies. Although an improvement over standard hard disk access times measured in milliseconds, it should be noted accessing RAM-based memory within 100 nanoseconds is completed 500 times faster than the Flash-based access; therefore, the Flash Cache should be considered as complementing the in-memory buffer cache, as opposed to replacing it. Thus the Flash Cache provides an additional storage tier local to each node in a RAC environment between the RAM, which is also local to each node; and the disk-based storage, which is shared between all nodes.

Whereas Flash PCIe cards connect by the PCIe interface through the ICH (both the PCIe interface and ICH were covered previously in this chapter), on-motherboard flash modules offer the potential of bringing Flash memory further up the performance hierarchy, above the PCIe interface. To realize this, a number of technology companies are collaborating on a standard interface to NAND flash chips, termed the ONFI (Open NAND Flash Interface). Through this interface, a server would be equipped with a low latency interface to Flash storage directly on the motherboard of the server.

At the time of writing, testing of Database Smart Flash Cache was not possible. We see the most relevance for enhancing performance in single-instance environments. With RAC, however, there is already an additional storage tier between local RAM and shared storage—namely, through Cache Fusion by accessing data blocks across the interconnect from a remote instance. Therefore, the main consideration is whether accessing a data block stored in the Flash Cache on a remote instance with Cache Fusion would outperform the remote instance accessing the data block directly from disk, such as from an SSD storage tier. The actual benefits would ultimately depend on all the technology deployed in the solution discussed in this chapter, such as the CPU, memory, and the interconnect. If you're considering a virtualized solution with Oracle VM, which you will learn more about in Chapter 5, it is also worth noting that such a device would be accessed with drivers at the Dom0 layer, in which case accessing the second tier of the buffer cache external to Guest could have a potential impact on performance.

In a RAC environment, storage based on both hard disk and SSD is best evaluated at the shared storage level, and combinations of both SSD and hard disk drives can also be adopted to configure a hybrid storage environment to take advantage of SSD performance in conjunction with hard-disk capacity. However, whether you're adopting hard-disk drives, SSDs, or a combination of both, all RAC cluster solutions require storage solutions based on multiple disk drives, the configuration of which impacts both storage performance and capacity. For this reason, we will discuss how this is achieved with RAID.

RAID

RAID was defined in 1988 as a Redundant Array of Inexpensive Disks by David A. Patterson, Garth Gibson, and Randy H. Katz at the University California, Berkeley. Later use often sees the Inexpensive replaced with Independent; the two terms are interchangeable, without affecting the core concepts.

The original RAID levels defined are 1, 2, 3, 4, and 5 (and 0), with additional terminology added at later points in time, such as RAID 0+1, RAID 10, and RAID-50.

In general, the term RAID is applied to two or more disks working in parallel, but presented as a logical single disk to the user to provide enhanced performance and/or resilience from the storage. RAID can be implemented in one of three ways:

Option 1: This option is referred to as Just a Bunch of Disks (JBOD). The JBOD is presented to the volume-management software running on the server that implements RAID on these devices; specifically, it is presented as logical volumes of disks to the database. RAID processing takes place using the server CPU.
Option 2: A JBOD is presented to the server, but the RAID processing is done by a dedicated RAID host bus adapter (HBA) or incorporated onto the motherboard of the server itself, also known as RAID on motherboard (ROMB). In this case, RAID processing is done on the server, but not by the server CPU.
Option 3: The server is connected to an external RAID array with its own internal RAID controller and disks, and it is presented a configurable logical view of disks. RAID processing is done completely independently from the server.

For RAC, the RAID implementation must be cluster-aware of multiple hosts accessing the same disks at the same time. This implementation restriction rules out the second RAID option and limits the choice in the first category to third-party solutions (from companies such as HP PolyServe and Symantec Veritas) that may be overlaid with cluster file systems and Oracle ASM. A more common RAID solution for RAC is the third option, where you use an external dedicated-storage array.

Whichever implementation is used, RAID is based on the same concepts and is closely related to the Oracle stripe-and-mirror-everything (SAME) methodology for laying out a database on storage for optimal performance and resilience. Therefore, in the following sections we will look at the most popular and practical implementations of RAID.

RAID 0 Striping

RAID 0 implements striping across a number of disks by allocating the data blocks across the drives in sequential order, as illustrated in Table 4-10. This table shows a stripe set of eight disks. The blocks from A to X represent, not Oracle blocks, but logical blocks of contiguous disk sectors, such as 128KB. In this example, we have one stripe against eight disks, which makes our total stripe size 1MB (8 × 128KB).

Table 4.10. RAID 0

DISK1	DISK 2	DISK 3	DISK 4
BLOCK-A	BLOCK-B	BLOCK-C	BLOCK-D
BLOCK-I	BLOCK-J	BLOCK-K	BLOCK-L
BLOCK-Q	BLOCK-R	BLOCK-S	BLOCK-T
DISK 5	DISK 6	DISK 7	DISK 8
BLOCK-E	BLOCK-F	BLOCK-G	BLOCK-H
BLOCK-M	BLOCK-N	BLOCK-O	BLOCK-P
BLOCK-U	BLOCK-V	BLOCK-W	BLOCK-X

This implementation does not cater to redundancy; therefore, by the strictest definition, it is not RAID. The loss of a single drive within the set results in the loss of the entire RAID group. So, on one hand, this implementation offers a lower level of protection in comparison to the presentation of individual drives.

What it does offer, on the other hand, is improved performance in terms of operations per second and throughput over the use of individual drives for both read and write operations. The example in the table would in theory offer eight times the number of operations per second of an individual drive and seven times the average throughput for hard disks. The throughput is marginally lower with hard-disk drives due to the increased latency to retrieve all of the blocks in a stripe set because the heads must move in parallel on all drives to a stripe. On solid state-based storage, the average throughput is also a factor of eight.

For RAC, this implementation of RAID delivers the highest-performing solution; however, the lack of resilience makes it unsuitable for any production implementation. RAID 0 is heavily utilized, mainly for commercial database benchmarks. The combination of RAID 0 and destroking on hard-disk drives is ideal for an environment where data loss or the cost of unused disk space is not a significant consideration.

RAID 1 Mirroring

In RAID 1, all data is duplicated from one disk or set of disks to another, resulting in a mirrored, real-time copy of the data (see Table 4-11).

Table 4.11. RAID 1

DISK 1		DISK 2	DISK 3		DISK 4
BLOCK-A	=	BLOCK-A	BLOCK-D	=	BLOCK-D
BLOCK-B	=	BLOCK-B	BLOCK-E	=	BLOCK-E
BLOCK-C	=	BLOCK-C	BLOCK-F	=	BLOCK-F
DISK 5		DISK 6	DISK 7		DISK 8
BLOCK-G	=	BLOCK-G	BLOCK-J	=	BLOCK-J
BLOCK-H	=	BLOCK-H	BLOCK-K	=	BLOCK-K
BLOCK-I	=	BLOCK-I	BLOCK-L	=	BLOCK-L

RAID 1 offers a minor performance improvement over the use of single drives because read requests can be satisfied from both sides of the mirror. However, write requests are written to both drives simultaneously, offering no performance gains. The main benefits of this form of RAID are that, in the event of a single drive failure or possibly multiple drive failures, the mirror is broken, but data availability is not impacted. This means that performance is only marginally impaired, if at all, until the drive is replaced. At this point, however, performance is impacted by the resilvering process of copying the entire contents of the good drive to the replacement, a lengthy and intensive I/O operation.

When implemented in software using a volume manager, the CPU load from mirroring is the lowest of any form of RAID, although the server I/O traffic is doubled, which should be a consideration in comparing software- to hardware-based RAID solutions.

The most significant cost of this form of RAID comes in using exactly double the amount of storage as that which is available to the database.

RAID 10 Striped Mirrors

RAID 10 offers the advantages and disadvantages of both RAID 0 and RAID 1, first by mirroring all of the disks onto a secondary set, and then by striping across these mirrored sets. Table 4-12 shows this configuration.

Table 4.12. RAID 10

DISK 1		DISK 2	DISK 3		DISK 4
BLOCK-A	=	BLOCK-A	BLOCK-B	=	BLOCK-B
BLOCK-E	=	BLOCK-E	BLOCK-F	=	BLOCK-F
BLOCK-I	=	BLOCK-I	BLOCK-J	=	BLOCK-J
DISK 5		DISK 6	DISK 7		DISK 8
BLOCK-C	=	BLOCK-C	BLOCK-D	=	BLOCK-D
BLOCK-G	=	BLOCK-G	BLOCK-H	=	BLOCK-H
BLOCK-K	=	BLOCK-K	BLOCK-L	=	BLOCK-L

This form of RAID is usually only available with hardware-based RAID controllers. It achieves the same I/O rates that are gained by striping, and it can sustain multiple simultaneous drive failures in which the failures are not experienced on both sides of a mirror. In this example, four of the drives could possibly fail with all of the data, and therefore the database, remaining available. RAID 10 is very much the implementation of SAME that Oracle extols; however, like RAID 1, it comes at the significant overhead of an additional 100% requirement in storage capacity above and beyond the database.

RAID 0+1 Mirrored Stripes

RAID 0+1 is a two-dimensional construct that implements the reverse of RAID 10 by striping across the disks and then mirroring the resulting stripes. Table 4-13 shows the implementation of RAID 0 across disks 1, 2, 5, and 6, mirrored against disks 3, 4, 7, and 8.

Table 4.13. RAID 0+1

DISK 1	DISK 2		DISK 3	DISK 4
BLOCK-A	BLOCK-B	=	BLOCK-A	BLOCK-B
BLOCK-E	BLOCK-F	=	BLOCK-E	BLOCK-F
BLOCK-I	BLOCK-J	=	BLOCK-I	BLOCK-J
DISK 5	DISK 6		DISK 7	DISK 8
BLOCK-C	BLOCK-D	=	BLOCK-C	BLOCK-D
BLOCK-G	BLOCK-H	=	BLOCK-G	BLOCK-H
BLOCK-K	BLOCK-L	=	BLOCK-K	BLOCK-L

RAID 0+1 offers identical performance characteristics to RAID 10, and it has exactly the same storage-capacity requirements. The most significant difference occurs in the event of a drive failure. In the RAID 0+1 configuration, if a single drive fails—for example, Disk 1—then you lose access to the entire stripe set on disks 1, 2, 5, and 6. At this point, you only have to lose another disk on the stripe set on the other side of the mirror to lose access to all of the data and, thus, the database it resides on.

Therefore, you might reasonably question where a RAID 0+1 configuration would be used instead of RAID 10. If implementing RAID in hardware on a dedicated-storage array, then you would never use this approach; RAID 10 should always be used. However, if using software RAID, such as with Oracle ASM, in combination with any number of low-cost modular storage arrays, a RAID 0+1 configuration can be used to stripe the disks at the storage level for performance and then mirrored by software between multiple arrays for resilience.

RAID 5

RAID 5 introduces the concept of parity. In Table 4-14, you can see the eight disks striped similarly to RAID 0, except that you have a parity block at the end of each stripe. This parity block is the same size as the data blocks, and it contains the results of the exclusive OR (XOR) operation on all of the bits in every block in the stripe. This example shows the first three stripes; and if it were to continue across all of the disks, you would have seven disks of data and one disk of parity.

Table 4.14. RAID 5

DISK 1	DISK 2	DISK 3	DISK 4
BLOCK-A	BLOCK-B	BLOCK-C	BLOCK-D
BLOCK-I	BLOCK-J	BLOCK-K	BLOCK-L
BLOCK-Q	BLOCK-R	BLOCK-S	BLOCK-T
DISK 5	DISK 6	DISK 7	DISK 8
BLOCK-E	BLOCK-F	BLOCK-G	PARITY
BLOCK-M	BLOCK-N	PARITY	BLOCK-H
BLOCK-U	PARITY	BLOCK-O	BLOCK-P

In this RAID configuration, the data can be read directly, just as it can in RAID 0. However, changing a data block requires writing the data block, and then reading, recalculating, and subsequently rewriting the parity block. This additional overhead for writes on RAID 5 is termed the write penalty. Note that, from the properties of the XOR operation for a write operation, touching the data block and the parity block is only necessary for a write operation; the parity can be calculated without it being required to access all of the other blocks in the stripe. When implemented on a hardware-based RAID controller, the impact of the parity calculation is negligible compared to the additional read and write operations, and the write penalty will range from 10% to 30%, depending on the storage array used. RAID 5 is less effective when implemented as a software solution because of the requirement to read all of the data, including the parity information, back to the server; calculate the new values; and then write all of the information back to the storage, with the write penalty being approximately 50%.

Recalling our discussion of storage fundamentals for RAC, this RAID configuration may appear completely unsuited to LGWR activity from a redo thread, presenting the system with a large number of sequential writes and no reads. From a theoretical standpoint, this unsuitability is true. However, good RAID 5-storage systems can take this sequential stream of writes and calculate the parity block without first needing to read the parity block from disk. All of the blocks and the parity block are written to disk in one action, similar to RAID 0; hence, in the example shown, eight write operations are required to commit blocks A to G to disk, compared to the fourteen required for RAID 10, although these can be completed in parallel.

The primary attraction of RAID 5 is that, in the event of a disk failure, the parity block means that the missing data for a read request can be reconstructed using the XOR operation on the parity information and the data from the other drives. Therefore, to implement resiliency, the 100% overhead of RAID 10 has been significantly reduced to the lower overhead of the parity disk. Unlike RAID 10, however, the loss of another drive will lead to a total data loss. The loss of a single drive also leads to the RAID 5 group operating in a degraded mode, with the additional load of reading all of the blocks and parity in the stripe to calculate the data in the failed block increasing the workload significantly. Similarly, when all of the drives are replaced, all of the blocks and the parity need to be read to regenerate the block, so it can be written to the replaced drive.

With parity-based RAID, an important area to look at is the impact of significantly larger disk media introduced since the original RAID concepts were defined. Despite the increase in disk size, the likelihood of a disk failure remains approximately the same. Therefore, for a RAID 5 configuration of a given capacity, the chances of operating in degraded mode or even losing data are comparatively higher. This risk has led to the emergence of RAID 50–based systems with mirrored RAID 5 configurations. However, RAID 50 offers further challenges in terms of the impact on performance for RAC.

Storage Cache

In using the term cache, we are referring to the RAM set aside to optimize the transfer of the Oracle data from the storage, and we are primarily concerned with the storage-array controller cache. For data access, other levels of cache exist at the hardware level through which the Oracle data will pass, for example, the multiple levels of CPU cache on the processor discussed previously in this chapter, as well as the buffer cache on the hard drive itself. Note that both are usually measured in a small number of megabytes.

As we have seen previously in this chapter, for a transactional system, the storage will experience a number of random reads and writes, along with a continual stream of shorter sequential writes. Data warehouse activity will either buffer much of the data in the SGA for single threaded queries, or request long, continual streams of reads for parallel queries with minimal write and redo log activity outside of data loading times.

For write operations, the effect of cache on the storage controller can be significant to a point. If operating as a write-back cache, the write request is confirmed as complete as soon as the data is in cache, before it has been written to disk. For Oracle, this cache should always be mirrored in RAM and supplied with a backup battery power supply. In the event of failure, this backup enables the data within cache to be written to disk before the storage array shuts down, reducing the likelihood of database corruption. Write cache is also significant in RAID 5 systems because it enables the calculation of parity for entire stripes in cache, and it stores frequently used parity blocks in cache, thus reducing the read operations required.

This benefit from write cache, however, is valuable only to a certain extent in coping with peaks in throughput. In the event of sustained throughput, the data will be written to cache faster than the cached data can be written to disk. Inevitably, the cache will fill and throughput will operate at disk speed. Storage write cache should be seen as an essential buffer to enable the efficient writing of data back to disk. Allocating a large amount of cache, however, will never be a panacea for a slow disk layout or subsystem.

RAID Summary

Of the generally available RAID configurations discussed, a popular choice is RAID 5. The reasons are straightforward: RAID 0 is not practical in terms of redundancy, and RAID 1 is not suitable for performance. Although RAID 10 and RAID 0+1 offer the best performance and reliability, they have the significant cost of requiring a 100% overhead in storage compared to usable capacity. In theory, RAID 5 offers a middle ground, sacrificing a degree of performance for resilience, while also maintaining most of the storage capacity as usable.

In practice, the costs and benefits are not as well defined. With RAID 1, RAID 0, and combinations of these levels, throughput calculations tend to be straightforward. With RAID 5, however, benchmarking and tests are required to clearly establish the performance thresholds. The reason for these testing requirements is that the advantages for RAID 5 tend to be stated from the viewpoint of a single sequential write stream, combined with random reads where performance is predictable. In practice, hosting more than one database environment on a storage system is highly likely; when adding RAC into the equation with multiple redo log streams, the I/O activity tends to be less identifiable than the theory dictates.

For example, as the RAID 5 system increases in activity, you are less likely to be taking direct advantage of storing and calculating your stripes in cache and writing them together to disk in one action. Multiple, single-logical-block updates will take at least four I/O operations for the parity updates, and the increased number of I/O operations will be compounded in the event of stripe crossing; that is, where the cluster file system or ASM stripe is misaligned with the storage stripe. Stripe crossing will result in more than one parity operation per write compounding the effect on throughput still further.

As more systems and activity are added to the RAID 5 storage, the impact becomes less predictable, meaning that RAID 10 is more forgiving of the practice of allocating storage ad hoc, rather than laying it out in an optimal manner. This difference between the theory and practice of RAID 5 and RAID 10 tends to lead to polarity between Oracle DBAs and storage administrators on the relative merits of RAID 5 and RAID 10. Both approaches are, in fact, correct from the viewpoints of their adherents.

In summary, when determining a RAID specification for RAC, RAID 10 with hardware is the optimal choice for fault tolerance, read performance, and write performance. Where RAID 5 is used, the careful planning, layout, and testing of the database across the storage can deliver a cost-effective solution, especially where the workload is predominantly read-only, such as a data warehouse. For a transactional system, RAID 10 for the redo log threads and RAID 5 for the datafiles can provide a practical compromise.

Storage Protocols for Linux

DBAs are presented with a multitude of terminologies and techniques for selecting and configuring storage for RAC. The key determining characteristic is the requirement for simultaneous access to the storage from all of the nodes in the cluster. From a practical standpoint, however, options to satisfy this requirement on Linux include technologies such as FireWire, Small Computer System Interface (SCSI), Fibre Channel Storage Area Network (SAN), Internet Protocol (IP) SAN, and Network Area Storage (NAS).

The correct decision depends on a number of factors. And, because no two environments are identical, the factors are likely to differ on a case-by-case basis. It is also important not to consider storage entirely in isolation. As seen previously, crucial functionalities, such as RAID, can be implemented at either at the hardware or software level, and the level implemented should be determined before a purchasing decision is made.

An important point to note is that the storage protocol decision is not necessarily a decision between storage vendors. Leading storage vendors offer the same products, and these include the ability to support many different protocols, according to circumstance.

Though our prime consideration up to this point has been storage performance for RAC, additional, equally important decision criteria include cost, resilience, manageability, and supportability of the entire database stack from storage to server hardware and software. For example, ruling out any particular storage protocol due to the CPU overhead on the cluster nodes is counterintuitive, without also taking into account the predicted workload and CPU capacity available on the nodes.

Although, from a practical point of view, a RAC cluster can be built on low-cost storage with a medium such as FireWire, this configuration is not worthy of consideration in a production environment in terms of both performance and supportability. In reality, the storage and storage infrastructure will most likely be the most costly hardware components of the entire solution.

Despite the number of options available for the configuration of storage for RAC, you have, in fact, two major approaches to implementing I/O for RAC on Linux: block I/O and file I/O. Although these often are categorized as SAN and NAS, respectively, implementations such as Internet small computer system interface (iSCSI) mean that the distinctions are not necessarily clearly defined. Therefore, we will look at the protocols available, beginning with the primary foundation of block I/O, the SCSI protocol.

SCSI

Some of the confusion regarding I/O in Linux stems from the fact that SCSI defines both a medium—that is, the physical attachment of the server to the storage—and the protocol for communicating across that medium. To clarify, here we refer to the SCSI protocol operating over a standard copper SCSI cable.

The SCSI protocol is used to define the method by which data is sent from the host operating system to peripherals, usually disk drives. This data is sent in chunks of bits—hence, the term block I/O–in parallel over the physical medium of a copper SCSI cable. Because this SCSI data is transmitted in parallel, all bits must arrive in unison.

The original implementation of SCSI in 1986 utilized an 8-bit data path at speeds of up to 5MB/s, and it enabled up to eight devices (including the host adapter itself) to connect to a single host adapter. Because of the signal strength and the deviation from the original source being transmitted, called jitter, the maximum distance a SCSI device could be from the host system using a high voltage differential was effectively limited to under 25m.

SCSI has subsequently been revised and updated to speeds of 320MB/s with up to 16 devices, although at a lower bus length of 12m using a low-voltage differential.

Each SCSI device has an associated target ID, and this target can be further divided into subdevices identified by LUNs. Because a server can have several host adapters, and each one may control one or more SCSI buses, uniquely identifying a SCSI device means that an operating system must account for the controller ID, the channel (or bus) ID, the SCSI ID, and the LUN. This hierarchy is precisely the one implemented in Linux for addressing SCSI devices, and it can be viewed with the following command:

[root@london1 ˜]# cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3500320NS      Rev: SN05
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: Slimtype Model: DVD A  DS8A1P    Rev: C111
  Type:   CD-ROM                           ANSI SCSI revision: 05

Host: scsi6 Channel: 00 Id: 00 Lun: 00
  Vendor: DGC      Model: RAID 0           Rev: 0324
  Type:   Direct-Access                    ANSI SCSI revision: 04
Host: scsi6 Channel: 00 Id: 00 Lun: 01
  Vendor: DGC      Model: RAID 0           Rev: 0324
  Type:   Direct-Access                    ANSI SCSI revision: 04

From this output, you can see that on SCSI host adapter 6, there are two drives on SCSI ID 0 and SCSI ID 1. This, this output indicates that the system has a single HBA, from which the two external disks are presented. Although /proc/scsi/scsi can still be used to view the SCSI disk configuration from the Linux 2.6 kernel, the SCSI configuration is in fact maintained under the /sys directory. For example, you can view the configured block devices using this snippet:

[root@london1 ˜]# ls -ld /sys/block/sd*
drwxr-xr-x 7 root root 0 Mar  8 23:44 /sys/block/sda
drwxr-xr-x 6 root root 0 Mar  8 23:44 /sys/block/sdb
drwxr-xr-x 6 root root 0 Mar  8 23:44 /sys/block/sdc

Similarly, you can use this snippet to determine the details of SCSI host adapter 6, through which the SCSI disks are connected:

[root@london1 ˜]# cat /sys/class/scsi_host/host6/info
Emulex LPe11000-M4 4Gb 1port FC: PCIe SFF HBA on PCI bus 0e device 00 irq 169

Finally, you can also view the major and minor device numbers for a specific device--in this case, the major number 8 and minor number 32. This which will prove useful for cross-referencing the physical storage layout with the configured disks, as discussed later in this section:

[root@london1 ˜]# cat /sys/class/scsi_device/6:0:0:1/device/block:sdc/dev
8:32

Using the /sys interface, you can also initiate a bus rescan to discover new devices, without needling to reload the relevant driver module, as in this snippet:

[root@london1 ˜]# echo "- - -" > /sys/class/scsi_host/host6/scan

The names of the actual SCSI devices, once attached, can be found in the /dev directory. By default, unlike some UNIX operating systems that identify devices by their SCSI bus address, Linux SCSI devices are identified by their major and minor device numbers. In earlier static implementations, there were either 8 or 16 major block numbers for SCSI disks. The initial 8 major block numbers were as follows: 8, 65, 66, 67, 68, 69, 70, and 71. The second set of 8 major block numbers added the following list of major numbers to the initial 8: 128, 129, 130, 131, 132, 133, 134, and 135.

Each major block number has 256 minor numbers, some of which are used to identify the disks themselves, while some others are used to identify the partitions of the disks. Up to 15 partitions are permitted per disk, and they are named with the prefix sd and a combination of letters for the disks and numbers for the partitions.

Taking major number 8 as an example, the first SCSI disk with minor number 0 will be /dev/sda, while minor number 1 will be the first partition, /dev/sda1. The last device will be minor number 255, corresponding to /dev/sdp15. This letter and number naming convention continues with /dev/sdq for major number 65 and minor number 0; next comes /dev/sdz15 at major number 65 and minor number 159, and this continues with /dev/sdaa at major number 65 and minor number 160. Within 2.6 kernels, device configuration is dynamic with udev based on the major and minor number for a device available under /sys (we discuss udev configuration in more detail in Chapter 6). With dynamic configuration, there is no restriction when creating a large number of static devices in advance. For example, the following shows all of the created SCSI disk devices on a system:

[root@london1 ˜]# ls -l /dev/sd*
brw-r----- 1 root disk 8,  0 Mar  8 23:44 /dev/sda
brw-r----- 1 root disk 8,  1 Mar  8 23:44 /dev/sda1
brw-r----- 1 root disk 8,  2 Mar  8 23:44 /dev/sda2
brw-r----- 1 root disk 8, 16 Mar  8 23:44 /dev/sdb
brw-r----- 1 root disk 8, 17 Mar  8 23:44 /dev/sdb1
brw-r----- 1 root disk 8, 32 Mar  8 23:44 /dev/sdc
brw-r----- 1 root disk 8, 33 Mar  8 23:44 /dev/sdc1

It is possible to use any available major number once the original 16 reserved major numbers have been allocated; and, as there are 4,095 major numbers, a 2.6 kernel based Linux implementation can support many thousands of disks, with the actual total number dependant on the sum total of all devices on the system. Also, /proc/partitions can be used to show the configured devices and their corresponding major and minor device numbers that correspond with the preceding directory listing:

[root@london1 ˜]# cat /proc/partitions
major minor  #blocks  name

   8     0  488386584 sda
   8     1     104391 sda1
   8     2  488279610 sda2
   8    16  104857600 sdb
   8    17  104856223 sdb1
   8    32  313524224 sdc
   8    33  313516476 sdc1
 253     0  467763200 dm-0
 253     1   20512768 dm-1

Although the device-naming convention detailed is the default, one with udev device names can be changed according to user-defined rules and persistent bindings created between devices and names. In some respects, such an approach offers similar functionality to what is offered by ASMLIB.

SCSI, by itself, as a protocol and medium, is not generally associated with the term SAN. First, the number of devices that can be connected to a single SCSI bus is clearly limited, and a strict SCSI implementation restricts the access to the bus to one server at a time, which makes it unsuitable for RAC. Second, although using link extenders to increase the length of SCSI buses is possible, doing so is impractical in a data-center environment, especially when compared to alternative technologies.

These disadvantages have led to advancements in SCSI, most notably the development of Serial Attached SCSI (SAS) to overcome the limitations of the parallel transmission architecture and to include the support of I/O requests from more than one controller at a time. However, these developments need to be balanced against overcoming the disadvantages of the SCSI medium with alternative technologies, such as Fibre Channel (FC).

Despite these disadvantages, the SCSI protocol itself remains the cornerstone of block-based storage for Linux. Its maturity and robustness have meant that the additional technologies and protocols used to realize SANs are implemented at a lower level than the SCSI protocol; therefore, even though SCSI cabling will rarely be used, the device naming and presentation will remain identical, as far as the operating system and Oracle are concerned.

Fibre Channel and FCoE

FC was devised as a technology to implement networks and overcome the limitations in the standard at the time, which was Fast Ethernet running at 100Mb/s. Although FC is used to a certain extent for networks, these ambitions were never fully realized because of the advent of Gigabit Ethernet. However, the development of FC for networks made it compatible with the requirements for overcoming the limitations of SCSI-based storage as follows: transmission over long distances with a low latency and error rate, and the implementation of a protocol at a hardware level to reduce the complexities and CPU overheads of implementing a protocol at the operating system level.

These features enabled the connection of multiple storage devices within the same network; hence, the terminology, Storage Area Network. Note that although FC and SAN are often used interchangeably, they are not synonymous—a SAN can be implemented without FC, and FC can be utilized for other reasons apart from SANs. Also, similar to SCSI, with FC it's important to distinguish between the medium and the protocol. Despite the name, FC can be realized over copper or fiber-optic cabling, and the name Fibre Channel is correctly used to refer to the protocol for communication over either.

The FC protocol has five levels: FC-0 to FC-4. Levels FC-0 to FC-3 define the protocol from a physical standpoint of connectivity all the way through to communication. FC-4 details how the Upper Layer Protocols (ULPs) interact with the FC network. For example, it defines how the FC Protocol (FCP) implements the SCSI protocol understood by the operating system, with the FCP functionality delivered by a specific device driver. Similarly, IPFC implements the Internet Protocol (IP) over FC. There are also three separate topologies for FC: fabric, arbitrated loop, and point-to-point. By far, the most common of these topologies for RAC is the fabric topology, which defines a switch-based network enabling all of the nodes in the cluster to communicate with the storage at full bandwidth.

Optical cabling for FC uses two unidirectional fiber-optic cables per connection, with one for transmitting and the other for receiving. On this infrastructure, the SCSI protocol is transmitted serially, which means distances of up to 10 km are supported at high transfer rates. As well as supporting greater distances, fiber-optic cabling is also insensitive to electromagnetic interference, which increases the integrity and reliability of the link. The server itself must be equipped with an FC HBA that implements the FC connectivity and presents the disk devices to the host as SCSI disks.

At the time of writing, existing FC products support transfer rates of up to 8Gb/s (800 MB/s). However, these available rates must always be considered in conjunction with the underlying disk performance. For example, you must have a minimum of three high specification hard-disk drives in the underlying storage to outperform SCSI alone.

When implementing a SAN environment in a fabric topology, a method must be in place to distinguish between the servers to present the storage to. RAC is the perfect example to illustrate that no restriction exists in the number of hosts that are presented an identical view of the same storage. To realize this lack of restriction, all devices connected to a fabric are identified by a globally unique 64-bit identifier called a World Wide Name (WWN). When the WWN logs into a fabric, it is assigned a 24-bit identifier that represents the device's position within the topology, and zoning is configured at the switch layer to ensure that only the designated hosts have a view of their assigned storage. This explanation is, to some extent, an oversimplification, because for resilience systems can be equipped with multiple HBAs and connected to the same storage by different paths through separate switches to guard against hardware failure. This configuration is realized at the host level by multipathing software that unifies the multiple paths into a single disk image view (see Chapter 6 for a detailed implementation of I/O Multipathing with a device-mapper).

With these advantages of performance and connectivity over SCSI, FC has become the dominant protocol for SAN. However, it does have some distinct disadvantages that can inhibit realizing these benefits. One of the most significant of these is cost. The components for FC—the server HBA, optical cabling, and infrastructure—are significantly more expensive than their copper equivalents. Concerns have also been voiced about the lack of security implemented within the FC protocol, and the supported distances of up to 10 km are still a significant constraint in some environments.

In addition, arguably the most problematic area for Fibre Channel SAN, especially for RAC, is that of interoperability and support. To Oracle and the operating system, FC is interacting with SCSI devices, possibly with a clustered-file system layered on top, and could be using asynchronous and/or direct I/O. The device driver for the server HBA implements the SCSI interaction with the storage using FCP, and it could be running multipathing software across multiple HBAs. The storage is also likely to be dealing with requests from multiple servers and operating systems at the same time. Coupled with RAC, all of the servers in the cluster are now interacting simultaneously with the same disks. For example, one node in the cluster recovering after a failure could be scanning the storage for available disks, while the other nodes are intensively writing redo log information. You can see an exponential increase in the possible combinations and configurations. For this reason, FC storage vendors issue their own compatibility matrices of tested combinations, of which clustering—Oracle RAC, in particular—is often an included category. These vendors often provided detailed information about supported combinations, such as the Oracle and Linux versions, the number of nodes in the cluster, multipathing software support, and HBA firmware and driver versions. In reality, this certification often lies behind the curve in Oracle versions, patches, and features; in Linux releases and support; in server architectures; and in many other factors, which make the planning and architecting of the entire stack for compatibility a crucial process when working on FC SAN-based systems. Storage vendors should always be consulted with respect to support at the planning stage of an FC-based RAC solution on Linux.

A recent development is the introduction of Fibre Channel over Ethernet (FCoE), which seeks to mitigate against some of the complexities of Fibre Channel, while also leveraging its advantages and dominance in SAN environments. FCoE replaces the FC-0 and FC-1 layers with Ethernet, thereby enabling Fibre Channel installations to be integrated into an Ethernet-based network. From an Ethernet-based technology perspective, 10GbE is the key driving technology, providing sufficient bandwidth for Ethernet networks to host both networking and storage traffic simultaneously—an architectural approach termed Unified Fabrics. This approach is of particular interest in RAC environments that require a single networking technology for both the cluster interconnect and SAN traffic, thereby increasing the flexibility of a clustered solution. A similar Unified Fabric approach can also be adopted with Infiniband, although with Ethernet as its underlying technology, FCoE integrates directly with the SAN at the network layer, as opposed to replacing both the network and SAN with a new protocol. At the time of writing, FCoE is still classed as an emerging technology, albeit one that, given sufficient levels of adoption, brings considerable potential to simplifying a Grid-based computing approach.

iSCSI

In addition to FCoE, there are three other competing protocols for transporting block-based storage over an IP-based network: Internet Small Computer Systems Interface (iSCSI), Internet FC Protocol (iFCP), and FC over TCP/IP (FCIP).

FCIP supports tunneling of FC over an IP network, while iFCP supports FCP level 4 implemented over the network. Whereas these protocols are often implemented in environments looking to interconnect or interoperate with existing FC-based SANs, the leader is iSCSI, which makes no attempt to implement FC. Instead, it defines a protocol to realize block-based storage by encapsulating SCSI into the Transmission Control Protocol (TCP) to be transmitted over a standard IP network. This native use of TCP/IP means that an existing Gigabit Ethernet infrastructure can be used for storage, unifying the hardware requirements for both storage and networking. The major benefits that stem from this unification include reducing cost of implementation, simplifying the administration of the entire network, and increasing the distance capabilities through the existing LAN or WAN—all while using well-understood IP-based security practices.

In iSCSI terminology, the client systems, which in this case are the RAC cluster nodes, require an iSCSI initiator to connect the storage target. Initiators can be realized in either software or hardware. The software initiator is, in effect, a driver that pairs the SCSI drivers with the network drivers to translate the requests into a form that can be transferred. This translation can be then be used with a standard Ethernet network interface card (NIC) connected to the storage by Category-5 cabling.

Like FC, once successfully implemented, iSCSI lets Oracle and Linux view and interact with the disks at a SCSI protocol level. Therefore, Oracle needs to be based on ASM-configured block devices or a cluster-file system on the storage presented. iSCSI software drivers for Linux are available as part of the defaultLinux install and the configuration of iSCSI is on Linux is detailed in Chapter 5, in the context of configuring storage for virtualization.

An alternative for iSCSI is the use of specialized iSCSI HBAs to integrate the entire iSCSI protocol stack of Ethernet, TCP/IP, and iSCSI onto the HBA, removing all of I/O protocol processing from the server. At the time of this writing, these HBAs are available at gigabit speeds over standard Category-5 cabling. Products supporting 10Gb speeds are available with fiber-optic cabling or the copper-based 10GBase-CX4 cable associated with InfiniBand. The downside: The costs approach those of the better-established FC, despite the improved performance.

SATA

Serial Advanced Technology Attachment (SATA) has received a degree of attention as a means of providing low-cost, high-capacity storage in comparison to SCSI- and FC-based disks. This lower cost for SATA, however, comes with some significant disadvantages when deployed with standard hard disks. SATA is based on a point-to-point architecture in which disk requests cannot be queued; therefore the maximum throughput is significantly lower compared to its higher-cost rivals, especially for random I/O. Rotational speeds are also lower, so latency tends to be higher for SATA. In its original specification, SATA supports throughput of 1.5Gb/s (150MB/s).

In addition, the maximum cable length for SATA is 1 m, introducing some physical limitations that mean, in practice, SATA–based arrays need to be connected to the target host systems by existing SAN or NAS technologies.

The most significant disadvantage cited for SATA is that the mean time between failures (MTBF) is up to three times that of SCSI- or FC-based disks. However, this measurement is based on SATA implementations of hard-disk drives. SATA is also a protocol commonly used for SSDs; therefore, the MTBF values should not be taken as applying to all SATA-based storage. It is also worth noting that, although SATA is commonly used for SSDs, it is not an exclusive requirement, and the relationship is primarily driven by the production of SSDs that are widely compatible with notebook, desktop, and server implementations. For example, SSDs are also available with a Fibre Channel interface.

In a RAID configuration with SATA, the maximum practical limit is 12 to 16 devices per controller. This means that database storage using this type of disk technology often requires the management of multiple storage arrays from the target-host level; hence, the applicability of volume-management software, such as ASM. As we noted earlier in our discussion of RAID 0 and RAID 1, using multiple storage arrays to achieve satisfactory levels of performance dictates that RAID 0+1 be used to mirror arrays at a host level of disks striped at the array level.

SATA has advanced to the SATA II specification, which has doubled the throughput to 3 Gb/s (300MB/s). There is also an additional draft specification that increases throughput to 6 Gb/s. The additional specification was introduced in 2008. When used in conjunction with SSDs, this means that SATA can also compete with SCSI and Fibre Channel disks in the performance sector. SSDs take advantage of some of the performance features available with SATA II—in particular, the Native Command Queuing (NCQ) that under the specification permits queuing up to 32 requests per device.

Using Block-Based Storage

Once the block storage is available on all of the hosts in the cluster and visible in /proc/scsi/scsi, you have a choice about how to configure the devices in a form for use by Oracle on more than one system simultaneously. Standard file systems such as ext2 or ext3 are often used for single-instance environments. However, these file systems are not applicable for use in clustered environments where data may be cached and blocks modified on multiple nodes without synchronization between them, resulting in the corruption of the data.

From a protocol standpoint, the simplest approach is for Oracle to use the devices directly as raw character devices. However, from a technical standpoint, although raw devices are supported for upgrades and can be configured from the command line with SQLPLUS, they are no longer supported within OUI for new installs. Also, Oracle has indicated that raw devices will not be supported in releases subsequent to Oracle 11g. On the other hand, block devices are the storage underlying Oracle ASM, as discussed in Chapter 9, and these devices must be used by ASM for storing database files without an intervening file system, including ACFS.

The alternative to using the block devices configured with ASM is to use the corresponding block-based device with a file system designed specifically for operating in a clustered environment that also supports the storage of database files. In Chapter 5 we look at how to configure OCFS2 in the context of virtualization.

Linux I/O Scheduling

I/O scheduling governs how the order of processing disk I/O requests is determined. In Linux, I/O scheduling is determined by the Linux kernel; with 2.6 Linux kernel releases, four different I/O schedulers are available. Selecting the most appropriate scheduler may have an influence on the performance of your block-based storage. The schedulers available are as follows:

Completely Fair Queuing (CFQ)
Deadline
NOOP
Anticipatory

The scheduler can be selected by specifying the elevator option of the Linux boot loader. For example, if you are using GRUB, the following options would be added to the /boot/grub/grub.conf file: CFQ uses elevator=cfq, Deadline uses elevator=deadline, NOOP uses elevator=noop, and Anticipatory uses elevator=as. With 2.6 kernels, it is also possible to change the scheduler for particular devices during runtime by echoing the name of the scheduler into /sys/block/devicename/queue/scheduler. For example, in the following listing we change the scheduler for device sdc to deadline while noting that device sdb remains at the default:

root@london1 ˜]# cat /sys/block/sdc/queue/scheduler
noop anticipatory deadline [cfq]
[root@london1 ˜]# echo deadline > /sys/block/sdc/queue/scheduler
[root@london1 ˜]# cat /sys/block/sdc/queue/scheduler
noop anticipatory [deadline] cfq
[root@london1 ˜]# cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

It is important to note that such dynamic scheduler changes are not persistent over reboots.

CFQ is the default scheduler for Oracle Enterprise Linux releases, and it balances I/O resources across all available resources. The Deadline scheduler attempts to minimize the latency of I/O requests with a round robin–based algorithm for real-time performance. The Deadline scheduler is often considered more applicable in a data warehouse environment, where the I/O profile is biased toward sequential reads. The NOOP scheduler minimizes host CPU utilization by implementing a FIFO queue, and it can be used where I/O performance is optimized at the block-device level. Finally, the Anticipatory scheduler is used for aggregating I/O requests where the external storage is known to be slow, but at the cost of latency for individual I/O requests.

Block devices are also tunable with the parameters in /sys/block/devicename/queue:

[root@london1 queue]# ls /sys/block/sdc/queue
iosched            max_sectors_kb  read_ahead_kb
max_hw_sectors_kb  nr_requests     scheduler

Some of the tunable parameters include the nr_requests parameter in /proc/sys/scsi, which is set according to the bandwidth capacity of your storage; and read_ahead_kb, which specifies the data pre-fetching for sequential read access, both of which have default values of 128.

CFQ is usually expected to deliver the optimal level of performance to the widest number of configurations, and we recommend using the CFQ scheduler, unless evidence from your own benchmarking shows another scheduler delivers better I/O performance in your environment.

With the CFQ scheduler, I/O throughput can be fine-tuned for a particular process using the ionice utility to assign a process to a particular scheduler class. The process can range from idle at the lowest to real-time at the highest, with best-effort in between. For example to change the scheduling process of the LGWR process, you need to begin by identifying the process id for the process:

[root@london1 ˜]# ps -ef | grep lgwr | grep -v grep
oracle   13014     1  0 00:21 ?        00:00:00 asm_lgwr_+ASM
oracle   13218     1  0 00:22 ?        00:00:00 ora_lgwr_PROD1

In this case, querying the process shows that no scheduling priority has been set:

[root@london1 ˜]# ionice -p 13218
none: prio 0

The scheduling classes are 1, 2, and 3 for real-time, best-effort, and idle, respectively. For real-time and best-effort, there are eight priority levels, ranging from 0-7, that determine the time allocated when scheduled. The following example shows how to set the LGWR process to a real-time class with a priority level of 4:

[root@london1 ˜]# ionice -c1 -n4 -p 13218
[root@london1 ˜]# ionice -p 13218
realtime: prio 4

In any case, we recommend testing your RAC storage configuration and chosen I/O scheduler. If, after testing, you conclude that you would benefit from using a non-default I/O scheduler, then you should ensure that you use the same scheduler on each and every node in the cluster.

NFS and NAS

So far we have covered the realization of block-based storage for Oracle RAC, as well as the technology underlying the implementation of SANs. In contrast to SAN, there is also NAS, which takes another approach in providing storage for RAC.

Although the approach is different, many storage vendors support both a SAN and/or NAS implementation from the same hardware. By definition, iSCSI is both SAN and NAS. To draw the distinction, we will use NAS to define storage presented over a network, but provided as a file-based system, as opposed to a block-based one. For Linux and RAC, the only file system that you need to be concerned with is the Network File System (NFS) developed by Sun Microsystems. Oracle supports RAC against solutions from EMC, Fujitsu, HP, IBM, NetApp, Pillar Data Systems, and Oracle Sun StorageTek. If you're using NFS with Oracle 11g, the Direct NFS client was introduced to implement NFS connectivity directly into the Oracle software itself, without requiring the use of the Linux NFS client.

As we have seen with block-level storage, Oracle RAC can either use the block devices directly or with ASM; or, for greater manageability, it can layer a file system such as OCFS2 on top of the block devices at the operating system level. The challenge for such a clustered-file system is to manage the synchronization of access between the nodes to prevent corruption of the underlying storage when a node is ejected from the cluster. Therefore, such an approach requires a degree of communication between the nodes. With a NAS approach, however, the file system itself is presented by the storage device and mounted by the RAC nodes across the network. This means that file-system synchronization is reconciled at the storage level. NAS storage also enables further file system features such as journaling, file-based snapshots, and volume management implemented and managed at the storage level for greater efficiency.

At the hardware level, NAS is realized by a file server, and the RAC nodes communicate with the file server through standard Ethernet connectivity. We want to stress that, For a RAC cluster, this storage network should be an absolutely private, dedicated network with strictly no other non-storage-related network traffic. The RAC cluster should not be connected simply by the local LAN, with NAS configuration requiring the same respect accorded to FC. Because file servers were originally designed for file sharing, their functionality is compatible with the RAC shared-all database approach. Also, because file servers are designed for this one function only, they tend to be highly optimized and efficient compared to a generic operating system in this role. A RAC cluster should not be based on an NFS file system served by anything but specialized hardware designed for that purpose and supported under the RAC Technologies Compatibility Matrix (RTCM) for Linux Clusters.

At the connectivity level, as discussed for iSCSI, generally available Ethernet technology currently operates at gigabit speeds below that of the commonly available bandwidth for FC, although increasing adoption of 10GbE provides a more level playing field in this respect. A commonly quoted restriction is that the data is transferred from the nodes through TCP/IP, which is implemented by the server CPU, and the communication between the CPU and the NIC results in a higher number of CPU interrupts. The use of the 11g Direct NFS client and server-based network acceleration technologies such as TOEs can negate some of this impact, but the additional overhead will impact the level of performance available to some degree.

Evaluating Storage Performance

As you have seen, there are a considerable number of factors that affect storage performance in an Oracle RAC environment, from the processor and memory to process the I/O requests down through the system, to the disk drives themselves. As such, evaluating the potential storage performance of a system from an entirely theoretical standpoint can only be accurate to a certain degree; therefore, the optimal way to understand the nuances of each approach is to test and compare the solution applicable to your environment.

To draw comparisons between various I/O configurations, we recommend testing the configurations in question. Ultimately, the best way of determining a storage configuration capable of supporting an Oracle RAC configuration is to measure that storage under an Oracle Database workload. However, while noting the relevance of measuring actual Oracle Database workloads, there is also a tool called Oracle Orion available with both the Grid infrastructure software and Oracle Database software installations in Oracle 11g Release 2. This tool enables the testing of storage performance in isolation.

If you're running against block devices, Orion should be run against the block devices themselves and not a file in the file system mounted upon them. Also, the user running the software should have permission to read and write to those devices, none of which should not be in use for any software or contain data that requires preserving. Set the names of your disks in a file with the extension .lun, as in this example:

[root@london1 bin]# cat prod.lun
/dev/sdd

When running Orion, you should ensure that the LD_LIBRARY_PATH environment variable is set to include either $ORACLE_HOME/lib or the equivalent directory in the Grid Infrastructure software home. You can find details for how to do this in Chapter 6. This snippet shows how to run the Orion tool with the -help argument:

[oracle@london1 bin]$ ./orion -help
ORION: ORacle IO Numbers -- Version 11.2.0.1.0

ORION runs IO performance tests that model Oracle RDBMS IO workloads.
It measures the performance of small (2-32K) IOs and large (128K+) IOs
at various load levels.
...

There are a number of options for configuring the type of I/O workload; however, the initial preliminary set of data can be collected, as follows:

[root@london1 bin]# ./orion -run simple -testname prod
ORION: ORacle IO Numbers -- Version 11.2.0.1.0
prod_20091113_1511
Calibration will take approximately 9 minutes.
Using a large value for -cache_size may take longer.

Within the working directory, a number of data files are produced, including a summary of performance data that corresponds to the storage-related values discussed previously in this chapter, including megabytes per second throughput, IOPS, and latency:

Maximum Large MBPS=91.09 @ Small=0 and Large=1
Maximum Small IOPS=1454 @ Small=5 and Large=0
Minimum Small Latency=2906.70 usecs @ Small=1 and Large=0

Other workloads in addition to simple include normal, oltp, dss with oltp reporting, IOPS, and latency and dss focusing on megabytes per second. Also, some advanced configurations give more opportunity to fine-tune the I/O workload, such as specifying different block sizes from the default.

I/O Calibration is also available within the Oracle 11g database in its DBMS_RESOURCE_MANAGER package. This procedure implements a read-only workload and is run as follows by specifying the input parameters of the number of physical disks on which the Oracle Database is installed, as well as the maximum tolerated latency in milliseconds:

SQL> set serveroutput on
declare
  max_iops              integer;
  max_mbps              integer;
  actual_latency        integer;
begin
  dbms_resource_manager.calibrate_io (
    num_physical_disks => 15,
    max_latency        => 10,
    max_iops           => max_iops,
    max_mbps           => max_mbps,
    actual_latency     => actual_latency);
  dbms_output.put_line ('IOPS = ' || max_iops);
  dbms_output.put_line ('MBPS = ' || max_mbps);
  dbms_output.put_line ('Latency = ' || actual_latency);
end;
/

The procedure may take up to 10 minutes to complete, and the status of the calibration process can be viewed as follows:

SQL> select inst_id, status from gv$io_calibration_status;

   INST_ID STATUS
---------- -------------
         1 IN PROGRESS
         2 IN PROGRESS

Note that the preceding runs on one instance in the cluster only, although Oracle recommends that all instances in the cluster remain open to calibrate the read workload across the entire cluster. Finally, the output parameters are reported by the procedure, as in this example:

IOPS = 2130
MBPS = 186
Latency = 9

PL/SQL procedure successfully completed.

This output can also be viewed subsequently in the view DBA_RSRC_IO_CALIBRATE and compared against the start and end of time of individual test runs.

Summary

In this chapter, we have looked at the RAC Technologies Compatibility Matrix (RTCM) and investigated all of the components available to you for building for RAC clusters on Linux. This chapter should give you, as an Oracle DBA, the knowledge to intelligently interpret hardware specifications and drill down into the features that are required to create a successful RAC environment. It should also enable you to balance all of the components from the processor and memory to the network interconnect and storage for the purpose of determining their relative advantages and limitations within a RAC environment. The goal of this approach is to select the optimal configuration of hardware to achieve the desired performance, scalability, and reliability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Hardware

Create new playlist

Sign In

Sign Up

Chapter 4. Hardware

Oracle Availability

Server Processor Architecture

x86 Processor Fundamentals

x86-64

Multicore Processors and Hyper-Threading

CPU Cache

CPU Power Management

Virtualization

Memory

Virtual Memory

Physical Memory

NUMA

Memory Reliability

Additional Platform Features

Onboard RAID Storage

Machine Check Architectures

Remote Server Management and IPMI

Network Interconnect Technologies

Server I/O

PCI

PCI-X

PCI-Express

Private Interconnect

Standard Ethernet Interconnects

Note

Fully Redundant Ethernet Interconnects

Infiniband

Private Interconnect Selection Summary

Storage Technologies

RAC I/O Characteristics

Read Activity

Write Activity

Note

Asynchronous I/O and Direct I/O

Hard Disk and Solid State Disk Drive Performance

RAID

RAID 0 Striping

RAID 1 Mirroring

RAID 10 Striped Mirrors

RAID 0+1 Mirrored Stripes

RAID 5

Storage Cache

RAID Summary

Storage Protocols for Linux

SCSI

Fibre Channel and FCoE

iSCSI

SATA

Using Block-Based Storage

Linux I/O Scheduling

NFS and NAS

Evaluating Storage Performance

Summary

Table of Contents for
4. Hardware