Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. RAC Concepts

Chapter 1 provided an introduction to Real Application Clusters; in this chapter, we will move on to discuss the concepts behind RAC in more detail. The topics introduced in this chapter form the basis of a successful implementation of a stable and highly available RAC environment.

We will begin with a look at cluster concepts in general, and then see which of them are available in RAC. Next, we will introduce the cluster layer foundation itself—Oracle Grid Infrastructure—before taking a closer look at the clustered database. Towards the end of this chapter, we will introduce some of the more interesting 11 g Release 2 features from a RAC database administrator's point of view.

There is a lot of important information to be absorbed—information that will serve as the foundation for the following chapters, where we will dig deeper into the topics raised here.

Clustering Concepts

A computer cluster abstracts the fact that it consists of multiple nodes from a cluster's users. From an outside point of view, a cluster is really just a single entity. Clustering has a long history, and Digital Equipment Corporation's VAX operating system is often credited as being one of the pioneers in supporting clustering. Clusters address the problem that a single unit can't provide the needed throughput and performance an enterprise needs. They also prevent a total outage of a service in the event of a node failure. Therefore, clusters are used to for reliability, scalability, and availability. Computer clusters exist for many purposes, from providing supercomputing power for the systems listed in the top 500 list for ultra-complex calculations, to protecting a web server from crashes. Before we explore RAC in the broader context of clustering, we will introduce the most common clustering concepts first.

Configuring Active/active Clusters

Clusters configured in an active/active role feature cluster members that all serve user requests—none of them is idly waiting on standby. In most setups, the hardware of the cluster members is identical to prevent skewed performance and to facilitate effective load balancing across the cluster. Active/active clusters require more complex management software to operate because access to all resources, such as disk and memory, needs to be synchronized across all nodes. Most often, a private interconnect will be employed as a heartbeat mechanism. The cluster management software will have to detect problems with nodes, such as node failures and intercluster communication problems.

The so-called "split-brain" is a dreaded condition in clustering where the communication between cluster members is interrupted while all the cluster nodes are still running. Adhering to its programming, the cluster software in each of the cluster halves will try to fail over resources of the nodes it considers crashed. The danger of data corruption looms here: if the application communicates with effectively unsynchronized cluster halves, different data can be written to disk. A split brain scenario is a well-known threat to clustering, and vendors of cluster management software have got the right tools to prevent this from happening.

Oracle's cluster management software is referred to as Cluster Ready Services in 10g Release 1 and Clusterware in 10g Release 2. In 11g Release 2, it was finally rebranded as Grid Infrastructure, and it uses the cluster interconnect and a quorum device, called a voting disk, to determine cluster membership. A voting disk is shared by all nodes in the cluster, and its main use comes into play when the interconnect fails. A node is evicted from the cluster if it fails to send heartbeats through the interconnect and the voting disk. A voting disk can also help in cases where a node can't communicate with the other nodes via the interconnect, but still has access to the voting disk. The subcluster elected to survive this scenario will send a node eviction message to the node. Clusterware performs node eviction using the STONITH algorithm. This is short for shoot the other node in the head—a software request is sent to the node to reboot itself. This can be tricky if the node to be rebooted is hung, and responding to a software reset might not be possible. But luckily, the hardware can assist in these cases, and Grid Infrastructure has support for IPMI (intelligent platform management interface), which makes it possible to issue a node termination signal. In the event of a node failure or eviction, the remaining nodes should be able to carry on processing user requests. Software APIs should make the node failure transparent to the application where possible.

Implementing Active/passive Clusters

An active/passive cluster operates differently from an active/active cluster. Hardware in an active/passive cluster is also identical or nearly identical, but only one of the two nodes is processing user requests at a time. The cluster management software constantly monitors the health of the resource(s) in the cluster. Should a resource fail, the cluster-management layer can try to restart the failed resource a number of times before it fails it over to the standby node.

Depending on the setup, the cluster resource can be located on shared storage or a file system that is also failed over as part of the resource failover. Use of a shared file system offers advantages over the use of an unshared file system, which might have to be checked for corruption through fsck (8) before being remounted to the standby node. Veritas Cluster Suite, Sun (Oracle) Cluster, and IBM HACMP are examples for cluster managers that allow the setup of active/passive clusters.

A little known fact is that using Oracle Grid Infrastructure makes it very simple to set up a cost effective active/passive cluster. Leveraging the Grid Infrastructure API and Oracle's Automatic Storage Management as a cluster logical volume manager makes it easy to constantly monitor a single instance Oracle database. In case of a node failure, the database will automatically be relocated to the standby node. Depending on the fast_start_mttr_target initialization parameter and the size of the recovery set, the failover to the standby node can be very quick; however, users will be disconnected from the database as part of the failover process.

Configuring a Shared-All Architecture

A cluster in which all nodes have concurrent access to shared storage and data is termed a shared-all or shared-everything configuration. Oracle RAC is an example of a shared-everything architecture: a single database located on shared storage is accessed by multiple database instances running on cluster nodes. In the Oracle terminology, an instance refers to the non-persistent memory structures, such as the shared global area (SGA) background- and foreground-user processes. In contrast, the database is the persistent information stored in data files on disk. With RAC, unlike a shared-nothing design, instance failure doesn't translate to a loss of access to the information mastered by that instance. After an instance failure, one of the remaining instances in the cluster will perform instance recovery, which you'll learn more about in Chapter 3, which covers RAC Architecture. If this happens, all remaining instances will continue to be accessible to users. Using high availability technologies such as Fast Connection Failover for connection pools or Transparent Application Failover can mask the instance failure from the user to varying degrees. The failed instance will eventually rejoin the cluster and reassume its share of the workload.

Configuring a Shared-Nothing Architecture

A shared-nothing database cluster is configured as a number of members in a cluster, where each node has its own private, individual storage that is inaccessible to others in the cluster. The database is vertically partitioned between the nodes; result sets of queries are returned as the unioned result sets from the individual nodes. The loss of a single node results in the inability to access the data managed by the failed node; therefore, a shared-nothing cluster is often implemented as a number of individual active/passive or active/active clusters to increase availability. MySQL Cluster is an example for a shared-nothing architecture.

Exploring the Main RAC Concepts

A Real Application Clusters (RAC) database is an active/active cluster employing a shared-everything architecture. To better understand what RAC is, as well as which components are employed, this chapter will introduce each of the following components involved from the bottom up:

Cluster nodes
Interconnect
Oracle Grid Infrastructure
Automatic Storage Management
Real Application Cluster
Global Resource Directory
Global Cache Service
Global Enqueue Service
Cache Fusion

It should be noted at this point that the software stack has evolved, and some of the concepts we discuss in this chapter have changed with Oracle 11g Release 2. We have dedicated a section at the end of this chapter to the new features; this section will provide you with a more complete picture about what's new in the latest release. Before doing that, however, let's have a closer look at each of the concepts just mentioned.

Working with Cluster Nodes

The basic building block of any cluster is the individual cluster node. In Oracle RAC, the number of nodes you can have in your cluster is version dependent. Publicly available documentation states that Oracle 10.2 Clusterware supports 100 nodes in a cluster, while 10.1 Cluster Ready Services supported 63 instances. Even though RAC-based applications continue to be available when individual nodes fail, every effort should be made to ensure that individual components in the database server don't prove to be a single point of failure (SPOF).

Hot swappable components, such as internal disks, fans, and other components should be found on the part list when procuring new hardware. Additionally, power supplies, host bus adapters, network cards, and hard disks should be redundant in the server. Where possible, components should be logically combined, either by hardware in form of RAID controllers or software. Examples of the latter would be software RAID, network bonding, or multipathing to the storage area network. Attention should also be paid to the data center, where the nodes are hosted—you should use an uninterruptible power supply, sufficient cooling, and professional racking of the servers as a matter of course. A remote lights-out management console should also be added to the item list—sometimes a node hangs for whatever reason, and it urgently needs troubleshooting or rebooting.

Leveraging the Interconnect

The cluster interconnect is one of the main features in Oracle RAC. Not only does the interconnect allow the cluster to overcome the limitations by the block pinging algorithm when transferring data blocks from one instance to another, it can also be used as a heartbeat and general communication mechanism. Interconnect failure will result in a cluster reconfiguration to prevent a split-brain situation: one or more cluster nodes will be rebooted by Grid Infrastructure.

It is possible to have a different interconnect for RAC and Grid Infrastructure, in which case you need to configure RAC to use the correct interconnect. The interconnect should always be private—no other network traffic should ever be sent across the wires. Users of RAC can choose from two technologies to implement the interconnect: Ethernet and Infiniband.

Using an Ethernet-based Interconnect

The use of 10 Gigabit Ethernet for the cluster interconnect is probably the most common setup in existence. The cluster daemons (we will cover these in detail later in this chapter) use TCP/IP as a means for communication. The Cache Fusion traffic used for cache coherence, which is different from the daemon communication, will make use of the User Datagram Protocol (UDP). UDP is found on the same transport layer as the better known Transmission Control Protocol (TCP). Where the latter is connection-oriented and uses explicit hand-shaking to guarantee that network packets arrive in order and that lost packets are retransmitted, UDP is stateless: it is a fire-and-forget protocol. UDP simply sends a packet (datagram) to the destination. The main advantage of UDP over TCP is that it's comparably lightweight, which is the reason it was chosen as the transport protocol.

Note

Please refrain from using cross-over Ethernet cables in a two-node cluster; the cluster interconnect must be switched, so the use of cross-over cables is explicitly not supported!

Efficiency and performance of the cluster interconnect can be increased by using so-called jumbo frames. Ethernet frames come in various sizes, usually limited to 1500 bytes by what is referred to as the maximum transmission unit (MTU). The frame size determines how much data can be transported with a single Ethernet frame—the larger the frame, the higher the payload. Storing a larger payload inside an Ethernet frame means that less work must be done on the server and switch, providing more efficient communication overall. Many switches allow a higher than the standard MTU range of 1500-9000 bytes to be sent in one frame, making it a jumbo frame. Note that jumbo frames aren't routable; unfortunately, this means they can't be used on a public network. When deciding to use jumbo frames, it is important to ensure that all cluster nodes use the same MTU.

We explained in the introduction on individual database server nodes that components should be redundant, and that the network interface cards should definitely be among those redundant components. Combining multiple network ports to a logical unit is referred to as bonding in Linux. Unlike many other operating systems, network cards can be bonded without having to purchase additional software. A new master device, bond, will be created that enslaves two or more network cards. Once completed, the bonded network device will route all network traffic into and out of the server. You should not bond network ports of the same network interface card.

Implementing an Infiniband-based Interconnect

Infiniband is a popular implementation of a remote direct memory access architecture (RDMA). It is a high-speed interconnect commonly associated with high-performance computing (HPC) environments. According to http://openfabrics.org/, the majority of all clusters in the Top500 HPC list use this implementation. RDMA enables parallel, direct, memory-to-memory transfers between the nodes in the cluster, and it requires dedicated RDMA adapters, switches, and software. It also avoids the CPU processing and context switching overheads associated with Ethernet-based implementations.

There are two different ways to implement Infiniband interconnects on Linux. The first way is referred to as IP over Infiniband (IPoIB) and—speaking in greatly simplified terms—it replaces Ethernet for the media access control and link-layer control layers. Since the Internet Protocol remains, the network protocol use of IPoIB is completely transparent to applications. The IPoIB implementation offers significant performance improvements over Ethernet connections.

Another option is to use Reliable Datagram Sockets over Infiniband. This option is available beginning with Oracle 10.2.0.3. RDS is available through the Open Fabric Enterprise Distribution (OFED) for Linux and Windows, developed by the Open Fabrics Alliance. As of version 2.6.30, RDS has found its way into the Linux kernel, with Oracle playing a major role as a contributor. The key characteristics of RDS are low-latency, low-overhead, and a high bandwidth—basically, all you could ask for in RAC!

You can find a compatibility matrix available online at www.oracle.com/technology/products/database/clustering/certify/tech_generic_linux_new.html.

This matrix lists RDS as supported for Linux with QLogic/SilverStorm switches on Oracle 10.2.0.3 and newer. Open Fabrics Enterprise Distribution (OFED) 1.3.1 and newer for RDS v2 is supported from Oracle 11.1 onwards with QLogic, HP, and Voltaire switches. My Oracle Support note 751343.1 has a direct link to the required patches for Oracle Enterprise Linux/Red Hat Enterprise Linux 5. Unfortunately, there does not seem to be a deployment instruction for RDS on SuSE Linux Enterprise Server. Oracle advises customers to monitor the certification page for updates.

The Oracle Database Machine and Exadata Storage server drive the use Infiniband to the extreme, offering up to 40Gb/s for communication within the cluster, which is impossible to beat with Ethernet. An Infiniband solution has the great advantage of higher performance of than the ever-so-present Gigabit Ethernet interconnect, but it also comes at a higher cost and introduces another technology set into the data center.

Clusterware/Grid Infrastructure

Grid Infrastructure is tightly integrated with the operating system and offers the following services to RAC:

Internode connectivity
Cluster membership
Messaging
Cluster logical volume manager
Fencing

Grid Infrastructure is required to run RAC, and it must be installed in its own, non-shared Oracle home, referred to as ORA_CRS_HOME or CRS_HOME. Oracle Grid Infrastructure will contain the binaries to run the Automatic Storage Management component, as well.

Oracle's Cluster software has gone through a number of name changes, as already indicated in other sections of this chapter, which Table 2-1 details. As we explained in this chapter's introduction, one of the purposes of clustering multiple computers is to make them appear as a single entity to the cluster's users. The cluster management software is responsible for doing exactly this, and Oracle Grid Infrastructure is remarkable because it means Oracle development has come up with unified cluster management software for all RAC-supported platforms.

Table 2.1. Oracle Cluster Software Naming

Product name	Corresponding Version	Terminal Release	Release Date	Comments
Cluster Ready Services	10g Release 1	10.1.0.5	2003	No patch set update (PSU) is available.
Clusterware	10g Release 2	10.2.0.5	2005	The latest PSU available at the time of writing is 10.2.0.4.3; terminal release is not out yet.
Clusterware	11g Release 1	11.1.0.7	2007	11.1.0.7.2 is the latest available PSU at the time of writing.
Grid Infrastructure	11g Release 2	Not yet known	2009	11.2.0.1.1 are the latest PSUs available at the time of writing for database and Grid Infrastructure.

Until Oracle 9i, vendor-specific clusterware was used to provide the same services that Oracle Clusterware provides today. So why did Oracle reinvent the wheel? The foreword by one of the ASM architects in the excellent Oracle Automatic Storage Management: Under-the-Hood & Practical Deployment Guide by Nitin Vengurlekar et al (McGraw Hill, 2007) gives you a hint at how Oracle works internally. The foreword writer noted that Oracle was not satisfied by the cluster file systems available from third-party vendors. There was also the problem that any improvement suggested by Oracle would also benefit Oracle's competitors. Hence, Oracle came up with an entirely new product (ASM) that only the Oracle database could use.

Similar thoughts may have played a role with the creation of unified cluster management software. The use of third-party cluster software is still possible, but it's getting less and less common since Oracle's Grid Infrastructure provides a mature framework. Should you decide to use a third-party cluster solution, then Grid Infrastructure must be installed on top of that. Multi-vendor solutions, especially in the delicate area of clustering, can lead to a situation where none of the vendors takes responsibility for a problem, and each blames the other parties involved for any problems that occur. By using only one vendor, this blame-game problem can be alleviated.

Planning a RAC Installation

When planning a RAC software installation, you first install Grid Infrastructure, then the Oracle RDBMS binaries, followed by the application of the latest patches to the new installation. The install process of all Oracle-clustered components, including Grid Infrastructure, follows the same process: once the software is made available to one of the cluster nodes, you start Oracle Universal Installer. Its task is to copy and link the software on the local node first. It then copies all files needed to all the other cluster nodes specified. This saves the installer from linking components on each node. After the software is distributed to all nodes, a number of scripts need to be executed as root, initializing the Grid Infrastructure stack and optionally starting ASM to mount the predefined disk groups. Once this step is complete, the Oracle Universal Installer performs some more internal initialization tasks before running a final verification check. The Oracle Universal Installer (OUI) will prompt you for the installation of a clustered Oracle home only if the installation of Grid Infrastructure succeeded when you run OUI to install the database binaries.

You must meet a number of prerequisites before you can install Grid Infrastructure. Public and private networks need to be configured, and SSH user equivalence or password-less logins across all cluster nodes must be established. Grid Infrastructure also requires kernel parameters to be set persistently across reboots, and it checks for the existence of required software packages—mainly compiler, linker, and compatibility libraries—before the installation. A great utility, called cluster verification tool or cluvfy, assists system and database administrators in meeting these requirements. In version 11.2 of the software, some of the detected problems are fixable by running a script cluvfy generates.

Grid Infrastructure cannot be installed into a shared Oracle home: all nodes of the cluster need their own local installation. The utility cluvfy checks for this, as well as for sufficient disk space. Not being able to use a shared Oracle home for Grid Infrastructure marks a change from previous versions of Clusterware.

With the installation finished, Grid Infrastructure will start automatically with every server restart, unless specifically instructed not to.

USING A GRID IN SINGLE-INSTANCE ORACLE

The use of Grid Infrastructure is not limited to clusters—it also provides high availability to single-instance Oracle deployments, as well. The standalone Grid Infrastructure installation was previously called Single Instance High Availability (SIHA), but it was renamed to Oracle Restart in the public release. With Oracle Restart, you shouldn't need startup scripts to start and stop databases and services because single-instance Oracle is managed in a way very similar to RAC. For example, you get FAN events as an added benefit when services go down (unexpectedly) or as part of a Data Guard broker-controlled failover operation.

Choosing a Process Structure

Once installed, a number of daemons are used to ensure that the cluster works as expected and communication to the outside world is possible. Some of these are started with root privileges if the Linux platform requires it. For example, any change to the network configuration requires elevated rights. The other daemon processes are executed with the Grid software owner's permissions. Table 2-2 introduces the main Clusterware and Grid Infrastructure daemons.

Table 2.2. Main Clusterware/Grid Infrastructure Daemons and Their Use

Daemon process	Description
Oracle High Availability Service (OHAS)	The Oracle High Availability Service is the first Grid Infrastructure component started when a server boots. It is configured to be started by `init(1)`, and it is responsible for spawning the agent processes.
Oracle Agent	Two oracle agents are used with Grid Infrastructure. The first one, broadly speaking, is responsible for starting a number of resources needed for accessing the OCR and voting files. It is created by the OHAS daemon. The second agent is created by the Cluster Ready Services Daemon (CRSD—see below), and it starts all resources that do not require root access on the operating system. The second Oracle Agent is running with the Grid Infrastructure software owner's privileges, and it takes over the tasks previously performed by the `racg` process in RAC 11.1.
Oracle Root Agent	Similar to the Oracle Agent, two Oracle Root Agents are created. The initial agent is spawned by OHAS, and it initializes resources for which the Linux operating system requires elevated privileges. The main daemon created is CSSD and CRSD. The Cluster Ready Services daemon CRSD in turn will start another Root agent. The agent will start resources that require root privileges, which are mainly network related.
Cluster Ready Services Daemon (CRSD)	The main Clusterware daemon uses the information stored in the Oracle Cluster Registry to manage resources in the cluster.
Cluster Synchronization Services Daemon (CSSD)	The CSS processes manage cluster configuration and node membership.
Oracle Process Monitor Daemon (OPROCD)	The `oprocd` daemon is responsible for I/O fencing in Clusterware 11.1. It was introduced for Linux with the 10.2.0.4 patch set. Before this patch set, the kernel hangcheck-timer module was responsible for similar tasks. Interestingly, `oprocd` had always been used on non-Linux platforms. Grid Infrastructure replaced oprocd with the `cssdagent` process.
Event Manager Daemon (EVM)	The EVM daemon is responsible for publishing events created by Grid Infrastructure.
Cluster Time Synchronization Service (CTSS)	The CTSS service is used as an alternative to the Network Time Protocol server for cluster time synchronization, which is crucial for running RAC. The Cluster Time Synchronization Services daemon can run in two modes: observer or active. It will run in observer mode whenever NTP is available; however, it will be actively synchronizing cluster nodes against the master node it runs on in the absence of NTP.
Oracle Notification Service (ONS)	This is the primary daemon responsible for publishing events through the Fast Application Framework.

The startup sequence of Grid Infrastructure changed significantly in RAC 11.2—the startup sequence in RAC 11.1 was pretty much identical to the process in Oracle 10.2. Instead of starting cluster ready services, cluster synchronization services, and event manager directly through inittab (5), the Oracle High Availability Service now takes care of creating agent processes, monitoring their siblings' health, and spawning cluster resources.

Among the non-Oracle managed daemons, NTP takes a special role. For each cluster, it is imperative to have clock synchronization, and Grid Infrastructure is no exception.

Tables 2-3 and 2-4 list the main daemons found in Clusterware 11.1 and Grid Infrastructure respectively.

Table 2.3. Clusterware 11.1 Main Daemons

Component	Linux Process	Comment
CRS	`crsd.bin`	Runs as root.
CSS	`init.cssd`, `ocssd`, and `ocssd.bin`	Except for `ocssd.bin`, all these components run as root.
EVM	`evmd, evmd.bin`, and `evmlogger`	`evmd` runs as root.
ONS	`Ons`
ORPOCD	`oprocd`	Runs as root and provides node fencing instead of the hangcheck timer kernel module.
RACG	`racgmain`, `racgimon`	Extends clusterware to support Oracle-specific requirements and complex resources. It also runs server callout scripts when FAN events occur.

Table 2.4. Grid Infrastructure 11.2 Main Daemons

Component	Linux Process	Comment
CRS	`crsd.bin`	Runs as root.
CSS	`ocssd.bin, cssdmonitor`, and `cssdagent`
CTSS	`octssd.bin`	Runs as root.
EVM	`evmd.bin`, `evmlogger.bin`
Oracle Agent	`oraagent.bin`
Oracle Root Agent	`orarootagent`	Runs as root.
Oracle High Availability Service	`ohasd.bin`	Runs as root through `init`, the mother of all other Grid Infrastructure processes.
ONS/eONS	`ons/eons`	ONS is the Oracle Notification Service; eONS is a java process.

The following example lists all the background processes initially started by Grid Infrastructure after a fresh installation:

[oracle@node1 ˜]$ crsctl status resource -init -t
------------------------------------------------------------------------------
NAME           TARGET   STATE        SERVER                   STATE_DETAILS
------------------------------------------------------------------------------
Cluster Resources
------------------------------------------------------------------------------
ora.asm
      1        ONLINE   ONLINE       node1            Started
ora.crsd
      1        ONLINE   ONLINE       node1
ora.cssd
      1        ONLINE   ONLINE       node1
ora.cssdmonitor
      1        ONLINE   ONLINE       node1
ora.ctssd
      1        ONLINE   ONLINE       node1            OBSERVER
ora.diskmon
      1        ONLINE   ONLINE       node1
ora.drivers.acfs
      1        ONLINE   ONLINE       node1
ora.evmd
      1        ONLINE   ONLINE       node1

ora.gipcd
      1        ONLINE   ONLINE       node1
ora.gpnpd
      1        ONLINE   ONLINE       node1
ora.mdnsd
      1        ONLINE  ONLINE        node1

You may have noticed additional background processes in the output of this list. We'll discuss those in Chapter 8, which covers Clusterware.

Configuring Network Components

Grid Infrastructure requires a number of network addresses to work correctly:

A public network address for each host
A private network for each host
A virtual (not yet assigned) IP address per host
One to three unassigned IP addresses for the Single Client Access Name feature.
If Grid Plug and Play is used, another non-used virtual address for the Grid Naming Service

Every host deployed on the network should already have a public IP address assigned to it that users can connect to, so that requirement is easy to satisfy. The private network has been already been discussed in this chapter's "Interconnect" section, and it's an absolute requirement for Grid Infrastructure. Again, it should be emphasized that the private interconnect is used exclusively for Grid Infrastructure/RAC Cache Fusion—adding iSCSI or NFS traffic over it does not scale well!

Node virtual IP addresses are one of the most useful additions to Oracle clustering. They need to be on the same subnet as the public IP address, and they are maintained as cluster resources within Grid Infrastructure. Let's think back to what it was like in the 9i days: in the case of a node failure, the public node address didn't reply to any connection requests (it couldn't because the node was down). If a client session tried to connect to the failed node, it had to wait for the operating system to time the request out, which can be a lengthy process. With the virtual IP address, things go considerably quicker: when a node fails, Grid Infrastructure fails the node's virtual IP address over to another node of the cluster. When a client connects to the failed over virtual IP address, Grid Infrastructure knows that this particular node is down and can send a reply back to the client, forcing it to try to connect to the next node in the cluster.

Another requirement calls for one to three IP addresses, regardless of the cluster size. This requirement is new to Grid Infrastructure. A new address type called single client access name (SCAN) abstracts from the number of nodes in the cluster. The SCAN is initiated and configured during Grid Infrastructure upgrades or installations. Before starting the installation, you need to add the IP addresses for the SCAN to DNS for a round-robin resolution to the single client access name.

In case you decide to use the Grid Naming Service, you need to allocate a virtual IP address on the public network for it.

Note

The Grid Naming Service is explained in detail in the "11g Release 2 New Features" section later in this chapter.

Setting up Shared Grid Infrastructure Components

In addition to the software daemons mentioned previously, Grid Infrastructure uses two different types of shared devices to manage cluster resources and node membership: the so-called Oracle Cluster Registry OCR and voting disks. Oracle 11.2 introduced a new, local-only file called Oracle Local Registry (OLR). All these components will be explained in their respective sections later in this chapter.

Implementing the Oracle Cluster Registry and Oracle Local Registry

The first of the shared persistent parts of Grid Infrastructure is the Oracle Cluster Registry. Shared by all nodes, it contains all the information about cluster resources and permissions that Grid Infrastructure needs to operate. To be sharable, the OCR needs to be placed either on a raw device, a shared block device, a cluster file system such as OCFS2, or Automatic Storage Management. With Grid Infrastructure, the use of non-ASM storage (or alternatively, a clustered file system) for the OCR is only supported for upgraded systems. All new installations either have to use a supported clustered file system or ASM. The OCR can have one mirror in RAC 10 and 11.1, and up to five copies can be defined in Grid Infrastructure for added resilience.

The OCR is automatically backed up every four hours by Grid Infrastructure, and a number of backups are retained for recoverability. RAC 11.1 introduced the option to manually back up the Cluster Registry, and additional checks of its integrity are performed when running diagnostic utilities as the root user. Clusterware 11.1 simplified the deployment of the Cluster Registry on shared block devices through Oracle Universal Installer. Prior to this, a manual procedure for moving the OCR to block devices was needed. When using raw devices in RAC 11.1 and Red Hat 5 or SLES 10, manual configuration of the raw devices through udev was necessary. Notes on My Oracle Support explain the procedure, which differs depending upon whether single or multipathing is used to connect to shared storage.

In some rare cases, the OCR can become corrupted, in which case a restore from a backup may be needed to restore service. Depending on the severity of the corruption, it might be sufficient to restore one of the mirrors to the primary location; otherwise, a backup needs to be restored. The administration and maintenance of the OCR is only supported through the use Oracle-supplied utilities—dumping and modifying the contents of the OCR directly will result in Oracle support refusing to help you with your configuration problem.

An additional cluster configuration file has been introduced with Oracle 11.2, the so-called Oracle Local Registry (OLR). Each node has its own copy of the file in the Grid Infrastructure software home. The OLR stores important security contexts used by the Oracle High Availability Service early in the start sequence of Clusterware. The information in the OLR and the Grid Plug and Play configuration file are needed to locate the voting disks. If they are stored in ASM, the discovery string in the GPnP profile will be used by the cluster synchronization daemon to look them up. Later in the Clusterware boot sequence, the ASM instance will be started by the cssd process to access the OCR files; however, their location is stored in the /etc/ocr.loc file, just as it is in RAC 11.1. Of course, if the voting files and OCR are on a shared cluster file system, then an ASM instance is not needed and won't be started unless a different resource depends on ASM.

Configuring Voting Disks

Voting disks are the second means of implementing intercluster communication, in addition to the cluster interconnect. If a node fails to respond to the heartbeat requests of the other nodes within a countdown-threshold, the non-responsive node will eventually be evicted from the cluster.

Similar to the Oracle Cluster Registry, the voting disk and all of its mirrors (up to 15 voting disks are supported in Grid Infrastructure, vs. three in Clusterware 11.1) need to be on shared storage. Raw devices, a clustered-file system, or Automatic Storage Management are possible locations for the voting disks. Again, and this is exactly the same as with the OCR, not storing the voting disks in a clustered file system or ASM in Grid Infrastructure is supported only for upgraded systems. Block and raw device support for these files will be deprecated in Oracle 12.

It is strongly recommended by Oracle to use at least three voting disks located in different locations. This is for resilience. When using ASM to store the voting disks, you need to pay attention to the redundancy level of the disk group and the failure groups available. Note that all copies of the voting disk will be in only one disk group—you can't spread the voting disks over multiple disk groups. With an external redundancy disk group, you can only have exactly one voting disk, a number that can't be increased by specifying multiple disk groups with external redundancy. Disk groups with normal redundancy need at least three failure groups to be eligible for storing exactly three voting disks; high redundancy is more flexible because it lets you support up to five voting disks.

Leveraging Automatic Storage Management

Automatic Storage Management was introduced to the RAC software stack as part of Oracle 10g Release 1. Oracle ASM is a cluster-aware logical volume manager for Oracle's physical database structures. Files that can be stored in ASM include control files, database files, and online redo logs. Until 11g Release 2, it was not possible to store any binaries or other type of operating system files—neither was it a suitable option for installing a shared Oracle home.

Note

The complete list of files supported by ASM changes from release to release. You can find the current list in the "Administering Oracle ASM Files, Directories, and Templates" chapter in the Storage Administrator's Guide. The short version is this: you can store anything but plain text (e.g., traces, logs, and audit files), classic export, and core dump files.

ASM is built around the following central concepts:

ASM disk
Failure groups
ASM disk groups

A number of individual ASM disks—either physical hard drives or external storage provided to the database server—form an ASM disk group. There is an analogy with LVM in that ASM disks correspond with physical volumes (see the PV in Figure 2-1). ASM disks sharing a common point of failure such as a disk controller can be grouped in a failure group for which there is no equivalent in LVM. An ASM disk group can be used to store physical database structures: data files, control files, online redo logs and other file types. In contrast to Linux's logical volume manager, LVM2, no logical volumes are created on top of a disk group. Instead, all files belonging to a database are logically grouped into a directory in the disk group. Similarly, a file system is not needed; this explains why ASM has a performance advantage over the classic LVM/file system setup (see Figure 2-1 for a comparison of LVM2 and ASM).

Figure 2.1. A comparison between Linux LVM and Automatic Storage Management

The restriction to store general purpose files has been lifted in Grid Infrastructure, which introduces the ASM Cluster File System (ACFS). We will discuss this in more detail in the "11g Release 2 New Features" section later in this chapter. ASM uses a stripe-and-mirror-everything approach for optimal performance.

The use of ASM and ACFS is not limited to clusters; single-instance Oracle can benefit greatly from it, as well. In fact, with Oracle Restart, the standalone incarnation of Grid Infrastructure will provide ASM as an added benefit—there is no longer any reason to shy away from this technology. Initially met with skepticism by many administrators, this technology has (at least partly) moved responsibility for storage from the storage administrators to the Oracle database administrator.

Technically, Oracle Automatic Storage Management is implemented as a specific type of Oracle instance—it has its own SGA, but no persistent dictionary. In RAC, each cluster node has its own ASM instance—and only one. When starting, each instance will detect the ASM disk group resources in Grid Infrastructure, through an initialization parameter in Clusterware. Each instance will then mount those disk groups. With the correct permissions granted—ASM 11.2 introduced Access Control Lists (ACLs)—databases can access their data files. Using ASM requires using Oracle Managed Files (OMF), which implies a different way of managing database files. Initialization parameters in the RDBMS instance such as db_create_file_dest and db_create_online_dest_n, as well as db_recovery_file_dest to an extent, define which disk group data files, online redo logs/control files and files for the flash recovery area are created. When requesting the creation of a new file, Oracle Managed Files will create a filename based on the following format:

+diskGroupName/dbUniqueName/file_type/file_type_tag.file.incarnation

An example data file name might look like this:

+DATA/dev/datafile/users.293.699134381

ASM is the only supported way of storing the database in Standard Edition RAC. This shows that Oracle is taking ASM seriously. ASM allows you to perform many operations online; as an added benefit, ASM 11.1 and newer can be upgraded in a rolling fashion, minimizing the impact on the database availability.

ASM operates on the raw partition level; LVM2 logical volumes should be avoided for production systems to reduce overhead. ASM is supported over NFS, as well. However, instead of mounting the directories exported by the filer directly, zero-padded files created by the dd utility have to be used as ASM volumes. When using NFS, you should also check with your vendor for best practice documents. We recommend direct NFS over NFS for ASM for reasons of practicality.

Environments with special requirements, such as very large databases with 10TB and more of data, can benefit from customizable extent sizes defined on the disk group level. A common storage optimization technique involves using only the outer regions of disk platters, which offer better performance than the rest of the platter. ASM Intelligent Data Placement allows administrators to define hot regions with greater speed and bandwidth. Frequently accessed files can be specifically placed into these areas for overall improved performance. Hard disk manufacturers are going to ship hard disks with 4k sector sizes soon, increasing storage density in the race for ever faster and larger disks. ASM is prepared for this, and it provides a new attribute for disk groups called sector size that can be set to either 512bytes or 4096bytes.

A typical workflow for most installations begins with the storage administrator presenting the storage to be used for the ASM disk to all cluster nodes. The system administrator creates partitions for the new block devices and adds necessary configuration into the multipathing configuration. Either ASMLib or udev can be used to indicate the new partitioned block device as a candidate disk. After handover to the database team, the Oracle administrator uses the new ASM disk in a disk group. All of these operations can be performed online without having to reboot the server.

Working with ASM Disks

An ASM disk is the basic building block of ASM. A disk is a shared block device presented to all nodes of the cluster, and it is made available to the ASM instance either as a raw device or through the services of ASMLib. ASM disks are generally multipathed, but they can also be single-pathed.

The use of raw devices is only supported in Oracle 11.1; the Oracle 11.2 documentation explicitly states that raw devices are deprecated. Oracle follows the Linux kernel developers who officially deprecated raw devices with kernel 2.6. Presenting ASM disks to the database servers can be difficult at times. From an administration point of view the use of ASMLib is far easier to use than raw devices. Regardless of which method is used, the new LUN is recognized as a candidate for addition into an ASM disk group.

As soon as an ASM candidate disk is added to a disk group, meta information is written into its header. This allows the ASM instance to recognize the disk and to mount it as part of a disk group.

Disk failure is a common occurrence when dealing with storage arrays. Individual hard disks are mechanical devices that undergo high levels of usage. It's normal for them to fail. In most cases, a storage array uses protection levels to mirror disk contents or uses parity information to reconstruct a failed disk's data.

Disk failures in ASM are not very common because most deployments will use LUNs with storage array protection. However, should a disk fail in an ASM protected disk group, urgency is required to replace the failed disk to prevent it from being dropped. A new parameter introduced in ASM 11.1 called disk repair time allows administrators to fix transient disk failures without incurring a full rebalance operation. A rebalance operation occurs when a disk is added or (forcibly) removed from an ASM disk group, restriping the contents of the disk groups across the new number of ASM disks in the disk group. Depending on the size of the ASM disk group, this rebalancing can be a lengthy operation. If the administrators are lucky enough to get the ASM disk back into the disk group before rebalancing occurs, the resilvering of the disk group is magnitudes faster because a only dirty region log needs to be applied to the disk, instead of a full rebalance operation.

Depending on the storage backend used, the LUN can either be protected internally on the array by a RAID level, or it can be an aggregation of unprotected storage (JBOD). This has implications for the protection on the ASM disk group level; we will discuss this topic next.

PRESENTING ASM DISKS: UDEV AND ASMLIB

ASMLib and udev(7) both address the problem of device name persistence. In Linux, the order in which devices are detected and enumerated is not static. This is in contrast to Solaris, for example, where the disk device names (cxtxdxpx) do not change unless a disk is physically moved within the array. A reconfiguration on the storage array can prove a big problem in Linux if no multipathing software is used: a device previously presented to the operating system as /dev/sda can be mapped as /dev/sdg after a reboot, simply because it has been detected a little later than during the last boot. Raw device mappings based on a device name are bound to fail.

Enter udev to the rescue. In udev, an integral part of the Linux operating system, rules are defined to leverage the fact that a SCSI device's world-wide-ID (WWID) does not change. The rule creates a mapping—it defines that device /dev/raw/raw1 will always be mapped to the LUN with SCSI ID abcd. The main problem with udev is that its configuration is not intuitive or easy to use. The udev configuration needs to be maintained by the administrator on all nodes of the cluster because udev doesn't replicate its configuration.

This problem does not exist for multipathed storage, where another software layer (e.g., the device-mapper-multipath package) or vendor-specific software might create a logical device combining each path to the storage.

ASMLib takes another approach. The ASMLib tools are freely available for download from http://oss.oracle.com, and they make the administration of ASM disks very easy. ASMLib consists of three RPMs: a kernel module, the actual ASMLib, and support tools. Before making use of a LUN as an ASM disk, you use the ASMLib tools to stamp it, which you accomplish by assigning a name to the LUN that writes meta information into the disk header. ASM will then be able to identify the new LUN as a potential candidate for addition to an ASM disk group. During a reboot, ASMLib will scan the system for ASM disks and will identify all of those presented to the operating system, regardless of the disk's physical device name assigned during the boot process. This ensures device-name stability at a very low cost.

ASMLib will explicitly use asynchronous I/O, but it doesn't make this visible through populating the kio-buffers in the kernel memory structures. As we said before, ASMLib is a kernel module, and it allocates its memory structures internally; unfortunately, it does not populate the slabinfo file in the proc file system. It can deal with single- and multipathing configurations.

Exploiting ASM Disk Groups

When all ASM disks are made available to the operating system—for simplicity, we assume this is accomplished through ASMLib—we can use them. Multiple ASM disks can be aggregated to disk groups, just as in classical LVM, where multiple physical volumes are aggregated into volume groups. Like LVM volume groups, disk groups can be named and take attributes. Depending on the protection level of the individual ASM disk/LUN, three different redundancy levels can be defined:

External redundancy
Normal redundancy
High redundancy

When creating a disk group with external redundancy, ASM will assume that the storage array takes care of protection from individual hard disk failure, and it won't perform any mirroring. However, it will stripe extents of a default size of 1M across all ASM disks available to the disk group. A write error to an ASM disk will force the disk to be dismounted. This has severe implications because no copies of the extents stored on the disk are available, and the whole disk group becomes unavailable.

With normal redundancy ASM will stripe and mirror each extent. For each extent written to disk, another extent will be written into another failure group to provide redundancy. With ASM 11.2, individual files can be defined to be triple-mirrored; two-way mirroring is the default. Normal redundancy can tolerate the failure of one ASM disk in the disk group.

Even higher protection is available with high redundancy, which provides triple-mirroring by default. With triple-mirroring, two additional copies of the primary extent are created. The loss of two ASM disks can be tolerated in disk groups with high redundancy.

Configuring Failure Groups

Failure groups are a logical grouping of disks that fail in their entirety if a component fails. For example, the disks belonging to a single SCSI controller form a failure group—should the controller fail, all disks will become unavailable. Failure groups are used by ASM to store mirror copies of data in normal and high redundancy disk groups. If not explicitly configured, each ASM disk forms its own failure group.

Normal redundancy disk groups need to consist of at least two failure groups, while high redundancy disk groups require at least three fail groups. However, it is recommended that you use more than the minimum number of fail groups for additional data protection.

ASM by default reads the primary extent from an ASM disk group. In extended distance clusters (see Chapter 3), this could cause a performance penalty if the primary extent is on the remote storage array. ASM 11.1 addresses this problem by introducing a preferred mirror read: each ASM instance can be instructed to read the local copy of the extent, regardless of whether it's a primary or copied extent.

Weighing Your ASM Installation and Administration Options

Until Oracle 11.1, the best practice was to install ASM as a separate set of binaries. This offered the advantage of being able to upgrade Clusterware and ASM independently of the database. For example, Clusterware and ASM could be upgraded to 11.1.0.7 while the database remained on the base release. If adhered to, this best practice resulted in a typical three Oracle home installation:

Clusterware
Automatic Storage Management
Database

If required, ASM 11.1 can be installed under a different operating system account than the one used for the installation of the RDBMS binaries. Oracle accounted for the fact that role separation between the database and storage administrators is common practice on many sites.

ASM is administered either through SQL*Plus, Enterprise Manager (dbconsole), or the Database Configuration Assistant (dbca). A major surprise awaited the early adopters of Oracle 11g Release 2: ASM is now part of Grid Infrastructure, both for single-instance and RAC environments. A new configuration assistant—asmca—has taken over and extended the functionality the database configuration assistant offered in 11.1. ASM can no longer be started out of the RDBMS Oracle home, either. The new ASM configuration assistant adds support for another new ASM feature called ASM Cluster File System; we will discuss this feature in more detail in the "Oracle 11.2 New Features" section.

The introduction of ASM to an Oracle environment—whether single instance or RAC—frequently devolves into a political debate, rather than remaining focused around what ASM can do for the business. Storage administrators often struggle to relinquish control over the presentation of storage from the fabric to Oracle DBAs. Role separation has been made possible with the introduction of a new super-user role called SYSASM, as opposed to the SYSDBA role we're familiar with since Oracle 9i. You can tie the SYSASM privilege to a different role than the SYSOPER or SYSDBA users.

Installing Real Application Clusters

With the setup and configuration of Grid Infrastructure completed, it is time to turn our attention towards the clustered Oracle software installation. As we just saw, Grid Infrastructure provides the framework for running RAC, including intercluster communication links, node fencing, node-membership services, and much more. Automatic Storage Management is Oracle's preferred way of storing the database. RAC uses all of these concepts and extends the basic services where necessary. In the following sections, we will discuss installation considerations for RAC; the difference between RAC and single-instance Oracle storage considerations for the RAC database; and finally, Cache Fusion and internal RAC metadata information.

Sorting Through Your Installation Options

After a successful Grid Infrastructure/Clusterware installation, Oracle Universal Installer will detect that a cluster is present and will offer the option to install the binaries with the RAC option on the entire cluster or a user-defined subset of the cluster. This is the easy part of the installation process. The Grid Infrastructure installation was the hard part, and now it is time to relax a little. It is good practice to use cluvfy, the cluster-verification tool, to check that all the requirements for the RDBMS installation are fulfilled. Just as with the cluster layer, Oracle Universal Installer will copy and link the software on the first node and then push the Oracle home to the other nodes specified in the installation setup.

Unlike Grid Infrastructure, Oracle RDBMS binaries can be installed on a shared file system such as OCFS2 or the brand new ASM Cluster File System ACFS, which has its own set of advantages and disadvantages. The addition of new nodes to the cluster is simplified because no new software has to be installed on the new node. Patching will be simplified as well—only one Oracle home has to be patched. However, patches cannot be installed in a rolling fashion, so downtime is inevitable.

For some time now, Oracle has used Oracle Configuration Manager (OCM) to simplify the integration of systems with My Oracle Support. Creation of service requests can use the information transmitted by OCM to fill in information about the system; previously, this was tedious at best after the fifth request for the same system has been filed. OCM can operate in an offline and online mode, depending on the security policy on the customer's site. Oracle claims that the information transmitted by OCM can speed up resolutions to service requests. In the relevant section, all you need to supply is a login to My Oracle Support and a password. It's probably good practice to create a dedicated account in My Oracle Support for this purpose.

During the installation process, Oracle Universal Installer will prompt the administrator to create/upgrade a database or install just the binaries. If patch sets are to be installed as soon as Oracle 11.2.0.2 is out, it is prudent to install the binaries only, patch the setup, and then create the database.

Choosing Between a Single Instance and a RAC Database

An Oracle RAC database is different from a single instance database in a number of key aspects. Remember that in RAC, a single database on shared storage is concurrently accessed by multiple database instances on different server hosts. Database files, online redo log files, control files and server parameter files must all be shared. Additionally, flashback logs, archived redo logs, a Data Pump dump, and Data Guard broker configuration files can also be shared, depending on your configuration—this is optional, but it's also highly recommended. When using ASM, you will also find a local parameter file (pfile) on each RAC instance pointing to the server parameter file in its respective disk group. Another locally stored file is the Oracle password file. Users of a cluster file system usually keep these files in a shared location with instance-specific symbolic links to $ORACLE_HOME/dbs.

As you will see in the following sections, a RAC database also contains all the structures you expect to find in single-instance Oracle databases, extending them to take the database's special needs for cluster operations into account.

Working with Database Files

Database files contain all the data belonging to the database, including tables, indexes, the data dictionary, and compiled PL/SQL code, to mention just a few. In a RAC database, there is only one copy of each data file that is located on shared storage and can be concurrently accessed by all instances. The data files are identical to those found in a single-instance database. Datafiles are not mirrored by Oracle by default. Most users choose to implement redundancy at the storage level to prevent the loss of datafiles due to media failure. Oracle Automatic Storage Management can provide redundancy in case the storage array does not provide this facility.

Storing Information in Control Files

As you know, the control files store information about the physical structures of the database, as well as their status. Depending on your use of the Recovery Manager RMAN, the controlfile can also contain information about RMAN backups in the absence of a dedicated RMAN catalog database. In single-instance Oracle and RAC, control files should be mirrored to provide protection against corruption and storage failures. When using ASM together with a flash recovery area, this multiplexing is automatically done for you. By default, Oracle multiplexes the control file to the disk group specified by the db_create_file_dest initialization parameter and the flash recovery area specified by the db_recovery_file_dest parameter. The control_files parameter, which indicates the location of the control files and all copies, is automatically updated for you when using a server parameter file in this case. Be aware that control files can become a point of contention in RAC because they are frequently updated. So be sure not to mirror too many copies of the controlfile and also locate them on fast storage.

Leveraging Online Redo Logs and Archiving

In RAC, each database instance has its own set of online redo log files, called a thread. Online redo log files are aggregated into groups within their instance's thread. In case you were wondering: single-instance Oracle behaves the same way, although you do not normally notice it! Thread 1 belongs to the instance (there is a one to one mapping between instance and database in this setup). Information about the thread number is available in V$LOG and related views.

You need two groups of online redo logs per thread, and you should also consider manually multiplexing the groups' members if not using ASM and a flash recovery area. The mapping between the instance and the thread is performed in the server parameter file: the initialization parameter "thread" maps the thread number to the instance. The convention is to map instance n is mapped to thread n; however, this is not required. In other words, you will find prod1.thread=1 in the spfile. An additional online redo thread is required when adding an instance to the cluster—this can be done in one of two ways. First, the administrator can use the alter database add logfile group x thread y SQL command in administrator-managed databases. Second, it can be done automatically by Oracle in policy-managed databases. A thread also needs to be enabled to be used by Oracle.

Note

A policy-managed database is a new way of managing a RAC database; it is explained in more detail in the "11.2 New Features" section later in this chapter.

Online redo logs in RAC function just as they do in single-instance Oracle. The lgwr background process flushes the redo buffer, a part of the SGA, to the online redo logs that belong to the instance the transaction was committed. Online redo logs need to be on fast storage, or they might become a point of contention, especially for systems with high commit rates. A common tuning techniques applied to poorly designed applications is to reduce the commit rate and to move at least the online redo logs and control files to fast storage, eliminating some of the worst performance bottlenecks. Systems with a high log switch rate can benefit from additional redo log groups per thread to give the archiver process more time to archive the used online redo log. This technique is also beneficial in situations when the archiver is responsible for shipping an archived redo log to a standby database; however, most modern systems use the Log Network Service (LNSn) process to asynchronously send a redo to the standby database's Remote File Server (RFS) process. There is one LNS process per destination in Oracle 10.2 and 11.1. In Oracle 11.2, the LNSn processes were replaced by the NSSn and NSAn background processes. The NSSn process will be used for synchronous shipping of redo; likewise, NSAn will be used for asynchronous shipping of redo. The size of each individual online redo log should be such that log switches don't occur too frequently—the Automatic Workload Repository, or statspack reports, help administrators identify an appropriate size for the online redo logs. Oracle 11.2 even allows administrators to choose the block size of online redo logs, accounting for the fact that modern storage units use 4kb sector sizes instead of 512b.

In case of an instance failure in RAC, which you'll learn more about in Chapter 3, all threads are merged to help create the recovery set for the roll-forward/roll-back operation performed by the server monitor process.

Once filled by the lgwr process, one of the archiver processes will copy the used online redo log to the specified destination.

Note

You are correct if you're thinking that the archiver process copies redo logs only if the database is in archive log mode; however, we can't think of a RAC production database that does not operate in this mode!

The flash recovery area introduced with Oracle 10g Release 1 seems to be the favorite destination for archived redo logs. If you are not using one, we recommend storing the archived redo logs on a shared-file system that is accessible to all nodes. As with single-instance Oracle, the archived logs are essential for a point in time recovery processes. The difference between RAC and single instance deployments is that in RAC, Oracle needs all the archived logs from all threads of the database. You can verify that Oracle is using all log files for each thread in the alert log of the instance that performs media recovery.

Managing Undo Tablespaces

Similar to online redo log threads, each cluster database instance has its own undo tablespace. Again, the 1:1 mapping between the instance and the undo tablespace name is performed in the server parameter file. This mapping doesn't mean that the undo tablespace is permanently tied only to the instance. All other instances can access the undo tablespace to create read-consistent images of blocks.

When adding a new instance to the cluster, an additional undo tablespace needs to be created and mapped to the new instance, the same requirement that applies to the online redo logs. In the case of policy-managed databases, Oracle does this for you; otherwise, the administrator has to take care of this task.

Although it is still possible to use manual undo management, we strongly recommend using Automatic Undo Management (AUM).

Weighing Storage Options for RAC Databases

An Oracle RAC database has to be stored on shared storage, as discussed previously. Administrators can choose from the following options:

Automatic Storage Management: This is Oracle's preferred storage option and the only supported configuration for RAC Standard Edition.
Oracle Cluster File System 2 (OCFS2). A POSIX compliant cluster file system developed by Oracle to store arbitrary files
Raw Devices: We recommend not using raw devices; not only are they deprecated in the Linux kernel, but they are also deprecated with Oracle 11.2.
Network File System: This checks the certification matrix on My Oracle Support for support of your NFS filer.
Red Hat Global File System: This is supported for Red Hat and Oracle Enterprise Linux only; it can be used for the Flash Recovery Area and database files alike.

It should be pointed out that Oracle recommends Automatic Storage Management for use with RAC in all parts of the documentation. Judging by the fact that ASM receives numerous improvements in each new release, it's probably a good time to get started with the technology, if you haven't already done so.

OCFS2 was discussed in more detail in Chapter 1; it is a POSIX-compliant file system for Linux kernel 2.6 and newer. It overcomes many of the initial OCFS limitations and allows users to store binaries and database files needed for RAC. Interestingly, Oracle RAC 11.2 also supports the use of Red Hat's (formerly Sistina's) Global File System.

Note

If you are planning to use a RAC over NFS, your filer needs to be explicitly supported by Oracle.

Drilling Down on a RAC Database Instance

A RAC database consists of two or more instances. Each instance usually resides on a different cluster node and consists of a superset of the shared memory structures and background processes used to support single-instance databases.

It is possible to install RAC on only one node for testing, without having to pay for the interconnect hardware or shared storage; however, such a setup is not recommended for production. Even though only a single database instance is used and no data will be shipped over an interconnect, operations routinely performed have to go through additional code paths not required for a single-instance database, which creates overhead.

Note

Not all nodes of a RAC database have to be up and running-the ability to run on a subset of physical nodes is integral to the instance recovery process and cluster database startup.

Each instance has its own set of shared memory called Shared Global Area (SGA), and it is allocated at instance startup. The SGA is comprised of multiple sub-pools and a fixed portion that is platform dependent. The buffer cache, shared pool, log buffer, streams, and many other pools take up the memory of the SGA. For some time now, Oracle has introduced technologies to automatically manage and tune the different SGA components. Oracle 10g gave administrators Automatic Shared Memory Management to handle most of the SGA components. Oracle 11 also introduced Automatic Memory Management (AMM) to handle the Program Global Area (PGA) in the set of automatically managed memory. However, AMM is not compatible with Linux huge pages, which can be a problem for systems with large memory.

Oracle has to synchronize access to shared memory locally and across the cluster. You might recall that, thanks to the RAC technology stack, all database instances can access other database instances' SGAs.

The methods employed by the Oracle kernel to protect shared memory in RAC are not different from single-instance Oracle. Latches and locks are used in both cases. A latch is a low level, lightweight serialization device. Processes trying to acquire a latch do not queue—if a latch cannot be obtained, a process will spin. Spinning means that the process will enter a tight loop to prevent being taken off the CPU by the operating system's scheduler. Executing a tight loop should prevent the process from being taken off the CPU. If the latch still can't be acquired, the process will eventually sleep and try again at regular intervals. Latches are local to the instance; however, there are no cluster wide latches.

On the other hand, locks are obtained for a longer period, and they are more sophisticated than the simple latch we just discussed. Locks can be shared or exclusive—with interim stages—and processes trying to acquire locks will have to wait on a first in, first out (FIFO) basis. Access to locks is controlled by so-called enqueues, which maintain a list of processes waiting for access to a given resource. Unlike latches, enqueues are maintained cluster wide.

Locking and latching are well known and understood in single-instance Oracle; however, the requirements for cache coherency in Oracle RAC mean that locking and latching are significantly more complex in RAC. As with single-instance Oracle, access to data blocks in the buffer cache and enqueues must be managed within the local instance; however, access by remote instances must be managed, as well. For this reason, Oracle uses the Global Resource Directory (GRD) and a number of additional background processes.

Note

Oracle has amended V$-views for cluster-wide use by adding an instance identifier to each view and renaming it to GV$. Think of a GV$ view as a global view that encompasses the dynamic performance views from all instances in the cluster.

Using the Global Resource Directory (GRD)

Additional background processes are used in Real Application Clusters for cache synchronization—remember that RAC uses the Cache Fusion architecture to simulate a global SGA across all cluster nodes. Access to blocks in the buffer cache needs to be coordinated for read consistency and write access, and enqueues to shared resources are now global across the cluster. These two main concepts are implemented in the Global Cache Service (GCS), for accessing the common buffer cache; and the Global Enqueue Service (GES), for managing enqueues in the cluster.

Both GCS and GES work transparently to applications. The meta structure used internally is the previously mentioned Global Resource Directory (GRD), which is maintained by the GCS and GES processes. The GRD is distributed across all nodes in the cluster and part of the SGA, which is why the SGA of a RAC database is larger than a single instance's equivalent. Resource management is negotiated by GCS and GES. As a result, a particular resource is managed by exactly one instance, referred to as the resource master. Resource mastering is not static, however. Oracle 9i Release 2 (to a degree) and subsequent versions of Oracle implemented dynamic resource mastering (DRM). In releases prior to Oracle 9i Release 2, resource premastering would only happen during instance failure, when the GRD was reconstructed. Resource mastering in newer releases can happen if Oracle detects that an instance different from the resource master accesses a certain resource more than a given number of times during a particular interval. In this case, the resource will be remastered to the other node. Many users have reported problems with dynamic remastering, which can add undesired overhead if it happens too frequently. Should this happen, DRM can be deactivated.

Note

The GRD also records which resource is mastered by which instance—a fact that will come in very handy during recovery, should an instance fail.

Figure 2-2 provides an overview of how the GCS and GES background processes work hand-in-hand to maintain the GRD; other RAC background processes have been left out intentionally.

Figure 2.2. The Global Resource Directory in the context of Global Enqueue Services and Global Cache Services

Maintaining Cache Coherence with Global Cache Services (GCS)

The Global Cache Service uses the LMSn background processes to maintain cache coherency in the global buffer cache. As we already discussed in Chapter 1, multiple copies (but only one current version!) of the same block can exist in the global SGA .GCS keeps track of the status and the location of data blocks. GCS will also ship blocks across the interconnect to remote instances.

Managing Global Enqueues with Global Enqueue Services (GES)

Similar to GCS, which works on the block level, GES manages global enqueues in the cluster. As a rule of thumb, if an operation doesn't involve mastering/transferring blocks within the global buffer cache, then it is most likely handled by GES. The Global Enqueue Service is responsible for all inter-instance resource operations, such as global management of access to the dictionary and library cache or transactions. It also performs intercluster deadlock detection. It tracks the status of all Oracle enqueue mechanisms for resources that are accessed by more than one instance. The Global Enqueue Service Monitor (LMON) and Global Enqueue Service Daemon (LMD) form part of the Global Enqueue Services. The Lock Process (LCK0) process mentioned in Figure 2-2 is responsible for non-Cache Fusion access, such as library and row cache requests.

Transferring Data Between Instances with Cache Fusion

Cache Fusion is the most recent evolution of inter-instance data transfer in Oracle. Instead of using the block ping mechanism implemented through Oracle 8i, Oracle uses a fast interconnect to transfer data blocks for reads and writes across all cluster nodes.

Note

Cache Fusion was partially implemented in Oracle 8i as well, but completed in 9i Release 1, which also saw the rebranding of Oracle Parallel Server to Real Application Clusters.

The block ping method of transferring blocks between instances was hugely expensive; at the time, it was strongly advised that you tie workloads to instances to ensure a minimum amount of inter-instance block transfer. In Oracle Parallel Server, when an instance other than the one holding a current block requested the current block for modification, it signaled that request to the holding instance, which in turn wrote the block to disk, signaling that the block could be read. The amount of communication and the write/read operation to disk were hugely undesirable.

Cache Fusion block transfers rely on the Global Resource Directory, and they never require more than three hops, depending on the setup and the number of nodes. Obviously, if there are only two cluster nodes, then there will be a two-way cache transfer. For more than two nodes, a maximum of three hops will be necessary. Oracle has instrumented communication involving the cache with dedicated wait events ending either in two-way or three-way cache transfers, depending on the scenario.

When an instance requests a data block through Cache Fusion, it first contacts the resource master to ascertain the current status of the resource. If the resource is not currently in use, it can be acquired locally by reading from the disk. If the resource is currently in use, then the resource master will request that the holding instance passes the resource to the requesting resource. If the resource is subsequently required for modification by one or more instances, the GRD will be modified to indicate that the resource is held globally. The resource master, requesting, and holding instance can all be different, in which case the maximum of three hops must be used to get the block.

The previously mentioned two-way and three-way block transfers are related to how resources are managed. In case the resource master holds the requested resource, then the request for the block can be satisfied immediately and the block shipped, which is a two-way communication. In a three-way scenario, the requestor, resource master, and holder of the block are different—the resource master needs to forward the request, introducing a new hop.

From this discussion, you can imagine that the effort to coordinate blocks and their images in the global buffer cache is not to be underestimated. In a RAC database, Cache Fusion usually represents both the most significant benefit and the most significant cost. The benefit is that Cache Fusion theoretically allows scale-up, potentially achieving near-linear scalability. However, the additional workload imposed by Cache Fusion can be in the range of 10% to 20%.

Achieving Read Consistency

One of the main characteristics of the Oracle database is the ability to simultaneously provide different views of data. This characteristic is called multi-version read consistency. Queries will be read consistently; writers won't block readers, and vice versa. Of course, multi-version read consistency also holds true for RAC databases, but a little more work is involved.

The System Change Number is an Oracle internal timestamp that is crucial for read consistency. If the local instance requires a read-consistent version of a block, it contacts the block's resource master to ascertain if a version of the block that has the same SCN, or if a more recent SCN exists in the buffer cache of any remote instance. If such a block exists, then the resource master will send a request to the relevant remote instance to forward a read-consistent version of the block to the local instance. If the remote instance is holding a version of the block at the requested SCN, it sends the block immediately. If the remote instance is holding a newer version of the block, it creates a copy of the block, called a past image; applies undo to the copy to revert it to the correct SCN; and sends it over the interconnect.

Synchronizing System Change Numbers

System Change Numbers are internal time stamps generated and used by the Oracle database. All events happening in the database are assigned SCNs, and so are transactions. The implementation Oracle uses to allow read consistency relies heavily on SCNs and information in the undo tablespaces to produce read-consistent information. System change numbers needs to be in sync across the cluster. Two different schemes to keep SCNs current on all cluster nodes are used in Real Application Clusters: the broadcast-on-commit scheme and the Lamport scheme.

The broadcast-on-commit scheme is the default scheme in 10g Release 2 and newer; it addresses a known problem with the Lamport scheme. Historically, the Lamport scheme was the default scheme—it promised better scalability as SCN propagation happened as part of other (not necessarily related) cluster communication and not immediately after a commit is issued on a node. This was deemed sufficient in most situations by Oracle, and documents available on My Oracle Support seem to confirm this. However, there was a problem with the Lamport scheme: It was possible for SCNs of a node to lag behind another node's SCNs—especially if there was little messaging activity. The lagging of system change numbers meant that committed transactions on a node were "seen" a little later by the instance lagging behind.

On the other hand, the broadcast-on-commit scheme is a bit more resource intensive. The log writer process LGWR updates the global SCN after every commit and broadcasts it to all other instances. The deprecated max_commit_propagation_delay initialization parameter allowed the database administrator to influence the default behavior in RAC 11.1; the parameter has been removed in Oracle 11.2.

Exploring the New Features of 11g Release 2

The final section of this chapter provides an overview of the most interesting new features introduced with Oracle 11g Release 2, both for Grid Infrastructure and for RAC. Many of these have already been mentioned briefly in previous sections in this chapter, as well as in Chapter 1.

Leveraging Grid Plug and Play

Many new features in 11g Release 2 aim to make maintenance tasks in Oracle RAC simpler, and Grid Plug and Play (GPnP) is no exception. It seems to the authors that another design goal of Oracle software is to provision support for tasks and services that were traditionally performed outside the DBA department. Simplifying the addition of cluster nodes to provide true grid computing where computing power is a utility seems to have also been a contributing factor to the design decisions made with RAC 11.2. There is a big demand in the industry for consolidation. Making maximum use of existing hardware resources has become very important, and RAC is well suited for this role. Grid Plug And Play helps administrators maintaining the cluster. For example, a number of manual steps previously required when adding or removing nodes from a cluster are automated with GPnP.

Note

GPnP focuses on the Grid Infrastructure layer—it does not address the addition of instances to a RAC databases. Adding and removing nodes to or from a RAC database is done through server pools.

Grid Plug and Play is not a monolithic concept; rather, it relies on several other new features:

Storing cluster information in a XML configuration file
Cluster time synchronization (CTSS)
The Grid Naming Service (GNS)
Single Client Access Name (SCAN)
Server Pools

Grid Plug and Play works behind the scenes. You will most likely notice the absence of configuration dialogs when performing cluster maintenance. For example, you won't be prompted for information such as a new node's name or its virtual IP address.

GPnP defines a node's meta data-network interfaces for public and private interconnect, the ASM server parameter file, and CSS voting disks. The profile, an XML file, is protected by a wallet against modification. If you have to manually modify the profile, it must first be unsigned with $GRID_HOME/bin/gpnptool, modified, and then signed again with the same utility. Don't worry, though; the profile is automatically updated without administrator intervention when using cluster management tools such as oifcfg.

The CTSS daemon, part of Grid Infrastructure, synchronizes time between cluster nodes in the absence of a network accessible network time protocol (NTP) server. This could remove the dependency on an NTP server, however we recommend using NTP wherever possible—you might otherwise end up with incorrect (but consistent!) system time on all nodes.

Prior to Oracle 11.2, a node's public and virtual IP addresses had to be registered in a Domain Name Server (DNS) for clients to connect to the database correctly and to support connect time load balancing. Cluster maintenance, such as adding or removing nodes, requires changes in DNS that can be a burden if such maintenance is performed often. The idea behind Grid Naming Service is to move the mapping of IP addresses to names and vice versa out of the DNS service and into Clusterware. Technically, Clusterware runs its own little nameserver, listening on yet another virtual IP address using a method called subdomain delegation. In simple terms, you create a new subdomain (e.g., ebsprod) to your domain (example.com) and instruct the root name server for your domain to hand off all requests for the subdomain (ebsprod.example.com) to GNS. During subsequent steps of the installation, you won't be asked to supply virtual IP addresses and names—only public and private network information has to be supplied. The addresses GNS uses have to come from a dynamic host configuration protocol (DHCP) server on the public network. Table 2-5 gives an overview of addresses and their use, their default names, and details on which part of the software stack is responsible for resolving them. The only dependency on DNS exists during the initial cluster installation, when the GNS virtual address is allocated and assigned in the DNS.

Table 2.5. Network Addresses with Active GNS

Address used by/for	Default Name	Assigned Type	Assigned by	Resolved by
Global Naming Service virtual address	`clustername-gns.example.com`	virtual	DNS administrator	DNS
public node address	assigned hostname	public	assigned during OS installation	GNS
node virtual address	`publicName-vip`	virtual	address assigned through DHCP	GNS
node private address	`publicName-priv`	private	fixed	hosts file
SCAN virtual IP 1	`clustername-gns.example.com`	virtual	assigned through DHCP	GNS
SCAN virtual IP 2	`clustername-gns.example.com`	virtual	assigned through DHCP	GNS
SCAN virtual IP 3	`clustername-gns.example.com`	virtual	assigned through DHCP	GNS

We recommend defining the private IP addresses for the cluster interconnect in the /etc/hosts file on each cluster node; this will prevent anyone or anything else from using them.

Warning

The use GNS is optional; at the time of writing, a number of bugs related to GNS are reported against the base release.

You choose to use GNS during the installation of Grid Infrastructure by ticking a small box and assigning a subdomain plus a virtual IP address for GNS. The instruction to DNS to perform subdomain delegation needs to be completed prior to the installation.

The next feature in our list is called Single Client Access Name (SCAN), which we will discuss later. SCAN helps with abstracting the number of nodes from client access. The addition and deletion of nodes is completely transparent—the SCAN relates to the cluster rather than the database.

To minimize additional effort after the addition of a node to the cluster, server pools have been introduced to simplify the addition and removal of database instances to a RAC database. We will discuss server pools next.

Modeling Resources with Server Pools

Server pools are an interesting new feature. They provide a new way of modeling resources in Clusterware. They allow you to subdivide a cluster into multiple logical subunits that can be useful in shared environments. All nodes in Clusterware 11.2 are either implicitly or explicitly part of a server pool. By default, two pools exist after a fresh installation: the free pool and the generic pool. The generic pool is used for backward compatibility, and it stores pre-11.2 databases or administrator-managed 11.2 databases. The free pool takes all non-assigned nodes.

Server pools are mutually exclusive and take a number of attributes, such as the minimum and maximum number of nodes, an importance, and a name. The importance attribute in a server pool ensures that low priority workloads don't starve out more important ones for resources. It is possible to reallocate servers from one pool to another, which can lead to very interesting scenarios in capacity management. Clusterware can automatically move servers from other server pools to meet an important server pool's minimum size requirement.

Server Pools go hand-in-hand with a new way of managing RAC databases. Prior to Oracle 11.2, administrators were responsible for adding and removing instances from a RAC database, including the creation and activation of public online redo log threads and undo tablespaces. Server pools—and the use of Oracle Managed Files as in ASM—automate these tasks through the use of the form of policy managed databases. Administrator-managed databases are, as the name suggests, managed entirely by the database administrator. In other words, this is the RAC database until to Oracle 11.1. Policy-managed databases use automated features for adding and removing instances and services. The number of nodes a policy-managed database starts is configured by the server pool's cardinality; in other words, if you need another instance, then all you need to do is to assign a new node to the database's server pool, and Oracle will do the rest.

Note

See Chapter 11 to learn more about the implications of services on policy-managed databases.

In conjunction with server pools, Grid Infrastructure introduced another feature called Role Separated Management. In shared environments, administrators should be restricted to managing their respective server pool. Access Control Lists are implemented for Clusterware resources, including server pools, to govern access to resources. A new role, the cluster administrator, is introduced. By default, the Grid Infrastructure software owner "grid" and the root user are permanent cluster administrators. Additional operating system accounts can be promoted to cluster administrators, each of which can have a set of permissions on resources, types, and server pools. Separation of duties now seems to be possible on the cluster level—but bear in mind that the grid owner and root users are all powerful.

Ensuring POSIX Compliance with ACFS

ASM has been discussed at great length in this chapter in relation to shared storage for the RAC database. Oracle 11.2 extends ASM so it not only stores database-related file structures, but also provides a POSIX compliant cluster file system called ACFS. POSIX compliance means that all the operating system utilities we use with ext3 and other file systems can be used with ACFS. A space efficient read-only, copy-on-write snapshot mechanism is available as well, allowing for up to 63 snapshots to be taken. ASM Cluster File System uses 64-bit structures internally; it is a journal file system, and it uses metadata checksums to ensure data integrity. ACFS can be resized online, and it uses all the services ASM offers, even (most importantly) I/O distribution.

ACFS solves a problem users had in RAC. Database directory objects and external tables can now point to ACFS file systems, allowing a single view on all the external data in the cluster. In the real world, users of external tables had to make sure to connect to the correct instance to access the underlying data in the external table or directory object. With ACFS, it is no longer necessary to connect to a specific node. Presenting the file system to all nodes solves the problem. ACFS also addresses the fact that the use of an additional cluster file system—OCFS2, for example—could cause problems on RAC systems because it meant the existence of a second set of heartbeat processes.

ACFS is not limited to RAC. Another scenario for ACFS is a web server farm protected by the Grid Infrastructure high availability framework that uses ACFS to export the contents of the web site or application code to the server root directories.

ACFS can store database binaries, BFILEs, parameter files (pfiles), trace files, logs, and, of course, any kind of user application data. Database files are explicitly not supported in ACFS, and Oracle won't allow you to create them in an ACFS mount point. However, Oracle does explicitly support installing the Oracle binaries as a shared home in ACFS; the mounting of the file system can be integrated into Clusterware, ensuring that the file system is available before trying to start the database.

ACFS uses the services of the ASM Dynamic Volume Manager (ADVM). It provides volume management services and a standard device driver interface to its file system clients, such as ACFS and ext3. The ACFS file system is created on top of an ADVM volume, which is a specific ASM file type that differentiates it from other ASM managed-file types, such as datafiles, controlfiles, and so on. Figure 2-3 details the relationship between ASM disks, the ASM disk group, a dynamic volume, and ACFS file system:

Figure 2.3. ACFS and ASM dynamic volumes

Management of ACFS and ADVM volumes is tightly integrated into Enterprise Manager, the ASM Configuration Assistant (ASMCA), and command line utilities, as well as into SQL*Plus.

Using Oracle Restart Instead of RAC

Oracle Restart, or Single Instance High Availability as it was called previously, is the standalone version and counterpart for Grid Infrastructure for RAC. Similar to its big brother, Oracle Restart keeps track of resources such as (single-instance) databases; the ASM instance and its disk groups; and listener and other node applications, such as Oracle Notification Service. It also uses the Oracle High Availability Service (OHAS) and its child processes to monitor the state of registered resources and restarts failed components if necessary. Oracle Restart's meta information is stored in the Oracle Local Registry (OLR), and it also runs a CSSD instance you don't need to create through a call to localconfig add for ASM up to Oracle 11.1. The great thing about Oracle Restart is its integration with ONS and the resource administration through the commands already known from RAC. Finally, database administrators don't need to worry about startup scripts—Oracle Restart will start a database when the server boots. Dependencies defined in Oracle Restart ensure that a database doesn't start before ASM is up, and so on. This means that mistakes such as forgetting to start the listener should be a problem of the past. Managed service providers will appreciate that Oracle Restart is identical across all platforms, thereby providing a uniform way of managing databases and their startup scripts.

Oracle Restart reduces the number of recommended Oracle homes by one—all you need is a Grid Infrastructure home that contains the binaries for ASM and the RDBMS software home. Unlike Grid Infrastructure, Oracle Restart can share the ORACLE_BASE with the database binaries. The installation is very similar to the clustered installation, minus the missing copy to remote locations. Another benefit: It doesn't require the definition of an OCR or voting disk locations. The execution of the root.sh script initializes the OLR and creates the ASM instance with the disk groups specified.

After installing Oracle Restart, you will see the following list of resources in your system:

[oracle@devbox001 ˜]$ crsctl status resource -t
------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
------------------------------------------------------------------------------
Local Resources
------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       devbox001
ora.DATA.dg
               ONLINE  ONLINE       devbox001
ora.REDO.dg
               ONLINE  ONLINE       devbox001
ora.FRA.dg
               ONLINE  ONLINE       devbox001
ora.asm
               ONLINE  ONLINE       devbox001            Started
------------------------------------------------------------------------------
Cluster Resources
------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       devbox001
ora.diskmon
      1        ONLINE  ONLINE       devbox001

As you can see, ASM disk groups are resources, both in clustered and single-instance installations, which makes it far easier to mount and unmount them. Disk groups are automatically registered when first mounted; you no longer need to modify the asm_diskgroups initialization parameter.

Once Oracle Restart is configured, you can add databases to it just as you do in RAC, but with one important difference: no instances need to be registered. Databases can be configured with their associated disk groups, creating a dependency: if the disk group is not mounted when the database starts, Oracle Restart will try to mount the disk group(s) first.

Services are defined through a call to srvctl—again, just as in the RAC counterpart. Do not set the initialization parameter service_names to specify a database's services.

By default, Oracle Restart won't instantiate the Oracle Notification Service. Adding the daemons is useful in conjunction with the Data Guard broker, which can send out FAN events to inform clients of a failover operation. FAN events are also sent for UP and DOWN events, but the lack of high availability in single-instance Oracle limits the usefulness of this approach.

Simplifying Clusterd Database Access with SCAN Listener

The Single Client Access Name is a new feature that simplifies access to a clustered database. In versions prior to Oracle 11.2, an entry in the tnsnames.ora file for a n-node RAC database always referenced all nodes in the ADDRESS_LIST section, as in the listing that follows:

QA =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (LOAD_BALANCE = ON)
      (FAILOVER = ON)
      (ADDRESS = (PROTOCOL = tcp)(HOST = london1-vip.examle1.com)(PORT =
          1521))
      (ADDRESS = (PROTOCOL = tcp)(HOST = london2-vip.examle1.com)(PORT =
          1521))
      (ADDRESS = (PROTOCOL = tcp)(HOST = london3-vip.examle1.com)(PORT =
          1521))
      (ADDRESS = (PROTOCOL = tcp)(HOST = london4-vip.examle1.com)(PORT =
          1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = qaserv)
    )
  )

Adding and deleting nodes from the cluster required changes in the ADDRESS_LIST. In centrally, well managed environments, this might not be much of a problem; however, for certain environments, where clients are distributed across a farm of application servers, this process might take a while and is error prone. The use of a SCAN address removes this problem—instead of addressing every single node as before, the SCAN virtual IP addresses refer to the cluster. Using the SCAN, the preceding connection entry is greatly simplified. Instead of listing each node's virtual IP address, all we need to do is enter the SCAN scanqacluster.example1.com. Here's the simplified version:

QA =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = tcp)(HOST = scanqacluster.examle.com)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = qaserv)
    )
  )

As a prerequisite to the installation or upgrade of Grid Infrastructure, at least one but preferably three previously unused IP addresses in the same subnet as the public network must be allocated and registered in DNS. Alternatively, in case you decide to use the Grid Naming Service (GNS), the GNS daemon will allocate three IP addresses from the range of addresses offered by the DHCP server. The IP addresses are Address (A) records in DNS that resolve to the SCAN name in a round robin fashion; you also need to make sure a reverse lookup is possible. Oracle Universal Installer will create new entities called SCAN listeners, along with the SCAN virtual IP addresses. The SCAN listeners will register with the local database listeners. A SCAN listener and a SCAN virtual IP address form a resource pair—both will be relocated to a different cluster node if a node fails. In case you ever need to, you can use the server control utility to administer the SCAN listener and IP.

The SCAN listeners are responsible for connect time load balancing, and they will hand off connections to the least loaded node offering the service the client requested. Figure 2-4 shows how these listeners fit into the overall RAC picture.

Figure 2.4. SCAN listeners in the context of Real Application Clusters

Figure 2-4 demonstrates the use of the SCAN. Assuming a three-tier application design and a manual—a non-GNS configuration, in other words—the application server's sessions connect to the SCAN scanname.examle.com on behalf of users. A DNS server is contacted to resolve the SCAN, and it will return one of the three IP addresses defined; this helps when spreading the load to all three SCAN listeners. The SCAN listener in turn will redirect the request to the local listener of the least loaded node, offering the service requested by the client. At this stage, the process is no longer different from the pre-SCAN era: the client resolves the virtual IP address of the node and establishes the connection.

The transition to the use of SCAN addresses is not mandatory—you can continue to use the old connection strings unless you would like to connect to a policy-managed database.

Summary

In this chapter, we discussed the relevant concepts behind Real Application Clusters that are necessary for understanding of the remaining chapters of the book. We began by introducing the various cluster concepts; there are differences in how many nodes serve requests, as well as in how the data served to clients is managed in the cluster. RAC takes the best of all worlds: all nodes serve requests, and they all access all the data concurrently.

We also discussed the cluster interconnect used extensively by Clusterware and Grid Infrastructure. Grid Infrastructure is the foundation for RAC databases, providing the necessary infrastructure service for the node membership management and intercluster communication that RAC needs. Next, we looked at Oracle's preferred storage option for the cluster database, Automatic Storage Management. ASM is a mature, cluster aware logical volume manager used to store all physical database structures. Oracle is pushing ASM heavily, and we recommend taking a closer look at this technology if you have not already done so. ASM uses techniques similar to other logical volume managers, making it easy to understand the concepts.

We also described RAC and how this technology is different from single-instance Oracle databases. The Cache Fusion architecture allows the definition of a cluster-wide shared global area. A number of additional structures maintained by each database instance ensure cache coherency in the global buffer cache; enqueue mastering is also globally performed.

Finally, we focused on some of the interesting new features introduced by Oracle 11g Release 2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. RAC Concepts

Create new playlist

Sign In

Sign Up

Chapter 2. RAC Concepts

Clustering Concepts

Configuring Active/active Clusters

Implementing Active/passive Clusters

Configuring a Shared-All Architecture

Configuring a Shared-Nothing Architecture

Exploring the Main RAC Concepts

Working with Cluster Nodes

Leveraging the Interconnect

Using an Ethernet-based Interconnect

Note

Implementing an Infiniband-based Interconnect

Clusterware/Grid Infrastructure

Planning a RAC Installation

Choosing a Process Structure

Configuring Network Components

Note

Setting up Shared Grid Infrastructure Components

Implementing the Oracle Cluster Registry and Oracle Local Registry

Configuring Voting Disks

Leveraging Automatic Storage Management

Note

Working with ASM Disks

Exploiting ASM Disk Groups

Configuring Failure Groups

Weighing Your ASM Installation and Administration Options

Installing Real Application Clusters

Sorting Through Your Installation Options

Choosing Between a Single Instance and a RAC Database

Working with Database Files

Storing Information in Control Files

Leveraging Online Redo Logs and Archiving

Note

Note

Managing Undo Tablespaces

Weighing Storage Options for RAC Databases

Note

Drilling Down on a RAC Database Instance

Note

Note

Using the Global Resource Directory (GRD)

Note

Maintaining Cache Coherence with Global Cache Services (GCS)

Managing Global Enqueues with Global Enqueue Services (GES)

Transferring Data Between Instances with Cache Fusion

Note

Achieving Read Consistency

Synchronizing System Change Numbers

Exploring the New Features of 11g Release 2

Leveraging Grid Plug and Play

Note

Warning

Modeling Resources with Server Pools

Note

Ensuring POSIX Compliance with ACFS

Using Oracle Restart Instead of RAC

Simplifying Clusterd Database Access with SCAN Listener

Summary

Table of Contents for
2. RAC Concepts