Cluster Software

TABLE 6-3 lists the selected base software stack. The applications that run on top of the base are not described here, since they are not publicly available.

Table 6-3. Oracle 9i RAC Cluster Software Stack
Software Version Description
Oracle RAC 9i Database system
VxVM 3.1.1 VERITAS Volume Manager
Sun Cluster software 3.0 update 1 Cluster platform
Solaris operating environment 8 7/01 Operating environment
Sun Management Center agents 3.0 Agents for managing the cluster with the Sun Management Center software

The Oracle 9i RAC database system allows two or more cluster nodes to simultaneously perform transactions against a single database. These nodes can operate in two modes:

  • Active-active is a type of architecture in which multiple nodes synchronize their accesses to database objects.

  • Active-passive or primary-secondary is a type of architecture in which one node performs work against the database while a second node stands by, ready to take over processing if the first node fails.

The Oracle 9i RAC uses a shared disk architecture (FIGURE 6-3) so each node in the cluster has direct access to the shared disks. All database instances access the data files and control files. The Oracle 9i RAC requires replication of many pieces of the Oracle database architecture across each participating cluster node. For example, each node starts an Oracle database instance comprising the necessary background processes—the system monitor (SMON), process monitor (PMON), log writer (LGWR), and database writer (DBWR) simultaneously.

Figure 6-3. Oracle 9i RAC Architecture


A distributed lock manager (DLM) process runs on each instance and coordinates data block synchronization by creating an in-memory database of lock objects that are equally distributed among all instances.

Background processes that support the DLM include the Global Enqueue Service monitor (LMON) and the Global Enqueue Service daemon (LMD). Moreover, each node maintains its own system global area (SGA) in memory, including the database buffer cache, the redo log buffer, and the shared pool.

On disk, the system assigns each node its own redo log files and rollback segments. The system uses these redo log files to return the database to a consistent state following a system crash. These log files record changes made to the blocks of any object, including tables, indices, and rollback segments. The log files guarantee preservation of all committed transactions in the event of a crash, even if the resultant data block changes are not yet written to the data files. Rollback segments store the “undo information,” for example, the information needed to cancel or roll back a transaction, should the application choose to do so.

Rollback segments also provide a form of SQL statement isolation. A long-running query against a set of tables must only see their contents as they looked at the time the query began. This is known as read consistency. If another transaction modifies the data blocks of the same tables later, the rollback segments will store the “before images” so that the earlier, long-running query can return a result set, isolated to the time at which it began.

The Oracle 9i RAC package fits into a market segment in which availability takes precedence over scalability. Hence, Oracle 9i RAC is also offered as an active-passive solution. In active-passive mode, all client connections remain on the primary node and only fail over to the secondary node if the primary fails. This active-passive configuration fails to address completely the concept of client connection (application) failover.

During a failover, graceful reconnection of the client to the surviving node without the client having to authenticate again is preferable. Moreover, if the client was processing a query, it would be beneficial to restart the query after failover, without user intervention. Oracle RAC Guard provides an application failover mechanism. RAC Guard meets the requirement of managed failover for an Oracle 9i RAC active-passive configuration. Oracle RAC Guard is an additional layer of software packages that, for architectural reasons, reside on top of the Oracle 9i RAC.

With this active-passive configuration, Oracle 9i RAC is well adopted by the industry. Applications need not be designed for parallel operations to gain the benefits of a quick failover; this approach lowers the MTTR, thereby improving overall availability. The downside to this approach is that the secondary node, which is basically standing by for a failover with a warm instance, is underused. The solution to gaining full use of the secondary node depends on synchronization technology improvements.

Arbitration

The Oracle 9i RAC has many performance and availability improvement optimizations. Many of these optimizations center around the architecture, in which internode communication is expensive in terms of latency. For arbitration, Oracle distributes the ownership locks across nodes in the cluster. Oracle calls this lock mastering. From an arbitration perspective, lock mastering is the metadata about locks. Each node knows which node is mastering each lock and where to send synchronization requests.

The Oracle 9i RAC can move the lock mastery dynamically between nodes. When the cluster is built, the nodes must arbitrate to decide where to master the locks. If a node fails, the surviving nodes must arbitrate to assign new masters for the locks previously held by the failed node.

Performance optimization is possible when a node that dominates access to a set of locks can become the master for those locks. This control reduces the amount of lock traffic by changing global accesses into local accesses.

Lock Mastering

To synchronize requests for resources from multiple instances, the DLM maintains an in-memory database of locks distributed randomly among all active cluster nodes. When the system requests an object for global role access, its object ID is applied to a hashing function, which identifies the instance responsible for coordinating its access. This instance acquires a global cache service (GCS) lock for the object from a reusable pool of locks. All other instances within the cluster can apply the same hashing function to the object and, therefore, know exactly which instance to contact for permission to access the object in a shared or exclusive mode. Such an instance is “mastering the lock” for that object. Randomly assigning lock mastering among all active instances distributes the DLM workload equitably.

Node Joining the Cluster

When a new node joins the cluster, it starts an instance and mounts the same database as the other instances. Group Services makes GCS aware that a new instance has joined the Oracle 9i RAC and is ready to be assigned its share of the workload of mastering the GCS locks.

The Oracle 9i RAC relies on an integer M, which is a multiple of the maximum number of possible instances, as defined by the PARALLEL_SERVER_INSTANCES parameter. Rather than interrupting all access to the database while performing a full remastering, the DLM gradually, over time, migrates some of its resource mastering workload over to the joining instance, thereby maximizing availability.

Node Leaving the Cluster

When a node leaves the cluster, only the resources it was mastering need to be remastered. Locks already mastered to the surviving instances are unaffected.

Once Sun Cluster Group Services notifies the Oracle 9i RAC of the departing node, a software package, such as that found in the Oracle RAC Guard, attempts to restart the instance on the failing node. If a restart is not possible, SMON, on the first surviving node to observe the instance failure, does the instance recovery at the same time the remastering of the locks occurs. All transactions that had been performed on the failed instance are recorded in the redo log files of that instance, but only those transactions committed before the newest checkpoint are guaranteed to have been written out to the data files. Since the redo log files for all instances reside on raw devices, the instance that is performing recovery can access the redo log of the failed instance. It either commits to the data files those transactions that had committed after the final checkpoint (also known as “rolling forward”) or rolls back those that had not by reading “before images” found in the rollback segments of the failed instance. SMON also frees up any resources that those pending transactions may have acquired. During the roll-forward period, the database is only partially available. Other instances can only access data blocks they currently have buffered. They cannot perform any database I/O, nor can they ask GCS for any additional resource locks.

Similar processing occurs when more than one, but not all, of the instances fail simultaneously. The system must remaster the GCS locks on the failed instances to the surviving instances, and the SMON of a surviving instance must perform roll-forward and rollback operations against the redo log files and rollback segments left behind. However, in this case, the Oracle 9i RAC takes advantage of the system change number (SCN), the timestamp known on all instances of the database. SMON reads the redo log files once to identify the “recovery set,” that is, the data blocks that have a sequence of modifications recorded but are missing a subsequent block written record.

During the second and final pass, SMON issues a sorted merge of the failed redo log files, based on SCN. When SMON identifies a block belonging to the recovery set built in pass one, it constructs the last known version of the block (which may involve applying the redo information from more than one failed instance), writes the version to disk, and frees the lock associated with it. This process avoids repeated writes of the same data block. Note that, if the latest known version of the data block resides on one of the failed instances, the system can obtain a more recent past image version of the block over the cluster interconnect from a surviving node, rather than rereading the block from disk.

Crash Recovery

In the unlikely event that all instances fail, the first instance restarted after the failure performs instance recovery on the redo log files of all failed instances, including its own, if necessary. This is known as “database crash recovery.” It is also possible to initiate the recovery from an entirely separate instance that was not participating in the cluster before failure. As in a single-instance database recovery, database access begins once the roll-forward phase is complete. Rolling back uncommitted transactions can occur in parallel with the creation of new work. Naturally, no lock remastering occurs during crash recovery, since the DLM is starting over.

Automatic Lock Remastering

To reduce traffic on the cluster interconnect, it is preferable to have the instance that performs the majority of accesses to a given resource also act as the mastering node for that resource. This method reduces the handling of DLM requests across instances. The data structures of the Oracle 9i RAC allow it to pinpoint a given table space that is being used exclusively by one instance. In this case, Oracle 9i RAC gradually migrates all lock mastering for that tablespace to the instance that is accessing it. Since the migration happens over a period of time, there is no noticeable slowdown or lack of availability.

The Oracle 9i RAC also performs some heuristics on the DLM traffic. For example, if the DLM notices that one instance repeatedly requests the same lock from a remote mastering instance, and if it is the only instance that is making these requests, the DLM eventually migrates the mastering of that lock to the requesting node. In this case, the notion of each instance mastering an equal share of the GCS locks is lost, but the benefit is reduced traffic over the cluster interconnect.

Synchronization

The Oracle 9i RAC synchronization mechanism is similar to that used by directory-based cache-coherency protocols in microprocessor designs. One significant difference is the fault recovery. Microprocessors encountering faults cannot guarantee that the software is unaffected, so they send an interrupt which, in the case of the Solaris operating environment, panics the operating system. The Oracle 9i RAC must not stop the database when it detects a fault. The Oracle database has features, such as the transaction logs, that enable it to recover the data in case of a failure. When a fault that causes the loss of a node occurs, the Oracle 9i RAC keeps the database up while it arbitrates the new cluster configuration, recovers the data from the failed node, and synchronizes any locks

Local GCS Lock Mode Versus Global

Versions of the Oracle RAC prior to 9i allowed only one instance at any time to hold a version of a data block considered “dirty,” that is, the image in the buffer cache does not match that on disk. If one instance held an exclusive lock on the data block, all other instances were only allowed to hold null locks on the same block, essentially giving up their locks. Oracle RAC 9i allows the data block to be dirty in more than one instance and hence introduces the concept of a global management role. The exclusive lock in a local role operates somewhat independently of the DLM. This permits writing its buffer to the on-disk version without prior consent. However, the same lock in a global role requires coordination from the DLM for this I/O and yet allows other instances to obtain a shared lock on the same block. Accompanying this update functionality is the concept of a PI block, which adds complexity to both the DLM algorithm and recovery. The role of a data block is updated from local to global when it is dirty and another instance requests the block for additional write access. This situation results in two distinct versions of the data block in memory, each different from the copy on disk.

Cache Fusion Read-Read Example

As the bandwidth for interconnects increased and the transport mechanism for “content and control” improved, Oracle introduced cache fusion for transferring data across nodes through the interconnect. Cache fusion is aptly named, since it describes an architecture that treats all of the physically distinct RAM of each cluster node logically as one large database SGA, with the interconnect providing the physical transport among them (FIGURE 6-4).

Figure 6-4. Oracle 9i RAC Logical Cache


Before Oracle 9i RAC, transferring a data block from one node to another involved writing the block from the database buffer cache of the holding node to the shared disk storage. The requesting node read the data block from disk into its own cache. However, passing data through persistent disk writes adds significant latency and a corresponding performance penalty. In some cases, application performance is several orders of magnitude slower than cache fusion.

Cache fusion implements cache synchronization, using a write-back model. The DLM processes on each node manage the synchronization by using the clustered interconnect for both inter-DLM traffic and data block movement between nodes.

FIGURE 6-5 shows a two-node cluster in which cache fusion performs as follows:

  1. Node B requests data block 1008 from the database.

  2. Node C already has the data block in its buffer (SGA). Node D is also the manager of the DLM lock on this resource.

  3. Lock information is exchanged over the cluster interconnect through an integrated DLM.

  4. The system transfers the data block from the buffer on node C across the cluster interconnect to node B.

Figure 6-5. Two-Node Cache Fusion


Good scalability is possible because the time required to transfer the data block over the cluster interconnect is significantly less than the time required for node A to write the data block to shared disk and for node B to read the data block. This model also frees the application from having to resolve all lock management issues in a very fine-grained manner.

Cache fusion is a good fit for applications that require parallel operations on resources. However, using standard IP-based network technology for the cluster interconnect still causes a small amount of latency. Message-passing on the interconnect usually occurs on the UDP/IP stack, and the system copies data to local buffers in the UDP/IP drivers before transferring it to the remote node. Therefore, one cache fusion operation (a data-block buffer transfer) results in two copying operations, one local and one remote.

Remote Shared Memory (RSM) eliminates the local copying requirement inherent in previous transport implementations. RSM comprises local and remote adapters that provide mapped memory segments to copy message data directly into a remote address space. Effectively, some global memory space is known to all nodes involved.

Once RSM is implemented, Oracle 9i RAC will do cache fusion by copying buffers directly into the remote memory space of the interconnect adapter card. In effect, the interconnect adapter card is a mapped extension of the memory of the remote node. The result is a lower latency for data block transfers and improved performance scalability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset