Arbitration

This section describes how Sun Cluster 3.0 software architecture handles arbitration problems and other issues, including split brain, multiple instance, and amnesia. “Arbitration Schemes” and “Failures in Clustered Systems” discuss these issues.

Cluster Membership

Sun Cluster 3.0 defines the concept of membership as a group of nodes that can successfully communicate with every other node in the group through the private interconnect. This concept is critical to the success of a cluster product that is performing distributed computing operations. The cluster membership monitor (CMM) must ensure that only one cluster incarnation is in progress at a time.

To determine membership and, more importantly, to ensure data integrity, the CMM must achieve the following:

  • Account for a change in cluster membership, such as a node joining or leaving the cluster

  • Ensure that a faulty node leaves the cluster

  • Ensure that the faulty node stays out of the cluster until it is repaired

  • Prevent the cluster from partitioning itself into subsets of nodes

Given these requirements, the Sun Cluster 3.0 CMM protects a cluster against these failures:

  • Split brain— All communication between nodes is lost and the cluster becomes partitioned into subclusters, each of which believes that it is the only partition. See “Split Brain”.

  • Amnesia— The cluster restarts after a shutdown with cluster configuration data older than at the time of the shutdown. See “Amnesia”.

Sun Cluster 3.0 avoids split brain by using the majority vote principle (see “Majority Voting and Quorum Principles”), coupled with the use of quorum disks to circumvent undesirable situations that would otherwise compromise cluster availability.

Avoiding potential data corruption means that the cluster must ensure that it is using the latest configuration information held in the CCR. Take, for example, the case in which an administrator shuts down one node, A, of a clustered pair, and then changes the cluster configuration on the remaining node, B. If node B is then shut down and node A is brought up, node A will have out-of-date configuration information. This situation is known as amnesia.

As another example, consider the case where the last member of a cluster, A, places SCSI-3 PGRe keys on the quorum disks defined in the CCR. When another node, B, tries to start up a cluster before node A has restarted, the booting process on B stops because B cannot acquire the necessary quorum disk votes to achieve majority—B cannot do so because the reservation that A placed on the disks is persistent and specific to A.

Changes in cluster membership drive the cluster reconfiguration sequence that, in turn, can cause services to migrate from failed or faulty nodes to healthy ones through rgmd. The process of fencing off a node is vital for ensuring that user data is not corrupted. See “Fault Containment”. This is especially true when parallel services are running and making simultaneous changes to a common set of data files. If the CMM did not actively fence off or shut down errant nodes, these nodes would be able to continue to service user requests and write to the data files in the mistaken belief that they are the only nodes that remain in the cluster. This action would inevitably lead to corruption when both nodes update the same data page in an uncontrolled fashion.

CMM Implementation

Sun Cluster 3.0 implements its CMM as a kernel module. Therefore, resource starvation is less likely to affect the CMM than a user-level daemon would. Consequently, Sun Cluster 3.0 can support shorter timeouts to allow faster failure detection. Note that these times are typically small relative to the time taken to recover an application, which tends to dominate failover times. The CMM determines connectivity to other nodes through the cluster transport mechanism. Only when the last path to a node is declared down does the CMM fence off the potentially failed node.

Majority Voting and Quorum Principles

The cluster membership model is independent of the cluster storage topology and volume manager employed. The basic principle relies on the concept of a majority, that is, more than half of a particular quantity. Once the initial scinstall(1M) completes on all cluster nodes and scsetup(1M) is run to assign the first quorum disk, the cluster is taken out of “install mode.“ Thereafter, each node within the cluster is given one vote. The quantity, Vn, represents the total number of node votes. For a cluster with no quorum disks configured to continue, a majority of nodes must be able to communicate with each other over the private interconnect. A majority is calculated as int[Vn*0.5]+1, in which the int function computes the integer portion of its operand. If this condition is not met, nodes in the particular subcluster panic because they lose majority. A subcluster is any subset of nodes that constitute the entire cluster.

This algorithm has undesirable consequences for a two-node cluster because shutting down one node automatically brings down the other node. To overcome this, a quorum disk is used. A quorum disk is simply a nominated disk somewhere in the shared storage of the cluster. A quorum disk that is multihosted to M nodes is given M-1 votes. For a dual-hosted quorum, this disk only receives one vote. Defining the total number of votes contributed by quorum disk as Vq, the total number of votes now available to the cluster (defined as Vt) is, therefore, Vt = Vn+Vq. A subcluster must still gain a majority of the available votes to continue. However, this is now calculated as int[Vt×0.5]+1.

A quorum disk must be defined for a two-node cluster. This arrangement enables any single node that obtains the vote of the quorum disk to maintain majority and continue as a viable cluster. The CMM forces the losing node out of the cluster.

For clusters with N nodes, where N is greater than two, N-1 quorum votes should be used. In certain topologies, N-1 can serve to prevent several nodes from panicking from loss of majority after you have shut down N/2 of the cluster nodes.

CMM Reconfiguration Process

A kernel CMM reconfiguration consists of 11 steps following the begin_state and qcheck_state phases. All cluster nodes execute these steps in order and in lockstep. That is, cluster nodes do not start the next step until all cluster nodes complete the current step. If membership changes during any of these steps, all nodes return to the begin_state phase.

Once the quorum algorithm decides on a quorum of cluster members, the cluster nodes execute these steps. Thereafter, the ORB and the replica framework, and indirectly the CFS and device configuration service, perform the appropriate actions for a node that is leaving or joining the cluster. See “Disk Fencing”. When a node joins the cluster, the replica framework can add a secondary for a device group, and so forth.

The CMM has two user clients: rgmd and Oracle 8i OPS or Oracle 9i RAC. Both register their own set of steps. When a reconfiguration occurs, the user clients, rgmd and Oracle 8i OPS or Oracle 9i RAC, are driven through their steps.

SCSI-2 and SCSI-3 Command Set Support

To prevent errant cluster nodes from writing to a disk and potentially corrupting user data, Sun Cluster 3.0 uses SCSI reservation. The ability to reserve a disk is part of the SCSI command set used by all cluster storage that Sun supports. For details on how SCSI reservations are used, see “Disk Fencing”.

Most disks in use today support the SCSI-2 command set. SCSI-2 reservations are binary, allowing or disallowing access to a disk. Therefore, if a host in a cluster with more than two nodes reserves a disk that supports the SCSI-2 command set, all nodes but one are unable to access it once the reservation is in place. If the node or disk is reset, the reservation is lost because it is not persistent.

The SCSI-3 command set enables you to make group reservations for a disk. This, in turn, enables access by a set of nodes while disabling others. Group reservations are persistent, so they can survive node and drive resets. This feature is called SCSI-3 persistent group reservation (PGR). Sun Cluster 3.0 uses SCSI-3 PGR on disks that support and emulate the use of PGRe. On disks that do not, Sun Cluster 3.0 uses the SCSI-2 Tkown and Release commands.

SCSI-2 PGRe emulation includes:

  1. Employing the alternate disk cylinders for storing the PGRe keys (different from the PGR keys that the drivers in question store) and the reservation key (there are 65 sectors, one each for the 64 possible nodes and an extra one for the reservation owner).

  2. Emulating the SCSI-3 PGR ioctls. PreemptAndAbort uses Lamport's algorithm [Lamport74].

  3. Using SCSI-2 Tkown and Release in conjunction with item 2 to ensure that the loser in the quorum race is removed from the cluster and hits reservation conflict, and is welcomed back into the cluster once it rejoins.

Sun Cluster 3.0 also implements a write-exclusive, registrants-only (WERO) form of SCSI-3 reservation. This type of reservation allows only registered initiators to update disk information. Fencing is done by ensuring that a WERO reservation is made, and by preempting the registration keys of the initiators of nodes to fence.

Cluster members place keys on the alternate disk cylinder at the beginning of the drive mentioned in item 1 when they join the cluster. The CMM removes the keys of removed members as part of the initial qcheck reconfiguration step, which precedes step one of the reconfiguration. This action is independent of whether the system uses PGR or PGRe.

Shared quorum disks have the keys of all the members in the last cluster reconfiguration, thus preventing amnesia. If the last cluster configuration had just one member, that key is the only key on the quorum disk. An amnesiac node would attempt to join the cluster and discover that it has been fenced away from the quorum disk.

As nodes join the cluster, they put their keys on the device. As some of the nodes leave the cluster, the current members remove the key of the departing node from the device. If a node panics while preempting a removed member that is removing its key, the CMM attempts another reconfiguration. If the remaining nodes can still form a cluster, they proceed to preempt the previously removed node as well as the node that just panicked.

If the remaining nodes do not form a majority, the cluster aborts. The quorum disk has the keys of these remaining nodes, the node that panicked, and the ousted node because the last successful reconfiguration had exactly those members.

Quorum Disk Vote

When cluster members lose contact with each other, they must attempt to acquire the votes of the quorum disks to maintain a majority. To achieve this, both nodes issue a SCSI-2 Tkown, or reservation, ioctl. The ioctl is atomic, so only one node is successful. If two nodes share multiple quorum disks, then, in a cluster with more than two nodes, both nodes attempt to acquire the votes for the quorum disks in the same order. The loser of the initial race drops out, thereby allowing the winner to obtain all of the necessary quorum disk votes and thus retain a majority. Writing the key of the owner to the alternate disk cylinder space makes the reservation persistent.

Uneven Cluster Partitions

Multinode clusters can be the subject of numerous failure scenarios, each of which should, ideally, result in the safe reconfiguration of the cluster to provide continued service on the remaining nodes. Some failures make it hard for the cluster to determine what the optimal outcome should be. For example, when a four-node, pair+M, cluster partitions 3 to 1, the three nodes should acquire the three nominated quorum disk votes and survive. However, a single node can win the race to the first disk and go on to attain majority, but this may not be the desired outcome. In an alternative scenario, in which the three nodes instantaneously lose communication with the single node (leading to a 1:1:1:1 split), either through a bizarre interconnect failure or by a simultaneous power failure, the single node should get the quorum disk votes and continue.

To resolve these two outcomes, the cluster membership algorithm introduces a staggered start in the race for the quorum disk reservation. Any majority partition is given a head start, on the basis that it should be the desired winner. If it fails to get the first quorum disk, either by being down or totally unresponsive, the node in the minority partition wins the race and goes on to attain majority.

The formula for this delay is 12 + N-R, where N is the number of nodes not in this partition and R the number of known failed nodes. In the preceding scenario, the partition with three nodes would have a two-second start over the single node. This advantage accrues because the single-node subcluster will wait (Delay = 12+3-0) 15 seconds, whereas the three-node subcluster will only wait (Delay = 12+1-0)13 seconds.

If the nodes in a minority partition cannot achieve majority, even with the addition of all of the quorum votes, they decline to participate in the race and panic out of the cluster.

Disk Fencing

Disk or failure fencing protects the data on disk against undesired access. See “Fault Containment”. The CMM must prevent unenclosed members from writing to any shared data disks to which they might be connected, in addition to protecting a cluster against using out-of-date configuration information.

The device configuration system (DCS) does the device group management. When the CMM is notified of a membership change, it notifies the DCS if a primary node for a device group has left the cluster. The CMM then chooses a new primary for the device group, using the properties stored in the CCR for that device group. The CMM then issues a DCS call to tell the node that it is now the primary node for the device group. During this call, the node fences off any nonquorum disks in the device group (note that quorum disks can still be used to store data). The node also takes the appropriate volume manager action of importing a VxVM disk group or taking ownership of an SVM disk set. The quorum algorithm reserves the quorum disks separately.

The DCS uses SCSI-3 PGR for multihosted devices and SCSI-2 for dual-hosted devices. In contrast, the quorum algorithm uses SCSI-3 PGR for multihosted devices and SCSI-3 PGRe for dual-hosted devices.

As part of the fencing process, MHIOCENFAILFAST is enabled on all disks. Any non-cluster member trying to access a disk with this set will failfast with a reservation-conflict panic message. See “Failfast Driver”.

For clusters that are running services such as Oracle 8i OPS or Oracle 9i RAC, the system uses the CVM feature of VxVM. This feature enables concurrent access to raw devices from more than one cluster node. If a cluster node fails, it is fenced off from all disks in shared CVM disk groups either through SCSI-3 PGRe for dual-hosted SCSI-2 disks, or through SCSI-3 PGR for multihosted devices. Thus, Oracle 8i OPS or Oracle 9i RAC clusters with more than two nodes require shared storage that implements SCSI-3 PGR.

Note

You should not attempt to alter any reservations placed on shared storage. The scdidadm(1M) command offers limited access to these reservations, but you should use it carefully.


Failfast Driver

The appropriate Solaris disk driver, either sd or ssd, handles failfast. If a device access returns a reservation conflict error when failfast is enabled, the driver panics the node. With failfast, you can also set up a polling thread that periodically accesses the device. The drive panics the node if the disk driver receives a reservation conflict error from the device.

Cluster Reconfiguration

The final job of the CMM is to drive the reconfiguration process. This task includes contacting the replica manager so that it can elect a new node to coordinate the replica managers. The primary I/O paths for the device groups of the cluster may also need to be changed. Changing the primary path requires the takeover of replica manager responsibilities.

If a cluster node fails, it is fenced off from all dual-hosted disks in shared CVM disk groups by simple SCSI-2 reservation. When multihosted storage is supported, the shared disk groups will use SCSI-3 PGR.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset