Sun Cluster 2.2 and 3.0 Feature Comparison

Sun Cluster 2.2 and Sun Cluster 3.0 are fundamentally different products, despite having very similar goals—provision of highly available and scalable application services. The differences stem from the fact that Sun Cluster 2.2 is predominantly a layered product, using a combination of programs, shell scripts, and daemons to achieve its goals. In contrast, Sun Cluster 3.0 is highly integrated with the Solaris 8 operating environment and delivers much of its functionality through kernel modules and agents. The common feature between the two products is the implementation of application fault monitors in the future.

Starting with the low-level components, such as the cluster interconnects and kernel drivers, this section describes the features that Sun Cluster 2.2 delivered and contrasts them with the features offered by Sun Cluster 3.0.

Cluster Interconnects

Both the Sun Cluster 2.2 and Sun Cluster 3.0 products require dedicated networks between the constituent cluster nodes for the exclusive use of the cluster framework. These networks provide the infrastructure for the transmission of heartbeat messages to determine connectivity between nodes, application-level messages, and data transfer for the new Sun Cluster 3.0 global features (for example, transmission of application and data transfer messages for the Oracle 8i OPS DLM and the Oracle 9i RAC cache fusion data).

Sun Cluster 2.2 requires two private connections to provide a backup level of resilience. Sun Cluster 3.0, however, supports a minimum of two and a maximum of six private interconnects because of the higher demands that the new global features place on the system. This means that, potentially, a Sun Cluster 3.0 system can tolerate more interconnect failures before it has to fence off a node that can no longer communicate with its peers.

FIGURE 3-5, FIGURE 3-6, and FIGURE 3-7 show the types of cluster topologies Sun Cluster 3.0 supports—Clustered Pair, N+1, and Pair+M. Sun Cluster 2.2 supports a ring topology that is not supported by Sun Cluster 3.0. Sun Cluster 2.2 does not support the scalable Pair+M topology.

Switch Management Agent

The Sun Cluster 2.2 switch management agent daemon, smad, is a user-level daemon responsible for managing communication sessions on the private interconnects. It is also dependent on a kernel module, called smak, that can be loaded. The smad communicates with its peers by sending and receiving UDP packets at regular intervals. This process is often described as the cluster heartbeat. This process differs from the lower level DLPI mechanism used by Sun Cluster 3.0. When SCI is used, smad also performs the heartbeat checks on these links.

In Sun Cluster 2.2, the physical network interfaces on the private interconnects are assigned IP addresses in the range of 204.152.65.1 to 204.152.65.20. These are Sun-registered, private, nonroutable network numbers. The smad process then configures an additional logical IP address in the range of 204.152.65.33 to 204.152.65.36 on one of the physical network interfaces. If a physical link fails, smad migrates the logical link to the alternate physical interface, allowing processes that depend on the IP link to continue uninterrupted. If both links fail, smad signals the cluster membership monitor to initiate a cluster reconfiguration.

Membership Monitor

Membership is an important concept for any platform that tries to execute state-based applications in a distributed computing environment. Without careful coordination, applications can be started more than once without each other's knowledge. The resulting uncontrolled access to the underlying data structures causes data corruption.

The cluster membership monitors (CMMs) of the two product releases are substantially different. Both CMMs use the concept of membership to prevent split-brain syndrome. However, the Sun Cluster 2.2 version does not protect against amnesia, as the Sun Cluster 3.0 CMM does. Instead, this function is left to the cluster configuration database (CCD) daemon, described in subsequent paragraphs. Similarly, Sun Cluster 2.2 implements the CMM as a user-level daemon known as clustd, whereas Sun Cluster 3.0 implements the CMM as a highly available kernel agent. When the CMM is engaged (the return step of the cluster state machine), all pending I/O to the shared cluster storage is complete and the system blocks new I/O, pending the outcome of the membership changes.

The Sun Cluster 2.2 CMM rules for deciding the outcome of membership changes are fairly complex. See the rules and scenarios in the Sun Cluster Environment Sun Cluster 2.2 book [SunSCE01], page 60 onward. This is due, in part, to the fact that Sun Cluster 2.2 design uses SCSI-2 disk reservation rather than SCSI-3 persistent group reservations (PGR), thus restricting its ability to fence off multihosted disks in a scalable cluster storage topology.

To guarantee data integrity, failure fencing is critical. Because SCSI-2 reservations are binary, a valid cluster node with rightful access to the data can be fenced off in a scalable storage topology. To guarantee that failed nodes in a scalable storage topology are fenced off, the Sun Cluster 2.2 CMM uses a “shoot down” mechanism that connects the nodes to their console ports, through the terminal concentrator, and forcibly halts the nodes. This shoot-down mechanism is the only mechanism that ensures that failure fencing works. When nodes are already down, because of a power outage, the CMM pauses cluster operation pending authority from the system administrator to continue a partition. The Sun Cluster 3.0 system does not require system administrator intervention for this condition.

The Sun Cluster 2.2 CMM algorithm is driven primarily by cluster membership transitions, rather than an overall vote count, as is Sun Cluster 3.0. Similarly, the disk fencing and quorum disk algorithms depend on the volume management product (Solaris Volume Manager (SVM) or VxVM) and storage topology used. “Quorum Voting” and “Failure Fencing” describe the potential outcomes in more detail.

To simplify the description of the results of the CMM algorithm, the concept of a partition is introduced here. A partition is a subcluster consisting of any number of nodes from the previous working cluster. The CMM bases its decisions on the concept of majority partition and minority partition. A majority partition has at least N/2+1 of the N nodes participating in the previous working cluster, whereas a minority partition has less than N/2 nodes.

With storage topologies that are not scalable, any partition that has less than a majority aborts and any partitions with a majority continue automatically. When a partition contains exactly half the nodes and it was initially a two-node cluster, an arbitration process is required to break the impasse. Then a quorum disk is used to break the tie.

For Sun Cluster 2.x clusters that use a scalable storage topology, if a partition contains fewer nodes than needed for majority, the CMM waits for a total of Exclude Timeout + Scalable Timeout1 seconds before trying to reserve the nominated quorum disks and terminal concentrator port. This delay allows other potentially larger partitions to form a new cluster first and shoot down this partition. If this partition is not shot down, and the minority partition successfully obtains the port and quorum disk, this partition continues as a cluster. Otherwise this cluster aborts.

When a majority partition exists, the partition attempts to acquire a lock on the terminal concentrator port, and then resets any nodes in the minor partition. If the partition fails to get a port lock, all the nodes of the partition abort. Finally, if both partitions consist of half the nodes, each partition attempts to obtain the port lock and subsequently shoots down the nodes in the other partition. Otherwise, the partition stalls pending operator assistance or uses a deterministic policy to decide which partition should continue.

Once the CMM has determined the new cluster members, it fences off any shared storage to prevent potential data corruption. The clustd process then continues to execute the remaining 12 reconfiguration steps, by calls to the reconf_ener program, during which the system reenables the I/O to the shared storage. The Sun Cluster Environment Sun Cluster 2.2 book describes the 12 reconfiguration steps on page 44.

Quorum Voting

For Sun Cluster 2.2 clusters using VxVM as the volume management product, a nominated quorum disk provides the additional vote to break the deadlock. For nonscalable architectures, one quorum disk is defined between each pair of nodes. The two nodes connected to the quorum disk issue a SCSI-2 reservation ioctl on the shared device. Because the ioctl is atomic, only one call succeeds and gains the reservation. The other call fails and drops the node out of the cluster.

Cluster nodes use a similar mechanism at startup. When two nodes share a quorum disk, the first node to enter the cluster attempts to reserve the device to ensure that no other cluster is in progress. If this reservation fails, the node aborts out of the cluster. When the reservation succeeds, the node releases the reservation only when it can communicate successfully over the interconnect networks with the peer node that is sharing the quorum disk.

For scalable topologies, the preceding approach does not work. According to the preceding argument, the introduction of the second node would release the reservation, thereby creating the opportunity for an entirely separate cluster to form, by the same route, from the remaining nodes. As a workaround, a telnet process directed to a port on the terminal concentrator locks the port. While the session remains active, a cluster is in progress.

Clusters based on SVM use a different approach that combines failure fencing with split-brain resolution. Disk sets under the control of SVM have a continual SCSI-2 reservation on them. Any attempt to take control of a disk set currently owned by another node results in a reservation conflict and, subsequently, a node panic.

When a split-brain scenario occurs, each node releases the disk sets under its control and attempts to reserve the disk sets of the other node. This action causes at least one node to panic out of the cluster. The remaining node then re-establishes reservations on all disk sets it owns. The outcome of such a race is somewhat nondeterministic, but no more so than the race for the quorum disk under VxVM.

Failure Fencing

For nonscalable cluster topologies and topologies that use VxVM for volume management, failure fencing is achieved through SCSI-2 reservations. For the SVM approach, see “Quorum Voting”. The successful partition places a SCSI-2 reservation on all shared storage, protecting it from corruption by failed cluster nodes.

Successful partitions in a scalable topology use the shoot-down method described previously to fence off failed nodes. Once shut down, the fenced nodes are unable to join the cluster until they can communicate with an existing cluster over the private interconnects. Because no SCSI-2 reservations are used, the data is more vulnerable than it might be in a Sun Cluster 3.0 system.

By employing SCSI-2 reservations, the Sun Cluster 2.2 framework precludes the use of technologies such as alternate pathing (AP) and dynamic multipathing (DMP). In contrast, Sun Cluster 3.0 uses SCSI-3 PGR calls and can benefit from, and use, the new Solaris 8 Sun StorEdge Traffic Manager storage multipathing framework. This framework is also called multipathing I/O (MPxIO).

Cluster Configuration Database

Sun Cluster 2.2 stores its configuration information in a cluster configuration database (CCD). Like Sun Cluster 3.0, Sun Cluster 2.2 implements CDD as a set of flat text files stored on the root file systems of the respective cluster nodes and updated by using a two-phase commit protocol. The CCD files rarely take up more than a few kilobytes.

Because Sun Cluster 2.2 was not designed to use persistent keys on its quorum disks, the CCD is not completely protected against amnesia, that is, the uncertainty that the cluster configuration information being used might not be the latest version. The Sun Cluster 3.0 CCR, however, is completely protected. An explanation of the files and daemons used to implement the CCD helps to provide an insight into these limitations.

Nodes participating in a Sun Cluster 2.2 cluster do not boot directly into a cluster. Instead, they have to wait until they are explicitly directed to start, or join, a cluster by the system administrator issuing the scadmin(1M) command. This manual procedure creates a nondeterministic recovery time when the entire cluster must be rebooted.

The CMM of the node that starts a new cluster must first determine a number of facts, including the cluster name, the potential nodes and their IPs addresses, and what the nominated quorum disk is. To find this information, CMM consults the cluster_name. cdb file in the /etc/opt/SUNWcluster/conf directory. This file is not automatically replicated, so any changes to it must be propagated to other nodes—failure to do so can prevent a node from joining the cluster. The failure to manually propagate changes to the cluster_name.cdb file introduces the opportunity for latent faults.

The CCD is stored in two separate files located in the /etc/opt/SUNWcluster/conf directory—a ccd.database.init file containing static information that allows the CCD to initialize and a ccd.database file for the dynamic data that changes when the cluster is running. A checksum protects the integrity of these files. A node is unable to join the cluster if the ccd.database.init files differ.

The ccd.database.init file effectively provides the Sun Cluster 3.0 CCR read-only mechanism (see “Cluster Configuration Control”); however, the ccd.database.init files are never updated.

The dynamic portion of the CCD, ccd.database, is updated by the user-level CCD daemon (ccdd). This update uses the two-phase commit protocol, implemented as RPCs over the cluster interconnects, to ensure that ccdd makes consistent updates to the cluster node CCD databases.

For any updates to occur, a majority of CCDs must be available. Effectively, that is a majority of cluster nodes running. If one node of a two-node cluster is not running, this restriction is inconvenient. To overcome this, you can configure a shared CCD database on two mirrored disks, dedicated to this purpose alone. Of course, this is a huge waste of space for such a small database. This copy of the data is stored in /etc/opt/SUNWcluster/conf/ccdssa/ccd.database.ssa; it is only active when one node is not participating in the cluster. Creating an extra copy of the CCD ensures that the majority requirements are always met because one node CCD copy plus the shared CCD copy holds two of the three existing copies of the data. This facility is only available to clusters that use VxVM for volume management.

Using a shared CCD in a two-node cluster prevents amnesia by ensuring that the latest information always is held in the ccd.database.ssa file. However, clusters with more than two nodes do not support the use of a disk-based, shared CCD. Instead, they must rely solely on the node's copies to attain majority. This still has the potential for a cluster to restart with the wrong, amnesiac, information in the CCD. A particular scenario in which this occurs is a four-node cluster with nodes A, B, C, and D. If changes are made to the CCD while D is shut down and then nodes A, B, and C are stopped and D is started, there is nothing to prevent D from using its erroneous CCD data, even though it cannot be updated.

Consequently, the Sun Cluster 3.0 implementation offers a far stronger guarantee that accurate configuration data is always being used. This assurance also comes without any additional management overhead—all the files it relies on are updated automatically. This method eliminates the manual intervention steps and greatly reduces the opportunities for latent configuration errors.

Data Services

Both cluster products can provide basic failover capabilities for crash-tolerant applications and support for parallel applications. However, Sun Cluster 2.2 lacks the new scalable services enabled by the global features introduced by the Sun Cluster 3.0 framework.

The terminology used between the two releases varies because of the inherently different capabilities and design philosophies of the two products. TABLE B-1 contrasts the terms the two releases use.

Table B-1. Sun Cluster 2.2 and 3.0 Terminology Differences
Sun Cluster 2.2 Sun Cluster 3.0
Data service, for example: Oracle or NFS Resource type
Data service instance Resource
Logical host Resource group, but does not contain any disk sets
Disk sets (SVM) or disk group (VxVM), managed as part of the logical host Device group, managed as a separate resource
Sun Cluster 2.2 logical hosts N/A

Logical hosts are the basis for all Sun Cluster 2.2 highly available services except Oracle 8i OPS. A logical host contains one or more disk groups or disk sets, a logical IP address per subnet, and one or more applications, for example, Oracle, NFS, or a web server. A logical host is the smallest unit of service migration between nodes.

The obvious and immediately discernible difference between a logical host and a Sun Cluster 3.0 resource group is the necessary inclusion of the disk sets and disk groups within its definition. This requirement stems from the absence of global file service functionality in Sun Cluster 2.2. Instead, the 11th step of the cluster reconfiguration, driven by the CMM, imports the disk group/disk sets defined in the logical host, checks the file systems of the relevant volumes, and then remounts the volumes on the new host. All this work is handled by user-level processes such as vxdg(1M), metaset(1M), fsck(1M), and mount(1M).

When several applications share a single logical host, the failure of any of its applications, necessitating a logical host migration, results in all the applications being moved, regardless of their condition. When fine-grained control is required, separate logical hosts, each with its own disk groups and disk set, must be created to facilitate independent migration. Sun Cluster 3.0 does not require such an implementation, although without additional device groups, some of the resource groups incur some remote I/O. FIGURE B-1 shows the Sun Cluster 2.2 logical host.

Figure B-1. Sun Cluster 2.2 Logical Host


Availability

By the time Sun Cluster 2.2 7/00 was released, a large collection of data services existed. The Sun and third-party agents included these services:

  • Oracle Enterprise Server and Applications

  • IBM DB2 Enterprise Edition

  • Sybase ASE

  • IBM Informix Dynamic Server

  • Network File System (NFS)

  • Domain name service (DNS)

  • Solstice Internet Mail Server

  • Apache web server

  • iPlanet Enterprise, Messaging, News, and LDAP Servers

  • SunPC™ NetLink

  • SAP R/3

  • Lotus Notes/Domino

  • Tivoli Framework

  • NetBackup

  • BEA Tuxedo

  • Open Market Transact

  • Adabas

Custom agents could be created with the 1.0 HA-API; however, no equivalent of the SunPlex Agent Builder Wizard was available to speed development. Agents that required subtly different setting usually required the creation of multiple versions of the same scripts because the data service infrastructure lacked the Sun Cluster 3.0 resource type registration (RTR) file concept that allows so much of the parameterization of resource types. This shortfall causes considerable management and development overhead when you trying to maintain a single code base for an agent.

Control

A collection of programs and shell scripts that are registered with the cluster framework by the hareg(1M) command starts, stops, and monitors the data services defined in the logical host. Whenever the cluster is reconfigured, a callback mechanism executes these programs and scripts, during steps 11 and 12 of the cluster state machine.

Compared with the two options offered by the Sun Cluster 3.0 release, Sun Cluster 2.2 has no way to enforce strict data service dependency. The illusion of logical host dependency can be created, but only by the registration of the services in a specific order. Any subsequent change in any of these data services can break the artificial ordering.

Cluster Management

The management complexity of Sun Cluster 2.2 is substantially higher than Sun Cluster 3.0 management complexity because each data service relies on a number of disparate commands, for example, haoracle(1M) or hasybase(1M), hadsconfig(1M), rather than the uniform scrgadm(1M) interface in Sun Cluster 3.0. Sun Cluster 2.2 only has limited integration with Sun SyMON, the forerunner of the Sun Management Center.

Summary

Sun Cluster 3.0 corrected the deficiencies of Sun Cluster 2.2—improved failure fencing, cluster membership arbitration, amnesia protection, the ability to automatically boot into an operating cluster, and resource management. Sun Cluster 3.0 also improved the interface to the system management tools and added a wizard for writing agents.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset