Cluster Failures

This section describes how the Sun Cluster 3.0 architecture handles the complex system failures presented in “Failures in Complex Systems” and “Failures in Clustered Systems”.

Failure Detection

No single component within the product is responsible for the detection and recovery from failures. Instead, components such as the public network infrastructure and the applications rely on their own fault probes to determine the condition of their particular service. Sun Cluster 3.0 implements a system of both local and remote fault probes, so you can distinguish connectivity problems from data service problems. Detection of failures in the disk subsystem and recovery from them lie with the volume management products.“Recoverable Failures” and “Unrecoverable Failures” describe the failures that the cluster tolerates and does not tolerate in the disk subsystem.

A Sun Cluster 3.0 system contains completely redundant hardware components. The cluster must have two or more server nodes with the following tasks or characteristics:

  • Runs separate copies of the Solaris operating environment

  • Multiple disk arrays, unless the cluster uses a resilient storage unit such as the Sun StorEdge™ A3500, Sun StorEdge A3500FC arrays, or Sun StorEdge T3ES in Partner Pair mode

  • Multiple private interconnects to provide a resilient framework for the cluster kernel infrastructure

  • Multiple public networks to serve the client population

Sun Cluster 3.0 can survive the failure of single components and remain capable of providing a service. In some cases, Sun Cluster 3.0 can survive multiple independent failures without loss of service. For example, the cluster would be able to tolerate the loss of a disk array and a NIC with a service failover as long as that node has a standby NIC.

Failure Handling and Outage Time

A series of compiled programs and shell scripts make applications that run on a Sun Cluster 3.0 system highly available. These scripts control the start and stop functions and monitor the health of the application through a series of application-specific probes. For example, appropriate calls to the Oracle svrmgr program start and stop the Oracle database service. The health of the service can then be gauged by both the local and the peer cluster nodes. These nodes can connect to the database and perform a series of database operations—create table, insert row, delete row, drop table, and select from table. Successful completion of these operations indicates that the database is healthy. The efficiency and robustness of any probe and the heuristics it uses depend entirely on its authors and their level of application expertise.

Clustered systems make applications highly available through the combination of the application-specific probes, described previously, and the “insurance policy” of additional server nodes to host the service if a failure is detected. If a general failure occurs, a clustered system does not confer any new properties on the application other than automatic restart attempts, either on the same node or a different one. This approach produces two important results. First, the sum of the failure detection time, the cluster reconfiguration time, and the application recovery and restart time determines the outage time of an application. Second, if unrecoverable application data corruption occurs, the application will be unable to restart on any cluster node because the corruption carries over to any new host.

Accuracy Versus Speed

The time taken to detect failures is always a trade-off between two factors—the resource overhead of checking the application on a shorter interval and, more importantly, the ability to distinguish between an application that is down and one that is responding slowly. Because many applications are state based, the impact on current users and the time taken to recover and restart the application can outweigh the value of shorter timeouts.

Decisions to failover an application are always easier to take in retrospect. You should take an iterative approach to tuning fault monitor timeouts, using careful change management control procedures. When detection and recovery is fast enough to meet the service level agreements (SLAs), you should not attempt further tuning. Further tuning risks encountering the problems described in “Timeouts”.

Public Network Monitoring

Every node in a cluster is connected to one or more public networks through which client applications access the services of the cluster. Like most hardware components, though, these networks can fail; if they are the sole means by which clients access an application, any user connections are lost. To guard against this, you must connect the cluster to a resilient corporate network infrastructure and have multiple switches and routers between the client system and the cluster. Thus, the cluster requires a minimum of two network interface cards (NICs) for each public subnet connection.

With this level of resiliency built in, the cluster can survive the failure of a local NIC, switch, or hub. Then, rather than switching an entire service over from one node to another with all the delays and disruption to user connectivity that process entails, the service can continue to communicate with its clients through the alternative NIC. Sun Cluster 3.0 migrates all of the logical host IP addresses in the failed NIC to the standby NIC.

The public network monitoring (PNM) daemon process, pnmd(1M), detects the loss of connectivity. Sun Cluster 3.0 arranges public network adapters in network adapter failover (NAFO) groups, creating one NAFO group per subnet. Thus, a NAFO group consists of one or more of the NICs that are connected to the particular subnet. The rules for the NICs in a NAFO group are:

  • A single NIC port can be associated only with a single NAFO group.

  • NIC ports within a NAFO group must be of the same speed, for example, hme and qfe but not hme and ge.

The PNM daemon can detect the failure of a cluster NIC and the failure of the hub or switch to which it is connected. PNM monitors connectivity through the kstat(3KSTAT) kernel interface to determine whether packets have been transmitted or received by a specific NIC in the previous time interval. To minimize the use of unnecessary pings and so keep monitoring traffic to a minimum, the system uses an optimal algorithm. Thus, you cannot tune the timeouts and intervals used by the PNM daemon.

The key test that the PNM algorithm performs is to determine whether any network traffic has flowed through a particular interface in the previous time interval. If no traffic is flowing, pnmd(1M) waits a short while before first trying to reach a previously contacted host. If this attempt also fails, PNM uses a combination of ping(1M) attempts to 224.0.0.2 and 224.0.0.1 to solicit a response from any host on the particular subnet. When a response to one of these broadcasts is received, that host is used as the target for subsequent pings until it no longer responds.

If traffic was flowing originally, PNM uses a broadcast ping, for example, 192.168.200.255, to solicit a response from a host on that subnet to use as a target for future test pings.

When PNM finds that no traffic is flowing, it contacts its peers in the cluster to determine whether the failure is a localized or is a more general network problem. If a general network failure occurs, PNM takes no action because no benefit accrues by requesting the movement of the cluster services, since this action will not improve client connectivity. However, if PNM finds that the problem is localized, it marks the NAFO group as DOUBT and migrates all of the local host IP addresses to the next free adapter listed in the NAFO group until it finds one that works. If no communication is possible through any of the adapters in the NAFO group, the group is marked DOWN, and PNM requests rgmd to migrate services to a node that can provide greater client connectivity. When a successful replacement is found, PNM returns the group to the OK status.

Configuring a logical host on a NIC results in an IP and MAC address pair being broadcast on the network in question. To ensure that other systems on the network pick up the new IP-to-MAC mapping, additional gratuitous packets are broadcast. The Sun Cluster framework does not attempt to migrate MAC addresses between servers. Additionally, the local-mac-address? EEPROM variable must be set to false.

Native Solaris 8 IP multipathing (IPMP) will replace public network monitoring (PNM) at some future date.

Application Failure

One of the main benefits of the Sun Cluster 3.0 approach to availability, compared with that of many fault tolerant systems, is the ability to run standard off-the-shelf applications. You can make these applications highly available through the programs and shell scripts that constitute the particular resource type or agent.

The ability to distinguish between an application that actually failed and an application that is responding slowly because of excessive workload or system resource constraints governs the efficacy of an application probe. Thus, you should run all shell script fault detection tests under the hatimerun(1M) facility. This facility enables you to run the test against a tunable timeout stored in the CCR.

A probe can be as simple or as complex as the designer cares to make it. For example, the Sun Oracle database probe tries to connect to the target database and check for database activity by monitoring changes in the statistics (v$sysstat) table. If the database is idle for any reason, because of a problem, for example, the probe uses a more intrusive and expansive set of operations—create a temporary table, insert and delete a row within that table, drop the table, and a commit—to check the health of the database. If this all happens in a timely fashion without error, the database is considered to be working correctly. Traditionally, this role is fulfilled by transaction processing (TP) monitors, such as Tuxedo from BEA Systems or CICS from IBM. However, the release of the Java 2 Enterprise Edition (J2EE) standard led to the development of a number of application servers (see “Application Servers”) that could potentially meet this requirement.

A fault probe can be extended almost indefinitely to cover an increasingly esoteric problem, but such extension increases code complexity, development time and costs, and the likelihood that the fault monitor itself can fail because of program bugs.

The user perception of an application failure largely depends on the nature of the application and any intermediate software layers between it and the user. Most modern software packages implement a multitiered approach. This multitiered approach can include a database, web servers, application servers, presentation servers, and integration with legacy systems. Shielding the user from failures in any of these layers can present a considerable challenge. These approaches require middle tiers to be able to reconnect to a database after a failure. If an application server fails, you should save the state of a request to a middle tier server so it can be resubmitted.

Without any of these provisions in place, an application failure will disconnect a user from the database or state-based service and require the user to reestablish the connection. Then, when the service restarts, the user must log in to the database and re-create the state of the transaction. However, in some cases the application recovery time can be tuned to minimize the service outage, for example, by increasing the frequency of Oracle checkpoints. Because all Sun Cluster services are addressed through logical rather than physical IP addresses, users need not change the way they access or address the application.

File, DNS, and similar stateless services pose less of a problem; a user simply sees a delay while the service responds. The NFS protocol is stateless because a server failure during a write operation can result in an “NFS server not responding” message, but the write operation will complete once the server restarts. Read operations are also blocked until service is restored.

Most web interaction results in a series of HTTP requests to the web server. Each request opens and closes a session with the web server. When the web server fails in midsession, the user must resubmit the request. Failures that do not occur within a session are transparent to the user. If the user has a state-based web service, such as HTTPS, a web server failure requires the user to reestablish the transaction in a manner analogous to that of a database failure.

Process Monitoring Facility

The process monitoring facility in Sun Cluster 3.0 provides a mechanism for monitoring processes and their descendents and the restarting them on the same node if they fail, without incurring an expensive failover. The start method (see “Data Services and Application Agents”) of the data service registers a process and an associated tag, with the pmfadm(1M) command. The rpc.pmfd(1M) restarts the process if it exits unexpectedly. The daemon does this a set number of times within a particular period before the data service fault probe tries to restart the process on an alternate node. You cannot trace processes that are monitored by PMF by using the truss(1) command because truss will not trace a process that is being controlled by another process through the /proc interface.

Recoverable Failures

Sun Cluster is designed to handle single failures and some combinations of double failures. Anything that leads to a panic within a standalone Solaris operating environment results in a panic in a clustered node. These failures include software failures, such as kernel bugs, or hardware failures, such as CPU failure, hard memory errors, and backplane failures. As a result, the resource group manager migrates any applications on these nodes to a functioning node.

Data Storage

All data storage in the cluster must have some form of RAID protection. Performance and cost factors govern the choice of RAID 1 or RAID 5. You can use the Solaris Volume Manager (SVM) or the VERITAS Volume Manager (VxVM) to provide host-based mirroring capabilities within Sun Cluster 3.0. However, RAID 5 protection is usually restricted to hardware RAID controllers, such as the Sun StorEdge T3ES array (in partner pair mode), Sun StorEdge A3500, and Sun StorEdge A3500FC arrays.

When a storage device fails to successfully complete an application I/O request, the volume manager driver (md or vx) or a hardware RAID controller must contain and report the error. The volume manager should then activate a suitable hot spare disk to protect the vulnerable data from a subsequent disk failure.

Similarly, the volume manager should trap the failure of a host bus adapter or storage interconnect as long as the failure does not result in a kernel panic.

If all access paths to the user data fail, the Sun Cluster 3.0 framework does not take any action because this is considered a double failure. Generally, applications must wait until their respective fault monitors detect that they are not responding before the system attempts to restart or move them. This action only succeeds if the resource group has a SUNW.HAStorage resource that can switch control of the device group to its secondary path—assuming that alternative working I/O paths exist and that the prior failure has not compromised their termination or integrity.

Private Interconnects

The failure of a private interconnect between nodes results in a cluster reconfiguration only when the private interconnect is the last active path. If additional paths are still operating, the cluster disables the connection and routes traffic through the remaining paths. User applications are unaffected by the failure. When the final connection is lost, the CMM establishes a new cluster membership. “Cluster Membership” describes this process in detail.

Public Networks

For public subnets, the public network monitoring daemon (see “Public Network Monitoring”) handles the failure of inbound or outbound network connectivity from a network adapter card, hub, or switch. During the adapter switchover, applications that use TCP/IP connections can drop packets, but once the system reaches the appropriate timeout, it retransmits these packets as part of the standard TCP recovery mechanism. When UDP is used instead, the application must recover the lost packets. Therefore, any application that relies on UDP must be able to handle such failures.

You should use Sun Management Center 3.0 to monitor the cluster. This software package enables you to identify network component failures and take corrective action to prevent subsequent errors from causing a service outage. For example, after a failed NIC has been replaced, you should switch any IP addresses back to the original adapter, using the pnmset(1M) command. This action prevents potential service outages when both the backup NIC on the node that is hosting the service and the potential primary NICs for the service on other cluster nodes are connected to the same hub and that hub subsequently fails. Under these circumstances, the cluster might decide that a public network is down.

Unrecoverable Failures

Regardless of the redundancy in hardware and software components within a cluster, the data, the application control of the cluster and any system administration pertaining to it are SPOFs. Corruption and deletion of application data files render a service inoperable because the fault persists, regardless of the node on which the application resides. Similarly, all cluster nodes can mirror application bugs that result in a crash and subsequent downtime.

When you are considering how to ensure against this type of unrecoverable failure, remember that no substitute exists for a well planned and tested disaster recovery policy. At a minimum, you should implement a mechanism for rapid data restoration. This might include a high-performance tape library to bring the latest copy of the data file back from tape. Alternatively, the Sun StorEdge Instant Image 3.0 package offers a faster route to recovery, albeit at the cost of additional storage space. This program copies only the changed disk blocks rather than the entire data file. The implementation details of such an approach depend on the specific data layout.

An equivalent, but less severe, problem stems from uncontrolled, untested, or inappropriate changes to the configuration files of an application. If the cluster file system stores a single, centralized copy of the file, the problem is global and can affect all running instances. Although this problem might be faster to correct, it might take longer to diagnose, especially if the ramifications of the change are experienced only during the next major cluster reconfiguration. The alternative approach of having individual copies of the application configuration files on each cluster node simply trades off administrative convenience for protection. See TABLE 3-1.

Failure Reporting

Sun Cluster logs all messages by issuing syslog(3) calls to the /var/adm/messages file. Therefore, this file should be the focus for regular expression and pattern matchers that monitor the system for abnormal conditions or emerging problems. You can use the Sun Management Center 3.0 software to achieve this goal.

Other user-level cluster processes, such as the SunPlex manager, have their own log directory hierarchies in /var/cluster. Daemon processes like pnmd produce a log file only if the debug flag is set when the system restarts the process.

Also, application failure reporting is highly specific to an application. Most applications have error logs and audit files to monitor for erroneous conditions. This facility enables you to catch and remedy problems early, rather than leaving them to cause a more significant outage later.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset