IBM PowerHA SystemMirror V7.2 for IBM AIX new features
This chapter covers the specific features that are new to IBM PowerHA SystemMirror V7.2 for IBM AIX.
This chapter includes the following topics:
 – Network Failure Detection Tunable per interface
 – Built-in NETMON logic
 – Traffic stimulation for better interface failure detection
 – Quarantine protection against “sick but not dead” nodes
 – NFS Tie Breaker support for split and merge policies
2.1 Resiliency enhancements
Every release of PowerHA SystemMirror aims to make the product even more resilient than its predecessors. PowerHA SystemMirror for AIX V7.2 continues this tradition.
2.1.1 Integrated support for AIX Live Kernel Update
AIX V7.2 introduced a new capability to allow concurrent patching without interruption to the applications. This capability is known as AIX live kernel update (LKU). Initially this capability is only supported for interim fixes, but it is the foundation for broader patching of service packs and eventually technologies levels in the future.
 
Tip: More details about LKU can be found on the following website:
A demonstration of performing LKU is available on the following website:
Consider the following key points about PowerHA’s integrated support for live kernel updates:
LKU can only be performed on one cluster node at a time
Support includes all PowerHA SystemMirror Enterprise Edition Storage replication features including HyperSwap and Geographic Logical Volume Manager (GLVM).
For asynchronous GLVM, you must swap to sync mode before LKU is performed, and then swap back to async mode upon LKU completion.
During LKU operation, enhanced concurrent volume groups cannot be changed.
Workloads continue to run without interruption.
PowerHA scripts and checks during live kernel update
PowerHA provides scripts that are called during different phases of the AIX live kernel update notification mechanism. An overview of the PowerHA operations that are performed at which phase follows:
Check phase
 – Verifies that no other concurrent AIX Live Update is in progress in the cluster
 – Verifies that the cluster is in stable state
 – Verifies that there are no GLVM active asynchronous mirror pools
Pre-phase
 – Switches the active Enhanced Concurrent volume groups (VGs) in a “silent” mode
 – Stops the cluster services and SRC daemons
 – Stops GLVM traffic
Post phase
 – Restarts GLVM traffic
 – Restarts System Resource Controller (SRC) daemons and cluster services
 – Restores the state of the Enhanced Concurrent volume groups
Enabling and disabling AIX Live Kernel Update support of PowerHA
As is the case for most of the features and functionality of PowerHA, the feature can be enabled and disabled both using the System Management Interface Tool (SMIT), and using the command line using the clmgr command. In either case, it must be set on each node.
When enabling AIX LKU through SMIT, the option is set using either yes or no. However, when using the clmgr command, the settings are true or false. The default is for it to be enabled (yes/true).
To modify using SMIT, perform the following steps, as shown in Figure 2-1:
1. Go to smitty sysmirror → Cluster Nodes and Networks → Manage Nodes → Change/Show a Node.
2. Select the wanted node.
3. Set the Enable AIX Live Update operation field as wanted.
4. Press Enter.
                              Change/Show a Node
 
Type or select values in entry fields.
Press Enter AFTER making all wanted changes.
 
[Entry Fields]
* Node Name Jess
New Node Name []
Communication Path to Node [Jess]                +
Enable AIX Live Update operation Yes +
 
Figure 2-1 Enabling AIX Live Update operation
An example of how to check the current value of this setting using clmgr follows:
[root@Jess] /# clmgr view node Jess |grep LIVE
ENABLE_LIVE_UPDATE="true"
An example of how to disable this setting using clmgr follows:
[root@Jess] /# clmgr modify node Jess ENABLE_LIVE_UPDATE=false
In order for the change to take effect. the cluster must be synchronized.
Logs generated during AIX Live Kernel Update operation
The two logs used during the operation of an AIX Live Kernel Update are both located in /var/hacmp/log directory:
lvupdate_orig.log This log file keeps information from the original source system logical partition (LPAR).
lvupdate_surr.log This log file keeps information from the target surrogate system LPAR
 
Tip: A demo of performing a Live Kernel Update, though on a stand-alone AIX system and not a PowerHA node, is available on the following website:
2.1.2 Automatic repository replacement
Cluster Aware AIX (CAA) detects when a repository disk failure occurs and generates a notification message. The notification messages continue until the failed repository disk is replaced. PowerHA V7.1.1 introduced the ability to define a backup repository disk. However the replacement procedure was a manual one. Beginning in PowerHA V7.2 and combined with AIX V7.1.4 or V7.2.0, Automatic Repository Update (ARU) provides the capability to automatically swap a failed repository disk with the backup repository disk.
A maximum of six repository disks per site can be defined in a cluster. The backup disks are polled once a minute by clconfd to verify that they are still viable for an ARU operation. The steps to define a backup repository disk are the same as in previous versions of PowerHA. These steps and examples of failure situations can be found in 4.2, “Automatic repository update for the repository disk” on page 77.
 
Tip: An overview of configuring and a demonstration of automatic repository replacement can be found on the following website:
2.1.3 Verification enhancements
Cluster verification is the framework to check environmental conditions across all nodes in the cluster. Its purpose is to try to ensure proper operation of cluster events when they occur. Every new release of PowerHA provides more verification checks. In PowerHA V7.2, there are both new default additional checks, and a new option for detailed verification checks.
The following new additional checks are the default:
Verify that the reserve_policy setting on shared disks is not set to single_path.
Verify that /etc/filesystems entries for shared file systems are consistent across nodes.
The new detailed verification checks, which only run when explicitly enabled, include the following steps:
Physical volume identifier (PVID) checks between Logical Volume Manager (LVM) and Object Data Manager (ODM) on various nodes
Use AIX Runtime Expert checks for LVM, and Network File System (NFS)
Checks if network errors exceed a predefined 5% threshold
GLVM buffer size
Security configuration, such as password rules
Kernel parameters, such as network, Virtual Memory Manager (VMM), and so on
Using the new detailed verification checks can add a significant amount of time to the verification process. To enable it, run smitty sysmirror → Custom Cluster Configuration → Verify and Synchronize Cluster Configuration (Advanced), and then set the option of Detailed Checks to Yes, as shown in Figure 2-2 on page 21. This must be set manually each time, because it will always default to No. This option is only available if cluster services are not running.
              PowerHA SystemMirror Verification and Synchronization
 
Type or select values in entry fields.
Press Enter AFTER making all wanted changes.
 
[Entry Fields]
* Verify, Synchronize or Both [Both] +
* Include custom verification library checks [Yes] +
* Automatically correct errors found during [No] +
verification?
 
* Force synchronization if verification fails? [No] +
* Verify changes only? [No] +
* Logging [Standard] +
* Detailed checks Yes +
* Ignore errors if nodes are unreachable ? No +
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
 
Figure 2-2 Enabling detail verification checking
2.1.4 Use of Logical Volume Manager rootvg failure monitoring
AIX LVM has recently added the capability to change a volume group to be a known as critical volume group. Though PowerHA has allowed critical volume groups in the past, that only applied to non-operating system/data volume groups. PowerHA V7.2 now also takes advantage of this functionality specifically for rootvg.
If the volume group is set to the critical VG, any input/output (I/O) request failure starts the Logical Volume Manager (LVM) metadata write operation to check the state of the disk before returning the I/O failure. If the critical VG option is set to rootvg and if the volume group loses access to the quorum set of disks (or all disks if quorum is disabled), instead of moving the VG to an offline state, the node is failed and a message is displayed on the console.
You can set and validate rootvg as a critical volume group by running the commands shown in Figure 2-3. The command must run once because we are using the clcmd CAA distributed command.
# clcmd chvg -r y rootvg
# clcmd lsvg rootvg |grep CRIT
DISK BLOCK SIZE: 512 CRITICAL VG: yes
DISK BLOCK SIZE: 512 CRITICAL VG: yes
 
Figure 2-3 Enabling rootvg as a critical volume group
Testing rootvg failure detection
In this environment, the rootvg is in a Storwize V7000 logical unit numbers (LUNs) presented to the PowerHA nodes via virtual Fibre Channel (FC) adapters. Simulating a loss of any disk can often be accomplished in multiple ways, but often one of the following methods is used:
From within the storage management, simply unmap the volume(s) from the host
Unmap the virtual FC adapter from the real adapter on the Virtual I/O Server (VIOS)
Unzone the virtual worldwide port names (WWPNs) from the storage area network (SAN)
We prefer to use the first option of unmapping from the storage side. The other two options also usually affect all of the disks rather than just rootvg. However, usually that is fine too.
After the rootvg LUN is disconnected and detected, a kernel panic ensues. If the failure occurs on a PowerHA node that is hosting a resource group, then a resource group fallover occurs as it would with any unplanned outage.
If you check the error report after restarting the system successfully, it will have a kernel panic entry, as shown in Example 2-1.
Example 2-1 Kernel panic error report entry
---------------------------------------------------------------------------
LABEL: KERNEL_PANIC
IDENTIFIER: 225E3B63
 
Date/Time: Mon Jan 25 21:23:14 CST 2016
Sequence Number: 140
Machine Id: 00F92DB14C00
Node Id: PHA72a
Class: S
Type: TEMP
WPAR: Global
Resource Name: PANIC
 
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
 
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
 
Detail Data
ASSERT STRING
 
PANIC STRING
Critical VG Force off, halting.
Of course, the cluster would need to be restarted on the previously failed node. If it previously hosted a resource group, then a resource group move back might be desired as well.
2.1.5 Live Partition Mobility automation
Performing a Live Partition Mobility (LPM) operation of a PowerHA node has always been supported. However, it is not without risk. Because of the unique nature of LPM, certain events, such as network loss could be triggered during the operation. There have been some suggestions in the past, such as unmanage the node before performing LPM, but many users were unaware of them. As a result of this, the LPM automation integration feature was created.
PowerHA scripts and checks during Live Partition Mobility
PowerHA provides scripts that are called during different phases of the Live Partition Mobility update notification mechanism. An overview of the PowerHA operations that are performed at which phase follows:
Check phase
 – Verifies that no other concurrent LPM is in progress in the cluster
 – Verifies the cluster is in stable state
 – Verifies network communications between cluster nodes
Pre-phase
 – If set, or if IBM HyperSwap is used, stop cluster services in unmanaged mode.
 – On local node, and on peer node in two-node configuration:
 • Stop the Reliable Scalable Cluster Technology (RSCT) Dead Man Switch.
 • If HEARTBEAT_FREQUENCY_FOR_LPM is set, change the CAA node timeout.
 • If CAA deadman_mode at per-node level is a, set it to e.
 – Restrict SAN communications across nodes.
Post phase
 – Restart cluster services.
 – On local node, and on peer node in two-node configuration:
 • Restart the RSCT Dead Man Switch.
 • Restore the CAA node timeout.
 • Restore the CAA deadman_mode.
 – Re-enable SAN communications across nodes.
The following new cluster heartbeat settings are associated with the auto handling of LPM:
Node Failure Detection Timeout during LPM
If specified, this timeout value (in seconds) will be used during a Live Partition Mobility (LPM) instead of the Node Failure Detection Timeout value.
You can use this option to increase the Node Failure Detection Timeout during the LPM duration to ensure it will be greater than the LPM freeze duration in order to avoid any risk of unwanted cluster events. Enter a value 10 - 600.
LPM Node Policy
This specifies the action to be taken on the node during a Live Partition Mobility operation.
If unmanage is selected, the cluster services are stopped with the Unmanage Resource Groups option during the duration of the LPM operation. Otherwise, PowerHA SystemMirror continues to monitor the resource groups and application availability.
As is common, these options can be set using both SMIT and the clmgr command line. To change these options using SMIT, run smitty sysmirror → Custom Cluster Configuration → Cluster Nodes and Networks → Manage the Cluster → Cluster Heartbeat Settings, as shown in Figure 2-4.
                     Cluster heartbeat settings
 
Type or select values in entry fields.
Press Enter AFTER making all wanted changes.
 
[Entry Fields]
 
* Network Failure Detection Time [20] #
* Node Failure Detection Timeout [30] #
* Node Failure Detection Grace Period [10] #
* Node Failure Detection Timeout during LPM [120]                    #
* LPM Node Policy [unmanage]               +
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
 
Figure 2-4 Enabling LPM integration
An example of using clmgr to check and change these settings is shown in Example 2-2.
Example 2-2 Using the clmgr command
[root@Jess] /# clmgr query cluster |grep LPM
LPM_POLICY=""
HEARTBEAT_FREQUENCY_DURING_LPM="0"
 
[root@Jess] /# clmgr modify cluster HEARTBEAT_FREQUENCY_DURING_LPM="120"
[root@Jess] /# clmgr modify cluster LPM_POLICY=unmanage
 
[root@Jess] /# clmgr query cluster |grep LPM
LPM_POLICY="120"
HEARTBEAT_FREQUENCY_DURING_LPM="unmanage"
Even with these new automated steps, there are still a few manual steps when using SAN Communications:
Before LPM
Verify that the tme attribute is set to yes on the target systems VIOS fibre adapters
After LPM
Reestablish SAN communication between VIOS and the client LPAR through virtual local area network (VLAN) 3358 adapter configuration
No matter which method you chose to change these settings, the cluster needs to be synchronized for the change to take effect cluster-wide.
2.2 Cluster Aware AIX (CAA) Enhancements
In every new AIX level CAA is also updated. The CAA version typically references the year in which it was released. For example, the AIX V7.2 CAA level is referenced as the 2015 version, also known as release 4. Table 2-1 shows matching AIX and PowerHA levels to the CAA versions. This chapter continues with features that are new to CAA (2015/R4).
Table 2-1 IBM AIX and PowerHA levels to CAA versions
Internal version
External release
AIX level
PowerHA level
2011
R1
6.1.7/7.1.1
7.1.1
2012
R2
6.1.8/7.1.2
7.1.2
2013
R3
6.1.9/7.1.3
7.1.3
2015
R4
7.1.4/7.2.0
7.2
2.2.1 Network Failure Detection Tunable
PowerHA 7.1 had a fixed latency for network failure detection that was about 5 seconds. In PowerHA 7.2, the default is now 20 seconds. The tunable is named network_fdt.
 
Note: The network_fdt tunable is also available for PowerHA 7.1.3. To get it for your PowerHA 7.1.3 version, you must open a PMR and request the “Tunable FDT IFix bundle”.
The self-adjusting network heartbeat behavior (CAA), which got introduced with PowerHA 7.1.0, still exists and does still get used. It has no impact to the network failure detection time.
2.2.2 Built in NETMON logic
NETMON logic was previously handled by RSCT. As it was getting hard to keep both CAA and RSCT layers synchronized about the adapter state, NETMON logic has been moved within the CAA layer.
The configuration file remains the same, namely /usr/es/sbin/cluster/netmon.cf. RSCT will eventually disable NETMON functionality in their code.
More information about netmon.cf file usage and formatting can be found on the following website:
2.2.3 Traffic stimulation for better interface failure detection
Multicast pings are sent to the all hosts multicast group just before marking an interface down. This ping gets distributed to the nodes within the subnet. Any node receiving this request replies (even the node is not a part of the cluster), and thus generates incoming traffic on the adapter. Multicast ping uses the address 224.0.0.1. All nodes register by default for this multicast group. Therefore, there is a good chance that some incoming traffic will be generated by this method.
2.3 Enhanced “split brain” handling
Split brain, also known as a partitioned cluster, refers to when all communications are lost between cluster nodes, yet the nodes are still running. PowerHA 7.2 supports new policies to quarantine a sick or dead active node. These policies help handle the cluster split scenarios to ensure data protection during split scenarios. The following two new policies are supported:
Disk fencing
Disk fencing uses Small Computer System Interface (SCSI-3) Persistent Reservation mechanism to fence out the sick or dead node to block future writes from the sick node.
Hardware Management Console (HMC)-based Active node shoot down
In the case of HMC-based Active node shoot down policy, standby node works with HMC to kill the previously active (sick) node, and only then starts the workload on the standby.
2.4 Resource optimized high availability (ROHA) fallovers using Enterprise Pools
PowerHA offers integrated support for dynamic LPAR (DLPAR), including using capacity on demand resources (CoD) since IBM HACMP V5.3. However, the type of CoD support was limited. Now PowerHA V7.2 extends support to include Enterprise Pool CoD (EPCoD) and elastic capacity on demand resources. Using these types of resources makes the solution less expensive to acquire and less expensive to own.
This support has the following requirements:
PowerHA SystemMirror V7.2, Standard Edition or Enterprise Edition
One of the following AIX levels:
 – AIX V6.1 TL09 SP5
 – AIX V7.1 TL03 SP5
 – AIX V7.1 TL4
 – AIX V7.2 or later
HMC requirement
 – HMC V7.8 or later
 – HMC must have a minimum of 2 gigabytes (GB) memory
Hardware requirement for using Enterprise Pool CoD license
 – IBM POWER7+: 9117-MMD, 9179-MHD with FW780.10 or later
 – IBM POWER8: 9119-MME, 9119-MHE with FW820 or later
Full details on using this integrated support can be found in Chapter 6, “Resource Optimized High Availability (ROHA)” on page 163.
2.5 Non-disruptive upgrades
PowerHA V7.2 enables non-disruptive cluster upgrades. It allows upgrades from PowerHA V7.1.3 to V7.2 without having to roll over the workload from one node to another as part of the migration. The key requirement is that the existing AIX/CAA levels must be either V6.1.9 or V7.1.3. More information on performing non-disruptive upgrades can be found in 5.3.6, “Non-disruptive migration of PowerHA from 7.1.3 to 7.2.0” on page 153.
 
Tip: A demonstration of performing a non-disruptive upgrade can be found on the following website:
2.6 GLVM wizard
PowerHA V6.1 introduced the first two-site GLVM configuration. However, it was limited to only synchronous implementations and still required a bit more manual steps. PowerHA V7.2 introduces an enhanced GLVM wizard that involves fewer steps but also includes support for asynchronous implementations. More details can be found in Chapter 7, “Using the GLVM Configuration Assistant” on page 261.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset