Automation to adapt to the Live Partition Mobility (LPM) operation
This chapter introduces one new feature of PowerHA SystemMirror 7.2 edition: Automation to adapt to the Live Partition Mobility (LPM) operation.
Before PowerHA SystemMirror 7.2 edition, if customers wanted to implement the LPM operation for one AIX LPAR that is running PowerHA service, they had to perform a manual operation, which is illustrated on the following website:
The PowerHA SystemMirror 7.2 edition plugs into the LPM infrastructure to listen to LPM events and adjusts the clustering related monitoring as needed for the LPM operation to succeed without disruption. This reduces the burden on the administrator to perform manual operations on the cluster node during LPM operations. See the following website for more information about this feature:
This chapter introduces what operations are necessary to ensure that the LPM operation for the PowerHA node completes successfully. This chapter used both PowerHA 7.1 and PowerHA 7.2 cluster environments to illustrate the scenarios.
This chapter contains the following sections:
8.1 Concept
This section provides an introduction to the Live Partition Mobility concepts.
Live Partition Mobility
Live Partition Mobility (LPM) enables you to migrate LPARs running the AIX operating system and their hosted applications from one physical server to another without disrupting the infrastructure services. The migration operation maintains system transactional integrity and transfers the entire system environment, including processor state, memory, attached virtual devices, and connected users.
LPM provides the facility for no down time for planned hardware maintenance. However, LPM does not offer the same for software maintenance or unplanned downtime. You can use PowerHA SystemMirror within a partition that is capable of LPM. This does not mean that PowerHA SystemMirror uses LPM in anyway, and it is treated as another application within the partition.
LPM operation time and freeze time
The amount of operational time that an LPM migration requires on an LPAR is determined by multiple factors, such as LPAR’s memory size, workload activity (more memory pages require more memory updates across the system), and network performance.
LPAR freeze time is a part of LPM operational time, and it occurs when the LPM tries to reestablish the memory state. During this time, no other processes can operate in the LPAR. As part of this memory reestablishment process, memory pages from the source system can be copied to the target system over the network connection. If the network connection is congested, this process of copying over the memory pages can increase the overall LPAR freeze time.
Cluster software in a PowerHA cluster environment
In a PowerHA solution, although PowerHA is one cluster software, there are two other kinds of cluster software running behind the PowerHA cluster:
RSCT
CAA
See section 4.4, “IBM PowerHA, RSCT, and CAA” on page 98, which describes their relationship.
PowerHA cluster heartbeating and the Dead Man Switch (DMS)
PowerHA SystemMirror uses constant communication between the nodes to keep track of the health of the cluster, nodes, and so on. One of the key components of communication is the heartbeating between the nodes. Lack of heartbeats forms a critical part of the decision-making process to declare a node to be dead.
PowerHA 7.2 default node failure detection time is 40 seconds. 30 seconds for node communication timeout plus 10 seconds grace period. Note that these values could be higher if a customer requires it.
Node A would declare partner Node B to be dead if Node A did not receive any communication or heartbeats for more than 40 seconds. This works great when Node B is actually dead (crashed, powered off, and so on). However, there could be scenarios where Node B is not dead, but is not able to communicate for long periods.
Some examples of such scenarios are as follows:
1. There is one communication link between the nodes and it is broken (it is highly recommended that multiple communication links be deployed between the nodes to avoid this scenario).
2. Due to a rare situation, the operating system froze the cluster processes and kernel threads such that the node could not send any I/O (disk or network) for more than 40 seconds. This would result in the same situation that Node A is not able to receive any communication from Node B for more than 40 seconds, and therefore would declare
Node B to be dead, even though it is alive. This leads to a “split brain” condition, which could result in data corruption if the disks are shared across nodes.
Some of these scenarios can be handled in the cluster. For example, in scenario #2, when Node B is allowed to run after the unfreeze, it recognizes the fact that it has not been able to communicate to other nodes for a long time and takes evasive action. Those types of actions are called Dead Man Switch (DMS) protection.
DMS involves timers monitoring various activities, such as I/O traffic and process health, to recognize stray cases where there is potential for it (Node B) to be considered dead by its peers in the cluster. In these cases, the DMS timers trigger just before the node failure detection time and evasive action is initiated. A typical evasive action involves fencing
the node.
PowerHA SystemMirror consists of different DMS protections:
Cluster Aware AIX (CAA) DMS protection
When CAA detects that a node is isolated in a multiple node environment, a DMS is triggered. This timeout occurs when the node cannot communicate with other nodes during the delay specified by the node_timeout cluster tunable. The system crashes with an errlog Deadman timer triggered if the deadman_mode cluster tunable (clctrl -tune) is set to a (assert mode, which is the default), or only log an event if deadman_mode is set to e (event mode).
This can occur on the node performing LPM, or on both nodes in a two-node cluster. To prevent a system crash due to this timeout, it is suggested to increase node_timeout to its maximum value, which is 600 seconds before LPM and restore it after LPM.
 
Note: This operation is done manually with a PowerHA SystemMirror 7.1 node. 8.3, “Example: LPM scenario for PowerHA node with version 7.1” on page 291 introduces the operation. This operation is done automatically with a PowerHA System 7.2 node, as described in 8.4, “New panel to support LPM in PowerHA 7.2” on page 308.
Group Services DMS
Group services is a critical component that allows for cluster-wide membership and group management. This daemon’s health is monitored continuously. If this process exits or becomes inactive for long periods of time, then the node is brought down.
RSCT RMC, ConfigRMC, clstrmgr, and IBM.StorageRM daemons
Group Services monitors the health of these daemons. If they are inactive for a long time or exit, then the node is brought down.
 
Note: The Group Service (cthags) DMS timeout, at the time this publication was written, is 30 seconds. For now, it is hardcoded, and cannot be changed.
Therefore, if the LPM freeze time is longer than the Group Service DMS timeout, Group Service (cthags) reacts and halts the node.
Because we cannot tune the parameter to increase its timeout, it is required to disable RSCT critical process monitoring before LPM, and enable it after LPM, with the following commands:
 – Disable RSCT critical process monitoring
To disable RSCT monitoring process, use the following commands:
/usr/sbin/rsct/bin/hags_disable_client_kill -s cthags
/usr/sbin/rsct/bin/dms/stopdms -s cthags
 – Enable RSCT critical process monitoring
To enable RSCT monitoring process, use the following commands:
/usr/sbin/rsct/bin/dms/startdms -s cthags
/usr/sbin/rsct/bin/hags_enable_client_kill -s cthags
 
Note: This operation is done manually in a PowerHA SystemMirror 7.1 node, as described in 8.3, “Example: LPM scenario for PowerHA node with version 7.1” on page 291. This operation is done automatically in a PowerHA System 7.2 node, as described in 8.4, “New panel to support LPM in PowerHA 7.2” on page 308.
8.1.1 Prerequisites for PowerHA node support of LPM
This section describes the prerequisites for PowerHA node support for LPM.
8.1.2 Reduce LPM freeze time as far as possible
To reduce the freeze time during LPM operation, it is suggested to use 10 Gb network adapters and a dedicated network with enough bandwidth available, and reduce memory activity during LPM.
8.1.3 PowerHA fix requirement
For PowerHA SystemMirror version 7.1 to support changing CAA’s node_time variable online through the PowerHA clmgr command, the following APARs are required:
PowerHA SystemMirror Version 7.1.2 - IV79502 (in SP8)
PowerHA SystemMirror Version 7.1.3 - IV79497 (in SP5)
Without these APARs or in PowerHA version 7.1.1, the change requires two steps to change the CAA node_timeout variable. See “Increase the CAA node_timeout” on page 298 for more information.
8.2 Operation flow to support LPM on PowerHA node
The operation flow includes pre-migration and post-migration.
If the PowerHA version is earlier than 7.2, then you have to do the operations manually. If PowerHA version is 7.2 or later, the PowerHA performs the operations automatically.
This section introduces pre-migration and post-migration operation flow during LPM.
8.2.1 Pre-migration operation flow
Figure 8-1 describes the operation flow in a pre-migration stage.
Figure 8-1 Pre-migration operation flow
Table 8-1 shows the detailed information for each step in the pre-migration stage.
Table 8-1 Description of the pre-migration operation flow
Step
Description
1
Check if HyperSwap is used. If YES, go to 2; otherwise, go to 1.1
1.1
Check if LPM_POLICY=unmanage is set. If YES, go to 2; otherwise, go to 4:
clodmget -n -f lpm_policy HACMPcluster
2
Change the node to unmanage resource group status:
clmgr stop node <node_name> WHEN=now MANAGE=unmanage
3
Add an entry in the /etc/inittab file, which is useful in case of a node crash before restoring the managed state:
mkitab hacmp_lpm:2:once:/usr/es/sbin/cluster/utilities/cl_dr undopremigrate > /dev/null 2>&1
4
Check if RSCT DMS critical resource monitoring is enabled:
/usr/sbin/rsct/bin/dms/listdms -s cthags | grep -qw Enabled
5
Disable RSCT DMS critical resource monitoring:
/usr/sbin/rsct/bin/hags_disable_client_kill -s cthags
/usr/sbin/rsct/bin/dms/stopdms -s cthags
6
Check if th ecurrent node_timeout value is equal to the value that you set:
clodmget -n -f lpm_node_timeout HACMPcluster
clctrl -tune -x node_timeout
7
Change the CAA node_timeout value:
clmgr -f modify cluster HEARTBEAT_FREQUENCY="600"
8
If SAN-based heartbeating is enabled, then disable this function:
echo 'sfwcom' >> /etc/cluster/ifrestrict
clusterconf
8.2.2 Post-migration operation flow
Figure 8-2 describes the operation flow in the post-migration stage.
Figure 8-2 Post-migration operation flow
Table 8-2 shows the detailed information for each step in the post-migration stage.
Table 8-2 Description of post-migration operation flow
Step
Description
1
Check if the current resource group status is unmanaged. If YES, go to 2; otherwise, go to 4.
2
Change the node back to manage resource group status:
clmgr start node <node_name> WHEN=now MANAGE=auto
3
Remove the entry from the /etc/inittab file that was added in the pre-migration process:
rmitab hacmp_lpm
4
Check if the RSCT DMS critical resource monitoring function is enabled before LPM operation.
5
Enable RSCT DMS critical resource monitoring:
/usr/sbin/rsct/bin/dms/startdms -s cthags
/usr/sbin/rsct/bin/hags_enable_client_kill -s cthags
6
Check if the current node_timeout value is equal to the value that you set before:
clctrl -tune -x node_timeout
clodmget -n -f lpm_node_timeout HACMPcluster
7
Restore the CAA node_timeout value:
clmgr -f modify cluster HEARTBEAT_FREQUENCY="30"
8
If SAN based heartbeating is enabled, then enable this function:
rm -f /etc/cluster/ifrestrict
clusterconf
rmdev -l sfwcomm*
mkdev -l sfwcomm*
8.3 Example: LPM scenario for PowerHA node with version 7.1
This section introduces detailed operations for performing LPM for one node with PowerHA SystemMirror version 7.1.
8.3.1 Topology introduction
Figure 8-3 describes the topology of the testing environment.
Figure 8-3 Testing environment topology
There are two Power Systems 780 servers. The first server is P780_09 and its serial number is 060C0AT, and the second server is P780_10 and its machine serial number is 061949T. The following list provides additional details about the testing environment:
Each server has one VIOS partition and one AIX partition.
The P780_09 server has VIOSA and AIX720_LPM1 partitions.
The P780_10 server has VIOSB and AIX720_LPM2 partitions.
There is one storage that can be accessed by the two VIO servers.
The two AIX partitions access storage via the NPIV protocol.
The heartbeating method includes IP, SAN, and dpcom.
The AIX version is AIX 7.2 SP1.
The PowerHA SystemMirror version is 7.1.3 SP4.
8.3.2 Initial status
This section describes the initial cluster status.
PowerHA and AIX version
Example 8-1 shows the PowerHA and the AIX version information.
Example 8-1 PowerHA and AIX version information
AIX720_LPM1:/usr/es/sbin/cluster # clhaver
Node AIX720_LPM2 has HACMP version 7134 installed
Node AIX720_LPM1 has HACMP version 7134 installed
 
AIX720_LPM1:/usr/es/sbin/cluster # clcmd oslevel -s
-------------------------------
NODE AIX720_LPM2
-------------------------------
7200-00-01-1543
 
-------------------------------
NODE AIX720_LPM1
-------------------------------
7200-00-01-1543
PowerHA configuration
Table 8-3 shows the cluster’s configuration.
Table 8-3 Cluster’s configuration
 
AIX720_LPM1
AIX720_LPM2
Cluster name
LPMCluster
Cluster type: NSC (No Site Cluster)
Network interface
en1:172.16.50.21
netmask:255.255.255.0
Gateway:172.16.50.1
en0:172.16.50.22
netmask:255.255.255.0
Gateway:172.16.50.1
Network
net_ether_01 (172.16.50.0/24)
CAA
Unicast
primary disk: hdisk1
shared VG
testVG:hdisk2
Service IP
172.16.50.23 AIX720_LPM_Service
Resource Group
testRG includes testVG, AIX720_LPM_Service
The node order is: AIX720_LPM1, AIX720_LPM2
Startup Policy: Online On Home Node Only
Fallover Policy: Fallover To Next Priority Node In The List
Fallback Policy: Never Fallback
PowerHA and Resource Group status
Example 8-2 shows the current status of PowerHA and the Resource Group.
Example 8-2 PowerHA and Resource Group status
AIX720_LPM1:/ # clcmd -n LPMCluster lssrc -ls clstrmgrES|egrep "NODE|state"|grep -v "Last"
NODE AIX720_LPM2
Current state: ST_STABLE
NODE AIX720_LPM1
Current state: ST_STABLE
 
 
AIX720_LPM1:/ # clcmd -n LPMCluster clRGinfo
-------------------------------
NODE AIX720_LPM2
-------------------------------
-----------------------------------------------------------------------------
Group Name State Node
-----------------------------------------------------------------------------
testRG ONLINE AIX720_LPM1
OFFLINE AIX720_LPM2
 
-------------------------------
NODE AIX720_LPM1
-------------------------------
-----------------------------------------------------------------------------
Group Name State Node
-----------------------------------------------------------------------------
testRG ONLINE AIX720_LPM1
OFFLINE AIX720_LPM2
CAA heartbeating status
Example 8-3 shows the current CAA heartbeating status and node_timeout parameter.
Example 8-3 CAA heartbeating status and value of node_timeout parameter
AIX720_LPM1:/ # clcmd lscluster -m
 
-------------------------------
NODE AIX720_LPM2
-------------------------------
Calling node query for all nodes...
Node query number of nodes examined: 2
 
Node name: AIX720_LPM1
Cluster shorthand id for node: 1
UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04
State of node: UP
Smoothed rtt to node: 7
Mean Deviation in network rtt to node: 3
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 2
-----------------------------------------------------------------------
Interface State Protocol Status SRC_IP->DST_IP
-----------------------------------------------------------------------
sfwcom UP none none none
tcpsock->01 UP IPv4 none 172.16.50.22->172.16.50.21
 
----------------------------------------------------------------------------
 
Node name: AIX720_LPM2
Cluster shorthand id for node: 2
UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04
State of node: UP NODE_LOCAL
Smoothed rtt to node: 0
Mean Deviation in network rtt to node: 0
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 0
 
-------------------------------
NODE AIX720_LPM1
-------------------------------
Calling node query for all nodes...
Node query number of nodes examined: 2
 
Node name: AIX720_LPM1
Cluster shorthand id for node: 1
UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04
State of node: UP NODE_LOCAL
Smoothed rtt to node: 0
Mean Deviation in network rtt to node: 0
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 0
 
----------------------------------------------------------------------------
 
Node name: AIX720_LPM2
Cluster shorthand id for node: 2
UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04
State of node: UP
Smoothed rtt to node: 17
Mean Deviation in network rtt to node: 13
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 2
-----------------------------------------------------------------------
Interface State Protocol Status SRC_IP->DST_IP
-----------------------------------------------------------------------
sfwcom UP none none none
tcpsock->02 UP IPv4 none 172.16.50.21->172.16.50.22
 
AIX720_LPM2:/ # clctrl -tune -L
NAME DEF MIN MAX UNIT SCOPE
...
node_timeout 20000 10000 600000 milliseconds c n
LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 30000
...
--> Current node_timeout is 30s
RSCT cthags status
Example 8-4 shows the current RSCT cthags service’s status.
Example 8-4 RSCT cthags service’s status
AIX720_LPM1:/ # lssrc -ls cthags
Subsystem Group PID Status
cthags cthags 13173166 active
5 locally-connected clients. Their PIDs:
9175342(IBM.ConfigRMd) 6619600(rmcd) 14549496(IBM.StorageRMd) 7995658(clstrmgr) 10355040(gsclvmd)
HA Group Services domain information:
Domain established by node 1
Number of groups known locally: 8
Number of Number of local
Group name providers providers/subscribers
rmc_peers 2 1 0
s00V0CKI0009G000001A9UHPVQ4 2 1 0
IBM.ConfigRM 2 1 0
IBM.StorageRM.v1 2 1 0
CLRESMGRD_1495882547 2 1 0
CLRESMGRDNPD_1495882547 2 1 0
CLSTRMGR_1495882547 2 1 0
d00V0CKI0009G000001A9UHPVQ4 2 1 0
Critical clients will be terminated if unresponsive
 
Dead Man Switch Enabled
 
AIX720_LPM1:/usr/sbin/rsct/bin/dms # ./listdms -s cthags
Dead Man Switch Enabled:
reset interval = 3 seconds
trip interval = 30 seconds
LPAR and server location information
Example 8-5 shows the current LPAR’s location information.
Example 8-5 LPAR and server location information
AIX720_LPM1:/ # prtconf
System Model: IBM,9179-MHD
Machine Serial Number: 060C0AT --> this server is P780_09
 
AIX720_LPM2:/ # prtconf
System Model: IBM,9179-MHD
Machine Serial Number: 061949T --> this server is P780_10
8.3.3 Manual operation before LPM
Before performing the LPM operation, there are several manual operations that are required.
Change the PowerHA service to unmanage Resource Group status
There are two methods to change the PowerHA service to Unmanage Resource Group status. The first method is through the SMIT menu, as shown in Example 8-6.
Start smit clstop.
Example 8-6 Change the cluster service to unmanage Resource Groups through the SMIT menu
                      Stop Cluster Services
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Stop now, on system restart or both now
Stop Cluster Services on these nodes [AIX720_LPM1]
BROADCAST cluster shutdown? true
* Select an Action on Resource Groups Unmanage Resource Groups
The second method is through the clmgr command, as shown in Example 8-7.
Example 8-7 Change cluster service to unmanage Resource Group through the clmgr command
AIX720_LPM1:/ # clmgr stop node AIX720_LPM1 WHEN=now MANAGE=unmanage
Broadcast message from root@AIX720_LPM1 (tty) at 23:52:44 ...
PowerHA SystemMirror on AIX720_LPM1 shutting down. Please exit any cluster applications...
AIX720_LPM1: 0513-044 The clevmgrdES Subsystem was requested to stop.
.
"AIX720_LPM1" is now unmanaged.
AIX720_LPM1: Jan 26 2016 23:52:43 /usr/es/sbin/cluster/utilities/clstop: called with flags -N -f
 
AIX720_LPM1:/ # clcmd -n LPMCluster clRGinfo
-------------------------------
NODE AIX720_LPM2
-------------------------------
-----------------------------------------------------------------------------
Group Name State Node
-----------------------------------------------------------------------------
testRG UNMANAGED AIX720_LPM1
UNMANAGED AIX720_LPM2
 
-------------------------------
NODE AIX720_LPM1
-------------------------------
-----------------------------------------------------------------------------
Group Name State Node
-----------------------------------------------------------------------------
testRG UNMANAGED AIX720_LPM1
UNMANAGED AIX720_LPM2
Disable RSCT cthags critical resource monitoring function
Example 8-8 shows how to disable the RSCT cthags critical resource monitoring function to prevent a DMS trigger if the LPM freeze time is longer than its timeout.
 
Note: In this case, there are only two nodes in this cluster, so you need to disable this function on both nodes. Only one node is shown in the example, but the command is run on both nodes.
Example 8-8 Disable RSCT cthgs critical resource monitoring function
AIX720_LPM1:/ # /usr/sbin/rsct/bin/hags_disable_client_kill -s cthags
AIX720_LPM1:/ # /usr/sbin/rsct/bin/dms/stopdms -s cthags
 
Dead Man Switch Disabled
DMS Re-arming Thread cancelled
 
AIX720_LPM1:/ # lssrc -ls cthags
Subsystem Group PID Status
cthags cthags 13173166 active
5 locally-connected clients. Their PIDs:
9175342(IBM.ConfigRMd) 6619600(rmcd) 14549496(IBM.StorageRMd) 19792370(clstrmgr) 19268008(gsclvmd)
HA Group Services domain information:
Domain established by node 1
Number of groups known locally: 8
Number of Number of local
Group name providers providers/subscribers
rmc_peers 2 1 0
s00V0CKI0009G000001A9UHPVQ4 2 1 0
IBM.ConfigRM 2 1 0
IBM.StorageRM.v1 2 1 0
CLRESMGRD_1495882547 2 1 0
CLRESMGRDNPD_1495882547 2 1 0
CLSTRMGR_1495882547 2 1 0
d00V0CKI0009G000001A9UHPVQ4 2 1 0
 
Critical clients will not be terminated even if unresponsive
 
Dead Man Switch Disabled
 
AIX720_LPM1:/usr/sbin/rsct/bin/dms # ./listdms -s cthags
 
Dead Man Switch Disabled
Increase the CAA node_timeout
Example 8-9 shows how to increase the CAA node_timeout to prevent a CAA DMS trigger if the LPM freeze time is longer than its timeout. You need to run this command on only one node, because it is cluster aware.
Example 8-9 Increase the CAA node_timeout
AIX720_LPM1:/ # clmgr -f modify cluster HEARTBEAT_FREQUENCY="600"
1 tunable updated on cluster LPMCluster.
 
AIX720_LPM1:/ # clctrl -tune -L
NAME DEF MIN MAX UNIT SCOPE
ENTITY_NAME(UUID) CUR
...
node_timeout 20000 10000 600000 milliseconds c n
LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 600000
 
Note: With the previous configuration, if LPM’s freeze time is longer than 600 seconds, CAA DMS is triggered because of the CAA’s deadman_mode=a (assert) parameter. The node crashes and its resource group is moved to another node.
Note: The -f option of the clmgr command means not to update the HACMPcluster ODM, because it will update the CAA variable (node_timeout) directly with the clctrl command. This function is included with the following interim fixes:
PowerHA SystemMirror Version 7.1.2 - IV79502 (SP8)
PowerHA SystemMirror Version 7.1.3 - IV79497 (SP5)
If you do not apply one of these interim fixes, then you must perform four steps to increase the CAA node_timeout variable (Example 8-10):
Change the PowerHA service to online status (because cluster sync needs this status)
Change the HACMPcluster ODM
Perform cluster verification and synchronization
Change the PowerHA service to unmanage resource group status
Example 8-10 Detailed steps to change CAA node_timeout variable without PowerHA interim fix
--> Step 1
AIX720_LPM1:/ # clmgr start node AIX720_LPM1 WHEN=now MANAGE=auto
Adding any necessary PowerHA SystemMirror entries to /etc/inittab and /etc/rc.net for IPAT on node AIX720_LPM1.
AIX720_LPM1: start_cluster: Starting PowerHA SystemMirror
...
"AIX720_LPM1" is now online.
Starting Cluster Services on node: AIX720_LPM1
This may take a few minutes. Please wait...
AIX720_LPM1: Jan 27 2016 06:17:04 Starting execution of /usr/es/sbin/cluster/etc/rc.cluster
AIX720_LPM1: with parameters: -boot -N -A -b -P cl_rc_cluster
AIX720_LPM1:
AIX720_LPM1: Jan 27 2016 06:17:04 Checking for srcmstr active...
AIX720_LPM1: Jan 27 2016 06:17:04 complete.
 
--> Step 2
AIX720_LPM1:/ # clmgr modify cluster HEARTBEAT_FREQUENCY="600"
 
--> Step 3
AIX720_LPM1:/ # clmgr sync cluster
Verifying additional pre-requisites for Dynamic Reconfiguration...
...completed.
 
Committing any changes, as required, to all available nodes...
Adding any necessary PowerHA SystemMirror entries to /etc/inittab and /etc/rc.net for IPAT on node AIX720_LPM1.
Checking for added nodes
Updating Split Merge Policies
1 tunable updated on cluster LPMCluster.
Adding any necessary PowerHA SystemMirror entries to /etc/inittab and /etc/rc.net for IPAT on node AIX720_LPM2.
 
Verification has completed normally.
 
--> Step 4
AIX720_LPM1:/ # clmgr stop node AIX720_LPM1 WHEN=now MANAGE=unmanage
Broadcast message from root@AIX720_LPM1 (tty) at 06:15:02 ...
PowerHA SystemMirror on AIX720_LPM1 shutting down. Please exit any cluster applications...
AIX720_LPM1: 0513-044 The clevmgrdES Subsystem was requested to stop.
.
"AIX720_LPM1" is now unmanaged.
 
--> Check the result
AIX720_LPM1:/ # clctrl -tune -L
NAME DEF MIN MAX UNIT SCOPE
ENTITY_NAME(UUID) CUR
...
node_timeout 20000 10000 600000 milliseconds c n
LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 600000
 
Note: When you stop the cluster with unmanage and when you start it with auto, it will try to bring the resource group online, which does not cause any problem with the VGs, file systems, and IPs. However, it runs the application controller one more time. If you do not predict the appropriate checks in its application controller before running the commands, it can cause problems with the application. Therefore, the application controller start script should check if the application is already online before starting it.
Disable SAN heartbeating function
 
Note: In our scenario, SAN-based heartbeating has been configured, so this step is required. You do not need to do this step if SAN-based heartbeating is not configured.
Example 8-11 shows how to disable SAN heartbeating function.
Example 8-11 Disable SAN heartbeating function
AIX720_LPM1:/ # echo "sfwcom" >> /etc/cluster/ifrestrict
AIX720_LPM1:/ # clusterconf
 
AIX720_LPM2:/ # echo "sfwcom" >> /etc/cluster/ifrestrict
AIX720_LPM2:/ # clusterconf
 
AIX720_LPM1:/ # clcmd lscluster -m
 
-------------------------------
NODE AIX720_LPM2
-------------------------------
Calling node query for all nodes...
Node query number of nodes examined: 2
 
Node name: AIX720_LPM1
Cluster shorthand id for node: 1
UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04
State of node: UP
Smoothed rtt to node: 7
Mean Deviation in network rtt to node: 3
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 1
-----------------------------------------------------------------------
Interface State Protocol Status SRC_IP->DST_IP
-----------------------------------------------------------------------
tcpsock->01 UP IPv4 none 172.16.50.22->172.16.50.21
----------------------------------------------------------------------------
 
Node name: AIX720_LPM2
Cluster shorthand id for node: 2
UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04
State of node: UP NODE_LOCAL
Smoothed rtt to node: 0
Mean Deviation in network rtt to node: 0
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 0
 
-------------------------------
NODE AIX720_LPM1
-------------------------------
Calling node query for all nodes...
Node query number of nodes examined: 2
 
Node name: AIX720_LPM1
Cluster shorthand id for node: 1
UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04
State of node: UP NODE_LOCAL
Smoothed rtt to node: 0
Mean Deviation in network rtt to node: 0
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 0
 
----------------------------------------------------------------------------
 
Node name: AIX720_LPM2
Cluster shorthand id for node: 2
UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04
State of node: UP
Smoothed rtt to node: 18
Mean Deviation in network rtt to node: 14
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 1
-----------------------------------------------------------------------
Interface State Protocol Status SRC_IP->DST_IP
-----------------------------------------------------------------------
tcpsock->02 UP IPv4 none 172.16.50.21->172.16.50.22
 
AIX720_LPM1:/ # lscluster -i
Network/Storage Interface Query
 
Cluster Name: LPMCluster
Cluster UUID: 11403f34-c4b7-11e5-8014-56c6a3855d04
Number of nodes reporting = 2
Number of nodes stale = 0
Number of nodes expected = 2
 
Node AIX720_LPM1
Node UUID = 112552f0-c4b7-11e5-8014-56c6a3855d04
Number of interfaces discovered = 3
Interface number 1, en1
IFNET type = 6 (IFT_ETHER)
NDD type = 7 (NDD_ISO88023)
MAC address length = 6
MAC address = FA:97:6D:97:2A:20
Smoothed RTT across interface = 0
Mean deviation in network RTT across interface = 0
Probe interval for interface = 990 ms
IFNET flags for interface = 0x1E084863
NDD flags for interface = 0x0021081B
Interface state = UP
Number of regular addresses configured on interface = 2
IPv4 ADDRESS: 172.16.50.21 broadcast 172.16.50.255 netmask 255.255.255.0
IPv4 ADDRESS: 172.16.50.23 broadcast 172.16.50.255 netmask 255.255.255.0
Number of cluster multicast addresses configured on interface = 1
IPv4 MULTICAST ADDRESS: 228.16.50.21
Interface number 2, sfwcom
IFNET type = 0 (none)
NDD type = 304 (NDD_SANCOMM)
Smoothed RTT across interface = 7
Mean deviation in network RTT across interface = 3
Probe interval for interface = 990 ms
IFNET flags for interface = 0x00000000
NDD flags for interface = 0x00000009
Interface state = DOWN RESTRICTED SOURCE HARDWARE RECEIVE SOURCE HARDWARE TRANSMIT
Interface number 3, dpcom
IFNET type = 0 (none)
NDD type = 305 (NDD_PINGCOMM)
Smoothed RTT across interface = 750
Mean deviation in network RTT across interface = 1500
Probe interval for interface = 22500 ms
IFNET flags for interface = 0x00000000
NDD flags for interface = 0x00000009
Interface state = UP RESTRICTED AIX_CONTROLLED
 
Node AIX720_LPM2
Node UUID = 11255336-c4b7-11e5-8014-56c6a3855d04
Number of interfaces discovered = 3
Interface number 1, en1
IFNET type = 6 (IFT_ETHER)
NDD type = 7 (NDD_ISO88023)
MAC address length = 6
MAC address = FA:F2:D3:29:50:20
Smoothed RTT across interface = 0
Mean deviation in network RTT across interface = 0
Probe interval for interface = 990 ms
IFNET flags for interface = 0x1E084863
NDD flags for interface = 0x0021081B
Interface state = UP
Number of regular addresses configured on interface = 1
IPv4 ADDRESS: 172.16.50.22 broadcast 172.16.50.255 netmask 255.255.255.0
Number of cluster multicast addresses configured on interface = 1
IPv4 MULTICAST ADDRESS: 228.16.50.21
Interface number 2, sfwcom
IFNET type = 0 (none)
NDD type = 304 (NDD_SANCOMM)
Smoothed RTT across interface = 7
Mean deviation in network RTT across interface = 3
Probe interval for interface = 990 ms
IFNET flags for interface = 0x00000000
NDD flags for interface = 0x00000009
Interface state = DOWN RESTRICTED SOURCE HARDWARE RECEIVE SOURCE HARDWARE TRANSMIT
Interface number 3, dpcom
IFNET type = 0 (none)
NDD type = 305 (NDD_PINGCOMM)
Smoothed RTT across interface = 750
Mean deviation in network RTT across interface = 1500
Probe interval for interface = 22500 ms
IFNET flags for interface = 0x00000000
NDD flags for interface = 0x00000009
Interface state = UP RESTRICTED AIX_CONTROLLED
8.3.4 Perform LPM
Example 8-12 shows how to perform the LPM operation for the AIX720_LPM1 node. This operation migrates this LPAR from P780_09 to P780_10.
Example 8-12 Performing the LPM operation
hscroot@hmc55:~> time migrlpar -o m -m SVRP7780-09-SN060C0AT -t SVRP7780-10-SN061949T -p AIX720_LPM1
 
real 1m6.269s
user 0m0.001s
sys 0m0.000s
PowerHA service and resource group status
After LPM completes, Example 8-13 shows that the PowerHA services are still stable, and AIX720_LPM1 has been moved to the P780_10 server.
Example 8-13 PowerHA services stable
AIX720_LPM1:/ # clcmd -n LPMCluster lssrc -ls clstrmgrES|egrep "NODE|state"|grep -v "Last"
NODE AIX720_LPM2
Current state: ST_STABLE
NODE AIX720_LPM1
Current state: ST_STABLE
 
AIX720_LPM1:/ # prtconf
System Model: IBM,9179-MHD
Machine Serial Number: 061949T --> this server is P780_10
 
AIX720_LPM2:/ # prtconf|more
System Model: IBM,9179-MHD
Machine Serial Number: 061949T --> this server is P780_10
8.3.5 Manual operation after LPM
After LPM completes, there are several manual operations required.
Enable SAN heartbeating function
Example 8-14 shows how to enable the SAN heartbeating function.
Example 8-14 Enable SAN heartbeating function
AIX720_LPM1:/ # rm /etc/cluster/ifrestrict
AIX720_LPM1:/ # clusterconf
 
AIX720_LPM2:/ # rm /etc/cluster/ifrestrict
AIX720_LPM2:/ # clusterconf
 
AIX720_LPM1:/ # clcmd lscluster -m
 
-------------------------------
NODE AIX720_LPM2
-------------------------------
Calling node query for all nodes...
Node query number of nodes examined: 2
 
Node name: AIX720_LPM1
Cluster shorthand id for node: 1
UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04
State of node: UP
Smoothed rtt to node: 7
Mean Deviation in network rtt to node: 3
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 2
-----------------------------------------------------------------------
Interface State Protocol Status SRC_IP->DST_IP
-----------------------------------------------------------------------
sfwcom UP none none none
tcpsock->01 UP IPv4 none 172.16.50.22->172.16.50.21
 
----------------------------------------------------------------------------
 
Node name: AIX720_LPM2
Cluster shorthand id for node: 2
UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04
State of node: UP NODE_LOCAL
Smoothed rtt to node: 0
Mean Deviation in network rtt to node: 0
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 0
 
-------------------------------
NODE AIX720_LPM1
-------------------------------
Calling node query for all nodes...
Node query number of nodes examined: 2
 
Node name: AIX720_LPM1
Cluster shorthand id for node: 1
UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04
State of node: UP NODE_LOCAL
Smoothed rtt to node: 0
Mean Deviation in network rtt to node: 0
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 0
 
----------------------------------------------------------------------------
 
Node name: AIX720_LPM2
Cluster shorthand id for node: 2
UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04
State of node: UP
Smoothed rtt to node: 16
Mean Deviation in network rtt to node: 14
Number of clusters node is a member in: 1
CLUSTER NAME SHID UUID
LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04
SITE NAME SHID UUID
LOCAL 1 51735173-5173-5173-5173-517351735173
 
Points of contact for node: 2
-----------------------------------------------------------------------
Interface State Protocol Status SRC_IP->DST_IP
-----------------------------------------------------------------------
sfwcom UP none none none
tcpsock->02 UP IPv4 none 172.16.50.21->172.16.50.22
 
Note: After this step, if the sfwcom interface is still not UP, check the VLAN storage framework communication device’s status. If it is in defined status, you need to reconfigure it with the following command:
AIX720_LPM1:/ # lsdev -C|grep vLAN
sfwcomm1 Defined vLAN Storage Framework Comm
AIX720_LPM1:/ # rmdev -l sfwcomm1; sleep 2; mkdev -l sfwcomm1
sfwcomm1 Defined
sfwcomm1 Available
Then you can check the sfwcom interface’s status again with the lscluster command.
Restore CAA node_timeout
Example 8-15 shows how to restore the CAA node_timeout.
 
Note: In a PowerHA cluster environment, the default value of node_timeout is 30 seconds.
Example 8-15 Restore the CAA node_timeout parameter
AIX720_LPM1:/ # clmgr -f modify cluster HEARTBEAT_FREQUENCY="30"
1 tunable updated on cluster LPMCluster.
 
AIX720_LPM1:/ # clctrl -tune -L
NAME DEF MIN MAX UNIT SCOPE
ENTITY_NAME(UUID) CUR
...
node_timeout 20000 10000 600000 milliseconds c n
LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 30000
Enable RSCT cthags critical resouce monitoring function
Example 8-16 shows how to enable the RSCT cthags critical resource monitoring function.
 
Note: In this case, there are only two nodes in this cluster, so you disabled the function on both nodes before LPM. Only one node is shown in this example, but the command is run on both nodes.
Example 8-16 Enable RSCT cthags resource monitoring
AIX720_LPM1:/ # /usr/sbin/rsct/bin/dms/startdms -s cthags
 
Dead Man Switch Enabled
DMS Re-arming Thread created
 
AIX720_LPM1:/ # /usr/sbin/rsct/bin/hags_enable_client_kill -s cthags
AIX720_LPM1:/ # lssrc -ls cthags
Subsystem Group PID Status
cthags cthags 13173166 active
5 locally-connected clients. Their PIDs:
9175342(IBM.ConfigRMd) 6619600(rmcd) 14549496(IBM.StorageRMd) 19792370(clstrmgr) 19268008(gsclvmd)
HA Group Services domain information:
Domain established by node 1
Number of groups known locally: 8
Number of Number of local
Group name providers providers/subscribers
rmc_peers 2 1 0
s00V0CKI0009G000001A9UHPVQ4 2 1 0
IBM.ConfigRM 2 1 0
IBM.StorageRM.v1 2 1 0
CLRESMGRD_1495882547 2 1 0
CLRESMGRDNPD_1495882547 2 1 0
CLSTRMGR_1495882547 2 1 0
d00V0CKI0009G000001A9UHPVQ4 2 1 0
 
Critical clients will be terminated if unresponsive
 
Dead Man Switch Enabled
AIX720_LPM1:/ # /usr/sbin/rsct/bin/dms/listdms -s cthags
Dead Man Switch Enabled:
reset interval = 3 seconds
trip interval = 30 seconds
Change PowerHA service back to normal status
Example 8-17 shows how to change the PowerHA service back to normal status. There are two methods to achieve it. One is through the SMIT menu:
Start smit clstart.
Example 8-17 Change PowerHA service back to normal status
                     Start Cluster Services
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Start now, on system restart or both now
Start Cluster Services on these nodes [AIX720_LPM1]
* Manage Resource Groups Automatically
BROADCAST message at startup? true
Startup Cluster Information Daemon? false
Ignore verification errors? false
Automatically correct errors found during Interactively
cluster start?
 
Another is through 'clmgr' command:
AIX720_LPM1:/ # clmgr start node AIX720_LPM1 WHEN=now MANAGE=auto
AIX720_LPM1: start_cluster: Starting PowerHA SystemMirror
...
"AIX720_LPM1" is now online.
 
 
Starting Cluster Services on node: AIX720_LPM1
This may take a few minutes. Please wait...
AIX720_LPM1: Jan 27 2016 01:04:43 Starting execution of /usr/es/sbin/cluster/etc/rc.cluster
AIX720_LPM1: with parameters: -boot -N -A -b -P cl_rc_cluster
AIX720_LPM1:
AIX720_LPM1: Jan 27 2016 01:04:43 Checking for srcmstr active...
AIX720_LPM1: Jan 27 2016 01:04:43 complete.
 
Note: When you stop the cluster with unmanage and when you start it with auto, it will try to bring the resource group online, which does not cause any problem with the VGs, file systems, and IPs. However, it runs the application controller one more time. If you do not predict the appropriate checks in its application controller before running the commands, it can cause problems with the application. Therefore, the application controller start script should check if the application is already online before starting it.
Example 8-18 shows that the resource group’s status has been changed to normal.
Example 8-18 Resource Group’s status
AIX720_LPM1:/ # clcmd clRGinfo
 
-------------------------------
NODE AIX720_LPM2
-------------------------------
-----------------------------------------------------------------------------
Group Name State Node
-----------------------------------------------------------------------------
testRG ONLINE AIX720_LPM1
OFFLINE AIX720_LPM2
 
 
-------------------------------
NODE AIX720_LPM1
-------------------------------
-----------------------------------------------------------------------------
Group Name State Node
-----------------------------------------------------------------------------
testRG ONLINE AIX720_LPM1
OFFLINE AIX720_LPM2
8.4 New panel to support LPM in PowerHA 7.2
From version 7.2, PowerHA SystemMirror automates some of the Live Partition Mobility (LPM) steps by registering a script with the LPM framework.
PowerHA SystemMirror listens to LPM events and automates steps in PowerHA SystemMirror to handle the LPAR freeze that can occur during the LPM process. As part of the automation, PowerHA SystemMirror provides a few variables that can be changed based on the requirements for your environment.
You can change the following LPM variables in PowerHA SystemMirror that provide LPM automation:
Node Failure Detection Timeout during LPM
LPM Node Policy
Start smit sysmirror. Select Custom Cluster Configuration → Cluster Nodes and Networks → Manage the Cluster → Cluster heartbeat settings. The next panel is a menu screen with a title menu option and seven item menu options.
Its fast path is cm_chng_tunables (Figure 8-4). This menu is not new, but two items have been added to it to make LPM easier in a PowerHA environment (the last two items are new).
            Cluster heartbeat settings
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
 
* Network Failure Detection Time [20]
* Node Failure Detection Timeout [30]
* Node Failure Detection Grace Period [10]
* Node Failure Detection Timeout during LPM [600]
* LPM Node Policy [manage]
Figure 8-4 Cluster heartbeat setting
Table 8-4 describes the context-sensitive help information for the cluster heartbeating setting.
Table 8-4 Context-sensitive help for the Cluster heartbeat setting
Name and fast path
context-sensitive help (F1)
Node Failure Detection Timeout during LPM
If specified, this timeout value (in seconds) will be used during a Live Partition Mobility (LPM) instead of the Node Failure Detection Timeout value.
You can use this option to increase the Node Failure Detection Timeout during the LPM duration to ensure it will be greater than the LPM freeze duration in order to avoid any risk of unwanted cluster events. The unit is second.
For PowerHA 7.2 GA Edition, the customer can enter a value 10 - 600.
For PowerHA 7.2 SP1 or later, the default is 600 and is unchangeable.
LPM Node Policy
Specifies the action to be taken on the node during a Live Partition Mobility operation.
If unmanage is selected, the cluster services are stopped with the Unmanage Resource Groups option during the duration of the LPM operation. Otherwise, PowerHA SystemMirror will continue to monitor the Resource Groups and application availability.
The default is manage.
8.5 PowerHA 7.2 scenario and troubleshooting
This scenario keeps the same hardware and operating system as 8.3, “Example: LPM scenario for PowerHA node with version 7.1” on page 291. This scenario replaces only the PowerHA software with the 7.2 edition.
Example 8-19 shows the PowerHA version.
Example 8-19 PowerHA version
AIX720_LPM1:/ #clhaver
Node AIX720_LPM1 has HACMP version 7200 installed
Node AIX720_LPM2 has HACMP version 7200 installed
Table 8-5 shows the variables of LPM.
Table 8-5 Cluster heartbeating setting
Items
Value
Node Failure Detection Timeout during LPM
600
LPM Node Policy
unmanage
8.5.1 Troubleshooting
The PowerHA log related with LPM operation is in /var/hacmp/log/clutils.log. Example 8-20 and Example 8-21 on page 311 show the information in this log file, and include pre-migration and post-migration.
 
Note: During the operation, PowerHA SystemMirror stops the cluster with the unmanage option in the pre-migration stage, and starts it with the auto option in the post-migration stage automatically. PowerHA SystemMirror tries to bring the resource group online in the post-migration stage, which does not cause any problem with the VGs, file systems, and IPs. However, it runs the application controller one more time.
If you do not predict the appropriate checks in its application controller before running the commands, it can cause problems with the application. Therefore, the application controller start script should check if the application is already online before starting it.
Example 8-20 Log file of pre-migration operation
...
--> Check if need to change PowerHA service to 'unmanage resource group' status
Tue Jan 26 10:57:08 UTC 2016 cl_dr: clodmget -n -f lpm_policy HACMPcluster
Tue Jan 26 10:57:08 UTC 2016 cl_dr: lpm_policy='UNMANAGE'
...
Tue Jan 26 10:57:09 UTC 2016 cl_dr: Node = AIX720_LPM1, state = NORMAL
Tue Jan 26 10:57:09 UTC 2016 cl_dr: Stop cluster services
Tue Jan 26 10:57:09 UTC 2016 cl_dr: LC_ALL=C clmgr stop node AIX720_LPM1 WHEN=now MANAGE=unmanage
...
"AIX720_LPM1" is now unmanaged.
...
--> Add an entry in /etc/inittab to ensure PowerHA to be in 'manage resource group' status after crash unexpectedly
Tue Jan 26 10:57:23 UTC 2016 cl_dr: Adding a temporary entry in /etc/inittab
Tue Jan 26 10:57:23 UTC 2016 cl_dr: lsitab hacmp_lpm
Tue Jan 26 10:57:23 UTC 2016 cl_dr: mkitab hacmp_lpm:2:once:/usr/es/sbin/cluster/utilities/cl_dr undopremigrate > /dev/null 2>&1
Tue Jan 26 10:57:23 UTC 2016 cl_dr: mkitab RC: 0
...
--> Stop RSCT cthags critical resource monitoring function (for two nodes)
Tue Jan 26 10:57:30 UTC 2016 cl_dr: Stopping RSCT Dead Man Switch on node 'AIX720_LPM1'
Tue Jan 26 10:57:30 UTC 2016 cl_dr: /usr/sbin/rsct/bin/dms/stopdms -s cthags
 
Dead Man Switch Disabled
DMS Re-arming Thread cancelled
 
Tue Jan 26 10:57:30 UTC 2016 cl_dr: stopdms RC: 0
Tue Jan 26 10:57:30 UTC 2016 cl_dr: Stopping RSCT Dead Man Switch on node 'AIX720_LPM2'
Tue Jan 26 10:57:30 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "LC_ALL=C lssrc -s cthags | grep -qw active"
Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 lssrc RC: 0
Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "LC_ALL=C /usr/sbin/rsct/bin/dms/listdms -s cthags | grep -qw Enabled"
Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 listdms RC: 0
Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "/usr/sbin/rsct/bin/dms/stopdms -s cthags"
 
Dead Man Switch Disabled
DMS Re-arming Thread cancelled
...
--> Change CAA node_time parameter to 600s
Tue Jan 26 10:57:31 UTC 2016 cl_dr: clodmget -n -f lpm_node_timeout HACMPcluster
Tue Jan 26 10:57:31 UTC 2016 cl_dr: clodmget LPM node_timeout: 600
Tue Jan 26 10:57:31 UTC 2016 cl_dr: clctrl -tune -x node_timeout
Tue Jan 26 10:57:31 UTC 2016 cl_dr: clctrl CAA node_timeout: 30000
Tue Jan 26 10:57:31 UTC 2016 cl_dr: Changing CAA node_timeout to '600000'
Tue Jan 26 10:57:31 UTC 2016 cl_dr: clctrl -tune -o node_timeout=600000
...
--> Disable CAA SAN heartbeating (for two nodes)
Tue Jan 26 10:57:32 UTC 2016 cl_dr: cl_rsh AIX720_LPM1 "LC_ALL=C echo sfwcom >> /etc/cluster/ifrestrict"
Tue Jan 26 10:57:32 UTC 2016 cl_dr: cl_rsh to node AIX720_LPM1 completed, RC: 0
Tue Jan 26 10:57:32 UTC 2016 cl_dr: clusterconf
Tue Jan 26 10:57:32 UTC 2016 cl_dr: clusterconf completed, RC: 0
...
Tue Jan 26 10:57:32 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "LC_ALL=C echo sfwcom >> /etc/cluster/ifrestrict"
Tue Jan 26 10:57:33 UTC 2016 cl_dr: cl_rsh to node AIX720_LPM2 completed, RC: 0
Tue Jan 26 10:57:33 UTC 2016 cl_dr: clusterconf
Tue Jan 26 10:57:33 UTC 2016 cl_dr: clusterconf completed, RC: 0
...
Example 8-21 shows information in the post-migration operation.
Example 8-21 Log file of post-migration operation
--> Change PowerHA service back to normal status
Tue Jan 26 10:57:52 UTC 2016 cl_2dr: POST_MIGRATE entered
Tue Jan 26 10:57:52 UTC 2016 cl_2dr: clodmget -n -f lpm_policy HACMPcluster
Tue Jan 26 10:57:52 UTC 2016 cl_2dr: lpm_policy='UNMANAGE'
Tue Jan 26 10:57:52 UTC 2016 cl_2dr: grep -w node_state /var/hacmp/cl_dr.state | cut -d'=' -f2
Tue Jan 26 10:57:52 UTC 2016 cl_2dr: Previous state = NORMAL
Tue Jan 26 10:57:52 UTC 2016 cl_2dr: Restarting cluster services
Tue Jan 26 10:57:52 UTC 2016 cl_2dr: LC_ALL=C clmgr start node AIX720_LPM1 WHEN=now MANAGE=auto
AIX720_LPM1: start_cluster: Starting PowerHA SystemMirror
...
"AIX720_LPM1" is now online.
...
--> Remove the entry from /etc/inittab, this entry was written in pre-migration operation
Tue Jan 26 11:00:27 UTC 2016 cl_2dr: lsitab hacmp_lpm
Tue Jan 26 11:00:27 UTC 2016 cl_2dr: Removing the temporary entry from /etc/inittab
Tue Jan 26 11:00:27 UTC 2016 cl_2dr: rmitab hacmp_lpm
...
--> Enable RSCT cthags critical resource monitoring function (for two nodes)
Tue Jan 26 10:58:21 UTC 2016 cl_2dr: LC_ALL=C lssrc -s cthags | grep -qw active
Tue Jan 26 10:58:21 UTC 2016 cl_2dr: lssrc RC: 0
Tue Jan 26 10:58:21 UTC 2016 cl_2dr: grep -w RSCT_local_DMS_state /var/hacmp/cl_dr.state | cut -d'=' -f2
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: previous RSCT DMS state = Enabled
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: Restarting RSCT Dead Man Switch on node 'AIX720_LPM1'
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: /usr/sbin/rsct/bin/dms/startdms -s cthags
 
Dead Man Switch Enabled
DMS Re-arming Thread created
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: startdms RC: 0
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: cl_rsh AIX720_LPM2 lssrc RC: 0
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: grep -w RSCT_peer_DMS_state /var/hacmp/cl_dr.state | cut -d'=' -f2
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: previous RSCT Dead Man Switch on node 'AIX720_LPM2' = Enabled
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: Restarting RSCT Dead Man Switch on node 'AIX720_LPM2'
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: cl_rsh AIX720_LPM2 "/usr/sbin/rsct/bin/dms/startdms -s cthags"
 
Dead Man Switch Enabled
DMS Re-arming Thread created
...
--> Restore CAA node_timeout value
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: previous CAA node timeout = 30000
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: Restoring CAA node_timeout to '30000'
Tue Jan 26 10:58:22 UTC 2016 cl_2dr: clctrl -tune -o node_timeout=30000
smcaactrl:0:[182](0.009): Running smcaactrl at Tue Jan 26 10:58:22 UTC 2016 with the following parameters:
-O MOD_TUNE -P CHECK -T 2 -c 7ae36082-c418-11e5-8039-fa976d972a20 -t 7ae36082-c418-11e5-8039-fa976d972a20,LPMCluster,0 -i -v node_timeout,600000
...
--> Enable SAN heartbeating (for two nodes)
Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh AIX720_LPM1 "if [ -s /var/hacmp/ifrestrict ]; then mv /var/hacmp/ifrestrict /etc/cluster/ifrestrict; else rm -f /etc/cluster/ifrestrict
; fi"
Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh to node AIX720_LPM1 completed, RC: 0
Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh AIX720_LPM2 "if [ -s /var/hacmp/ifrestrict ]; then mv /var/hacmp/ifrestrict /etc/cluster/ifrestrict; else rm -f /etc/cluster/ifrestrict
; fi"
Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh to node AIX720_LPM2 completed, RC: 0
Tue Jan 26 11:00:26 UTC 2016 cl_2dr: clusterconf
Tue Jan 26 11:00:27 UTC 2016 cl_2dr: clusterconf completed, RC: 0
Tue Jan 26 11:00:27 UTC 2016 cl_2dr: Launch the SAN communication reconfiguration in background.
...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset