Chapter 8. Automation to adapt to the Live Partition Mobility (LPM) operation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Automation to adapt to the Live Partition Mobility (LPM) operation

This chapter introduces one new feature of PowerHA SystemMirror 7.2 edition: Automation to adapt to the Live Partition Mobility (LPM) operation.

Before PowerHA SystemMirror 7.2 edition, if customers wanted to implement the LPM operation for one AIX LPAR that is running PowerHA service, they had to perform a manual operation, which is illustrated on the following website:

https://www.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.admngd/ha_admin_live_partition.htm?lang=en

The PowerHA SystemMirror 7.2 edition plugs into the LPM infrastructure to listen to LPM events and adjusts the clustering related monitoring as needed for the LPM operation to succeed without disruption. This reduces the burden on the administrator to perform manual operations on the cluster node during LPM operations. See the following website for more information about this feature:

https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.0/com.ibm.powerha.admngd/ha_admin_live_partition.htm?lang=en

This chapter introduces what operations are necessary to ensure that the LPM operation for the PowerHA node completes successfully. This chapter used both PowerHA 7.1 and PowerHA 7.2 cluster environments to illustrate the scenarios.

This chapter contains the following sections:

•Concept

•Prerequisites for PowerHA node support of LPM

•Operation flow to support LPM on PowerHA node

•Example: LPM scenario for PowerHA node with version 7.1

•New panel to support LPM in PowerHA 7.2

•PowerHA 7.2 scenario and troubleshooting

8.1 Concept

This section provides an introduction to the Live Partition Mobility concepts.

Live Partition Mobility

Live Partition Mobility (LPM) enables you to migrate LPARs running the AIX operating system and their hosted applications from one physical server to another without disrupting the infrastructure services. The migration operation maintains system transactional integrity and transfers the entire system environment, including processor state, memory, attached virtual devices, and connected users.

LPM provides the facility for no down time for planned hardware maintenance. However, LPM does not offer the same for software maintenance or unplanned downtime. You can use PowerHA SystemMirror within a partition that is capable of LPM. This does not mean that PowerHA SystemMirror uses LPM in anyway, and it is treated as another application within the partition.

LPM operation time and freeze time

The amount of operational time that an LPM migration requires on an LPAR is determined by multiple factors, such as LPAR’s memory size, workload activity (more memory pages require more memory updates across the system), and network performance.

LPAR freeze time is a part of LPM operational time, and it occurs when the LPM tries to reestablish the memory state. During this time, no other processes can operate in the LPAR. As part of this memory reestablishment process, memory pages from the source system can be copied to the target system over the network connection. If the network connection is congested, this process of copying over the memory pages can increase the overall LPAR freeze time.

Cluster software in a PowerHA cluster environment

In a PowerHA solution, although PowerHA is one cluster software, there are two other kinds of cluster software running behind the PowerHA cluster:

•RSCT

•CAA

See section 4.4, “IBM PowerHA, RSCT, and CAA” on page 98, which describes their relationship.

PowerHA cluster heartbeating and the Dead Man Switch (DMS)

PowerHA SystemMirror uses constant communication between the nodes to keep track of the health of the cluster, nodes, and so on. One of the key components of communication is the heartbeating between the nodes. Lack of heartbeats forms a critical part of the decision-making process to declare a node to be dead.

PowerHA 7.2 default node failure detection time is 40 seconds. 30 seconds for node communication timeout plus 10 seconds grace period. Note that these values could be higher if a customer requires it.

Node A would declare partner Node B to be dead if Node A did not receive any communication or heartbeats for more than 40 seconds. This works great when Node B is actually dead (crashed, powered off, and so on). However, there could be scenarios where Node B is not dead, but is not able to communicate for long periods.

Some examples of such scenarios are as follows:

1. There is one communication link between the nodes and it is broken (it is highly recommended that multiple communication links be deployed between the nodes to avoid this scenario).

2. Due to a rare situation, the operating system froze the cluster processes and kernel threads such that the node could not send any I/O (disk or network) for more than 40 seconds. This would result in the same situation that Node A is not able to receive any communication from Node B for more than 40 seconds, and therefore would declare
Node B to be dead, even though it is alive. This leads to a “split brain” condition, which could result in data corruption if the disks are shared across nodes.

Some of these scenarios can be handled in the cluster. For example, in scenario #2, when Node B is allowed to run after the unfreeze, it recognizes the fact that it has not been able to communicate to other nodes for a long time and takes evasive action. Those types of actions are called Dead Man Switch (DMS) protection.

DMS involves timers monitoring various activities, such as I/O traffic and process health, to recognize stray cases where there is potential for it (Node B) to be considered dead by its peers in the cluster. In these cases, the DMS timers trigger just before the node failure detection time and evasive action is initiated. A typical evasive action involves fencing
the node.

PowerHA SystemMirror consists of different DMS protections:

•Cluster Aware AIX (CAA) DMS protection

When CAA detects that a node is isolated in a multiple node environment, a DMS is triggered. This timeout occurs when the node cannot communicate with other nodes during the delay specified by the node_timeout cluster tunable. The system crashes with an errlog Deadman timer triggered if the deadman_mode cluster tunable (clctrl -tune) is set to a (assert mode, which is the default), or only log an event if deadman_mode is set to e (event mode).

This can occur on the node performing LPM, or on both nodes in a two-node cluster. To prevent a system crash due to this timeout, it is suggested to increase node_timeout to its maximum value, which is 600 seconds before LPM and restore it after LPM.

Note: This operation is done manually with a PowerHA SystemMirror 7.1 node. 8.3, “Example: LPM scenario for PowerHA node with version 7.1” on page 291 introduces the operation. This operation is done automatically with a PowerHA System 7.2 node, as described in 8.4, “New panel to support LPM in PowerHA 7.2” on page 308.

•Group Services DMS

Group services is a critical component that allows for cluster-wide membership and group management. This daemon’s health is monitored continuously. If this process exits or becomes inactive for long periods of time, then the node is brought down.

•RSCT RMC, ConfigRMC, clstrmgr, and IBM.StorageRM daemons

Group Services monitors the health of these daemons. If they are inactive for a long time or exit, then the node is brought down.

Note: The Group Service (cthags) DMS timeout, at the time this publication was written, is 30 seconds. For now, it is hardcoded, and cannot be changed.

Therefore, if the LPM freeze time is longer than the Group Service DMS timeout, Group Service (cthags) reacts and halts the node.

Because we cannot tune the parameter to increase its timeout, it is required to disable RSCT critical process monitoring before LPM, and enable it after LPM, with the following commands:

– Disable RSCT critical process monitoring

To disable RSCT monitoring process, use the following commands:

/usr/sbin/rsct/bin/hags_disable_client_kill -s cthags

/usr/sbin/rsct/bin/dms/stopdms -s cthags

– Enable RSCT critical process monitoring

To enable RSCT monitoring process, use the following commands:

/usr/sbin/rsct/bin/dms/startdms -s cthags

/usr/sbin/rsct/bin/hags_enable_client_kill -s cthags

Note: This operation is done manually in a PowerHA SystemMirror 7.1 node, as described in 8.3, “Example: LPM scenario for PowerHA node with version 7.1” on page 291. This operation is done automatically in a PowerHA System 7.2 node, as described in 8.4, “New panel to support LPM in PowerHA 7.2” on page 308.

8.1.1 Prerequisites for PowerHA node support of LPM

This section describes the prerequisites for PowerHA node support for LPM.

8.1.2 Reduce LPM freeze time as far as possible

To reduce the freeze time during LPM operation, it is suggested to use 10 Gb network adapters and a dedicated network with enough bandwidth available, and reduce memory activity during LPM.

8.1.3 PowerHA fix requirement

For PowerHA SystemMirror version 7.1 to support changing CAA’s node_time variable online through the PowerHA clmgr command, the following APARs are required:

•PowerHA SystemMirror Version 7.1.2 - IV79502 (in SP8)

•PowerHA SystemMirror Version 7.1.3 - IV79497 (in SP5)

Without these APARs or in PowerHA version 7.1.1, the change requires two steps to change the CAA node_timeout variable. See “Increase the CAA node_timeout” on page 298 for more information.

8.2 Operation flow to support LPM on PowerHA node

The operation flow includes pre-migration and post-migration.

If the PowerHA version is earlier than 7.2, then you have to do the operations manually. If PowerHA version is 7.2 or later, the PowerHA performs the operations automatically.

This section introduces pre-migration and post-migration operation flow during LPM.

8.2.1 Pre-migration operation flow

Figure 8-1 describes the operation flow in a pre-migration stage.

Figure 8-1 Pre-migration operation flow

Table 8-1 shows the detailed information for each step in the pre-migration stage.

Table 8-1 Description of the pre-migration operation flow

Step	Description
1	Check if HyperSwap is used. If YES, go to 2; otherwise, go to 1.1
1.1	Check if LPM_POLICY=unmanage is set. If YES, go to 2; otherwise, go to 4: clodmget -n -f lpm_policy HACMPcluster
2	Change the node to unmanage resource group status: clmgr stop node <node_name> WHEN=now MANAGE=unmanage
3	Add an entry in the /etc/inittab file, which is useful in case of a node crash before restoring the managed state: mkitab hacmp_lpm:2:once:/usr/es/sbin/cluster/utilities/cl_dr undopremigrate > /dev/null 2>&1
4	Check if RSCT DMS critical resource monitoring is enabled: /usr/sbin/rsct/bin/dms/listdms -s cthags \| grep -qw Enabled
5	Disable RSCT DMS critical resource monitoring: /usr/sbin/rsct/bin/hags_disable_client_kill -s cthags /usr/sbin/rsct/bin/dms/stopdms -s cthags
6	Check if th ecurrent node_timeout value is equal to the value that you set: clodmget -n -f lpm_node_timeout HACMPcluster clctrl -tune -x node_timeout
7	Change the CAA node_timeout value: clmgr -f modify cluster HEARTBEAT_FREQUENCY="600"
8	If SAN-based heartbeating is enabled, then disable this function: echo 'sfwcom' >> /etc/cluster/ifrestrict clusterconf

8.2.2 Post-migration operation flow

Figure 8-2 describes the operation flow in the post-migration stage.

Figure 8-2 Post-migration operation flow

Table 8-2 shows the detailed information for each step in the post-migration stage.

Table 8-2 Description of post-migration operation flow

Step	Description
1	Check if the current resource group status is unmanaged. If YES, go to 2; otherwise, go to 4.
2	Change the node back to manage resource group status: clmgr start node <node_name> WHEN=now MANAGE=auto
3	Remove the entry from the /etc/inittab file that was added in the pre-migration process: rmitab hacmp_lpm
4	Check if the RSCT DMS critical resource monitoring function is enabled before LPM operation.
5	Enable RSCT DMS critical resource monitoring: /usr/sbin/rsct/bin/dms/startdms -s cthags /usr/sbin/rsct/bin/hags_enable_client_kill -s cthags
6	Check if the current node_timeout value is equal to the value that you set before: clctrl -tune -x node_timeout clodmget -n -f lpm_node_timeout HACMPcluster
7	Restore the CAA node_timeout value: clmgr -f modify cluster HEARTBEAT_FREQUENCY="30"
8	If SAN based heartbeating is enabled, then enable this function: rm -f /etc/cluster/ifrestrict clusterconf rmdev -l sfwcomm* mkdev -l sfwcomm*

8.3 Example: LPM scenario for PowerHA node with version 7.1

This section introduces detailed operations for performing LPM for one node with PowerHA SystemMirror version 7.1.

8.3.1 Topology introduction

Figure 8-3 describes the topology of the testing environment.

Figure 8-3 Testing environment topology

There are two Power Systems 780 servers. The first server is P780_09 and its serial number is 060C0AT, and the second server is P780_10 and its machine serial number is 061949T. The following list provides additional details about the testing environment:

•Each server has one VIOS partition and one AIX partition.

•The P780_09 server has VIOSA and AIX720_LPM1 partitions.

•The P780_10 server has VIOSB and AIX720_LPM2 partitions.

•There is one storage that can be accessed by the two VIO servers.

•The two AIX partitions access storage via the NPIV protocol.

•The heartbeating method includes IP, SAN, and dpcom.

•The AIX version is AIX 7.2 SP1.

•The PowerHA SystemMirror version is 7.1.3 SP4.

8.3.2 Initial status

This section describes the initial cluster status.

PowerHA and AIX version

Example 8-1 shows the PowerHA and the AIX version information.

Example 8-1 PowerHA and AIX version information

AIX720_LPM1:/usr/es/sbin/cluster # clhaver

Node AIX720_LPM2 has HACMP version 7134 installed

Node AIX720_LPM1 has HACMP version 7134 installed

AIX720_LPM1:/usr/es/sbin/cluster # clcmd oslevel -s

-------------------------------

NODE AIX720_LPM2

-------------------------------

7200-00-01-1543

-------------------------------

NODE AIX720_LPM1

-------------------------------

7200-00-01-1543

PowerHA configuration

Table 8-3 shows the cluster’s configuration.

Table 8-3 Cluster’s configuration

	AIX720_LPM1	AIX720_LPM2
Cluster name	LPMCluster Cluster type: NSC (No Site Cluster)
Network interface	en1:172.16.50.21 netmask:255.255.255.0 Gateway:172.16.50.1	en0:172.16.50.22 netmask:255.255.255.0 Gateway:172.16.50.1
Network	net_ether_01 (172.16.50.0/24)
CAA	Unicast primary disk: hdisk1
shared VG	testVG:hdisk2
Service IP	172.16.50.23 AIX720_LPM_Service
Resource Group	testRG includes testVG, AIX720_LPM_Service The node order is: AIX720_LPM1, AIX720_LPM2 Startup Policy: Online On Home Node Only Fallover Policy: Fallover To Next Priority Node In The List Fallback Policy: Never Fallback

PowerHA and Resource Group status

Example 8-2 shows the current status of PowerHA and the Resource Group.

Example 8-2 PowerHA and Resource Group status

AIX720_LPM1:/ # clcmd -n LPMCluster lssrc -ls clstrmgrES|egrep "NODE|state"|grep -v "Last"

NODE AIX720_LPM2

Current state: ST_STABLE

NODE AIX720_LPM1

Current state: ST_STABLE

AIX720_LPM1:/ # clcmd -n LPMCluster clRGinfo

-------------------------------

NODE AIX720_LPM2

-------------------------------

-----------------------------------------------------------------------------

Group Name State Node

-----------------------------------------------------------------------------

testRG ONLINE AIX720_LPM1

OFFLINE AIX720_LPM2

-------------------------------

NODE AIX720_LPM1

-------------------------------

-----------------------------------------------------------------------------

Group Name State Node

-----------------------------------------------------------------------------

testRG ONLINE AIX720_LPM1

OFFLINE AIX720_LPM2

CAA heartbeating status

Example 8-3 shows the current CAA heartbeating status and node_timeout parameter.

Example 8-3 CAA heartbeating status and value of node_timeout parameter

AIX720_LPM1:/ # clcmd lscluster -m

-------------------------------

NODE AIX720_LPM2

-------------------------------

Calling node query for all nodes...

Node query number of nodes examined: 2

Node name: AIX720_LPM1

Cluster shorthand id for node: 1

UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04

State of node: UP

Smoothed rtt to node: 7

Mean Deviation in network rtt to node: 3

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 2

-----------------------------------------------------------------------

Interface State Protocol Status SRC_IP->DST_IP

-----------------------------------------------------------------------

sfwcom UP none none none

tcpsock->01 UP IPv4 none 172.16.50.22->172.16.50.21

----------------------------------------------------------------------------

Node name: AIX720_LPM2

Cluster shorthand id for node: 2

UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04

State of node: UP NODE_LOCAL

Smoothed rtt to node: 0

Mean Deviation in network rtt to node: 0

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 0

-------------------------------

NODE AIX720_LPM1

-------------------------------

Calling node query for all nodes...

Node query number of nodes examined: 2

Node name: AIX720_LPM1

Cluster shorthand id for node: 1

UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04

State of node: UP NODE_LOCAL

Smoothed rtt to node: 0

Mean Deviation in network rtt to node: 0

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 0

----------------------------------------------------------------------------

Node name: AIX720_LPM2

Cluster shorthand id for node: 2

UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04

State of node: UP

Smoothed rtt to node: 17

Mean Deviation in network rtt to node: 13

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 2

-----------------------------------------------------------------------

Interface State Protocol Status SRC_IP->DST_IP

-----------------------------------------------------------------------

sfwcom UP none none none

tcpsock->02 UP IPv4 none 172.16.50.21->172.16.50.22

AIX720_LPM2:/ # clctrl -tune -L

NAME DEF MIN MAX UNIT SCOPE

...

node_timeout 20000 10000 600000 milliseconds c n

LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 30000

...

--> Current node_timeout is 30s

RSCT cthags status

Example 8-4 shows the current RSCT cthags service’s status.

Example 8-4 RSCT cthags service’s status

AIX720_LPM1:/ # lssrc -ls cthags

Subsystem Group PID Status

cthags cthags 13173166 active

5 locally-connected clients. Their PIDs:

9175342(IBM.ConfigRMd) 6619600(rmcd) 14549496(IBM.StorageRMd) 7995658(clstrmgr) 10355040(gsclvmd)

HA Group Services domain information:

Domain established by node 1

Number of groups known locally: 8

Number of Number of local

Group name providers providers/subscribers

rmc_peers 2 1 0

s00V0CKI0009G000001A9UHPVQ4 2 1 0

IBM.ConfigRM 2 1 0

IBM.StorageRM.v1 2 1 0

CLRESMGRD_1495882547 2 1 0

CLRESMGRDNPD_1495882547 2 1 0

CLSTRMGR_1495882547 2 1 0

d00V0CKI0009G000001A9UHPVQ4 2 1 0

Critical clients will be terminated if unresponsive

Dead Man Switch Enabled

AIX720_LPM1:/usr/sbin/rsct/bin/dms # ./listdms -s cthags

Dead Man Switch Enabled:

reset interval = 3 seconds

trip interval = 30 seconds

LPAR and server location information

Example 8-5 shows the current LPAR’s location information.

Example 8-5 LPAR and server location information

AIX720_LPM1:/ # prtconf

System Model: IBM,9179-MHD

Machine Serial Number: 060C0AT --> this server is P780_09

AIX720_LPM2:/ # prtconf

System Model: IBM,9179-MHD

Machine Serial Number: 061949T --> this server is P780_10

8.3.3 Manual operation before LPM

Before performing the LPM operation, there are several manual operations that are required.

Change the PowerHA service to unmanage Resource Group status

There are two methods to change the PowerHA service to Unmanage Resource Group status. The first method is through the SMIT menu, as shown in Example 8-6.

Start smit clstop.

Example 8-6 Change the cluster service to unmanage Resource Groups through the SMIT menu

Stop Cluster Services

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Stop now, on system restart or both now

Stop Cluster Services on these nodes [AIX720_LPM1]

BROADCAST cluster shutdown? true

* Select an Action on Resource Groups Unmanage Resource Groups

The second method is through the clmgr command, as shown in Example 8-7.

Example 8-7 Change cluster service to unmanage Resource Group through the clmgr command

AIX720_LPM1:/ # clmgr stop node AIX720_LPM1 WHEN=now MANAGE=unmanage

Broadcast message from root@AIX720_LPM1 (tty) at 23:52:44 ...

PowerHA SystemMirror on AIX720_LPM1 shutting down. Please exit any cluster applications...

AIX720_LPM1: 0513-044 The clevmgrdES Subsystem was requested to stop.

"AIX720_LPM1" is now unmanaged.

AIX720_LPM1: Jan 26 2016 23:52:43 /usr/es/sbin/cluster/utilities/clstop: called with flags -N -f

AIX720_LPM1:/ # clcmd -n LPMCluster clRGinfo

-------------------------------

NODE AIX720_LPM2

-------------------------------

-----------------------------------------------------------------------------

Group Name State Node

-----------------------------------------------------------------------------

testRG UNMANAGED AIX720_LPM1

UNMANAGED AIX720_LPM2

-------------------------------

NODE AIX720_LPM1

-------------------------------

-----------------------------------------------------------------------------

Group Name State Node

-----------------------------------------------------------------------------

testRG UNMANAGED AIX720_LPM1

UNMANAGED AIX720_LPM2

Disable RSCT cthags critical resource monitoring function

Example 8-8 shows how to disable the RSCT cthags critical resource monitoring function to prevent a DMS trigger if the LPM freeze time is longer than its timeout.

Note: In this case, there are only two nodes in this cluster, so you need to disable this function on both nodes. Only one node is shown in the example, but the command is run on both nodes.

Example 8-8 Disable RSCT cthgs critical resource monitoring function

AIX720_LPM1:/ # /usr/sbin/rsct/bin/hags_disable_client_kill -s cthags

AIX720_LPM1:/ # /usr/sbin/rsct/bin/dms/stopdms -s cthags

Dead Man Switch Disabled

DMS Re-arming Thread cancelled

AIX720_LPM1:/ # lssrc -ls cthags

Subsystem Group PID Status

cthags cthags 13173166 active

5 locally-connected clients. Their PIDs:

9175342(IBM.ConfigRMd) 6619600(rmcd) 14549496(IBM.StorageRMd) 19792370(clstrmgr) 19268008(gsclvmd)

HA Group Services domain information:

Domain established by node 1

Number of groups known locally: 8

Number of Number of local

Group name providers providers/subscribers

rmc_peers 2 1 0

s00V0CKI0009G000001A9UHPVQ4 2 1 0

IBM.ConfigRM 2 1 0

IBM.StorageRM.v1 2 1 0

CLRESMGRD_1495882547 2 1 0

CLRESMGRDNPD_1495882547 2 1 0

CLSTRMGR_1495882547 2 1 0

d00V0CKI0009G000001A9UHPVQ4 2 1 0

Critical clients will not be terminated even if unresponsive

Dead Man Switch Disabled

AIX720_LPM1:/usr/sbin/rsct/bin/dms # ./listdms -s cthags

Dead Man Switch Disabled

Increase the CAA node_timeout

Example 8-9 shows how to increase the CAA node_timeout to prevent a CAA DMS trigger if the LPM freeze time is longer than its timeout. You need to run this command on only one node, because it is cluster aware.

Example 8-9 Increase the CAA node_timeout

AIX720_LPM1:/ # clmgr -f modify cluster HEARTBEAT_FREQUENCY="600"

1 tunable updated on cluster LPMCluster.

AIX720_LPM1:/ # clctrl -tune -L

NAME DEF MIN MAX UNIT SCOPE

ENTITY_NAME(UUID) CUR

...

node_timeout 20000 10000 600000 milliseconds c n

LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 600000

Note: With the previous configuration, if LPM’s freeze time is longer than 600 seconds, CAA DMS is triggered because of the CAA’s deadman_mode=a (assert) parameter. The node crashes and its resource group is moved to another node.

Note: The -f option of the clmgr command means not to update the HACMPcluster ODM, because it will update the CAA variable (node_timeout) directly with the clctrl command. This function is included with the following interim fixes:

•PowerHA SystemMirror Version 7.1.2 - IV79502 (SP8)

•PowerHA SystemMirror Version 7.1.3 - IV79497 (SP5)

If you do not apply one of these interim fixes, then you must perform four steps to increase the CAA node_timeout variable (Example 8-10):

•Change the PowerHA service to online status (because cluster sync needs this status)

•Change the HACMPcluster ODM

•Perform cluster verification and synchronization

Change the PowerHA service to unmanage resource group status

Example 8-10 Detailed steps to change CAA node_timeout variable without PowerHA interim fix

--> Step 1

AIX720_LPM1:/ # clmgr start node AIX720_LPM1 WHEN=now MANAGE=auto

Adding any necessary PowerHA SystemMirror entries to /etc/inittab and /etc/rc.net for IPAT on node AIX720_LPM1.

AIX720_LPM1: start_cluster: Starting PowerHA SystemMirror

...

"AIX720_LPM1" is now online.

Starting Cluster Services on node: AIX720_LPM1

This may take a few minutes. Please wait...

AIX720_LPM1: Jan 27 2016 06:17:04 Starting execution of /usr/es/sbin/cluster/etc/rc.cluster

AIX720_LPM1: with parameters: -boot -N -A -b -P cl_rc_cluster

AIX720_LPM1:

AIX720_LPM1: Jan 27 2016 06:17:04 Checking for srcmstr active...

AIX720_LPM1: Jan 27 2016 06:17:04 complete.

--> Step 2

AIX720_LPM1:/ # clmgr modify cluster HEARTBEAT_FREQUENCY="600"

--> Step 3

AIX720_LPM1:/ # clmgr sync cluster

Verifying additional pre-requisites for Dynamic Reconfiguration...

...completed.

Committing any changes, as required, to all available nodes...

Adding any necessary PowerHA SystemMirror entries to /etc/inittab and /etc/rc.net for IPAT on node AIX720_LPM1.

Checking for added nodes

Updating Split Merge Policies

1 tunable updated on cluster LPMCluster.

Adding any necessary PowerHA SystemMirror entries to /etc/inittab and /etc/rc.net for IPAT on node AIX720_LPM2.

Verification has completed normally.

--> Step 4

AIX720_LPM1:/ # clmgr stop node AIX720_LPM1 WHEN=now MANAGE=unmanage

Broadcast message from root@AIX720_LPM1 (tty) at 06:15:02 ...

PowerHA SystemMirror on AIX720_LPM1 shutting down. Please exit any cluster applications...

AIX720_LPM1: 0513-044 The clevmgrdES Subsystem was requested to stop.

"AIX720_LPM1" is now unmanaged.

--> Check the result

AIX720_LPM1:/ # clctrl -tune -L

NAME DEF MIN MAX UNIT SCOPE

ENTITY_NAME(UUID) CUR

...

node_timeout 20000 10000 600000 milliseconds c n

LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 600000

Note: When you stop the cluster with unmanage and when you start it with auto, it will try to bring the resource group online, which does not cause any problem with the VGs, file systems, and IPs. However, it runs the application controller one more time. If you do not predict the appropriate checks in its application controller before running the commands, it can cause problems with the application. Therefore, the application controller start script should check if the application is already online before starting it.

Disable SAN heartbeating function

Note: In our scenario, SAN-based heartbeating has been configured, so this step is required. You do not need to do this step if SAN-based heartbeating is not configured.

Example 8-11 shows how to disable SAN heartbeating function.

Example 8-11 Disable SAN heartbeating function

AIX720_LPM1:/ # echo "sfwcom" >> /etc/cluster/ifrestrict

AIX720_LPM1:/ # clusterconf

AIX720_LPM2:/ # echo "sfwcom" >> /etc/cluster/ifrestrict

AIX720_LPM2:/ # clusterconf

AIX720_LPM1:/ # clcmd lscluster -m

-------------------------------

NODE AIX720_LPM2

-------------------------------

Calling node query for all nodes...

Node query number of nodes examined: 2

Node name: AIX720_LPM1

Cluster shorthand id for node: 1

UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04

State of node: UP

Smoothed rtt to node: 7

Mean Deviation in network rtt to node: 3

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 1

-----------------------------------------------------------------------

Interface State Protocol Status SRC_IP->DST_IP

-----------------------------------------------------------------------

tcpsock->01 UP IPv4 none 172.16.50.22->172.16.50.21

----------------------------------------------------------------------------

Node name: AIX720_LPM2

Cluster shorthand id for node: 2

UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04

State of node: UP NODE_LOCAL

Smoothed rtt to node: 0

Mean Deviation in network rtt to node: 0

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 0

-------------------------------

NODE AIX720_LPM1

-------------------------------

Calling node query for all nodes...

Node query number of nodes examined: 2

Node name: AIX720_LPM1

Cluster shorthand id for node: 1

UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04

State of node: UP NODE_LOCAL

Smoothed rtt to node: 0

Mean Deviation in network rtt to node: 0

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 0

----------------------------------------------------------------------------

Node name: AIX720_LPM2

Cluster shorthand id for node: 2

UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04

State of node: UP

Smoothed rtt to node: 18

Mean Deviation in network rtt to node: 14

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 1

-----------------------------------------------------------------------

Interface State Protocol Status SRC_IP->DST_IP

-----------------------------------------------------------------------

tcpsock->02 UP IPv4 none 172.16.50.21->172.16.50.22

AIX720_LPM1:/ # lscluster -i

Network/Storage Interface Query

Cluster Name: LPMCluster

Cluster UUID: 11403f34-c4b7-11e5-8014-56c6a3855d04

Number of nodes reporting = 2

Number of nodes stale = 0

Number of nodes expected = 2

Node AIX720_LPM1

Node UUID = 112552f0-c4b7-11e5-8014-56c6a3855d04

Number of interfaces discovered = 3

Interface number 1, en1

IFNET type = 6 (IFT_ETHER)

NDD type = 7 (NDD_ISO88023)

MAC address length = 6

MAC address = FA:97:6D:97:2A:20

Smoothed RTT across interface = 0

Mean deviation in network RTT across interface = 0

Probe interval for interface = 990 ms

IFNET flags for interface = 0x1E084863

NDD flags for interface = 0x0021081B

Interface state = UP

Number of regular addresses configured on interface = 2

IPv4 ADDRESS: 172.16.50.21 broadcast 172.16.50.255 netmask 255.255.255.0

IPv4 ADDRESS: 172.16.50.23 broadcast 172.16.50.255 netmask 255.255.255.0

Number of cluster multicast addresses configured on interface = 1

IPv4 MULTICAST ADDRESS: 228.16.50.21

Interface number 2, sfwcom

IFNET type = 0 (none)

NDD type = 304 (NDD_SANCOMM)

Smoothed RTT across interface = 7

Mean deviation in network RTT across interface = 3

Probe interval for interface = 990 ms

IFNET flags for interface = 0x00000000

NDD flags for interface = 0x00000009

Interface state = DOWN RESTRICTED SOURCE HARDWARE RECEIVE SOURCE HARDWARE TRANSMIT

Interface number 3, dpcom

IFNET type = 0 (none)

NDD type = 305 (NDD_PINGCOMM)

Smoothed RTT across interface = 750

Mean deviation in network RTT across interface = 1500

Probe interval for interface = 22500 ms

IFNET flags for interface = 0x00000000

NDD flags for interface = 0x00000009

Interface state = UP RESTRICTED AIX_CONTROLLED

Node AIX720_LPM2

Node UUID = 11255336-c4b7-11e5-8014-56c6a3855d04

Number of interfaces discovered = 3

Interface number 1, en1

IFNET type = 6 (IFT_ETHER)

NDD type = 7 (NDD_ISO88023)

MAC address length = 6

MAC address = FA:F2:D3:29:50:20

Smoothed RTT across interface = 0

Mean deviation in network RTT across interface = 0

Probe interval for interface = 990 ms

IFNET flags for interface = 0x1E084863

NDD flags for interface = 0x0021081B

Interface state = UP

Number of regular addresses configured on interface = 1

IPv4 ADDRESS: 172.16.50.22 broadcast 172.16.50.255 netmask 255.255.255.0

Number of cluster multicast addresses configured on interface = 1

IPv4 MULTICAST ADDRESS: 228.16.50.21

Interface number 2, sfwcom

IFNET type = 0 (none)

NDD type = 304 (NDD_SANCOMM)

Smoothed RTT across interface = 7

Mean deviation in network RTT across interface = 3

Probe interval for interface = 990 ms

IFNET flags for interface = 0x00000000

NDD flags for interface = 0x00000009

Interface state = DOWN RESTRICTED SOURCE HARDWARE RECEIVE SOURCE HARDWARE TRANSMIT

Interface number 3, dpcom

IFNET type = 0 (none)

NDD type = 305 (NDD_PINGCOMM)

Smoothed RTT across interface = 750

Mean deviation in network RTT across interface = 1500

Probe interval for interface = 22500 ms

IFNET flags for interface = 0x00000000

NDD flags for interface = 0x00000009

Interface state = UP RESTRICTED AIX_CONTROLLED

8.3.4 Perform LPM

Example 8-12 shows how to perform the LPM operation for the AIX720_LPM1 node. This operation migrates this LPAR from P780_09 to P780_10.

Example 8-12 Performing the LPM operation

hscroot@hmc55:~> time migrlpar -o m -m SVRP7780-09-SN060C0AT -t SVRP7780-10-SN061949T -p AIX720_LPM1

real 1m6.269s

user 0m0.001s

sys 0m0.000s

PowerHA service and resource group status

After LPM completes, Example 8-13 shows that the PowerHA services are still stable, and AIX720_LPM1 has been moved to the P780_10 server.

Example 8-13 PowerHA services stable

AIX720_LPM1:/ # clcmd -n LPMCluster lssrc -ls clstrmgrES|egrep "NODE|state"|grep -v "Last"

NODE AIX720_LPM2

Current state: ST_STABLE

NODE AIX720_LPM1

Current state: ST_STABLE

AIX720_LPM1:/ # prtconf

System Model: IBM,9179-MHD

Machine Serial Number: 061949T --> this server is P780_10

AIX720_LPM2:/ # prtconf|more

System Model: IBM,9179-MHD

Machine Serial Number: 061949T --> this server is P780_10

8.3.5 Manual operation after LPM

After LPM completes, there are several manual operations required.

Enable SAN heartbeating function

Example 8-14 shows how to enable the SAN heartbeating function.

Example 8-14 Enable SAN heartbeating function

AIX720_LPM1:/ # rm /etc/cluster/ifrestrict

AIX720_LPM1:/ # clusterconf

AIX720_LPM2:/ # rm /etc/cluster/ifrestrict

AIX720_LPM2:/ # clusterconf

AIX720_LPM1:/ # clcmd lscluster -m

-------------------------------

NODE AIX720_LPM2

-------------------------------

Calling node query for all nodes...

Node query number of nodes examined: 2

Node name: AIX720_LPM1

Cluster shorthand id for node: 1

UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04

State of node: UP

Smoothed rtt to node: 7

Mean Deviation in network rtt to node: 3

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 2

-----------------------------------------------------------------------

Interface State Protocol Status SRC_IP->DST_IP

-----------------------------------------------------------------------

sfwcom UP none none none

tcpsock->01 UP IPv4 none 172.16.50.22->172.16.50.21

----------------------------------------------------------------------------

Node name: AIX720_LPM2

Cluster shorthand id for node: 2

UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04

State of node: UP NODE_LOCAL

Smoothed rtt to node: 0

Mean Deviation in network rtt to node: 0

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 0

-------------------------------

NODE AIX720_LPM1

-------------------------------

Calling node query for all nodes...

Node query number of nodes examined: 2

Node name: AIX720_LPM1

Cluster shorthand id for node: 1

UUID for node: 112552f0-c4b7-11e5-8014-56c6a3855d04

State of node: UP NODE_LOCAL

Smoothed rtt to node: 0

Mean Deviation in network rtt to node: 0

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 0

----------------------------------------------------------------------------

Node name: AIX720_LPM2

Cluster shorthand id for node: 2

UUID for node: 11255336-c4b7-11e5-8014-56c6a3855d04

State of node: UP

Smoothed rtt to node: 16

Mean Deviation in network rtt to node: 14

Number of clusters node is a member in: 1

CLUSTER NAME SHID UUID

LPMCluster 0 11403f34-c4b7-11e5-8014-56c6a3855d04

SITE NAME SHID UUID

LOCAL 1 51735173-5173-5173-5173-517351735173

Points of contact for node: 2

-----------------------------------------------------------------------

Interface State Protocol Status SRC_IP->DST_IP

-----------------------------------------------------------------------

sfwcom UP none none none

tcpsock->02 UP IPv4 none 172.16.50.21->172.16.50.22

Note: After this step, if the sfwcom interface is still not UP, check the VLAN storage framework communication device’s status. If it is in defined status, you need to reconfigure it with the following command:

AIX720_LPM1:/ # lsdev -C|grep vLAN

sfwcomm1 Defined vLAN Storage Framework Comm

AIX720_LPM1:/ # rmdev -l sfwcomm1; sleep 2; mkdev -l sfwcomm1

sfwcomm1 Defined

sfwcomm1 Available

Then you can check the sfwcom interface’s status again with the lscluster command.

Restore CAA node_timeout

Example 8-15 shows how to restore the CAA node_timeout.

Note: In a PowerHA cluster environment, the default value of node_timeout is 30 seconds.

Example 8-15 Restore the CAA node_timeout parameter

AIX720_LPM1:/ # clmgr -f modify cluster HEARTBEAT_FREQUENCY="30"

1 tunable updated on cluster LPMCluster.

AIX720_LPM1:/ # clctrl -tune -L

NAME DEF MIN MAX UNIT SCOPE

ENTITY_NAME(UUID) CUR

...

node_timeout 20000 10000 600000 milliseconds c n

LPMCluster(11403f34-c4b7-11e5-8014-56c6a3855d04) 30000

Enable RSCT cthags critical resouce monitoring function

Example 8-16 shows how to enable the RSCT cthags critical resource monitoring function.

Note: In this case, there are only two nodes in this cluster, so you disabled the function on both nodes before LPM. Only one node is shown in this example, but the command is run on both nodes.

Example 8-16 Enable RSCT cthags resource monitoring

AIX720_LPM1:/ # /usr/sbin/rsct/bin/dms/startdms -s cthags

Dead Man Switch Enabled

DMS Re-arming Thread created

AIX720_LPM1:/ # /usr/sbin/rsct/bin/hags_enable_client_kill -s cthags

AIX720_LPM1:/ # lssrc -ls cthags

Subsystem Group PID Status

cthags cthags 13173166 active

5 locally-connected clients. Their PIDs:

9175342(IBM.ConfigRMd) 6619600(rmcd) 14549496(IBM.StorageRMd) 19792370(clstrmgr) 19268008(gsclvmd)

HA Group Services domain information:

Domain established by node 1

Number of groups known locally: 8

Number of Number of local

Group name providers providers/subscribers

rmc_peers 2 1 0

s00V0CKI0009G000001A9UHPVQ4 2 1 0

IBM.ConfigRM 2 1 0

IBM.StorageRM.v1 2 1 0

CLRESMGRD_1495882547 2 1 0

CLRESMGRDNPD_1495882547 2 1 0

CLSTRMGR_1495882547 2 1 0

d00V0CKI0009G000001A9UHPVQ4 2 1 0

Critical clients will be terminated if unresponsive

Dead Man Switch Enabled

AIX720_LPM1:/ # /usr/sbin/rsct/bin/dms/listdms -s cthags

Dead Man Switch Enabled:

reset interval = 3 seconds

trip interval = 30 seconds

Change PowerHA service back to normal status

Example 8-17 shows how to change the PowerHA service back to normal status. There are two methods to achieve it. One is through the SMIT menu:

Start smit clstart.

Example 8-17 Change PowerHA service back to normal status

Start Cluster Services

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Start now, on system restart or both now

Start Cluster Services on these nodes [AIX720_LPM1]

* Manage Resource Groups Automatically

BROADCAST message at startup? true

Startup Cluster Information Daemon? false

Ignore verification errors? false

Automatically correct errors found during Interactively

cluster start?

Another is through 'clmgr' command:

AIX720_LPM1:/ # clmgr start node AIX720_LPM1 WHEN=now MANAGE=auto

AIX720_LPM1: start_cluster: Starting PowerHA SystemMirror

...

"AIX720_LPM1" is now online.

Starting Cluster Services on node: AIX720_LPM1

This may take a few minutes. Please wait...

AIX720_LPM1: Jan 27 2016 01:04:43 Starting execution of /usr/es/sbin/cluster/etc/rc.cluster

AIX720_LPM1: with parameters: -boot -N -A -b -P cl_rc_cluster

AIX720_LPM1:

AIX720_LPM1: Jan 27 2016 01:04:43 Checking for srcmstr active...

AIX720_LPM1: Jan 27 2016 01:04:43 complete.

Example 8-18 shows that the resource group’s status has been changed to normal.

Example 8-18 Resource Group’s status

AIX720_LPM1:/ # clcmd clRGinfo

-------------------------------

NODE AIX720_LPM2

-------------------------------

-----------------------------------------------------------------------------

Group Name State Node

-----------------------------------------------------------------------------

testRG ONLINE AIX720_LPM1

OFFLINE AIX720_LPM2

-------------------------------

NODE AIX720_LPM1

-------------------------------

-----------------------------------------------------------------------------

Group Name State Node

-----------------------------------------------------------------------------

testRG ONLINE AIX720_LPM1

OFFLINE AIX720_LPM2

8.4 New panel to support LPM in PowerHA 7.2

From version 7.2, PowerHA SystemMirror automates some of the Live Partition Mobility (LPM) steps by registering a script with the LPM framework.

PowerHA SystemMirror listens to LPM events and automates steps in PowerHA SystemMirror to handle the LPAR freeze that can occur during the LPM process. As part of the automation, PowerHA SystemMirror provides a few variables that can be changed based on the requirements for your environment.

You can change the following LPM variables in PowerHA SystemMirror that provide LPM automation:

•Node Failure Detection Timeout during LPM

•LPM Node Policy

Start smit sysmirror. Select Custom Cluster Configuration → Cluster Nodes and Networks → Manage the Cluster → Cluster heartbeat settings. The next panel is a menu screen with a title menu option and seven item menu options.

Its fast path is cm_chng_tunables (Figure 8-4). This menu is not new, but two items have been added to it to make LPM easier in a PowerHA environment (the last two items are new).

Cluster heartbeat settings

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Network Failure Detection Time [20]

* Node Failure Detection Timeout [30]

* Node Failure Detection Grace Period [10]

* Node Failure Detection Timeout during LPM [600]

* LPM Node Policy [manage]

Figure 8-4 Cluster heartbeat setting

Table 8-4 describes the context-sensitive help information for the cluster heartbeating setting.

Table 8-4 Context-sensitive help for the Cluster heartbeat setting

Name and fast path	context-sensitive help (F1)
Node Failure Detection Timeout during LPM	If specified, this timeout value (in seconds) will be used during a Live Partition Mobility (LPM) instead of the Node Failure Detection Timeout value. You can use this option to increase the Node Failure Detection Timeout during the LPM duration to ensure it will be greater than the LPM freeze duration in order to avoid any risk of unwanted cluster events. The unit is second. For PowerHA 7.2 GA Edition, the customer can enter a value 10 - 600. For PowerHA 7.2 SP1 or later, the default is 600 and is unchangeable.
LPM Node Policy	Specifies the action to be taken on the node during a Live Partition Mobility operation. If unmanage is selected, the cluster services are stopped with the Unmanage Resource Groups option during the duration of the LPM operation. Otherwise, PowerHA SystemMirror will continue to monitor the Resource Groups and application availability. The default is manage.

8.5 PowerHA 7.2 scenario and troubleshooting

This scenario keeps the same hardware and operating system as 8.3, “Example: LPM scenario for PowerHA node with version 7.1” on page 291. This scenario replaces only the PowerHA software with the 7.2 edition.

Example 8-19 shows the PowerHA version.

Example 8-19 PowerHA version

AIX720_LPM1:/ #clhaver

Node AIX720_LPM1 has HACMP version 7200 installed

Node AIX720_LPM2 has HACMP version 7200 installed

Table 8-5 shows the variables of LPM.

Table 8-5 Cluster heartbeating setting

Items	Value
Node Failure Detection Timeout during LPM	600
LPM Node Policy	unmanage

8.5.1 Troubleshooting

The PowerHA log related with LPM operation is in /var/hacmp/log/clutils.log. Example 8-20 and Example 8-21 on page 311 show the information in this log file, and include pre-migration and post-migration.

Note: During the operation, PowerHA SystemMirror stops the cluster with the unmanage option in the pre-migration stage, and starts it with the auto option in the post-migration stage automatically. PowerHA SystemMirror tries to bring the resource group online in the post-migration stage, which does not cause any problem with the VGs, file systems, and IPs. However, it runs the application controller one more time.

If you do not predict the appropriate checks in its application controller before running the commands, it can cause problems with the application. Therefore, the application controller start script should check if the application is already online before starting it.

Example 8-20 Log file of pre-migration operation

...

--> Check if need to change PowerHA service to 'unmanage resource group' status

Tue Jan 26 10:57:08 UTC 2016 cl_dr: clodmget -n -f lpm_policy HACMPcluster

Tue Jan 26 10:57:08 UTC 2016 cl_dr: lpm_policy='UNMANAGE'

...

Tue Jan 26 10:57:09 UTC 2016 cl_dr: Node = AIX720_LPM1, state = NORMAL

Tue Jan 26 10:57:09 UTC 2016 cl_dr: Stop cluster services

Tue Jan 26 10:57:09 UTC 2016 cl_dr: LC_ALL=C clmgr stop node AIX720_LPM1 WHEN=now MANAGE=unmanage

...

"AIX720_LPM1" is now unmanaged.

...

--> Add an entry in /etc/inittab to ensure PowerHA to be in 'manage resource group' status after crash unexpectedly

Tue Jan 26 10:57:23 UTC 2016 cl_dr: Adding a temporary entry in /etc/inittab

Tue Jan 26 10:57:23 UTC 2016 cl_dr: lsitab hacmp_lpm

Tue Jan 26 10:57:23 UTC 2016 cl_dr: mkitab hacmp_lpm:2:once:/usr/es/sbin/cluster/utilities/cl_dr undopremigrate > /dev/null 2>&1

Tue Jan 26 10:57:23 UTC 2016 cl_dr: mkitab RC: 0

...

--> Stop RSCT cthags critical resource monitoring function (for two nodes)

Tue Jan 26 10:57:30 UTC 2016 cl_dr: Stopping RSCT Dead Man Switch on node 'AIX720_LPM1'

Tue Jan 26 10:57:30 UTC 2016 cl_dr: /usr/sbin/rsct/bin/dms/stopdms -s cthags

Dead Man Switch Disabled

DMS Re-arming Thread cancelled

Tue Jan 26 10:57:30 UTC 2016 cl_dr: stopdms RC: 0

Tue Jan 26 10:57:30 UTC 2016 cl_dr: Stopping RSCT Dead Man Switch on node 'AIX720_LPM2'

Tue Jan 26 10:57:30 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "LC_ALL=C lssrc -s cthags | grep -qw active"

Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 lssrc RC: 0

Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "LC_ALL=C /usr/sbin/rsct/bin/dms/listdms -s cthags | grep -qw Enabled"

Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 listdms RC: 0

Tue Jan 26 10:57:31 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "/usr/sbin/rsct/bin/dms/stopdms -s cthags"

Dead Man Switch Disabled

DMS Re-arming Thread cancelled

...

--> Change CAA node_time parameter to 600s

Tue Jan 26 10:57:31 UTC 2016 cl_dr: clodmget -n -f lpm_node_timeout HACMPcluster

Tue Jan 26 10:57:31 UTC 2016 cl_dr: clodmget LPM node_timeout: 600

Tue Jan 26 10:57:31 UTC 2016 cl_dr: clctrl -tune -x node_timeout

Tue Jan 26 10:57:31 UTC 2016 cl_dr: clctrl CAA node_timeout: 30000

Tue Jan 26 10:57:31 UTC 2016 cl_dr: Changing CAA node_timeout to '600000'

Tue Jan 26 10:57:31 UTC 2016 cl_dr: clctrl -tune -o node_timeout=600000

...

--> Disable CAA SAN heartbeating (for two nodes)

Tue Jan 26 10:57:32 UTC 2016 cl_dr: cl_rsh AIX720_LPM1 "LC_ALL=C echo sfwcom >> /etc/cluster/ifrestrict"

Tue Jan 26 10:57:32 UTC 2016 cl_dr: cl_rsh to node AIX720_LPM1 completed, RC: 0

Tue Jan 26 10:57:32 UTC 2016 cl_dr: clusterconf

Tue Jan 26 10:57:32 UTC 2016 cl_dr: clusterconf completed, RC: 0

...

Tue Jan 26 10:57:32 UTC 2016 cl_dr: cl_rsh AIX720_LPM2 "LC_ALL=C echo sfwcom >> /etc/cluster/ifrestrict"

Tue Jan 26 10:57:33 UTC 2016 cl_dr: cl_rsh to node AIX720_LPM2 completed, RC: 0

Tue Jan 26 10:57:33 UTC 2016 cl_dr: clusterconf

Tue Jan 26 10:57:33 UTC 2016 cl_dr: clusterconf completed, RC: 0

...

Example 8-21 shows information in the post-migration operation.

Example 8-21 Log file of post-migration operation

--> Change PowerHA service back to normal status

Tue Jan 26 10:57:52 UTC 2016 cl_2dr: POST_MIGRATE entered

Tue Jan 26 10:57:52 UTC 2016 cl_2dr: clodmget -n -f lpm_policy HACMPcluster

Tue Jan 26 10:57:52 UTC 2016 cl_2dr: lpm_policy='UNMANAGE'

Tue Jan 26 10:57:52 UTC 2016 cl_2dr: grep -w node_state /var/hacmp/cl_dr.state | cut -d'=' -f2

Tue Jan 26 10:57:52 UTC 2016 cl_2dr: Previous state = NORMAL

Tue Jan 26 10:57:52 UTC 2016 cl_2dr: Restarting cluster services

Tue Jan 26 10:57:52 UTC 2016 cl_2dr: LC_ALL=C clmgr start node AIX720_LPM1 WHEN=now MANAGE=auto

AIX720_LPM1: start_cluster: Starting PowerHA SystemMirror

...

"AIX720_LPM1" is now online.

...

--> Remove the entry from /etc/inittab, this entry was written in pre-migration operation

Tue Jan 26 11:00:27 UTC 2016 cl_2dr: lsitab hacmp_lpm

Tue Jan 26 11:00:27 UTC 2016 cl_2dr: Removing the temporary entry from /etc/inittab

Tue Jan 26 11:00:27 UTC 2016 cl_2dr: rmitab hacmp_lpm

...

--> Enable RSCT cthags critical resource monitoring function (for two nodes)

Tue Jan 26 10:58:21 UTC 2016 cl_2dr: LC_ALL=C lssrc -s cthags | grep -qw active

Tue Jan 26 10:58:21 UTC 2016 cl_2dr: lssrc RC: 0

Tue Jan 26 10:58:21 UTC 2016 cl_2dr: grep -w RSCT_local_DMS_state /var/hacmp/cl_dr.state | cut -d'=' -f2

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: previous RSCT DMS state = Enabled

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: Restarting RSCT Dead Man Switch on node 'AIX720_LPM1'

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: /usr/sbin/rsct/bin/dms/startdms -s cthags

Dead Man Switch Enabled

DMS Re-arming Thread created

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: startdms RC: 0

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: cl_rsh AIX720_LPM2 lssrc RC: 0

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: grep -w RSCT_peer_DMS_state /var/hacmp/cl_dr.state | cut -d'=' -f2

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: previous RSCT Dead Man Switch on node 'AIX720_LPM2' = Enabled

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: Restarting RSCT Dead Man Switch on node 'AIX720_LPM2'

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: cl_rsh AIX720_LPM2 "/usr/sbin/rsct/bin/dms/startdms -s cthags"

Dead Man Switch Enabled

DMS Re-arming Thread created

...

--> Restore CAA node_timeout value

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: previous CAA node timeout = 30000

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: Restoring CAA node_timeout to '30000'

Tue Jan 26 10:58:22 UTC 2016 cl_2dr: clctrl -tune -o node_timeout=30000

smcaactrl:0:[182](0.009): Running smcaactrl at Tue Jan 26 10:58:22 UTC 2016 with the following parameters:

-O MOD_TUNE -P CHECK -T 2 -c 7ae36082-c418-11e5-8039-fa976d972a20 -t 7ae36082-c418-11e5-8039-fa976d972a20,LPMCluster,0 -i -v node_timeout,600000

...

--> Enable SAN heartbeating (for two nodes)

Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh AIX720_LPM1 "if [ -s /var/hacmp/ifrestrict ]; then mv /var/hacmp/ifrestrict /etc/cluster/ifrestrict; else rm -f /etc/cluster/ifrestrict

; fi"

Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh to node AIX720_LPM1 completed, RC: 0

Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh AIX720_LPM2 "if [ -s /var/hacmp/ifrestrict ]; then mv /var/hacmp/ifrestrict /etc/cluster/ifrestrict; else rm -f /etc/cluster/ifrestrict

; fi"

Tue Jan 26 11:00:26 UTC 2016 cl_2dr: cl_rsh to node AIX720_LPM2 completed, RC: 0

Tue Jan 26 11:00:26 UTC 2016 cl_2dr: clusterconf

Tue Jan 26 11:00:27 UTC 2016 cl_2dr: clusterconf completed, RC: 0

Tue Jan 26 11:00:27 UTC 2016 cl_2dr: Launch the SAN communication reconfiguration in background.

...

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8. Automation to adapt to the Live Partition Mobility (LPM) operation

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8. Automation to adapt to the Live Partition Mobility (LPM) operation