Planning considerations
This chapter provides information to help you plan the implementation of IBM PowerHA SystemMirror.
This chapter covers the following topics:
3.1 Introduction
There are many different ways to build a highly available environment. This chapter describes a small subset.
3.1.1 Mirrored architecture
In a mirrored architecture, you have identical or nearly identical physical components in each part of the data center. You can have this type of setup in a single room (although this is not recommended), in different rooms in the same building, or in different buildings. The distance between each part can be between few kilometers or several kilometers (or up to 50+ km, depending on the application latency requirements).
Figure 3-1 shows a high-level diagram of a cluster. In this example, there are two networks, two managed systems, two Virtual Input/Output Servers (VIOS) per managed system, and two storage subsystems. This example also uses the Logical Volume Manager (LVM) mirroring for maintaining a complete copy of data within each storage subsystem.
This example also has a logical unit number (LUN) for the Cluster Aware AIX (CAA) repository disk on each storage subsystem. For details about how to set up the CAA repository disk, see 3.2, “Cluster Aware AIX repository disk” on page 36.
Figure 3-1 Cluster with multiple storage subsystems
3.1.2 Single storage architecture
In a single storage architecture, the storage is shared by both the primary and backup logical partition (LPAR). This solution can be used when there are lower availability requirements for the data, and is not uncommon when the LPARs are in the same location.
When it is possible to use a storage-based mirroring feature such as IBM SAN Volume Controller or a SAN Volume Controller stretched cluster, the layout look, from a physical point of view, identical or nearly identical to the mirrored architecture that is described in 3.1.1, “Mirrored architecture” on page 32. However, from an AIX and cluster point of view, it is a single storage architecture because it is aware of only a single set of LUNs. For more information about the layout in a SAN Volume Controller stretched cluster, see 3.1.3, “Stretched cluster” on page 33.
Figure 3-2 shows such a layout from a logical point of view.
Figure 3-2 Cluster with single storage subsystem
3.1.3 Stretched cluster
A stretched cluster involves separating the cluster nodes into sites. A site can be in a different building within a campus or separated by a few kilometers in terms of distance. In this configuration, there is a storage area network (SAN) that spans the sites and storage can be presented across sites.
As with any multi-site cluster, Transmission Control Protocol/Internet Protocol (TCP/IP) communications are essential. Multiple links and routes are suggested such that a single network component or path failure can be incurred and communications between sites still be maintained.
Another main concern is having redundant storage and verifying that the data within the storage devices is synchronized across sites. The following section presents a method for synchronizing the shared data.
SAN Volume Controller in a stretched configuration
The SAN Volume Controller can be configured in a stretched configuration. In the stretched configuration, the SAN Volume Controller presents two storage devices that are separated by distance but look as though it is a single SAN Volume Controller device. The SAN Volume Controller itself keeps the data between the sites consistent through its disk mirroring technology.
The SAN Volume Controller in a stretched configuration allows the PowerHA cluster to provide continuous availability of the storage LUNs even if there is a single component failure anywhere in the storage environment. With this combination, the behavior of the cluster is similar in terms of function and failure scenarios in a local cluster (Figure 3-3).
Figure 3-3 SAN Volume Controller stretched configuration
3.1.4 Linked cluster
A linked cluster is another type of cluster that involves multiple sites. In this case, there is no SAN across sites because the distance between sites is often too far or the expense is too great. In this configuration, the repository disk is mirrored across the Internet Protocol network. Each site has its own copy of the repository disk and PowerHA keeps those disks synchronized.
TCP/IP communications are essential, and multiple links and routes are suggested such that a single network component or path failure can be incurred and communications between sites still be maintained.
For more information about linked clusters see IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106.
IBM supported storage that uses copy services
There are several IBM supported storage devices with copy services capabilities. For the following example, we use one of these devices, the SAN Volume Controller. SAN Volume Controller can replicate data across long distances with the SAN Volume Controller copy services functions. The data can be replicated in synchronous or asynchronous modes where synchronous provides the most up-to-date data redundancy.
For data replication in synchronous mode where both writes must complete before acknowledgment is sent to the application, the distance can greatly affect application performance. Synchronous mode is commonly used for 100 kilometers or less. Asynchronous modes are often used for distances over 100 km. However, these are common baseline recommendations.
If there is a failure that requires moving the workload to the remaining site, PowerHA interacts directly with the storage to switch the direction of the replication. PowerHA then makes the LUNs read/write capable and varies on the appropriate volume groups (VGs) to activate the application on the remaining site.
An example of this concept is shown in Figure 3-4.
Figure 3-4 PowerHA and SAN Volume Controller storage replication
3.2 Cluster Aware AIX repository disk
CAA uses a shared disk to store its cluster configuration information. You must have at least 512 MB and no more than 460 GB of disk space that is allocated for the cluster repository disk. This feature requires that a dedicated shared disk is available to all nodes that are part of the cluster. This disk cannot be used for application storage or any other purpose.
The amount of configuration information that is stored on this repository disk directly depends on the number of cluster entities, such as shared disks, number of nodes, and number of adapters in the environment. You must ensure that you have enough space for the following components when you determine the size of a repository disk:
Node-to-node communication
CAA Cluster topology management
All migration processes
The preferred size for the repository disk in a two-node cluster is 1 GB.
3.2.1 Preparing for a Cluster Aware AIX repository disk
A common way to protect the repository disk is to use storage-based mirroring or RAID. One example is the one that is described in 3.1.2, “Single storage architecture” on page 32. In this example, you must make sure that the LUN for the CAA repository disk is visible on all cluster nodes, and that there is a physical volume identifier (PVID) that is assigned to it.
If you have a multi-storage environment, such as the one that is described in 3.1.1, “Mirrored architecture” on page 32, then see 3.2.2, “Cluster Aware AIX with multiple storage devices” on page 36.
 
Important: The repository is not supported for mirroring by LVM.
3.2.2 Cluster Aware AIX with multiple storage devices
The description here is related to the architecture that is described in 3.1.1, “Mirrored architecture” on page 32. This example uses one backup CAA repository disk. The maximum number of backup disks that you can define is six.
If you plan to use one or more disks, which can potentially be used as backup disks for the CAA repository, it is a preferred practice to rename the disks, as described in “Renaming the hdisk” on page 38. However, this is not possible in all cases.
 
Important: Third-party MultiPath I/O (MPIO) management software, such as EMC PowerPath, uses disk mapping to manage multi-paths. These software programs typically have a disk definition at a higher level, and path-specific disks underneath. Also, these software programs typically use special naming conventions.
Renaming these types of disks by using the AIX rendev command can confuse the third-party MPIO software and create disk-related issues. For more information about any disk renaming tool that is available as part of the vendor’s software kit, see your vendor documentation.
The examples in this section mainly use smitty sysmirror. However, using the clmgr command line can be faster, but it can be hard to use by a novice. The examples use the clmgr command where it makes sense or where it is the only option.
Using the standard hdisk name
A current drawback of having multiple LUNs that can be used as repository disk is that they are not clearly identified as such by the lspv output. In this example, hdisk3 and hdisk4 are the LUNs that are prepared for the primary and backup CAA repository disks. Therefore, hdisk1 and hdisk2 are used for the application. Example 3-1 shows the output of the lspv command before starting the configuration.
Example 3-1 The lspv output before configuring Cluster Aware AIX
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
hdisk3          00f747c9b40ebfa5 None
hdisk4          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
After selecting hdisk3 as the CAA repository disk, synchronizing and creating the cluster, and creating the application VG, you get the output that is listed in Example 3-2. The commands that are used for this example are the following ones:
clmgr add cluster test_cl
clmgr sync cluster
As shown in Example 3-2, the problem is that the lspv command does not show that hdisk4 is reserved as the backup disk for the CAA repository.
Example 3-2 The lspv output after configuring Cluster Aware AIX
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc                    testvg
hdisk2 00c3f55e34ff433d                    testvg
hdisk3          00f747c9b40ebfa5 caavg_private active
hdisk4          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
To see which disk is reserved as a backup disk, use the clmgr -v query repository command or the odmget HACMPsircol command. Example 3-3 shows the output of the clmgr command, and Example 3-4 on page 38 shows the output of the odmget command.
Example 3-3 The clmgr -v query repository output
# clmgr -v query repository
NAME="hdisk3"
NODE="c2n1"
PVID="00f747c9b40ebfa5"
UUID="12d1d9a1-916a-ceb2-235d-8c2277f53d06"
BACKUP="0"
TYPE="mpioosdisk"
DESCRIPTION="MPIO IBM 2076 FC Disk"
SIZE="1024"
AVAILABLE="512"
CONCURRENT="true"
ENHANCED_CONCURRENT_MODE="true"
STATUS="UP"
 
NAME="hdisk4"
NODE="c2n1"
PVID="00f747c9b476a148"
UUID="c961dda2-f5e6-58da-934e-7878cfbe199f"
BACKUP="1"
TYPE="mpioosdisk"
DESCRIPTION="MPIO IBM 2076 FC Disk"
SIZE="1024"
AVAILABLE="95808"
CONCURRENT="true"
ENHANCED_CONCURRENT_MODE="true"
STATUS="BACKUP"#
As you can see in the clmgr output, you can directly see the hdisk name. The odmget command output (Example 3-4) lists the PVIDs.
Example 3-4 The odmget HACMPsircol output
# odmget HACMPsircol
 
HACMPsircol:
name = "c2n1_cluster_sircol"
id = 0
uuid = "0"
ip_address = ""
repository = "00f747c9b40ebfa5"
backup_repository = "00f747c9b476a148"#
Renaming the hdisk
To get around the issues that are mentioned in “Using the standard hdisk name” on page 37, rename the hdisks. The advantage is that it is much easier to see which disk is reserved as the CAA repository disk.
There are some points to consider:
Generally, you can use any name, but if it gets too long you can experience some administration issues.
The name must be unique.
It is preferable not to have the string “disk” as part of the name. There might be some scripts or tools that can search for the string “disk”.
You must manually rename the hdisks on all cluster nodes.
 
Important: Third-party MultiPath I/O (MPIO) management software, such as EMC PowerPath, uses disk mapping to manage multi-paths. These software programs typically have a disk definition at a higher level, and path-specific disks underneath. Also, these software programs typically use special naming conventions.
Renaming these types of disks by using the AIX rendev command can confuse the third-party MPIO software and create disk-related issues. For more information about any disk renaming tool that is available as part of the vendor’s software kit, see your vendor documentation.
Using a long name
First, we test by using a longer and more descriptive name. Example 3-5 shows the output of the lspv command before we started.
Example 3-5 The lspv output before using rendev
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
hdisk3          00f747c9b40ebfa5 None
hdisk4          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
Initially we decide to use a longer name (caa_reposX). Example 3-6 shows what we did and what the lspv command output looks like afterward.
 
Important: Remember to do the same on all cluster nodes.
Example 3-6 The lspv output after using rendev (using a long name)
#rendev -l hdisk3 -n caa_repos0
#rendev -l hdisk4 -n caa_repos1
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
caa_repos0      00f747c9b40ebfa5 None
caa_repos1      00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
Next, configure the cluster by using the SMIT. Using F4 to select the CAA repository disk returns the panel that is shown in Figure 3-5. As you can see, only the first part of the name is displayed. So, the only way to obtain which is the disk is to check for the PVID.
                   Define Repository and Cluster IP Address
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Cluster Name c2n1_cluster
* Heartbeat Mechanism Unicast +
* Repository Disk [] +
Cluster Multicast Address []
+--------------------------------------------------------------------------+
| Repository Disk |
| |
| Move cursor to desired item and press Enter. |
| |
| caa_rep (00f747c9b40ebfa5) on all cluster nodes                        |
| caa_rep (00f747c9b476a148) on all cluster nodes |
| hdisk1 (00c3f55e34ff43cc) on all cluster nodes |
| hdisk2 (00c3f55e34ff433d) on all cluster nodes |
| |
| F1=Help F2=Refresh F3=Cancel |
F1| F8=Image F10=Exit Enter=Do |
F5| /=Find n=Find Next |
F9+--------------------------------------------------------------------------+
Figure 3-5 SMIT panel that uses long repository disk names
Using a short name
In this case, a short name means a name with a maximum of seven characters. We use the same starting point, as listed in Example 3-5 on page 39. This time, we decide to use a shorter name (caa_rX). Example 3-7 shows what we did and what the lspv command output looks like afterward.
 
Important: Remember to do the same on all cluster nodes.
Example 3-7 The lspv output after using rendev (using a short name)
#rendev -l hdisk3 -n caa_r0
#rendev -l hdisk4 -n caa_r1
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
caa_r0          00f747c9b40ebfa5 None
caa_r1          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
Now, we start configuring the cluster by using SMIT. Using F4 to select the CAA repository disk returns the panel that is shown in Figure 3-6. As you can see, the full name now is displayed.
                   Define Repository and Cluster IP Address
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Cluster Name c2n1_cluster
* Heartbeat Mechanism Unicast +
* Repository Disk [] +
Cluster Multicast Address []
+--------------------------------------------------------------------------+
| Repository Disk |
| |
| Move cursor to desired item and press Enter. |
| |
| caa_r0  (00f747c9b40ebfa5) on all cluster nodes                        |
| caa_r1  (00f747c9b476a148) on all cluster nodes |
| hdisk1 (00c3f55e34ff43cc) on all cluster nodes |
| hdisk2 (00c3f55e34ff433d) on all cluster nodes |
| |
| F1=Help F2=Refresh F3=Cancel |
F1| F8=Image F10=Exit Enter=Do |
F5| /=Find n=Find Next |
F9+--------------------------------------------------------------------------+
Figure 3-6 SMIT panel that uses short names
3.3 Important considerations for virtual input/output server
This section lists some new features of AIX and Virtual I/O Server (VIOS) that help to increase overall availability, and are specially suggested to use for PowerHA environments.
3.3.1 Using poll_uplink
To use the poll_uplink option, you must have the following versions and settings:
VIOS 2.2.3.4 or later installed in all related VIOS.
The LPAR must be at AIX 7.1 TL3, or AIX 6.1 TL9 or later.
The option poll_uplink must be set on the LPAR on the virtual entX interfaces.
The option poll_uplink can be defined directly on the virtual interface if you are using shared Ethernet adapter (SEA) fallover or the Etherchannel device that points to the virtual interfaces. To enable poll_uplink, use the following command:
chdev -l entX -a poll_uplink=yes –P
 
Important: You must restart the LPAR to activate poll_uplink.
Figure 3-7 shows an overview of how the option works. In production environments, you normally have at least two physical interfaces on the VIOS, and you can also use a dual-VIOS setup. In a multiple physical interface environment, the virtual link is reported as down only when all physical connections on the VIOS for this SEA are down.
Figure 3-7 Using poll_uplink
The following settings are possible for poll_uplink:
poll_uplink (yes, no)
poll_uplink_int (100 milliseconds (ms) - 5000 ms)
To display the settings, use the lsattr –El entX command. Example 3-8 shows the default settings for poll_uplink.
Example 3-8 The lsattr details for poll_uplink
# lsdev –Cc Adapter | grep ^ent
ent0 Available Virtual I/O Ethernet Adapter (l-lan)
ent1 Available Virtual I/O Ethernet Adapter (l-lan)
# lsattr –El ent0 | grep “poll_up”
poll_uplink no Enable Uplink Polling True
poll_uplink_int 1000 Time interval for Uplink Polling True
#
If your LPAR is at least AIX 7.1 TL3 SP3, or AIX 6.1 TL9 SP3 or later, you can use the entstat command to check for the poll_uplink status and if it is enabled. Example 3-9 shows an excerpt of the entstat command output in an LPAR where poll_uplink is not enabled (set to no).
Example 3-9 Using poll_uplink=no
# entstat -d ent0
--------------------------------------------------
ETHERNET STATISTICS (en0) :
Device Type: Virtual I/O Ethernet Adapter (l-lan)
...
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 20000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
DataRateSet VIOENT
...
LAN State: Operational
...
#
Compared to Example 3-9, Example 3-10 shows the entstat command output on a system where poll_uplink is enabled and where all physical links that are related to this virtual interface are up. The text in bold shows the additional displayed content:
VIRTUAL_PORT
PHYS_LINK_UP
Bridge Status: Up
Example 3-10 Using poll_uplink=yes when physical link is up
# entstat -d ent0
--------------------------------------------------
ETHERNET STATISTICS (en0) :
Device Type: Virtual I/O Ethernet Adapter (l-lan)
...
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 20000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
DataRateSet VIOENT VIRTUAL_PORT
PHYS_LINK_UP
...
LAN State: Operational
Bridge Status: Up
...
#
When all physical links on the VIOS are down, then the output that is listed in Example 3-11 is displayed. The text PHYS_LINK_UP no longer displays, and the Bridge Status changes from Up to Unknown.
Example 3-11 Using poll_uplink=yes when physical link is down
# entstat -d ent0
--------------------------------------------------
ETHERNET STATISTICS (en0) :
Device Type: Virtual I/O Ethernet Adapter (l-lan)
...
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 20000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
DataRateSet VIOENT VIRTUAL_PORT
...
LAN State: Operational
Bridge Status: Unknown
...
#
3.3.2 Advantages for PowerHA when poll_uplink is used
In PowerHA V7, the network down detection is performed by CAA. CAA by default checks for IP traffic and for the link status of an interface. Therefore, using poll_uplink is advised for PowerHA LPARs, which helps the system to make a better decision whether a given interface is up or down.
3.4 Network considerations
This section focuses on the network considerations from a PowerHA point of view only. From this point of view, it does not matter if you have virtual or physical network devices.
3.4.1 Dual-adapter networks
This type of network has historically been the most common since the inception of PowerHA. However, starting with virtualization, this type was replaced with single adapter network solutions. But, the “single” adapter is redundant by using Etherchannel and often combined with SEA.
In PowerHA V7.1, this solution can still be used, but it is not recommended. The cross-adapter checking logic is not implemented in PowerHA V7. The advantage of not having this feature is that PowerHA V7.1 and later versions do not require that the IP source route is enabled.
When using a dual-adapter network in PowerHA V7.1 or later, you must also use the netmon.cf file in a similar way as that for a single adapter layout. In this case, the netmon.cf file must have a path for all potential enX interfaces defined.
3.4.2 Single-adapter network
When we describe a single-adapter network, it is from a PowerHA point of view. In a highly available environment, you must always have redundant ways to access the network. This is commonly done today by using SEA fallover or Etherchannel Link Aggregation or node initialization block (NIB). The Etherchannel NIB-based solution can be used in both scenarios, by using virtual adapters or physical adapters. The Etherchannel Link Aggregation-based solution can be used only if you have direct-attached adapters.
 
Note: With a single adapter, you use the SEA fallover or the Etherchannel fallover.
This setup eases the setup from a TCP/IP point of view, and it also reduces the content of the netmon.cf file. But netmon.cf must still be used.
3.5 Network File System tie breaker
This section describes the Network File System (NFS) tie breaker.
3.5.1 Introduction and concepts
The NFS tie-breaker function represents an extension of the previously introduced disk tie-breaker feature that relied on a Small Computer System Interface (SCSI) disk that is accessible to all nodes in a PowerHA cluster. The differences between the protocols that are used for accessing the tie-breaker (SCSI disk or NFS-mounted file) favor the NFS-based solution for linked clusters.
Split-brain situation
A cluster split-brain event can occur when a group of nodes cannot communicate with the remaining nodes in a cluster. For example, in a two-site linked cluster, a split occurs if all communication links between the two sites fail. Depending on the communication network topology and the location of the interruption, a cluster split event splits the cluster into two (or more) partitions, each of them containing one ore more cluster nodes. The resulting situation is commonly referred to as a split-brain situation.
In a split-brain situation, the two partitions have no knowledge of each other’s status, each of them considering the other as being offline. As a consequence, each partition tries to bring online the other partition’s resource groups (RGs), thus generating a high risk of data corruption on all shared disks. To prevent a split-brain situation, and subsequent potential data corruption, split and merge policies are available to be configured.
Tie breaker
The tie-breaker feature uses a tie-breaker resource to choose a surviving partition that continues to operate when a cluster split-brain event occurs. This feature prevents data corruption on the shared cluster disks. The tie breaker is identified either as a SCSI disk or an NFS-mounted file that must be accessible, under normal conditions, to all nodes in the cluster.
Split policy
When a split-brain situation occurs, each partition attempts to acquire the tie breaker by placing a lock on the tie-breaker disk or on the NFS file. The partition that first locks the SCSI disk or reserves the NFS file wins, and the other loses.
All nodes in the winning partition continue to process cluster events, and all nodes in the losing partition attempt to recover according to the defined split and merge action plan. This plan most often implies either the restart of the cluster nodes, or merely the restart of cluster services on those nodes.
Merge policy
There are situations in which, depending on the cluster split-brain policy, the cluster can have two partitions that run independent of each other. However, most often, it is a preferred practice to configure a merge policy that allows the partitions to operate together again after communications are restored between them.
In this second approach, when partitions that were part of the cluster are brought back online after the communication failure, they must be able to communicate with the partition that owns the tie-breaker disk or NFS file. If a partition that is brought back online cannot communicate with the tie-breaker disk or the NFS file, it does not join the cluster. The tie-breaker disk or NFS file is released when all nodes in the configuration rejoin the cluster.
The merge policy configuration, in this case an NFS-based tie breaker, must be of the same type as that for the split policy.
3.5.2 Test environment setup
The lab environment that we use to test the NFS tie-breaker function consists of a two-site linked cluster, each site having a single node with a common NFS-mounted resource, as shown in Figure 3-8.
Figure 3-8 NFS tie-breaker test environment
Because the goal was to test the NFS tie-breaker function as a method for handling split-brain situations, the additional local nodes in a linked multisite cluster were considered irrelevant, and therefore not included in the test setup. Each node had its own cluster repository disk (clnode_1r and clnode_2r), and both nodes shared a common cluster disk (clnode_12, which is the one that must be protected from data corruption that is caused by a split-brain situation), as shown in Example 3-12.
Example 3-12 List of physical volumes on both cluster nodes
clnode_1:/# lspv
clnode_1r 00f6f5d0f8c9fbf4 caavg_private active
clnode_12 00f6f5d0f8ca34ec datavg concurrent
hdisk0 00f6f5d09570f170 rootvg active
clnode_1:/#
 
clnode_2:/# lspv
clnode_2r 00f6f5d0f8ceed1a caavg_private active
clnode_12 00f6f5d0f8ca34ec datavg concurrent
hdisk0 00f6f5d09570f31b rootvg active
clnode_2:/#
To allow greater flexibility for our test scenarios, we chose to use different network adapters for the production traffic and the connectivity to the shared NFS resource. The network setup of the two nodes is shown in Example 3-13.
Example 3-13 Network settings for both cluster nodes
clnode_1:/# netstat -in | egrep "Name|en"
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 ee.af.e.90.ca.2 533916 0 566524 0 0
en0 1500 192.168.100 192.168.100.50 533916 0 566524 0 0
en0 1500 192.168.100 192.168.100.51 533916 0 566524 0 0
en1 1500 link#3 ee.af.e.90.ca.3 388778 0 457776 0 0
en1 1500 10 10.0.0.1 388778 0 457776 0 0
clnode_1:/#
 
clnode_2:/# netstat -in | egrep "Name|en"
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 ee.af.7.e3.9a.2 391379 0 278953 0 0
en0 1500 192.168.100 192.168.100.52 391379 0 278953 0 0
en1 1500 link#3 ee.af.7.e3.9a.3 385787 0 350121 0 0
en1 1500 10 10.0.0.2 385787 0 350121 0 0
clnode_2:/#
During the setup of the cluster, the NFS communication network, with the en1 network adapters in Example 3-13, was discovered and automatically added to the cluster configuration as a heartbeat network, as net_ether_02. However, we manually removed it afterward to prevent interference with the NFS tie-breaker tests. Therefore, the cluster eventually had only one heartbeat network: net_ether_01.
The final cluster topology was reported, as shown in Example 3-14.
Example 3-14 Cluster topology information
clnode_1:/# cltopinfo
Cluster Name: nfs_tiebr_cluster
Cluster Type: Linked
Heartbeat Type: Unicast
Repository Disks:
Site 1 (site1@clnode_1): clnode_1r
Site 2 (site2@clnode_2): clnode_2r
Cluster Nodes:
Site 1 (site1):
clnode_1
Site 2 (site2):
clnode_2
 
There are 2 node(s) and 1 network(s) defined
NODE clnode_1:
Network net_ether_01
clst_svIP 192.168.100.50
clnode_1 192.168.100.51
NODE clnode_2:
Network net_ether_01
clst_svIP 192.168.100.50
clnode_2 192.168.100.52
 
Resource Group rg_IHS
Startup Policy Online On Home Node Only
Fallover Policy Fallover To Next Priority Node In The List
Fallback Policy Never Fallback
Participating Nodes clnode_1 clnode_2
Service IP Label clst_svIP
clnode_1:/#
At the end of our environment preparation, the cluster was active. The RG, IBM Hypertext Transfer Protocol (HTTP) Server that is installed on the clnode_12 cluster disk with the datavg VG was online, as shown in Example 3-15.
Example 3-15 Cluster nodes and resource groups status
clnode_1:/# clmgr -cv -a name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:NORMAL:ST_STABLE
 
clnode_1:/#
clnode_1:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
ONLINE SECONDARY clnode_2@site2
 
clnode_1:/#
3.5.3 NFS server and client configuration
An important prerequisite of the NFS tie-breaker function deployment is that the function does not work with the more common NFS version 3.
 
Important: The NFS tie-breaker function requires NFS version 4.
Our test environment used an NFS server that is configured on an AIX 7.1 TL3 SP5 LPAR. This, of course, is not a requirement for deploying an NFS version 4 server.
A number of services must be active to allow NFSv4 communication between clients and servers:
On the NFS server:
 – biod
 – nfsd
 – nfsgryd
 – portmap
 – rpc.lockd
 – rpc.mountd
 – rpc.statd
 – TCP
On the NFS client (all cluster nodes):
 – biod
 – nfsd
 – rpc.mountd
 – rpc.statd
 – TCP
Most of the previous services are usually active by default, and particular attention is required for the setup of the nfsrgyd service. As mentioned previously, this daemon must be running on both the server and the clients. In our case, the two cluster nodes. This daemon provides a name conversion service for NFS servers and clients that use NFS v4.
Starting the nfsrgyd daemon requires that the local NFS domain is set. The local NFS domain is stored in the /etc/nfs/local_domain file and it can be set by using the chnfsdom command, as shown in Example 3-16.
Example 3-16 Setting the local NFS domain
nfsserver:/# chnfsdom nfs_local_domain
nfsserver:/# startsrc -g nfs
[...]
nfsserver:/# lssrc -g nfs
Subsystem Group PID Status
[...]
nfsrgyd nfs 7077944 active
[...]
nfsserver:#
In addition, for the server, you must specify the root node directory, what clients mount as /, and the public node directory with the command-line interface (CLI), by using the chnfs command, as shown in Example 3-17.
Example 3-17 Setting the root and public node directory
nfsserver:/# chnfs -r /nfs_root -p /nfs_root
nfsserver:/#
Alternatively, root, the public node directory, and the local NFS domain can be set with SMIT. Use the smit nfs command, follow the path Network File System (NFS)  Configure NFS on This System, then select the corresponding option:
Change Version 4 Server Root Node
Change Version 4 Server Public Node
Configure NFS Local Domain → Change NFS Local Domain
As a final step for the NFS configuration, create the NFS resource, also known as the NFS export. Example 3-18 shows the NFS resource that was created by using SMIT by running the smit mknfs command.
Example 3-18 Creating an NFS v4 export
                               Add a Directory to Exports List
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Pathname of directory to export [/nfs_root/nfs_tie_breaker] /
[...]
Public filesystem? no +
[...]
Allow access by NFS versions [4] +
[...]
* Security method 1 [sys,krb5p,krb5i,krb5,dh] +
* Mode to export directory read-write +
[...]
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
Test the NFS configuration by manually mounting the NFS export to the clients, as shown in Example 3-19. The date column was removed from the output for clarity.
Example 3-19 Mounting an NFS v4 export
clnode_1:/# mount -o vers=4 nfsserver:/nfs_tie_breaker /mnt
clnode_1:/# mount | egrep "node|---|tie"
node       mounted    mounted over  vfs   options
--------   ----------------  ------------  ----  -------------------------------
nfsserver  /nfs_tie_breaker  /mnt          nfs4  vers=4,fg,soft,retry=1,timeo=10
clnode_1:/#
clnode_1:/# umount /mnt
clnode_1:/#
3.5.4 NFS tie-breaker configuration
The NFS tie-breaker function can be configured either with CLI commands or SMIT.
To configure the NFS tie breaker by using SMIT, complete the following steps:
1. The SMIT menu that enables the configuration of NFS Tie Breaker split policy can be accessed by following the path Custom Cluster Configuration → Cluster Nodes and Networks → Initial Cluster Setup (Custom) → Configure Cluster Split and Merge Policy.
2. Select Split Management Policy, as shown in Example 3-20.
Example 3-20 Configuring the split handling policy
Configure Cluster Split and Merge Policy
 
Move cursor to desired item and press Enter.
 
Split Management Policy
Merge Management Policy
Quarantine Policy
 
 
+-------------------------------------------------------------+
|                    Split Handling Policy                    |
| |
| Move cursor to desired item and press Enter. |
| |
| None |
| TieBreaker |
| Manual |
| |
| F1=Help F2=Refresh F3=Cancel |
| F8=Image F10=Exit Enter=Do |
F1=Help | /=Find n=Find Next |
F9=Shell +-------------------------------------------------------------+
3. Select TieBreaker to open the menu where you choose the method to use for tie breaking, as shown in Example 3-21.
Example 3-21 Selecting the tie-breaker type
Configure Cluster Split and Merge Policy
 
Move cursor to desired item and press Enter.
 
Split Management Policy
Merge Management Policy
Quarantine Policy
 
 
+-------------------------------------------------------------+
|                   Select TieBreaker Type                    |
| |
| Move cursor to desired item and press Enter. |
| |
| Disk |
| NFS |
| |
| F1=Help F2=Refresh F3=Cancel |
| F8=Image F10=Exit Enter=Do |
F1=Help | /=Find n=Find Next |
F9=Shell +-------------------------------------------------------------+
 
4. After selecting NFS as the method for tie breaking, specify the NFS export server, directory, and the local mount point, as shown in Example 3-22.
Example 3-22 Configuring the NFS tie breaker for split handling policy by using SMIT
                         NFS TieBreaker Configuration
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
Split Handling Policy NFS
* NFS Export Server [nfsserver_nfs]
* Local Mount Directory [/nfs_tie_breaker]
* NFS Export Directory [/nfs_tie_breaker]
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
Split and merge policies must be of the same type, and the same rule applies for the tie-breaker type. Therefore, selecting the TieBreaker option for the Split Handling Policy field, and the NFS option for the TieBreaker type for that policy, implies also selecting those same options (TieBreaker and NFS) for the Merge Handling Policy:
1. Configure the merge policy. From the same SMIT menu (Custom Cluster Configuration → Cluster Nodes and Networks → Initial Cluster Setup (Custom) → Configure Cluster Split and Merge Policy), select the Merge Management Policy option (Example 3-23).
Example 3-23 Configuring the merge handling policy
                          Configure Cluster Split and Merge Policy
 
Move cursor to desired item and press Enter.
 
Split Management Policy
Merge Management Policy
Quarantine Policy
 
+-------------------------------------------------------------+
|                    Merge Handling Policy                  |
| |
| Move cursor to desired item and press Enter. |
| |
| Majority |
| TieBreaker |
| Manual |
| Priority |
| |
| F1=Help F2=Refresh F3=Cancel |
| F8=Image F10=Exit Enter=Do |
F1=Help | /=Find n=Find Next |
F9=Shell +-------------------------------------------------------------+
2. Selecting the option of TieBreaker opens the menu that is shown in Example 3-24, where we again choose NFS as the method to use for tie breaking.
Example 3-24 Configuring NFS tie breaker for merge handling policy with SMIT
                         NFS TieBreaker Configuration
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
Merge Handling Policy NFS
* NFS Export Server [nfsserver_nfs]
* Local Mount Directory [/nfs_tie_breaker]
* NFS Export Directory [/nfs_tie_breaker]
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
Alternatively, both split and merge management policies can be configured by CLI by using the clmgr modify cluster SPLIT_POLICY=tiebreaker MERGE_POLICY=tiebreaker command followed by the cl_sm command, as shown in Example 3-25.
Example 3-25 Configuring the NFS tie breaker for the split and merge handling policy by using the CLI
clnode_1:/# /usr/es/sbin/cluster/utilities/cl_sm -s 'NFS' -k'nfsserver_nfs' -g'/nfs_tie_breaker' -p'/nfs_tie_breaker'
The PowerHA SystemMirror split and merge policies have been updated.
Current policies are:
Split Handling Policy : NFS
Merge Handling Policy : NFS
NFS Export Server :
nfsserver_nfs
Local Mount Directory :
/nfs_tie_breaker
NFS Export Directory :
/nfs_tie_breaker
Split and Merge Action Plan : Restart
The configuration must be synchronized to make this change known across the cluster.
clnode_1:/#
 
 
clnode_1:/# /usr/es/sbin/cluster/utilities/cl_sm -m 'NFS' -k'nfsserver_nfs' -g'/nfs_tie_breaker' -p'/nfs_tie_breaker'
The PowerHA SystemMirror split and merge policies have been updated.
Current policies are:
Split Handling Policy : NFS
Merge Handling Policy : NFS
NFS Export Server :
nfsserver_nfs
Local Mount Directory :
/nfs_tie_breaker
NFS Export Directory :
/nfs_tie_breaker
Split and Merge Action Plan : Restart
The configuration must be synchronized to make this change known across the cluster.
clnode_1:/#
At this point, a PowerHA cluster synchronization and restart and a CAA cluster restart are required. Complete the following steps:
1. Verify and synchronize the changes across the cluster either by using the SMIT menu (run the smit sysmirror command, then follow the path Cluster Applications and Resources → Resource Groups → Verify and Synchronize Cluster Configuration), or by the CLI by using the clmgr sync cluster command.
2. Stop cluster services for all nodes in the cluster by running the clmgr stop cluster command.
3. Stop the CAA daemon on all cluster nodes by running the stopsrc -s clconfd command.
4. Start the CAA daemon on all cluster nodes by running the startsrc -s clconfd command.
5. Start cluster services for all nodes in the cluster by running the clmgr start cluster command.
 
Important: Verify all output messages that are generated by the synchronization and restart of the cluster because if an error occurred when activating the NFS tie-breaker policies, it might not necessarily produce an error on the overall result of a cluster synchronization action.
When all cluster nodes are synchronized and active, and the split and merge management policies are applied, the NFS resource is accessed by all nodes, as shown in Example 3-26 (the date column removed for clarity).
Example 3-26 Checking for the NFS export that is mounted on clients
clnode_1:/# mount | egrep "node|---|tie"
node     mounted    mounted over   vfs   options
-------------  ---------------  ---------------- ----  ----------------------
nfsserver_nfs  /nfs_tie_breaker  /nfs_tie_breaker  nfs4   vers=4,fg,soft,retry=1,timeo=10
clnode_1:/#
 
 
clnode_2:/# mount | egrep "node|---|tie"
node     mounted    mounted over   vfs   options
-------------  ---------------  ---------------- ----  ----------------------
nfsserver_nfs  /nfs_tie_breaker  /nfs_tie_breaker  nfs4   vers=4,fg,soft,retry=1,timeo=10
clnode_2:/#
3.5.5 NFS tie-breaker tests
A common method to simulate network connectivity loss is to use the ifconfig command to bring network interfaces down. Its effect is not persistent across restarts, so the NFS tie-breaker induced restart has the expected recovery effect. The test scenarios that we use and the actual results that we got are presented in the following sections.
Loss of network communication to the NFS server
Because using an NFS server resource is a secondary communication means, the primary one being the heartbeat network, the loss of communication between the cluster nodes and the NFS server did not have any visible results other than the expected log entries.
Loss of production/heartbeat network communication on standby node
The loss of the production heartbeat network communication on the standby node triggered no response because no RGs were online on that node at the time the simulated event occurred.
Loss of production heartbeat network communication on active node
The loss of the production heartbeat network communication on the active node triggered the expected fallover action. This occurred because the network service IP and the underlying network (as resources essential to the RG that was online until the simulated event) were no longer available.
This action can be seen on both nodes’ logs, as shown in the cluster.mmddyyy logs in Example 3-27, for the disconnected node (the one that releases the RG).
Example 3-27 The cluster.mmddyyy log for the node releasing the resource group
Nov 13 14:42:13 EVENT START: network_down clnode_1 net_ether_01
Nov 13 14:42:13 EVENT COMPLETED: network_down clnode_1 net_ether_01 0
Nov 13 14:42:13 EVENT START: network_down_complete clnode_1 net_ether_01
Nov 13 14:42:13 EVENT COMPLETED: network_down_complete clnode_1 net_ether_01 0
Nov 13 14:42:20 EVENT START: resource_state_change clnode_1
Nov 13 14:42:20 EVENT COMPLETED: resource_state_change clnode_1 0
Nov 13 14:42:20 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:20 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:20 EVENT START: stop_server app_IHS
Nov 13 14:42:20 EVENT COMPLETED: stop_server app_IHS 0
Nov 13 14:42:21 EVENT START: release_service_addr
Nov 13 14:42:22 EVENT COMPLETED: release_service_addr 0
Nov 13 14:42:25 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:25 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:27 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:27 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:30 EVENT START: network_up clnode_1 net_ether_01
Nov 13 14:42:30 EVENT COMPLETED: network_up clnode_1 net_ether_01 0
Nov 13 14:42:31 EVENT START: network_up_complete clnode_1 net_ether_01
Nov 13 14:42:31 EVENT COMPLETED: network_up_complete clnode_1 net_ether_01 0
Nov 13 14:42:33 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:33 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:33 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:33 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:35 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:36 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:38 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:39 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:39 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:39 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:39 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:39 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:41 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:42 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:46 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:47 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:47 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:47 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:47 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:47 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:49 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:53 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:55 EVENT START: resource_state_change_complete clnode_1
Nov 13 14:42:55 EVENT COMPLETED: resource_state_change_complete clnode_1 0
This action is also shown in Example 3-28 for the other node (the one that acquires the RG).
Example 3-28 The cluster.mmddyyy log for the node acquiring the resource group
Nov 13 14:42:13 EVENT START: network_down clnode_1 net_ether_01
Nov 13 14:42:13 EVENT COMPLETED: network_down clnode_1 net_ether_01 0
Nov 13 14:42:14 EVENT START: network_down_complete clnode_1 net_ether_01
Nov 13 14:42:14 EVENT COMPLETED: network_down_complete clnode_1 net_ether_01 0
Nov 13 14:42:20 EVENT START: resource_state_change clnode_1
Nov 13 14:42:20 EVENT COMPLETED: resource_state_change clnode_1 0
Nov 13 14:42:20 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:20 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:20 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:20 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:27 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:29 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:31 EVENT START: network_up clnode_1 net_ether_01
Nov 13 14:42:31 EVENT COMPLETED: network_up clnode_1 net_ether_01 0
Nov 13 14:42:31 EVENT START: network_up_complete clnode_1 net_ether_01
Nov 13 14:42:31 EVENT COMPLETED: network_up_complete clnode_1 net_ether_01 0
Nov 13 14:42:33 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:33 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:34 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:34 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:36 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:36 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:39 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:39 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:39 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:39 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:39 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:39 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:42 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:45 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:47 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:47 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:47 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:47 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:49 EVENT START: acquire_takeover_addr
Nov 13 14:42:50 EVENT COMPLETED: acquire_takeover_addr 0
Nov 13 14:42:50 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:50 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:50 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:50 EVENT START: start_server app_IHS
Nov 13 14:42:51 EVENT COMPLETED: start_server app_IHS 0
Nov 13 14:42:52 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:55 EVENT START: resource_state_change_complete clnode_1
Nov 13 14:42:55 EVENT COMPLETED: resource_state_change_complete clnode_1 0
Either log includes split_merge_prompt, site_down, or node_down events.
Loss of all network communication on standby node
The loss of all network communications from both the production heartbeat and connectivity to NFS server on the standby node triggers a restart of that node. This is in accordance with the split and merge action plan that was defined earlier.
As a starting point, both nodes were operational and the RG was online on node clnode_1 (Example 3-29).
Example 3-29 The cluster nodes and resource group status before the simulated network down event
clnode_1:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:NORMAL:ST_STABLE
clnode_1:/#
 
 
clnode_1:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
ONLINE SECONDARY clnode_2@site2
clnode_1:/#
Complete the following steps:
1. Temporarily bring down the network interfaces on the standby node clnode_2, in a terminal console opened by using the Hardware Management Console (HMC), as shown in Example 3-30.
Example 3-30 Simulating a network down event
clnode_2:/# ifconfig en0 down; ifconfig en1 down
clnode_2:/#
2. Within about a minute of the previous step, as a response to the split-brain situation, the node clnode_2 (with no communication to the NFS server) restarted itself. This can be seen on the virtual terminal console opened (by using the HMC) on that node, and is also reflected by the status of the cluster nodes (Example 3-31).
Example 3-31 Cluster nodes status immediately after a simulated network down event
clnode_1:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:UNKNOWN:UNKNOWN
clnode_1:/#
3. After a restart, the node clnode_2 was functional, but with cluster services stopped (Example 3-32).
Example 3-32 Cluster nodes and resource group status after node restart
clnode_1:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:OFFLINE:ST_INIT
clnode_1:/#
 
 
clnode_2:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
OFFLINE clnode_2@site2
clnode_2:/#
4. Manually start the services on the clnode_2 node (Example 3-33).
Example 3-33 Starting cluster services on the recently rebooted node
clnode_2:/# clmgr start node
[...]
clnode_2: Completed execution of /usr/es/sbin/cluster/etc/rc.cluster
clnode_2: with parameters: -boot -N -A -b -P cl_rc_cluster.
clnode_2: Exit status = 0
clnode_2:/#
5. You are now back to the point before the simulated network loss event, with both nodes operational and the RG online on node clnode_1 (Example 3-34).
Example 3-34 Cluster nodes and resource group status after cluster services start
clnode_2:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:NORMAL:ST_STABLE
clnode_2:/#
 
clnode_2:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
ONLINE SECONDARY clnode_2@site2
clnode_2:/#
Loss of all network communication on the active node
The loss of all network communications of the production heartbeat and connectivity to NFS server on the active node, the node with the RG online, triggers the restart of that node. At the same time, the RG is independently brought online on the other node.
The test was performed exactly like the one on the standby node, as described in “Loss of all network communication on standby node” on page 57, and the process was similar. The only notable difference was that the previously active node, now disconnected, restarted. The other node, previously the standby node, was now bringing the RG online, thus ensuring service availability.
3.5.6 Log entries for monitoring and debugging
As expected, the usual system and cluster log files contain information that is related to the NFS tie-breaker events and actions. However, the particular content of these logs varies between the nodes as each node’s role differs.
Error report (errpt)
The surviving node includes log entries that are presented in chronological order with older entries first, as shown in Example 3-35.
Example 3-35 Error report events on the surviving node
LABEL: CONFIGRM_SITE_SPLIT
Description
ConfigRM received Site Split event notification
 
 
LABEL: CONFIGRM_PENDINGQUO
Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM. This state usually indicates that exactly half of the nodes that are defined in the peer domain are online. In this state cluster resources cannot be recovered although none will be stopped explicitly.
 
 
LABEL: LVM_GS_RLEAVE
Description
Remote node Concurrent Volume Group failure detected
 
 
LABEL: CONFIGRM_HASQUORUM_
Description
The operational quorum state of the active peer domain has changed to HAS_QUORUM.
In this state, cluster resources may be recovered and controlled as needed by
management applications.
The disconnected or restarted node includes log entries that are presented in chronological order with the older entries listed first, as shown in Example 3-36.
Example 3-36 Error report events on the restarted node
LABEL: CONFIGRM_SITE_SPLIT
Description
ConfigRM received Site Split event notification
 
 
LABEL: CONFIGRM_PENDINGQUO
Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM. This state usually indicates that exactly half of the nodes that are defined in the peer domain are online. In this state cluster resources cannot be recovered although none will be stopped explicitly.
 
 
LABEL: LVM_GS_RLEAVE
Description
Remote node Concurrent Volume Group failure detected
 
 
LABEL: CONFIGRM_NOQUORUM_E
Description
The operational quorum state of the active peer domain has changed to NO_QUORUM.
This indicates that recovery of cluster resources can no longer occur and that
the node may be rebooted or halted in order to ensure that critical resources
are released so that they can be recovered by another subdomain that may have
operational quorum.
 
 
LABEL: CONFIGRM_REBOOTOS_E
Description
The operating system is being rebooted to ensure that critical resources are
stopped so that another subdomain that has operational quorum may recover
these resources without causing corruption or conflict.
 
 
LABEL: REBOOT_ID
Description
SYSTEM SHUTDOWN BY USER
 
 
LABEL: CONFIGRM_HASQUORUM_
Description
The operational quorum state of the active peer domain has changed to HAS_QUORUM.
In this state, cluster resources may be recovered and controlled as needed by
management applications.
 
 
LABEL: CONFIGRM_ONLINE_ST
Description
The node is online in the domain indicated in the detail data.
The restarted node’s log includes information that is relative to the surviving node’s log, and information about the restart event.
The cluster.mmddyyy log file
For each split-brain situation encountered, the content of the cluster.mmddyyy log file was similar on the two nodes. The surviving node’s log entries are presented in Example 3-37.
Example 3-37 The cluster.mmddyyy log entries on the surviving node
Nov 13 13:40:03 EVENT START: split_merge_prompt split
Nov 13 13:40:07 EVENT COMPLETED: split_merge_prompt split 0
Nov 13 13:40:07 EVENT START: site_down site2
Nov 13 13:40:09 EVENT START: site_down_remote site2
Nov 13 13:40:09 EVENT COMPLETED: site_down_remote site2 0
Nov 13 13:40:09 EVENT COMPLETED: site_down site2 0
Nov 13 13:40:09 EVENT START: node_down clnode_2
Nov 13 13:40:09 EVENT COMPLETED: node_down clnode_2 0
Nov 13 13:40:11 EVENT START: rg_move_release clnode_1 1
Nov 13 13:40:11 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 13:40:11 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 13:40:11 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 13:40:11 EVENT START: rg_move_fence clnode_1 1
Nov 13 13:40:12 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 13:40:14 EVENT START: node_down_complete clnode_2
Nov 13 13:40:14 EVENT COMPLETED: node_down_complete clnode_2 0
The log entries for the same event, but this time on the disconnected or restarted node, are shown in Example 3-38.
Example 3-38 The cluster.mmddyyy log entries on the restarted node
Nov 13 13:40:03 EVENT START: split_merge_prompt split
Nov 13 13:40:03 EVENT COMPLETED: split_merge_prompt split 0
Nov 13 13:40:12 EVENT START: site_down site1
Nov 13 13:40:13 EVENT START: site_down_remote site1
Nov 13 13:40:13 EVENT COMPLETED: site_down_remote site1 0
Nov 13 13:40:13 EVENT COMPLETED: site_down site1 0
Nov 13 13:40:13 EVENT START: node_down clnode_1
Nov 13 13:40:13 EVENT COMPLETED: node_down clnode_1 0
Nov 13 13:40:15 EVENT START: network_down clnode_2 net_ether_01
Nov 13 13:40:15 EVENT COMPLETED: network_down clnode_2 net_ether_01 0
Nov 13 13:40:15 EVENT START: network_down_complete clnode_2 net_ether_01
Nov 13 13:40:15 EVENT COMPLETED: network_down_complete clnode_2 net_ether_01 0
Nov 13 13:40:18 EVENT START: rg_move_release clnode_2 1
Nov 13 13:40:18 EVENT START: rg_move clnode_2 1 RELEASE
Nov 13 13:40:18 EVENT COMPLETED: rg_move clnode_2 1 RELEASE 0
Nov 13 13:40:18 EVENT COMPLETED: rg_move_release clnode_2 1 0
Nov 13 13:40:18 EVENT START: rg_move_fence clnode_2 1
Nov 13 13:40:19 EVENT COMPLETED: rg_move_fence clnode_2 1 0
Nov 13 13:40:21 EVENT START: node_down_complete clnode_1
Nov 13 13:40:21 EVENT COMPLETED: node_down_complete clnode_1 0
This log also includes the information about the network_down event.
The cluster.log file
The cluster.log file includes much of the information in the cluster.mmddyyy log file. The notable exception is that this one cluster.log also included information about the quorum status losing and regaining quorum. For the disconnected or restarted node only, the cluster.log file has information about the restart event, as shown in Example 3-39.
Example 3-39 The cluster.log entries on the restarted node
Nov 13 13:40:03 clnode_2 [...] EVENT START: split_merge_prompt split
Nov 13 13:40:03 clnode_2 [...] CONFIGRM_SITE_SPLIT_ST ConfigRM received Site Split event notification
Nov 13 13:40:03 clnode_2 [...] EVENT COMPLETED: split_merge_prompt split 0
Nov 13 13:40:09 clnode_2 [...] CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to PENDING_QUORUM. This state usually indicates that exactly half of the nodes that are defined in the peer domain are online. In this state cluster resources cannot be recovered although none will be stopped explicitly.
Nov 13 13:40:12 clnode_2 [...] EVENT START: site_down site1
Nov 13 13:40:13 clnode_2 [...] EVENT START: site_down_remote site1
Nov 13 13:40:13 clnode_2 [...] EVENT COMPLETED: site_down_remote site1 0
Nov 13 13:40:13 clnode_2 [...] EVENT COMPLETED: site_down site1 0
Nov 13 13:40:13 clnode_2 [...] EVENT START: node_down clnode_1
Nov 13 13:40:13 clnode_2 [...] EVENT COMPLETED: node_down clnode_1 0
Nov 13 13:40:15 clnode_2 [...] EVENT START: network_down clnode_2 net_ether_01
Nov 13 13:40:15 clnode_2 [...] EVENT COMPLETED: network_down clnode_2 net_ether_01 0
Nov 13 13:40:15 clnode_2 [...] EVENT START: network_down_complete clnode_2 net_ether_01
Nov 13 13:40:16 clnode_2 [...] EVENT COMPLETED: network_down_complete clnode_2 net_ether_01 0
Nov 13 13:40:18 clnode_2 [...] EVENT START: rg_move_release clnode_2 1
Nov 13 13:40:18 clnode_2 [...] EVENT START: rg_move clnode_2 1 RELEASE
Nov 13 13:40:18 clnode_2 [...] EVENT COMPLETED: rg_move clnode_2 1 RELEASE 0
Nov 13 13:40:18 clnode_2 [...] EVENT COMPLETED: rg_move_release clnode_2 1 0
Nov 13 13:40:18 clnode_2 [...] EVENT START: rg_move_fence clnode_2 1
Nov 13 13:40:19 clnode_2 [...] EVENT COMPLETED: rg_move_fence clnode_2 1 0
Nov 13 13:40:21 clnode_2 [...] EVENT START: node_down_complete clnode_1
Nov 13 13:40:21 clnode_2 [...] EVENT COMPLETED: node_down_complete clnode_1 0
Nov 13 13:40:29 clnode_2 [...] CONFIGRM_NOQUORUM_ER The operational quorum state of the active peer domain has changed to NO_QUORUM. This indicates that recovery of cluster resources can no longer occur and that the node may be rebooted or halted in order to ensure that critical resources are released so that they can be recovered by another subdomain that may have operational quorum.
Nov 13 13:40:29 clnode_2 [...] CONFIGRM_REBOOTOS_ER The operating system is being rebooted to ensure that critical resources are stopped so that another subdomain that has operational quorum may recover these resources without causing corruption or conflict.
[...]
Nov 13 13:41:32 clnode_2 [...] RMCD_INFO_0_ST The daemon is started.
Nov 13 13:41:33 clnode_2 [...] CONFIGRM_STARTED_ST IBM.ConfigRM daemon has started.
Nov 13 13:42:03 clnode_2 [...] GS_START_ST Group Services daemon started DIAGNOSTIC EXPLANATION HAGS daemon started by SRC. Log file is /var/ct/1Z4w8kYNeHvP2dxgyEaCe2/log/cthags/trace.
Nov 13 13:42:36 clnode_2 [...] CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM. In this state, cluster resources may be recovered and controlled as needed by management applications.
Nov 13 13:42:36 clnode_2 [...] CONFIGRM_ONLINE_ST The node is online in the domain indicated in the detail data. Peer Domain Name nfs_tiebr_cluster
Nov 13 13:42:38 clnode_2 [...] STORAGERM_STARTED_ST IBM.StorageRM daemon has started.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset