Planning considerations
This chapter provides information when planning to implement IBM PowerHA SystemMirror.
This chapter provides information about the following topics:
3.1 Introduction
There are many different ways to build a high available environment. This chapter describes a small subset.
3.1.1 Mirrored architecture
In a mirrored architecture, you have identical or nearly identical physical components in each part of the data center. You can have this type of setup in a single room (not recommended), in different rooms in the same building, or in different buildings. The distance between each part can be between few meters and several kilometers (km) or miles.
Figure 3-1 shows a high-level diagram of such a cluster. In this example, there are two networks, two managed systems, two Virtual Input/Output Servers (VIOS) by managed system, and two storage subsystems. This example also uses Logical Volume Manager (LVM) mirroring for getting the data written to each storage subsystem.
This example also has a logical unit number (LUN) for the Cluster Aware AIX (CAA) repository disk on each storage subsystem. For details on how to set up the CAA repository disk see section 3.2, “Cluster Aware AIX repository disk” on page 33.
Figure 3-1 Cluster with multiple storage subsystems
3.1.2 Single storage architecture
In a single storage architecture, you have a single storage subsystem, which is used by both your primary and backup logical partition (LPAR). This solution can be used when you have lower availability requirements for your data or when it is combined in a geographic solution.
If you can use the mirror feature in IBM SAN Volume Controller (SVC) or an SVC stretched cluster, this can look from a physical point of view identical or nearly identical to the mirrored architecture described in 3.1.1, “Mirrored architecture” on page 30. However, from an AIX and Cluster point of view, it is a Single Storage Architecture. For more details about the layout in an SVC stretched cluster, see 3.1.3, “Stretched cluster” on page 31.
Figure 3-2 shows such a kind of layout from a logical point of view.
Figure 3-2 Cluster with single storage subsystem
3.1.3 Stretched cluster
A stretched cluster involves separating the cluster nodes into sites where a site can be in a different building within a campus or separated by a few miles in terms of distance. In this configuration, there is a storage area network (SAN) that spans the sites and a disk can be presented between sites.
As with any multi-site cluster, Transmission Control Protocol/Internet Protocol (TCP/IP) communications are essential, and multiple links and routes are suggested such that a single network component or path failure could be incurred and communications between sites still be maintained.
A main concern is having redundant storage and verifying that the data within the storage devices is synchronized across sites. The following section presents a method for synchronizing the shared data.
IBM SAN Volume Controller (SVC) in a stretched configuration
The IBM SAN Volume Controller can be configured in a stretched configuration. In the stretched configuration, the IBM SVC can make two storage devices that are separated by some distance look as if it is a single IBM SVC device. The IBM SVC itself keeps the data between the sites consistent through its disk mirroring technology.
The IBM SVC in a stretched configuration allows the PowerHA cluster continuous availability of the storage LUNs even if there is a single component failure anywhere on the storage environment. With this combination, the behavior of the cluster is similar in terms of functionality and failure scenarios as a local cluster (Figure 3-3).
Figure 3-3 IBM SVC stretched configuration
3.1.4 Linked cluster
A linked cluster is another type of cluster that involves multiple sites. In this case, there is no SAN network between sites, typically because the distance between sites is too large. In this configuration, the repository disk is mirrored across a network link such that each site has their own copy of the repository disk and PowerHA keeps those disks synchronized.
TCP/IP communications are essential, and multiple links and routes are suggested such that a single network component or path failure could be incurred and communications between sites still be maintained.
For more information, see IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106 on the following website:
IBM-supported storage using copy services
There are several IBM-supported storage devices with copy services capabilities and, for the following example, we use one of these devices, the IBM SVC, which can replicate data across long distances with the IBM SVC copy services functions. The data can be replicated in synchronous or asynchronous modes where synchronous provides the most up-to-date data redundancy.
Data replication in asynchronous modes is typically used for distances longer than 100 miles, or where the data replication in synchronous mode can affect application performance.
If there is a failure that requires moving the workload to the remaining site, PowerHA will interact directly with the storage to switch replication direction. PowerHA will then make the LUNs read/write capable and vary on the appropriate volume groups to activate the application on the remaining site.
An example of this concept is shown in Figure 3-4.
Figure 3-4 PowerHA and SVC storage replication
3.2 Cluster Aware AIX repository disk
Cluster Aware AIX (CAA) uses a shared disk to store its cluster configuration information. You must have at least 512 megabytes (MB) and no more than 460 gigabytes (GB) of disk space allocated for the cluster repository disk. This feature requires that a dedicated shared disk is available to all nodes that are part of the cluster. This disk cannot be used for application storage or any other purpose.
The amount of configuration information that is stored on this repository disk is directly dependent on the number of cluster entities, such as shared disks, number of nodes, and number of adapters in the environment. You must ensure that you have enough space for the following components when you determine the size of a repository disk:
Node-to-node communication
Cluster topology management
All migration processes
The advised size for a two-node cluster is 1 GB.
3.2.1 Preparing for a CAA repository disk
The amount of work you have to do to prepare for a CAA Repository disk depends on your storage architecture. The easiest one is when you have an environment like the one described in 3.1.2, “Single storage architecture” on page 30. In this case, you need to make sure that the LUN for the CAA repository disk is visible on all cluster nodes, and that there is a PVID assigned to it.
If you have a multi-storage environment, such as the one described in 3.1.1, “Mirrored architecture” on page 30, then read 3.2.2, “CAA with multiple storage devices” on page 34.
3.2.2 CAA with multiple storage devices
The description here is related to the architecture described in 3.1.1, “Mirrored architecture” on page 30. This example uses one backup CAA repository disk. The maximum number of backup disks you can define is six.
If you plan to use one or more disks, which can potentially be used as backup disks for the CAA repository, it is advised to rename the disks, as described in “Rename the hdisk” on page 36. However, this cannot be possible in all cases.
 
Important: Note that third-party Microsoft Multipath I/O (MPIO) management software, such as EMC PowerPath, uses disk mapping to manage multi-paths. These software programs typically have a disk definition at a higher level, and path-specific disks underneath. Also, these software programs typically use special naming conventions.
Renaming these types of disks using the AIX rendev command can confuse the third-party MPIO software and can create disk-related issues. See your vendor documentation for any disk renaming tool available as part of the vendor’s software kit.
The examples that are described in this section use mainly smitty sysmirror to show the interesting parts. Using the clmgr command can be faster, but it can be harder to understand for someone new in this area. Nevertheless, the examples use the clmgr command where it makes sense or where it is the only option.
Using the standard hdisk name
A current drawback of having multiple LUNs that can be used as a CAA Repository disk is that they are not visible by using normal AIX commands, such as lspv. In this example, hdisk3 and hdisk4 are the LUNs prepared for the primary and backup CAA repository disks. Therefore, hdisk1 and hdisk2 are for the application. Example 3-1 shows the output of the lspv command before starting the configuration.
Example 3-1 The lspv output before configuring CAA
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
hdisk3          00f747c9b40ebfa5 None
hdisk4          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
After creating a cluster, selecting hdisk3 as the CAA repository disk, synchronizing the cluster, and creating the application volume group, you get the output listed in Example 3-2. As you can see in this output, the problem there is that the lspv command does not show that hdisk4 is reserved as the backup disk for the CAA repository.
Example 3-2 The lspv output after configuring CAA
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc                    testvg
hdisk2 00c3f55e34ff433d                    testvg
hdisk3          00f747c9b40ebfa5 caavg_private active
hdisk4          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
To see which disk is reserved as a backup disk, you can use the clmgr -v query repository command or the odmget HACMPsircol command. Example 3-3 shows the output of the clmgr command, and Example 3-4 on page 36 shows the output of the odmget command.
Example 3-3 The clmgr -v query repository output
# clmgr -v query repository
NAME="hdisk3"
NODE="c2n1"
PVID="00f747c9b40ebfa5"
UUID="12d1d9a1-916a-ceb2-235d-8c2277f53d06"
BACKUP="0"
TYPE="mpioosdisk"
DESCRIPTION="MPIO IBM 2076 FC Disk"
SIZE="1024"
AVAILABLE="512"
CONCURRENT="true"
ENHANCED_CONCURRENT_MODE="true"
STATUS="UP"
 
NAME="hdisk4"
NODE="c2n1"
PVID="00f747c9b476a148"
UUID="c961dda2-f5e6-58da-934e-7878cfbe199f"
BACKUP="1"
TYPE="mpioosdisk"
DESCRIPTION="MPIO IBM 2076 FC Disk"
SIZE="1024"
AVAILABLE="95808"
CONCURRENT="true"
ENHANCED_CONCURRENT_MODE="true"
STATUS="BACKUP"
#
As you can see in the output of the clmgr command, you can directly see the hdisk name. The odmget command output (Example 3-4) lists the physical volume identifiers (PVIDs).
Example 3-4 The odmget HACMPsircol output
# odmget HACMPsircol
 
HACMPsircol:
name = "c2n1_cluster_sircol"
id = 0
uuid = "0"
ip_address = ""
repository = "00f747c9b40ebfa5"
backup_repository = "00f747c9b476a148"
#
Rename the hdisk
To get around the issues mentioned in “Using the standard hdisk name” on page 34, it is suggested to rename the hdisks. The advantage of doing this is that it will be a lot easier to see which disk is reserved as the CAA repository disk.
There are some points to consider:
Generally you can use any name, but if it gets too long you can experience some administration issues.
The name must be unique.
It is advised not to have the string disk as part of the name. There might be some scripts or tools that can search for the string disk.
You must manually rename the hdisks on all cluster nodes.
 
Important: Note that third-party Microsoft Multipath I/O (MPIO) management software, such as EMC PowerPath, uses disk mapping to manage multi-paths. These software programs typically have a disk definition at a higher level, and path-specific disks underneath. Also, these software programs typically use special naming conventions.
Renaming these types of disks using the AIX rendev command can confuse the third-party MPIO software and can create disk-related issues. See your vendor documentation for any disk renaming tool available as part of the vendor’s software kit.
Using a long name
First, we tested using a longer and more descriptive name. Example 3-5 shows the output of the lspv command before we started.
Example 3-5 The lspv output before using rendev
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
hdisk3          00f747c9b40ebfa5 None
hdisk4          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
For the first try we decided to use a longer name (caa_reposX). Example 3-6 shows what we did and what the lspv command output looks like afterward.
 
Important: Remember to do the same on all cluster nodes.
Example 3-6 The lspv output after using rendev (using a long name)
#rendev -l hdisk3 -n caa_repos0
#rendev -l hdisk4 -n caa_repos1
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
caa_repos0      00f747c9b40ebfa5 None
caa_repos1      00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
Now we started to configure the cluster using the System Management Interface Tool (SMIT). Using F4 to select the CAA repository disk returns the screen shown in Figure 3-5. As you can see, only the first part of the name displayed. So the only way to find out which is the disk, is to check for the PVID.
                   Define Repository and Cluster IP Address
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Cluster Name c2n1_cluster
* Heartbeat Mechanism Unicast +
* Repository Disk [] +
Cluster Multicast Address []
+--------------------------------------------------------------------------+
| Repository Disk |
| |
| Move cursor to desired item and press Enter. |
| |
| caa_rep (00f747c9b40ebfa5) on all cluster nodes                        |
| caa_rep (00f747c9b476a148) on all cluster nodes |
| hdisk1 (00c3f55e34ff43cc) on all cluster nodes |
| hdisk2 (00c3f55e34ff433d) on all cluster nodes |
| |
| F1=Help F2=Refresh F3=Cancel |
F1| F8=Image F10=Exit Enter=Do |
F5| /=Find n=Find Next |
F9+--------------------------------------------------------------------------+
Figure 3-5 SMIT screen using long names
Using a short name
In this case, a short name means a name with a maximum of seven characters. We used the same starting point, as listed in Example 3-5 on page 36. This time, we decided to use a shorter name (caa_rX). Example 3-7 shows what we did and what the lspv command output looks like afterward.
 
Important: Remember to do the same on all cluster nodes.
Example 3-7 The lspv output after using rendev (using a short name)
#rendev -l hdisk3 -n caa_r0
#rendev -l hdisk4 -n caa_r1
# lspv
hdisk0 00f71e6a059e7e1a rootvg active
hdisk1 00c3f55e34ff43cc None
hdisk2 00c3f55e34ff433d None
caa_r0          00f747c9b40ebfa5 None
caa_r1          00f747c9b476a148 None
hdisk5          00f71e6a059e701b                    rootvg active
#
Now we start to configure the cluster using SMIT. Using F4 to select the CAA repository disk returns the screen shown in Figure 3-6. As you can see, the full name now displays.
                   Define Repository and Cluster IP Address
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Cluster Name c2n1_cluster
* Heartbeat Mechanism Unicast +
* Repository Disk [] +
Cluster Multicast Address []
+--------------------------------------------------------------------------+
| Repository Disk |
| |
| Move cursor to desired item and press Enter. |
| |
| caa_r0  (00f747c9b40ebfa5) on all cluster nodes                        |
| caa_r1  (00f747c9b476a148) on all cluster nodes |
| hdisk1 (00c3f55e34ff43cc) on all cluster nodes |
| hdisk2 (00c3f55e34ff433d) on all cluster nodes |
| |
| F1=Help F2=Refresh F3=Cancel |
F1| F8=Image F10=Exit Enter=Do |
F5| /=Find n=Find Next |
F9+--------------------------------------------------------------------------+
Figure 3-6 Smit screen using short names
3.3 Important considerations for Virtual Input/Output Server
This section lists some new features of AIX and Virtual I/O Server (VIOS) that help to increase overall availability, and are specially suggested to use for PowerHA environments.
3.3.1 Using poll_uplink
To use the poll_uplink option, you must have the following versions and settings:
VIOS 2.2.3.4 or later installed in all related VIO servers.
The LPAR must be at AIX 7.1 TL3, or AIX 6.1 TL9 or later.
The option poll_uplink needs to be set on the LPAR, on the virtual entX interfaces.
The option poll_uplink can be defined directly on the virtual interface if you are using shared Ethernet adapter (SEA) fallover or the Etherchannel device that points to the virtual interfaces. To enable poll_uplink, use the following command:
chdev -l entX -a poll_uplink=yes –P
 
Important: You must restart the LPAR to get the poll_uplink activated.
Figure 3-7 shows how the option works from a simplified point of view. In production environments, you normally have at least two physical interfaces on the VIOS, and you can also use a dual VIOS setup. In a multiple physical interface environment, the virtual link will be reported as down only when all physical connections on the VIOS for this SEA are down.
Figure 3-7 Using poll_uplink
The following settings are possible for poll_uplink:
poll_uplink (yes, no)
poll_uplink_int (100 milliseconds (ms) - 5000 ms)
To display the settings, use the lsattr –El entX command. Example 3-8 shows the default settings for poll_uplink.
Example 3-8 The lsattr details for poll_uplink
# lsdev –Cc Adapter | grep ^ent
ent0 Available Virtual I/O Ethernet Adapter (l-lan)
ent1 Available Virtual I/O Ethernet Adapter (l-lan)
# lsattr –El ent0 | grep “poll_up”
poll_uplink no Enable Uplink Polling True
poll_uplink_int 1000 Time interval for Uplink Polling True
#
There is another way to check whether poll_uplink is enabled, and what the current state is. However, this requires at least AIX 7.1 TL3 SP3 or AIX 6.1 TL9 SP3 or later. If your LPAR is at one of these levels or on later ones, you can use the entstat command to check for the poll_uplink status and if it is enabled.
Example 3-9 shows an excerpt of the entstat command output in an LPAR where poll_uplink is not enabled (set to no).
Example 3-9 Using poll_uplink=no
# entstat -d ent0
--------------------------------------------------
ETHERNET STATISTICS (en0) :
Device Type: Virtual I/O Ethernet Adapter (l-lan)
...
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 20000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
DataRateSet VIOENT
...
LAN State: Operational
...
#
Compared to Example 3-9, Example 3-10 on page 41 shows the entstat command output on a system where poll_uplink is enabled and where all physical links that are related to this virtual interface are up. The text in bold shows the additional content that you get:
VIRTUAL_PORT
PHYS_LINK_UP
Bridge Status: Up
Example 3-10 Using poll_uplink=yes when physical link is up
# entstat -d ent0
--------------------------------------------------
ETHERNET STATISTICS (en0) :
Device Type: Virtual I/O Ethernet Adapter (l-lan)
...
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 20000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
DataRateSet VIOENT VIRTUAL_PORT
PHYS_LINK_UP
...
LAN State: Operational
Bridge Status: Up
...
#
When all of the physical links on the VIOS are down, then you get the output listed in Example 3-11. The text PHYS_LINK_UP no longer displays, and the Bridge Status changes from Up to Unknown.
Example 3-11 Using poll_uplink=yes when physical link is down
# entstat -d ent0
--------------------------------------------------
ETHERNET STATISTICS (en0) :
Device Type: Virtual I/O Ethernet Adapter (l-lan)
...
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 20000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
DataRateSet VIOENT VIRTUAL_PORT
...
LAN State: Operational
Bridge Status: Unknown
...
#
3.3.2 Advantages for PowerHA when poll_uplink is used
In PowerHA V7, the network down detection is performed by CAA. CAA by default checks for IP traffic and for the link status of an interface. Therefore, using poll_uplink is advised for PowerHA LPARs. This helps the system to make a better decision whether a given interface is up or down.
3.4 Network considerations
This section focuses on the network considerations from a PowerHA point of view only. It means from this point of view it does not matter if you have virtual or physical network devices.
3.4.1 Dual adapter networks
This type of network was the most used one in the past. Starting with virtualization, this was replaced with the single adapter network solutions.
In PowerHA 7.1, this solution can still be used, but it is not recommended. The cross-adapter checking logic is not implemented in PowerHA V7. The advantage of not having this feature is that PowerHA 7.1 and later versions do not require that the IP Source route is enabled.
If you are using this kind of setup in PowerHA 7.1 or later, you must also use the netmon.cf file in a similar way as that for a single adapter layout. In this case, the netmon.cf file must have a path for all potential enX interfaces defined.
3.4.2 Single adapter network
When we describe a single adapter network, it is from a PowerHA point of view. In a highly available environment, you should always have a redundant way to access your network. This is nowadays done by using SEA fallover or Etherchannel, so Link Aggregation or node initialization block (NIB). The Etherchannel NIB-based solution can be used in both scenarios, using virtual adapters or physical adapters. The Etherchannel Link Aggregation-based solution can be used only if you have direct-attached adapters.
 
Note: Keep in mind that with a single adapter, you use the SEA fallover or the Etherchannel fallover.
This setup eases the setup from a TCP/IP point of view, and it also reduces the content of the netmon.cf file.
3.5 Network File System tie breaker
This section describes the Network File System (NFS) tie breaker.
3.5.1 Introduction and concepts
NFS tie breaker functionality represents an extension of the previously introduced disk tie breaker feature that relied on a Small Computer System Interface (SCSI) disk accessible to all nodes in a PowerHA cluster. The differences between the protocols that are used for accessing the tie breaker (SCSI disk or NFS-mounted file) favor the NFS-based solution for linked clusters.
Split-brain situation
A cluster split event can occur when a group of nodes cannot communicate with the remaining nodes in a cluster. For example, in a two-site linked cluster, a split occurs if all communication links between the two sites fail. Depending on the communication network topology and the location of the interruption, a cluster split event will split the cluster into two (or more) partitions, each of them containing one ore more cluster nodes. The resulting situation is commonly referred to as a split-brain situation.
In a split-brain situation, as its name implies, the two partitions have no knowledge of each other’s status, each of them considering the other as being offline. As a consequence, each partition will try to bring online the other partition’s resource groups, thus generating a high risk of data corruption on all shared disks. To prevent that, split and merge policies are defined as a method to avoid data corruption on the shared cluster disks.
Tie breaker
The tie breaker feature uses a tie breaker resource to choose a surviving partition (that will be allowed to continue to operate) when a cluster split event occurs. This feature prevents data corruption on the shared cluster disks. The tie breaker is identified either as a SCSI disk or an NFS-mounted file that must be accessible (in normal conditions) to all nodes in the cluster.
Split policy
When a split-brain situation occurs, each partition attempts to acquire the tie breaker by placing a lock on the tie breaker disk or on the NFS file. The partition that first locks the SCSI disk or reserves the NFS file “wins”, while the other “loses”.
All nodes in the winning partition continue to process cluster events, while all nodes in the other partition (the losing partition) attempt to recover according to the defined split and merge action plan. This plan most often implies either the restart of the cluster nodes, or merely the restart of cluster services on those nodes.
Merge policy
There are situations in which, depending on the cluster split policy, the cluster can have two partitions that run independently of each other. However, most often, the wanted option is to configure a merge policy that allows the partitions to operate together again after communications are restored between the partitions.
In this second approach, when partitions that were part of the cluster are brought back online after the communication failure, they must be able to communicate with the partition that owns the tie breaker disk or NFS file. If a partition that is brought back online cannot communicate with the tie breaker disk or the NFS file, it does not join the cluster. The tie breaker disk or NFS file is released when all nodes in the configuration rejoin the cluster.
The merge policy configuration (in this case, NFS-based tie breaker) must be of the same type as that for the split policy.
3.5.2 Test environment setup
The laboratory environment that we used to test the NFS tie breaker functionality consisted of a two-sites linked cluster (each site having a single node) with a common NFS-mounted resource, as shown in Figure 3-8.
Figure 3-8 NFS tie-breaker test environment
Because the goal was to test the NFS tie breaker functionality as a method for handling split-brain situations, the additional local nodes in a linked multisite cluster were considered irrelevant, and therefore not included in the test setup. Each node had its own cluster repository disk (clnode_1r and clnode_2r), while both nodes shared a common cluster disk (clnode_12, the one that needs to be protected from data corruption caused by a split-brain situation), as shown in Example 3-12.
Example 3-12 List of physical volumes on both cluster nodes
clnode_1:/# lspv
clnode_1r 00f6f5d0f8c9fbf4 caavg_private active
clnode_12 00f6f5d0f8ca34ec datavg concurrent
hdisk0 00f6f5d09570f170 rootvg active
clnode_1:/#
 
clnode_2:/# lspv
clnode_2r 00f6f5d0f8ceed1a caavg_private active
clnode_12 00f6f5d0f8ca34ec datavg concurrent
hdisk0 00f6f5d09570f31b rootvg active
clnode_2:/#
To allow greater flexibility for our test scenarios, we chose to use different network adapters for the production traffic or inter-site connectivity, and the connectivity to the shared NFS resource. The network setup of the two nodes is shown in Example 3-13.
Example 3-13 Network settings for both cluster nodes
clnode_1:/# netstat -in | egrep "Name|en"
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 ee.af.e.90.ca.2 533916 0 566524 0 0
en0 1500 192.168.100 192.168.100.50 533916 0 566524 0 0
en0 1500 192.168.100 192.168.100.51 533916 0 566524 0 0
en1 1500 link#3 ee.af.e.90.ca.3 388778 0 457776 0 0
en1 1500 10 10.0.0.1 388778 0 457776 0 0
clnode_1:/#
 
clnode_2:/# netstat -in | egrep "Name|en"
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 ee.af.7.e3.9a.2 391379 0 278953 0 0
en0 1500 192.168.100 192.168.100.52 391379 0 278953 0 0
en1 1500 link#3 ee.af.7.e3.9a.3 385787 0 350121 0 0
en1 1500 10 10.0.0.2 385787 0 350121 0 0
clnode_2:/#
During the setup of the cluster, the NFS communication network (with the en1 network adapters in Example 3-13) was discovered and automatically added to the cluster configuration as a heartbeat network (as net_ether_02). However, we manually removed it afterward to prevent interference with the NFS tie breaker tests. Therefore, the cluster eventually had only one heartbeat network: net_ether_01.
The final cluster topology was reported, as shown in Example 3-14.
Example 3-14 Cluster topology information
clnode_1:/# cltopinfo
Cluster Name: nfs_tiebr_cluster
Cluster Type: Linked
Heartbeat Type: Unicast
Repository Disks:
Site 1 (site1@clnode_1): clnode_1r
Site 2 (site2@clnode_2): clnode_2r
Cluster Nodes:
Site 1 (site1):
clnode_1
Site 2 (site2):
clnode_2
 
There are 2 node(s) and 1 network(s) defined
NODE clnode_1:
Network net_ether_01
clst_svIP 192.168.100.50
clnode_1 192.168.100.51
NODE clnode_2:
Network net_ether_01
clst_svIP 192.168.100.50
clnode_2 192.168.100.52
 
Resource Group rg_IHS
Startup Policy Online On Home Node Only
Fallover Policy Fallover To Next Priority Node In The List
Fallback Policy Never Fallback
Participating Nodes clnode_1 clnode_2
Service IP Label clst_svIP
clnode_1:/#
At the end of our environment preparation, the cluster was up and running. The resource group (IBM Hypertext Transfer Protocol (HTTP) Server, installed on the clnode_12 cluster disk, with the datavg volume group) was online, as shown in Example 3-15.
Example 3-15 Cluster nodes and resource groups status
clnode_1:/# clmgr -cv -a name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:NORMAL:ST_STABLE
 
clnode_1:/#
clnode_1:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
ONLINE SECONDARY clnode_2@site2
 
clnode_1:/#
3.5.3 NFS server and client configuration
An important prerequisite of the NFS tie breaker functionality deployment is the proper setup of the NFS resource. For that matter, note that the NFS tie breaker functionality does not work with (the more common) NFS version 3.
 
Important: NFS tie breaker functionality requires NFS version 4.
Our test environment used an NFS server configured for convenience on an AIX 7.1 TL3 SP5 LPAR. This, of course, is not a requirement for deploying an NFS version 4 server.
A number of services are required to be active in order to allow NFSv4 communication between clients and servers:
On the NFS server:
 – biod
 – nfsd
 – nfsgryd
 – portmap
 – rpc.lockd
 – rpc.mountd
 – rpc.statd
 – TCP
On the NFS client (all cluster nodes):
 – biod
 – nfsd
 – rpc.mountd
 – rpc.statd
 – TCP
While most of the previous services can (by default) already be active, particular attention is required for the setup of the nfsrgyd service. As mentioned previously, this daemon must be running on both the server and the clients (in our case, the two cluster nodes). This daemon provides a name conversion service for NFS servers and clients using NFS v4.
Starting the nfsrgyd daemon requires in turn that the local NFS domain is set. The local NFS domain is stored in the /etc/nfs/local_domain file and it can be set by using the chnfsdom command as shown in Example 3-16).
Example 3-16 Setting the local NFS domain
nfsserver:/# chnfsdom nfs_local_domain
nfsserver:/# startsrc -g nfs
[...]
nfsserver:/# lssrc -g nfs
Subsystem Group PID Status
[...]
nfsrgyd nfs 7077944 active
[...]
nfsserver:#
In addition, for the server, you need to specify the root node directory (what clients would mount as /) and the public node directory with the command-line interface (CLI), using the chnfs command, as shown in Example 3-17.
Example 3-17 Setting the root and public node directory
nfsserver:/# chnfs -r /nfs_root -p /nfs_root
nfsserver:/#
Alternatively, root, the public node directory, and the local NFS domain can be set with SMIT. Use the smit nfs command and follow the path Network File System (NFS)  Configure NFS on This System, then select the corresponding option:
Change Version 4 Server Root Node
Change Version 4 Server Public Node
Configure NFS Local Domain  Change NFS Local Domain
As a final step for the NFS configuration, create the NFS resource (export). Example 3-18 shows the NFS resource, created using SMIT (smit mknfs command).
Example 3-18 Creating an NFS v4 export
                               Add a Directory to Exports List
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
* Pathname of directory to export [/nfs_root/nfs_tie_breaker] /
[...]
Public filesystem? no +
[...]
Allow access by NFS versions [4] +
[...]
* Security method 1 [sys,krb5p,krb5i,krb5,dh] +
* Mode to export directory read-write +
[...]
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
At this point, a good practice is to make sure that the NFS configuration is correct and test it by manually mounting the NFS export to the clients, as shown in Example 3-19 (date column removed for clarity).
Example 3-19 Mounting an NFS v4 export
clnode_1:/# mount -o vers=4 nfsserver:/nfs_tie_breaker /mnt
clnode_1:/# mount | egrep "node|---|tie"
node       mounted    mounted over  vfs   options
--------   ----------------  ------------  ----  -------------------------------
nfsserver  /nfs_tie_breaker  /mnt          nfs4  vers=4,fg,soft,retry=1,timeo=10
clnode_1:/#
clnode_1:/# umount /mnt
clnode_1:/#
3.5.4 NFS tie breaker configuration
NFS tie breaker functionality can be configured either with CLI commands or with SMIT.
To configure the NFS tie breaker using SMIT, complete the following steps:
1. The SMIT menu that enables the configuration of NFS Tie Breaker split policy can be accessed following the path Custom Cluster Configuration → Cluster Nodes and Networks → Initial Cluster Setup (Custom) → Configure Cluster Split and Merge Policy.
2. When there, first select the option of Split Management Policy, as shown in Example 3-20.
Example 3-20 Configuring split handling policy
Configure Cluster Split and Merge Policy
 
Move cursor to desired item and press Enter.
 
Split Management Policy
Merge Management Policy
Quarantine Policy
 
 
+-------------------------------------------------------------+
|                    Split Handling Policy                    |
| |
| Move cursor to desired item and press Enter. |
| |
| None |
| TieBreaker |
| Manual |
| |
| F1=Help F2=Refresh F3=Cancel |
| F8=Image F10=Exit Enter=Do |
F1=Help | /=Find n=Find Next |
F9=Shell +-------------------------------------------------------------+
3. Selecting further on the option of TieBreaker leads us to the menu where we can choose the method to use for tie breaking, as shown in Example 3-21.
Example 3-21 Selecting the tie breaker type
Configure Cluster Split and Merge Policy
 
Move cursor to desired item and press Enter.
 
Split Management Policy
Merge Management Policy
Quarantine Policy
 
 
+-------------------------------------------------------------+
|                   Select TieBreaker Type                    |
| |
| Move cursor to desired item and press Enter. |
| |
| Disk |
| NFS |
| |
| F1=Help F2=Refresh F3=Cancel |
| F8=Image F10=Exit Enter=Do |
F1=Help | /=Find n=Find Next |
F9=Shell +-------------------------------------------------------------+
 
4. After selecting NFS as the method for tie breaking, we get to the last SMIT menu for our purpose, where we must specify the NFS export server and directory and the local mount point, as shown in Example 3-22.
Example 3-22 Configuring NFS tie breaker for split handling policy using SMIT
                         NFS TieBreaker Configuration
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
Split Handling Policy NFS
* NFS Export Server [nfsserver_nfs]
* Local Mount Directory [/nfs_tie_breaker]
* NFS Export Directory [/nfs_tie_breaker]
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
Split and merge policies must be of the same type, and the same rule goes for the tie breaker type. Therefore, selecting the TieBreaker option for the Split Handling Policy field, and the NFS option for the TieBreaker type for that policy, implies also selecting those same options (TieBreaker and NFS) for the Merge Handling Policy:
1. In a similar manner to the one described previously, we configure the merge policy. From the same SMIT menu mentioned earlier (Custom Cluster Configuration → Cluster Nodes and Networks → Initial Cluster Setup (Custom) → Configure Cluster Split and Merge Policy), we select the Merge Management Policy option (Example 3-23).
Example 3-23 Configuring merge handling policy
                          Configure Cluster Split and Merge Policy
 
Move cursor to desired item and press Enter.
 
Split Management Policy
Merge Management Policy
Quarantine Policy
 
+-------------------------------------------------------------+
|                    Merge Handling Policy                  |
| |
| Move cursor to desired item and press Enter. |
| |
| Majority |
| TieBreaker |
| Manual |
| Priority |
| |
| F1=Help F2=Refresh F3=Cancel |
| F8=Image F10=Exit Enter=Do |
F1=Help | /=Find n=Find Next |
F9=Shell +-------------------------------------------------------------+
2. Selecting further on the option of TieBreaker leads us to the menu shown in Example 3-24, where we again choose NFS as the method to use for tie breaking.
Example 3-24 Configuring NFS tie breaker for merge handling policy with SMIT
                         NFS TieBreaker Configuration
 
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
 
[Entry Fields]
Merge Handling Policy NFS
* NFS Export Server [nfsserver_nfs]
* Local Mount Directory [/nfs_tie_breaker]
* NFS Export Directory [/nfs_tie_breaker]
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
Alternatively, both split and merge management policies can be configured by CLI using the clmgr modify cluster SPLIT_POLICY=tiebreaker MERGE_POLICY=tiebreaker command, followed by the cl_sm command, as shown in Example 3-25.
Example 3-25 Configuring NFS tie breaker for split and merge handling policy using the CLI
clnode_1:/# /usr/es/sbin/cluster/utilities/cl_sm -s 'NFS' -k'nfsserver_nfs' -g'/nfs_tie_breaker' -p'/nfs_tie_breaker'
The PowerHA SystemMirror split and merge policies have been updated.
Current policies are:
Split Handling Policy : NFS
Merge Handling Policy : NFS
NFS Export Server :
nfsserver_nfs
Local Mount Directory :
/nfs_tie_breaker
NFS Export Directory :
/nfs_tie_breaker
Split and Merge Action Plan : Restart
The configuration must be synchronized to make this change known across the cluster.
clnode_1:/#
 
 
clnode_1:/# /usr/es/sbin/cluster/utilities/cl_sm -m 'NFS' -k'nfsserver_nfs' -g'/nfs_tie_breaker' -p'/nfs_tie_breaker'
The PowerHA SystemMirror split and merge policies have been updated.
Current policies are:
Split Handling Policy : NFS
Merge Handling Policy : NFS
NFS Export Server :
nfsserver_nfs
Local Mount Directory :
/nfs_tie_breaker
NFS Export Directory :
/nfs_tie_breaker
Split and Merge Action Plan : Restart
The configuration must be synchronized to make this change known across the cluster.
clnode_1:/#
At this point, both a PowerHA cluster synchronization and restart, and a CAA cluster restart, are required. To complete these restarts, the following actions must be performed:
1. Verify and synchronize the changes across the cluster. This can be achieved either by the SMIT menu (select the smit sysmirror command, then follow the path: Cluster Applications and Resources → Resource Groups → Verify and Synchronize Cluster Configuration), or by the CLI, using the clmgr sync cluster command.
2. Stop cluster services for all nodes in the cluster by running the clmgr stop cluster command.
3. Stop the Cluster Aware AIX (CAA) daemon on all cluster nodes by running the stopsrc -s clconfd command.
4. Start the Cluster Aware AIX (CAA) daemon on all cluster nodes by running the startsrc -s clconfd command.
5. Start cluster services for all nodes in the cluster by running the clmgr start cluster command.
 
Important: Verify all output messages generated by the synchronization and restart of the cluster, because if an error occurred when activating the NFS tie breaker policies, it might not necessarily produce an error on the overall result of a cluster synchronization action.
When all cluster nodes are synchronized and running, and the split and merge management policies are applied, the NFS resource is accessed by all nodes, as shown in Example 3-26 (date column removed for clarity).
Example 3-26 Checking for NFS export mounted on clients
clnode_1:/# mount | egrep "node|---|tie"
node     mounted    mounted over   vfs   options
-------------  ---------------  ---------------- ----  ----------------------
nfsserver_nfs  /nfs_tie_breaker  /nfs_tie_breaker  nfs4   vers=4,fg,soft,retry=1,timeo=10
clnode_1:/#
 
 
clnode_2:/# mount | egrep "node|---|tie"
node     mounted    mounted over   vfs   options
-------------  ---------------  ---------------- ----  ----------------------
nfsserver_nfs  /nfs_tie_breaker  /nfs_tie_breaker  nfs4   vers=4,fg,soft,retry=1,timeo=10
clnode_2:/#
3.5.5 NFS tie breaker tests
A number of tests have been carried out. As a general method to simulate network connectivity loss, we chose to use the ifconfig command to bring network interfaces down, especially because its effect was not persistent across restarts, so that the NFS tie breaker induced restart would have the expected recovery effect. The test scenarios that we used and the actual results that we got are presented in the following sections.
Loss of network communication to the NFS server
Because the use of an NFS server resource was merely a secondary communication means (the primary one being the heartbeat network), the loss of communication between the cluster nodes and the NFS server did not actually have any visible results (other than the expected log entries).
Loss of production/heartbeat network communication on standby node
The loss of the production/heartbeat network communication on the standby node triggered no actual response, because no resource groups were online on that node at the time the simulated event occurred.
Loss of production/heartbeat network communication on active node
The loss of the production/heartbeat network communication on the active node triggered the expected fallover action. This occured because the network service IP and the underlying network (as resources essential to the resource group that was online until the simulated event) were no longer available.
This action can be seen on both nodes’ logs, as shown for cluster.mmddyyy logs in Example 3-27 for the disconnected node (the one that releases the resource group).
Example 3-27 The cluster.mmddyyy log for the node releasing the resource group
Nov 13 14:42:13 EVENT START: network_down clnode_1 net_ether_01
Nov 13 14:42:13 EVENT COMPLETED: network_down clnode_1 net_ether_01 0
Nov 13 14:42:13 EVENT START: network_down_complete clnode_1 net_ether_01
Nov 13 14:42:13 EVENT COMPLETED: network_down_complete clnode_1 net_ether_01 0
Nov 13 14:42:20 EVENT START: resource_state_change clnode_1
Nov 13 14:42:20 EVENT COMPLETED: resource_state_change clnode_1 0
Nov 13 14:42:20 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:20 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:20 EVENT START: stop_server app_IHS
Nov 13 14:42:20 EVENT COMPLETED: stop_server app_IHS 0
Nov 13 14:42:21 EVENT START: release_service_addr
Nov 13 14:42:22 EVENT COMPLETED: release_service_addr 0
Nov 13 14:42:25 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:25 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:27 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:27 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:30 EVENT START: network_up clnode_1 net_ether_01
Nov 13 14:42:30 EVENT COMPLETED: network_up clnode_1 net_ether_01 0
Nov 13 14:42:31 EVENT START: network_up_complete clnode_1 net_ether_01
Nov 13 14:42:31 EVENT COMPLETED: network_up_complete clnode_1 net_ether_01 0
Nov 13 14:42:33 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:33 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:33 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:33 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:35 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:36 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:38 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:39 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:39 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:39 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:39 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:39 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:41 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:42 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:46 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:47 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:47 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:47 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:47 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:47 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:49 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:53 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:55 EVENT START: resource_state_change_complete clnode_1
Nov 13 14:42:55 EVENT COMPLETED: resource_state_change_complete clnode_1 0
This action is also shown in Example 3-28 for the other node (the one that acquires the resource group).
Example 3-28 The cluster.mmddyyy log for the node acquiring the resource group
Nov 13 14:42:13 EVENT START: network_down clnode_1 net_ether_01
Nov 13 14:42:13 EVENT COMPLETED: network_down clnode_1 net_ether_01 0
Nov 13 14:42:14 EVENT START: network_down_complete clnode_1 net_ether_01
Nov 13 14:42:14 EVENT COMPLETED: network_down_complete clnode_1 net_ether_01 0
Nov 13 14:42:20 EVENT START: resource_state_change clnode_1
Nov 13 14:42:20 EVENT COMPLETED: resource_state_change clnode_1 0
Nov 13 14:42:20 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:20 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:20 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:20 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:27 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:29 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:31 EVENT START: network_up clnode_1 net_ether_01
Nov 13 14:42:31 EVENT COMPLETED: network_up clnode_1 net_ether_01 0
Nov 13 14:42:31 EVENT START: network_up_complete clnode_1 net_ether_01
Nov 13 14:42:31 EVENT COMPLETED: network_up_complete clnode_1 net_ether_01 0
Nov 13 14:42:33 EVENT START: rg_move_release clnode_1 1
Nov 13 14:42:33 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 14:42:34 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 14:42:34 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 14:42:36 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:36 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:39 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:39 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:39 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:39 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:39 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:39 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:42 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:45 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:47 EVENT START: rg_move_fence clnode_1 1
Nov 13 14:42:47 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 14:42:47 EVENT START: rg_move_acquire clnode_1 1
Nov 13 14:42:47 EVENT START: rg_move clnode_1 1 ACQUIRE
Nov 13 14:42:49 EVENT START: acquire_takeover_addr
Nov 13 14:42:50 EVENT COMPLETED: acquire_takeover_addr 0
Nov 13 14:42:50 EVENT COMPLETED: rg_move clnode_1 1 ACQUIRE 0
Nov 13 14:42:50 EVENT COMPLETED: rg_move_acquire clnode_1 1 0
Nov 13 14:42:50 EVENT START: rg_move_complete clnode_1 1
Nov 13 14:42:50 EVENT START: start_server app_IHS
Nov 13 14:42:51 EVENT COMPLETED: start_server app_IHS 0
Nov 13 14:42:52 EVENT COMPLETED: rg_move_complete clnode_1 1 0
Nov 13 14:42:55 EVENT START: resource_state_change_complete clnode_1
Nov 13 14:42:55 EVENT COMPLETED: resource_state_change_complete clnode_1 0
Note that neither log includes split_merge_prompt, site_down, or node_down events.
Loss of all network communication on standby node
The loss of all network communication (production/heartbeat and connectivity to NFS server) on the standby node (the node without any online resource groups) triggered the restart of that node, in accordance to the split and merge action plan defined earlier.
As a starting point, both nodes were operational and the resource group was online on node clnode_1 (Example 3-29).
Example 3-29 Cluster nodes and resource group status before simulated network down event
clnode_1:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:NORMAL:ST_STABLE
clnode_1:/#
 
 
clnode_1:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
ONLINE SECONDARY clnode_2@site2
clnode_1:/#
We performed the following steps:
1. First, we temporarily (not persistent across restart) brought down the network interfaces on the standby node clnode_2, in a terminal console opened using the Hardware Management Console (HMC), as shown in Example 3-30.
Example 3-30 Simulating a network down event
clnode_2:/# ifconfig en0 down; ifconfig en1 down
clnode_2:/#
2. Then (in about a minute or less), as a response to the split brain situation, the node clnode_2 (with no communication to the NFS server) rebooted itself. This can be seen on the virtual terminal console opened (using the HMC) on that node, and is also reflected by the status of the cluster nodes (Example 3-31).
Example 3-31 Cluster nodes status immediately after simulated network down event
clnode_1:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:UNKNOWN:UNKNOWN
clnode_1:/#
3. After restart, the node clnode_2 was functional, but with cluster services stopped (Example 3-32).
Example 3-32 Cluster nodes and resource group status after node restart
clnode_1:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:OFFLINE:ST_INIT
clnode_1:/#
 
 
clnode_2:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
OFFLINE clnode_2@site2
clnode_2:/#
4. We then manually started the services on clnode_2 node (Example 3-33).
Example 3-33 Starting cluster services on the recently rebooted node
clnode_2:/# clmgr start node
[...]
clnode_2: Completed execution of /usr/es/sbin/cluster/etc/rc.cluster
clnode_2: with parameters: -boot -N -A -b -P cl_rc_cluster.
clnode_2: Exit status = 0
clnode_2:/#
5. Finally, we arrived to the exact initial situation that we had before the simulated network loss event, that is with both nodes operational and the resource group online on node clnode_1 (Example 3-34).
Example 3-34 Cluster nodes and resource group status after cluster services start-up
clnode_2:/# clmgr -cva name,state,raw_state query node
# NAME:STATE:RAW_STATE
clnode_1:NORMAL:ST_STABLE
clnode_2:NORMAL:ST_STABLE
clnode_2:/#
 
clnode_2:/# clRGinfo
-----------------------------------------------------------------------------
Group Name Group State Node
-----------------------------------------------------------------------------
rg_IHS ONLINE clnode_1@site1
ONLINE SECONDARY clnode_2@site2
clnode_2:/#
Loss of all network communication on active node
The loss of all network communication (production/heartbeat and connectivity to NFS server) on the active node (the node with the resource group online) triggered the restart of that node. At the same time, the resource group was independently brought online on the
other node.
The test was performed just like the one on the standby node (see “Loss of all network communication on standby node” on page 56) and the process was similar. The only notable difference was that while the previously active node (now disconnected) was restarting, the other node (previously the standby node) was now bringing the resource group online, thus ensuring service availability.
3.5.6 Log entries for monitoring and debugging
As expected, the usual system and cluster log files contain also information related to the NFS tie breaker events and actions. However, the particular content of these logs varies significantly upon the node that is recording those logs and its role in such an event.
Error report (errpt)
The surviving node included log entries as presented (in chronological order, older entries first) in Example 3-35.
Example 3-35 Error report events on the surviving node
LABEL: CONFIGRM_SITE_SPLIT
Description
ConfigRM received Site Split event notification
 
 
LABEL: CONFIGRM_PENDINGQUO
Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM. This state usually indicates that exactly half of the nodes that are defined in the peer domain are online. In this state cluster resources cannot be recovered although none will be stopped explicitly.
 
 
LABEL: LVM_GS_RLEAVE
Description
Remote node Concurrent Volume Group failure detected
 
 
LABEL: CONFIGRM_HASQUORUM_
Description
The operational quorum state of the active peer domain has changed to HAS_QUORUM.
In this state, cluster resources may be recovered and controlled as needed by
management applications.
While the disconnected or rebooted node included log entries as presented (again in chronological order, older entries first) in Example 3-36.
Example 3-36 Error report events on the rebooted node
LABEL: CONFIGRM_SITE_SPLIT
Description
ConfigRM received Site Split event notification
 
 
LABEL: CONFIGRM_PENDINGQUO
Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM. This state usually indicates that exactly half of the nodes that are defined in the peer domain are online. In this state cluster resources cannot be recovered although none will be stopped explicitly.
 
 
LABEL: LVM_GS_RLEAVE
Description
Remote node Concurrent Volume Group failure detected
 
 
LABEL: CONFIGRM_NOQUORUM_E
Description
The operational quorum state of the active peer domain has changed to NO_QUORUM.
This indicates that recovery of cluster resources can no longer occur and that
the node may be rebooted or halted in order to ensure that critical resources
are released so that they can be recovered by another sub-domain that may have
operational quorum.
 
 
LABEL: CONFIGRM_REBOOTOS_E
Description
The operating system is being rebooted to ensure that critical resources are
stopped so that another sub-domain that has operational quorum may recover
these resources without causing corruption or conflict.
 
 
LABEL: REBOOT_ID
Description
SYSTEM SHUTDOWN BY USER
 
 
LABEL: CONFIGRM_HASQUORUM_
Description
The operational quorum state of the active peer domain has changed to HAS_QUORUM.
In this state, cluster resources may be recovered and controlled as needed by
management applications.
 
 
LABEL: CONFIGRM_ONLINE_ST
Description
The node is online in the domain indicated in the detail data.
Note that the rebooted node’s log includes information relative to the surviving node’s log, and information on the restart event.
The cluster.mmddyyy log file
For each split brain situation occurred, the content of cluster.mmddyyy log file was similar on the two nodes. The surviving node’s log entries are presented in Example 3-37.
Example 3-37 The cluster.mmddyyy log entries on the surviving node
Nov 13 13:40:03 EVENT START: split_merge_prompt split
Nov 13 13:40:07 EVENT COMPLETED: split_merge_prompt split 0
Nov 13 13:40:07 EVENT START: site_down site2
Nov 13 13:40:09 EVENT START: site_down_remote site2
Nov 13 13:40:09 EVENT COMPLETED: site_down_remote site2 0
Nov 13 13:40:09 EVENT COMPLETED: site_down site2 0
Nov 13 13:40:09 EVENT START: node_down clnode_2
Nov 13 13:40:09 EVENT COMPLETED: node_down clnode_2 0
Nov 13 13:40:11 EVENT START: rg_move_release clnode_1 1
Nov 13 13:40:11 EVENT START: rg_move clnode_1 1 RELEASE
Nov 13 13:40:11 EVENT COMPLETED: rg_move clnode_1 1 RELEASE 0
Nov 13 13:40:11 EVENT COMPLETED: rg_move_release clnode_1 1 0
Nov 13 13:40:11 EVENT START: rg_move_fence clnode_1 1
Nov 13 13:40:12 EVENT COMPLETED: rg_move_fence clnode_1 1 0
Nov 13 13:40:14 EVENT START: node_down_complete clnode_2
Nov 13 13:40:14 EVENT COMPLETED: node_down_complete clnode_2 0
The log entries for the same event, but this time on the disconnected or rebooted node, are shown in Example 3-38.
Example 3-38 The cluster.mmddyyy log entries on the rebooted node
Nov 13 13:40:03 EVENT START: split_merge_prompt split
Nov 13 13:40:03 EVENT COMPLETED: split_merge_prompt split 0
Nov 13 13:40:12 EVENT START: site_down site1
Nov 13 13:40:13 EVENT START: site_down_remote site1
Nov 13 13:40:13 EVENT COMPLETED: site_down_remote site1 0
Nov 13 13:40:13 EVENT COMPLETED: site_down site1 0
Nov 13 13:40:13 EVENT START: node_down clnode_1
Nov 13 13:40:13 EVENT COMPLETED: node_down clnode_1 0
Nov 13 13:40:15 EVENT START: network_down clnode_2 net_ether_01
Nov 13 13:40:15 EVENT COMPLETED: network_down clnode_2 net_ether_01 0
Nov 13 13:40:15 EVENT START: network_down_complete clnode_2 net_ether_01
Nov 13 13:40:15 EVENT COMPLETED: network_down_complete clnode_2 net_ether_01 0
Nov 13 13:40:18 EVENT START: rg_move_release clnode_2 1
Nov 13 13:40:18 EVENT START: rg_move clnode_2 1 RELEASE
Nov 13 13:40:18 EVENT COMPLETED: rg_move clnode_2 1 RELEASE 0
Nov 13 13:40:18 EVENT COMPLETED: rg_move_release clnode_2 1 0
Nov 13 13:40:18 EVENT START: rg_move_fence clnode_2 1
Nov 13 13:40:19 EVENT COMPLETED: rg_move_fence clnode_2 1 0
Nov 13 13:40:21 EVENT START: node_down_complete clnode_1
Nov 13 13:40:21 EVENT COMPLETED: node_down_complete clnode_1 0
Note that this log also includes the information about the network_down event.
The cluster.log file
The cluster.log file included much of the information in the cluster.mmddyyy log file. The notable exception was that this one (cluster.log) also included information about the quorum status (losing and regaining quorum). For the disconnected or rebooted node only, the cluster.log file has information about the restart event, as shown in Example 3-39.
Example 3-39 The cluster.log entries on the rebooted node
Nov 13 13:40:03 clnode_2 [...] EVENT START: split_merge_prompt split
Nov 13 13:40:03 clnode_2 [...] CONFIGRM_SITE_SPLIT_ST ConfigRM received Site Split event notification
Nov 13 13:40:03 clnode_2 [...] EVENT COMPLETED: split_merge_prompt split 0
Nov 13 13:40:09 clnode_2 [...] CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to PENDING_QUORUM. This state usually indicates that exactly half of the nodes that are defined in the peer domain are online. In this state cluster resources cannot be recovered although none will be stopped explicitly.
Nov 13 13:40:12 clnode_2 [...] EVENT START: site_down site1
Nov 13 13:40:13 clnode_2 [...] EVENT START: site_down_remote site1
Nov 13 13:40:13 clnode_2 [...] EVENT COMPLETED: site_down_remote site1 0
Nov 13 13:40:13 clnode_2 [...] EVENT COMPLETED: site_down site1 0
Nov 13 13:40:13 clnode_2 [...] EVENT START: node_down clnode_1
Nov 13 13:40:13 clnode_2 [...] EVENT COMPLETED: node_down clnode_1 0
Nov 13 13:40:15 clnode_2 [...] EVENT START: network_down clnode_2 net_ether_01
Nov 13 13:40:15 clnode_2 [...] EVENT COMPLETED: network_down clnode_2 net_ether_01 0
Nov 13 13:40:15 clnode_2 [...] EVENT START: network_down_complete clnode_2 net_ether_01
Nov 13 13:40:16 clnode_2 [...] EVENT COMPLETED: network_down_complete clnode_2 net_ether_01 0
Nov 13 13:40:18 clnode_2 [...] EVENT START: rg_move_release clnode_2 1
Nov 13 13:40:18 clnode_2 [...] EVENT START: rg_move clnode_2 1 RELEASE
Nov 13 13:40:18 clnode_2 [...] EVENT COMPLETED: rg_move clnode_2 1 RELEASE 0
Nov 13 13:40:18 clnode_2 [...] EVENT COMPLETED: rg_move_release clnode_2 1 0
Nov 13 13:40:18 clnode_2 [...] EVENT START: rg_move_fence clnode_2 1
Nov 13 13:40:19 clnode_2 [...] EVENT COMPLETED: rg_move_fence clnode_2 1 0
Nov 13 13:40:21 clnode_2 [...] EVENT START: node_down_complete clnode_1
Nov 13 13:40:21 clnode_2 [...] EVENT COMPLETED: node_down_complete clnode_1 0
Nov 13 13:40:29 clnode_2 [...] CONFIGRM_NOQUORUM_ER The operational quorum state of the active peer domain has changed to NO_QUORUM. This indicates that recovery of cluster resources can no longer occur and that the node may be rebooted or halted in order to ensure that critical resources are released so that they can be recovered by another sub-domain that may have operational quorum.
Nov 13 13:40:29 clnode_2 [...] CONFIGRM_REBOOTOS_ER The operating system is being rebooted to ensure that critical resources are stopped so that another sub-domain that has operational quorum may recover these resources without causing corruption or conflict.
[...]
Nov 13 13:41:32 clnode_2 [...] RMCD_INFO_0_ST The daemon is started.
Nov 13 13:41:33 clnode_2 [...] CONFIGRM_STARTED_ST IBM.ConfigRM daemon has started.
Nov 13 13:42:03 clnode_2 [...] GS_START_ST Group Services daemon started DIAGNOSTIC EXPLANATION HAGS daemon started by SRC. Log file is /var/ct/1Z4w8kYNeHvP2dxgyEaCe2/log/cthags/trace.
Nov 13 13:42:36 clnode_2 [...] CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM. In this state, cluster resources may be recovered and controlled as needed by management applications.
Nov 13 13:42:36 clnode_2 [...] CONFIGRM_ONLINE_ST The node is online in the domain indicated in the detail data. Peer Domain Name nfs_tiebr_cluster
Nov 13 13:42:38 clnode_2 [...] STORAGERM_STARTED_ST IBM.StorageRM daemon has started.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset