Chapter 1. Introduction to IBM PowerHA SystemMirror for IBM AIX

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Introduction to IBM PowerHA SystemMirror for IBM AIX

This chapter provides an introduction to IBM PowerHA SystemMirror for newcomers into this solution as well as a refresher to those that have implemented PowerHA and have used it for many years.

This chapter contains the following topics:

•What IBM PowerHA SystemMirror for IBM AIX is

•Availability solutions: An overview

•History and evolution

•High availability terminology and concepts

•Fault tolerance versus high availability

•Additional PowerHA resources

1.1 What IBM PowerHA SystemMirror for IBM AIX is

IBM PowerHA SystemMirror for IBM AIX (PowerHA) is the IBM Power Systems data center solution that helps protect critical business applications (apps) from outages, planned or unplanned. One of the major objectives of PowerHA is to offer automatically continued business services by providing redundancy despite different component failures.

PowerHA depends on Reliable Scalable Cluster Technology (RSCT). RSCT is a set of low-level operating system components that allow clustering technologies implementation, such as IBM Spectrum Scale™ (formerly IBM General Parallel File System, IBM GPFS™). RSCT is distributed with AIX. On the current AIX release, AIX V7.1, RSCT is on version 3.1.2.0. After installing PowerHA and Cluster Aware AIX (CAA) file sets, the RSCT’s topology services subsystem is deactivated and all its functionality is performed by CAA.

PowerHA version 7.1 and later rely heavily on the CAA infrastructure available in
AIX V6.1 TL6 and AIX V7.1. CAA provides communication interfaces and monitoring provision for PowerHA and execution using CAA commands with clcmd.

PowerHA Enterprise Edition also provides disaster recovery functionality, such as cross site mirroring, IBM HyperSwap®, Geographical Logical Volume Mirroring and many storage-based replication methods. These cross-site clustering methods support PowerHA functionality between two geographic sites. For more information see the IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106.

More information for features added in PowerHA V7.1.1 and above can be found at 1.3, “History and evolution” on page 6.

1.1.1 High availability

In today’s complex environments, providing continuous service for applications is a key component of a successful IT implementation. High availability (HA) is one of the components that contributes to providing continuous service for the application clients, by masking or eliminating both planned and unplanned systems and application downtime.

A high availability solution ensures that the failure of any component of the solution, either hardware, software, or system management, does not cause the application and its data to become permanently unavailable to the user. High availability solutions can help to eliminate single points of failure through appropriate design, planning, selection of hardware, configuration of software, control of applications, a carefully controlled environment, and change management discipline.

In short, we can define high availability as the process of ensuring, through the use of duplicated or shared hardware resources, managed by a specialized software component, that an application stays up and available for use.

1.1.2 Cluster multiprocessing

In addition to high availability, PowerHA also provides the multiprocessing component. The multiprocessing capability comes from the fact that in a cluster there are multiple hardware and software resources managed by PowerHA to provide complex application functionality and better resource use.

A short definition for cluster multiprocessing can be multiple applications that run over several nodes with shared or concurrent access to the data.

Although desirable, the cluster multiprocessing component depends on the application capabilities and system implementation to efficiently use all resources available in a multi-node (cluster) environment. This must be implemented starting with the cluster planning and design phase.

PowerHA is only one of the HA technologies and builds on the increasingly reliable operating systems, hot-swappable hardware, increasingly resilient applications, by offering monitoring and automated response. A high availability solution based on PowerHA provides automated failure detection, diagnosis, application recovery, and node reintegration. PowerHA can also provide excellent horizontal and vertical scalability by combining other advanced functionality, such as dynamic logical partition (DLPAR) and capacity on demand (CoD).

1.2 Availability solutions: An overview

Many solutions can provide a wide range of availability options. Table 1-1 lists various types of availability solutions and their characteristics.

Table 1-1 Types of availability solutions

Solution	Downtime	Data availability	Observations
Stand-alone	Days	From last backup	Basic hardware and software
Enhanced stand-alone	Hours	Until last transaction	Double most hardware components
High availability clustering	Seconds	Until last transaction	Double hardware and additional software costs
Fault-tolerant	Zero	No loss of data	Specialized hardware and software, very expensive

High availability solutions, in general, offer the following benefits:

•Standard hardware and networking components (can be used with the existing hardware)

•Works with nearly all applications

•Works with a wide range of disks and network types

•Excellent availability at a reasonable cost

The highly available solution for IBM Power Systems offers distinct benefits:

• Proven solution with ~26 years of product development

• Using “off the shelf” hardware components

• Proven commitment for supporting our customers

• IP version 6 (IPv6) support for both internal and external cluster communication

• Smart Assist technology that enables high availability support for all prominent applications

• Flexibility (virtually any application that runs on a stand-alone AIX system can be protected with PowerHA)

When you plan to implement a PowerHA solution, consider the following aspects:

•Thorough HA design and detailed planning from end to end

•Elimination of single points of failure

•Selection of appropriate hardware

•Correct implementation (do not take “shortcuts”)

•Disciplined system administration practices and change control

•Documented operational procedures

•Comprehensive test plan and thorough testing

A typical PowerHA environment is shown in Figure 1-1. Both IP heartbeat networks and non-IP heartbeat networks perform actions through the cluster repository disk.

Figure 1-1 PowerHA cluster example

1.2.1 Downtime

Downtime is the period when an application is not available to serve its clients. Downtime can be classified in two categories, planned and unplanned:

•Planned

– Hardware upgrades

– Hardware or software repair or replacement

– Software updates or upgrades

– Backups (offline backups)

– Testing (periodic testing is required for cluster validation)

– Development

•Unplanned

– Administrator errors

– Application failures

– Hardware failures

– Operating system errors

– Environmental disasters

The role of PowerHA is to maintain application availability through the unplanned outages and normal day-to-day administrative requirements. PowerHA provides monitoring and automatic recovery of the resources on which your application depends.

1.2.2 Single point of failure (SPOF)

A single point of failure is any individual component that is integrated in a cluster and that, if there is a failure, renders the application unavailable for users.

Good design can remove single points of failure in the cluster: nodes, storage, and networks. PowerHA manages these, and also the resources required by the application (including the application start/stop scripts).

Ultimately, the goal of any IT solution in a critical environment is to provide continuous application availability and data protection. The high availability is just one building block in achieving the continuous operation goal. The high availability is based on the availability of the hardware, software (operating system and its components), application, and network components.

To avoid single points of failure, use the following items:

• Redundant servers

• Redundant network paths

• Redundant storage (data) paths

• Redundant (mirrored, RAID) storage

• Monitoring of components

• Failure detection and diagnosis

• Automated application fallover

• Automated resource reintegration

As previously mentioned, a good design is able to avoid single points of failure, and PowerHA can manage the availability of the application through downtimes. Table 1-2 lists each cluster object, which, if it fails, can result in loss of availability of the application. Each cluster object can be a physical or logical component.

Table 1-2 Single points of failure

Cluster object	SPOF eliminated by
Node (servers)	Multiple nodes
Power/power supply	Multiple circuits, power supplies, or uninterruptible power supply (UPS)
Network	Multiple networks connected to each node, redundant network paths with independent hardware between each node and the clients
Network adapters	Redundant adapters and use other HA type features, such as EtherChannel and shared Ethernet adapters (SEA) via Virtual input/output Server (VIOS)
i/O adapters	Redundant I/O adapters and multipathing software
Controllers	Redundant controllers
Storage	Redundant hardware, enclosures, disk mirroring or Redundant Array of Independent Disks (RAID) technology, redundant data paths
Application	Configuring application monitoring and backup nodes to acquire the application engine and data
Sites	Use of more than one site for disaster recovery
Resource groups	Use of resource groups to control all resources required by an application

PowerHA also optimizes availability by allowing for dynamic reconfiguration of running clusters. Maintenance tasks such as adding or removing nodes can be performed without stopping and restarting the cluster.

In addition, other management tasks, such as modifying storage, managing users, can be performed on the running cluster using the Cluster Single Point of Control (C-SPOC) without interrupting user access to the application running on the cluster nodes. C-SPOC also ensures that changes made on one node are replicated across the cluster in a consistent manner.

1.3 History and evolution

IBM High Availability Cluster Multi-Processing (IBM HACMP™) development started in 1990 to provide high availability solutions for applications that run on IBM RS/6000® servers. We do not provide information about the early releases, which are no longer supported or were not in use at the time this publication was written. Instead, we provide highlights about the most recent versions.

Originally designed as a stand-alone product (known as HACMP classic), after the IBM high availability infrastructure known as Reliable Scalable Clustering Technology (RSCT) became available, HACMP adopted this technology and became HACMP Enhanced Scalability (HACMP/ES), because it provides performance and functional advantages over the classic version. Starting with HACMP V5.1, there are no more classic versions. Later HACMP terminology was replaced with PowerHA with V5.5 and then to PowerHA SystemMirror V6.1.

Starting with PowerHA V7.1, the Cluster Aware AIX (CAA) feature of the operating system is used to configure, verify, and monitor the cluster services. This major change improved reliability of PowerHA because the cluster service functions now run in kernel space rather than user space. CAA was introduced in AIX V6.1 TL6. At the time that this publication was written, the release is PowerHA V7.2.0 SP1.

1.3.1 PowerHA SystemMirror version 7.1.1

Released in September 2010, PowerHA V7.1.1 introduced improvements to PowerHA in terms of administration, security, and simplification of management tasks.

The following list summarizes the improvements in PowerHA V7.1.1:

•Federated security allows cluster-wide single point of control, such as these:

– Encrypted file system (EFS) support

– Role-based access control (RBAC) support

– Authentication by using Lightweight Directory Access Protocol (LDAP) methods

•Logical Volume Manager (LVM) and C-SPOC enhancements, to name several:

– EFS management by C-SPOC

– Support for mirror pools

– Disk renaming inside the cluster

– Support for EMC, Hitachi, HP disk subsystems multipathing logical unit number (LUN) as a clustered repository disk

– Capability to display disk Universally Unique Identifier (UUID)

– File system mounting feature (journaled file system (JFS2) Mount Guard), which prevents simultaneous mounting of the same file system by two nodes, which can cause data corruption

•Repository resiliency

•Dynamic automatic reconfiguration (DARE) progress indicator

•Application management improvements such as new application startup option

When you add an application controller, you can choose the application startup mode. Now, you can choose background startup mode, which is the default and where the cluster activation moves forward with an application start script that runs in the background. Alternatively, you can choose foreground startup mode.

When you choose the application controller option, the cluster activation is sequential, which means that cluster events hold application-startup-script execution. If the application script ends with a failure (nonzero return code), the cluster activation is considered to failed, also.

•New network features, such as defining a network as private, use of netmon.cf file, and more network tunables.

Note: More details and examples of implementing these features are found in IBM PowerHA SystemMirror Standard Edition 7.1.1 for AIX Update, SG24-8030, which is available at the following website:

http://www.redbooks.ibm.com/abstracts/sg248030.html

1.3.2 PowerHA SystemMirror version 7.1.2

Released in October 2012, PowerHA V7.1.2 continued to add features and functionality:

•Two new cluster types (stretched and linked clusters):

– Stretched cluster refers to a cluster that has sites that are defined in the same geographic location. It uses a shared repository disk. Extended distance sites with only IP connectivity are not possible with this cluster.

– Linked cluster refers to a cluster with only internet protocol (IP) connectivity across sites, and is usually for PowerHA Enterprise Edition.

•IPv6 support reintroduced

•Backup repository disk

•Site support reintroduced with Standard Edition

•PowerHA Enterprise Edition reintroduced:

– New HyperSwap support added for DS88XX:

• All previous storage replication options supported in PowerHA V6.1 are supported

• IBM DS8000® Metro Mirror and Global Mirror

• SAN Volume Controller Metro Mirror and Global Mirror

• IBM Storwize® V7000 Metro Mirror and Global Mirror

• EMC Corporation SRDF synchronous and asynchronous replication

• Hitachi TrueCopy and HUR replication

• HP Continuous Access synchronous and asynchronous replication

– Geographic Logical Volume Manager (GLVM)

Note: Additional details and examples of implementing some of these features are found in the IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106, which is available at the following website:

http://www.redbooks.ibm.com/abstracts/sg248106.html

1.3.3 PowerHA SystemMirror version 7.1.3

Released in October 2013, PowerHA V7.1.3 continued the development of PowerHA SystemMirror, by adding further improvements in management, configuration simplification, automation, and performance areas. The following list summarizes the improvements in PowerHA V7.1.3:

•Unicast heartbeat

•Dynamic host name change

•Cluster split and merge handling policies

•clmgr command enhancements:

– Embedded hyphen and leading digit support in node labels

– Native Hypertext Markup Language (HTML) report

– Cluster copying through snapshots

– Syntactical built-in help

– Split and merge support

•CAA enhancements:

– Scalability up to 32 nodes

– Support for unicast and multicast

– Dynamic host name or IP address support

•HyperSwap enhancements:

– Active-active sites

– One node HyperSwap

– Auto resynchronization of mirroring

– Node level unmanage mode support

– Enhanced repository disk swap management

•PowerHA plug-in enhancements for IBM Systems Director:

– Restore snapshot wizard

– Cluster simulator

– Cluster split/merge support

•Smart Assist for SAP enhancements

Note: More details and examples of implementing some of these features are found in the IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106, which is available at the following website:

http://www.redbooks.ibm.com/abstracts/sg248106.html

1.3.4 PowerHA SystemMirror version 7.2.0

Released in December 2015, PowerHA V7.2 continued the development of PowerHA SystemMirror, by adding further improvements in management, configuration simplification, automation, and performance areas.

The following list summarizes the improvements in PowerHA V7.2:

•Resiliency enhancements

– Integrated support for AIX Live Kernel Update (LKU)

– Automatic repository replacement

– Verification enhancements

– Exploitation of LVM rootvg failure monitoring

– Live Partition Mobility automation

•Cluster Aware AIX (CAA) Enhancements

– Network Failure Detection Tunable per interface

– Built in NETMON logic

– Traffic stimulation for better interface failure detection

•Enhanced split brain handling

– Quarantine protection against “sick but not dead” nodes

– Network File System (NFS) Tie Breaker support for split and merge policies

•Resource optimized high availability (ROHA) fallovers using Enterprise Pools

•Non-disruptive upgrades

1.4 High availability terminology and concepts

To understand the functionality of PowerHA and to use it effectively, understanding several important terms and concepts can help.

1.4.1 Terminology

The terminology used to describe PowerHA configuration and operation continues to evolve. The following terms are used throughout this book:

Cluster Loosely-coupled collection of independent systems (nodes) or logical partitions (LPARs) organized into a network for sharing resources and communicating with each other.

PowerHA defines relationships among cooperating systems where peer cluster nodes provide the services offered by a cluster node if that node is unable to do so. These individual nodes are together responsible for maintaining the functionality of one or more applications in case of a failure of any cluster component.

Node An IBM Power Systems (or LPAR) running AIX and PowerHA that is defined as part of a cluster. Each node has a collection of resources (disks, file systems, IP addresses, and applications) that can be transferred to another node in the cluster in case the node or a component fails.

Client A client is a system that can access the application that is running on the cluster nodes over a local area network (LAN). Clients run a client application that connects to the server (node) where the app runs.

Topology Contains basic cluster components nodes, networks, communication interfaces, and communication adapters.

Resources Logical components or entities that are being made highly available (for example, file systems, raw devices, service IP labels, and applications) by being moved from one node to another. All resources that together form a highly available application or service, are grouped together in resource groups (RG).

PowerHA keeps the RG highly available as a single entity that can be moved from node to node if a component or node fails. Resource groups can be available from a single node or, for concurrent applications, available simultaneously from multiple nodes. A cluster can host more than one resource group, thus allowing for efficient use of the cluster nodes.

Service IP label A label that matches to a service IP address and is used for communications between clients and the node. A service IP label is part of a resource group, which means that PowerHA can monitor it and keep it highly available.

IP address takeover (IPAT)

The process whereby an IP address is moved from one adapter to another adapter on the same logical network. This adapter can be on the same node, or another node in the cluster. If aliasing is used as the method of assigning addresses to adapters, then more than one address can exist on a single adapter.

Resource takeover This is the operation of transferring resources between nodes inside the cluster. If one component or node fails because of a hardware or operating system problem, its resource groups are moved to another node.

Fallover This represents the movement of a resource group from one active node to another node (backup node) in response to a failure on that active node.

Fallback This represents the movement of a resource group back from the backup node to the previous node, when it becomes available. This movement is typically in response to the reintegration of the previously failed node.

Heartbeat packet A packet that is sent between communication interfaces in the cluster, and is used by the various cluster daemons to monitor the state of the cluster components (nodes, networks, adapters).

RSCT daemons These consist of two types of processes, topology and group services. PowerHA uses group services but depends on CAA for topology services. The cluster manager receives event information generated by these daemons, and takes corresponding (response) actions in case of any failure.

1.5 Fault tolerance versus high availability

Based on the response time and response action to system detected failures, the clusters and systems can belong to one of the following classifications:

•Fault-tolerant systems

•High availability systems

1.5.1 Fault-tolerant systems

The systems provided with fault tolerance are designed to operate virtually without interruption, regardless of the failure that can occur (except perhaps for a complete site down because of a natural disaster). In such systems, all components are at least duplicated for both software or hardware.

All components, processors (CPUs), memory, and disks have a special design and provide continuous service, even if one subcomponent fails. Only special software solutions can run on fault tolerant hardware.

Such systems are expensive and extremely specialized. Implementing a fault tolerant solution requires much effort and a high degree of customization for all system components.

For environments where no downtime is acceptable (life critical systems), fault-tolerant equipment and solutions are required.

1.5.2 High availability systems

The systems configured for high availability are a combination of hardware and software components that are configured to work together to ensure automated recovery in case of failure with a minimal acceptable downtime.

In such systems, the software that is involved detects problems in the environment, and manages application survivability by restarting it on the same or on another available machine (taking over the identity of the original machine: node).

Therefore, eliminating all single points of failure (SPOF) in the environment is important. For example, if the machine has only one network interface (connection), provide a second network interface (connection) in the same node to take over in case the primary interface that is providing the service fails.

Another important issue is to protect the data by mirroring and placing it on shared disk areas, accessible from any machine in the cluster.

The PowerHA software provides the framework and a set of tools for integrating applications in a highly available system. Applications to be integrated in a PowerHA cluster can require a fair amount of customization, possibly both at the application level and at the PowerHA and AIX platform level. PowerHA is a flexible platform that allows integration of generic applications that are running on the AIX platform, providing for highly available systems at a reasonable cost.

Remember, PowerHA is not a fault tolerant solution and should never be implemented as such.

1.6 Additional PowerHA resources

The following list describes more PowerHA resources:

•Entitled Software Support (download images)

https://www.ibm.com/servers/eserver/ess/ProtectedServlet.wss

•PowerHA, CAA, & RSCT ifixes

https://aix.software.ibm.com/aix/ifixes/PHA_Migration/ha_install_mig_fixes.htm

•PowerHA wiki

Probably the most comprehensive resource. The wiki contains links to all of the following references and much more. It can be found at the following website:

https://ibm.biz/Bd45qZ

•PowerHA LinkedIn Group

https://www.linkedin.com/grp/home?gid=8413388

•PowerHA V7.2 release notes

https://ibm.biz/BdHaRM

•Base publications

All of the following PowerHA V7 publications are available on the following website:

http://www.ibm.com/support/knowledgecenter/SSPHQG/welcome

– Administering PowerHA SystemMirror

– Developing Smart Assist applications for PowerHA SystemMirror

– Geographic Logical Volume Manager for PowerHA SystemMirror Enterprise Edition

– Installing PowerHA SystemMirror

– Planning PowerHA SystemMirror

– PowerHA SystemMirror concepts

– PowerHA SystemMirror for IBM Systems Director

– Programming client applications for PowerHA SystemMirror

– Quick reference: clmgr command

– Smart Assists for PowerHA SystemMirror

– Storage-based high availability and disaster recovery for PowerHA SystemMirror Enterprise Edition

– Troubleshooting PowerHA SystemMirror

•PowerHA and Capacity Backup

http://www.ibm.com/systems/power/hardware/cbu/

•Videos

Shawn Bodily has several PowerHA related videos on his YouTube channel:

https://www.youtube.com/user/PowerHAguy

•DeveloperWorks Discussion forum

https://ibm.biz/Bd45q2

•IBM Redbooks publications

The main focus of each IBM PowerHA Redbooks differs a bit but usually their main focus is covering what’s new in a particular release. They generally have more details and advanced tips than the base publications.

Each new IBM Redbooks publication is rarely a complete replacement for the last. The only exception to this is the IBM PowerHA SystemMirror for AIX Cookbook. It was updated to version 7.1.3 after replacing two previous cookbooks.

It is probably the most comprehensive of all of the current Redbooks publications with regard to PowerHA Standard Edition specifically. Although there is some overlap across them, with multiple versions supported, it is important to reference the version of the Redbooks publication that is relevant to the version you are using.

Figure 1-2 shows a list of relevant PowerHA Redbooks. Though it still includes
PowerHA V6.1 Enterprise Edition, which is no longer supported, that exact Redbooks publication is still the best reference for configuring EMC SRDF and Hitachi TrueCopy.