Chapter 9. Combining Local/Metro continuous availability with out-of-region disaster recovery

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Combining Local/Metro continuous availability with out-of-region disaster recovery

In this chapter we discuss the capabilities and considerations for implementing GDPS/Metro Global Mirror (GDPS/MGM) and GDPS/Metro z/OS Global Mirror (GDPS/MzGM). It will be of interest to clients that have requirements for both continuous availability locally and regional disaster recovery protection.

GDPS/MGM and GDPS/MzGM combine the continuous availability attributes of GDPS/PPRC with the out-of-region disaster recovery capabilities of GDPS/GM or GDPS/XRC to protect critical business data in the event of a wide-scale disruption, while providing for fast automated recovery for various failure conditions.

Note: Both GDPS/PPRC and GDPS/PPRC HyperSwap Manager can be combined with Global Mirror or z/OS Global Mirror, as described in this chapter. To facilitate readability, only GDPS/PPRC will be used in the text for most of the discussions. If a particular function is not supported by GDPS/PPRC HyperSwap Manager, it will be explicitly mentioned.

Functions provided by these technologies include:

•Three-copy disk mirroring using GDPS/PPRC to support zero data loss for day-to-day disruptions at metropolitan distances, and GDPS/GM or GDPS/XRC for long distance, out-of-region data protection, with limited data loss in the event of a wide-scale disruption.

•Multisite management of the remote copy environment to maintain data integrity and data consistency across all three disk copies.

•Support transparent switching to secondary disks in the event of a primary disk storage subsystem failure using GDPS/PPRC with HyperSwap.

– Ability to incrementally resynchronize the GDPS/GM¹ or GDPS/XRC mirror after a PPRC HyperSwap.

•Fast automated recovery for RTO of less than an hour for site and regional disasters.

•Zero data loss protection for both Open Systems and System z using GDPS/PPRC and GDPS/GM, assuming that only one site is lost in the event of a disaster.

•Use of FlashCopy to facilitate nondisruptive functions (such as backups, data mining, application testing, disaster recovery testing), and to provide a consistent copy of the data during remote copy synchronization to ensure disaster readiness is maintained at all times.

•Planned switch to running production in the recovery region and return home.

9.1 Introduction

Enterprises running highly critical applications have an increasing need to improve the overall resilience of their business services and functions. Enterprises already doing synchronous replication have become accustomed to the availability benefits of relatively short distance synchronous replication. This is especially true in mainframe environments where the capabilities of HyperSwap provide the ability to handle disk subsystem failures without an outage and to utilize server capacity in both sites.

Regulatory bodies (both governmental and industry-based) in various countries are requiring enterprises to maintain a significant distance between their primary and disaster locations to protect against wide-scale disruptions. For some organizations, this can result in a requirement to establish backup facilities well outside the range of synchronous replication capabilities, thus driving the need to implement asynchronous disk mirroring solutions.

From a business perspective, this could mean compromising continuous availability to comply with regulatory requirements. With a three-copy disk mirroring solution, the availability benefits of synchronous replication can be combined with the distance allowed by asynchronous replication to meet both the availability expectations of the business and the requirements of the regulator.

9.2 Design considerations

In the following sections we describe design considerations to keep in mind, including three-copy solutions versus three-site solutions; multitarget and cascading topologies; and cost considerations.

9.2.1 Three-copy solutions versus three-site solutions

It is not always the case that clients implementing a three-copy mirroring solution will have three independent data centers (shown in Figure 9-1), each with the capability to run production workloads.

Figure 9-1 Three-site solution

Having three distinct locations with both the connectivity required for the replication and connectivity for user access is expensive and might not provide sufficient cost justification. Additionally, as the distance between the locations connected with synchronous mirroring increases, the ability to provide continuous availability features such as cross-site disk access, HyperSwap, or CF duplexing diminishes.

Having a production location with two copies of data within a single data center (shown in Figure 9-2), along with a third copy of the data at a remote recovery location, provides you with many of the benefits of a full three-site solution while allowing for a reduced overall cost. Disk subsystem failures are handled as local failures and if the single site has some degree of internal resilience, then even minor “disaster-type” events can perhaps be handled within the single location.

Figure 9-2 Two-site solution

Another benefit of the two-data center solution, especially in a System z environment, is that you can realize the full benefit of features such as HyperSwap and Coupling Facility Duplexing to provide continuous availability features without provisioning significant additional and expensive cross-site connectivity, or having concerns regarding the impact of extended distance on production workloads.

Figure 9-3 illustrates another variation of this scenario, in which the primary data center is a campus location with separate machine rooms or buildings, each with the ability to run production workloads.

Figure 9-3 Two-site solution - Campus and Recovery site

In the past, clients often used the bunker topology (shown in Figure 9-4) to create a solution that could provide mirroring at extended distances, but still handle a primary site failure without data loss.

Figure 9-4 Two sites and an intermediate bunker

There are a number of arguments against this approach:

1. For guaranteed zero data loss you need a policy in which, if the mirroring stops, the production applications are also stopped. There are clients who have implemented such a policy, but it is not a common policy. If production is allowed to continue after a local mirroring failure, then zero data loss cannot be guaranteed in all situations.

2. If the disaster event also affects the bunker site or affects the bunker site first, then zero data loss is again not guaranteed. If the reason for the extended distance to the recovery site was to handle regional events, then this possibility cannot be excluded.

3. The networking and hardware costs of the bunker site are probably still considerable despite there being no servers present. Further investment in the availability characteristics of the primary location or in a campus-type solution in which the synchronous secondary disk subsystems can be used for production services might provide a greater return on investment for the business.

9.2.2 Multitarget and cascading topologies

Multitarget and cascading topologies are similar in terms of capabilities in that both provide a synchronous and an asynchronous copy of the production data. Certain failure scenarios are handled more simply by multi-target solutions and other scenarios by cascading solutions.

The key requirements for either topology are:

1. A viable recovery copy/capability is available at all times in a location other than where production is running. It is possible that there will be regulatory requirements that demand this.

2. Any single site failure will only result in at most a short outage of the replication capability between the surviving sites to ensure minimal exposure where there might be increased data loss for a second failure.

The first requirement implies that there are no situations where both offsite copies will be compromised.

The second requirement makes it extremely desirable to have a capability to perform incremental resynchronization between any two copies of the data. Not having this will result in an extended period of exposure to additional data loss in case of a second failure.

9.2.3 Cost considerations

The third location is, in many situations, regarded as an insurance copy and as mainly providing regulatory compliance. This might imply that costs for this location are kept to an absolute minimum.

Reducing the network bandwidth to the remote location can provide significant cost savings for the overall cost of the solution. Given that a synchronous copy is already available, trading off the RPO versus the cost of the network might be a useful compromise especially if the times of increased RPO are during periods of batch processing or database maintenance where the transactional data loss would be smaller.

Using a disaster recovery service provider such as IBM BCRS is one method of reducing the costs of the third location. Shared hardware assets and the removal of the requirement to invest in an additional physical location can provide significant cost benefits, and with the majority of events expected to be handled in the two main locations, the disadvantages of a shared facility are reduced.

9.3 GDPS Metro/Global Mirror solution

GDPS provides two “three-site” solutions:

•GDPS Metro/Global Mirror (GDPS/MGM) is a cascading data replication solution for both System z and distributed systems data.

•GDPS Metro/z/OS Global Mirror (GDPS/MzGM) is a multitarget data replication solution for System z data.

This section describes the capabilities and requirements of the GDPS Metro/Global Mirror (GDPS/MGM) solution.

GDPS Metro/Global Mirror (GDPS/MGM) is a cascading data replication solution that combines the capabilities of GDPS/PPRC and GDPS/GM.

Synchronous replication between a primary and secondary disk subsystem located either within a single data center, or between two data centers located within metropolitan distances, is implemented with GDPS/PPRC or GDPS/PPRC HyperSwap Manager.

GDPS/GM is used to asynchronously replicate data from the secondary disks to a third disk subsystem located in a recovery site that is typically out of the local metropolitan region. Because both Metro Mirror and Global Mirror are hardware-based remote copy technologies, CKD and FBA devices can be mirrored to the recovery site, thereby protecting both System z and open system data.

For enterprises that require consistency across both distributed systems and System z data, GDPS/MGM provides a comprehensive three-copy data replication strategy to protect against day-to-day disruptions, while protecting critical business data and functions in the event of a wide-scale disruption.

9.3.1 GDPS/MGM overview

The GDPS/MGM configuration shown in Figure 9-5 is a three-site continuous availability and DR solution. In this example, Site1 and Site2 are running an active/active workload (refer to 3.2.3, “Active/active configuration” on page 66 for more information about this topic) and are located within metropolitan distances to ensure optimal application performance. All data required to recover critical workloads is resident on disk and mirrored. Each site is configured with sufficient spare capacity to handle failed-over workloads in the event of a site outage.

The third site, or recovery site, can be located at virtually unlimited distance from Site1 and Site2 to protect against regional disasters. Asynchronous replication is running between Site2 and the recovery site. Redundant network connectivity is installed between Site1 and the recovery site to provide for continued disaster recovery protection in the event of a Site2 disaster, or a failure of the disk subsystems in Site2. See “Incremental resynchronization for GDPS/MGM” on page 245 for more details.

There is sufficient CPU capacity installed to support the R-sys. CBU is installed and GDPS will invoke CBU on System z to provide the additional capacity needed to support production workloads in the event disaster recovery is invoked.

Figure 9-5 GDPS Metro Global Mirror configuration

The A disks are synchronously mirrored to the B disks located in Site2 using Metro Mirror. The B disks are then asynchronously mirrored to a third (C) set of disks located in the recovery site using Global Mirror. A fourth (D) set of disks, also located in the recovery site, are the FlashCopy targets used to provide the consistent data for disaster recovery. A fifth (F) and optional set of disks are used for stand-alone disaster recovery testing, or in the event of a real disaster, to create a “gold” or insurance copy of the data For more detailed information about Global Mirror, refer to Chapter 6, “GDPS/Global Mirror” on page 149.

Because there is likely to be some distance between the local sites, Site1 and Site2, running the PPRC leg of MGM, and the remote recovery site that is the GM recovery site, we also distinguish between the local sites and the remote site using Region terminology. Site1 and Site2 are in one region, Region A, and the remote recovery site is in another region, Region B.

Incremental resynchronization for GDPS/MGM

The incremental resynchronization functionality of Metro Global Mirror aims to allow for incremental resynchronization between Site1 and the recovery site when the intermediate site, Site2, or the disk subsystems in the intermediate site are not available.

Without this capability, if the intermediate site becomes unavailable, the data at the recovery site starts to age because data could no longer be replicated. Instead of requiring a new Global Mirror session from the production site to the recovery site (and a full copy), incremental resynchronization capability with GDPS/MGM supports a configuration where only incremental changes must be copied from Site1 to the recovery site.

Figure 9-6 GDPS Metro Global Mirror configuration after Site2 outage

Figure 9-6 shows how GDPS/MGM can establish a Global Mirror session between the production site, Site1, and the recovery site when it detects that the intermediate site is unavailable. After the session is established, only an incremental resynchronization of the changed data needs to be performed, thus allowing the disaster recovery capability to be restored in minutes, instead of hours, when the intermediate site is not available.

GDPS/MGM Incremental Resynchronization Tool

The GDPS/MGM Incremental Resynchronization Tool (IR Tool) is supplied with GDPS/GM. Unlike the add-on GDPS tools we have described for the various GDPS products, IR Tool is a fully supported component of GDPS for use in an IR configuration. This tool can be used for three purposes:

•To incrementally reintroduce the Site2 intermediate disk if GM had been incrementally resynchronized from Site1 to the recovery site.

The tool provides the ability to return to an A-B-C configuration when running in an A-C configuration. Without the tool, returning to an A-B-C configuration would have required full initial copy for both PPRC (A-disk to B-disk) and for GM (B-disk to C-disk). Thus, the tool provides significant availability and disaster recovery benefit for IR environments.

Note that the tool can only be used for this purpose if the B-disk is returned “intact,” meaning that metadata on the disk subsystem pertaining to its status as a PPRC secondary and GM primary disk is still available. If you need to introduce a new disk into the configuration, this is going to require full initial copy of all the data.

•To perform a planned toggle between the A-disk and the B-disk.

If you intend to perform periodic “flip/flops” of Site1 and Site2 (or A-disk and B-disk), the tool allows you to go from an A-B-C configuration to a B-A-C configuration and then back to an A-B-C configuration in conjunction with A-disk to B-disk planned HyperSwap and B-disk to A-disk planned HyperSwap.

•To incrementally “return home” after recovering production on C-disk, or after you have switched production to C-disk by reintroducing both the A-disk and the B-disk.

This is a C to A-B-C (or B-A-C) transformation. It assumes that both the A-disk and the B-disk are returned intact. Although the MGM mirror can be incrementally reinstated, a production outage is necessary to move production from running on the C-disk in the recovery site back to either the A-disk or the B-disk in one of the local sites.

Note that the GDPS/MGM Incremental Resynchronization tool only supports CKD disks. Incremental Resynchronization is not supported with GDPS/PPRC HM.

9.3.2 GDPS/MGM Site1 failures

The primary role of GDPS is to protect the integrity of the B copy of the data. At the first indication of a failure in Site1, GDPS/PPRC will freeze all B disks, both CKD and FBA, to prevent logical contamination of data residing on the B devices. For a more detailed description of GDPS/PPRC processing, refer to Chapter 3, “GDPS/PPRC” on page 51.

At this point, the GDPS/GM session between Site2 and the recovery site is still running and it is most likely that both locations will have the same set of data after a brief period of time. The business focus is now on restarting the production systems in either Site2 or the recovery site, depending on the failure scenario. If the systems are started in Site2, the GDPS/GM solution is already in place.

9.3.3 GDPS/MGM Site2 failures

In this situation the production systems are still running, so the business requirement is to ensure that disaster recovery capabilities are restored as fast as possible. The GDPS/GM session should be restarted as soon as possible between Site1 and the recovery site using incremental resynchronization. See “Incremental resynchronization for GDPS/MGM” on page 245 for more details. If incremental resynchronization is not configured, a full copy is required.

This scenario has possibly less impact to the business than a failure of the production site, but this depends on the specific environment.

9.3.4 GDPS/MGM region switch and return home

It is possible to switch production from running in Region A (in either Site2 or Site2) to Region B. Many GDPS/MGM customers run Site1 and Site2 in the same physical site or on a campus where these two sites are separated by little distance. In such configurations there could be planned outage events, such as complete power maintenance, that is likely to affect both sites.

Similarly, an unplanned event that impacts both sites will force recovery in Region B.

While production runs in Region B, the disk subsystems in this region track the updates that are made. When Region A is available again, assuming that all disks configured in the region come back intact, it is possible to return production back to Region A using the GDPS/MGM Incremental Resynchronization Tool without requiring fully copying data back. Because the updates have been tracked, only the data that changed while Region A was down are sent back to the Region A disks to bring them up to date. Then production is shut down in Region B. The final updates are allowed to be drained to Region A and production can then be restarted in Region A.

Because Region A and Region B are not symmetrically configured, the capabilities and levels of protection offered when production runs in Region B will be different. Most notably, because there is no PPRC of the production data in Region B, there is no HyperSwap protection to provide continuous data access. For the same reason, the various operational procedures for GDPS will also be different when running in Region B. However, even if no outage is planned for Region A, switching production to Region B periodically (for example, once or twice a year) and running live production there for a brief period of time is the best form of disaster testing because it will provide the best indication of whether Region B is properly configured to sustain real, live production workloads.

9.3.5 Scalability in a GDPS/MGM environment

As described in “Addressing z/OS device limits in a GDPS/PPRC environment” on page 25, GDPS/PPRC allows defining the PPRC secondary devices in alternate subchannel set 1 (MSS1), which allows up to nearly 64 K devices to be mirrored in a GDPS/PPRC configuration. The definitions of these devices are in the application site I/O definitions.

Similarly, “Addressing z/OS device limits in a GDPS/GM environment” on page 33 describes how GM allows defining the GM FlashCopy target devices in alternate MSS1 in the recovery site I/O definitions and not defining the practice FlashCopy target devices at all to the GDPS/GM R-sys, again, allowing up to nearly 64 K devices to be mirrored in a GDPS/GM configuration.

In a GDPS/MGM environment where the PPRC secondary devices defined in MSS1 are the GM primary devices, there is additional support in GDPS/GM that allows the GM primary devices to be defined in MSS1. With the combined alternate subchannel set support in GDPS/PPRC and GDPS/MGM, up to nearly 64 K devices can be replicated using the MGM technology.

9.3.6 Other considerations in a GDPS/MGM environment

With Global Mirror, it is possible to deliberately underconfigure the bandwidth provided to reduce the total cost of the solution. If there are significant peaks, then this cost saving could be considerable because the network costs are often a significant portion of ongoing costs. The drawback with under-configuring bandwidth can be that this could impact the recovery point that can be achieved. If a disaster affects the entire production region, both Site1 and Site2, during any peak when the GM mirror is running behind, there is likely to be more data loss.

9.3.7 Managing the GDPS/MGM environment

GDPS provides a range of solutions for disaster recovery and continuous availability in a System z-centric environment. GDPS/MGM provides support for Metro Global Mirror within a GDPS environment. GDPS builds on facilities provided by System Automation and NetView, and utilizes inband connectivity to manage the Metro Global Mirror relationships.

GDPS/MGM runs two different services to manage Metro Global Mirror, both of which run on z/OS systems, as explained here:

•GDPS/PPRC services run on every z/OS image in the production sysplex and the Controlling systems, K1 and K2, located in Site1 and Site2. Each Controlling system is allocated on its own non-mirrored disk and has access to the primary and secondary disk subsystems.

During normal operations, the master function runs in the Controlling system located where the secondary disks reside. This is where the day-to-day management and recovery of the PPRC mirroring environment is performed. If Site1 or Site2 fails, the Master system manages the recovery of the PPRC disks and production systems.

•The second Controlling system is an alternate and will take over the master function if the Master Controlling system becomes unavailable.

The GDPS/GM services run in the Kg and R-sys Controlling systems. Kg runs in the production sysplex and is responsible for controlling the Global Mirror environment and sending information to the R-sys running in the recovery site. The R-sys is responsible for carrying out all recovery actions in the event of a wide-scale disruption that impacts both Site1 and Site2.

In addition to managing the operational aspects of Global Mirror, GDPS/GM also provides facilities to restart System z production systems in the recovery site. By providing scripting facilities, it provides a complete solution for the restart of a System z environment, in a disaster situation, without requiring expert manual intervention to manage the recovery process.

GDPS supports both System z and distributed systems devices in a Metro Global Mirror environment.

9.3.8 Flexible testing in a GDPS/MGM environment

To facilitate testing of site failover and failback processing, consider installing additional disk capacity to support FlashCopy in Site1 and Site2. The FlashCopy can be used at both Site1 and Site2 to maintain disaster recovery checkpoints during remote copy resynchronization. This ensures there is a consistent copy of the data available if a disaster-type event should occur while testing your site failover and failback procedures. In addition, the FlashCopy could be used to provide a copy to be used for testing or backing up data without the need for extended outages to production systems.

GDPS/MGM supports an additional FlashCopy disk device, referred to as F disks. These are additional FlashCopy target devices that might optionally be created in the recovery site. The F disks might be used to facilitate stand-alone testing of your disaster recovery procedures while the Global Mirror environment is running. This ensures that a consistent and current copy of the data is available at all times. In addition, the F disk can be used to create a “gold” or insurance copy of the data if a disaster situation occurs.

Currently, GDPS/MGM supports the definition and management of a single F device for each Metro-Global Mirror triplet (B, C, and D disk combinations) in the configuration. To reduce management and operational complexity, support exists in GDPS/GM to support the F disk without adding a requirement for these disks to be defined to the I/O configurations of the GDPS systems managing them. Known as “No UCB” FlashCopy, this support allows for the definition of F disks without the need to define additional UCBs to the GDPS management systems.

9.3.9 GDPS Query Services in a GDPS/MGM environment

GDPS/PPRC provides Query Services, allowing you to query various aspects of the PPRC leg of a GDPS/MGM environment. Similarly, GDPS/GM provides Query Services, allowing you to query various aspects of the GM leg of a GDPS/MGM environment.

The GDPS/GM query services also have awareness of the fact that a particular environment is a GDPS/MGM environment enabled for Incremental Resynchronization (IR) and returns additional information pertaining to the IR aspects of the environment. In a GM environment, at any time, the GM session could be running from Site2 to the recovery site (B disk to C disk) or from Site1 to the recovery site (A disk to C disk). If GM is currently running B to C, this is the Active GM relationship and the A to C relationship is the Standby GM relationship. The GM query services in an MGM IR environment return information about both the active and the standby relationships for the physical and logical control units in the configuration and the devices in the configuration.

9.3.10 Prerequisites for GDPS/MGM

GDPS/MGM has the following prerequisites:

•GDPS/PPRC or GDPS/PPRC HM is required. If GDPS/PPRC HM is used, the incremental Resynchronization function is not available.

•GDPS/GM is required and the GDPS/GM prerequisites must be met.

•Consult with your storage vendor to ensure that the required features and functions are supported on your disk subsystems.

Important: For the latest GDPS prerequisite information, refer to the GDPS product web site, available at:

http://www.ibm.com/systems/z/advantages/gdps/getstarted

9.4 GDPS Metro z/OS Global Mirror solution

GDPS provides two “three-site” solutions:

•GDPS Metro/Global Mirror (GDPS/MGM) is a cascading data replication solution for both System z and distributed systems data.

•GDPS Metro/z/OS Global Mirror (GDPS/MzGM) is a multitarget data replication solution for System z data.

This section describes the capabilities and requirements of the GDPS Metro/z/OS Global Mirror (GDPS/MzGM) solution.

GDPS Metro/z/OS Global Mirror is a multitarget data replication solution that combines the capabilities of GDPS/PPRC and GDPS/XRC.

GDPS/PPRC or GDPS/PPRC HyperSwap Manager is used to manage the synchronous replication between a primary and secondary disk subsystem located either within a single data center, or between two data centers located within metropolitan distances.

GDPS/XRC is used to asynchronously replicate data from the primary disks to a third disk system located in a recovery site, typically out of the local metropolitan region. Because z/OS Global Mirror (XRC) only supports CKD devices, only System z data can be mirrored to the recovery site.

For enterprises that want to protect System z data, GDPS/MzGM delivers a three-copy replication strategy to provide continuous availability for day-to-day disruptions, while protecting critical business data and functions in the event of a wide-scale disruption.

9.4.1 GDPS/MzGM overview

The solution depicted in Figure 9-7 is an example of a three-site GDPS/MzGM continuous availability and DR implementation. In this example, Site1 and Site2 are running an active/active workload (refer to 3.2.3, “Active/active configuration” on page 66) and located within metropolitan distances to ensure optimal application performance. All data required to recover critical workloads is resident on disk and mirrored. Each site is configured with sufficient spare capacity to handle failed-over workloads in the event of a site outage.

The third or recovery site can be located at a virtually unlimited distance from Site1 and Site2 locations to protect against regional disasters. Because of the extended distance, GDPS/XRC is used to asynchronously replicate between Site1 and the recovery site.

Redundant network connectivity is installed between Site2 and the recovery site to provide for continued data protection and DR capabilities in the event of a Site1 disaster, or a failure of the disk subsystems in Site1. See “Incremental resynchronization for GDPS/MzGM” on page 251 for more details. Sufficient mainframe resources are allocated to support the SDMs and GDPS/XRC Controlling system. In the event of a disaster situation, GDPS will invoke CBU to provide the additional capacity needed to support production workloads.

Figure 9-7 GDPS z/OS Metro Global Mirror

The A disks are synchronously mirrored to the B disks located in Site2 using Metro Mirror. In addition, A disks are asynchronously mirrored to a third (C) set of disks located in the recovery site using z/OS Global Mirror (XRC). An optional, and highly recommended, fourth (F) set of disks located in the recovery site are used to create FlashCopy of the C disks. These disks can then be used for stand-alone disaster recovery testing, or in the event of a real disaster, to create a “gold” or insurance copy of the data. For more detailed information about z/OS Global Mirror, refer to Chapter 5, “GDPS/XRC” on page 127.

Because there is likely to be some distance between the local sites, Site1 and Site2, running the PPRC leg of MzGM, and the remote recovery site, which is the XRC recovery site, we also distinguish between the local sites and the remote site using Region terminology. Site1 and Site2 are in one region, Region A and the remote recovery site is in another region, Region B.

Incremental resynchronization for GDPS/MzGM

The incremental resynchronization (IR) functionality of Metro z/OS Global Mirror aims to allow you to move the z/OS Global Mirror (XRC) primary disk location from Site1 to Site2 or vice versa without having to perform a full initial copy of all data.

Without incremental resynchronization, if Site1 becomes unavailable and the PPRC primary disk is swapped to Site2, the data at the recovery site starts to age because updates are no longer being replicated. The disaster recovery capability can be restored by establishing a new XRC session from Site2 to the recovery site. However, without incremental resynchronization, a full copy is required and this could take several hours or even days for significantly large configurations. The incremental resynch allows restoring the XRC mirror using the Site2 disks as primary and only sending to the recovery site the changes that have occurred since the PPRC disk switch.

Figure 9-8 GDPS Metro z/OS Global Mirror configuration after Site1 outage

Figure 9-8 shows how GDPS/MzGM can establish a z/OS Global Mirror session between Site2 and the recovery site when it detects that Site1 is unavailable. After the session is established, only an incremental resynchronization of the changed data needs to be performed, thus allowing the disaster recovery capability to be restored in minutes, instead of hours, when the intermediate site is not available. GDPS can optionally perform this resynchronization of the XRC session using the swapped-to disks totally automatically, requiring no operator intervention.

9.4.2 GDPS/MzGM Site1 failures

At the first indication of a failure, GDPS will issue a freeze command to protect the integrity of the B copy of the disk. For a more detailed description of GDPS/PPRC processing, refer to Chapter 3, “GDPS/PPRC” on page 51.

If the freeze event were part of a larger problem in which you could no longer use the A-disk or Site1, you must recover the B-disk and restart production applications using the B-disk. After the production systems are restarted, the business focus will be on establishing z/OS Global (XRC) mirroring between Site2 and the recovery site as soon as possible. You can perform incremental resynchronization from the B-disk to the C-disk and maintain disaster recovery readiness.

Note that if the failure was caused by a primary disk subsystem failure, and Site1 systems are not impacted, GDPS/PPRC will use HyperSwap to transparently switch all systems in the production sysplex to the secondary disks in Site2, and the production systems will continue to run. In this case also, GDPS can perform incremental resynchronization from the B-disk to the C-disk and maintain disaster recovery readiness.

9.4.3 GDPS/MzGM Site2 failures

In this situation the production systems in Site1 will continue to run and replication to the remote site is still running. GDPS, based on user-defined actions, will restart Site2 production systems in Site1. No action is required from an application or disaster recovery solution perspective. This scenario has less impact to the business than a failure of the Site1 location. When Site2 is recovered, if the disks have survived, an incremental resynchronization can be initiated to resynchronize the A and B disks.

9.4.4 GDPS/MzGM region switch and return home

It is possible to switch production from running in Region A (in either Site2 or Site2) to Region B. Many GDPS/MzGM customers run Site1 and Site2 in the same physical site or on a campus where these two sites are separated by little distance. In such configurations there could be planned outage events, such as complete power maintenance, that is likely to affect both sites.

Similarly, an unplanned event that impacts both sites will force recovery in Region B.

When Region A is available again, assuming that all disks configured in the region come back intact, it is possible to return production back to Region A using a GDPS-provided step-by-step procedure to accomplish this return home operation.

To move data back to Region A, the z/OS Global Mirror (XRC) remote copy environment must be designed to allow the mirroring session to be reversed. Production will be running in Region B, and Region A will need to run the GDPS/XRC SDM systems. This means you need to ensure that the proper connectivity and resources are configured in both regions to allow them to assume the recovery region role.

9.4.5 Management of the GDPS/MzGM environment

GDPS/MzGM provides management functions for a Metro/z/OS Global Mirror in a GDPS environment. The GDPS/PPRC management functions described in 9.3.7, “Managing the GDPS/MGM environment” on page 248, are also provided by GDPS/MzGM.

GDPS/XRC services run on the Kx Controlling system located in the recovery site along with the SDM systems. The SDM and Kx systems must be in the same sysplex. The Kx Controlling system is responsible for managing the z/OS Global Mirror (XRC) remote copy process, and recovering the production systems if a disaster occurs. It has no awareness of what is happening in Site1 and Site2.

If a wide-scale disruption that impacts both Site1 and Site2 occurs, the operator must initiate the recovery action to restart production systems in the recovery site. At this point the Kx system will activate the production LPARs and Coupling Facilities, and is able to respond to certain z/OS initialization messages. However, it cannot automate the complete startup of the production systems. For this, the K1 or K2 systems could be used to automate the application startup and recovery process in the production sysplex.

9.4.6 Flexible testing of the GDPS/MzGM environment

To facilitate testing of site failover and failback processing, consider installing additional disk capacity to support FlashCopy in Site1 and Site2. The FlashCopy can be used at both sites to maintain disaster recovery checkpoints during remote copy resynchronization. This ensures that a consistent copy of the data will be available if a disaster-type event should occur while testing your site failover and failback procedures. In addition, the FlashCopy could be used to provide a copy to be used for testing or backing up data without the need for extended outages to production systems.

By combining z/OS Global Mirror with FlashCopy, you can create a consistent point-in-time tertiary copy of the z/OS Global Mirror (XRC) data sets and secondary disks at your recovery site. The tertiary devices can then be used to test your disaster recovery and restart procedures while the GDPS/XRC sessions between Site1 and the recovery site are running, which ensures that disaster readiness is maintained at all times. In addition, these devices can be used for purposes other than DR testing; for example, nondisruptive data backup, data mining, or application testing.

With the addition of GDPS/XRC Zero Suspend FlashCopy, enterprises are able to create the tertiary copy of the z/OS Global Mirror (XRC) data sets and secondary disks without having to suspend the z/OS Global Mirror (XRC) mirroring sessions. This GDPS function prevents the SDM from writing new consistency groups to the secondary disks while FlashCopy is used to create the tertiary copy of the disks.

The time to establish the FlashCopies will depend on the number of secondary SSIDs involved, the largest number of devices in any SSID, and the speed of the processor. Zero Suspend FlashCopy will normally be executed on the GDPS K-system in the recovery site, where there should be limited competition for CPU resources.

Because SDM processing is suspended while FlashCopy processing is occurring, performance problems in your production environment might occur if the SDM is suspended too long. For this reason, Zero Suspend FlashCopy should be evaluated by testing on your configuration, under different load conditions, to determine whether this facility can be used in your environment.

For enterprises that have requirements to test their recovery capabilities and maintain the currency of the replication environment, you will need to provide additional disk capacity to support FlashCopy. By providing an additional usable copy of the data, you have the flexibility to perform on-demand DR testing and other nondisruptive activities, while maintaining up-to-date DR readiness.

9.4.7 Prerequisites for GDPS/MzGM

GDPS MzGM has the following prerequisites:

•GDPS/PPRC or GDPS/PPRC HM is required.

•GDPS/XRC is required and the GDPS/XRC prerequisites must be satisfied.

•Consult with your storage vendor to ensure required features and functions are supported on your disk subsystems.

Important: For the latest GDPS prerequisite information, refer to the GDPS product web site, available at:

http://www.ibm.com/systems/z/advantages/gdps/getstarted

¹ Incremental Resynchronization of GDPS/GM is not supported in conjunction with GDPS/PPRC HM.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Combining Local/Metro continuous availability with out-of-region disaster recovery

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9. Combining Local/Metro continuous availability with out-of-region disaster recovery