Chapter 12. Disaster recovery

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Disaster recovery

This chapter describes the use of the TS7700 in disaster recovery (DR) along with DR testing processes.

This chapter includes the following sections:

•TS7700 disaster recovery principles

•Failover scenarios

•Planning for disaster recovery

•High availability and disaster recovery configurations

•Disaster recovery testing basics

•Disaster recovery testing detailed procedures for FlashCopy

•Disaster recovery testing detailed procedures for alternatives before Release 3.1

•A real disaster

•Geographically Dispersed Parallel Sysplex for z/OS

12.1 TS7700 disaster recovery principles

To understand the DR capabilities of the TS7700 grid, the following topics are described:

•Data availability in the grid

•Deferred Copy Queue

•Volume ownership

12.1.1 Data availability

The fundamental function of the TS7700 is that all logical volumes are accessible through any of the virtual device addresses on the clusters in the grid configuration. If a copy of the logical volume is not available at that TS7700 cluster (either because it does not have a copy or the copy it does have is inaccessible because of an error), and a copy is available at another TS7700 cluster in the grid, the volume is accessed through the Tape Volume Cache (TVC) at the TS7700 cluster that has the available copy. If a recall is required to place the logical volume in the TVC on the other TS7700 cluster, it is done as part of the mount operation.

Whether a copy is available at another TS7700 cluster in a multi-cluster grid depends on the Copy Consistency Policy that was assigned to the logical volume when it was written. The Copy Consistency Policy is set through the Management Class (MC) storage construct. It specifies if and when a copy of the data is made between the TS7700 clusters in the grid configuration. The following Copy Consistency Policies can be assigned:

•Synchronous Copy (Synch): Data that is written to the cluster is compressed and simultaneously written to another specified cluster.

•Rewind Unload (RUN): Data that is created on one cluster is copied to the other cluster as part of successful RUN command processing.

•Deferred Copy (Deferred): Data that is created on one cluster is copied to the specified clusters after successful RUN command processing.

•No Copy (None): Data that is created on one cluster is not copied to the other cluster.

Consider when the data is available on the cluster at the DR site. With Synchronous Copy, the data is written to a secondary cluster. If the primary site is unavailable, the volume can be accessed on the cluster that specified Synch. With RUN, unless the Copy Count Override is enabled, any cluster with Run specified has a copy of the volume available. With None, there is no copy written in this cluster. With Deferred, a copy is available later, so it might be available at the cluster that specified Deferred.

When you enable Copy Count Override, it is possible to limit the number of RUN consistency points that are required before the application is given back device end, which can result in fewer copies of the data that is available than your copy policies have specified.

The Volume Removal policy for hybrid grid configurations is available in any grid configuration that contains at least one TS7720 or TS7720T cluster and should be considered as well. The TS7720 Disk-Only solution has a maximum storage capacity that is the size of its TVC, and TS7720T CP0 works like TS7720. Therefore, after the cache fills, this policy enables logical volumes to be removed automatically from cache while a copy is retained within one or more peer clusters in the grid. If the cache is filling up, it is possible that fewer copies of the volume exist in the grid than is expected based on the copy policy alone.

12.1.2 Deferred Copy Queue

Besides a copy policy of No Copy, a Deferred Copy policy has the least impact to the applications running on the host. Immediately after the volume is closed, device end is passed back to the application and a copy is then queued to be made later. These copies are put on the Deferred Copy Queue. With the standard settings, host application I/O always has a higher priority than the Deferred Copy Queue. It is normally expected that the configuration and capacity of the grid is such that the entire queue has the copies completed each day; otherwise, the incoming copies cause the Deferred Copy Queue to grow continually and the RPO might not be fulfilled.

When a cluster becomes unavailable due to broken grid links, error, or disaster, the incoming copy queue might not be complete and the data might not be available on other clusters in the grid. You can use BVIR to analyze the incoming copy queue, but the possibility exists that volumes are not available. For backups, this might be acceptable, but for primary data, it might be preferable to use a Synch copy policy rather than Deferred.

12.1.3 Volume ownership

If a logical volume is written on one of the clusters in the grid configuration and copied to the other cluster, the copy can be accessed through the other cluster. This is subject to the so-called volume ownership.

At any time, a logical volume is owned by a single cluster. The owning cluster has control over access to the volume and changes to the attributes that are associated with the volume (such as category or storage constructs). The cluster that has ownership of a logical volume can surrender it dynamically to another cluster in the grid configuration that is requesting a mount of the volume.

When a mount request is received on a virtual device address, the cluster for that virtual device must have ownership of the volume to be mounted or must obtain the ownership from the cluster that owns it. If the clusters in a grid configuration and the communication paths between them are operational (grid network), the change of ownership and the processing of logical volume-related commands are transparent to the operation of the TS7700.

However, if a cluster that owns a volume is unable to respond to requests from other clusters, the operation against that volume fails, unless more direction is given. Clusters will not automatically assume or take over ownership of a logical volume without being directed.

This is done to prevent the failure of the grid network communication paths between the clusters resulting in both clusters thinking that they have ownership of the volume. If more than one cluster has ownership of a volume, that might result in the volume’s data or attributes being changed differently on each cluster, resulting in a data integrity issue with the volume.

If a cluster fails or is known to be unavailable (for example, a power fault in the IT center) or must be serviced, its ownership of logical volumes is transferred to the other cluster through one of the following modes.

These modes are set through the Management Interface (MI):

•Read Ownership Takeover (ROT): When ROT is enabled for a failed cluster, ownership of a volume is allowed to be taken from a cluster that has failed. Only read access to the volume is allowed through the other cluster in the grid. After ownership for a volume is taken in this mode, any operation attempting to modify data on that volume or change its attributes fails. The mode for the failed cluster remains in place until a different mode is selected or the failed cluster is restored.

•Write Ownership Takeover (WOT): When WOT is enabled for a failed cluster, ownership of a volume is allowed to be taken from a cluster that has been marked as failed. Full access is allowed through the other cluster in the grid. The mode for the failed cluster remains in place until a different mode is selected or the failed cluster is restored.

•Service prep/service mode: When a cluster is placed in service preparation mode or is in service mode, ownership of its volumes is allowed to be taken by the other cluster. Full access is allowed. The mode for the cluster in service remains in place until it is taken out-of-service mode.

•In addition to the manual setting of one of the ownership takeover modes, an optional automatic method named Autonomic Ownership Takeover Manager (AOTM) is available when each of the TS7700 clusters is attached to a TS3000 System Console (TSSC) and there is a communication path provided between the TSSCs. AOTM is enabled and defined by the IBM Service Support Representative (IBM SSR). If the clusters are near each other, multiple clusters in the same grid can be attached to the same TSSC, and the communication path is not required.

Guidance: The links between the TSSCs must not be the same physical links that are also used by cluster grid gigabit (Gb) links. AOTM must have a different network to be able to detect that a missing cluster is down, and that the problem is not caused by a failure in the grid gigabit wide area network (WAN) links.

When enabled by the IBM SSR, suppose that a cluster cannot obtain ownership from the other cluster because it does not get a response to an ownership request. In this case, a check is made through the TSSCs to determine whether the owning cluster is inoperable, or if the communication paths to it are not functioning. If the TSSCs determine that the owning cluster is inoperable, they enable either read or WOT, depending on what was set by the IBM SSR.

AOTM enables an ownership takeover mode after a grace period, and can be configured only by an IBM SSR. Therefore, jobs can intermediately fail with an option to try again until the AOTM enables the configured takeover mode. The grace period is set to 20 minutes, by default. The grace period starts when a cluster detects that another remote cluster has failed. It can take several minutes.

The following OAM messages can be displayed up until the point when AOTM enables the configured ownership takeover mode:

•CBR3758E Library Operations Degraded

•CBR3785E Copy operations disabled in library

•CBR3786E VTS operations degraded in library

•CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname might be unavailable or a communication issue might be present.

•CBR3750I Message from library libname: G0009 Autonomic ownership takeover manager within library libname has determined that library libname is unavailable. The Read/Write ownership takeover mode has been enabled.

•CBR3750I Message from library libname: G0010 Autonomic ownership takeover manager within library libname determined that library libname is unavailable. The Read-Only ownership takeover mode has been enabled.

A failure of a cluster causes the jobs that use its virtual device addresses to end abnormally (abend). To rerun the jobs, host connectivity to the virtual device addresses in the other cluster must be enabled (if it is not already), and an appropriate ownership takeover mode selected. If the other cluster has a valid copy of a logical volume, the jobs can be tried again.

If a logical volume is being accessed in a remote cache through the Ethernet link and that link fails, the job accessing that volume also fails. If the failed job is attempted again, the TS7700 uses another Ethernet link. If all links fail, access to any data in a remote cache is not possible.

12.2 Failover scenarios

As part of a total systems design, you must develop business continuity procedures to instruct information technology (IT) personnel in the actions that they need to take in a failure. Test those procedures either during the initial installation of the system or at another time.

The scenarios are described in detail in the IBM Virtualization Engine TS7700 Series Grid Failover Scenarios white paper, which was written to assist IBM specialists and clients in developing such testing plans. The white paper is available at the following address:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100831

The white paper documents a series of TS7700 Grid failover test scenarios for z/OS that were run in an IBM laboratory environment. Simulations of single failures of all major components and communication links, and some multiple failures, are run.

12.3 Planning for disaster recovery

Although you can hope that a disaster does not happen, planning for such an event is important. Information is provided that can be used in developing a DR plan as it relates to a TS7700.

Many aspects of DR planning must be considered:

•Consider DR site connectivity input/output definition file (IODF).

•How critical is the data in the TS7700?

•Can the loss of some of the data be tolerated?

•How much time can be tolerated before resuming operations after a disaster?

•What are the procedures for recovery and who runs them?

•How will you test your procedures?

12.3.1 Disaster recovery site connectivity IODF considerations

If your production hosts have FICON connectivity to the TS7700 clusters at your DR site, you might consider including those virtual device addresses in your production IODF. Having those devices configured and offline to your production hosts makes it easier if there is a TS7700 failure that requires FICON access to the DR clusters, which is distance-dependent and might not be appropriate for all configurations.

To switch over to the DR clusters, a simple vary online of the DR devices is all that is needed by the production hosts to enable their usage. Another alternative is to have a separate IODF ready with the addition of the DR devices. However, that required an IODF activation on the production hosts.

12.3.2 Grid configuration

With the TS7700, two types of configurations can be installed:

•Stand-alone cluster

•Multi-cluster grid

With a stand-alone system, a single cluster is installed. If the site at which that system is installed is destroyed, the data that is associated with the TS7700 might be lost unless COPY EXPORT was used and the tapes were removed from the site. If the cluster goes out of service due to failures, it depends on the failure type if the data will be recoverable.

The recovery process assumes that the only elements that are available for recovery are the stacked volumes that are produced by COPY EXPORT and removed from the site. It further assumes that only a subset of the volumes is undamaged after the event. If the physical cartridges have been destroyed or irreparably damaged, recovery is not possible, as with any other cartridge types. It is important that you integrate the TS7700 recovery procedure into your current DR procedures.

Remember: The DR process is a joint exercise that requires your involvement and that of your IBM SSR to make it as comprehensive as possible.

For many clients, the potential data loss or the recovery time that is required with a stand-alone TS7700 is not acceptable because the COPY EXPORT method might take some time to complete. For those clients, the TS7700 grid provides a near-zero data loss and expedited recovery-time solution. With a multi-cluster grid configuration, up to six clusters are installed, typically at two or three sites, and interconnected so that data is replicated among them. The way that the sites are used then differs, depending on your requirements.

In a two-cluster grid, one potential use case is that one of the sites is the local production center and the other site is a backup or DR center, which is separated by a distance that is dictated by your company’s requirements for DR. Depending on the physical distance between the sites, it might be possible to have two clusters be both a high availability and DR solution.

In a three-cluster grid, the typical use is that two sites are connected to a host and the workload is spread evenly between them. The third site is strictly for DR and there probably are no connections from the production host to the third site. Another use for a three-cluster grid might consist of three production sites, which are all interconnected and holding the backups of each other.

In a four or more cluster grid, DR and high availability can be achieved. The high availability is achieved with two local clusters keeping RUN or SYNC volume copies, with both clusters attached to the host. The third and fourth (or more) remote clusters can hold deferred volume copies for DR. This design can be configured in a crossed way, which means that you can run two production data centers, with each production data center serving as a backup for the other.

The only connection between the production sites and the DR site is the grid interconnection. There is normally no host connectivity between the production hosts and the DR site’s TS7700. When client data is created at the production sites, it is replicated to the DR site as defined through Outboard policy management definitions and storage management subsystem (SMS) settings.

12.3.3 Planning guidelines

As part of planning a TS7700 grid configuration to address this solution, you must consider the following items:

•Plan for the necessary WAN infrastructure and bandwidth. You need more bandwidth if you are primarily using a Copy Consistency Points of RUN or SYNC because any delays in copy time that are caused by bandwidth limitations result in longer job run times.

If you have limited bandwidth available between sites, use Deferred Copy Consistency Point, or copy only the data that is critical to the recovery of your key operations. The amount of data that is sent through the WAN and the distance it is sent possibly might justify the establishment of a separate, redundant, and dedicated network only for the multi-cluster grid. There are also newer IPEX SAN42B-R switches that are available and IP extension hardware that might help with this issue.

•A factor to consider in the implementation of Copy Export for DR is that the export does not capture any volumes in the export pool that are not in the TVC of the export cluster. Any data that is migrated to back-end tape is not going to be on the EXPORT COPY volumes.

•Plan for host connectivity at your DR site with sufficient resources to run your critical workloads. If the cluster that is local to the production host becomes unavailable and there is no access to the DR site’s cluster by this host, production cannot run. Optionally, plan for an alternative host to take over production at the DR site.

•Design and code the Data Facility System Management Subsystem (DFSMS) automatic class selection (ACS) routines to control what MC on the TS7700 is assigned. It is these MCs that control which Copy Consistency Points are used. You might need to consider MC assignment policies for testing your procedures at the DR site that are different from the production policies.

•Prepare procedures that your operators run if the local site becomes unusable. The procedures include various tasks, such as bringing up the DR host, varying the virtual drives online, and placing the DR cluster in one of the ownership takeover modes.

•Perform a periodic capacity planning of your tape setup and host throughput to evaluate whether the disaster setup still can the full production workload in a disaster.

•If encryption is used in production, ensure that the disaster site supports encryption. The EKs must be available at the DR site or the data cannot be read.

•Consider how you test your DR procedures. Many scenarios can be set up:

– Will it be based on all data from an existing TS7700?

– Will it be based on using the Copy Export function and an empty TS7700?

– Will it be based on stopping production access to one TS7700 cluster and running production to another cluster?

12.4 High availability and disaster recovery configurations

A few examples of grid configurations are addressed. These examples are a small subset of possible configurations, and are only provided to show how the grid technology can be used. With five-cluster or six-cluster grids, there are many more ways to configure a grid.

Two-cluster grid

With a two-cluster grid, you can configure the grid for DR, high availability, or both. Configuration considerations for two-cluster grids are described. The scenarios that are presented are typical configurations. Other configurations are possible, and might be better suited for your environment.

Disaster recovery configuration

This section provides information that is needed to plan for a TS7700 2-cluster grid configuration to be used specifically for DR purposes.

A natural or human-caused event has made the local site’s cluster unavailable. The two clusters are in separate locations, which are separated by a distance that is dictated by your company’s requirements for DR. The only connection between the local site and the DR site are the grid interconnections. There is no host connectivity between the local hosts and the DR site cluster.

Figure 12-1 summarizes this configuration.

Figure 12-1 Disaster recovery configuration

Consider the following information as part of planning a TS7700 grid configuration to implement this solution:

•Plan for the necessary WAN infrastructure and bandwidth to meet the copy policy requirements that you need. If you have limited bandwidth available between sites, have data that is critical copied with a consistency point of RUN, with the rest of the data using the Deferred Copy Consistency Point. RUN or SYNCH are only acceptable copy policies for distances less than 100 kilometers. Distances greater than 100 km must rely on the Deferred Copy Consistency Point.

•Plan for host connectivity at your DR site with sufficient resources to perform your critical workloads.

•Design and code the DFSMS ACS routines to control what MC on the TS7700 is assigned, which determines what data gets copied, and by which Copy Consistency Point.

Configuring for high availability

This section provides the information that is needed to plan for a two-cluster grid configuration to be used specifically for high availability. The assumption is that continued access to data is critical, and no single point of failure, repair, or upgrade can affect the availability of data.

In a high-availability configuration, both clusters are within metro distance of each other. These clusters are connected through a LAN. If one of them becomes unavailable because it has failed, or is undergoing service or being updated, data can be accessed through the other cluster until the unavailable cluster is made available.

As part of planning a grid configuration to implement this solution, consider the following information:

•Plan for the virtual device addresses in both clusters to be configured to the local hosts. In this way, a total of 512 or 992 virtual tape devices are available for use (256 or 496 from each cluster).

•Set up a Copy Consistency Point of RUN for both clusters for all data to be made highly available. With this Copy Consistency Point, as each logical volume is closed, it is copied to the other cluster.

•Design and code the DFSMS ACS routines and MCs on the TS7700 to set the necessary Copy Consistency Points.

•Ensure that AOTM is configured for an automated logical volume ownership takeover method in case a cluster becomes unexpectedly unavailable within the grid configuration. Alternatively, prepare written instructions for the operators that describe how to perform the ownership takeover manually, if necessary. See 2.3.34, “Autonomic Ownership Takeover Manager” on page 87 for more details about AOTM.

Figure 12-2 summarizes this configuration.

Figure 12-2 Availability configuration

Configuring for disaster recovery and high availability

You can configure a two-cluster grid configuration to provide both DR and high availability solutions. The assumption is that the two clusters are in separate locations, which are separated by a distance dictated by your company’s requirements for DR. In addition to the configuration considerations for DR, you need to plan for the following items:

•Access to the FICON channels on the cluster at the DR site from your local site’s hosts. This can involve connections that use dense wavelength division multiplexing (DWDM) or channel extender, depending on the distance separating the two sites. If the local cluster becomes unavailable, you use this remote access to continue your operations by using the remote cluster.

•Because the virtual devices on the remote cluster are connected to the host through a DWDM or channel extension, there can be a difference in read or write performance when compared to the virtual devices on the local cluster.

If performance differences are a concern, consider using only the virtual device addresses in the remote cluster when the local cluster is unavailable. If that is important, you must provide operator procedures to vary online and offline the virtual devices to the remote cluster.

•You might want to have separate Copy Consistency Policies for your DR data versus your data that requires high availability.

Figure 12-3 summarizes this configuration.

Figure 12-3 Availability and disaster recovery configuration

Three-cluster grid

With a three-cluster grid, you can configure the grid for DR and high availability or use dual production sites that share a common DR site. Configuration considerations for three-cluster grids are described. The scenarios that are presented are typical configurations. Other configurations are possible and might be better suited for your environment.

The planning considerations for a two-cluster grid also apply to a three-cluster grid.

High availability and disaster recovery

Figure 12-4 on page 744 illustrates a combined high availability and DR solution for a three-cluster grid. In this example, Cluster 0 and Cluster 1 are the high-availability clusters and are local to each other (less than 50 kilometers (31 miles) apart). Cluster 2 is at a remote site that is away from the production site or sites. The virtual devices in Cluster 0 and Cluster 1 are online to the host and the virtual devices in Cluster 2 are offline to the host. The host accesses the virtual devices provided by Cluster 0 and Cluster 1.

Host data that is written to Cluster 0 is copied to Cluster 1 at RUN time or earlier with Synchronous mode. Host data that is written to Cluster 1 is written to Cluster 0 at RUN time. Host data that is written to Cluster 0 or Cluster 1 is copied to Cluster 2 on a Deferred basis.

The Copy Consistency Points at the DR site (NNR or NNS) are set to create a copy only of host data at Cluster 2. Copies of data are not made to Cluster 0 and Cluster 1. This enables DR testing at Cluster 2 without replicating to the production site clusters.

Figure 12-4 shows an optional host connection that can be established to the remote Cluster 2 by using DWDM or channel extenders. With this configuration, you must define an extra 256 or 496 virtual devices at the host.

Figure 12-4 High availability and disaster recovery configuration

Dual production site and disaster recovery

Figure 12-5 on page 745 illustrates dual production sites that are sharing a DR site in a three-cluster grid (similar to a hub-and-spoke model). In this example, Cluster 0 and Cluster 1 are separate production systems that can be local to each other or distant from each other. The DR cluster, Cluster 2, is at a remote site at a distance away from the production sites.

The virtual devices in Cluster 0 are online to Host A and the virtual devices in Cluster 1 are online to Host B. The virtual devices in Cluster 2 are offline to both hosts. Host A and Host B access their own set of virtual devices that are provided by their respective clusters. Host data that is written to Cluster 0 is not copied to Cluster 1. Host data that is written to Cluster 1 is not written to Cluster 0. Host data that is written to Cluster 0 or Cluster 1 is copied to Cluster 2 on a Deferred basis.

The Copy Consistency Points at the DR site (NNR or NNS) are set to create only a copy of host data at Cluster 2. Copies of data are not made to Cluster 0 and Cluster 1. This enables DR testing at Cluster 2 without replicating to the production site clusters.

Figure 12-5 shows an optional host connection that can be established to remote Cluster 2 using DWDM or channel extenders.

Figure 12-5 Dual production site with disaster recovery

Three-cluster high availability production site and disaster recovery

This model has been adopted by many clients. In this configuration, two clusters are in the production site (same building or separate location within metro area) and the third cluster is remote at the DR site. Host connections are available at the production site (or sites).

In this configuration, each TS7720 replicates to both its local TS7720 peer and to the remote TS7740. Optional copies in both TS7720 clusters provide high availability plus cache access time for the host accesses. At the same time, the remote TS7740 provides DR capabilities and the remote copy can be remotely accessed, if needed.

This configuration, which provides high-performance production cache if you choose to run balanced mode with three copies (R-R-D for both Cluster 0 and Cluster 1), is depicted in Figure 12-6.

Figure 12-6 Three-cluster high availability and disaster recovery with two TS7720 and one TS7740 tape drives

Another variation of this model uses a TS7720 and a TS7740 for the production site, as shown in Figure 12-7, both replicating to a remote TS7740.

Figure 12-7 Three-cluster high availability and disaster recovery with two TS7740 and one TS7720 tape drives

In both models, if a TS7720 reaches the upper threshold of usage, the PREFER REMOVE data, which has already been replicated to the TS7740, is removed from the TS7720 cache followed by the PREFER KEEP data. PINNED data is never removed unless AUTOREMOVAL is enabled.

In the example that is shown in Figure 12-7, you can have particular workloads favoring the TS7740, and others favoring the TS7720, suiting a specific workload to the cluster best equipped to perform it.

Copy Export (shown as optional in both figures) can be used to have an additional copy of the migrated data, if required.

Four-cluster grid

A four-cluster grid that can have both sites for dual purposes is described. Both sites are equal players within the grid, and any site can play the role of production or DR, as required.

Dual production and disaster recovery at Metro Mirror distance

In this model, you have dual production and DR sites. Although a site can be labeled as a high availability pair or DR site, they are equivalent from a technology standpoint and functional design. In this example, you have two production sites within metro distances and two remote DR sites within metro distances between them. This configuration delivers the same capacity as a two-cluster grid configuration, with the high availability of a four-cluster grid. See Figure 12-8.

Figure 12-8 Four-cluster high availability and disaster recovery

You can have host workload balanced across both clusters (Cluster 0 and Cluster 1 in Figure 12-8). The logical volumes that are written to a particular cluster are only replicated to one remote cluster. In Figure 12-8, Cluster 0 replicates to Cluster 2 and Cluster 1 replicates to Cluster 3. This task is accomplished by using copy policies. For the described behavior, copy mode for Cluster 0 is RDRN or SDSN and for Cluster 1 is DRNR or DSNS.

This configuration delivers high availability at both sites, production and DR, without four copies of the same tape logical volume throughout the grid.

If this example was not in Metro Mirror distances, use copy policies on Cluster 0 of RDDN and Cluster 1 of DRND.

Figure 12-9 shows the four-cluster grid reaction to a cluster outage. In this example, Cluster 0 goes down due to an electrical power outage. You lose all logical drives emulated by Cluster 0. The host uses the remaining addresses emulated by Cluster 1 for the entire production workload.

Figure 12-9 Four-cluster grid high availability and disaster recovery - Cluster 0 outage

During the outage of Cluster 0 in the example, new jobs for write only use one half of the configuration (the unaffected partition in the lower part of the picture). Jobs for read can access content in all available clusters. When power is normalized at the site, Cluster 0 starts and rejoins the grid, reestablishing the original balanced configuration.

In a DR situation, the backup host in the DR site operates from the second high availability pair, which is the pair of Cluster 2 and Cluster 3 in Figure 12-11 on page 757. In this case, copy policies can be RNRD for Cluster 2 and NRNR for Cluster 3.

If these sites are more than Metro Mirror distance, you can have Cluster 2 copy policies of DNRD and Cluster 3 policies of NDDR.

12.4.1 Restoring the host and library environments

Before you can use the recovered logical volumes, you must restore the host environment. The following steps are the minimum steps that you need to continue the recovery process of your applications:

1. Restore the tape management system (TMS) CDS.

2. Restore the DFSMS data catalogs, including the tape configuration database (TCDB).

3. Define the I/O gen by using the Library IDs of the recovery TS770 tape drives.

4. Update the library definitions in the source control data set (SCDS) with the Library IDs for the recovery TS7700 tape drives in the composite library and distributed library definition windows.

5. Activate the I/O gen and the SMS SCDS.

You might also want to update the library nicknames that are defined through the MI for the grid and cluster to match the library names defined to DFSMS. That way, the names that are shown on the MI windows match those names used at the host for the composite library and distributed library.

To set up the composite name that is used by the host to be the grid name, complete the following steps:

1. Select Configuration → Grid Identification Properties.

2. In the window that opens, enter the composite library name used by the host in the grid nickname field.

3. You can optionally provide a description.

Similarly, to set up the distributed name, complete the following steps:

1. Select Configuration → Cluster Identification Properties.

2. In the window that opens, enter the distributed library name used by the host in the Cluster nickname field.

3. You can optionally provide a description.

These names can be updated at any time.

12.5 Disaster recovery testing basics

The TS7700 grid configuration provides a solution for DR needs when data loss and the time for recovery must be minimized. Although a real disaster is not something that can be anticipated, it is important to have tested procedures in place in case one occurs.

Before R3.1, you might decide to run your DR test with Write Protection mode, and choose whether to define write protect exclusion categories.

Selective write protect for disaster recovery testing

This function enables clients to emulate DR events by running test jobs at a DR location within a TS7700 grid configuration, enabling volumes only within specific categories to be manipulated by the test application. This function prevents any changes to production-written data, which is accomplished by excluding up to 16 categories from the cluster’s write-protect enablement.

When a cluster is write-protect-enabled, all volumes that are protected cannot be modified or have their category or storage construct names modified. As in the TS7700 write-protect setting, the option is grid partition scope (a cluster) and configured through the MI. Settings are persistent, except for DR FLASH, and saved in a special repository.

Also, the new function enables any volume that is assigned to one of the categories that are contained within the configured list to be excluded from the general cluster’s write-protect state. The volumes that are assigned to the excluded categories can be written to or have their attributes modified.

In addition, those scratch categories that are not excluded can optionally have their Fast Ready characteristics ignored, including Delete Expire and hold processing, enabling the DR test to mount volumes as private that the production environment has since returned to scratch (they are accessed as read-only).

One exception to the write protect is those volumes in the insert category. To enable a volume to be moved from the insert category to a write-protect-excluded category, the source category of insert cannot be write-protected. Therefore, the insert category is always a member of the excluded categories.

Be sure that you have enough scratch space when Expire Hold processing is enabled to prevent the reuse of production scratched volumes when planning for a DR test. Suspending the volumes’ return-to-scratch processing during the DR test is also advisable.

Because selective write protect is a cluster-wide function, separated DR drills can be conducted simultaneously within one multi-cluster grid, with each cluster having its own independent client-configured settings. Again, DR FLASH is the exception to this statement.

With Release 3.1, a new function, called FlashCopy for DR Testing, was introduced. This feature is a major improvement regarding the DR testing possibilities.

Today, three major alternatives exist:

1. DR test without Write Protect Mode

2. Write Protect Mode or Selective Write Protect Mode

3. FlashCopy for Disaster Recovery Testing

For alternatives 1 and 2, you can also decide whether to break the gridlinks.

Here are some of the considerations to use based on which alternative method you use
for testing.

Alternative 1: Disaster recovery test without Write Protect Mode

The protection is based only on the z/OS (DEVSUPxx) and the TMS capabilities. There is no hardware support to protect the production data unless the grid was partitioned by using SDAC at implementation and dedicated volume serial ranges are used for the DR test.

Do not run Housekeeping processes on either the DR host or the production host during the testing. This method should be selected only if you are running a microcode level that does not support Write Protect Mode.

Consider the following restrictions when DR testing without Write Protect Mode:

•Use the production volumes from the DR host. In this approach, you have no protection of your production data at all. Applications might modify the data, and scratch runs delete production data.

•Use the production volumes from the DR host as Read only. In this approach, all applications that modify tape content do not run properly during the DR test.

•Don’t use the production volumes from the DR host at all. In this approach, you cannot access any production data, and you are not able to test several of your applications.

Therefore, the test capabilities are limited if you use this alternative.

Alternative 2: Write Protect Mode / Selective Write Protect Mode

Write Protect Mode or Selective Write Protect mode is a hardware feature. A cluster is set to a Write Protect mode with the MI.

The Write Protect mode still enables logical volumes to be copied from the remaining production clusters to the DR cluster.

You can define Write protect excluded media categories, where updates and status changes are allowed.

However, this alternative cannot handle two different instances of the same logical volume on one cluster to enable access to a DR point in time data, and the propagation of production updates.

Alternative 3: FlashCopy for disaster recovery testing

With Release 3.1, concurrent DR testing is improved with the FlashCopy for Disaster Recovery Testing function. This enables a DR host to perform testing against a point in time consistency snapshot while production operations and replication continue. With FlashCopy, production data continues to replicate during the entire DR test and the same logical volume can be mounted at the same time by a DR host and a production host.

Used with Selective Write Protect for DR testing, DR test volumes can be written to and read from while production volumes are protected from modification by the DR host. All access by a DR host to write protected production volumes are provided by using a snapshot in time, or flash, of the logical volumes. In addition, a DR host continues to have read access to production original content that has since been returned to scratch.

During a DR test, volumes might need to be mounted from both the DR and production hosts. Before FlashCopy for DR Testing, these mounts were serialized such that one host access received an IN USE exception. This was especially painful when the true production host was the instance that fails the mount.

FlashCopy enables logical volumes to be mounted in parallel to a production host and a DR host. Production hosts can scratch volumes, reuse volumes, or modify volumes, but the DR TS7700 provides a snapshot of the logical volumes from time zero of the simulated disaster event or the start of the DR test.

12.5.1 Disaster recovery general considerations

As you design a test involving the TS7700 grid configuration, there are several capabilities that are designed into the TS7700 that you must consider.

The z/OS test environment represents a point in time

The test environment is typically a point in time, which means that at the beginning of the test, the catalog, TCDB, and TMS control databases are all a snapshot of the production systems. Over the duration of the test, the production systems continue to run and make changes to the catalogs and TMS. Those changes are not reflected in the point-in-time snapshot.

The main effect is that it is possible that a volume that has been returned to SCRATCH status by the production system is used in a test. The test system’s catalogs and TMS do not reflect that change. Depending on your decisions, the data can still be accessed, regardless if the logical volume is defined as scratch or not.

The data that is available in the disaster recovery cluster

In a real disaster, the data that is available in the clusters in your remaining site might not be consistent with the content in your TMS catalog. This situation depends on the selected Copy Modes, and if the copies are already processed.

During your DR test, production data is updated on the remaining production clusters. Depending on your selected DR testing method, this updated data can be copied to the DR clusters. Also, it depends on the DR testing method, if this updated data is presented to the DR host, or if a FlashCopy from a Time Zero is available.

Without the FlashCopy option, both alternatives (updating the data versus not updating the data) have advantages and disadvantages. For more information, see 12.5.2, “Breaking the interconnects between the TS7700 grid” on page 755.

Also, the DR host might create some data in the DR clusters. For more information, see “Creating data during the disaster recovery test from the DR host: Selective Write Protect” on page 753.

Protection of your production data

In a real disaster this is not an issue because the remaining systems become your production environment.

During a DR test, you need to ensure that the actions on the DR site do not have an influence on the data from production. Therefore, the DR host must not have any connections to the clusters in production. Ensure that all devices that are attached to the remaining production clusters are offline (if they are FICON attached to the DR site).

The Write Protect mode prevents any host action (write data, host command) sent to the test cluster from creating new data, modifying existing data, or changing volume attributes such as the volume category. The Write Protect mode still enables logical volumes to be copied from the remaining production clusters to the DR cluster.

As an alternative to the Write Protect Mode, if you are at an earlier TS7700 microcode level and want to prevent overwriting production data, you can use the TMS control to enable only read-access to the volumes in the production VOLSER ranges. However, this process does not enable you to write data during the DR testing. For more information, see 12.5.3, “Considerations for DR tests without Selective Write Protect mode” on page 756.

Separating production and disaster recovery hosts: Logical volumes

The DR host is an isolated LPAR that needs to be segregated from the production. To avoid any interference or data loss, complete these optional steps:

1. Define host-specific media categories for Media1/2, Error, and Private.

2. Limit the usage of logical volumes by using the TMS.

3. Define separate logical volume serial ranges (insert process).

To ensure that the inserted volume ranges are not accepted by the production systems, you need to perform the following steps:

•Changes on production systems:

– Use the RMM REJECT ANYUSE(TST*) parameter, which means to not use VOLSERs named TST* here.

•Changes on the DR test systems:

– Use the RMM VLPOOL PREFIX(TST*) TYPE(S) parameter to enable use of these volumes for default scratch mount processing.

– Change DEVSUPxx to point to other categories, which are the categories of the TST* volumes.

Figure 12-10 shows the process to insert cartridges in a DR site to perform a DR test.

Figure 12-10 Insertion considerations in a disaster recovery test

After these settings are done, insert the new TST* logical volumes. It is important that the test volumes that are inserted by using the MI are associated with the test system so that the TS7700 at the test site has ownership of the inserted volumes. The DR system must be running before the insertion is run.

Important: Ensure that one logical unit has been or is online on the test system before entering logical volumes.

Any new allocations that are performed by the DR test system use only the logical volumes defined for the test. At the end of the test, the volumes can be returned to SCRATCH status and left in the library, or deleted, if you want.

Creating data during the disaster recovery test from the DR host: Selective Write Protect

During the DR test, you might want to write data from the DR host to the DR clusters. These tests typically include running a batch job cycle that creates new data volumes.

This test can be handled in two ways:

•Have a different TS7700 available as the output target for the test jobs.

•Have a separate logical volume range that is defined for use only by the test system.

The second approach is the most practical in terms of cost. It involves defining the VOLSER range to be used, defining a separate set of categories for scratch volumes in the DFSMS DEVSUP parmlib, and inserting the volume range into the test TS7700 before the start of the test.

Important: The test volumes that are inserted by using the MI must be associated with the cluster that is used as DR cluster so that the TS7700 at the test site has ownership of the inserted volumes.

If you require that the test host be able to write new data, you can use the Selective Write Protect for DR testing function that enables you to write to selective volumes during
DR testing.

With Selective Write Protect, you can define a set of volume categories on the TS7700 that are excluded from the Write Protect Mode. This configuration enables the test host to write data onto a separate set of logical volumes without jeopardizing normal production data, which remains write-protected.

This requires that the test host use a separate scratch category or categories from the production environment. If test volumes also must be updated, the test host’s private category must also be different from the production environment to separate the two environments.

You must determine the production categories that are being used and then define separate, not yet used categories on the test host by using the DEVSUPxx member. Be sure that you define a minimum of four categories in the DEVSUPxx member: MEDIA1, MEDIA2, ERROR, and PRIVATE.

In addition to the host specification, you must also define on the TS7700 those volume categories that you are planning to use on the DR host and that need to be excluded from Write-Protect mode.

For more information about the necessary definitions for DR testing with a TS7700 grid that uses Selective Write Protect, see 12.7.1, “TS7700 2-cluster grid that uses Selective Write Protect” on page 773.

The Selective Write Protect function enables you to read production volumes and to write new volumes from the beginning of tape (BOT) while protecting production volumes from being modified by the DR host. Therefore, you cannot modify or append to volumes in the production hosts’ PRIVATE categories, and DISP=MOD or DISP=OLD processing of those volumes is not possible.

At the end of the DR test, clean up the data written during the DR test.

Creating data during the disaster recovery test from the disaster recovery host: Copy policies

If you are using the MCs used in production, the data being created as part of the test might be copied to the production site, wasting space and inter-site bandwidth. This situation can be avoided by defining the copy mode for the MCs differently at the test TS7700 than at the production TS7700.

Using a copy mode of No Copy for the production library site prevents the test TS7700 from making a copy of the test data. It does not interfere with the copying of production data. Remember to set the content of the MCs back to the original contents in the cleanup of the DR test.

Scratch runs during the disaster recovery test from the production host

The scratch runs on the production host set the status of a logical volume to scratch. If a logical volume is in a scratch category, it cannot be read from a host.

With TS7700 using Selective Write Protect, you can use write protected categories to avoid conflicts. Also, see step 6 on page 775. This approach enables the DR host to only read scratched volumes. It does not prevent the production host from using them. Either turning off return-to-scratch or configuring a long expire-hold time can be used, as well.

For scratch runs during the DR test from the production host without using Selective Write Protect, see 12.5.3, “Considerations for DR tests without Selective Write Protect mode” on page 756.

Scratch runs during the disaster recovery test from the disaster recovery host

Depending on the selected method, a scratch that is run on the DR host should be carefully considered. If Write Protect is enabled, and the production category is not set to Write Protect Excluded, you can process a scratch run on the DR host. Generally, limit the scratch run to the volume serial range allowed for the DR host.

If you choose not to use Write Protect, or define the production categories as excluded from write protect, a scratch that is run on the DR host might lead to data loss. Avoid running any housekeeping process.

Cleanup phase of a disaster recovery test

You must clean up your DR test environment at the end of the DR test. In this phase, the data that is written by the DR host is deleted in the TS7700.

If this data is not deleted (set to scratch and run housekeeping) after the DR test, this unneeded data uses cache or tape space. These data never expires because no scratch run will be processed for this volume. Ensure that a scratch category with Expiration time is used for the DR logical volumes. Otherwise, they also waste space because these logical volumes will not be overwritten.

12.5.2 Breaking the interconnects between the TS7700 grid

Before Release 3.1 FlashCopy for DR testing, you had two options:

•The site-to-site links are broken.

•The links are left connected.

A test (with or without Write Protect Mode) can be conducted with either approach, but each one has trade-offs.

Breaking the grid links offers the following benefits:

•You are sure that only the data that is copied to the TS7700 that is connected to the test system is accessed.

•Logical volumes that are returned to scratch by the production system are not seen by the TS7700 under test.

•Test data that is created during the test is not copied to the other TS7700.

This approach has the following disadvantages:

•If a disaster occurs while the test is in progress, data that was created by the production site after the links were broken is lost.

•The TS7700 at the test site must be allowed to take over volume ownership (either read-only or read/write).

•The TS7700 under test can select a volume for scratch that has already been used by the production system while the links were broken (only if no different media category was used).

•Breaking the gridlinks must be run from the CE. Do not disable a grid link with the Library Request command. Disabling the gridlink with the command does not stop synchronous mode copies and the exchange of status information.

The concern about losing data in a disaster during a test is the major issue with using the break site-to-site links method. The TS7700 has several design features that make valid testing possible without having to break the site-to-site links.

Ownership takeover

If you perform the test with the links broken between sites, you must enable ROT so that the test site can access the data on the production volumes owned by the production site. Because the production volumes are created by mounting them on the production site’s TS7700, that TS7700 has volume ownership.

If you attempt to mount one of those volumes from the test system without ownership takeover enabled, the mount fails because the test site’s TS7700 cannot request ownership transfer from the production site’s TS7700. By enabling ROT, the test host can mount the production logical volumes and read their contents.

The test host is not able to modify the production site-owned volumes or change their attributes. The volume looks to the test host as a write-protected volume. Because the volumes that are going to be used by the test system for writing data were inserted through the MI that is associated with the TS7700 at the test site, that TS7700 already has ownership of those volumes. Also, the test host has complete read and write control of them.

Important: Never enable WOT mode for a test. WOT mode must be enabled only during a loss or failure of the production TS7700.

If you are not going to break the links between the sites, normal ownership transfer occurs whenever the test system requests a mount of a production volume.

12.5.3 Considerations for DR tests without Selective Write Protect mode

As an alternative to the Write Protect Mode with or without FlashCopy if you are at a lower TS7700 microcode level and want to prevent overwriting production data, you can use the TMS control to enable only read-access to the volumes in the production VOLSER ranges. However, the following process does not enable you to write data during the DR testing.

For example, with DFSMSrmm, you insert these extra statements into the EDGRMMxx parmlib member:

•For production volumes in a range of A00000 - A09999, add this statement:

REJECT OUTPUT(A0*)

•For production volumes in a range of ABC000 - ABC999, add this statement:

REJECT OUTPUT(ABC*)

With REJECT OUTPUT in effect, products and applications that append data to an existing tape with DISP=MOD must be handled manually to function correctly. If the product is DFSMShsm, tapes that are filling (seen as not full) from the test system control data set (CDS) must be modified to full by running commands. If DFSMShsm then later needs to write data to tape, it requires a scratch volume related to the test system’s logical volume range.

As a result of recent changes in DFSMSrmm, it now is easier to manage this situation:

•In z/OS V1R10, the new commands PRTITION and OPENRULE provide for flexible and simple control of mixed system environments as an alternative to the REJECT examples used here. These new commands are used in the EDGRMMxx member of parmlib.

•You can specify extra EXPROC controls in the EDGHSKP SYSIN file to limit the return-to-scratch processing to specific subsets of volumes. So, you can just EXPROC the DR system volumes on the DR system and the PROD volumes on the PROD system. You can still continue to run regular batch processing, and also run expiration on the DR system.

Figure 12-11 helps you understand how you can protect your tapes in a DR test while your production system continues running.

Figure 12-11 Work process in a disaster recovery test

Clarification: The term HSKP is used because this term is typically the job name that is used to run the RMM EDGHSKP utility that is used for daily tasks, such as vital records processing, expiration processing, and backup of control and journal data sets. However, it can also see the daily process that must be done with other TMSs. This publication uses the term HSKP to mean the daily process on RMM or any other TMSs.

This includes stopping any automatic short-on-scratch process, if enabled. For example, RMM has one emergency short-on-scratch procedure.

To illustrate the implications of running the HSKP task in a DR test system, see the example in Table 12-1, which displays the status and definitions of one cartridge in a normal situation.

Table 12-1 VOLSER AAAAAA before returned to scratch from the disaster recovery site

Environment	DEVSUP	TCDB	RMM	MI	VOLSER
PROD	0002	Private	Master	000F	AAAAAA
DR	0012	Private	Master	000F	AAAAAA

In this example, cartridge AAAAAA is the master in both environments, and if there are any errors or mistakes, it is returned to scratch by the DR system. You can see its status in Table 12-2.

Table 12-2 VOLSER AAAAAA after returned to scratch from the disaster recovery site

Environment	DEVSUP	TCDB	RMM	MI	VOLSER
PROD	0002	Private	Master	0012	AAAAAA
DR	0012	Scratch	Scratch	0012	AAAAAA

Cart AAAAAA is now in scratch category 0012, which presents two issues:

•If you need to access this volume from the Prod system, you need to change its status to master (000F) in the MI before you can access it. Otherwise, you lose the data on the cartridge, which can have serious consequences if you, for example, return to scratch 1,000 volumes.

•Using DR RMM, reject using the Prod cartridges to output activities. If this cartridge is mounted in response to a scratch mount, it is rejected by RMM. Imagine that you must mount 1,000 scratch volumes because RMM rejected all of them before you get one validated by RMM.

Perform these tasks to protect production volumes from unwanted return to scratch:

•Ensure that the RMM HSKP procedure is not running during the test window of the test system. There is a real risk of data loss if the test system returns production volumes to scratch and you have defined in TS7700 that the expiration time for virtual volumes is 24 hours. After this time, volumes can become unrecoverable.

•Ensure that the RMM short-on-scratch procedure does not start. The results can be the same as running an HSKP.

If you are going to perform the test with the site-to-site links broken, you can use the ROT mode to prevent the test system from modifying the production site’s volumes. For more information about ownership takeover, see 2.3.34, “Autonomic Ownership Takeover Manager” on page 87.

In addition to the protection options that are described, you can also use the following RACF commands to protect the production volumes:

RDEFINE TAPEVOL x* UACC(READ) OWNER(SYS1)

SETR GENERIC(TAPEVOL) REFRESH

In the command, x is the first character of the VOLSER of the volumes to protect.

Returning to scratch without using Selective Write Protect

In a test environment where the links are maintained, care must be taken to ensure that logical volumes that are to be in the test are not returned to SCRATCH status and used by production applications to write new data. There are several ways to prevent conflicts between the return-to-scratch processing and the test use of older volumes:

1. Suspend all return-to-scratch processing at the production site. Unless the test is fairly short (hours, not days), this is not likely to be acceptable because of the risk of running out of scratch volumes, especially for native tape workloads. If all tape processing uses logical volumes, the risk of running out of scratch volumes can be eliminated by making sure that the number of scratch volumes available to the production system is enough to cover the duration of the test.

In z/OS V1R9 and later, you can specify more EXPROC controls in the EDGHSKP SYSIN file to limit the return-to-scratch processing to specific subsets of volumes. So, you can just EXPROC the DR system volumes on the DR system and the PROD volumes on the PROD system. Therefore, you can still continue to run regular batch processing and also run expiration on the DR system.

If a volume is returned to a scratch (Fast Ready) category during a DR test, mounting
that volume through a specific mount does not recall the previously written data. Even though DR knows that it is private (TCDB and RMM are a snapshot of production data).

The TS7700 always mounts a blank volume from a scratch (Fast Ready) category. It can be recovered by assigning the volume back to a private (non-Fast Ready) category, or taking that category out of the scratch (Fast Ready) list and trying the mount again.

Even if the number of volumes in the list is larger than the number of volumes that are needed per day times the number of days of the test, you still need to take steps to make it unlikely that a volume that is needed for test is reused by production.

For more information, see the IBM Virtualization Engine TS7700 Series Best Practices - Return-to-Scratch Considerations for Disaster Recovery Testing with a TS7700 Grid white paper at the following URL:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP101281

2. Suspend only the return-to-scratch processing for the production volume needed for the test. For RMM, this can be done by using policy management through vital record specifications (VRSs). A volume VRS can be set up that covers each production volume so that this overrides any existing policies for data sets.

For example, the production logical volumes to be used in the test are in a VOLSER range of 990000 - 990999. To prevent them from being returned to scratch, the following subcommand is run on the production system:

RMM AS VOLUME(990*) COUNT(99999) OWNER(VTSTEST) LOCATION(CURRENT) PRIORITY(1)

Then, EDGHSKP EXPROC can be run and not expire the data required for test.

After the test is finished, you have a set of tapes in the TS7700 that belong to test activities. You need to decide what to do with these tapes. As a test ends, the RMM database and VOLCAT will probably be destaged (with all of the data used in the test). However, in the MI database, the tapes remain defined:

– One is in master status.

– The others are in SCRATCH status.

If the tapes are not needed anymore, manually release the volumes and then run EXPROC to return the volumes to scratch under RMM control. If the tapes will be used for future test activities, manually release these volumes. The cartridges remain in the SCRATCH status and ready for use. Remember to use a Scratch category with expiration time to ensure that no space is wasted.

Important: Although cartridges in the MI remain ready to use, you must ensure that the next time that you create the test environment that these cartridges are defined to RMM and VOLCAT. Otherwise, you cannot use them.

When enabled, the new function enables the handling of two instances on the same cluster. The DR host accesses the content of a logical volume from a point zero, and the logical volume can be updated with new copies pulled from the production cluster. You do not need a break of the gridlink to ensure that only data from time zero is available to the DR host.

For a detailed technical description, see IBM Virtualization Engine TS7700 Series Best Practices - FlashCopy for Disaster Recovery Testing, which is available at the Techdocs website (search for the term TS7700):

http://www.ibm.com/support/techdocs/atsmastr.nsf/Web/TechDocs

The following terms are newly introduced:

•Live Copy: A real-time instance of a virtual tape within a Grid that can be modified and replicated to peer clusters. This is the live instance of a volume in a cluster that is the most recent version of the volume on that cluster. If the Live Copy is also consistent relative to the grid, it can be altered by a production host or from a DR host when it is in the exclusion list of write protect.

•FlashCopy: A snapshot of a live copy at time zero. The content in the FlashCopy is fixed and does not change even if the original copy is modified or if replication events occur. A FlashCopy might not exist at a particular cluster if a live volume was not present within that cluster at time zero. In addition, a FlashCopy does not imply consistency because the live copy might have been down level to the grid, or simply incomplete at time zero.

•DR Family: A set of TS7700 clusters (most likely those at the DR site) that serve the purpose of DR. One to seven clusters can be assigned to a DR family. The DR family is used to determine which clusters should be affected by a flash request or write-protect request by using a host console request command (HCR). A DR Family of one TS7720 cluster is supported.

•Write Protect Mode (existing function): When Write Protect Mode is enabled on a cluster, host commands fail if they are sent to logical devices in that cluster and attempt to modify a volume’s data or attributes, and that volume is not excluded from write protect. The FlashCopy is created on a cluster when it is in the write protect mode only. Also, only write-protected virtual tapes are flashed. Virtual tapes that are assigned to the excluded categories are not flashed.

•Time Zero: The time when the FlashCopy is taken within a DR family. The time zero mimics the time when a real disaster happens. Customers can establish the time zero using a host console request command.

Basic requirements and concepts

All clusters in the grid must be running with R3.1 or higher microcode level to enable this function.

The FlashCopy for DR testing function is supported on TS7700 Grid configurations where at least one TS7720 cluster exists within the DR location. The function cannot be supported under TS7740-only grids or where a TS7740 is the only applicable DR cluster. A TS7740 might be present and used as part of the DR test if at least one TS7720 is also present in the DR site.

The Write Protect exclusion categories are not a subject for the Flash. For these categories only, a Live Copy exists.

During an enabled Flash, the autoremoval process is disabled for the TS7720 member of the DR Family. A TS7720 within a DR location requires extra capacity to accommodate the reuse of volumes and any DR test data that is created within an excluded category. Volumes that are not modified during the test require no additional TS7720 disk cache capacity. The extra capacity requirement must be considered when planning the size of the TS7720 disk cache.

If you are using Time Delay Replication Policy, also check the cache usage of the remaining production cluster TS7720. Volumes can be removed from the TS7720 only when the T copies are processed (either in the complete grid, or in the family).

DR Family

In R3.1, one DR Family can be defined. A DR Family can be defined, modified, and deleted with the Library Request command. After a flash is enabled, a DR Family cannot be modified.

At least one TS7720 must be part of the DR Family. You can optionally include one or more TS7740s. The TS7740 does not have the same functions in a DR Family that the TS7720 has. The Write Protect excluded media categories needs to be consistent on all clusters in a DR Family. If not consistent, the FlashCopy is not enabled.

Considerations

DR tests have the following restrictions:

•There is no autoremoval of data from a TS7720 if the Flash is enabled.

•Do not perform the DR testing by using the FlashCopy function when a cluster in the grid is unavailable. An attempt to enable a FlashCopy in this situation results in a failure. You can perform the DR testing by using the FlashCopy function if all clusters in the grid are powered on (they can be in service/offline state).

•To perform the FlashCopy function, all clusters in a grid must be reachable through the grid links. Otherwise, host console commands to enable write protect mode/flash copy fail with an internal error.

Write Protect and FlashCopy enablement / disablement

The FlashCopy is based on a Write Protect Mode. You can enable the Write Protect Mode first and the FlashCopy later, or you can enable them together. If you want to disable the FlashCopy, you need first to disable the FlashCopy and later on the Write Protect Mode. Also, you can run the action with a single command.

Note: A FlashCopy cannot be enabled if Write Protect Mode was enabled from the MI.

Do not enable the FlashCopy if production hosts with Tape processing have device allocations on the clusters where the Flash will be enabled. Failures might occur because the Read only mode does not enable subsequent mounts.

Livecopy enablement on a TS7740 in a DR Family

A DR Family must contain at least one TS7720. TS7740 can be defined optionally to a DR Family. The TS7740 itself has no flash. To ensure that during a DR test only data from Time Zero are used, all mounts need to be run on the TS7720. The TS7720 uses the data in its own cache first. If no valid copy exists, the TS7720 identifies if the TS7740 has a copy before Time Zero. If no valid copy from time zero exists, the host mount fails.

A remote mount from the TS7740 can occur if the livecopy parameter is enabled. To enable the livecopy option, run this command:

LI REQ, <clib_name>, DRSETUP, <family_name>, LIVECOPY, NONE

To disable the livecopy option, run this command:

LI REQ, <clib_name>, DRSETUP, <family_name>, LIVECOPY, FAMILY

The livecopy setting is persistent. Disabling the FlashCopy does not change the setting. Only a complete deletion of the DR Family can change the setting.

Important: Use the TS7740 in a DR Family only for remote mounts. Do not vary online the TS7740 devices directly to the DR host.

12.6 Disaster recovery testing detailed procedures for FlashCopy

Detailed instructions are provided that include all of the necessary steps to perform a DR test, such as pre-test task, post-test task, production host task, and recovery site task.

For a detailed description of all commands, see IBM Virtualization Engine TS7700 Series Best Practices - FlashCopy for Disaster Recovery Testing, which is available at the Techdocs website (search for the term TS7700):

http://www.ibm.com/support/techdocs/atsmastr.nsf/Web/TechDocs

12.6.1 Planning your disaster recovery test

Complete these steps to properly plan your DR test:

1. Ensure that the TS7720s in the DR Family have sufficient space to hold the Flash data. The Autoremoval function is not available while the Flash is enabled. Do a temporary autoremoval process in advance if necessary.

2. Define the DR Family name, and which clusters will be members of the DR Family.

3. If a TS7740 is part of the DR Family, define whether Livecopy should be used.

4. Define the Write Protect exclude media categories for the DR host.

5. Define the parameters of the scratch category (Expiration time).

6. Define the logical volume serial range used by the DR host.

7. Define the number of scratch volumes needed in the DR host.

8. Define the cleanup phase (scratch of the DR volume serial range.

9. Plan cache usage from TS7720 DR clusters and TS7720 production clusters during the DR test timeline.

12.6.2 Running Phase 1: Preparation

In this phase, all necessary definitions and actions before the actual enabling of the Flash are processed. The actual shutdown or restart of your DR host is not included because that depends on your situation.

1. Define the DEVSUPxx member in the DR host, and ensure that the new categories are used. Use either the DS QL,CATS command, or process an IPL. If you choose to switch categories with the command, ensure that before the switch no tape processing occurs.

2. Change the TMS to enable the new volume serial ranges for output processing.

3. Insert the new volume serial ranges on the MI. Remember to have at least one device online to the DR host.

4. Define the Write Protect excluded media categories on all clusters (by using the MI) belonging to the DR Family. Remember, you need the MEDIA1, MEDIA2, and
PRIVATE categories.

5. Change the Expiration time on the scratch category for MEDIA1 and MEDIA2 if necessary.

6. Offline all TS7740 devices to the DR host.

7. Modify the Automatic Allocation Managers device tables (if necessary).

8. Change the Autoremoval Temporary Threshold on the TS7720 used for DR testing to ensure that enough cache space is available for DR data and production data. Remember, no auto removal can occur on the DR host during the DR test. Wait until the Temporary auto removal process completes.

9. If applicable, change the Autoremoval Temporary Threshold on the remaining production TS7720 and wait until the removal processing completes.

12.6.3 Running Phase 2: Enablement

Now, the DR Family is defined, and the Write Protect and FlashCopy is enabled in one step. Also, enable the usage of the Livecopy in a TS7740 (because a TS7740 is a family member):

1. Create a DR Family or add a cluster (remember to add the TS7720 first). See Example 12-1.

LI REQ, <COMPOSITE>,DRSETUP, <FAMILYNAME>, ADD, <CLUSTER ID>

Example 12-1 Create a DR Family and add a cluster

-LI REQ,HYDRAG,DRSETUP,DRFAM01,add,2

CBR1020I Processing LIBRARY command: REQ,HYDRAG,DRSETUP,DRFAM01,ADD,1.

CBR1280I Library HYDRAG request. 939

Keywords: DRSETUP,DRFAM01,ADD,1

----------------------------------------------------------------------

DRSETUP V1 0.0

DR FAMILY DRFAM01 WAS NEWLY CREATED

CLUSTER 1 WAS ADDED TO DR FAMILY DRFAM01 SUCCESSFULLY

2. Add a TS7740 to the DR Family (only if required). See Example 12-2.

LI REQ, <COMPOSITE>,DRSETUP, <FAMILYNAME>, ADD, <CLUSTER ID>

Example 12-2 Add a TS7740 to the DR Family

LI REQ,HYDRAG,DRSETUP,DRFAM01,add,2

CBR1020I Processing LIBRARY command: REQ,HYDRAG,DRSETUP,DRFAM01,ADD,2.

CBR1280I Library HYDRAG request. 946

Keywords: DRSETUP,DRFAM01,ADD,2

----------------------------------------------------------------------

DRSETUP V1 0.0

v. CLUSTER 2 WAS ADDED TO DR FAMILY DRFAM01 SUCCESSFULLY

3. Define the Livecopy Usage (if needed). See Example 12-3.

LI REQ, <COMPOSITE>,DRSETUP, <FAMILYNAME>, LIVECOPY,FAMILY

Example 12-3 Define the Livecopy Usage

LI REQ,HYDRAG,DRSETUP,SHOW,DRFAM01

CBR1020I Processing LIBRARY command: REQ,HYDRAG,DRSETUP,SHOW,DRFAM01.

CBR1280I Library HYDRAG request. 230

Keywords: DRSETUP,DRFAM01,LIVECOPY,FAMILY

---------------------------------------------------------

DRSETUP V1 0.0

LIVE COPY USAGE HAS BEEN UPDATED TO FAMILY SUCCESSFULLY

4. Check the DR Family settings (Example 12-4).

LI REQ, <COMPOSITE>,DRSETUP, SHOW, <FAMILYNAME>

Example 12-4 Check the DR Family Settings

LI REQ,HYDRAG,DRSETUP,SHOW,DRFAM01

CBR1020I Processing LIBRARY command: REQ,HYDRAG,DRSETUP,SHOW,DRFAM01.

CBR1280I Library HYDRAG request. 302

Keywords: DRSETUP,SHOW,DRFAM01

----------------------------------------------------------------------

DRSETUP V1 0.0

DR FAMILY VIEW

ID FAM NAME FLASH FLASH TIME (UTC) LCOPY MEMBER CLUSTERS

1 DRFAM01 INACTIVE N/A FAMILY - 1 2 - - - - -

----------------------------------------------------------------------

FAMILY MEMBER WRITE PROTECT STATUS VIEW

CLUSTER WRT-PROTECT EXCATS-NUM IGNORE-FR ENABLED-BY

CLUSTER1 DISABLED 3 TRUE N/A

CLUSTER2 DISABLED 3 TRUE N/A

----------------------------------------------------------------------

CATEGORIES EXCLUDED FROM WRITE PROTECTION WITHIN DR FAMILY DRFAM01

CLUSTER ACTIVE EXCLUDED CATEGORIES

CLUSTER1 0092 009F 3002

5. Enable the FlashCopy. See Example 12-5 on page 765.

LI REQ, <COMPOSITE>,DRSETUP, <FAMILYNAME>, DOALL,ENABLE

Example 12-5 Enable the FlashCopy

LI REQ,HYDRAG,DRSETUP,DRFAM01,DOALL,ENABLE

CBR1020I Processing LIBRARY command: REQ,HYDRAG,DRSETUP,DRFAM01,DOALL

ENABLE.

CBR1280I Library HYDRAG request. 154

Keywords: DRSETUP,DRFAM01,DOALL,ENABLE

---------------------------------------------------------------------

DRSETUP V1 0.0

WRITE PROTECT STATUS HAS BEEN ENABLED SUCCESSFULLY

FlashCopy HAS BEEN CREATED SUCCESSFULLY

6. Check the DR Family settings again. See Example 12-6.

LI REQ, <COMPOSITE>,DRSETUP, SHOW, <FAMILYNAME>

Example 12-6 Check the DR Family Settings

LI REQ,HYDRAG,DRSETUP,SHOW,DRFAM01

CBR1020I Processing LIBRARY command: REQ,HYDRAG,DRSETUP,SHOW,DRFAM01.

CBR1280I Library HYDRAG request. 758

Keywords: DRSETUP,SHOW,DRFAM01

---------------------------------------------------------------------

DRSETUP V1 0.0

DR FAMILY VIEW

ID FAM NAME FLASH FLASH TIME (UTC) LCOPY MEMBER CLUSTER

1 DRFAM01 ACTIVE 2014-02-24-14.03.35 FAMILY - 1 2 - - - -

---------------------------------------------------------------------

FAMILY MEMBER WRITE PROTECT STATUS VIEW

CLUSTER WRT-PROTECT EXCATS-NUM IGNORE-FR ENABLED-BY

CLUSTER1 ENABLED 3 TRUE LIREQ

CLUSTER2 ENABLED 3 TRUE LIREQ

---------------------------------------------------------------------

CATEGORIES EXCLUDED FROM WRITE PROTECTION WITHIN DR FAMILY DRFAM01

CLUSTER ACTIVE EXCLUDED CATEGORIES

CLUSTER1 0092 009F 3002

CLUSTER2 0092 009F 3002

12.6.4 Running Phase 3: Running the disaster recovery test

During the DR test, you might want to check the status of these logical volumes:

•Newly produced volumes from production

•Updated volumes from production

•Newly produced volumes from DR

You can use the following commands to identify if a FlashCopy exists for a specific volume, and the status from the livecopy and the FlashCopy.

To do so, use the D SMS,VOL(xxxxxx) and the D SMS,VOL(xxxxxx),FLASH commands.

If the livecopy volume is identical to the FlashCopy volume, the status is ACTIVE. Only if the logical volume was updated from production, and a second instance exists, the status changes to CREATED (Example 12-7).

Example 12-7 Display of a logical volume after modification from production - Livecopy

LI REQ,HYDRAG,LVOL,A08760

CBR1020I Processing LIBRARY command: REQ,HYDRAG,LVOL,A08760.

CBR1280I Library HYDRAG request. 883

Keywords: LVOL,A08760

-------------------------------------------------------------

LOGICAL VOLUME INFORMATION V3 0.0

LOGICAL VOLUME: A08760

MEDIA TYPE: ECST

COMPRESSED SIZE (MB): 2763

MAXIMUM VOLUME CAPACITY (MB): 4000

CURRENT OWNER: cluster1

MOUNTED LIBRARY:

MOUNTED VNODE:

MOUNTED DEVICE:

TVC LIBRARY: cluster1

MOUNT STATE:

CACHE PREFERENCE: PG1

CATEGORY: 000F

LAST MOUNTED (UTC): 2014-03-11 10:19:47

LAST MODIFIED (UTC): 2014-03-11 10:18:08

LAST MODIFIED VNODE: 00

LAST MODIFIED DEVICE: 00

TOTAL REQUIRED COPIES: 2

KNOWN CONSISTENT COPIES: 2

KNOWN REMOVED COPIES: 0

IMMEDIATE-DEFERRED: N

DELETE EXPIRED: N

RECONCILIATION REQUIRED: N

LWORM VOLUME: N

FlashCopy: CREATED

----------------------------------------------------------------

LIBRARY RQ CACHE PRI PVOL SEC PVOL COPY ST COPY Q COPY CP

cluster1 N Y ------ ------ CMPT - RUN

cluster2 N Y ------ ------ CMPT - RUN

Example 12-8 shows the flash instance of the same logical volume.

Example 12-8 Display of a logical volume after modification from production - Flash volume

LI REQ,HYDRAG,LVOL,A08760,FLASH

CBR1020I Processing LIBRARY command: REQ,HYDRAG,LVOL,A08760,FLASH

CBR1280I Library HYDRAG request. 886

Keywords: LVOL,A08760,FLASH

-----------------------------------------------------------------

LOGICAL VOLUME INFORMATION V3 0.0

FlashCopy VOLUME: A08760

MEDIA TYPE: ECST

COMPRESSED SIZE (MB): 0

MAXIMUM VOLUME CAPACITY (MB): 4000

CURRENT OWNER: cluster2

MOUNTED LIBRARY:

MOUNTED VNODE:

MOUNTED DEVICE:

TVC LIBRARY: cluster1

MOUNT STATE:

CACHE PREFERENCE: ---

CATEGORY: 000F

LAST MOUNTED (UTC): 1970-01-01 00:00:00

LAST MODIFIED (UTC): 2014-03-11 09:05:30

LAST MODIFIED VNODE:

LAST MODIFIED DEVICE:

TOTAL REQUIRED COPIES: -

KNOWN CONSISTENT COPIES: -

KNOWN REMOVED COPIES: -

IMMEDIATE-DEFERRED: -

DELETE EXPIRED: N

RECONCILIATION REQUIRED: N

LWORM VOLUME: -

---------------------------------------------------------------

LIBRARY RQ CACHE PRI PVOL SEC PVOL COPY ST COPY Q COPY CP

cluster2 N Y ------ ------ CMPT - RUN

Only the clusters from the DR Family (in this case only a TS7720 was defined in the DR Family) are shown. This information is also available on the MI.

In Example 12-9 on page 768, you see a copy with an active, created FlashCopy. That means that the logical volume is not only in a write protected category and part of the flash, but also that the logical volume was updated during the DR test. Therefore, the Flash instance was created. The detail for last access by a host is the information from the livecopy (even on the DR Cluster).

To see the information from the created FlashCopy instance, select the FlashCopy CREATED field. This opens a second view as shown in Figure 12-12.

Figure 12-12 Display of a logical volume with an active FlashCopy

Figure 12-13 shows the next view, which is opened by clicking Created.

Figure 12-13 Display of the FlashCopy information of a logical volume

Run your DR test. During the execution, monitor the cache usage of your TS7720 clusters. For the TS7720 cluster used as DR, you have two new possibilities.

The following HCR command provides you information about the space used by the FlashCopy on the bottom of the output. See Example 12-9.

LI REQ,<distributed library name>,CACHE

Example 12-9 Cache Consumption FlashCopy

LI REQ,distributed library name,CACHE

CBR1280I Library VTSDIST1 request.

Keywords: CACHE

----------------------------------------------------------------------

TAPE VOLUME CACHE STATE V3 0.0

PRIMARY TAPE MANAGED PARTITIONS

INSTALLED/ENABLED GBS 0/ 0

CACHE ENCRYPTION STATUS:

PARTITION ALLOC USED PG0 PG1 PMIGR COPY PMT CPYT

0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0

2 0 0 0 0 0 0 0 0

3 0 0 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0

5 0 0 0 0 0 0 0 0

6 0 0 0 0 0 0 0 0

7 0 0 0 0 0 0 0 0

PRIMARY CACHE RESIDENT ONLY INFORMATION

INSTALLED/ENABLED GBS 95834/ 95834

ADJUSTED CACHE USAGE 5172

CACHE ENCRYPTION STATUS: CAPABLE

ALLOCATED USED PIN PKP PRM COPY CPYT

95834 5151 0 5150 0 0 0

FlashCopy INFORMATION

INDEX ENABLED SIZE

1 YES 252

2 NO 0

3 NO 0

4 NO 0

5 NO 0

6 NO 0

7 NO 0

8 NO 0

You can find the same information also on the MI. You can select the following display windows:

•Monitor

•Performance

•Cache Usage

Figure 12-14 is an example of Cache Utilization output.

Figure 12-14 Cache usage of FlashCopy data

Also, you can control the usage of your virtual drives. You can select these displays on the MI:

•Virtual

•Virtual Tape Drives

Figure 12-15 is an example of virtual tape drive output.

Figure 12-15 Virtual Tape Drive window during a FlashCopy for disaster recovery test

12.6.5 Running Phase 4: Cleaning up

Before you end the DR test, clean up the environment. Depending on your DR test strategy, this might include these steps:

1. Scratch all used logical volumes from the DR host during the DR test.

2. Run a housekeeping job on the DR host that includes only the logical volume serial ranges used by the DR host.

3. Stop the DR host processing.

All data that is created on the DR host will be scratch after this process. Depending on the definition of the Scratch Media Category, this data is expired soon, which ensures that this data does not use any cache in the TS7700 clusters.

It is mandatory to run these processes before you disable the Write Protect/Flash Copy.

12.6.6 Running Phase 5: Disabling the Write Protect and FlashCopy

After the cleanup, you can disable the Write Protect and delete the FlashCopy. Example 12-10 shows the disable and delete in one step.

Example 12-10 Disable the Write Protect and FlashCopy

LI REQ,HYDRAG,DRSETUP,DRFAM01,DOALL,DISABLE

CBR1020I Processing LIBRARY command: REQ,HYDRAG,DRSETUP,DRFAM01,DOALL

DISABLE.

CBR1280I Library HYDRAG request. 765

Keywords: DRSETUP,DRFAM01,DOALL,DISABLE

---------------------------------------------------------------------

DRSETUP V1 .0

WRITE PROTECT STATUS HAS BEEN DISABLED SUCCESSFULLY

FlashCopy HAS BEEN DELETED SUCCESSFULLY

You can now switch back to the original system setup. That means, you can bring back all devices from all clusters of the DR Family back to production and change your Automation Allocation Manager setup.

12.6.7 Expected failures during the DR test

The next section covers some expected failures during a DR test.

The messages in Example 12-11 might appear if you try to read a logical volume that was not present at time zero in the DR Family.

Example 12-11 Expected failures during the disaster recovery test

IEF233A M 2500,A08759,,DENEKA1,STEP1,DENEKA.HG.TEST1.DUMP1

CBR4195I LACS retry possible for job DENEKA1: 399

IEE763I NAME= CBRLLACS CODE= 140394

CBR4000I LACS WAIT permanent error for drive 2500.

CBR4171I Mount failed. LVOL=A08759, LIB=HYDRAG, PVOL=??????,RSN=22

The message in Example 12-12 might also appear if you want to modify a volume that is in a write protect media category.

Example 12-12 Error message for volume in a write media category

IEF116I DENEKY6 STEP1 - MOUNT OF VOLUME PRIVAT ON DEVICE 2580 FAILED

IEE763I NAME= CBRLLACS CODE= 14017E

CBR4000I LACS MOUNT permanent error for drive 2580.

CBR4126I Library HYDRAG drive is in read only mode.

IEF272I DENEKY6 STEP1 - STEP WAS NOT EXECUTED

The message in Example 12-13 might occur if a job was running on the cluster while the FlashCopy was enabled.

Example 12-13 Message for job running on the cluster while FlashCopy was enabled

IEF233A M 2507,A10088,,DENEKA8,STEP2,DENEKA.HG.TEST1.DUMP1

IEC518I SOFTWARE ERRSTAT: WRITPROT 2507,A10088,SL,DENEKA8,STEP2

IEC502E RK 2507,A10088,SL,DENEKA8,STEP2

IEC147I 613-24,IFG0194F,DENEKA8,STEP2,AUS1,2507,,DENEKA.HG.TEST1.DUMP1

12.7 Disaster recovery testing detailed procedures for alternatives before Release 3.1

Detailed instructions are provided that include all the necessary steps to run a DR test, such as pre-test task, post-test task, production host task, and recovery site task.

The best DR test is a pseudo-real DR test, which means stopping the production site and starting real production at the DR site. However, stopping production is rarely realistic, so the following scenarios assume that production must continue working during the DR test. The negative aspect of this approach is that DR test procedures and real disaster procedures can differ slightly.

Tips: In a DR test on a TS7700 grid, without using Selective Write Protect, with production systems running concurrently, be sure that no return-to-scratch or emergency short-on-scratch procedure is started in the test systems. You can return to scratch production tapes, as described in “Returning to scratch without using Selective Write Protect” on page 759.

In a DR test on a TS7700 grid that uses Selective Write Protect, with production systems running concurrently, you can use the Ignore fast ready characteristics of write-protected categories option, together with Selective Write Protect, as described in “Creating data during the disaster recovery test from the DR host: Selective Write Protect” on page 753.

Procedures are described for four scenarios, depending on the TS7700 release level, grid configuration, and connection status during the test:

1. TS7700 2-cluster grid that uses Selective Write Protect

This scenario describes the steps for running a DR test by using the Selective Write Protect DR testing enhancements. Whether the links between the clusters are broken is irrelevant. For more information, see 12.7.1, “TS7700 2-cluster grid that uses Selective Write Protect” on page 773.

2. TS7700 2-cluster grid without using Selective Write Protect

This scenario assumes that the DR test is run with production running in parallel on a TS7700 2-cluster grid. The links between both clusters are not broken, and you cannot use the Selective Write Protect DR enhancements. For more information, see 12.7.2, “TS7700 2-cluster grid not using Selective Write Protect” on page 779.

3. TS7700 2-cluster grid without using Selective Write Protect

This scenario assumes that the DR test is run on a TS7700 2-cluster grid without using Selective Write Protect with the links broken between both clusters so the production cannot be affected by the DR test. For more information, see 12.7.2, “TS7700 2-cluster grid not using Selective Write Protect” on page 779.

4. TS7700 3-cluster grid without using Selective Write Protect

This scenario is similar to TS7700 2-cluster grid without using Selective Write Protect, but running production in parallel on a three-cluster grid. The links between both clusters are not broken, and you cannot use the Selective Write Protect DR enhancements. See 12.7.3, “TS7700 3-cluster grid not using Selective Write Protect” on page 782.

12.7.1 TS7700 2-cluster grid that uses Selective Write Protect

Figure 12-16 shows a sample multi-cluster grid scenario that uses Selective Write Protect. The left cluster is the Production Cluster, and the right cluster is the DR Cluster.

Figure 12-16 Sample disaster recovery testing scenario with TS7700 by using Selective Write Protect

Clarification: You can also use the steps described in the following procedure when running DR testing on one cluster within a three-cluster or four-cluster grid. To run DR testing on more than one host or cluster, repeat the steps in the procedure on each of the DR hosts and clusters involved in the test.

Perform the following steps to prepare your DR environment:

1. Vary all virtual drives of the DR Cluster offline to the normal production hosts and to the DR hosts.

2. Ensure that the production hosts have access to the Production Cluster so that normal tape processing can continue.

3. On the MI, select Configuration → Write Protect Mode.

The window that is shown in Figure 12-17 opens.

Figure 12-17 TS7700 Write Protect Mode window

4. Click Enable Write Protect Mode to set the cluster in Write Protect Mode.

Be sure to also leave the Ignore fast ready characteristics of write protected categories selected. This setting ensures that volumes in Production scratch (Fast Ready) categories that are write-protected on the DR Cluster are treated differently.

Normally, when a mount occurs to one of these volumes, the TS7700 assumes that the host starts writing at BOT and creates a stub. Also, when Expire Hold is enabled, the TS7700 does not enable any host access to these volumes until the hold period passes.

Therefore, if the production host returns a volume to scratch after time zero, the DR host still believes within its catalog that the volume is private and the host will want to validate its contents. It cannot afford to enable the TS7700 to stub it or block access if the DR host attempts to mount it.

The Ignore fast ready characteristics of write protected categories option informs the DR Cluster that it must ignore these characteristics and treat the volume as a private volume. It will then surface the data versus a stub and will not prevent access because of Expire Hold states. However, it will still prevent write operations to these volumes.

Click Submit Changes to activate your selections.

5. Decide which set of categories you want to use during DR testing on the DR hosts and confirm that no host system is using this set of categories, for example X’0030’ - X’003F’.

You define those categories to the DR host in a later step.

On the DR cluster TS7700 MI, define two scratch (Fast Ready) categories as described in “Defining scratch categories” on page 518. These two categories are used on the DR host as scratch categories, MEDIA1 and MEDIA2 (X’0031’ and X’0032’), and are defined as excluded from Write-Protect mode.

6. In the DR cluster MI, use the Write Protect Mode window (shown in Figure 12-17 on page 774) to define the entire set of categories to be excluded from Write-Protect Mode, including the Error and the Private categories.

On the bottom of the window, click Select Action → Add, and then, click Go. The next window opens (Figure 12-18).

Figure 12-18 Add Category window

Define the categories that you have decided to use for DR testing, and ensure that Excluded from Write Protect is set to Yes. In the example, you define volume categories X'0030' - X'003F' or, as a minimum, X'0031' (MEDIA1), X'0032' (MEDIA2), X'003E' (ERROR), and X'003F' (PRIVATE).

7. On the DR Cluster, ensure that no copy is written to the Production Cluster that defines the Copy Consistency Point of No Copy in the MC definitions that are used by the DR host.

8. On the DR host, restore your DR system.

9. Change the DEVSUPxx member on the DR host to use the newly defined DR categories. DEVSUPxx controls installation-wide default tape device characteristics, for example:

– MEDIA1 = 0031

– MEDIA2 = 0032

– ERROR = 003E

– PRIVATE = 003F

Therefore, the DR host is enabled to use these categories that have been excluded from Write-Protect Mode in Step 6.

10. On the DR host, define a new VOLSER range to your TMS.

11. Insert that VOLSER range on the DR Cluster and verify that Volume Insert Processing has assigned them to the correct scratch (Fast Ready) categories.

12. On the DR host, vary online the virtual drives of the DR Cluster. Start DR testing the TS7700 2-cluster grid by not using Selective Write Protect.

The standard scenario is a DR test in a DR site while real production occurs. In this situation, the grid links are not broken because the production site is working and it needs to continue copying cartridges to the DR site to be ready if a real disaster happens while you are running the test. The following points are assumed:

•The grid links must not be broken.

•The production site is running everyday jobs as usual.

•The DR site must not affect the production site in any way.

•The DR site is ready to start if a real disaster happens.

Figure 12-19 shows the environment and the main tasks to perform in this DR situation.

Figure 12-19 Disaster recovery environment - two clusters and links not broken

Note the following information about Figure 12-19:

•The production site can write and read its usual cartridges (in this case, 1*).

•The production site can write in any address in Cluster 0 or Cluster 1.

•The DR site can read production cartridges (1*), but cannot write on this range. You must create a new range for this purpose (2*) that cannot be accessible by the production site.

•Ensure that no production tapes can be modified in any way by DR site systems.

•Ensure that the production site does not rewrite tapes that are needed during the DR test.

•Do not waste resources copying cartridges from the DR site to the production site.

Issues

Consider the following issues with TS7700 without using Selective Write Protect environments:

•You must not run the HSKP process in the production site unless you can run it without the EXPROC parameter in RMM. In z/OS V1R10, the new RMM parmlib commands PRTITION and OPENRULE provide for flexible and simple control of mixed system environments.

In z/OS V1R9 and later, you can specify extra EXPROC controls in the EDGHSKP SYSIN file to limit the return-to-scratch processing to specific subsets of volumes. Therefore, you can use EXPROC on the DR system volumes on the DR system and use the PROD volumes on the PROD system. You can still continue to run regular batch processing and also run expiration on the DR system.

•With other TMSs, you need to stop the return-to-scratch process, if possible. If not, stop the whole daily process. To avoid problems with scratch shortage, you can add more logical volumes.

•If you run HSKP with the EXPROC (or daily processes in other TMSs) parameter in the production site, you cannot expire volumes that might be needed in the DR test. If you fail to do so, TS7700 sees these volumes as scratch. With the scratch (Fast Ready) category on, TS7700 presents the volume as a scratch volume, and you lose the data on the cartridge.

•Ensure that HSKP or short-on-scratch procedures are deactivated in the DR site.

Tasks before the disaster recovery test

Before running the DR test of the TS7700 grid, prepare the environment and complete tasks that enable you to run the test without any problems or without affecting your production systems.

Perform the following steps:

1. Plan and decide the scratch categories that are needed in the DR site (1*). See “Number of scratch volumes needed per day” on page 162.

2. Plan and decide the VOLSER ranges that will be used to write in the DR site (2*).

3. Modify the production site PARMLIB RMM member EDGRMMxx:

a. Include REJECT ANYUSE (2*) to avoid having the production system using or accepting the insertion of 2* cartridges.

b. If your TMS is not RMM, disable CBRUXENT exit before inserting cartridges in the DR site.

4. Plan and decide the virtual address used in the DR site during the test (BE0-BFF).

5. Insert extra scratch virtual volumes to ensure that during the DR test that production cartridges can return to scratch but are not rewritten afterward. This must be done in the production site. For more information, see “Physical Tape Drives window” on page 400.

6. Plan and define a new MC with copy policies on No Rewind (NR) by using the MI at the DR site, for example, NOCOPY. For more information, see 8.2.8, “The Constructs icon” on page 406.

Tasks during the disaster recovery test

After starting the DR system, but before the real DR test can start, you must change several things to be ready to use tapes from the DR site. Usually, the DR system is started by using a clone image of the production system, so you need to alter certain values and definitions to customize the image for the DR site.

Follow these necessary steps:

1. Modify DEVSUPxx in SYS1.PARMLIB at the DR site and define the scratch category selected for DR.

2. Use the command DEVSERV QLIB,CATS at the DR site to change scratch categories dynamically. See “DEVSERV QLIB,CATS command” on page 591.

3. Modify the test PARMLIB RMM member EDGRMMxx at the DR site:

a. Include REJECT OUTPUT (1*) to enable only read activity against production cartridges.

b. If you have another TMS product, ask your software provider how to use a similar function, if one exists.

4. Modify test PARMLIB RMM member EDGRMMxx at the DR site and delete REJECT ANYUSE(2*) to enable write and insertion activity of 2* cartridges.

5. Define a new SMS MC (NOCOPY) in SMS CDS at the DR site.

6. Modify the MC ACS routine at the DR site. All the writes must be directed to MC NOCOPY.

7. Restart the SMS configuration at the DR site.

8. Insert a new range (2*) of cartridges from the MI at the DR site. Ensure that all the cartridges are inserted into the DR TS7700 so that the owner is the TS7700 at the DR site:

a. If you have RMM, your cartridges are defined automatically to TCDB and RMM.

b. If you have another TMS, check with the original equipment manufacturer (OEM) software provider. In general, to add cartridges to other TMSs, you need to stop them.

9. Perform the next modification in DFSMShsm at the DR site:

a. Mark all hierarchical storage management (HSM) Migration Level 2 (ML2) cartridges as full by using the DELVOL MARKFULL HSM command.

b. Run HOLD HSM RECYCLE.

10. Again, ensure that the following procedures do not run:

– RMM housekeeping activity at the DR site

– Short-on-scratch RMM procedures at the DR site

Tasks after the disaster recovery test

After the test is finished, you have a set of tapes in the TS7700 that are used by the test activities. You must decide what to do with these tapes. As the test ends, the RMM database and VOLCAT are destaged (because all the data was used in the test), but in the MI database, the tapes remain defined: One is in master status and the others in SCRATCH status.

What you do with these tapes depends on whether they are no longer needed or if the tapes will be used for future DR test activities.

If the tapes are not needed anymore, complete the following steps:

1. Stop the RMM address space and subsystem, and by using Interactive Storage Management Facility (ISMF) 2.3 (at the DR site), return to scratch all private cartridges.

2. After all of the cartridges are in the SCRATCH status, use ISMF 2.3 again (at the DR site) to eject all the cartridges. The MI can accept only 1,000 eject commands at one time. If you must eject a higher number of cartridges, the process is time-consuming.

In the second case (tapes will be used in the future), run only step 1. The cartridges remain in the SCRATCH status and are ready for future use.

12.7.2 TS7700 2-cluster grid not using Selective Write Protect

In other situations, you can choose to break grid links, even if your production system is running during a DR test.

Assume that the following information is true:

•The grid links are broken.

•The production site is running everyday jobs as usual.

•The DR site cannot affect the production site.

•The DR site is ready for a real disaster.

Do not use logical drives in the DR site from the production site.

If you decide to break links during your DR test, you must review carefully your everyday work. For example, if you have 3 TB of cache and you write 4 TB of new data every day, you are a good candidate for a large amount of throttling, probably during your batch window. To understand throttling, see 10.3.7, “Throttling in the TS7700” on page 611.

After the test ends, you might have many virtual volumes in the pending copy status. When TS7700 grid links are restored, communication is restarted, and the first task that the TS7700 runs is to make a copy of the volumes that are created during your links broken window. This task can affect the TS7700 performance.

If your DR test runs over several days, you can minimize the performance degradation by suspending copies by using the GRIDCNTL Host Console command. After your test is over, you can enable the copy again during a low activity workload to avoid or minimize performance degradation. See 9.1.3, “Host Console Request function” on page 576 for more information.

Figure 12-20 shows the environment and the main tasks to perform in this DR scenario.

Figure 12-20 Disaster recovery environment - two clusters and broken links

Note the following information about Figure 12-20:

•The production site can write and read its usual cartridges (in this case, 1*).

•The production site writes only to virtual addresses associated with Cluster 0. The tapes remain, pending copy.

•The DR site can read production cartridges (1*) but cannot write on this range. You must create a new one for this purpose (2*). This new range must not be accessible by the production site.

•Ensure that no production tapes can be modified by the DR site systems.

•Ensure that the production site does not rewrite tapes that are needed during the DR test.

•Do not waste resources copying cartridges from the DR site to the production site.

Issues

Consider the following items:

•You can run the whole HSKP process at the production site. Because communications are broken, the return-to-scratch process cannot be completed in the DR TS7700, so your production tapes never return to scratch in the DR site.

•In this scenario, be sure that HSKP or short-on-scratch procedures are deactivated in the DR site.

Tasks before the disaster recovery test

Before you start the DR test for the TS7700 grid, prepare the environment and complete several tasks so that you can run the test without any problems, and without affecting your production site. Perform the following steps:

1. Plan and decide on the scratch categories needed at the DR site (1*). See “Number of scratch volumes needed per day” on page 162 for more information.

2. Plan and decide on the VOLSER ranges that will be used to write at the DR site (2*).

3. Plan and decide on the virtual address used at the DR site during the test (BE0-BFF).

4. Plan and define a new MC with copy policies on NR in the MI at the DR site, for example, NOCOPY. For more information, see 8.2.8, “The Constructs icon” on page 406.

Tasks during the disaster recovery test

After starting the DR system, but before DR itself can start, you must change several things to be ready to use tapes from the DR site. Usually, the DR system is started by using a clone image of the production system, so you need to alter certain values and definitions to customize the DR site.

Perform the following steps:

1. Modify DEVSUPxx in SYS1.PARMLIB at the DR site and, when you define the scratch category, choose DR.

2. Use the DEVSERV QLIB,CATS command at the DR site to change scratch categories dynamically. See “DEVSERV QLIB,CATS command” on page 591 for more information.

3. Modify the test PARMLIB RMM member EDGRMMxx at the DR site:

a. Include REJECT OUTPUT (1*) to enable only read activity against production cartridges.

b. If you have another TMS product, ask your software provider for a similar function. There might not be similar functions in other TMSs.

4. Define a new SMS MC (NOCOPY) in SMS CDS at the DR site.

5. Modify the MC ACS routine at the DR site. All the writes must be directed to MC NOCOPY.

6. Restart the SMS configuration at the DR site.

7. Insert a new range of cartridges from the MI at the DR site. Ensure that all the cartridges are inserted in the DR TS7700 so that the ownership of these cartridges is at the DR site:

a. If you have RMM, your cartridges are defined automatically to TCDB and RMM.

b. If you have another TMS, check with the OEM software provider. In general, to add cartridges to other TMSs, you need to stop them.

8. Now, you can break the link connection between clusters. If you complete this step before cartridge insertion, the insertion fails.

9. If either of the following conditions apply, skip this step:

– If you have the Autonomic Ownership Takeover function running.

– If you usually write in the production site. See “The Service icon” on page 469 for more information.

Otherwise, modify the ownership takeover mode in the MI in the cluster at the production site. Select Write-only takeover mode, which is needed only if you are working in balanced mode.

10. Modify ownership takeover mode in the MI in the cluster at the DR site. Select Read-only takeover mode because you need to read only production cartridges.

11. Perform the next modification in DFSMShsm at the DR site:

a. Mark all HSM ML2 cartridges as full by using the DELVOL MARKFULL, HSM command.

b. Run HOLD HSM RECYCLE.

12. Again, ensure that the following procedures do not run:

– RMM housekeeping activity at the DR site

– Short on scratch RMM procedures at the DR site

Tasks after the disaster recovery test

After the test is finished, you have a set of tapes in the TS7700 that belong to test activities. You need to decide what to do with these tapes. As the test ends, the RMM database and VOLCAT are destaged (as is all the data that is used in the test), but in the MI database, the tapes remain defined: One is in master status and the others in SCRATCH status.

What you do with these tapes depends on whether they are not needed anymore, or if the tapes will be used for future DR test activities.

If the tapes are not needed anymore, complete the following steps:

1. Stop the RMM address space and subsystem, and by using ISMF 2.3 (at the DR site), return to scratch all private cartridges.

2. After all of the cartridges are in the SCRATCH status, use ISMF 2.3 again (at the DR site) to eject all the cartridges. MI can accept only 1,000 eject commands at one time. If you must eject a high number of cartridges, the process is time-consuming.

In the second case (tapes will be used in the future), run only step 1. The cartridges remain in the SCRATCH status and are ready for future use.

12.7.3 TS7700 3-cluster grid not using Selective Write Protect

This scenario covers a three-cluster grid. In general, two of the clusters are on a production site and have high availability locally. From the DR point of view, this scenario is similar to the two grid procedures described earlier.

Assume that the following information is true:

•The grid links are not broken.

•The production site will be running everyday jobs as usual.

•The DR site must not affect the production site at all.

•The DR site is ready to start if a real disaster happens.

Figure 12-21 shows the environment and the major tasks to complete in this DR situation.

Figure 12-21 Disaster recovery environment - three clusters and links not broken

Note the following information about Figure 12-21:

•The production site can write and read its usual cartridges (in this case, 1*).

•The production site can write in any address in Cluster 0 or Cluster 1.

•The DR site can read production cartridges (1*) but cannot write on this range. You need to create a new range for this purpose (2*). This new range must not be accessible by the production site.

•Ensure that no production tapes can be modified in any way by DR site systems.

•Ensure that the production site does not rewrite tapes that are needed during the DR test.

•Do not waste resources copying cartridges from the DR site to the production site.

Issues

Take the following issues into consideration:

•You must not run the HSKP process at the production site, or you can run it without the EXPROC parameter in RMM. In other TMSs, stop the return-to-scratch process, if possible. If not, stop the whole daily process. To avoid problems with scratch shortage, you can add more logical volumes.

•If you run HSKP with the EXPROC (or a daily process in other TMSs) parameter in the production site, you cannot expire volumes that are needed in the DR test. If you fail to do so, and the TS7700 sees that volume as a scratch (Fast Ready) category, the TS7700 presents the volume as a scratch volume, and you lose the data on the cartridge.

•Again, ensure that the HSKP (short-on-scratch) procedures are deactivated at the DR site.

Tasks before the disaster recovery test

Before you run a DR test on the TS7700 grid, prepare the environment and complete tasks that enable you to run the test without complications or affecting your production site.

Complete the following steps:

1. Plan and decide upon the scratch categories needed at the DR site (1*).

2. Plan and decide upon the VOLSER ranges that will be used to write at the DR site (2*).

3. Modify the production site PARMLIB RMM member EDGRMMxx:

a. Include REJECT ANYUSE (2*) to prevent the production site from using or accepting the insertion of 2* cartridges.

b. In your TMS, disable the CBRUXENT exit before inserting cartridges in the DR site.

4. Plan and decide upon the virtual address used at the DR site (C00-CFF).

5. Insert extra scratch virtual volumes at the production site to ensure that during the DR test that production cartridges can return to scratch but are not rewritten.

6. Plan and define a new MC with copy policies on NR in the MI at the DR site, for example, NOCOPY.

7. Remove the Fast Ready attribute for the production scratch category at the DR site TS7700. Do this during the DR test.

Tasks during the disaster recovery test

Perform the following steps:

1. Modify DEVSUPxx in SYS1.PARMLIB at the DR site and define the scratch category
for DR.

2. Use the DEVSERV QLIB,CATS command at the DR site to change scratch categories dynamically. See “DEVSERV QLIB,CATS command” on page 591 for more information.

3. Modify the test PARMLIB RMM member EDGRMMxx at the DR site:

a. Include REJECT OUTPUT (1*) to enable only read activity against production cartridges.

b. If you have another TMS product, ask your software provider for a similar function. There might not be similar functions in other TMSs.

4. Modify test PARMLIB RMM member EDGRMMxx at the DR site and delete REJECT ANYUSE(2*) to enable write and insertion activity of 2* cartridges.

5. Define a new SMS MC (NOCOPY) in SMS CDS at the DR site.

6. Modify the MC ACS routine at the DR site. All the writes must be directed to MC NOCOPY.

7. Restart the SMS configuration at the DR site.

8. Insert a new range (2*) of cartridges from the MI at the DR site. Ensure that all the cartridges are inserted into the DR TS7700 so that the ownership of these cartridges belongs to the TS7700 at the DR site:

– If you have RMM, your cartridges are defined automatically to TCDB and RMM.

– If you have another TMS, check with the OEM software provider. In general, to add cartridges to other TMSs, you need to stop them.

9. Modify the DFSMShsm at the DR site:

a. Mark all HSM ML2 cartridges as full by using the DELVOL MARKFULL HSM command.

b. Run HOLD HSM RECYCLE.

10. Again, ensure that the following procedures are not running:

– RMM housekeeping activity at the DR site

– Short on scratch RMM procedures at the DR site

Tasks after the disaster recovery test

After the test is finished, you have a set of tapes in the TS7700 that belong to test activities. You need to decide what to do with these tapes. As the test ends, the RMM database and VOLCAT are destaged (and all the data that is used in the test), but the tapes remain defined in MI database: One is in the master status, and the others in SCRATCH status.

What you do with these tapes depends on whether they are not needed anymore, or if the tapes will be used for future DR test activities.

If the tapes are not needed anymore, complete the following steps:

1. Stop the RMM address space and subsystem, and by using ISMF 2.3 (at the DR site), return to scratch all private cartridges. Be sure that RMM Housekeeping is run against only the VOLSER range that was created and used on the DR site cluster. The VOLUMES parameter can be added to the EXPROC step in the Housekeeping job to process only these volumes. Running EXPROC without this parameter can result in production tapes being returned to scratch.

2. After all of the cartridges are in the SCRATCH status, use ISMF 2.3 again (at the DR site) to eject all the cartridges. The MI can accept only 1,000 eject commands at one time. If you must eject a high number of cartridges, the process is time-consuming.

In the second case (tapes will be used in the future), run only step 1. The cartridges remain in the SCRATCH status and are ready for future use.

Important: Although cartridges in MI remain ready to use, you must ensure that the next time that you create the test environment that these cartridges are defined to RMM and VOLCAT. Otherwise, you cannot use them.

12.8 A real disaster

To clarify what a real disaster means, if you have a hardware issue that, for example, stops the TS7700 for 12 hours, is this a real disaster? It depends.

For a bank, during the batch window, and without any other alternatives to bypass a 12-hour TS7700 outage, this can be a real disaster. However, if the bank has a three-cluster grid (two local and one remote), the same situation is less dire because the batch window can continue accessing the second local TS7700.

Because no set of fixed answers exists for all situations, you must carefully and clearly define which situations can be considered real disasters, and which actions to perform for all possible situations.

As explained in 12.7, “Disaster recovery testing detailed procedures for alternatives before Release 3.1” on page 772, several differences exist between a DR test situation and a real disaster situation. In a real disaster situation, you do not have to do anything to be able to use the DR TS7700, which makes your task easier. However, this easy-to-use capability does not mean that you have all the cartridge data copied to the DR TS7700.

If your copy mode is RUN, you need to consider only in-flight tapes that are being created when the disaster happens. You must rerun all these jobs to re-create tapes for the DR site. Alternatively, if your copy mode is Deferred, you have tapes that are not copied yet. To know which tapes are not copied, you can go to the MI in the DR TS7700 and find cartridges that are already in the copy queue. After you have this information, you can, by using your TMS, discover which data sets are missing, and rerun the jobs to re-create these data sets at the DR site.

Figure 12-22 shows an example of a real disaster situation.

Figure 12-22 Real disaster situation

In a real disaster scenario, the whole primary site is lost. Therefore, you need to start your production systems at the DR site. To do this, you need to have a copy of all your information not only on tape, but all DASD data copied to the DR site.

After you can start the z/OS partitions, from the TS7700 perspective, you must be sure that your hardware configuration definition (HCD) “sees” the DR TS7700. Otherwise, you cannot put the TS7700 online.

You must change ownership takeover, also. To perform that task, go to the MI interface and enable ownership takeover for read and write.

All the other changes that you did in your DR test are not needed now. Production tape ranges, scratch categories, SMS definitions, RMM inventory, and so on, are in a real configuration that is in DASD that is copied from the primary site.

Perform the following changes because of the special situation that a disaster merits:

•Change your MC to obtain a dual copy of each tape that is created after the disaster.

•Depending on the situation, consider using the Copy Export capability to move one of the copies outside the DR site.

After you are in a stable situation at the DR site, you need to start the tasks that are required to recover your primary site or to create a new site. The old DR site is now the production site, so you must create a DR site, which is beyond the scope of this book.

12.9 Geographically Dispersed Parallel Sysplex for z/OS

The z Systems multisite application availability solution, Geographically Dispersed Parallel Sysplex (GDPS), integrates Parallel Sysplex technology and remote copy technology to enhance application availability and improve DR. The GDPS topology is a Parallel Sysplex cluster that is spread across two sites, with all critical data mirrored between the sites. GDPS manages the remote copy configuration and storage subsystems, automates Parallel Sysplex operational tasks, and automates failure recovery from a single point of control, improving application availability.

12.9.1 Geographically Dispersed Parallel Sysplex considerations in a TS7700 grid configuration

A key principle of GDPS is to have all I/O be local to the system running production. Another principle is to provide a simplified method to switch between the primary and secondary sites, if needed. The TS7700 grid configuration provides a set of capabilities that can be tailored to enable it to operate efficiently in a GDPS environment. Those capabilities and how they can be used in a GDPS environment are described in the following sections.

Direct production data I/O to a specific TS7740

The hosts are directly attached to the TS7740 that is local to the host so that is your first consideration in directing I/O to a specific TS7740. Host channels from each site’s GDPS hosts are also typically installed to connect to the TS7740 at the site that is remote to a host to cover recovery only when the TS7740 cluster at the GDPS primary site is down. However, during normal operation, the remote virtual devices are set offline in each GDPS host.

The default behavior of the TS7740 in selecting which TVC is used for the I/O is to follow the MC definitions and considerations to provide the best overall job performance. However, it uses a logical volume in a remote TS7740’s TVC, if required, to perform a mount operation unless override settings on a cluster are used.

To direct the TS7740 to use its local TVC, complete the following steps:

1. For the MC that is used for production data, ensure that the local cluster has a Copy Consistency Point. If it is important to know that the data is replicated at job close time, specify a Copy Consistency Point of RUN or Synchronous mode copy.

If some amount of data loss after a job closes can be tolerated, a Copy Consistency Point of Deferred can be used. You might have production data with different data loss tolerance. If that is the case, you might want to define more than one MC with separate Copy Consistency Points. In defining the Copy Consistency Points for an MC, it is important that you define the same copy mode for each site because in a site switch, the local cluster changes.

2. Set Prefer Local Cache for Fast Ready Mounts in the MI Copy Policy Override window. This override selects the TVC local to the TS7740 on which the mount was received if it is available and a Copy Consistency Point other than No Copy is specified for that cluster in the MC specified with the mount. The cluster does not have to have a valid copy of the data for it to be selected for the I/O TVC.

3. Set Prefer Local Cache for Non-Fast Ready Mounts in the MI Copy Policy Override window. This override selects the TVC local to the TS7740 on which the mount was received if it is available and the cluster has a valid copy of the data, even if the data is only on a physical tape. Having an available, valid copy of the data overrides all other selection criteria. If the local cluster does not have a valid copy of the data, without the next override, it is possible that the remote TVC is selected.

4. Set Force Volume Copy to Local. This override has two effects, depending on the type of mount requested. For a private mount, if a valid copy does not exist on the cluster, a copy is performed to the local TVC as part of the mount processing. For a scratch mount, it has the effect of OR-ing the specified MC with a Copy Consistency Point of RUN for the cluster, which forces the local TVC to be used. The override does not change the definition of the MC. It serves only to influence the selection of the I/O TVC or to force a local copy.

5. Ensure that these override settings are duplicated on both TS7740 Virtualization Engines.

Switching site production from one TS7700 to another one

The way that data is accessed by either TS7740 is based on the logical volume serial number. No changes are required in tape catalogs, job control language (JCL), or TMSs. In a failure in a TS7740 grid environment with GDPS, three scenarios can occur:

•GDPS switches the primary host to the remote location and the TS7740 grid is still fully functional:

– No manual intervention is required.

– Logical volume ownership transfer is done automatically during each mount through the grid.

•A disaster happens at the primary site, and the GDPS host and TS7740 cluster are down or inactive:

– Automatic ownership takeover of volumes, which are then accessed from the remote host, is not possible.

– Manual intervention is required. Through the TS7740 MI, the administrator must start a manual ownership takeover. To do so, use the TS7740 MI and click Service → Ownership Takeover Mode.

•Only the TS7740 cluster at the GDPS primary site is down. In this case, two manual interventions are required:

– Vary online remote TS7740 cluster devices from the primary GDPS host.

– Because the down cluster cannot automatically take ownership of volumes that is then accessed from the remote host, manual intervention is required. Through the TS7740 MI, start a manual ownership takeover. To do so, click Service → Ownership Takeover Mode in the TS7740 MI.

12.9.2 Geographically Dispersed Parallel Sysplex functions for the TS7700

GDPS provides TS7700 configuration management and displays the status of the managed TS7700 tape drives on GDPS windows. TS7700 tape drives that are managed by GDPS are monitored and alerts are generated for abnormal conditions. The capability to control TS7700 replication from GDPS scripts and window by using TAPE ENABLE and TAPE DISABLE by library, grid, or site is provided for managing the TS7700 during planned and unplanned outage scenarios.

The TS7700 provides a capability called Bulk Volume Information Retrieval (BVIR). If there is an unplanned interruption to tape replication, GDPS uses this BVIR capability to collect automatically information about all volumes in all libraries in the grid where the replication problem occurred. In addition to this automatic collection of in-doubt tape information, it is possible to request GDPS to perform BVIR processing for a selected library by using the GDPS window interface at any time.

GDPS supports a physically partitioned TS7700. For more information about the steps that are required to partition physically a TS7700, see Appendix I, “Case study for logical partitioning of a two-cluster grid” on page 905.

12.9.3 Geographically Dispersed Parallel Sysplex implementation

Before implementing the GDPS support for TS7700, ensure that you review and understand the following topics:

•2.2.19, “Copy Consistency Point: Copy policy modes in a stand-alone cluster” on page 44

•IBM Virtualization Engine TS7700 Series Best Practices Copy Consistency Points, which is available at the following website:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP101230

•IBM Virtualization Engine TS7700 Series Best Practices Synchronous Copy Mode, which is available at the following website:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102098

The complete instructions for implementing GDPS with the TS7700 can be found in the GDPS manuals.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12. Disaster recovery

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 12. Disaster recovery