Introduction to business resilience and the role of GDPS
In this chapter we discuss the objective of this book and briefly introduce the contents and layout. We discuss the topic of business IT resilience (which we refer to as IT resilience for brevity) from a technical perspective.
The chapter includes a general discussion that is not specific to mainframe platforms, although the topics are covered from an enterprise systems and mainframe perspective. Finally, we introduce the members of the Geographically Dispersed Parallel Sysplex (GDPS) family of offerings and provide a brief description of the aspects of an IT resilience solution that each offering addresses.
1.1 Objective
Business IT resilience is a high profile topic across many industries and businesses. Apart from the business drivers requiring near-continuous application availability, government regulations in various industries now take the decision about whether to have an IT resilience capability out of your hands.
This document was developed to provide an introduction to the topic of business resilience from an IT perspective, and to share how GDPS can help you address your IT resilience requirements.
1.2 Layout of this document
This chapter starts by presenting an overview of IT resilience and disaster recovery. These practices have existed for many years. However, recently they have become more complex due to a steady increase in the complexity of applications, the increasingly advanced capabilities of available technology, competitive business environments, and goverment regulations.
In Chapter 2, “Infrastructure planning for availability and GDPS” on page 13 we briefly describe the available technologies typically leveraged in a GDPS solution to achieve IT resilience goals. Additionally, to understand the positioning and capabilities of the various offerings (which encompass hardware, software, and services), it is useful to have at least a basic understanding of the underlying technology.
Following these two introductory chapters and starting with Chapter 3, “GDPS/PPRC” on page 51, we describe the capabilities and prerequisites of each offering in the GDPS family of offerings. Because each offering addresses fundamently different requirements, each member of the GDPS family of offerings is described in a chapter of its own.
Most enterprises today have a heterogeneous IT environment including a variety of hardware and software platforms. After covering the GDPS family of offerings, Chapter 8, “GDPS extensions for heterogeneous systems and data” on page 213 describes the GDPS facilities that can provide a single point of control to manage data across all the server platforms within an enterprise IT infrastructure.
Finally, we include a section with examples illustrating how the various GDPS offerings can satisfy your requirements for IT resilience and disaster recovery.
1.3 IT resilience
IBM defines IT resilience as the ability to rapidly adapt and respond to any internal or external disruption, demand, or threat, and continue business operations without significant impact.
IT resilience is related to, but broader in scope, than disaster recovery. Disaster recovery concentrates solely on recovering from an unplanned event.
When investigating IT resilience options, two things must be at the forefront of your thinking:
Recovery Time Objective (RTO)
This term refers to how long your business can afford to wait for IT services to be resumed following a disaster.
If this number is not clearly stated now, think back to the last time you had a significant service outage. How long was that outage, and how much pain did your company suffer as a result? This will help you get a sense of whether to measure your RTO in days, hours, or minutes.
Recovery Point Objective (RPO)
This term refers to how much data your company is willing to have to recreate following a disaster. In other words, what is the acceptable time difference between the data in your production system and the data at the recovery site?
As an example, if your disaster recovery solution depends on once-daily full volume tape dumps, your RPO is 24 to 48 hours depending on when the tapes are taken offsite. If your business requires an RPO of less than 24 hours, you will almost certainly be forced to perform some form of offsite real time data replication instead of relying on these tapes alone.
The terms RTO and RPO are used repeatedly in this document because they are core to the methodology that you can use to meet your IT resilience needs.
1.3.1 Disaster recovery
As mentioned, the practice of preparing for disaster recovery (DR) is something that has been a focus of IT planning for many years. In turn, there is a wide range of offerings and approaches available to accomplish DR. Several options rely on offsite or even outsourced locations that are contracted to provide data protection or even servers in the event of a true IT disaster. Other options rely on in-house IT infrastructures and technologies that can be managed by your own teams.
There is no one correct answer for which approach is better for every business, but the first step in deciding what makes the most sense for you is to have a good view of your IT resiliency objectives; specifically, your RPO and RTO.
Although Table 1-1 does not cover every possible DR offering and approach, it does provide a view of what RPO and RTO might typically be achieved with some common options.
Table 1-1 Typical achievable RPO and RTO for some common DR options
Description
Typically achievable Recovery Point Objective (RPO)
Typically achievable Recovery Time Objective (RTO)
No disaster recovery plan
N/A - all data lost
N/A
Tape vaulting
Measured in days since last stored backup
Days
Electronic vaulting
Hours
Hours (hot remote location) to days
Active replication to remote site (w/o recovery automation)
Seconds to minutes
Hours to days (dependent on availability of recovery hardware)
Active storage replication to remote “in-house” site
Zero to minutes (dependent on replication technology and automation policy)
1 or more hours (dependent on automation)
Active software replication to remote ‘active’ site
Seconds to minutes
Seconds to minutes (dependent on automation)
Generally a form of real-time software or hardware replication will be required to achieve an RPO of minutes or less, but the only technologies that can provide an RPO of zero (0) are synchronous replication technologies (see 2.3, “Synchronous versus asynchronous data transfer” on page 19) coupled with automation to ensure no data is written to one location and not the other.
The recovery time is largely dependent on the availability of hardware to support the recovery and control over that hardware. You might have real-time software or hardware-based replication in place, but without server capacity at the recovery site you will have hours to days before you can recover this once-current data.
Furthermore, even with all the spare capacity and current data, you might find that you are relying on people to perform the recovery actions. In this case, you will undoubtedly find that these same people are not necessarily available in the case of a true disaster or, even more likely, they find that processes and procedures for the recovery are not practiced or accurate. This is where automation comes in to mitigate the risk introduced by the human element and to ensure you actually meet the RTO required of the business.
Finally, you might decide that one DR option is not appropriate for all aspects of the business. Various applications might tolerate a much greater loss of data and might not have an RPO as low as others. At the same time, some applications might not require recovery within hours whereas others most certainly do.
Although there is obvious flexibility in choosing different DR solutions for each application, the added complexity this can bring needs to be balanced carefully against the business benefit. The recommended approach, supported by GDPS, is to provide a single optimized solution for the enterprise. This generally leads to a simpler solution and, because less infrastructure and software might need to be duplicated, often a more cost-effective solution, too. Consider a different DR solution only for your most critical applications, where their requirements cannot be catered for with a single solution.
1.3.2 The next level
In addition to the ability to recover from a disaster, many businesses now look for a greater level of availability covering a wider range of events and scenarios. This larger requirement is called IT resilience. In this document, we concentrate on two aspects of IT resilience: disaster recovery, as discussed previously, and Continuous Availability (CA), which encompasses not only recovering from disasters, but keeping your applications up and running throughout the far more common planned and unplanned outages that do not constitute an actual disaster.
For some organizations, a proven disaster recovery capability that meets their RTO and RPO can be sufficient. Other organizations might need to go a step further and provide near-continuous application availability.
The market drivers behind the need for IT resilience are as follows:
High and constantly increasing client and market requirements for continuous availability of IT processes
Financial loss due to lost revenue, punitive penalties or fines, or legal actions that are a direct result of disruption to critical business services and functions
An increasing number of security-related incidents, causing severe business impact
Increasing regulatory requirements
Major potential business impact in areas such as market reputation and brand image from security or outage incidents
For a business today, few events impact a company like having an IT outage, even for a matter of minutes, and then finding the incident splashed across the newspapers and the evening news. Today, your clients, employees, and suppliers expect to be able to do business with you around the clock, and from all corners of the globe.
To help keep business operations running 24x365, you need a comprehensive business continuity plan that goes beyond disaster recovery. Maintaining high availability and continuous operations in normal day-to-day operations are also fundamental for success. Businesses need resiliency to help ensure:
Key business applications and data are protected and available.
If a disaster occurs, business operations continue with a minimal impact.
Regulations
In some countries, government regulations specify how organizations must handle data and business processes. An example is the Health Insurance Portability and Accountability Act (HIPAA) in the United States. This law defines how an entire industry, the US health care industry, must handle and account for patient-related data.
Other well-known examples include the US government-released “Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System”1 which loosely drove changes in the interpretation of IT resilience within the US financial industry, and the Basel II rules for the European banking sector, which stipulate that banks must have a resilient back-office infrastructure.
This is also an area that accelerates as financial systems around the world become more interconnected. Although a set of recommendations published in Singapore (like the S 540-2008 Standard on Business Continuity Management)2 might only be directly addressing businesses in a relatively small area, it is common for companies to do business in many countries around the world, where these might be requirements for ongoing business operations of any kind.
Business requirements
It is important to understand that the cost and complexity of a solution can increase as you get closer to true continuous availability, and that the value of a potential loss must be borne in mind when deciding which solution you need, and which one you can afford. You do not want to spend more money on a continuous availability solution than the financial loss you can incur as a result of a outage.
A solution must be identified that balances the costs of the solution with the financial impact of an outage. A number of studies have been done to identify the cost of an outage; however, most of them are several years old and do not accurately reflect the degree of dependence most modern businesses have on their IT systems.
Therefore, your company needs to calculate the impact in your specific case. If you have not already conducted such an exercise, you might be surprised at how difficult it is to arrive at an accurate number. For example, if you are a retailer and you suffer an outage in the middle of the night after all the batch work has completed, the financial impact is far less than if you had an outage of equal duration in the middle of your busiest shopping day. Nevertheless, to understand the value of the solution, you must go through this exercise, using assumptions that are fair and reasonable.
1.3.3 Other considerations
In addition to the increasingly stringent availability requirements for traditional mainframe applications, there are other considerations, including those described in this section.
Increasing application complexity
The mixture of disparate platforms, operating systems, and communication protocols found within most organizations intensifies the already complex task of preserving and recovering business operations. Reliable processes are required for recovering not only the mainframe data, but also perhaps data accessed by multiple flavors of UNIX, Microsoft Windows, or even a proliferation of virtualized distributed servers.
It is becoming increasingly common to have business transactions that span, and update data on, multiple platforms and operating systems. If a disaster occurs, your processes must be designed to recover this data in a consistent manner.
Just as you would not consider recovering half an application’s IBM DB2® data to 8:00 a.m. and the other half to 5:00 p.m., the data touched by these distributed applications must be managed to ensure that all this data is recovered with consistency to a single point in time. The exponential growth in the amount of data generated by today’s business processes and IT servers compounds this challenge.
Increasing infrastructure complexity
Have you looked in your computer room recently? If you have, you probably found that your mainframe systems comprise only a small part of the equipment in that room. How confident are you that all those other platforms can be recovered? And if they can be recovered, will it be to the same point in time as your mainframe systems? And how long will that recovery take?
Figure 1-1 on page 7 shows a typical IT infrastructure. If you have a disaster and recover the mainframe systems, will you be able to recover your service without all the other components that sit between the user and those systems? It is important to remember why you want your applications to be available, so that users can access them.
Therefore, part of your IT resilience solution must include not only addressing the non-mainframe parts of your infrastructure, but also ensuring that recovery is integrated with the mainframe plan.
Figure 1-1 Typical IT infrastructure
Outage types
In the early days of computer data processing, planned outages were relatively easy to schedule. Most of the users of your systems were within your company, so the impact to system availability was able to be communicated to all users in advance of the outage. Examples of planned outages are software or hardware upgrades that require the system to be brought down. These outages can take minutes or even hours.
The majority of outages are planned, and even among unplanned outages, the majority are not disasters. However, in the current business world of 24x7 Internet presence and web-based services shared across and also between enterprises, even planned outages can be a serious disruption to your business.
Unplanned outages are unexpected events. Examples of unplanned outages are software or hardware failures. Although various of these outages might be quickly recovered from, others might be considered a disaster.
You will undoubtedly have both planned and unplanned outages while running your organization, and your business resiliency processes must cater to both types. You will likely find, however, that coordinated efforts to reduce the numbers of and impacts of unplanned outages often are complementary to doing the same for planned outages.
Later in this book we discuss the technologies available to you to make your organization more resilient to outages, and perhaps avoid them altogether.
1.4 Characteristics of an IT resilience solution
As the previous sections demonstrate, IT resilience encompasses much more than the ability to get your applications up and running after a disaster with “some” amount of data loss, and after “some” amount of time.
When investigating an IT resilience solution, keep the following points in mind:
Support for planned system outages
Does the proposed solution provide the ability to stop a system in an orderly manner? Does it provide the ability to move a system from the production site to the backup site in a planned manner? Does it support server clustering, data sharing, and workload balancing, so the planned outage can be masked from users?
Support for planned site outages
Does the proposed solution provide the ability to move the entire production environment (systems, software subsystems, applications, and data) from the production site to the recovery site? Does it provide the ability to move production systems back and forth between production and recovery sites with minimal or no manual intervention?
Support for data that spans more than one platform
Does the solution support data from more systems than just z/OS? Does it provide data consistency across all supported platforms, or only within the data from each platform?
Support for managing the data replication environment
Does the solution provide an easy-to-use interface for monitoring and managing the data replication environment? Will it automatically react to connectivity or other failures in the overall configuration?
Support for data consistency
Does the solution provide consistency across all replicated data? Does it provide support for protecting the consistency of the second copy if it is necessary to resynchronize the primary and secondary copy?
Support for continuous application availability
Does the solution support continuous application availability? From the failure of any component? From the failure of a complete site?
Support for hardware failures
Does the solution support recovery from a hardware failure? Is the recovery disruptive (reboot / re-IPL) or transparent (HyperSwap, for example)?
Support for monitoring the production environment
Does the solution provide monitoring of the production environment? Is the operator notified in case of a failure? Can recovery be automated?
Dynamic provisioning of resources
Does the solution have the ability to dynamically allocate resources and manage workloads? Will critical workloads continue to meet their service objectives, based on business priorities, in the event of a failure?
Support for recovery across database managers
Does the solution provide recovery with consistency independent of the database manager? Does it provide data consistency across multiple database managers?
End-to-end recovery support
Does the solution cover all aspects of recovery, from protecting the data through backups or remote copy, through to automatically bringing up the systems following a disaster?
Cloned applications
Do your critical applications support data sharing and workload balancing, enabling them to run concurrently in more than one site? If so, does the solution support and exploit this capability?
Support for recovery from regional disasters
What distances are supported by the solution? What is the impact on response times? Does the distance required for protection from regional disasters permit a continuous application availability capability?
You then need to compare your company’s requirements in each of these categories against your existing or proposed solution for providing IT resilience.
1.5 GDPS offerings
GDPS is actually a collection of several offerings, each addressing a different set of IT resiliency goals, that can be tailored to meet the RPO and RTO for your business. Each offering leverages a combination of server and storage hardware or software-based replication and automation and clustering software technologies, many of which are described in more detail in Chapter 2, “Infrastructure planning for availability and GDPS” on page 13.
In addition to the infrastructure that makes up a given GDPS solution, IBM also includes services, particularly for the first installation of GDPS and optionally for subsequent installations, to ensure the solution meets and fulfills your business objectives.
The following list provides brief descriptions of each offering, with a view of which IT resiliency objectives it is intended to address. Additional details are included in separate chapters in this book:
GDPS/PPRC
Near-CA or DR solution across two sites separated by metropolitan distances. The solution is based on the IBM PPRC synchronous disk mirroring technology.
GDPS/PPRC HyperSwap Manager
Near-CA solution for a single site or entry-level DR solution across two sites separated by metropolitan distances. The solution is based on the same technology as GDPS/PPRC, but does not include much of the systems automation capability that makes GDPS/PPRC a more complete DR solution.
GDPS/XRC
DR solution across two sites separated by virtually unlimited distance between sites. The solution is based on the IBM XRC asynchronous disk mirroring technology (also branded by IBM as z/OS Global Mirror).
GDPS/Global Mirror
DR solution across two sites separated by virtually unlimited distance between sites. The solution is based on the IBM System Storage® Global Mirror technology, which is a disk subsystems-based asynchronous form of remote copy.
GDPS Metro/Global Mirror
A three-site solution that provides CA across two sites within metropolitan distances and DR to a third site at virtually unlimited distances. It is based on a cascading mirroring technology that combines PPRC and Global Mirror.
GDPS Metro/z/OS Global Mirror
A three-site solution that provides CA across two sites within metropolitan distances and DR to a third site at virtually unlimited distances. It is based on a multitarget mirroring technology that combines PPRC and XRC (also known as z/OS Global Mirror on IBM storage subsystems).
GDPS/Active-Active
A multisite CA/DR solution at virtually unlimited distances. This solution is based on software-based asynchronous mirroring between two active production sysplexes running the same applications with the ability to process workloads in either site.
As mentioned briefly at the beginning of this section, each of these offerings provides the following benefits:
GDPS automation code
This code has been developed and enhanced, over a number of years, to exploit new hardware and software capabilities to reflect best practices based on IBM’s experience with GDPS clients since its inception in 1998, and to address the constantly changing requirements of our clients.
Capabilities to exploit underlying hardware and software capabilities
IBM software and hardware products have support to surface problems that can affect the availability of those components, and to facilitate repair actions.
Services
There is perhaps only one factor in common across all GDPS implementations, namely that each has a unique requirement or attribute that makes it different from every other implementation. The services aspect of each offering provides you with invaluable access to experienced GDPS practitioners.
The amount of service included depends on the scope of the offering. For example, more function-rich offerings such as GDPS/PPRC include a larger services component than GDPS/PPRC HyperSwap Manager.
 
Note: Detailed information about each of the offerings is provided in the following chapters. Each chapter can be read in isolation; that is, it is not necessary to read all chapters if you are only interested in a specific offering. If you do read all the chapters, you might note that some information is repeated in multiple chapters.
1.6 Automation and disk replication compatibility
The GDPS automation code relies on the runtime capabilities of IBM Tivoli NetView® and IBM Tivoli System Automation (SA). Although these products provide tremendous first-level automation capabilities in and of themselves, there are alternative solutions you might already have from other vendors. GDPS continues to deliver features and functions that take advantage of properties unique to the Tivoli products (such as support for alert management through Tivoli IOM), but Tivoli NetView and Tivoli SA also work well alongside other first-level automation solutions. In other words, although there are definitely benefits to having a comprehensive solution from IBM, you do not have to replace your current automation investments before moving forward with a GDPS solution.
Additionally, most of the GDPS solutions rely on the IBM-developed disk replication technologies3 of PPRC for GDPS/PPRC, XRC for GDPS/XRC, and Global Mirror for GDPS/GM. These architectures are implemented on several IBM enterprise storage products. Specifically, PPRC has been implemented and branded as IBM System Storage Metro Mirror for the IBM Enterprise Storage Server (ESS) and the IBM DS8000 family of products. Similarly, the XRC technology has been implemented on the same storage servers under the brand name of IBM System Storage z/OS Global Mirror.
The external interfaces for all of these disk replication technologies (PPRC, XRC, GM, and FlashCopy) have also been licensed by many major enterprise storage vendors. This allows clients the flexibility to select the disk subsystems that best match their requirements and to mix and match disk subsystems from different storage vendors within the context of a single GDPS solution. Indeed, although most GDPS installations do rely on IBM storage products, there are several production installations of GDPS around the world that rely on non-IBM storage vendor products.
Finally, IBM has a GDPS Qualification Program4 for other enterprise storage vendors to validate that their implementation of the advanced copy services architecture meets the GDPS requirements.
The GDPS Qualification Program offers the following arrangement to vendors:
IBM provides the system environment.
Vendors install their disk in this environment.
Testing is conducted jointly.
A qualification report is produced jointly, describing details of what was tested and the test results.
Recognize that this qualification program does not imply that IBM provides defect or troubleshooting support for a qualified vendor’s products. It does, however, indicate at least a point-in-time validation that the products are functionally compatible and demonstrates that they work in a GDPS solution.
Check directly with non-IBM storage vendors if you are considering using their products with a GDPS solution because they can share their own approaches and capability to support the specific GDPS offering you are interested in.
1.7 Summary
At this point we have discussed why it is important to have an IT resilience solution, and have provided information about key objectives to consider when developing your own solution. We have also introduced the GDPS family of offerings with a brief description of which objectives of IT resiliency each offering is intended to address.
In Chapter 2, “Infrastructure planning for availability and GDPS” on page 13 we introduce key infrastructure technologies related to IT resilience focused on the mainframe platform. After that, we describe how the various GDPS offerings exploit those technologies. And finally, we position the various GDPS offerings against typical business scenarios and requirements.
It is our intent to update this document as new GDPS capabilities are delivered.

3 Disk replication technology is independent of the GDPS/Active-Active solution, which exploits software replication.
4 http://www.ibm.com/systems/z/gdps/qualification.html
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset