High Availability and Disaster Recovery environment for Oracle
This chapter is an introduction to planning a highly available and disaster recovery (HADR) environment for Oracle Databases that are running on Linux on System z in a virtualized environment. IBM System z hardware is designed for continuous availability and offers a set of reliability, availability, and serviceability features (RAS).
Oracle Database is one of the leading technologies with built-in High Availability options. The combination of IBM System z and Oracle Database provides a system that is comprehensive, reliable, and capable of deploying highly available environments that offer varying levels of data, application, and infrastructure resilience.
Many tiers of an HADR solution are possible. Oracle recommends Maximum Availability Architecture (MAA) as the best practices’ blueprint for an HA environment. The right HADR configuration is a balance between recovery time and recovery point requirements and cost.
Based on our experiences in implementing Oracle on Linux on System z, we provide a road map in this chapter to plan an HADR environment for Oracle databases.
A highly available environment is a combination of technology, coordination across multiple teams, change control, skills, enterprise culture, and operational discipline. This chapter is an introduction to the various technology options that are available to users (in-depth information that is necessary to implement the architectures is not included here). We encourage the reader to review Oracle MAA white papers that are available on the Oracle Technology Network website for more in-depth descriptions about implementing the right solutions for complex environments.
This chapter includes the following topics:
9.1 High Availability
A highly available system is designed to eliminate or minimize the loss of service because of planned or unplanned outages. High Availability does not equate to continuous availability (that is, a system with nonstop service). High Availability in general describes the accessibility and uptime of critical business application environments to the users, who experience frustration when their application is unavailable, and they do not care about the complexity of the application or why it is not available. Availability is always measured by the perception of an application’s user.
High Availability is a key component of business resiliency. Although hardware technology like in IBM System z are highly reliable, the unplanned outages, such as, operator errors, software problems, application performance issues, and other non-hardware related factors still can make the systems unavailable. The five nines of availability term specifies 99.999% uptime, but in any user environment, it might not be necessary to achieve that and have all the applications available always. There might be some critical applications that always must be up and running, and here might be other applications that do not need as much availability requirements.
High Availability solutions always involve redundancy, and the basic rule of High Availability is to identify and remove single points of failure in the architecture.
A user environment can have multiple layers, such as, cloud, user, application, firewall, security, facility, storage, and database layers. All of these layers can be in one or multiple data centers and they can be in one or multiple servers that are running under multiple operating systems.
In this section, we describe achieving High Availability for Oracle databases in a Linux on System z environment only. Figure 9-1 on page 182 shows an architecture in which components in an Oracle database environment are running on a Linux on System z environment.
 
Figure 9-1 Components in Oracle environment on Linux on System z
Data Center
The Data Center is the top layer where all the components that are needed for Oracle databases on Linux on System z are running. This center encompasses all of the servers, storages, network, facility, human resources, software, and other components that are required to run the databases.
Servers
System z server is hardware that provides the computing power (CPU, memory, I/O connections, and power supply).
Logical partitions
System z servers are typically divided into multiple logical partitions (LPARs) to share the System z hardware resources.
z/VM
z/VM is a hypervisor operating system that is running in an LPAR.
Linux guests
z/VM hypervisor that is running in an LPAR can host one or more Linux operating systems in entities that are known as virtual machines. We refer to a virtual machine that is running an operating system as guest. Also, Linux operating system can run natively in an LPAR.
Oracle instances
In a Linux guest, one or more Oracle instances are running.
Disk Storage
z/VM, Linux, and Oracle need non-volatile storage for their operations, in the same way that Oracle instances have their database files on the storages.
In the environment that is shown in Figure 9-1 on page 182, there can be failures in any one of the components, which can cause unavailability. Planned or unplanned downtime is costly and in the following sections we describe the general causes for planned and unplanned outages in the environment that is shown in Figure 9-1 on page 182.
9.1.1 Planned downtime events
In general, the planned downtime in an Oracle environment can be started by any of the following events:
Software maintenance and upgrades
Hardware refreshes
Data center relocations
Building maintenance
It might be surprising to note that the largest share of time that a database is rendered unavailable is because of the planned maintenance activities.
In many situations, the planned downtime activities can be coordinated with users in advance. With the understanding of the business requirements and proper planning, the Service Level Agreements (SLA) can be met and the effect to the user community can be minimized.
9.1.2 Unplanned downtime triggers
In any environment, unplanned downtime activities can result in prolonged application failures, and they might result in unsatisfied users and revenue loss. Any of the reasons that are described in this section might trigger downtime in an Oracle environment and can jeopardize the SLA.
Data Center
The Data Center where the systems are deployed might not be available because of any of the following factors:
Natural disasters
Sabotages
Power failures
Network or firewall failures
Hardware components
The hardware components that are hosting the applications might not be available for any of the following reasons:
Hardware failures (CPU, memory, power supply, cables, storage, or switches)
Bottleneck of resources (CPU, memory, storage, or network)
System software components
Any of the following system software components can fail:
Hypervisor (z/VM)
Operating systems (Linux)
Monitoring software
Application components
The applications, load balancers, web servers, and other associated software components might fail because of the following factors:
Application Logic
Application overload
Oracle database
Oracle databases might not be available for any of the following reasons:
Instances failure
Components (Listeners) are not available
Data files corruption or deletion
Logical data corruptions
Security violations
Administration issues (file sizes cannot be extended)
Performance issues
Scalability issues (additional users)
Critical patches application requirements
9.1.3 Defining the common requirements for High Availability
Any of the events that were described might result in unavailability. Any HADR solution should ensure the following resiliencies:
Infrastructure resiliency
Normally, infrastructure resiliency is obtained by reliable, redundant, and clustered components.
Data resiliency
Data is the critical component of any system, and the resiliency of the data availability is a minimum requirement for any HADR solution. Storage mirroring and database technologies drive the resiliency of the data.
Application resiliency
In an ideal HADR design, when the system is recovered for operations, the users must be able to continue where they left off after recovery is complete. This application stateless resiliency is also a basic requirement.
In this chapter, we focus more on data resiliency, and, to a lesser extent, infrastructure resiliency. The application resiliency is beyond the scope of this document.
The effort that is needed to implement a highly available system depends on the following measurements:
Recovery Time Objective (RTO)
RTO is the maximum period for which a disturbance to an application or a process can be tolerated before the business or financial effect is unacceptable. RTO is inclusive of issue identification, response, and issue resolve time.
Recovery Point Objective (RPO)
RPO is the maximum amount of time data loss can be tolerated. For some applications, the requirement can be zero data loss, as in a stock trading application. In other cases, it can be couple of hours worth of data losses.
SLA
SLA is the agreed upon downtime for an application with the user. This might span from system availability to functional availability in the application. Typically, the query response time during online or the reports creation time during batch running dictates this requirement.
In business environments, users also divide their applications into multiple tiers. Typically, Tier 1 applications can have the strictest RTO, RPO, and SLA requirements. Tier 2 and Tier 3 have less stringent requirements. The HADR solutions for Tier 1 applications normally are costlier to implement.
 
Note: The right High Availability configuration is a balance between the recovery requirements and cost.
9.2 Oracle technologies for High Availability
In an Oracle environment, the undisputable requirement is data resiliency. The data should be available always. If a failure occurs, the data loss should be zero or minimal. In this section, we introduce the various technologies that are available from Oracle to build a highly available environment for Oracle databases.
9.2.1 Backup and recovery
The foundation for any robust Oracle highly available environment is having a solid, reliable backup and recovery process. In some situations, as in the following examples, database backups might be the only recovery options:
For a block corruption, media failure, or other physical data failures where there is no Data Guard in the environment, the only solution is to restore from existing backups.
Failures at primary and secondary sites of a Data Guard solution.
Setting up the initial Data Guard environment.
Backups can be performed at the logical and physical level. An effective backup strategy must be based on physical backups that allow the database to be restored with consistency. Logical backups, such as, exporting database objects, are a useful supplement to physical backups but cannot protect the whole database.
The backups can be consistent or inconsistent. During the consistent backup, the database is shut down cleanly and remains closed during the backup. All committed changes in the online redo log files are written to the data files during the shut down process, so the data files are in a transaction-consistent state. The database can be opened immediately after the data files are restored from the consistent backup.
It is not always possible, however, to shut down the database for backup. In that case, by enabling the ARCHIVELOG for the database, the backups can be taken, which are called inconsistent backups. In this case, the online redo log files contain changes that are not yet applied to the data files. The online redo log files must be archived and then backed up with the data files to ensure recoverability. If the archived redo log files and data files are restored, media recovery must be performed before the database can be opened.
For Oracle Databases backup Oracle’s Recovery Manager (RMAN) utility is the ideal choice for most users. Many third-party vendors, such as, IBM Tivoli® integrated with RMAN utilities to offer value-added services for backup and recovery.
Using RMAN for backup includes the following advantages:
RMAN automatically determines what files are to be backed up and what files must be used for media-recovery operations.
Online database backups are done without placing tablespaces in backup mode.
Block-level incremental backups and data block integrity checks are done during backup and restore operations.
Automated tablespace point-in-time recovery and block media recovery is available.
Best practices for backup and recovery
The following best practices are available for backup and recovery tasks:
Oracle recommends a Fast Recovery Area to simplify the management of backup and recovery files. This area is an Oracle managed directory, file system, or Oracle Automatic Storage Management (ASM) disk group that provides a centralized disk location for backup and recovery files. Oracle also creates archived logs and flashback logs in the fast recovery area. RMAN can store its backup sets and image copies in the fast recovery area, and RMAN uses it when restoring files during media recovery. The fast recovery area also acts as a disk cache for tape. It is important that the fast recovery area is sized according to the database’s file sizes, transaction rates, and retention policy requirements.
Establish block changes tracking, incremental backup processes, and the backup frequency, which should be based on the following factors:
 – The RPO, which depends on the data criticality
 – The RTO, which is dictated by data repair time
 – Database transaction rates for data changes
Establish backup retention policy:
 – Distinct redundant data files, recovery window-based files, or both policies should be established on the basics of data criticality.
 – The legal requirements might also determine the backup files retention policy.
 – The use of RMAN recovery catalog is a good practice and it is preferred to use the catalog schema in a dedicated stand-alone database.
 – RMAN should be configured to automatically back up the control file and the server parameter file (SPFILE).
HADR applicability
Solid backup and recovery process is the foundation and part of any HADR configuration.
9.2.2 Oracle Flashback Technology
Backup and recovery processes protect failures against the physical deletion of database files, media corruption, and dropping of database logical entities. In any database environment, it is impossible to avoid human errors, such as, an authorized user running an incorrect query and deleting rows in the tables or corrupting some data. Though it is possible to recover the data from the backup files, it can take hours to rebuild the database. Oracle provides a Flashback technology to reverse human errors by selectively undoing the effects of a mistake. Flashback supports recovery at the row, transaction, table, and the entire database levels.
Facilities provided by Oracle Flashback Technology
Oracle Database 11g Release 2 includes enhancements to enable Flashback Database while the database is open. Oracle Flashback Technology can accommodate the reversal of the following types of actions by using flashback logs:
Flashback database:
 – Entire database to a specific point-in-time can be restored.
 – Flashback Database is fast compared to traditional backup and recovery process because it restores blocks that changed only.
 – The Database can be rewound based on System Change Number (SCN), time stamp, or restore points.
Flashback drop:
By using the Flashback Drop feature, dropped tables can be recovered. The dropped table, and all of its indexes, constraints, and triggers, from the Recycle Bin are recovered.
Flashback table:
The logically corrupted table can be restored to a specific point. The corrupted table can be rewound, undoing any updates that are made to the table between the current time and the specified time.
Flashback Query
By using Oracle Flashback Query, users can query any data at some point in the past. This feature can be used to view and logically reconstruct corrupted data that might be deleted or changed inadvertently.
This facility allows identification and resolution of logical data corruption.
Flashback Versions Query
By using Oracle Flashback Versions Query, users can retrieve different versions of a row across a specified time interval instead of a single point. Users also can pinpoint exactly when and how data changed, which enables data repair and application debugging to identify and resolve logical data corruption.
Flashback Transaction Query
A flawed transaction might result in logical data corruption across the tables. Flashback Transaction Query shows the changes that are made by a transaction and also produces the SQL statements necessary to flashback or undo the transaction.
Flashback Transaction
A flawed transaction can result in logical data corruption across the tables. With Flashback Transaction, a single transaction (and optionally, all of its dependent transactions), can be flashed back.
Flashback Transaction relies on undo data and archived redo logs to back out the changes
HADR applicability
HADR applicability includes data corruption that is caused by human errors.
9.2.3 Oracle Automated Storage Manager
Oracle Automated Storage Manager (ASM) is an integrated database file system and disk manager. Oracle ASM groups the disks in the storage system into one or more disk groups, and automates the placement of the database files within those disk groups. It reduces the complexity of managing thousands of files in a large environment. Oracle ASM is part of the Oracle Grid Infrastructure (GI), and it is installed when the Oracle Grid is installed. Oracle clusterware and ASM are installed into the same Oracle home.
ASM features
ASM includes the following features:
An ASM disk group is a collection of disks that us managed as a unit. A disk group can have as many as 10,000 disks and each disk can have a maximum size of 2 TB.
Each disk group is self-contained and has its own ASM metadata. An ASM instance manages that ASM metadata.
In Oracle 11.2 three disk groups are specified: one for data, one for flash recovery and archive, and another one for SPFILE, voting, and Oracle Cluster Registry (OCR).
In large enterprises, the data disks can be groups that are based on the storage tiers. The best practice is to use similar performance level and similar sized disks within a group. The disk size is not an influential factor, and a minimum of four disks is recommended per group.
ASM looks for disks in the operating system location that is specified by the ASM_DISKSTRING initialization parameter.
For 11gR2, the SCAN listener is run from GI Home and database listener from DB HOME.
Oracle recommends RMAN to back up and transport database files in ASM
ASM benefits
ASM features the following benefits:
ASM spreads data evenly across all disks in a disk group. This software-controlled striping evenly distributes the database files to eliminate the hot spots.
Optionally, ASM supports two-way mirroring in which each file extent receives one mirrored copy. It also supports three-way mirroring in which each file extent receives two mirrored copies. Additionally, ASM mirrors at file level, and the mirrored copy is kept at a disk other than the original copy disk. This configuration improves the availability.
Dynamic addition of disks and removal facility of ASM improves the storage availability.
ASM can now store Voting and OCR files for Oracle clusters.
ASM reduces administrative tasks by enabling files that are stored in Oracle ASM disk groups to be Oracle Managed Files. It reduces the complexity of managing thousands of files in a large environment.
HADR applicability
HADR applicability includes the following factors:
Data corruption
Storage failures
9.2.4 Oracle Grid Control Cluster technology
Oracle Grid Infrastructure technology allows clustering of independent servers so that they cooperate as a single system. If a clustered server fails, any managed application can be restarted on the surviving servers. Oracle Grid Infrastructure software integrates Oracle Clusterware and Oracle ASM and provides the infrastructure necessary for a High Availability framework. The managed applications can be like Siebel, GoldenGate, WebSphere, or even Oracle databases.
Oracle Grid Infrastructure features
Oracle Grid Infrastructure includes the following features:
Oracle Clusterware provides cluster management capabilities, such as node membership, group services, global resource management, and High Availability functions.
The applications are protected in active/passive environment.
For High Availability, applications can be placed under the protection of Oracle Clusterware so that they can be restarted in the primary node when the built-in agent detects the application failure.
By using built-in agents, if the primary node fails, it can restart the application on the other active nodes in the cluster.
The monitoring frequency, starting, and stopping of the applications and the application dependencies are all configured.
HADR applicability
HADR applicability includes the following features:
Protection from Computer hardware failures
Protection from OS (Linux/z/VM) failures
Protection from Oracle instance failures
Protection from storage failures (if ASM is used)
Active/passive configuration, recovery is not instantaneous
9.2.5 Oracle RAC One Node technology
Oracle Real Application Clusters One Node (Oracle RAC One Node) technology is a new option in 11.2. and provides a fail over solution for Oracle databases. Oracle RAC One Node is a single instance of an Oracle RAC database that runs on one node in a cluster. It uses Omotion technology to relocate the instance without any downtime and does not need manual intervention. During the short period when the instance is moved from one node to another, both instances are active. After all of the connections are migrated, the first instance goes down. If the active instance suddenly fails, Oracle RAC One Node detects the failure and restarts the failed database or fails it over to another server.
Oracle RAC One Node features
Oracle RAC One Node includes the following features:
Running Oracle instance can be migrated from one server to another without disruption of service.
Better availability than Clusterware active/passive solution.
Online patching and upgrading of operating system and database software without downtime is possible.
Databases can be consolidated into a single cluster for efficient administration. If a server fails, they can be quickly relocated.
Ready to scale and upgrade to multi-node Oracle RAC for scalability.
HADR applicability
HADR applicability includes the following features:
Protection from Computer hardware failures
Protection from OS (Linux/z/VM) failures
Protection from Oracle instance failures
Protection from storage failures (when ASM is used)
For planned outages, it is possible to have continuous availability of Oracle instances
9.2.6 Oracle RAC technology
System z provides a reliable architecture and avoids server as a single point of failure. But it is possible that the Oracle instances, operating systems, such as, Linux, hypervisors, such as, z/VM can fail. These components can introduce the single point of failures, unless clustered.
In a typical Oracle environment, an Oracle instance that is running on a Linux guest under z/VM hypervisors accesses a single database. If the instance stops or the Linux guest where Oracle instance is running fails, access to the data is impossible. The Oracle RAC technology allows multiple Oracle instances that are running across multiple nodes to access the same database and provides a single logical instance view. A cluster in this case can be defined as a pool of independent servers that are acting as a single system.
Oracle Clusterware technology allows clustering of independent servers so that they cooperate as a single system. Oracle Grid Infrastructure software integrates Oracle Clusterware and Oracle ASM and provides the infrastructure that is necessary for High Availability framework. With Oracle RAC, all nodes are active and it enables the continuous availability of Oracle instance.
Oracle RAC features
Oracle RAC includes the following features:
Ability to tolerate and quickly recover from computer and instance failures.
Rolling upgrades for system and hardware changes.
Rolling patch upgrades for some interim patches, security patches, CPUs, and Cluster software.
Scalability by adding more instances (servers).
Oracle Extended RAC features
Oracle Extended RAC is an architecture in which the nodes in the cluster are separated into different data centers. It provides fast recovery from a site failure and allows for all nodes at all sites to actively process transactions as part of single database cluster. This means that it provides the highest level of availability for server and site failures. It includes the following challenges:
Redundant connections and sufficient bandwidth for public traffic, interconnect, and I/O.
High interconnect and network latency can throttle database performance and response time.
10 km distance between nodes might require Dark Fiber and, therefore, is high cost.
Unlike Oracle Data Guard, RAC is a single database (no secondary database) and data corruptions, lost writes, or database-wide failures are possible.
Storage complexity.
HADR applicability
HADR applicability includes the following features:
Protection from Computer hardware failures.
Protection from OS (Linux/z/VM) failures.
Protection from Oracle instance failures.
Protection from storage failures (when ASM is used).
Active/active configuration and hence continuous availability.
Fast Application Notification (FAN) with integrated Oracle client failover.
Server side callouts to log trouble tickets or page Administrators to alert them of a failure.
Complex solution.
 
Note: A full description of RAC architecture is beyond the scope of this document. For more information, see “Related publications” on page 391.
9.2.7 Oracle Application Failover technology
When a planned or unplanned database outage occurs, the applications can encounter errors or hangs. Oracle’s High Availability features address these hangs by providing APIs to speed up the error response and, in some case, mask the error to the users. The database and the application tiers should be configured for fast application failover.
At a high level, automating client failover in an Oracle RAC configuration includes the following steps:
1. Relocate the database services to new or surviving instances.
2. Notify the clients that a failure occurred.
3. Redirect the clients to the relocated or a surviving instance.
FAN
FAN emits events when database conditions change, such as service, instance, or site goes up or down. The events are propagated by Oracle Notification System (ONS) or Streams Advanced Queuing (AQ). Compared to TCP/IP timeout, FAN provides fast detection of condition change and fast notification.
The FAN events can be used by the applications or users that connect to a new primary database upon failover by using Fast Connection Failover (FCF).
FCF
FCF is an Oracle High Availability feature for Java Database Connectivity (JDBC) applications and supports the JDBC Thin and JDBC Oracle Call Interface (OCI) drivers. FCF works with the JDBC connection caching mechanism, FAN, and Oracle RAC.
FCF provides the following High Availability features for client connections in planned and unplanned outages:
Rapid database service, instance, or node failure detection then stops and removes invalid connections from the pool
Recognition of new nodes that join an Oracle RAC cluster
Load balancing the connection requests to all active Oracle RAC instances
Transparent Application Failover
Transparent Application Failover (TAF) is an OCI feature that provides the client recovery capabilities if connections fail. TAF can be used with or without FAN conditions:
SELECT failover
If the connection is lost, Oracle Net establishes a connection to another node and reruns the SELECT statements with the cursor positioned on the row on which it was positioned before the failover. This approach is best for data warehouse systems where the transactions are big and complex
SESSION failover
If a user’s connection is lost, SESSION failover establishes a new session that is automatically created for the user on the backup node. This type of failover does not attempt to recover selects. This failover is ideal for Online Transaction Processing (OLTP) systems where transactions are small.
Graceful session migration for planned downtime.
HADR applicability
A server failure, Linux crash, or other faults can cause the crash of an individual Oracle instance in an Oracle RAC database. To maintain availability, application clients that are connected to the failed instance are quickly notified of the failure and immediately established with a new connection to the surviving instances of the Oracle RAC database.
9.2.8 Oracle Data Guard technology
Oracle Data Guard configuration consists of one primary database and one or more (up to 30) standby databases. Oracle Data Guard maintains standby databases as transactionally consistent copies of the primary database. If the primary database becomes unavailable, Oracle Data Guard can switch any standby database to the primary role, which minimizes the downtime that is associated with the outage.
The following standby databases are available:
Physical standby database
Oracle Active Data Guard
Transient logical standby database
Snapshot standby database
Logical standby database
An Oracle Data Guard configuration can include any combination of these types of standby databases.
Physical standby database
Physical standby databases include the following features:
A physically identical copy of the primary database with identical schemas, indexes, and data files.
Is kept synchronized with the primary database through Redo Apply, which recovers the redo data that is received from the primary database and applies the redo data to the physical standby database. This ensures a physical, block-for-block copy of the primary database.
Physical standby database can be opened for read-only access while redo data is applied (Oracle Active Data Guard option or real-time query mode).
Physical standby database can be used for taking backup, incremental backups, report creations, and creating clone databases.
Oracle Active Data Guard database
Oracle Active Guard databases include the following features:
Is a superset of Data Guard and allows a physical standby database to be open read-only while changes are applied to it from the primary database.
Enables productive use of physical standby databases
Automatically repairs block corruptions that are detected at the primary database.
Transient Logical Standby databases
Current physical standby database can be temporarily converted to a logical standby and can be used for rolling database upgrades as recommended by Oracle MAA best practices.
Snapshot standby databases
Snapshot Standby databases include the following features:
A physical standby database can be temporarily converted into a standby database that can be updated.
Snapshot standby databases can be used as clones or test databases to validate new functionality and new releases. When finished, it can be converted back into a physical standby.
While running in the snapshot standby database role, it continues to receive and queue redo data so that data protection and the RPO are maintained.
When it is converted back to physical standby database, the changes that are made to the snapshot standby state are discarded. Redo Apply automatically resynchronizes the physical standby database with the primary database by using the redo data that was archived.
Logical standby databases
Logical standby databases include the following features:
A logical standby database contains the same logical information as the primary database, although the physical organization and structure of the data can be different.
The logical standby database is kept synchronized with the primary database through SQL Apply, which transforms the redo data that is received from the primary database into SQL statements and then runs the SQL statements on the standby database.
HADR applicability
HADR availability includes the following features:
Data Guard technology addresses High Availability and Disaster Recovery requirements.
Data Guard technology complements Oracle RAC.
Provides one or more synchronized standby databases and protects data from failures, disasters, errors, and corruptions.
9.2.9 Oracle GoldenGate
Oracle GoldenGate is an asynchronous, log-based, real-time data replication technology that includes the following features:
Moves data across heterogeneous database, hardware, and operating system environments.
Supports multi-master replication, hub-and-spoke deployment, and data transformation.
Supports replication that involves a heterogeneous mix of Oracle databases and non-Oracle databases.
Can be deployed for data distribution and data integration.
HADR applicability
HADR applicability includes the following features:
Maintains transactional integrity; it is resilient against interruptions and failures.
Heterogeneous replication, transformations, subsetting, and multiple topologies.
All sites fully active (read/write).
9.3 High Availability with z/VM
In a typical Oracle on Linux on System z environment, Linux can be implemented to run on a single LPAR or multiple Linux guests can be hosted in an LPAR that is running z/VM. Oracle databases are installed on the Linux guests.
In that environment, the unavailability or a single point of failures can be result of any of the following activities:
Planned downtime activities:
 – System z hardware upgrades that require Power On Reset (POR).
 – LPAR configuration changes that require reboot of the LPAR.
 – z/VM maintenance.
Unplanned outages:
 – The System z hardware might experience multiple unrecoverable failures, which cause the entire server to fail (although this is not likely to happen).
 – Network or connectivity failures.
 – Disk subsystem I/O channels failures.
 – The LPAR microcode might fail.
 – z/VM failures.
HADR applicability
z/VM offers the following technologies to enhance the High Availability of Oracle databases on Linux on System z environment:
Single points of system failures can be avoided by implementing z/VM multi-system clustering technology, as described in Chapter 6, “Using z/VM Live Guest Relocation to relocate a Linux guest” on page 117.
Ability to use multiple LPARs and distributing Linux guests to run on them reduces several potential single points of failure at the system-image level.
The applications that are running on LPARs in a single System can communicate with each other by using HiperSockets or memory-to-memory data transfers. This avoids any external traffic and is a good choice to implement Oracle RAC interconnect requirements.
Virtual switch (VSWITCH) under z/VM and OSA Channel Bonding under LPAR can be used to avoid network single point of failures.
For ECKD DASD devices that are accessed over FICON® channels, redundant multipathing is provided and handled invisibly to the Linux operating system.
For SCSI (fixed-block) LUNs that are accessed over System z FCP channels, each path to each LUN appears to the Linux operating system as a different device. Linux kernel (2.6 and above) multipath facility handles this and provides High Availability.
 
Tip: For more information about High Availability with System z, see High-Availability of System Resources: Architectures for Linux on IBM System z Servers, ZSW03236-USEN-01, which is available at this website:
http://public.dhe.ibm.com/common/ssi/ecm/en/zsw03236usen/ZSW03236USEN.PDF
9.4 Disaster Recovery solutions
One of the objectives in achieving High Availability is to prevent a site from becoming a single point of failure (SPoF). Disaster Recovery (DR) solutions are an extension of High Availability solutions with the added capability of providing resiliency with geographic dispersion. Current disaster recovery solutions require geographic dispersion and should also meet RTO, RPO, and SLA objectives.
A plan for DR normally includes the following considerations:
Ensuring continuity of operations in the event of various disaster scenarios.
Dual site concept, where two data centers are in different locations. The entire hardware configuration is redundant, and the two systems are connected to each other.
Continuous operations for applications, databases, system, networks, supporting staff, and supporting infrastructure (power, cooling, and space).
Normally DR processes and Business Continuity processes function as a closely coupled set of processes.
DR configuration that is identical across tiers on the production site and standby site is called a symmetric site.
Having a DR site is an expensive proposition because of the following costs:
 – Hardware
 – Software
 – Network
 – Site facilities
 – Human Resources
 – Under-used standby resources
 – No immediate ROI until a disaster occurs
An ideal DR solution includes the following features:
 – Highly reliable
 – Low complexity
 – Proven technologies
 – Less expensive to implement
Challenges in a DR solution include the following factors:
 – Expensive, redundant systems that are under-used
 – Difficult to test to determine whether it really works
 – No ROI until a disaster occurs
 – Hardware and software maintenance still needed
 – Distance across data centers creates data synchronization challenges
Many of the System z customers have well-established business processes for DR scenarios, which usually uses the Capacity BackUp (CBU) features of System z1. Their current DR environments also can be easily extended to include Oracle databases that are running on the Linux on System z environment. For Oracle databases, the major requirement for DR is data resiliency and can be achieved by any of the following technologies:
Storage array-based remote mirroring solutions
Extended cluster solutions (Extended RAC)
Oracle Data Guard-based solutions.
Oracle MAA recommends that you build a DR solution that is based on Oracle Data Guard technology for Oracle databases for the following reasons:
Automatic and fast failover
Transactionally consistent data
Detection and deletion of data corruptions
Application, system vendor, or storage independent
Planned downtime reduction by using database rolling upgrades
9.5 Summary
In this chapter, we described how to plan for a highly available and disaster recovery (HADR) environment for Oracle Databases that are running on Linux on System z in a virtualized environment.
HADR solutions are possible by using Oracle MAA blueprint with IBM System z hardware’s proven design for continuous availability, which offers a set of reliability, availability, and serviceability features (RAS). The right HADR configuration is a balance between recovery time and recovery point requirements and cost.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset