Chapter 7. Business continuity

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Business continuity

In this chapter, we describe how to provide for business continuity with IBM FileNet Content Manager (P8 Content Manager).

We discuss the following topics:

•Defining business continuity

•Defining high availability (HA)

•Implementing a high availability solution

•Defining disaster recovery (DR)

•Implementing a disaster recovery solution

•Best practices

•Reference documentation

7.1 Defining business continuity

Business continuity is defined as maintaining business services to an organization’s customers despite disruptive events that have the potential to interrupt service. Disruptions range from human errors or component failures to full-scale human-caused or natural disasters. Providing for continued business operations in the event of a local component failure is called high availability. Business continuity in the event of a full-scale disaster is called disaster recovery.

Business continuity is concerned with resuming all critical business functions after disruptive events. High availability and disaster recovery are concerned primarily with the subset of business continuity devoted to keeping information technology (IT) services available during and after disruptions. Beside IT services, business continuity covers all aspects of continuing business operations, including crisis management and communications, alternate work sites for employees, employee disaster assistance, temporary staffing, emergency transportation, physical security, and chain of command.

Business continuity planning (BCP) involves all aspects of anticipating possible disruptions to mission-critical business functions and putting in place plans to avoid or recover from those disruptions. BCP focuses on planning for the successful resumption of all mission-critical business operations after a disruption, not just restoring IT functions. It involves much more than IT professionals. It touches every department in an enterprise from upper management to human resources, to external communications professionals, telecommunications staff, facility management, healthcare services, finance, sales, marketing, and engineering.

Business continuity planning in the limited scope of IT functions will involve the IT department, facility management, telecommunications, and line of business management who can assist in evaluating which IT functions are mission-critical after a disruption or disaster. High availability and disaster recovery plans need to be formally developed and reviewed by all these stakeholders, implemented, and then regularly tested by all staff to be certain that they will function as expected during and after a real disruption.

This chapter covers the part of business continuity that concerns restoring IT functions, in particular P8 Content Manager, after a disruptive event.

7.2 Defining high availability (HA)

What is high availability (HA) and how is it measured? We start by defining availability. A business system is said to be available whenever it is fully accessible by its users. Availability is measured as a percentage of the planned uptime for a system during which the system is available to its users, that is, during which it is fully accessible for all its normal uses.

Planned uptime is the time that the system administrators have agreed to keep the system up and running for its users, frequently in the form of a service level agreement (SLA) with the user organizations. The SLA might allow the system administrators to take the system down nightly or weekly for backups and maintenance, or, in an increasing number of applications, rarely if at all. Certain mission critical systems for around-the-clock operations now need to be available 24 hours a day, 365 days a year.

The concept of high availability roughly equates to system and data available almost all of the planned uptime for a system. Achieving high availability means having the system up and running for a period of time that meets or exceeds the SLA for system availability, as measured as a percentage of the planned uptime for a system.

Table 7-1 helps quantify and classify a range of availability targets for IT systems. At the low end of the availability range, 95% availability is a fairly modest target and therefore is termed basic availability. It can typically be achieved with standard tape backup and restore facilities. The next level up, enhanced availability, requires more robust features, such as a Redundant Array of Independent Disks (RAID) storage system, which prevents data loss in the first place, rather than the more basic mechanisms for recovering from data loss after it occurs. Highly available systems will range from 99.9% to 99.999% availability and require protection from both application loss and data loss. At the high end of this continuum of availability is a fault tolerant system that is designed to avoid any downtime ever, because the system is used in life and death situations.

Table 7-1 Range of availability

Availability percent	Annual downtime	Availability type
100%	0 minutes	Fault tolerance for life and death applications
99.999%	5.3 minutes	Five nines (near continuous availability)
99.99%	53 minutes	High availability
99.9%	526 minutes (8.8 hours)	High availability
99%	88 hours (3.7 days)	Enhanced availability
95%	18 days (2.6 weeks)	Basic availability

To make this more concrete, consider the maximum downtime that can be absorbed in a year while still achieving 99.999% availability, also called five nines availability. As Table 7-1 on page 219 indicates, five nines availability permits no more than 5.3 minutes of unscheduled downtime per year, or even less if the system is not scheduled for round-the-clock operation. This is near continuous availability, but not strictly fault tolerant. For a three nines target of 99.9%, we can allow 100 times more downtime, or 8.8 hours per year. An availability target of 99%, which still sounds like a high target, can be achieved even if the system is down 88 hours per year, or over three and half days. So the range of availability is actually quite large.

You might be asking yourself, “Why not provide for the highest levels of availability on all IT systems?”. The answer, as always, is cost. The cost of providing high availability goes up exponentially as availability approaches 99.9% and higher.

Choosing an appropriate availability target involves analyzing the sources and costs of downtime in order to justify the cost of the availability solution. Industry experts estimate that less than half of system downtime can be attributed to hardware, operating system, or environmental failures. The majority of downtime is the result of people and process problems, which comes down to a mix of operator errors and application errors.

This chapter focuses primarily on how to mitigate downtime due to hardware outages, system, and IBM FileNet software problems outside the control of an IBM FileNet client, or environmental failures, such as loss of power, network connectivity, or air conditioning. This covers less than half of the sources of downtime. The majority of the sources requires people or process changes.

Our advice is to determine what has caused the most downtime in the past for a particular system and focus first on that. Frequently, we have found that stricter change control and better load testing for new applications provide the greatest benefit. Focus on the root causes of outages first and then address the secondary and tertiary causes only after protecting against the root causes.

Here are several examples of best practices for avoiding downtime from people and process problems:

•System administrators need to be well-trained and dedicated full-time to their systems so that they are least likely to commit pilot errors.

•The applications running on the system must be designed with great care to avoid possible application crashes or other failures.

•Exception handling, both by administrators and application programs, must be carefully thought-out so that problems are anticipated and handled efficiently and effectively when they occur.

•Comprehensive testing and staging of the system is paramount to avoiding production downtime. Testing of the system under a simulated production workload is critical to avoiding downtime when the system is stressed with a peak load during production. Downtime on a test system does not affect availability of the production system, so make sure to eradicate all the problems before taking a new system, software release, service pack, or even software patch into production.

•Deploying a new application into production must likewise be planned and tested carefully to minimize the possibilities of adversely affecting production due to an overlooked deployment complication.

•Thorough user training will help keep the system performing well within the bounds for which it was designed. Users who abuse a system due to ignorance can affect overall system performance or even cause a system failure.

Make sure that all sources of downtime are addressed, if high availability is to be achieved. After the fundamental people-related and process-related problems have been addressed, you need to consider hardware and software availability next.

7.3 Implementing a high availability solution

There are a variety of building blocks for high availability, ranging from the most basic backup and restore facilities, to hardened servers and backup servers, to the best practices: server farms and server clusters.

It is important to note that server farms and server clusters, as those terms are used in this chapter, are different solutions. We will explore server farms first, and then explain how clusters differ.

7.3.1 Load-balanced server farms

Server farms are the best practice for web servers. In fact, they are the best practice, in terms of high availability, for all the server tiers in a P8 Content Manager solution where they are supported. The architecture and function of some servers do not lend themselves to a server farm configuration. But, the core P8 5.2 Content Platform Engine, as well as all the P8 web and presentation tier products, supports server farming. In addition, IBM DB2 pureScale® and Oracle Real Application Clusters (RAC) support server farming.

As we have already discussed in 3.2, “Scalability” on page 52, the key concept for a server farm is to distribute the incoming user workload across two or more active, cloned servers. This distribution is commonly called “load balancing,” which can be implemented either in hardware or software.

This is a scalable architecture because servers can be added to the farm to scale it out for greater workloads. It also provides improved availability because the failure of one server in a farm still leaves one or more other servers to handle incoming client requests. This availability keeps the service available at all times.

In a load-balanced server farm, clients of that server see one virtual server, even though there are actually two or more servers behind the load-balancing hardware or software. The applications or services that are accessed by the server’s clients are replicated, or cloned, across all the servers in the farm. And all those servers are actively providing the application or service all the time.

The load-balancing software or hardware receives each request and uses any one of a variety of approaches for distributing the request workload over the servers in the farm. This can be a simple round-robin approach, which sends requests to the servers in a predefined order. A more sophisticated load balancer might use dynamic feedback from the servers in the farm to choose the server with the lightest current load or the fastest average response time, for example.

In any case, the load balancer tracks the state of each server in the farm, so that if a server becomes unavailable, the load balancer can direct all future requests to the remaining servers in the farm and avoid the down server, therefore, masking the failure.

The key enabler for a server farm is the load balancer. In most cases, IBM FileNet leverages third-party load-balancing hardware and software products, rather than building load balancing into IBM FileNet products themselves.

IBM Content Search Services (CSS) and IBM FileNet Rendition Engine (RE) are two exceptions that provide load balancing on their own, so they do not require any external hardware or software load-balancing solutions.

All the Java application server vendors provide software to balance the Java application workload running in their Java Platform Enterprise Edition (Java EE) environments. For example, the IBM WebSphere Application Server Network Deployment product includes built-in software load balancing for Java applications that are deployed in WebSphere Application Server Network Deployment clusters.

Note: The base WebSphere Application Server product bundled with IBM FileNet Content Manager does not include this feature, so WebSphere Application Server Network Deployment must be licensed separately for high availability deployments.

Java EE application server vendors, including IBM, use the term cluster for their load-balancing software feature. The other Java EE application servers, Oracle WebLogic and JBoss, also provide a similar load-balancing software feature.

Java method calls by clients of a clustered Java application, such as the P8 Content Platform Engine, are distributed across all the WebSphere Application Server Network Deployment servers running Content Platform Engine by means of the WebSphere Application Server Network Deployment Workload Management (WLM) component. WLM consists of both a client-side component and a server-side component.

The server side, in conjunction with the WebSphere Application Server Network Deployment High Availability (HA) Manager component, keeps track of the health of each instance of the Java application, and sends that information back to the client-side WLM component on the return from every Java method call from the client.

The client-side WLM component, which is part of the WebSphere Java Runtime Environment (JRE) running on the client server, is responsible for distributing the method calls from local Java applications, such as the IBM Content Navigator, over the servers running the target Java application, such as the P8 Content Platform Engine. When IBM Content Navigator makes a content-related or process-related method call to the P8 Content Platform Engine, the local WLM running on the IBM Content Navigator server will decide which currently active server in the P8 Content Platform Engine cluster to use for that call, effectively load balancing all the calls across the servers in the P8 Content Platform Engine cluster.

Network hardware vendors, such as Cisco and f5 Networks, have implemented load balancing for server farms in several of their network devices. f5 BIG-IP is a popular hardware load-balancing device.

There are also many other vendors that have load balancer products. These products are best for load balancing the HTTP network traffic from web browsers to the web application tier in a P8 system, as well as the SOAP/HTTP network traffic from P8 client applications that use the Web Services interfaces to the P8 Content Platform Engine. However, do not use hardware load balancers in combination with WebSphere Application Server Network Deployment WLM software load balancing for the native Java APIs to the P8 Content Platform Engine, because the WebSphere Application Server Network Deployment WLM load balancing is self-contained and complete on its own.

In the best case, the hardware load balancer affects only the initial Java infrastructure call to locate the instances of the P8 Content Platform Engine in the WebSphere Application Server Network Deployment cluster. After that, the WLM component takes over the routing of all the Java method calls to the P8 Content Platform Engine. In the worst case, a hardware load balancer can compete with and disrupt the software load balancing provided by WebSphere Application Server Network Deployment and cause serious performance problems.

In addition to WLM for Java method call load balancing, WebSphere Application Server Network Deployment provides another load balancing feature for HTTP load balancing. This is the WebSphere Application Server HTTP plug-in for all the popular HTTP servers. The plug-in intercepts HTTP traffic flowing through the HTTP server to P8 servers, and distributes that traffic over the P8 servers configured for each HTTP function. For example, HTTP traffic between users’ web browsers and the IBM Content Navigator web application can be load balanced by HTTP servers in front of the IBM Content Navigator server instances, if the WebSphere Application Server HTTP plug-in is installed on the HTTP server and configured for that traffic. Another example is traffic from clients of the P8 Content Platform Engine that use the Web Services interface to the content and process functions of Content Platform Engine, rather than the Java API. The Web Services calls are made outside of the Java infrastructure over the SOAP protocol running on HTTP. These calls can be load balanced by any HTTP load balancer, including the WebSphere Application Server HTTP plug-in, or hardware load balancers from the network hardware vendors, such as f5 Networks.

Figure 7-1 shows a logical diagram of a load-balanced server farm. This figure shows a pair of hardware load balancers and multiple servers in the server farm. Redundancy is essential to prevent the failure of one load balancer from taking down the server farm.

Figure 7-1 A load-balanced server farm

This concept of no single point of failure is key to high availability. Every link in the chain, that is, every element in the hardware and software, must have an alternate element available to take over in case the first element fails. Software load balancers, for example, are designed to avoid any single point of failure; therefore, each server in the farm has a copy of the load-balancing software running on it in configurations using software instead of hardware for load balancing.

The software running on each server in a farm is functionally identical. As changes are made to any server in the farm, you must replicate those changes to all the servers in the farm. In this regard, a key benefit of WebSphere Application Server Network Deployment is its facility for rolling out software changes across all the nodes in a WebSphere Application Server Network Deployment cluster, after one of the nodes in the cluster has been updated. So, it facilitates keeping the software the same across all the nodes of a WebSphere Application Server Network Deployment cluster. (Recall that a WebSphere Application Server Network Deployment Java EE cluster is actually what we call an active-active load-balanced server farm in this chapter. We describe next how that differs from the concept of an active-passive server cluster.)

Load balancing offers a good solution: Any client calling into a load-balanced server farm can be directed to any server in the farm. The load can be evenly distributed across all the servers for the best possible response time and server usage. However, load balancing can be a problem if the servers in the farm retain any state between calls. For instance, if a user initiates a session by providing logon credentials, it is beneficial for those credentials to be cached for reuse on all subsequent calls to the server for that user session.

We cannot ask the user to log in over and over every time the application needs to communicate with the server. Therefore, in one solution, the server keeps a temporary copy of the user’s validated credentials in its memory. This works fine if there is only one server, but in a load-balanced server farm, the load balancer can easily direct subsequent calls from the same user session to different servers in the farm. Those other servers will not have the session state in their memory.

Load balancers can be configured for session-based load balancing to solve this session state problem. This is also known as sticky sessions, session affinity, or stateful load balancing. The load balancer keeps track of which server it selected at the beginning of a user session and directs all the traffic for that session to the same physical server. Session-based load balancing is required for the Application Engine, but not for the Content Platform Engine, because the Application Engine caches session state, while the Content Platform Engine does not.

Load-balanced server farms (or Java EE clusters) that manage persistent data stored on disk need to have a way for all the servers in the farm to share the same set of disks. In data stored in databases, such as DB2, all the database vendors provide interfaces with locking and transaction features that enable multiple database clients in a load-balanced server farm to share read/write access to the same database.

In addition to data housed in databases, the IBM FileNet Content Platform Engine manages data stored in file systems, such as the file storage areas for content objects, such as documents and annotations. So all the servers in a Content Platform Engine farm have to be able to read and write to one or more common file systems. The solution is a shared network file system, which network-attached storage (NAS) devices provide natively over the Network File System (NFS) or Common Internet File System (CIFS) protocols. NFS is supported by the UNIX and Linux operating systems, and CIFS is supported by Microsoft Windows. Another option for AIX and Linux based P8 Content Platform Engine servers is the IBM General Parallel File System (GPFS™), which can be deployed with storage area network (SAN) storage devices to provide a shared network file system for P8 servers.

Now, we turn to active-passive server clusters and explore how they differ from active-active load-balanced server farms (or from load-balanced Java EE server clusters, such as WebSphere Application Server Network Deployment clusters).

7.3.2 Active-passive server clusters

Historically, active-passive server clusters were commonly required for the business logic and data tiers beneath the web and presentation layer tier of servers. Examples include business process servers, library or repository servers, and database or file system servers. For instance, IBM FileNet Image Services (IS) is a content repository that requires an active-passive server cluster configuration when deployed for high availability.

Business logic and data tier servers all differ from web and presentation servers in that they directly manage substantial dynamic data, such as content or process data. A stream of dynamic data, by definition, is a stream of new or rapidly changing data. For business logic or data tier server products that have not been specifically designed to allow multiple servers to manage this kind of dynamic data in a safe, cooperative process, a single server must manage the dynamic data set, in order to avoid data inconsistency or corruption from multiple servers trying to make changes to the data simultaneously.

Fortunately, more and more server products, including the IBM FileNet 5.2 Content Platform Engine and its predecessor Content Engine and Process Engine products, make use of transactional software and locking to allow multiple server instances to manage dynamic data sets safely. Those products can take advantage of active-active load balancing, described previously. But other products, notably IBM FileNet Image Services, do not have this capability, so each data set must be managed by only one server.

Because of that single server architecture, a server farm with two or more active servers does not fit well with servers that have not been designed for cooperative data management. Yet a second server is still needed for continued availability, in case the first server fails. The solution in this case is an active-passive server cluster, where the second server stands by until the first server fails, before stepping in to take over the data management.

The second server needs to have access to the data that was being managed by the first server, either the same exact copy, or a copy of its own. The common solution allows both servers to be connected to the same copy of data either via a network file share or, more commonly, a SAN storage device that both servers can access, but only one at a time. The active server owns the SAN storage, and the passive server has no access.

Shared access to SAN storage in this way is an alternative to replicating the data to a second storage device accessed by the second server. However, maintaining a replica of the data, sometimes called a mirror, on a second local storage device is a good practice, as protection against the failure of the primary SAN storage device. Even highly available SAN storage devices, which have internal protection against the loss of a disk drive through redundant copies of the data, have been known to fail completely. Active-passive server clusters can still be configured such that all servers can take over the primary storage in the event of the active server failing, with the local mirror as a standby copy that is used only if the primary storage device failed. The IBM DB2 High Availability Disaster Recovery (HADR) product is an example of a product that provides both active-passive server clustering as well as data replication so that the passive server has its own separate copy of the data.

If there is no local mirror, recovering from the loss of a primary storage device involves either time-consuming restoration from a previous backup, or declaring a disaster and failing everything over to the recovery site, which is also time-consuming. Data updates that have occurred in the time since a backup was taken will necessarily be lost when a backup is restored. If the sources of those updates are still available, the updates can be made a second time to avoid data loss. In comparison to restoring from backup or switching over to a disaster recovery site, switching over to a local replica by reconfiguring the server managing the storage is faster, simpler, and avoids any data loss.

Figure 7-2 on page 228 shows two servers in a server cluster with access to the same shared storage. Recall that some server farms typically do not have this requirement for shared storage. DB2 pureScale, Oracle RAC, and the Content Platform Engine are exceptions, in that they exhibit both server farm and server cluster characteristics. They take advantage of load balancing, combined with cooperative data management using storage that is simultaneously shared by all the servers. In a load-balanced server farm with shared storage, all the servers are active and thus need to access the storage in parallel, so a network file share is required. An active-passive server cluster, however, is designed to allow only the active server to access the storage, so the single-owner model of SAN storage works well. The typical server cluster does not support load balancing, but it does support shared storage via SAN. The storage is shared in a server cluster in the sense that both servers are connected to the same storage, so they share access to the same storage, but never concurrently in the case of SAN storage.

Figure 7-2 Active-passive server cluster

As with server farms, clients of a server cluster see one virtual server, even though the physical server they interact with will change if the primary server fails. If the primary server fails, a failover occurs, and the second server takes over the data copy and starts the software to manage the stored data. It also takes over the virtual network address, which is shared by the two servers, making the failover transparent to the client of the server cluster.

Both triggering a failover and actually accomplishing the failover cleanly are the responsibility of clustering software running on both servers. This software is configured on the secondary server to monitor the health of the primary server and initiate a failover if the primary server fails. The active server in an active-passive cluster owns the storage resources, commonly called a resource group or shared volume group. The resource group is visible from both cluster nodes but only dedicated to the active node. If the active node fails, the clustering software will move the resource group to the remaining passive node. The passive node sees the resource group but does not write to it until the clustering software ensures consistency. This is called a shared volume group for IBM PowerHA® clusters.

After the failed server is repaired and running again, a failback is initiated via the clustering software to shift the responsibility back to the primary server and put the secondary server in waiting mode again. This failback is necessary to get back to a redundant state that can accommodate another server failure.

In certain cases, intentional failovers can be used to mask planned downtime for software or hardware upgrades or other maintenance. You can upgrade and test the secondary server offline. And then, you can trigger a failover and apply the upgrade to the primary server while the secondary server is standing in for the primary server.

This type of configuration, in which the second server is inactive or passive until it is called to step in for the active server, is called an active-passive server cluster. Several clustering software products also support an active-active cluster configuration, which is similar to a server farm where all servers are active. An active-active cluster configuration is useful for data managing servers that are designed to share the management across more than one server.

However, IBM FileNet products that use active-passive clustering software for high availability all require an active-passive configuration. IBM FileNet products that work with an active-active configuration always use a server farm and load balancing rather than clustering software. (Server farms are always active-active.)

Server cluster software requires agents or scripts that are configured to manage key server processes on a particular server. These agents or scripts allow the cluster software to monitor the health of the application software, as well as start and stop the application software on that server. Cluster software typically comes with predefined agents or scripts for common server types, such as database servers.

A failover in an active-passive server cluster is not instantaneous. It will typically take ten or fifteen minutes or longer, depending on how long it takes the clustering software to stop the failing server, shift the virtual IP address and the storage to the passive server, and start the application software on the passive server. Before the system is accessible again, additional internal steps can take place, such as database transaction recovery. Depending on the state of a database and the number of in-flight transactions at the time of a database server failure, it can take substantially more than fifteen minutes to roll back incomplete transactions before the database is once again online and available.

7.3.3 Geographically dispersed server clusters and server farms

Most server clusters consist of two side-by-side servers. However, certain software vendors also support geographically dispersed clusters. The Symantec Veritas Cluster Server, for instance, supports both stretch clusters and replicated data clusters. A stretch cluster is defined as two servers in a cluster separated by as much as 100 km (62 miles). The distance limitation is due to the requirement to connect both servers via fiber to the same SAN device for shared storage and also due to the maximum amount of time allowed for the heartbeat protocol exchange between the two servers. The two servers in a stretch cluster always share the same SAN storage device, just as though they were side by side, and operate identically with the way a local server cluster operates.

A replicated data cluster is similar to a stretch cluster, but the remote server always has its own replicated copy of the data. In the event of a failover, the second server comes up on its local copy of the data. In certain cases (but not all cases), this capability removes the need for an expensive fiber connection between the two sites, because neither server needs the speed of fiber to access storage at the other site. Data replication can be done over an IP network. There is still a 100 km (62 mile) distance limitation to ensure that the heartbeat between servers will not time out due to transmission delays and to allow for synchronous replication. See 7.5.1, “Replication” on page 236 for an explanation of synchronous and asynchronous replication.

A replicated data cluster cannot provide the same level of availability as a local cluster, because of the additional downtime required for a data resync to the primary site on a site failback. In addition, particularly in stretch clusters, the network connectivity between the two sites is typically much more expensive and substantially more prone to failure than the local network connectivity between two servers in a local cluster.

Similarly, some server farms can be dispersed geographically across multiple sites. In that case, load balancing must be done across sites. Servers that manage persistent data, such as database servers, need to share a single copy of the data, which necessarily must live at just one of the sites. The network connectivity issues with geographically dispersed server clusters apply to server farms as well. Some vendors supporting server farms caution against geographically distributing their farms. Notably, IBM strongly discourages stretching WebSphere Application Server Network Deployment clusters across sites, due to the added risk of communication failures between the sites and the complexity of this kind of deployment.

Because of the availability trade-offs and communication costs, geographically dispersed server clusters and server farms are generally not the best practice for high availability. However, some organizations have chosen to deploy twin data centers within a single metropolitan area, typically less than 40 kilometers (25 miles) apart. The motivation behind twin data centers is to reduce the risk of downtime from one data center being unavailable, because of planned or unplanned downtime for the whole data center. By limiting the distance between sites, the networking costs and risk of network failure are reduced, making this approach more feasible.

Still, the simplicity of keeping server clusters and farms local to a single data center is the best practice for high availability, because this minimizes the risk of failure due to the complexity of running these clusters or farms across multiple data centers, and the increased risk of network problems.

As we will see later in the disaster recovery discussion, the best practice solution for the loss of the production data center is to fail over to a standby recovery data center that is typically located hundreds of miles or more away from the production data center.

Twin data centers in a metropolitan area are less attractive for true disaster recovery, because both can be lost in a single local disaster. Even if one of the nearby data centers survives a disaster, the IT staff living in the metropolitan area surrounding the two data centers can effectively become a single point of failure for the two data centers. If their access to the remaining data center is cut off, or they are otherwise unable to work due to effects of the disaster, the remaining data center can be effectively lost without suffering direct damage itself from the disaster.

Some organizations have combined twin nearby data centers with a third recovery center farther away, in order to have both a local disaster recovery option as well as a remote disaster recovery option. That is the best practice when twin nearby production data centers are a company standard. But a more cost-effective and lower-risk solution is to have a single production data center configured for full local high availability, and a remote standby disaster recovery data center located at least a hundred miles away, preferably more.

7.3.4 Server cluster products

All the server vendors offer their own server cluster software products (see Table 7-2), as well as several software vendors.

Table 7-2 Server cluster products

Server and software platform	Server cluster software products
IBM System p® AIX	PowerHA
Microsoft Windows Server	Microsoft Cluster Server
Hewlett-Packard (HP) 9000 HP-UX	HP ServiceGuard
Sun Solaris	Sun Cluster
AIX, Solaris, Windows, and Linux	Symantec Veritas Cluster Server (also supports HP-UX) or IBM Tivoli System Automation for Multiplatforms

7.3.5 Comparing and contrasting farms to clusters

Table 7-3 summarizes the differences and similarities between load-balanced server farms and active-passive server clusters.

Table 7-3 Comparison of farms to clusters

Feature	Farms	Clusters
Clients see one virtual IP address and one virtual server	Yes	Yes
All servers active	Yes	No
One server active and one server passive	No	Yes, typically
Capacity and performance scalable by adding servers	Yes	No
Instantaneous failover	Yes, all servers active all the time	No, must wait for software to be started after failover
Shared storage between the servers	Not necessarily, but can include a network file share for parallel accesses from all the servers in the farm	Yes, typically SAN storage, which allows just the active server to access the storage
Used for web servers, presentation tier, and certain services tier servers	Yes	Not usually
Used for data tier servers	No	Yes (except active-active database products such as DB2 pureScale)
Requires hardware or software load balancer	Yes, such as BIG-IP or WebSphere Network Deployment clustering	No
Requires failover cluster software	No	Yes, such as PowerHA, IBM Tivoli System Automation for Multiplatforms, Microsoft Cluster Server, or Symantec Veritas Cluster Server

Now that we have covered the differences between server farms and server clusters, we explore the advantages of farms over clusters and the advantages of clusters over farms. Server farms have no idle servers, by definition, because all servers in a farm are active. Server clusters always have one or more idle servers in a steady state. Even more importantly, you can expand server farms by simply adding a server clone, thereby scaling out the farm to handle larger workloads. This horizontal scalability is not possible with active-passive server clusters. The last advantage of a farm over a cluster is faster recovery time. Server cluster failovers are delayed by the time that it takes to start the software on the passive server on a failover. All the servers in a server farm are active and immediately available to accept work that has been redirected away from failed servers.

There are also some advantages that clusters have over farms, but on balance farms have the advantage. The chief advantage of a cluster over a farm is that the passive server can be configured identically with the active server, guaranteeing no performance drop-off in the event of a failover. With server farms, even if the initial server sizing is done to allow one server in a two-server farm to handle 100% of the workload, the workload can increase over time to the point where a single server is unable to handle the full workload after a failure. Careful capacity monitoring and periodic testing can prevent this problem from occurring with farms, however.

7.3.6 Inconsistent industry terminology

The terminology used in this book to distinguish load-balanced server farms from active-passive server clusters is not unique to the book, but also not standard across the industry. As you can see in Table 7-4 on page 234, many vendors use the term “cluster” for both farms and clusters. Microsoft, for example, uses both terms for server farms. Symantec/Veritas uses “failover group” for a cluster and “parallel group” for a farm. Both Oracle and IBM call their Java EE application server farming configurations clusters. As we have seen, farms and clusters, under our definition of those terms, are quite different, therefore the emphasis here on distinct terms for these HA approaches.

Table 7-4 Inconsistent industry terminology for HA

Vendor	HA terminology
Microsoft	“NLB cluster” and “cluster farm” = farm “Server cluster” and “cluster server” = clusters
Symantec Veritas	“Failover group” = cluster “Parallel group” = farm
Oracle	WebLogic “cluster” = farm
IBM	WebSphere “cluster” = farm PowerHA “cluster” = cluster

7.3.7 Server virtualization and high availability

Chapter 3, “System architecture” on page 37 introduced the concept of server virtualization and its promise of consolidating data center hardware and thus reducing total cost of ownership for the data center. This has considerable appeal, but it can also have a negative impact on availability. If a server farm or server cluster with two physical servers is consolidated into two virtual servers hosted on the same physical server, you must be careful to ensure that the physical server has no single points of failure. Does it have redundant power supplies, network interface cards, processors, memory, and so on? If any single component failure on a server can take down all the virtual servers hosted on it, that server cannot act as host for all the servers in a cluster or farm. Two of the virtual servers must be hosted by different physical servers in this case to avoid downtime caused by a single component failure.

7.4 Defining disaster recovery (DR)

Now, we turn from high availability to disaster recovery. How do they differ? Both high availability and disaster recovery are part of business continuity, that is, making sure that critical business systems and processes can continue to operate despite system failures and disruptions. However, disaster recovery and high availability solutions perform under different circumstances that require different solutions.

Disaster recovery concerns restoring service after the loss of an entire business system or data center due to natural or human-made disasters, such as fire, flood, hurricane, earthquake, war, criminal action, or sabotage. In contrast to that, high availability concerns keeping a business system available despite a local component failure – such as a server power supply failure, a network switch failure, or a disk crash – that leaves most of the system untouched.

For recovery from the loss of an entire production system in a disaster, a full remote system with its own up-to-date copy of the data is needed. All users and operations must be switched over to the remote system. Compare that to when just a single component fails in a data center: the optimal solution then is an automated, localized, and limited substitution of a single replacement component for the failed component. Server farms and clusters substitute a single replacement component with minimal disruption to the rest of the system and its users. Disaster recovery solutions are much more drastic, disruptive, time-consuming, and heavyweight, because they have to replace an entire system or data center, not just a single failed component. Therefore, disaster recovery solutions are an inappropriate choice for high availability.

Disasters, such as the World Trade Center destruction on 11 September 2001 (9/11) or Hurricane Katrina in New Orleans and the Mississippi Gulf Coast, can have a devastating effect on businesses in their path. Organizations with business continuity, HA, and DR plans were much more likely to rebound and recover from 9/11 and Katrina than those without such planning. Analysts estimate that a significant number of businesses that suffer an extended IT systems outage due to disaster go out of business within a year or two; other businesses never resume operations at all. The obvious inference is that planning and preparing for disaster recovery is a best practice for businesses of all sizes.

7.4.1 Disaster recovery concepts

There are two key metrics that play important roles in determining an appropriate disaster recovery (DR) solution for a particular business and application. They are Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

In certain cases, the most recent data changes at the production site, which stretch back to a point in time prior to the disaster, do not make it to the recovery site because of a time lag that is inherent in how the data is replicated. The magnitude of this time lag is dependent on the particular type of data replication technology that you choose. Assuming a disaster occurs, the recovery point is the point in time before the disaster that represents the most recently replicated data. How far back in time is the business willing to go after the disaster happens? That is, the RPO translates to how much recent data the business is willing to lose in a disaster.

The duration of time that passes before the systems can be made operational at the recovery site is called the recovery time. The RTO is the business’s time requirement for getting the system back online. That is, how much downtime can the business endure?

RPOs and RTOs for different businesses and industries range from seconds to minutes or days, even to weeks, depending on business requirements.

7.5 Implementing a disaster recovery solution

Disaster recovery can be greatly facilitated by two key technologies. One is data replication to a remote recovery site, and the other is software or scripting that can automate most of a site failover to a recovery site after a disaster takes away the primary site. The RTO and RPO for a particular business determine when these two technologies are required for the disaster recovery solution for that business. With an RPO and RTO measured in days to weeks, that is, if the business is willing to lose days to weeks of data and can wait days to weeks for the system to come back online, tape backup and restore are sufficient. But if an RPO of seconds to hours is desired, a form of data replication is required. If an RTO of hours to weeks is acceptable, replication alone might suffice. But if an RTO of seconds to hours is desired, both replication and automated site failover will be required.

Next, we explore replication and automated site failover in more detail.

7.5.1 Replication

Backing up to tape or other removable media is the minimum for copying data for use after a disaster. You must ship the media off-site to a location outside of the projected disaster impact zone. The greater the distance of the location from the production site, the lower the risk that both production and recovery sites will be affected by the same disaster. One general rule is that a backup tape vault and recovery site must be at least 48.28 km (30 miles) away from the production system, which in most cases is sufficient to avoid a flood or fire disabling both sites. However, sites that are close together can still be in the same impact zone for earthquakes, hurricanes, or power grid failures, so more cautious organizations separate their production and recovery sites by hundreds, if not thousands, of miles.

Companies usually perform backups once a day, which meets only a 24 hour RPO. That means that as much as 24 hours of data can be lost. The recovery time required for data restoration from tape can be days due to the need to restore a series of tapes that represents a full backup and subsequent incremental or differential backups. So, you measure both RPO and RTO in days if the only DR provision is tape backup.

For a better RPO, that is, to reduce the potential data loss in a disaster, you need to periodically replicate the data to a remote disk, because periodical replication can be done more often than tape backup. This effectively reduces the window of data loss. Continuous replication that is done in real time can avoid any data loss at all.

Note: When you use continuous data replication products, point-in-time backups, such as tape backup or periodic replication, are still required in order to recover from data corruption or accidental deletion. Continuous replication copies the corruption or deletion to the replica; therefore, you need to be able to fall back on a point-in-time copy prior to when the corruption occurred.

There are several levels at which you can perform replication: the application level, the host level, and the storage level. Database replication is the best example of application-based replication. Host-based replication is beneath the application level, but it still resides on the server host and typically runs at the file system or operating system level. Storage-level replication is implemented by the storage subsystem itself, frequently, a SAN device or a NAS device.

Application-based replication

Application-level software that understands the structure of data and relationships between data elements can copy the data intelligently, so that the structure and relationships are preserved in the replica. Database and object-based replication are examples. Database replication ensures that the replica database is always in a consistent state with respect to database transactions. Object-based replication ensures that content objects that include both content and properties are replicated as an atomic unit, so that the content and properties are always consistent with each other in the replica.

Each database vendor has replication products that replicate just the database, but not other data. Examples include IBM DB2 High Availability and Disaster Recovery (HADR) and Oracle Data Guard. Database replication products are typically based on shipping database logs to the recovery site to be applied to a database copy there. The advantage of these products is that they keep the database replica in a fully consistent state at all times, with no incomplete transactions, which reduces the recovery time required when bringing up the database after a disaster. The disadvantage of these products is that they have no means to replicate anything other than databases. File systems that need to be kept consistent with the database, for instance, have to be replicated by a different replication mechanism, which introduces the possibility of inconsistency between the database and file system replicas.

Host-based replication

In contrast to application-based replication, host-based replication has no understanding of the data content, structure, or interrelationships. It detects when a file or disk block has been modified and copies that file or block to the replica. Symantec Veritas Volume Replicator and Double-Take Software Double-Take are examples of host-based replication products. Unlike application-based replication, they can be used to replicate all forms of data, whether it is in a database, a file system, or even a raw disk partition. Several of these products use the concept of consistency groups, which tie together data in different volumes and allow all the data to be replicated together, therefore maintaining consistency across related data sets, such as databases and file systems. In contrast to application-based replication, however, the replica is not guaranteed to be in a clean transactional state, because the replication mechanism has no visibility into database or file system transactions. Recovery can take longer, because incomplete transactions must be cleaned up prior to making the data available again.

Storage-based replication

All of the storage vendors offer storage-based replication for their SAN and NAS products. The storage products themselves provide storage-based replication and do not use server host resources. Examples include IBM Metro Mirror (PPRC) and Global Mirror (XRC), EMC SRDF and MirrorView, Hitachi Data Systems TrueCopy, and Network Appliance SnapMirror.

NAS products replicate changes at the file level. SAN products replicate block by block. In both NAS and SAN replication, as with host-based replication, there is no knowledge of the structure or semantics of the stored data. So, databases replicated in that way can be in any transient state with regard to database transactions and therefore might require more database recovery time when the replica is brought online. That increases the overall recovery time.

NAS replication covers any data in the file system, whereas SAN replication, which is at the lower level of disk blocks, covers all data stored on the disk.

An emerging specialization of storage-based replication uses a SAN network device to intercept disk writes to SAN storage devices and manage replication independently of both the server host and the storage devices. IBM SAN Volume Controller is an example of this type of product. It has the advantage of being able to span heterogeneous SAN storage devices and replicate data for all those devices in a consistent manner. You can think of the IBM SAN Volume Controller as a new form of storage-based replication, because it resides in the Fibre Channel infrastructure used to access SAN storage. Analysts have a new term for this kind of replication: network-based replication.

Synchronous as opposed to asynchronous replication

Host-based and storage-based replication commonly support two modes of operation: synchronous and asynchronous. Synchronous replication writes new data to both the production storage and the remote recovery site storage before returning success to the operating system at the production site for the disk write. So, when the operating system signals that a disk write is complete, it has actually been completed on both storage devices. You can think of synchronous replication as logically writing the data at both sites at the same time. That means that after a disaster strikes the production system, we know that the recovery site has all the data right up to the last block that was successfully written at the production site. Synchronous replication ensures that there is no data lost in a disaster, as long as the recovery site survives the disaster. (However, incomplete transactions can still be rolled back when the recovery system is started, leading to unavoidable loss of the data in those transactions, even with synchronous replication.) But to make the latency for disk writes short enough, synchronous replication is typically feasible only for sites that are separated by 96.5 km
(60 miles) or less. Above that separation, the wait for the write to the recovery site slows the overall speed of the system significantly. The wait is a function of the distance between sites, because signals can travel no faster than the speed of light between sites. At more than 96.5 km (60 miles), the latency becomes too great in many cases, although certain storage vendors are now extending this distance to 290 km (180 miles).

For sites that are separated by more than 96.5 km (60 miles), asynchronous replication is the choice. Asynchronous replication is not done in lock step, the way that synchronous replication is. Instead, the local disk write is allowed to complete before the write is completed to the second site. The update to the second site is said to be done “asynchronously” from the local update, that is, not in the same logical operation. This method frees the production system from the performance drag of waiting for each disk write to occur at the remote site. However, it opens a time window during which the production site data differs from the recovery site copy. That difference represents data that is lost in a disaster when asynchronous replication is used. In exchange for that data loss, the two sites can be any distance apart, although the further apart they are, the greater the typical data loss.

Storage vendors have devised a way to ensure no data loss over any distance, however, by a configuration involving a third copy as shown in Figure 7-3 on page 240. This solution requires a nearby synchronous replica and a remote asynchronous replica. The data from the production site is replicated synchronously to a backup site within 96.5 km (60 miles), which is Site 2 in Figure 7-3 on page 240, and replicated asynchronously to a remote site, Site 3, any distance away. As long as only one of the three sites is lost in a disaster, it is always possible to recover all the data from the remaining two sites. In the diagram in Figure 7-3 on page 240, if Site 1 is lost in a disaster, the synchronous copy at Site 2 holds all the data up to the moment of the disaster. From there, the data can be replicated asynchronously to Site 3, the actual recovery site, therefore extending zero data loss all the way to Site 3. It works, but the added replica and site can be expensive.

Figure 7-3 Zero data loss replication over any distance

Several vendors support an optimized version of the second site called a “bunker site” where only the blocks not yet replicated are stored and no others. The list of the blocks that have not yet been replicated is typically a small list, so a bunker site can be configured with minimal storage space, which reduces the overall cost of this solution. IBM Asynchronous Cascading Peer-to-Peer Remote Copy (PPRC) is an example of this three-site zero data loss solution.

Comparing the replication options

What makes host-based or storage-based replication better than database-based replication? First, storage-based replication has the advantage that it allows a single replication product to be used for all data. With database-based replication, the database is replicated separately from the rest of the data, which can lead to inconsistency between the databases and the other data stored in a file system, such as content data. Second, using a common replication product for all data also simplifies the DR solution, which leads to less required training of system administrators and less total cost of ownership overall. Third, synchronous storage-based replication prevents any data loss other than incomplete transactions. Database-based replication typically is asynchronous and thus is vulnerable to more data loss in a disaster. Host-based replication shares these three advantages over database-based replication.

Lastly, storage-based replication is implemented entirely by the storage device. Database-based or host-based replication runs on the server and takes up server resources. (Vendors of host-based replication products counter that the load on the server is minimal and just a small percent.)

Why choose database-based replication over storage-based replication after you see these disadvantages? The key reason is the lower recovery time that can result from the database replica being in a cleaner state and therefore requiring less recovery processing. A database replicated via its native replication facility is always in a clean database transaction state, so no incomplete database transactions have to be rolled back when the backup database is activated. This allows the system to recover more quickly, which can be viewed as more critical than a small amount of data inconsistency, when minimal recovery time is of paramount importance. Moreover, if all the content and property data is stored in the database, which is an option with the P8 Content Manager for small deployments, database-based replication has no consistency disadvantage or cost of ownership disadvantage.

7.5.2 Automated site failover

The second key technology that is used in many disaster recovery solutions is automated site failover. Some software vendors offer a product to do this called a global cluster option or a geographic cluster manager. We use the generic term global cluster manager here to distinguish it from geographically dispersed clustering, which we described previously. Recall that geographically dispersed clusters are still clusters in the sense of a heartbeat between the nodes and failover if the active server fails; they just have the servers dispersed over a distance as great as 96.5 km (60 miles). A global cluster manager, however, extends an ordinary server cluster with the capability to oversee multiple sites that are any distance apart. It manages local server clusters at each site, controls replication between sites, and updates Domain Name System (DNS) servers to redirect users to the recovery site system. Its major function is to automate most or all of the process of failing over from a production site to a recovery site after a disaster.

Most organizations prefer to have at least one manual decision step before declaring a disaster, because of the gravity and cost of switching all operations and users to a recovery site. But after that decision has been made, a global cluster manager can automate the rest of the process. This is advantageous, because automating the process reduces the chances of human error, makes the process repeatable and testable, and thus increases the chances of a successful site failover in the highly stressful period following a disaster. Symantec Veritas Global Cluster Option is one example of a global cluster manager. The Geographic Logical Volume Manager (GLVM) configuration of IBM PowerHA SystemMirror® Enterprise Edition offers some global cluster management features for the AIX platform.

In the absence of a global cluster manager, server command-line scripting is another way to automate key parts of a site failover.

7.5.3 Disaster recovery approaches

IBM ECM Lab Services defines three common approaches for disaster recovery:

•Build it when you need it.

•Third-party hot site recovery service.

•Redundant standby system.

Build it when you need it

The lowest cost approach, but the slowest and the hardest to test, is to build a replacement system after a disaster has occurred. There is nothing in place prior to a disaster, which makes it extremely low cost, but it allows no testing either. This approach has an RTO of days to weeks.

Third-party hot site recovery service

The second approach is to contract with a third party for a hot site recovery service. Third parties, such as SunGard, IBM, and HP, have shared recovery sites around the world that you can reserve by contract for use in the event of a disaster. This approach costs more than the first approach, of course, but it also offers a shorter recovery time, because the site is equipped and hot at the point of disaster. Data has to be restored at the hot site, but no hardware has to be acquired or configured. The third-party providers include regular testing of failover to their site as a part of their service. IBM ECM Lab Services has an offering to assist you in setting up and testing the hot site and activating it in the event of a disaster. This approach has an RTO of hours to days.

Redundant standby system

The third and most frequently chosen approach is a standby redundant system in place at a client-owned and operated remote recovery site or at a third-party site. This approach is the highest cost approach, because the cost of the redundant system is not shared with anyone else. But it offers the shortest recovery time, particularly if the data replica is constantly updated and available for use. It also can be tested on a regular basis, which is in keeping with best practices for ensuring that a disaster recovery plan will actually work as expected when needed. This approach has an RTO of minutes to hours.

Comparing the costs and technologies

No matter which of these DR options you choose, it is essential to have a copy of the data off-site. Table 7-5 on page 244 summarizes the data backup or replication choices and costs, as well as the recovery site choices. Table 7-5 on page 244 shows the relationship between recovery time, recovery point, and the type and cost of data replication required to achieve that recovery time and recovery point. Like high availability choices, the choices for disaster recovery become exponentially more expensive as RTO and RPO approach the minimums of hours to minutes. The cost increase is due to the changes in disaster recovery technologies required to meet increasingly more ambitious recovery times and points.

For an RTO of three days or more, the minimum level of data replication, backing up to tape, is sufficient. As we noted earlier, a form of point-in-time backup, such as tape backup, is always required, regardless of RTO, as a means of recovering from data corruption or accidental deletion. The solution is to retrieve the latest backup tape or other point-in-time backup from the off-site storage location and restore the data to a point in time prior to the corruption or deletion of the data. Full data restoration from tape is a slow and laborious process, which typically involves a full backup tape and a number of incremental backup tapes after that, which takes days for completion. Backups are done periodically, usually once a day, possibly multiple times a day, so the RPO for this minimum solution is hours to days of lost data.

Periodic replication to off-site storage characterizes the next two solutions up the cost curve with an increase in cost for communications links, but providing an RPO and RTO of hours, not days. Periodic point-in-time backup to remote storage, usually disk storage, is the first step up from standard local tape backup. The next step up consists of shipping database or file system update logs to the remote recovery site, where they are applied to a copy of the data to bring it up-to-date with that log. These are both done on a periodic basis, but as the period is shortened, it approaches the limit of continuous replication, which is the next step up the cost curve.

Table 7-5 Range of disaster recovery solutions

Recovery time	Recovery point	Cost	Technologies
Minutes to an hour	Zero data loss	$$$$$$$$$$$$$	Hot standby site, synchronous replication, or global clustering
1 - 6 hours	Minutes of data lost	$$$$$$$$$	Hot or warm standby site, asynchronous replication, or global clustering
6 - 12 hours	Hours of data lost	$$$$$	Warm standby site, continuous or periodic replication, or log shipping
12 - 24 hours	Hours to days of data lost	$$$	Warm or cold standby site, or periodic backup to remote storage
Days to weeks	One or more days of data lost	$	Cold or no standby site, or nightly tape backups shipped off-site

The cost now starts to accelerate upward. As the name implies, continuous replication is the process of replicating data to the recovery site as it changes, that is, on a continuous basis. Near continuous and continuous replication greatly decrease the potential for data loss when compared to periodic replication, which brings the RPO down to seconds worth of data loss, or even zero data loss in synchronous replication.

Disaster recovery time is similarly decreased with synchronous and asynchronous replication, because the data is kept continuously in sync, or close to it, at both sites. In the event of a disaster, no time is required to bring the data up-to-date, as is the case with restoring from backup, periodic replication, or log shipping, but time might be required for configuring and bringing up a duplicate of the application environment on the replicated data. The RTO is in the range of hours in that case, or, if a complete application environment is maintained at all times at the recovery site, and global clustering is used to automate and speed site failover, RTO can be in the range of just minutes.

7.6 Best practices

Having defined the concepts of high availability and disaster recovery and having detailed the key technologies and approaches used for HA and DR solutions, what are the best practices for configuring P8 Content Manager for high availability and disaster recovery from the available options and approaches?

Best practices for high availability

We start with high availability, which is summarized on the left side of Figure 7-4 on page 246. For P8 systems that will be accessed from an untrusted network, including the public Internet, a DMZ is required for shielding the P8 system from attack. Users or client applications on an untrusted network are shown on the far left, with a DMZ interposed between them and the P8 system. The DMZ consists of an external firewall shielding the DMZ from the untrusted network, a pair of load-balanced HTTP servers behind the external firewall, and an internal firewall shielding the internal, trusted network from the DMZ. The HTTP servers in the DMZ intercept all HTTP traffic from the untrusted network and forward traffic (for authenticated users only) through the internal firewall to the P8 system on the trusted network.

External users typically use a URL address for the P8 web applications that references a public HTTP port, such as port 80, by default. For authenticated users, the HTTP servers in the DMZ map that public port to the specific private port configured for the intended P8 web application, enabling the users’ requests to be forwarded through the internal firewall to the P8 web application server.

The first part of the P8 architecture, to the right of the DMZ inside the trusted internal network, is the web and presentation tier. For this tier, where the IBM FileNet P8 Application Engine, WorkPlace, FileNet Workplace XT, and IBM Content Navigator predefined web applications live, as well as custom applications, the best practice is load-balanced server farms. All the servers in this tier are active with incoming user/client HTTP requests directed to the load balancer via virtual host names mapped to virtual IP addresses assigned to each application, and then distributed by the load balancer across the servers running those applications. IBM FileNet P8 eForms, IBM Enterprise Records, and IBM Case Manager are all hosted on this tier as well and thus must be deployed in load-balanced server farms for high availability.

Figure 7-4 Recommendation for IBM FileNet P8 5.2

At the business logic tier, sometimes also called the services tier, the HA best practices for the core P8 components shown in Figure 7-4 are all load-balanced server farms. Only a few optional P8 components, not shown in Figure 7-4, require active-passive server clustering because they do not support active-active load balancing. Process Simulator is an example. (Many clients choose not to make Process Simulator highly available, because it does not play a runtime production role.)

Two or more P8 Content Platform Engine servers must be deployed in a load-balanced server farm when high availability is required.¹ The Content Platform Engine has been qualified with both hardware and software load balancers.

Note: A Content Platform Engine deployment typically requires both Java EE software load balancing via Java application server load balancing (for example, WebSphere Application Server Network Deployment in WebSphere), as well as HTTP load balancing via hardware or software load balancing, when deployed for high availability.

Most client applications, such as IBM Content Navigator and IBM Case Manager, use the Java EE EJB interface and transport when they access the Content Platform Engine. They therefore impose a requirement on the Content Platform Engine servers to be deployed on the clustering version of a Java EE application server (WebSphere Application Server Network Deployment in WebSphere) in order to provide the Java EE software load balancing required for the EJB transport. Other client applications, such as IBM Content Collector, use the content and process Web Services interface and transport when interacting with the Content Platform Engine, and therefore require the deployment of HTTP load balancing for the Content Platform Engine. So, the typical P8 system will require both forms of load balancing, Java EE load balancing such as WebSphere Application Server Network Deployment, and HTTP load balancing, for the Content Platform Engine when it is deployed in a high availability configuration. Figure 7-5 on page 248 shows both Java EE load balancing, in this case implemented by WebSphere Application Server Network Deployment WLM, and HTTP load balancing, in this case implemented by a pair of hardware load balancers, for a pair of Content Platform Engine servers.

IBM FileNet Image Services repositories can be federated with the P8 Content Manager via Content Federation Services. Image Services must be deployed in active-passive server clusters for high availability; it does not support being deployed in load-balanced server farms.

At the data tier, all the database servers can be deployed in active-passive server clusters for HA, such as DB2 HADR. In addition, DB2 pureScale and Oracle RAC are active-active load-balanced alternatives.² The Content Platform Engine makes use of network file shares for file storage areas for content storage and index areas for content-based search indexes, so the network file servers or NAS devices underlying the Content Platform Engine file storage areas and index areas need to be highly available as well. For a network file server, the typical HA configuration is an active-passive server cluster; NAS devices typically have internal support for either active-active or active-passive configurations for HA. NAS devices are purpose-built for high performance and scalability, so they generally scale and perform much better than generic server clusters providing a network share to SAN storage.

Figure 7-5 High availability best practices for P8 5.2 with protocol detail

Best practices for disaster recovery

For disaster recovery, the best practice is dependent on RTO and RPO. In all cases, point-in-time backup to tape or disk is the best practice for protection against data corruption or accidental or malicious deletion. For any RTO/RPO values less than days to weeks, we suggest data replication to a remote standby site as a best practice, as shown in Figure 7-4 on page 246. At the high end, with RTO and RPO in the range of minutes to hours, a dedicated warm standby recovery site and automated site failover are the best practice, and near continuous to continuous replication is also the best practice. Zero data loss requires synchronous replication to a bunker site or intermediate site if the distance to the remote recovery site is too great. For the absolute minimum RTO, on the order of minutes, database-based replication, in addition to storage-based or host-based replication for the other data, is the best practice. For the best data consistency after a disaster, at the risk of adding minutes to an hour of database recovery time to RTO, the use of a single replication mechanism for all data, combined with consistency groups, is the best practice.

The best practice for redirecting the user community to the replacement systems at the recovery site is via DNS updates or DNS load balancers such as f5 BIG-IP Global Traffic Manager, Cisco Global Site Selector, or similar products from other network device vendors. DNS aliases (CNAMES) must be used by the user’s client computers to locate the P8 Content Manager services, so that the aliases can be redirected after a disaster through DNS updates or DNS load balancing. This redirection allows reconnection to the recovery site without making any client computer changes. The DNS servers or DNS load balancers themselves must be redundant, of course, to avoid being a single point of failure.

Combining HA and DR into a single solution

There is a common temptation to try to simplify business continuity by combining high availability and disaster recovery into a single solution. The idea is to locate a second site within the same metropolitan area as the production site and make both sites active with each site having a full copy of the data. This is a workable approach when the data being managed is essentially static, as in a corporate website. Changes to the website are carefully reviewed and managed and then pushed out to multiple hosting sites in parallel, and incoming user requests can be load-balanced across the sites. If one of the sites goes down or even is lost in a disaster, user requests can be directed to the other site for continuous access to the largely static content (assuming the second site is far enough away to be out of the disaster’s impact zone).

Why does this approach not work with Content Manager? The key is the nature of the data and how it must be managed. P8 Content Manager, as the name suggests, is designed to manage rapidly changing and growing collections of data that are being accessed and modified in parallel by users across an enterprise. Unlike the largely static data of a corporate website, which is published or released to the site in a carefully controlled authoring and information publication process, content in a typical P8 Content Manager object store is being collaboratively authored, enhanced, deleted, created, and processed in a dynamic manner under transaction control to avoid conflicting changes. As a result, only a single active copy of the data can be online and changeable at any point in time so that transaction locking can be enforced and changes are saved in a safe, consistent manner. Therefore, the basic idea of two sites, in which each site has an active copy of all the content, is not the best practice for a transactional system. It is not supported by the P8 Content Manager.

A related temptation is to deploy a disaster recovery solution with a standby (inactive) copy of the data at the recovery site and depend on this single solution for both high availability and disaster recovery. This can be done with P8 Content Manager, but there is a clear trade-off that you need to carefully consider. Relying on a disaster recovery configuration for high availability compromises the availability target for the system, because any failure leads to a full site failover as though the entire production site had been lost in a disaster. A site failover is a time-consuming, complicated process that necessarily takes much longer than a single server failing over to a local passive server in a cluster, and even longer than the nearly instantaneous switch to another, already-active server in a server farm when a server fails in that farm. The net result is that high availability (in the range of 99.9% and higher) is not reachable when every local failure triggers a full site failover (and later a full site failback to return to a protected state).

How about using geographically dispersed farms and clusters, that is, with the farms and clusters split between the two sites? If one server fails, the server at the other site takes over, either coming up at the time of failure in an active-passive server cluster or simply taking on redirected client requests in server farms. Again, there is an availability trade-off because of the added risk of communication problems between the two sites. We do not recommend geographically dispersed farms and clusters as best practice because of the added risk and higher networking costs.

So the best practice is to deploy local server farms and clusters for high availability in order to provide for continuing service in the event of local component failures and to deploy a second site with data replication and, optionally, global clustering, to provide for rapid recovery from disasters. The best practice is to locate the recovery site outside the disaster impact zone of the production site.

7.7 Reference documentation

For additional information on high availability, see the IBM FileNet Version 5.2 Information Center section devoted to high availability:

http://pic.dhe.ibm.com/infocenter/p8docs/v5r2m0/topic/com.ibm.p8.sysoverview.doc/p8pha001.htm

See the IBM Redbooks publication IBM High Availability Solutions for IBM FileNet P8 Systems, SG24-7700, for more details on P8 (and IS) high availability deployments:

http://www.redbooks.ibm.com/abstracts/sg247700.html?Open

See the IBM Redbooks publication Disaster Recovery and Backup Solutions for IBM FileNet P8 Version 4.5.1 Systems, SG24-7744, for more details on P8 (and IS) disaster recovery deployments:

http://www.redbooks.ibm.com/abstracts/sg247744.html?Open

See “Images Services 4.1.2 High Availability Procedures and Guidelines.” This document describes both high availability via clustering software (Microsoft Cluster Server or Veritas Cluster Server) and disaster recovery via data replication software (Veritas Volume Replicator - see Appendix C) for Image Services:

ftp://ftp.software.ibm.com/software/data/cm/filenet/docs/isdoc/412x/HACluster.pdf

¹ Prior to P8 4.0, the Content Engine supported both farming and clustering for its Object Store Services component, but only active-passive clustering for its File Store Services component. Starting with P8 4.0, these components were unified and have since supported farming across the board. Prior to P8 4.0, the Process Engine required active-passive server clustering for high availability, but has also supported farming since P8 4.0. In P8 5.2, the Content Engine and Process Engine were merged together into the Content Platform Engine.

² IBM Case Manager 5.1.1 only supports active-passive database clusters, due to a Business Space constraint.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Business continuity

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7. Business continuity