CHAPTER 4
SELECTING A STRATEGY
Setting the Direction

However beautiful the strategy, you should occasionally look at the results.
—Sir Winston Churchill

INTRODUCTION

With the results of the Business Impact Analysis and risk assessment in hand, it is time to select a recovery strategy. The recovery strategy is the overall direction for planning your recovery. It provides the “what” of your recovery plan. Individual plans are the “how” it will be done. An approved strategy keeps the company’s recovery plans in sync and avoids working at cross purposes.

A recovery strategy is not for restoring things to the way they were before. It is for restoring vital business functions to a minimally acceptable level of service. This minimal level of service enables the company to provide a flow of goods and services to its customers and buys time for planning a permanent recovery. There will be a separate recovery strategy for different parts of the company.

Disaster recovery planning is all defensive. Like insurance, you pay year after year so that if something did occur, you are covered. If nothing happens, then the money is spent with nothing tangible to show for it. The business benefit of disaster recovery planning (a subset of business continuity planning) is that it reduces the risk that a major company catastrophe will close the doors forever.

Another strategy deals with business continuity for operating a company that overcomes “in-process disasters” and keeps operating. Facility-destroying disasters are rare. More common are the many local disasters that occur in a process. Business continuity returns value to the company by developing contingency plans in case of a vital business function interruption. It also forces the company to examine its critical processes and to simplify them for easy recovery. Simpler processes are cheaper to operate, more efficient and more reliable. The business continuity strategy is in addition to and complementary to the disaster recovery strategy.

SELECTING A RECOVERY STRATEGY

Your recovery strategy determines the future costs and capability of your overall program. All subsequent plans will be written to fulfill the recovery time required, and solution selected. A poorly selected strategy will require all plans to be rewritten when it is replaced.

Companies have long struggled with how much money to spend on a quick recovery that may never be used. A recovery strategy is a tradeoff between time and money. The faster the ability to recover (up to near instantaneous), the higher the expense. The maximum recovery time that a company can tolerate an outage is its recovery time objective (RTO). This was identified by its Business Impact Analysis (BIA). Rapid recoveries are often favored until the initial and ongoing costs are detailed. However, a rapid recovery may also become a marketplace advantage by providing a more reliable product delivery.

The RTO is measured from the time when the incident occurs. Hours lost dithering around whether to declare a disaster or not is time lost toward your recovery time goal.

The classic error is to recover the data center which then sits idle because the various departments that use the IT systems were not recovered. Companies must craft a separate recovery strategy for each significant area:

Image Information Technology. Recovering a data center, internal and external network connections, and telecommunications.

Image Work Area Recovery. Recovering a place for office workers along with a personal computer, telephone, printer access, etc., all securely connected to the recovered data center.

Image Pandemic. Maintaining business during a public health emergency that may run for 18 months or more.

Image Business Continuity. Keeping the flow of products and services to customers despite significant failures in company processes.

Image Manufacturing. Recovering the flow of products after a crisis.

Image Call Centers. Maintaining customer contact throughout the crisis.

Whatever is decided, the recovery strategy must be communicated throughout the recovery project. All team members must understand the company’s timeframe for recovering and the budgeted way to achieve it. It is the starting point for each recovery plan.

Recovery Point Objective

Another important factor is your recovery point objective (RPO). This is the amount of data that may be lost since your last backup. If your IT systems recover to the point of their last backup, perhaps from the night before, and the incident occurred at 3:00 PM the next day, then all of the data changes from the time of the last backup up to that 3:00 PM incident must be re-created after the data center is recovered. If not, the information is lost. Consider how many people take orders over the telephone and enter them directly into the order processing system. How many orders are shipped to customers with only online documentation? How many bank transfers are received in a day? In the past, this data might be reentered from paper documents. However, most of the paper products have been discontinued. Where will this data come from?

Time

We live in a “right now” world. Will the company’s customers wait a week while someone cobbles together a data center to restore the data or for someone to answer customer service questions? The amount of time required to recovery a company’s vital business functions is the first question. Can your company survive if it loses a day’s worth of data? The BIA identified your RTO. The recovery strategy for all plans must meet this time goal. The RTO typically drives the cost of the entire program.

Distance

The distance between the primary and backup recovery sites depends on the risk assessment. Wherever you go, the recovery site must be far enough away so that the same catastrophe does not strike both sites. Wide-area disasters, such as floods, earthquakes, and hurricanes, can impact hundreds of square miles. Use your personal experience and that of the BCP team to identify areas that are not likely to be affected by the same risks.

The farther away your recovery site is, the more likely that the team must stay there overnight. This requires additional expense for hotel rooms, catered food, etc. However, there is a point where a recovery site is too far away. It is not unusual for a company to depend on a critical employee who is also a single parent. These people cannot stay away from home for extended periods.

In many cases, the distance is determined by the type of local threats from nature. If your company is located on a seacoast that is susceptible to hurricanes, then the recover site may be hundreds of miles inland to avoid the same storm disabling both sites. The same would be true in a floodplain such as along the Mississippi River. However, if you are located in the Midwest, then a one hour distance for a recovery site may suffice.

You cannot foresee everything that might go wrong. After the terrorist attack of September 11, 2001, many New York companies activated their disaster recovery plans. Since their recovery sites were hundreds of miles away, they had planned to fly to them. Who would have predicted that all of the country’s civil air fleet would be ordered to remain grounded for so many days? In the end, driving was the only way to get there, delaying most recoveries by at least a day.

Recovery Options

Recovering a data center is different from recovering a warehouse is different from recovering a call center. In the end, all strategies come together to restoring a minimal level of service to the company within the RTO.

The primary recovery strategies are to:

Image Recover in a Different Company Site. This provides maximum control of the recovery, of testing, and of employees. Some companies split operations so that each facility can cover the essential functions of the other in a crisis. The enemy of this approach is an executive’s desire to consolidate everything into one large building to eliminate redundancies.

Image Subscribe to a Recovery Site. This leaves all of the work of building and maintaining the recovery site to others. However, in a wide-area disaster (such as a hurricane), the nearest available recovery site may be hundreds of miles away since other subscribers may have already occupied the nearest recovery sites.

Image Wait Until the Disaster Strikes and Then Find Some Empty Space. This approach requires lots of empty office and warehouse space that is already wired, etc. All we need to do is to keep tabs on availability and when needed, take out a lease on short notice. This approach results in a long recovery time but is the least expensive.

IT RECOVERY STRATEGY

IT systems were early adapters of disaster recovery planning. However, as technology has evolved, so have expectations for how quickly they must recover. Today’s companies keep almost all of their data in their computer systems. Without this information, they stop working altogether. The time and expense to completely re-create it is unacceptable. Companies examining their alternatives must face up to the high cost of immediate recovery versus the lower cost of slowly rebuilding in a new site. IT recovery steps (even for a temporary facility) include rebuilding:

Image Environmental. IT equipment must stay within a specific temperature and humidity range.

Image Infrastructure. External network connection into the data center of the local service provider, and throughout the recovered data center; critical servers used by application servers such as a domain controller, DNS, DHCP, etc.

Image Applications. Company specific software used by the business to address customer and internal administrative requirements.

Image Data. The information needed by the company’s business departments to support the flow of products and services.

In the past, the issue was to have a standby recovery site ready to go when needed. This model is based on reloading software and data from backup media (typically magnetic tape). However, this recovery strategy takes days. At best, when company data is loaded onto backup media, vital data is separated from nonvital data. Few companies bother to do this. The result is shuffling media in and out of a loader to load critical files while the company waits for a recovery. Refer to Figure 4-1 for a list of IT disaster recovery solutions; these fall into several general categories from slowest to fastest.

How much can these solutions cost? A hot-site contract will cost about as much per month as leasing your existing data center equipment. In a crisis, you must pay the monthly fee for each day of use. So, if you use a hot site in a disaster for 12 days, you might pay the same as you would for a year of disaster recovery coverage.

Recovery solutions, such as hot sites, are expensive. A popular solution is for a company to establish a second company data center about one hour’s drive from the main data center. This location should use a different power grid and telecommunications company link than the main facility. A one-hour drive enables workers to sleep at home every night. (Remember that some of the employees will live in the opposite direction from the recovery site and the drive might be two hours each way.) This is especially important for single parents. Hopefully this is far enough away that the same wide-area disaster cannot strike both.

To prepare the recovery site, move to the second data center all of the test servers for the critical IT systems. Also move servers for the noncritical systems. Include adequate disk and network support. This provides equipment that is ready in a disaster, but not sitting idle. To save more time on recovery, mirror the critical data between the data center and the recovery site. Data replication requires a high-speed data connection with replication equipment at each end. The costs include data replication controllers at each end and a significant set of disk drives at the recovery site.

Let someone else do it. Application Service Providers (ASPs) provide data processing equipment, software licenses, and services to companies. Instead of operating your own data center, you run on their equipment at their site. Require that they maintain a Business Continuity Program. If this is your strategy, you must witness and audit their tests to ensure they provide the level of protection that you expect. The advantage is that this is their line of business and they will be more efficient at writing these plans and recovering at a different site. Ensure that the ASP is contractually required to meet your RTO irrespective of its commitments to other customers.

Image

FIGURE 4-1: IT disaster recovery solutions.

Recommended IT Recovery Strategy

Establish a second company site at least a one-hour drive away in a place that is on a different power grid and data network. In this site, operate the company’s primary production data center. Ensure this satellite office has telecommunications and network capacity to provide for a 25% surge in employees. Place the “Test” IT systems and noncritical IT equipment in the company headquarters building. The reasons for recommending this option include:

Image If the headquarters offices are destroyed, the data center is safe, or vice versa.

Then we only have to recover from one disaster at a time.

Image We can continue telephone contact with our customers if either office fails and our customers will see only a slight drop in service.

Image Using test servers as a backup data center avoids expensive “just-in-case” machines sitting idle. In an emergency, the test servers become the production machines for the applications they already support. Noncritical servers are repurposed for critical systems support.

Image The company’s application software is already on disk; we only need to load the current version.

Image We know the alternate data center is connected to a live network because we use it daily.

Image The company controls security access and facility maintenance of both sites.

Image Backup media can be maintained in the headquarters facility (except for archive copies), which means savings on third-party storage for short periods.

Image Potential to add data replication to avoid time lost loading data from tape and to minimize data losses.

Image Recovery tests can be scheduled whenever we wish.

Example IT Recovery Strategy

The myCompany Data Center Disaster Recovery Strategy provides general guidance for critical system recovery after an incident renders the myCompany Data Center unusable. A recovery site has been prepared at our Shangrila data center that is about a one-hour commute from the existing work site. This recovery site is on a separate power grid and telecommunication connection. This site is also furnished and equipped to accommodate 75 office workers.

To facilitate this recovery, myCompany has located all test servers (and adequate disk storage) at the backup Data Center and keep production IT equipment at myCompany. The underlying assumption is that the test system hardware is an adequate substitute for the critical systems (CPU & RAM), and that each critical system has a corresponding set of test servers. In this way, myCompany has an operational hot site that is proven to work (idle sites tend to develop unnoticed problems).

Under this approach, servers in the backup Data Center are already loaded with the necessary version and patch level of the operating system. During a disaster, the test system is offloaded to tape or removable media. The equipment is then loaded with the current production version of the application (which should be present on their local disk drives).

All critical data is mirrored between the operational data center and the backup site. The estimated recovery time is in seconds with minimal data loss.

Reasons for this selection are:

Image Quick recovery at the lowest cost.

Image The recovery site is under myCompany control.

Image Segregating test servers facilitates testing of DR plans.

Image Keeps production data in myCompany for easier backups.

WORK AREA RECOVERY STRATEGY

The general term for recovering damaged offices is “work area recovery.” A common disaster recovery error is to focus solely on the IT recovery without providing a place from which to access it.

On September 14, 2008, the remnants of Hurricane Ike swept through the Ohio Valley with sustained winds equal to a Category 1 hurricane. This resulted in widespread power outages that lasted for many days. The author worked for an organization whose generator promptly roared to life and kept the data center in operation even though none of the offices were wired for backup power. The people arrived at work to hear the generator running but no lights inside or power for desktop PCs. No one had any place to work, so they were all sent home. After several (expensive) days of running the generator, power was restored to that portion of the city.

Just like your IT recovery strategy, the work area recovery strategy must execute in a prepared site. It does not take that long to run electrical connections down the middle of a conference center, string some network wiring, and erect work tables and chairs. The longest delay is the time required to add adequate bandwidth to the outside world (which includes the data center recovery site). Without this external connection of adequate size, the recovery is hobbled or delayed. If the disaster covers a wide area, it may be weeks before the telecom connection is ready.

In a crisis, only the personnel essential to operating the critical IT systems, required to answer customer calls, or necessary to fulfill legal requirements must be recovered immediately. The rest of the offices can be recovered over time. Employees equipped with Virtual Private Network (VPN) authentication may connect to the data center through secure connections. Scarce work area can be maximized by adding a second shift for staff who do not directly work with customers (such as the Accounting department).

One option for recovering offices is through the use of specially equipped office trailers. These units come with work surfaces, chairs, generators for creating their own electrical power, a telephone switch, and a satellite connection to bypass downed lines. When on site, these trailers are typically parked in the company parking lot to use any surviving services—and then provide the rest.

Beware of counting on hotels as large-scale work area recovery sites for offices. Like everyone else, they watch their costs and do not want a monthly bill for data capabilities far in excess of what is normally used. A T-1 provides sufficient bandwidth for a hotel and its guests but not enough to support 100 office workers filling the conference rooms. Also the hotel switchboard will lack capacity for busy offices.

Setting up a recovery site requires:

Image A location far enough away that it is not affected by the same disaster.

Image Chairs to sit on and tables for work surfaces.

Image Locating together any business teams that frequently interact or exchange documents during business. Otherwise, there may be multiple work area recovery locations.

Image Desktop equipment, such as a computers and telephones. Loading the company software image on PCs takes time. Also, people will miss their personal data.

Image Alternative communications, such as fax and modem.

Image Historical documents that must be checked during the course of business.

Image Preprinted forms required for legal or other business reasons.

Refer to Figure 4-2 for a list of work are a disaster recovery solutions; these fall into several general categories from slowest to fastest.

Example Work Area Recovery Strategy

myCompany’s Work Area Recovery strategy is to use the company’s IT training rooms adjacent to the backup data center as temporary offices in an emergency. These classrooms are equipped with workstations on every table. A telephone switch is online and wires are run to each workstation. In an emergency, telephones can be quickly installed.

This recovery site, approximately 60 minutes of travel from the main office, is used as an off-site conference center and IT training facility. It can accommodate enough of the critical office workers to keep the company operating until permanent facilities have been prepared.

IT staff not involved with the IT recovery plan will work from home via VPN.

Executive staff is to meet in the Sleepy-Head motel conference center until a local office is ready.

The Work Area Recovery Manager is also responsible for the ongoing maintenance of the office recovery site. The recovery site must be tested semi-annually to ensure that the network and telecom connections are functional and available when needed.

Image

FIGURE 4-2: Work area disaster recovery solutions.

PANDEMIC STRATEGY

The goal of the Pandemic Emergency Plan is for the company to continue operations at a level that permits it to remain in business. This requires steps to prevent the spread of disease into and within the organization. Actions to minimize the spread of infection represent an additional cost for the company which must be borne until the danger passes. Unlike other contingency plans, a Pandemic Recovery Plan will be in operation from 18 to 24 months.

In 2003, in response to a local outbreak of severe acute respiratory syndrome (SARS), the World Health Organization urged postponement of nonessential travel to Toronto. Some conferences scheduled for the city were canceled and hotel occupancy rate sank to half of normal. Although reported SARS cases were few, the financial impact was significant.

Pandemic emergency steps require different strategies for major stakeholders:

Image Employees

a. Employees who can work from home should use a VPN connection to minimize the amount of time that they spend in the office.

b. The company sick policy must be relaxed so that sick people are not forced to come into the workplace. Anyone who is sick is encouraged to stay home. They should also stay home if they have a sick family member.

c. Areas used by company workers must be periodically cleaned thoroughly to address any infection brought in from the outside.

d. Employees who travel into areas with a high rate of pandemic infection should work from home for the first week of their return.

Image Customers

a. Areas where customers enter the facility must be cleaned thoroughly to address any infection brought in from the outside.

b. Provide complimentary hand sanitation at all store entrances.

c. It may be necessary to bring in individual sanitation supplies for an extended period of time.

d. All returned products should be sanitized before examination.

Image Vendors

a. Use video conferencing and other electronic tools to meet with vendors.

b. Carefully select meeting places with a low incidence of pandemic.

Example Pandemic Strategy

myCompany’s Pandemic Emergency Plan is designed to contain the potential spread of illness within the company. It is initiated when the state public health authorities in the headquarters building’s state declare a pandemic. Limitations on the number of sick days provided to each employee in the company’s sick leave policy are suspended. Employees are encouraged to stay home with sick family members.

All company areas where employees are in close physical contact with customers or vendors must be thoroughly sanitized every day. Each employee is provided with personal sanitation gloves and face masks.

Any employee returning from a business trip will work from home via VPN for seven days before entering the office.

BUSINESS CONTINUITY STRATEGY

A successful business continuity strategy is when your customers never notice an interruption in service. It is a proactive plan to identify and prevent problems from occurring. To implement this in a company, begin with the list of critical processes identified by the BIA. Develop a process map for each vital process that shows each step along each. Identify areas of risk such as bottlenecks into a single person or device, limited resources or legal compliance issues. Mitigate each point of risk by implementing standby equipment, trained backup personnel, etc.

A severe blizzard in Minnesota is not visible to a customer in Arizona who is waiting on a rush order. An instant failover for IT systems is essential for online companies, banking, hospitals, vital government service offices, public utilities, etc. The dollar loss of customer impact is so high that it justifies the high cost. Other companies regret the interruption but are not so real-time with their customers. As a result, they have several days to recover with minimal customer interruption. An example might be a health spa where a one-week interruption in service is overshadowed by the strong customer relationship.

A business continuity strategy deals with your company’s vital processes. It might be anything whose absence disrupts the normal flow of work. For example, many companies have eliminated their company telephone operator and replaced that person with an automated telephone directory. Key in the person’s name, and you are connected. However, if that device fails, the rest of the company is still creating and shipping products to the customer, but no one canall into the facility. A Business Continuity Plan provides information on how to recover that device or quickly replace it.

The strategy is to begin with your list of critical processes. Assign someone to develop a process map of each to identify potential single points of failure or places where the flow of products and services are constrained. Then reduce the likelihood of failure by providing duplicate (or backup) equipment to single threaded devices, trained backup personnel, etc.

Therefore, a business continuity strategy deals with processes. It might include:

Image Identification of vital processes (this list is updated quarterly).

Image Drafting a process map to examine each step for single threading or weakness, such as unstable equipment or operators.

Image Identification of steps to eliminate (simple processes are easiest to recover).

Image Drafting a risk assessment for the process.

Image Drafting an end-to-end recovery plan for each remaining step in the process.

CONCLUSION

Selecting a recovery strategy is an important step. Its boundaries are determined by how quickly the company must recover in order to survive. Another factor is the amount of data it can afford to lose. When the risk from natural is evaluated, a recovery strategy can be created.

The strategy selected will drive the cost of the company’s recovery plans. Therefore it must be based on the data gathered by the Business Impact Analysis. This focuses efforts on the “vital few” processes. Each strategy selected must be approved by the project’s executive sponsor. Otherwise, most work will be lost when a revised strategy is issued.

A separate strategy must be developed for each plan. The primary plan is for recovering the data center. Next, the strategy for the work area recovery must be based on when the data center will be ready for use. The pandemic plan is different in that the crisis comes on slowly, eventually hits a peak and then gradually fades away.

In the end it comes down to how much security the company can afford. Where possible, try to combine recovery capabilities with existing assets (such as using “test” IT servers to recover the data center) with disaster recovery requirements to reduce the program’s ongoing cost.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset