13

TESTING YOUR PLANS

Test, Test, Test

Action is the foundational key to all success.

—PABLO PICASSO

INTRODUCTION

Writing a recovery plan is only half of the challenge. The second half, the real challenge, is to periodically test it. Everyone can relate to writing a recovery plan. “Testing” a plan sounds like you do not trust it. Testing requires expensive technician time, the equipment and facility resources to conduct a test, and the expertise to plan the exercise. Gathering all of this into one place can be difficult.

Arranging for expensive technician time was tough enough to secure for writing the plans. The most knowledgeable people are usually the busiest. Getting them to give the time to sit down long enough to test a plan is difficult—yet essential. Testing validates that a recovery plan will work. A plan that is tested has a much higher possibility of succeeding over a plan that has never been proven. The many benefits to testing include:

image Demonstrating that a plan works

image Validating plan assumptions

image Identifying unknown contingencies

image Verifying resource availability

image Training team members for their recovery roles

image Determining the true length of recovery time, and ultimately the ability to achieve the desired company recovery time objective (RTO)

The Many Benefits of Plan Testing

Recovery plans are tested for many business reasons. An untested plan is merely process documentation. Testing a plan ensures that the document provides the desired results. The benefits of testing include the following.

TESTING REVEALS MISSING STEPS

When people write a plan, they think about a process or IT system, and then write the plan so that they will understand what is explained and the steps to take. In this sense, the plan is a reflection of their experience. However, in a crisis, they may not be the person who will execute the recovery. Furthermore, some people cannot break down a process to include each of its individual steps. In action, they will pick up on visual cues to take a specific action to fill a gap.

Therefore, the first purpose of a recovery plan test is to ensure that it includes all of the necessary steps to achieve recovery. Missing steps are not unusual in the first draft of a recovery plan. Other missing information may be IT security codes, the location of physical keys for certain offices or work areas, or the location of vendor contact information.

TESTING REVEALS PLAN ERRORS

Writing a plan sometimes introduces misleading, incorrect, or unnecessary steps. Testing the plan will uncover all such errors.

TESTING UNCOVERS CHANGES SINCE THE PLAN WAS WRITTEN

A plan may have been sitting on the shelf for a period of time without review. Over time, IT systems change server sizes, add disk storage, or are upgraded to new software versions. Business processes move machinery and change the sequence of steps, and key support people leave the company.

TESTING A PLAN TRAINS THE TEAM

After a plan has been debugged, exercising it teaches everyone participating in the recovery their role during the emergency. It is one thing to read the words on a page and another to actually carry out the steps.

Types of Recovery Plan Tests

Exercises can consist of talking through recovery actions or physically recovering something. Discussion-based tests exercise teamwork in decision making, analysis, communication, and collaboration. Operations-based tests involve physically recovering something, such as a data center, telephone system, office, or manufacturing cell. This type of test uses expensive resources and is more complex to conduct.

Everyone has their own name for the various types of testing. Tests are categorized by their complexity in setting them up and in the number of participants involved. These tests are listed here in a progression from least complex to most difficult to run:

image Standalone Testing is where the person who authored the plan reviews it with someone else with a similar technical background. This may be the manager or the backup support person. This type of testing is useful for catching omissions, such as skipping a process step. It also provides some insight into the process for the backup support person.

image Walk-Through Testing involves everyone mentioned in the plan and is conducted around a conference room table. Everyone strictly follows what is in the plan as they talk through what they are doing. This also identifies plan omissions, as there are now many perspectives examining the same document.

image Integrated System Testing occurs when all of the components of an IT system (database, middleware, applications, operating systems, network connections, etc.) are recovered from scratch. This type of test reveals many of the interfaces between IT systems required to recover a specific IT function. For example, this would be to test the recovery of the Accounting department’s critical IT system, Human Resources IT system, the telephone system, email, etc.

image Table-Top Exercises simulate a disaster but the response to it is conducted in a conference room. A disaster scenario is provided and participants work through the problem. This is similar to Walk-Through Testing, except that the team responds to an incident scenario. As the exercise progresses, the Exercise Coordinator injects additional problems into the situation.

image Simulation Exercises take a Table-Top exercise one step further by including the actual recovery site and equipment. A simulation is the closest that a company can come to experiencing (and learning from) a real disaster. Simulations provide many dimensions that most recovery plan tests never explore. However, they are complex to plan and expensive to conduct.

Validating the Recovery Time Objective

Testing recovery plans ensure that they can achieve the required recovery time objective. Since plans are tested in small groups, the actual RTO is determined by tracking the amount of time required to recover each IT system and business process. These plans fit into an overall recovery sequence (developed by the Business Continuity Manager). Once in this framework, the time required to complete each plan is added up (many plans execute in parallel) to determine if the RTO can be achieved.

Is it a “test” or an “exercise”? A “test” implies a pass or fail result. An “exercise” implies using something and is less threatening to participants.

WRITING A TESTING STRATEGY

Testing distracts an organization from its mission of returning a profit to shareholders. Everyone is busy meeting their own company objectives. Somehow, time must be found within each department’s busy schedule to test its recovery plans. To maximize the benefit to the company while minimizing cost, develop a testing strategy for your company. This strategy describes the type and frequency of testing for recovery plans. An executive-approved testing strategy provides the top-level incentive for management compliance. The testing strategy is inserted into the administrative plan (see Chapter 7).

The testing calendar should reach out over several years. Keep in mind that different departments have their own “busy season” and trying to test at that time will be difficult. For example, the Accounting department will be occupied before and after the end of the company’s fiscal year. Payroll needs to submit tax forms at the end of the calendar year. By using an annual testing calendar, it is easier to gain commitment from the various departments to look ahead and commit to tests on specific days.

Testing follows a logical progression. It begins with the individual plan. The next level is a grouping of recovery plans to test together. This is followed by a simulation of some sort. Executives become frustrated by the length of time required to properly test all of the plans in this sequence, but they are more disturbed by the cost to test them faster.

Begin by Stating Your Goals

As with all things in the business continuity program, begin writing the testing strategy by referring back to the Business Impact Analysis. If the recovery time objective is brief (measured in minutes or hours), then the testing must be frequent and comprehensive. The longer the recovery time objective, the less frequent and comprehensive the testing may be. Considered from a different angle, the less familiar the current recovery team is with a plan, the longer it will take them to complete it.

Another issue is the severity of an incident. While the overall plan may tolerate a long recovery time, there may be specific processes whose availability is important to the company. This might be the Order Entry IT system or a critical machine tool. Consider testing those few highly critical processes more frequently than the overall plan.

The testing goal may be stated as, “Recovery plans are tested to demonstrate that the company’s approved recovery time objective of (your RTO here) can be achieved,” and that all participants understand their roles in achieving a prompt recovery.

Progressive Testing

Testing follows a progression from simple to complex. Once a plan is written, it begins at the Standalone Test level and progresses from there. Any process or IT system that is significantly changed must be retested beginning at the Standalone level. The progression of testing is as follows:

image Standalone Testing is the first action after a plan is written. It reveals the obvious problems.

image Walk-Through Testing exercises a group of related plans at the same time, conducted as a group discussion.

image Integrated System Testing tests a group of related plans at the same time by actually recovering them on spare equipment.

image Table-Top Exercises test a group of related plans at the same time, based on an incident scenario.

image Simulation Exercises combine many groups for an actual recovery at the recovery sites, based on an incident scenario.

Creating a Three-Year Testing Roadmap

Some tests only involve two people, while others can include most of the IT department. All tests require preparation time. This is necessary to coordinate schedules for people, exercise control rooms, and equipment. Copies of plans must be printed and distributed and exercise scenarios created. At a minimum, every plan should be tested annually. This can be accomplished by the manager and the process owner performing a Standalone Test to see if anything significant has changed in the process.

Few companies halt operations for several days to conduct a complete disaster simulation. Instead, they test “slices” of the recovery program. For example, the test might focus on a recovery of the Operations department or Shipping department. On the IT side, this would be a group of related systems that regularly exchange information, such as order entry, materials management, and billing.

Too much testing can reduce interest in the program. Practically speaking, testing is a preventive measure (all cost and no immediate payback) and does not increase a company’s revenue. Depending on the industry, testing may never progress beyond the Table-Top exercise stage. The Business Continuity Manager works with the program sponsor to identify the adequate level of testing for the organization and then spreads it throughout the year.

When developing a testing calendar, executives will vent their frustration. They will want something that is written, tested, and then set aside as completed. They do not like to consider that completed plans must continue to be exercised regularly. There are many plans and combinations of plans to test: business processes, IT systems, work area recovery, and pandemic emergencies. A typical testing schedule includes:

image Quarterly

Inspect Command Center sites for availability and to ensure their network and telecommunication connections are live.

Data Backups

image Verify that data backups (on each media type) are readable.

image Ensure that every disk in the data center and key personal computers are included in the backups.

image Inspect safe and secure transportation of media to off-site storage.

image Inspect how the off-site storage facility handles and secures the media.

All business process owners verify that their employee recall lists are current.

Issue updated versions of plans.

image Annually (spread throughout the year)

Conduct an IT simulation at the recovery site.

Conduct a work area recovery simulation at the recovery site.

Conduct a pandemic Table-Top exercise.

Conduct an executive recovery plan exercise with all simulations.

Review business continuity plans of key vendors.

All managers submit a signed report that their recovery plans are up to date.

Practice a data backup recall from the secured storage area to the hot site.

A partial plan exercise calendar might look like Figure 13-1.

image

FIGURE 13-1. Example of a three-year testing calendar.

TESTING TEAM

Testing a recovery plan is a team effort. The best results come from a clear explanation of the responsibilities of team members and some training to show them what to do. This enables each person to contribute expertise to the exercise while learning by doing. Training for individual team members is the responsibility of the Business Continuity Manager.

The duties for each of the testing team members will vary according to the type of test, with a full disaster simulation requiring the most time from everyone. Possible team duties include:

image Business Continuity Manager

Develops a long-term testing calendar, updated annually.

Develops or updates a testing strategy.

Schedules tests.

Prepares test areas and participant materials.

Explains testing process to team prior to start of exercise.

Presents scenario.

Logs events during the exercise.

Keeps exercise focused for prompt completion.

Injects variations to scenarios during simulation testing.

Conducts after-action critiques of recovery plan and a separate discussion of the test process.

Provides a written test report to the program sponsor.

image Sponsor

Reviews and approves recovery plan test calendar and testing strategy.

Approves initiation of all tests.

Provides financial support for tests.

Ensures internal support of test program.

Observes tests in progress.

Reviews written report of test results and team critique.

image Exercise Recorder

Records actions and decisions.

Records all assumptions made during the test.

Drafts narrative of what happened during the test for the after-action review.

image Exercise Participants

Prepare for the test by reviewing the recovery plans.

Participate in test by following the plans.

Offer ways to improve the recovery plans during the test.

Participate in the after-action critique of the recovery plans and the testing process.

image Nonemployee Participants

Where practical include people from other organizations who have a stake in your plans, such as the fire and police departments and the power company.

News reporters should be invited to report on the exercise and to participate in exercising your corporate communications plan.

Visitors, such as customer or supplier representatives, can also participate.

EXERCISE SCENARIOS

A disaster scenario is a hypothetical incident that gives participants a problem to work through. The scenario may describe any disruption to the normal flow of a business process. The scenario should be focused on the type of problem that a particular group of people may face. For example, the problem and its mid-execution “injection of events” should encompass all participants.

Every simulation starts with a scenario, a hypothetical situation for the participants to work through. Scenarios that reflect potential threats also add an air of reality to the exercise. A good place to look for topics is in the plan’s risk analysis section or program assumptions. Another place to look is in the recent national or local news.

For example, who has never experienced a power outage or a loss of data connectivity? How about severe weather like a hurricane or a blizzard? Or consider tornados and earthquakes. Human-created situations, such as fire, loss of water pressure, or a person with a weapon in the building, are also potential scenarios.

The planning expertise of the test coordinator is crucial. The coordinator must devise an exercise schedule to include a detailed timeline of events, coordinate and place the resources involved (people, equipment, facilities, supplies, and information), establish in everyone’s minds their role, and identify interdependencies between individuals and groups.

In theory, a business faces a wide range of threats from people, nature, and infrastructure. In reality, few of these will occur. Some are dependent on the season and changes in the political environment. Whatever the crisis, the recovery steps for many threats are the same. A data center lost to a fire is the same as a data center lost (or made unusable) because of a collapsed roof or a flood. In each of these cases, there will be many steps unique to that event. However, the initial actions in each case will be the same. It is this similarity that enables disaster recovery planning. A disaster plan is most useful in the first few hours when there is limited information, but the greatest benefit comes from containing the damage and restoring minimal service to the company.

Include in the scenario the incident’s day of week and time of day. The weekend response will differ from the work-time response. Consider declaring the scenario to include the company’s worst time of the year (such as the day before Christmas for a retailer). Also, the severity of the damage can at first appear to be small and then grow through “injects” provided by the exercise controller. Consider the example of a small fire. When a large amount of water was sprayed on the fire, it ran down the floor and saturated the carpet in the nearby retail show room. It also leaked through the floor into the data center below, soaking the equipment.

Ask the program sponsor to approve the scenario used in an exercise. This will minimize participant discussion during its presentation. It will also help to avoid scenarios that executives feel are too sensitive.

Some potential testing scenarios might be:

image Natural Disasters

Hurricane or heavy downpour of rain

Tornado or high winds

Earthquake

Flood

Pandemic

Fire

Severe snow or ice

image Civil Crises

Labor strike (in company or secondary picketing)

Workplace violence

Serious supplier disruption

Terrorist target neighbor (judiciary, military, federal, or diplomatic buildings)

Sabotage/theft/arson

Limited or no property access

image Location Threats

Nearby major highway, railway, or pipeline

Hazardous neighbor (stores or uses combustibles, chemicals, or explosives)

Offices above 12th floor (limit of fire ladders)

Major political event that may lead to civil unrest

image Network/Information Security Issues

Computer virus

Hackers stealing data

Data communication failure

image Data Operations Threats

Roof collapse (full or partial)

Broken water pipe in room above data center

Fire in data center

Critical IT equipment failure

Environmental support, equipment failure

Telecommunications failure

Power failure

Service provider failure

Loss of water pressure that shuts down chilled water coolers

Select scenarios so that the problem exercises multiple plans. Choosing the right scenario can engage the participant’s curiosity and imagination. It converts a dull exercise into a memorable and valuable experience for its participants. A good scenario should:

image Be realistic—no meteors crashing through the ceiling.

image Be broad enough to encompass several teams to test their intergroup communications.

image Have an achievable final solution.

image Include time increments, such as every 10 minutes equals one hour.

Prior to the exercise, draft the scenario as a story. It begins with an initial call from the alarm monitoring service with vague information—just like a real incident. To add to the realism, some people will use a bit of photo editing to illustrate the scene. Imposing flames over the top of a picture of your facility may wake some people up!

As the exercise continues, the Exercise Coordinator provides additional information known as “injects.” This predefined information clarifies (or confirms) previous information and also raises other issues incidental to the problem. For example, if there was a fire in the warehouse, an inject later in the exercise may say that the warehouse roof has collapsed injuring several employees or that the fire marshal has declared the warehouse to be a crime scene and the data center is unreachable until the investigation is completed in two days.

Injects, like the scenario, must make sense in the given situation and may also include good news, such as workers missing from the warehouse fire are safe and have been found nearby. Unplanned injects may be made during the exercise if a team is stumped. Rather than end their portion of the test, state an assumption as fact. For example, the Exercise Coordinator could state that, “The data center fire was concentrated in the print room and no servers were damaged.”

Try to insert some humor into a tragic situation. For example, state that a fire started by a lightning strike in the boss’s office or the Board of Health condemned the food vending machines.

TYPES OF EXERCISES

There are various types of recovery plan tests. They range from easy to set up and quick to complete to full simulation requiring months of planning. Each plan starts with Standalone Testing. Unfortunately, many companies never test their plans in a full simulation.

Standalone Testing

Standalone Testing is the first level of testing for all recovery plans. It is also required when a significant change has been made to the IT system of a business process.

Standalone Testing exercises individual IT components or business processes to estimate the time required for recovery. It provides the first level of plan error checking. The scenario of a Standalone Test is to recover an individual IT component or business process from nothing. (It assumes the process or IT system has been destroyed or rendered totally unusable.) A recovery plan is written so that someone other than the primary support person can understand and follow it. It also familiarizes at least one other person with the plan’s contents.

Recovering business processes often requires that many plans work together. Standalone Testing examines the individual building blocks of the overall effort. Later tests examine the interactions and interfaces among the individual plans.

The result of a Standalone Test should be a recovery plan that is in the company standard format. This ensures that anyone unfamiliar with this process can find the same type of information in the same place. The plan should be approved as complete and accurate by the plan’s author and plan reviewer. The plan’s author also provides a time estimate as to how long the recovery plan should take to complete.

PREPARATION

Schedule a conference room away from an office’s distractions. If the document is large, break it into one-hour meetings to keep everyone fresh.

MATERIALS TO PROVIDE

A copy of the standard plan format and copies of the plan for each participant.

TESTING TEAM

Consists of the document author and a reviewer (may be the backup support person or the process’s owner).

THE MEETING AGENDA THE AGENDA SHOULD BE AS FOLLOWS:

image Review ground rules.

This is a draft document and anyone can suggest changes.

Suggesting a change is not a personal attack.

All comments are focused on the document and not on the author.

image Review document for:

Proper format.

Content.

Clarity.

image Estimate time required to execute each step in this plan and the plan overall.

image Set time for review of changes suggested by this test.

FOLLOW UP

Continue Standalone Test plan reviews until the document conforms to the company standard format and the participants believe that the document reflects the proper recovery process.

Integration Testing

Integration Testing (or Integrated System Testing) exercises multiple plans in a logical group. This might be an IT system with its interdependent components (a database server, an application, special network connections, or unique data collection devices).

The purpose of an Integration Test is to ensure that the data exchanges and communication requirements among individual components have been addressed. These interdependent components require each other to provide the desired business function. This type of test is normally used by IT systems. For example, the Order Entry system may require access to multiple databases, files, and applications. To test the recovery of the Order Entry system, all of the three other components must be recovered first.

The ideal place to execute this plan is at the IT recovery hot site that the company will use in a crisis. If that is not available, then use equipment that is as close in performance and configuration to the hot site as practical. This will help to identify differences between the hot site and the required data center configuration.

Another advantage to using the hot site is to provide an actual recovery time for validating the recovery time objective (RTO). This result is added to other test results to see what the company can realistically expect for a recovery time, given the current technology.

Integration Testing is usually conducted by the backup support person(s) for each recovery plan. The Business Continuity Manager observes the test and records the actual time required to recover the IT system or business process.

In most IT recoveries, the server administrator builds the basic infrastructure and then provides it to the recovery team. For example, an operating system is loaded onto “blank” servers and then turned over to the recovery team. The time to prepare these devices is part of the RTO calculation.

PREPARATION

Schedule time at the hot site or the use of equipment in the data center.

TESTING TEAM

The testing team should consist of the following people:

image Backup support person(s) for each device to be recovered

image Network support technician to isolate the test network from production and to load the DNS server

image System administrator to load the operating systems and establish the domain controller

image Applications support team to load and test their systems

image Optionally, a database administrator

image Reviewer (IT Manager or Business Continuity Manager)

image Business process owner to validate a good recovery

image Business continuity program sponsor to approve timing of test and required funding

MATERIALS TO PROVIDE

These include the following:

image Nonproduction (spare) IT equipment, based on the list of required equipment as detailed in the recovery plans

image Copies of the recovery plans to be tested

THE TEST PROGRAM

The program should include the following actions:

image Review ground rules.

Write down all corrections as they are encountered.

Record the amount of time required to complete each step in the plan and the total in the plan. This may isolate steps that take a long time as targets for improving the speed of the recovery.

Given the amount of time required to set up an Integration Test, if time permits, rerun it after the plans have been corrected.

Keep the support team (network, database, systems administrators) close at hand to address problems after the applications recovery begins.

image Prepare for the test.

Set up a network that is isolated from the world since some applications may have embedded IP addressing.

image Conduct the test.

Set up the infrastructure.

Set up required infrastructure, such as DNS and domain controllers.

Load a basic operating system on the recovered servers.

Provide adequate servers and disk storage space.

Using the recovery plans, follow each step.

Note all corrections.

Using these corrections, restart the test from the beginning.

image Once the system is ready:

Applications support runs test scripts to ensure the system has been properly recovered.

A business process owner validates that it appears to function correctly.

image Review the results.

Update any plans as required.

Determine a realistic recovery time for this process.

image Conduct the after-action review.

Identify plan improvement needs.

Identify areas to research to reduce the recovery time.

Identify improvements in the testing process.

image Report test results. The Business Continuity Manager writes a report of test results and submits it to the program sponsor.

image Set time for review of changes.

FOLLOW UP

Collect all plan corrections and reissue updated documents. If a plan required significant changes, then it should be reviewed in a Standalone Test before using it in another Integration Test.

Just as recovery plans are exercised, so is your ability to plan and conduct a test. After the plans are updated, ask the participants to review the planning and testing processes for ways to improve them.

Walk-Through Testing

The purpose of a Walk-Through Test is to test a logical grouping of recovery plans at one time. It is similar to an Integration Test except that no equipment is involved and the recovery is theoretical. A Walk-Through recovery plan exercise familiarizes recovery team members with their roles. It is useful for rehearsing for an Integration Test, for testing when an Integration Test is not practical, and for reviewing business process recovery plans.

Integration Testing is valuable since it involves an actual recovery. A Walk-Through also provides many benefits, but without the expense of actually using equipment.

Walk-Through recovery plan testing is conducted in a conference room. Participants explain their actions as they read through the recovery plan. The goals are to improve plan clarity, identify gaps in the plans, and ensure that all interfaces among individual plans are addressed. These interfaces may be the passing of data from one IT component to another or the passing of a document between workers.

A Walk-Through Test does not provide a real RTO for the collective plans. However, estimates may be provided by the recovery team members.

PREPARATION Schedule time in a conference room.

TESTING TEAM The team should consist of the following people:

image Backup support person(s) for each plan to be recovered

image Exercise Coordinator (IT Manager or Business Continuity Manager)

image Business process owner

image Exercise recorder to capture action, decisions, and assumptions as they occur

MATERIALS TO PROVIDE:

Include copies of the recovery plans to be tested.

THE TEST PROGRAM The program should include the following:

image Conduct the test, following the recovery plans.

Set up the infrastructure.

Set up required infrastructure, such as DNS and domain controllers.

Load a basic operating system on the recovered servers.

Provide adequate servers and disk storage space.

Note all corrections.

image Review the results.

Update plans as required.

Ensure team members are now more familiar with their recovery roles.

image Conduct the after-action review.

Identify plan improvement.

Identify areas to research to reduce the recovery time.

Identify improvements in the testing process.

image Report test results.

The Business Continuity Manager reports the test result to the program sponsor.

image Set time for review of changes.

FOLLOW UP:

Collect all plan corrections and reissue updated documents. If a plan requires significant changes, then it should be reviewed in a Standalone Test before using it in another Walk-Through exercise.

Simulations

Up to this point, all tests have been based on recovering a business process or IT system from scratch. The reasoning is that if the plan has adequate information to recover from nothing, then it will have the information necessary to recover from a partial failure. However, it is this partial failure that is more common.

A simulation test brings all of the plans together. In a real crisis, rarely is the recovery isolated to a single plan. IT systems recover the data center, work area recovery plans recover office processes, and the supporting plans for Human Resources, Corporate Communications, Facilities, Security, and a range of other departments are all in play. A simulation not only invokes these many plans but forces them to work together toward the common goal.

A simulation test begins with a scenario (such as a partial roof collapse from a severe storm or a person entering the building with a gun). In both of these examples, most of the facility is intact yet may be temporarily disabled.

Simulation adds to plan exercises the elements of uncertainty, time pressure, and chaos. No situation comes with complete and verified information, yet managers must react correctly to minimize damage to the company. Chaos comes from inaccurate and incomplete information, yet decisions must be made. Unlike the smooth pace of a Walk-Through Test, simulations add the element of chaos in which events surge forward whether someone is ready for it or not.

Simulation tests can be simplistic Table-Top exercises. They can also be complex (and expensive), such as relocating the entire data center or work area to the recovery site and running the business from there. Most simulations only address a portion of the company, usually a group of related business processes. This keeps the recovery team to a manageable size and the recovery exercise focused on a set of plans.

Make it fun! Send out pre-exercise announcements as if they were news elements related to the scenario (clearly marked as exercise notices for training only). At the beginning of the exercise, state the goals to instill a sense of purpose in the group. At the end of the exercise, restate the goals and ask the group how well it measured up. After all, these people gave up some hours of their lives, so show them how important it was to the company!

Simulations can also add the dimension of external agencies to the recovery. Firefighters, reporters, police officers, and other emergency groups can be invited to keep the chaos lively while educating participants of each agency’s role in a crisis. The Exercise Coordinator may also include employees at other company sites via conference call.

The purpose of the exercise is to validate that the plans are workable and flexible enough to meet any challenge. Participants will depend on the plan to identify actions to take during an incident. (However, just as in a real crisis, they are free to deviate from them.) Notes will be collected, and the plan will be updated as a result of the exercise. Participants will note areas for improvement such as corrections, clarity, content, and additional information.

There is no “right” answer to these exercises. The goal is to debug the plans and seek ways to make them more efficient without losing their flexibility (since we never know what sorts of things will arise). “Rigging the test” to ensure success should not be done. Conducting the exercise at the recovery site will minimize distractions from electronic interruptions.

About one week before the exercise, verify that participants or their alternates are available. This is also a good time to rehearse the exercise with the testing team and to handle minor administrative tasks such as making copies of plans and tent cards identifying participants and their roles.

People will react in different ways. If someone on the team will be declared injured or killed during the exercise, ensure that they agree to this prior to the start of the exercise.

Table-Top Testing

A Table-Top Test is a simulated emergency without the equipment. It exercises decision making: Analysis, communication, and collaboration are all part of the plan. Table-Top exercises test an incident management plan using a minimum of resources. The size of the incident is not important. It is the fog within which early decisions must be made until the situation becomes clearer.

A Table-Top exercise tests a logical grouping of recovery plans with a realistic disaster scenario. One or more conference rooms are used to control the recovery. Unlike a Walk-Through Test, a Table-Top Test uses a scenario, mid-exercise problem injections, and, often, external resources.

A Table-Top exercise is much less disruptive to a business than a full simulation. A Table-Top exercise typically runs for a half day, where a full simulation can run for several days.

The goals are to train the team members, identify omissions in the plans, and raise awareness of the many dimensions of recovery planning. Each participant uses the recovery plan for guidance but is free to choose alternative actions to restore service promptly. The goals are to improve plan clarity, identify gaps in the plans, and ensure that all interfaces between individual plans are addressed. These interfaces may be the passing of data from IT component to another or the passing of a document between workers.

A Table-Top Test does not provide a real RTO for the collective plans. However, estimates may be provided by the recovery team members.

The Exercise Coordinator keeps the group focused on the test. It works best if someone else is designated as the exercise recorder. The recorder writes down the events, decisions, and reactions during the exercise, freeing the Exercise Coordinator to work with the team. These notes are valuable later when considering ways to improve the recovery plans and the Table-Top exercise process.

PREPARATION:

Schedule a conference room.

TESTING TEAM The team should consist of the following people:

image Backup support person(s) for each plan to be recovered

image Exercise Coordinator (IT Manager or Business Continuity Manager)

image Exercise recorder

image Business process owners

image External resources, such as news reporters, firefighters, or police officers

MATERIALS TO PROVIDE Key materials include:

image Copies of the recovery plans to be tested

image Scenario and incident “injects”

image A clock projected by a PC onto a whiteboard

THE TEST PROGRAM The program should include the following:

image Explain to participants the rules for the exercise.

Time is essential; decisions must be made with incomplete information.

Everyone must help someone if asked.

Everyone takes notes for the after-action critique.

No outside interruptions are permitted—cell phones off.

If an issue is bogging down the exercise, the Exercise Coordinator can announce a decision for the issue or set it aside for future discussion.

image Introduce each of the team members and explain their role in the recovery.

image State the exercise goals (familiarize the team with the plan, gather RTO data, improve the plans, etc.).

image Introduce the scenario to the team.

Clarify group questions about the situation.

Ensure everyone has copies of the appropriate plans.

image Conduct the exercise.

Select several of the primary recovery team members to step out of the exercise; their backup person must continue the recovery.

Inject additional information and complexity into the exercise every 10 minutes.

End the exercise at a predetermined time, or when the company is restored to full service.

image Conduct the after-action review.

Identify plan improvements.

Identify areas to research to reduce the recovery time.

Identify improvements in the testing process.

image Report results. The Business Continuity Manager submits a written report of the test result to the program sponsor.

FOLLOW UP:

Collect all plan corrections and reissue updated documents.

Two types of plans are best-tested as Table-Top exercises. A crisis management plan is easily tested in a conference room. The types of actions required can be discussed rather than acted out. A pandemic can range over 18 months, so a full simulation is not practical. Both can be conducted in the Command Center for additional realism.

Disaster Simulation

The purpose of a Disaster Simulation is to test a logical grouping of recovery plans with a realistic scenario. Essentially, a Disaster Simulation is a simulated emergency that includes the people and equipment necessary to recover IT equipment or a wide range of business processes. Running a simulation is expensive in time and equipment, so it should be approved far in advance. A simulation may be disruptive to a company’s normal business and should be planned for the company’s slow time of the year. It may run for several days.

The goals are to train the team members, identify omissions in the plans, and raise awareness of the many dimensions of recovery planning. A key advantage of a simulation is that it provides the actual time required to recover a process. It also adds the pressure of chaos to the recovery.

The Exercise Coordinator keeps the group focused on the test. Appoint an exercise recorder for each recovery team. That person writes down the events, decisions, and reactions during the exercise. These notes are valuable later when considering ways to improve the recovery plans and the exercise process.

Always preannounce a simulation; there should be no surprise alerts. Before engaging outside participants, the disaster recovery (DR) core team should perform the simulation exercise as a dress rehearsal to “polish” the sequence of events.

Real tests provide the most realistic results. Avoid the temptation of the IT team to make a “special” set of backup media just for the test. The true recovery time comes from sifting through the many backup tapes to find the files that you need.

A simulation begins with the initial incident alert by the night watchman or by an alarm that automatically alerts a manager. Full-scale testing involves pulling the plug on some part of the operation and letting the disaster recovery plan kick in. For obvious reasons, this is rarely done.

Simulation tests should be conducted at the recovery site at least once per year. Recovery plans are used as guidelines, but participants are free to deviate from them. The goals are to improve plan clarity, identify gaps in the plans, and ensure that all interfaces between individual plans are addressed. These interfaces may be the passing of data from one IT component to another or the passing of a document between workers.

PREPARATION:

Schedule a conference room. Ensure the participants understand the exercise is a rehearsal and not a test. A rehearsal allows people to play out their actions; a test implies pass or fail. For each recovery team:

image Create a log sheet to document the communication among recovery teams (see Form 13-1 from the companion url).

image Create an observation log (see Form 13-2 from the companion url).

TESTING TEAM The testing team should consist of the following people:

image Backup support person(s) for each plan to be recovered

image Exercise Coordinator (IT Manager or Business Continuity Manager)

image Business process owners

image Exercise recorder

image External resources, such as news reporters, firefighters, or police officers

MATERIALS TO PROVIDE Key materials include:

image Copies of the recovery plans to be tested

image Scenario and incident “injects”

image A clock projected by a PC onto a whiteboard

THE TEST PROGRAM The program should include the following:

image Explain to participants the rules for the exercise.

Time is essential; decisions must be made with incomplete information.

Everyone must help someone if asked.

Everyone should take notes for the after-action critique.

No outside interruptions are permitted—cell phones off.

If an issue is bogging down the exercise, the Exercise Coordinator can announce a decision for the issue or can set it aside for future discussion.

image Introduce each of the team members and explain their role in the recovery.

image Introduce the scenario to the team.

Clarify group questions about the situation.

image Conduct the exercise.

Select several of the primary recovery team members to step out; their backup person must continue the recovery.

Inject additional information and complexity into the exercise every 10 minutes.

End the exercise at a predetermined time, or when the company is restored to full service.

image Conduct the after-action review.

Identify plan improvements.

Identify areas to research to reduce the recovery time.

Identify improvements in the testing process.

Collect RTO metrics.

image Report results. The Business Continuity Manager gives a report of the test result to program sponsor.

image Set time for review of changes.

FOLLOW UP:

Collect all plan corrections and reissue updated documents. If a plan required significant changes, then it should be reviewed in a Standalone Test before using it in another Integration Test. In addition, update the RTO Hour-by-Hour Recovery Plan.

SOMETIMES NATURE TESTS THE PLANS FOR YOU

There are numerous incidents that pop up from time to time that are not significant emergencies but that provide an opportunity to test parts of a plan. For example, if there is a power outage at work, use the plans to minimize the disruption. Do the same for a loss of data communications, a tornado warning, or a snowstorm emergency. Relocating a business process or significant portion of the data center is similar to a disaster.

Another opportunity that can trigger a test plan is facility construction. An example is if the electricity to a building must be turned off for work on the power main. Use the recovery plans to locate and turn off all of the equipment, noting anything found that was not in the plan. When the work is over, use the plans to turn back on all of the equipment. Then test each critical system to ensure it is operational. Following the plans for restarting equipment may uncover equipment tucked away in offices or closets that are not in the plan.

Relocation to a new facility is a great opportunity to completely test your disaster recovery plan. Many of the activities necessary during relocation are the same as those required in a disaster: New machines may need to be purchased, servers are down for some period of time, new communications infrastructure needs to be built, and data must be restored. In fact, if a relocation project is not done properly, it may turn into a real disaster!

Whenever such a problem occurs:

image Focus people on referring to their recovery plans. The value of a plan is to reduce chaos at the beginning of a crisis. Plans are no good if no one uses them.

image Begin recording what has occurred and people’s reaction to it. These notes are used to improve our plans (and never to criticize anyone).

image Conduct an after-action review the next day to gather everyone’s perspective.

image A plan that is used for a real event has been tested just as surely as a scheduled exercise. Mark that plan as tested for the quarter.

Whenever a significantly disruptive incident occurs, such as a power outage, loss of external network, or a computer virus outbreak, begin taking notes during the event. These notes should be a narrative of times and actions taken—who did what, when, and the result. See if anyone thought to break out the appropriate recovery plans and follow them.

Within two working days after the incident, convene a group to conduct an after-action review. This review is intended to capture everyone’s perspective of the incident to improve plans for future use.

DEBRIEFING PARTICIPANTS USING AN AFTER-ACTION REVIEW

Whenever an incident occurs (e.g., a power outage, a fire in the computer room) that is covered by a recovery plan (or should have been covered by a plan), conduct an after-action review on the next work day after the recovery. This is an open discussion of the event and how to improve future reaction.

Someone is appointed as the review coordinator (usually the Business Continuity Manager). It is helpful if someone else records the discussions so that the review coordinator is free to focus on the discussion.

What happened—It is important to gain agreement on what occurred, as further discussion is based on this finding. Each person will define the problem from his or her own perspective. Sometimes agreement on a point takes a lot of discussion.

What should have happened—This is where positive things are listed, such as the recovery plan was easy to find.

What went well—Not all is doom and gloom. Now that the crisis has ended, take credit for the things that went well. Acknowledge those people who contributed to the recovery.

What did not go well—This is the substance of the review. Once you list what did not work out, you can move to the last step. Take care never to personalize the discussions. Focus on the action and not on a person. Otherwise, people become defensive and no one will participate in the discussion.

What will be done differently in the future—List the solution to each item identified in the previous step. Assign action items to specific people, each with a due date.

Here is an example after-action report for a power outage:

What happened? The power for the building went out and everyone stopped working. The office people flooded out to the factory because there was light there through the windows. People milled around outside of the data center to see if they could help. The emergency lights failed in most of the offices and everyone was in the dark.

What should have happened? The emergency lights should have worked. Everyone should have known where to meet for further instructions.

What went well? Nobody panicked. The UPS system kept the data center running until power was restored.

What did not go well? No one knew what to do. Different people were shouting out different directions, trying to help but really confusing everyone. We could not find the system administrators in case the servers needed to be turned off.

What will be done differently in the future? We will identify assembly areas for everyone. Supervisors will be responsible for finding out what has occurred and passing it on to their people. The emergency lights will be checked monthly.

DEMONSTRATING RTO CAPABILITY

During the Business Impact Analysis, an RTO was established by the company. It was selected based on the impact to the company, not on what the company was capable of doing. Testing recovery plans and recording the recovery time is the place where the company proves it can meet the RTO. If not, something must change to meet it.

Some RTOs are obvious. A company that expects to recover from tape backup requires days. If the organization requires recovery in a few hours, then stop the testing and rework the data storage strategy. However, if a company has a reasonable strategy based on its RTO, then only testing can prove if it is achievable or not.

Figure 13-2 shows the first page of a possible RTO Hour-by-Hour Recovery Plan for a data center. (A similar chart should be built for work area recovery.) This chart collects the recovery times from plan exercises. In the IT world, most recoveries must wait until the basic infrastructure is in place (network, firewalls, DNS, domain controllers, etc.). The plan for recovering each infrastructure component is placed in sequence at the top of the chart. Below that is the list of applications, databases, etc., that must be recovered in the appropriate sequence. For example, a LAN recovery must be in place before the domain controller can be recovered.

Use this basic plan as an outline for building your own recovery timeline. As you enter the times from actual system recoveries, you can prove or disprove the company’s ability to meet its desired RTO. Actual recovery times are always preferred to estimated values.

If the RTO is not achievable or if you wish to shorten it, use this plan to identify places to make changes. Look for tasks that could run in parallel instead of sequentially. Look for the ones that take a long time and seek ways to shorten them (e.g., use faster technology or redesign the process). Of course, the best way to reduce the time required is to eliminate noncritical steps.

During a recovery, company executives can use the RTO Hour-by-Hour Recovery Plan to follow along with the recovery’s progress. Based on where the team is in the recovery, they can look at the times and see how much longer before, for example, the email system should be available, or that the billing system should be operational.

image

FIGURE 13-2. RTO Hour-by-Hour Recovery Plan for a data center.

Conclusion

No plan can be called complete until it has been tested. Beyond the initial testing, ongoing testing is critical to ensure that the plan is kept up to date. As the organization grows and evolves, the plan must be updated to incorporate the necessary changes. Periodic testing validates these changes and keeps everyone aware of their responsibilities when a disaster strikes.

There are different types of tests, from simple one-on-one Standalone Tests to full simulated disasters. Tests should follow a progression from simple tests to complex. Trying to jump too quickly into simulations will result in people sitting around while muddled plans are worked through. Participants will conclude that the tests themselves are the disaster.

The people participating in the tests are a valuable source of information. After each exercise, promptly gather their ideas in an after-action meeting. They should advise the Exercise Coordinator of ways to improve the plans, communications among the testing teams, and everything that can speed a recovery. In a separate meeting, ask them to critique the testing process. This will improve their participation and cooperation in the future, as well as make your tests run smoother.

There are times when company activities or Mother Nature tests your plans for you. Immediately focus everyone on using their plans. After the event has passed, pull everyone together for an after-action meeting to collect their ideas. (This is also a great time to slip in a plug for the value of plans when disaster strikes.)

The outcome of each test should be used to update an RTO Hour-by-Hour Recovery Plan. It is one thing for a company to declare an RTO, but that chart illustrates whether it is likely or not.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset