CHAPTER 9
TECHNICAL RECOVERY PLAN
Putting Humpty Dumpty Back
Together Again

Any sufficiently advanced technology is indistinguishable from magic.
—Arthur C. Clarke

INTRODUCTION

When people think about disaster recovery, they focus on technical recovery plans. Although these plans offer detailed instructions for how to re-create a technical function for a company, they are really more than instructions for rebuilding a computer server in IT. They can be for the recovery of anything vital to the business, including recovery of a single machine in the factory or an office process. Each company needs to determine which of its processes need a technical recovery plan.

This chapter addresses two types of technical recovery plans. The first explains how to recover something complex. In the examples used, it will be an IT system. The second describes the critical actions to be taken by the technical recovery team leader at the recovery site. For example, it details how to recall recovery media, what to expect from the delivery service, when it should appear, etc.

DETERMINING WHAT NEEDS A PLAN

In terms of the overall disaster recovery, only vital business functions are candidates for a technical recovery plan. This is because the creation and maintenance of these plans is time consuming and, therefore, expensive. However, department managers may find these plans useful for all of their primary business functions, even if the company does not consider all of them vital. The creation of recovery plans beyond the minimal helps to raise a company’s business resilience from disaster recovery to business continuity planning.

Significant company disasters are rare. However, an isolated crisis, such as the failure of a company’s e-mail server, can make a manager’s life very stressful (assuming he or she is responsible for its care and feeding). In this situation, the company is still creating and delivering its products and services to customers, but without the assistance of this business function. A technical recovery plan provides a manager the tool to begin the recovery process while waiting for technical assistance to arrive. Some of the plans most companies require cover:

Image All vital IT functions, as identified by the Business Impact Analysis.

Image All vital business functions, as identified by the Business Impact Analysis.

Image Telephone service, such as the main telephone switch, automatic voice mail routing, etc.

Image Essential facilities services, such as water, electric, sanitary, etc.

Image The office operations of vital business functions, such as how to recover a warehouse in another location during a disaster, how to maintain the customer service desk during a disaster, etc.

These plans cannot be written by anyone other than the person who supports them every day. Writing the plans takes time. Because few of these people have every written one, they will be reluctant to sign up for a task when they cannot gauge how long it will take.

The key to gaining their support is to make the process as easy as possible. Providing a recovery plan format and training them on how to use it relieves some of the anxiety. Beyond that, it is up to the company executives to free up the technician’s time and to identify creating these plans as a priority.

Included on the CD attached to this book is a sample Technical Recovery Plan (Form 9-1). This sample plan is only a starting point. Customize it to meet your own company requirements. For instance, there are example risk assessments and restoration priority charts which you must replace with those based on your own information developed in other chapters.

CREATING RECOVERY PLANS

Many companies assign someone to run their business continuity program and then walk away thinking that the job is done. This person is expected to sit in a back room and create the company’s response to a serious incident. For parts of the plan, this is true. This person can craft the administrative plan and even work through much of the Crisis Management Team’s plan without significant input. However, the Business Continuity Manager cannot write the technical recovery plans. This must be done by the skilled technicians who support these systems. The problem is that these technicians are busy meeting other company priorities. Writing a plan for a disaster that may never happen is not high on their lists. Executive support is critical!

Overcome Objections

Asking technicians to write recovery plans is like asking small children to volunteer for a round of immunization shots. Few will step forward. That is why senior executive support is so important. However, in fairness, the technicians are also expressing their personal concerns, so the person coordinating the creation of these plans must address and eliminate as many of these objections as possible:

Image “I can’t write.” Make plan writing easy through an approved “fill-in-the-blanks” plan template (see Form 9-1, Technical Recovery Plan, on the CD) and a brief class to show them what information goes into which space.

Image “If I tell you, I lose my job security.” You might explain that strong technical expertise is the best defense for job security. However, managers who permit workers to use this as an excuse are setting themselves up for blackmail at some future date (pay raise time, annual performance reviews, days off, etc.).

Image “Someone will fool around with my systems.” This can happen anyhow. Remind them that passwords typically protect systems, and they are not stored with the plans.

Image “I know what to do and don’t need a plan.” The plan is a guideline for the technical backup person. Also, some software is so stable that it is rarely touched. This plan reminds the primary support person about its specific requirements for installation.

Image “You never know what will happen so any written plan is useless.” A technical recovery plan addresses the worst-case scenario by providing recovery instructions. No matter what the disaster, if the system needs to be rebuilt on different equipment, this plan will save the company a considerable amount of time and effort. If less than a full recovery is needed, then only part of the plan is used.

Passwords

IT systems are typically protected with passwords. In an emergency, the person recovering the IT system will need to know what the passwords are. However, writing down passwords to the company’s most sensitive systems violates information security standards. Once written, they must be protected so that that are only available in a crisis.

Another problem is that passwords change. If your company is serious about its information security program, then all passwords expire at some point in time. With multiple servers, routers, and anything else protected by a password, this expiration date is variable. Passwords are also changed if the company’s security is breached, if a key person leaves the company, etc.

There are as many different solutions to these problems as there are companies. Two of the most common solutions are:

Image Keep the passwords in a locked container at the recovery site, and update its contents weekly.

Image E-mail all changed passwords to the CIO’s third party e-mail account (such as Yahoo or Hotmail).

Names or Position Titles

Some companies prefer not to include names within plans and instead refer to a system support chart. This makes maintaining the support chart simpler. Companies with lower employee turnover may prefer to have names in the plans, so leaders do not need to look up things in a chart that may be hard to find in a crisis. An example of a recall table is shown in Figure 9-1. This table is a simple way to collect and keep recovery recall information in one place.

PLAN FORMAT

Two important ways to encourage technician plan writing is to offer a plan template and a class explaining how to complete it. Some people find writing very easy. Others stare blankly at an empty sheet of paper with no idea where to begin. Others fear that someone will criticize their writing and are afraid to write anything down. The Business Continuity Manager’s approach to this is to be patient, firm, and helpful!

Image

FIGURE 9-1: Example recall table from a technical recovery plan.

Plan Template

Use Form 9-1, Technical Recovery Plan, included on the CD. The plan template addresses the who, what, where, when, why and, most importantly, how to recover something. Although this template is organized for the recovery of an IT system, it could easily be reworked to recover a telephone switch, the company’s data network, a special machine tool, or simply a process.

The front of the plan has a table of contents. No one reads these plans like a novel (front to back). Instead, they are often looking for something specific. The table of contents quickly points them to where they want to go.

The template steps through the various dimensions of what must be known or possessed to recover a particular IT system. The first part explains how the system supports the business. This is useful when making tradeoff decisions during a recovery. Farther down the list are technical requirements for a successful system recovery:

Image Purpose. Set the context in which this system provides value. For example, the purpose of the materials management system is to control the quantity, location, and usage history of the company’s manufacturing materials.

Image Scope. What this system does and does not support. For example, this system supports the company headquarters, the Eldorado, Ohio factory, and the Abu Dhabi sales office.

Image Background. Explain any business requirements that assist the reader in understanding why this server/application exists.

Image Assumptions. A list of things that were assumed when this plan was written, such as, “The technical qualifications required for the person executing this plan,” etc.

Image Dependencies. Other servers must be in place, such as IBM’s IIS server, a specific Oracle database server, etc. Essential IT servers, such as DHCP, Domain Controllers, etc., should be assumed as in place. Also skip the environmental concerns of air conditioning, filtered electrical power, etc.

Image Tech Support. List the names and 24-hour contact numbers of the primary and secondary support persons for this system. Some companies refer to the employee recall list in the administrative plan for telephone numbers.

Image System Users. List the primary end users for this system. These people will be called upon to verify that the system has been successfully recovered. They will more thoroughly exercise the system than will the IT technician.

“Systems Requirements” details specific technical requirements that must be in place before this IT system can achieve its minimal level of service. Since this is an IT example, a different set of criteria should be selected for recovering an office, a piece of machinery, etc.

Image Server Requirements in terms of CPU, RAM, “C: drive” size and type, etc. Be clear about what is needed because these specifications may be used to order a replacement server that is the appropriate size.

Image Disk Space Requirements lists the total disk storage required for local disks, SAN disks, etc. Any special configuration of these disks is also noted here.

Image Connectivity Requirements describes the network configuration, such as VLANs, trusts, opened firewall ports, special firewall rules, etc.

Image Support Software lists the many supporting utilities that may be needed.

Image Application Requirements are listed in case a software application must be changed during recovery. The appropriate compiler version must be known to implement the repair.

Image Database Requirements lists the type and version of the database program supporting this system. This will also include required permissions, databases, and table connections needed.

Image Special Input Data beyond what is in the company’s backup media, such as data stored in a different off-site location or an external data feed.

Image Licensing Requirements may be relevant since in some cases, loading this system on new hardware may require a license change by the software manufacturer. For example, a license may be tied to a CPU serial number. If applicable, detail the instructions for obtaining it.

Image Special Printing Requirements details instructions for setting up printed output for this application to include special forms.

Image Service Contracts that support this system’s components to include days and times of coverage, etc. Include the expiration date. Describe how to contact the vendor or whoever provides support. This information should be available through the command center and the administrative plan.

“Detailed Recovery” details the specific steps required to bring this system back up to a minimum level of service. This includes:

Image Prerequisite Systems/Applications that are required prior to restoring this application.

Image Successor Systems/Applications that are fed by this application.

Image Application or Infrastructure Component License Requirements (necessary to accomplish the test).

Image Architecture Diagram (insert a diagram that indicates where in storage the application is found, how to start it, how it relates to other systems and passes data between components).

Template Training

The people who know the most about your systems are often the busiest. Time spent writing your document is time not spent on actions the company judges to be productive. Therefore, only a few people at a time will be made available for writing plans. This is why template training is usually conducted in small groups. This allows for one-on-one questions to be quickly addressed. In larger groups, a few strong personalities may disrupt the meeting and many questions will go unanswered.

Walk the team through the template element by element. Ask for their ideas for improving it and seek to identify their challenges in filling it in. The most common problem is writing the step-by-step recovery. The easiest approach is to ask them to explain the recovery steps to you. Writing the steps is the same as speaking them to another person.

All plans are written for someone with at least a basic familiarity with that technology (UNIX, DB2, C++, etc.). To save on words, it is nice to include screen shots of what to enter where and software responses to look for. These screen prints should go into the recovery document. For example, instead of telling someone to look for the small button in the upper left corner and then describing the next field to enter, they can draw an arrow on the screen shot. This saves time to read and to execute it.

Screen shots can make the individual recovery plans rather large. However, a plan with clear illustrations speeds recovery.

Proofing with the Manager

Once the plan is drafted, it needs to be proofread by someone other than its author. This will help to correct grammatical errors and to identify logical gaps in the narrative. There are two logical candidates for this:

Image Author’s Team Leader. This is the person in the hot seat when it comes to ensuring that a workable plan is created. Proofreading the plan familiarizes the leader with the system in question and enables him or her to identify “best practices” that can be applied to other plans created by the team. The leader also has a “big picture” view of this system and adjacent software systems to point out connection points between them.

Image Backup Support Person. This person may be the one called upon to execute the plan, so he or she has a personal stake in ensuring that it is understandable and complete.

Step-by-Step Specifics

How much detail is necessary? It is not practical to write a plan so detailed that any person walking by can execute it. That would take too long to execute and would be too unwieldy to keep current. Instead, write it at the level of someone familiar with the technology, but not necessarily with that IT system. This enables the use of other company employees or contractors to run with this plan in the event of a serious disaster.

When writing these plans, put yourself in the place of someone asked to recover this unfamiliar system. What would you want to know about? What aspect of the recovery would concern you the most? What is the logical way to sequence the recovery steps for a smooth recovery? How might the technician verify that required predecessors are in place before beginning?

RECOVERY PLAN FOR THE RECOVERY TEAM LEADER

A plan to manage the plans is important. During a disaster, the person leading the efforts at a recovery site will have his or her own specific recovery information steps and requirements. These should be included in a separate plan. In this case, use Form 9-2, IT Team Leader Recovery Plan, that is included on the CD. The purpose of this plan is to guide the technical recovery team leader on the actions required at the remote technical recovery site. This plan is intended to work together with the Command Center plan to ensure a smooth recovery.

Activity at the recovery site will be hectic with technicians coming and going—each with their own idea of what should be done next. The technical recovery team leader must focus this energy on the recovery effort at the time when it is needed.

Recovery Site Manager

The Recovery Site Manager is the CIO’s representative at the recovery site. This person is charged with providing direction to all employees and contractors onsite. During a recovery, there is no time to argue over the boundaries of job responsibilities. The Recovery Site Manager has the authority to assign any employee or contractor onsite to any recovery task.

Organizing the local recovery efforts means wearing many hats. The Recovery Site Manager is both leading the recovery effort and activating a new company facility. This responsibility runs from reloading software to security to janitorial service and includes everything necessary to ensure a safe, sanitary, and operational facility. Most people assigned to this task have no problem with the technical side—it is the rest of the work that distracts them. If a facilities manager is onsite, that would be the logical person to address the facility issues.

To minimize distractions, the team must also be cared for. The recovery site should be located at least an hour away from the normal working site. This means that the surrounding countryside may not be familiar to team members. To keep the team focused on the recovery, food is usually purchased and brought in so no one needs go searching for it. Local hotel accommodations may also be needed. Specific responsibilities include:

Image Appoint someone as the alternate site manager to answer questions in the Recovery Site Manager’s absence.

Image Ensure security of the site to safeguard company assets and data.

Image Report progress once an hour to the company recovery Command Center on the IT systems recovery.

Image Assign team members and contractors to whatever tasks need to be accomplished.

Image Publish a rest plan to ensure the recovery can proceed around the clock.

Image Ensure everything in the facility operates safely.

Image Coordinate purchasing requirements through the Command Center. However, use the company credit card for small purchases such as food for the staff and miscellaneous supplies.

Several important tools will guide the Recovery Site Manager. These are typically found in the administrative plan as they are useful to more than one team.

Personnel Tracking

The Recovery Site Manager assigns workers to specific tasks as they arrive. Based on the situation, people may be assigned to areas other than their primary specialties. For example, once the networks are operational early in the recovery, the network technicians may be assigned to other duties.

Maintain a log of who arrived, and when. Know who is at the recovery site in case they are needed. When the crisis has passed, this can be used for a number of actions, such as “Thank You” notes, calculating labor used in the recovery toward the company’s losses from the disaster, etc. This log enables:

Image Tracking when people have been too long on the job and need to rest. Tired people make mistakes.

Image Tracking who is in the disaster site to account for everyone. Know who went where and when so that someone can look for them if they are overdue.

Use Form 9-3, Technician Tracking Log, on the CD to track personnel. This sheet shows who arrived when. Late arrivals may explain some of the delays in starting specific recovery steps. It shows who is still onsite and can also be used to provide personnel status reports. The intention is to avoid time lost looking for someone. If they sign out then they are off-site and no further searching is required. It also serves as a record later for who was onsite, when, and for how long.

Recovery Activity Log

Use Form 9-4, Recovery Activity Log, on the CD to track recovery activities. It is used to record significant events occurring during the recovery. This document is valuable for later analysis so that recovery performance can be improved. Start this log as soon as the facility is open. Require all technicians to report when they begin their recovery work and when it is complete.

Examples of entries in the activity log are:

Image Requests to Purchasing for additional supplies or services.

Image Calls to external tech support.

Image Status reports to the Command Center.

Recovery Gantt Chart

The Command Center’s question is always, “When will it be ready?” The Recovery Gantt Chart will show where the recovery is in relation to its completion and provide some idea of when a specific application will be available for business use. There is a sequence to the logical recovery of IT systems. The network (internal and external) must be activated, then the supporting servers are restored (Domain Controller, DHCP, DNS, etc.), then the application and database servers are restored, and on and on. A tool to achieve this is a restoration priority list.

The Recovery Gantt Chart is created during plan testing. It is used to add up all of the estimated and actual plan test recovery times to see if the company can achieve its recovery time objective (RTO). The same document is an excellent tool for gauging progress. For example, if the CEO decides that he needs e-mail service restored as early as possible, then a review of the Gantt identifies those IT services that must restored before e-mail recovery can begin. As these services are restored, the remaining time for the rest of this recovery can be estimated.

The recovery timeline is a tool that enables the recovery team leader to monitor progress toward recovery completion and to estimate the remaining time required to complete the recovery. A copy of the timeline is maintained in the Command Center for the same purpose. The timeline helps the CIO project when specific IT functions should be available given recovery progress.

During the development of recovery plans, each author estimated the amount of time required to execute his or her plan. This information from all the plans was then added to the RTO Hour-by-Hour Recovery Plan. Each plan was placed in sequence according to its timing in the recovery. For example, the domain controller is recovered after the network is restored and before starting on application recovery. The sequence of restoration and the lengths of time required are formatted into a Gantt Chart using Microsoft Project.

Form 9-5, Hour-by-Hour Recovery Plan, is an example of a Gantt Chart that can be found on the CD. This is in Microsoft Project format.

To use the plan during recovery, follow the team’s progress on the chart. If someone asked when a particular application will be ready, it can be traced back up the Gantt Chart to the point where the team is currently working. This backward check indicates the amount of additional time needed before that application can be used.

Another way this chart helps is if the recovery of one component is delayed. Looking ahead on the chart, this indicates the additional time required for subordinate recoveries. For example, if a domain controller was delayed by a half day, then all subsequent recoveries on the chart would be delayed by that much. The total of the delays is called the “accumulated delay.”

Materials to Reference in the Administrative Plan

Keep a copy of the company’s administrative plan in the recovery site. It contains much useful information for the Recovery Site Manager, such as:

Image Technical Support Chart. A matrix indicating who is the primary and secondary support for every technology. This is used to determine whom to contact for recovering which system. If the primary support person is not present then the secondary person can be summoned.

Image Recall Roster. A complete employee recall roster is located in the Command Center. If a particular person is needed, request him or her through the Command Center. The Human Resources representative in the Command Center also has a matrix of job skills. If the primary and secondary person are not available, then this chart can be used to identify someone else in the company who is familiar with the technology in question.

Image Vendor List. 24-hour contact information for vendors. Although after-hours calls from a customer are severely discouraged, in a true emergency, a supplier looking to retain a valuable customer will step up and unlock business doors even in the middle of the night.

Emergency purchasing authority must be clearly described in the company’s policies and Crisis Management Plan. No one wants to lose valuable recovery time waiting for a purchasing agent to appear.

Validating a Successful Recovery

When an application is restored, it is first tested by the technician who loaded it. Once the technician is satisfied that the application is ready, then the “power user” for that application should log in and exercise its many options. After that person is satisfied, the application is released for general use. This layering of testing catches errors and missing system interfaces before end users miss them. In addition, in a busy recovery, technicians are quickly assigned to recover other systems. There is little time to go back and troubleshoot a poorly restored software application.

Once a recovered application has passed all tests, inform the Recovery Site Manager. The result will be added to the Hour-by-Hour chart and reported back to the Command Center.

Security

Most recovery sites are not permanently staffed. When they are opened during a crisis, they need a team with the right mix of skills. The responsibility for ensuring this occurs usually falls to the recovery team leader onsite. This person must establish the basic support functions to make the site run smoothly.

The first person to arrive with a key establishes security for the front door. People will be going in and out for various reasons. Many of the team members will not have access via the electronic locks. Not everyone can be relied upon to verify the identity of someone before admitting them to the site. It is better to post one person at the door as a security check for movement in and out until a company security guard arrives. If the guard needs a break, they must call down to see if someone can watch the door for them.

Electronic locks make it easy to secure entrance to a few people during normal operation, and then enable many others to enter during an event. If possible, add the recovery team to the locks to minimize delays at the entrance.

Team Support

Identify in advance nearby hotels for rest and food so that team members are not missed while they wander around a strange neighborhood looking for an open restaurant. Find out who delivers to the recovery site. Through this all, an up-to-date telephone book or Internet access is essential.

The best time to set this up is during plan testing. When running a test, try out some of the local hotels and restaurants to see which provides the best service. Then when the crisis hits, the team will already know whom and where to call. The company may also keep the contact information for these sites close at hand and open purchase orders to cover requests.

Communications

As soon as the recovery site is opened for the technical team, the recovery team manager establishes communications with the primary Command Center. Report that the recovery has begun. Keep the line available so that the Command Center can call when needed. (IT recovery sites tend to be shielded and it may be difficult to gain cellular telephone reception.) Other important communications issues are listed below.

STATUS REPORTING It is important that the Command Center knows the current status of the recovery. As systems become available, users can be assigned to catch up on work to restore service to customers. Communications is a two-way street. The primary Command Center will provide status of the disaster site’s disaster containment and recovery.

Submit a status report to the Command Center on the hour containing some or all of the following:

Image Progress in the restoration priority list and restoration timeline.

Image Whether security has been implemented.

Image Who is present at the recovery site.

Image What resources are needed.

Image What purchasing is needed.

The CIO in the Command Center reports to the Recovery Site Manager on:

Image Progress in the disaster site assessment, as well as containment and salvage efforts.

Image Status of resources at both sites.

Image Status of purchasing requests from the recovery site.

COMMUNICATIONS TOOLS IN THE RECOVERY SITE The recovery site must be equipped with a range of communications devices. Given the short timeframe for a recovery, there is not time to work around a communications mismatch. Items needed include:

Image Fax line for vendors requiring a faxed order with a signature.

Image VPN line for contracted services to connect to the systems with minimal potential for interception.

Image External network and modem connection. If the network connection is not normally live, the direct telephone line is, and external sources could dial in via modem.

Recovering Backup Media from Storage

The key to a prompt IT recovery is ready access to the most recent copy of the company’s backup media. Every company has its own approach to storing this media. First, it must be stored off-site in a facility that has as rigorous security protection against theft and environmental controls in place as in the data center. Often this means that a secure courier is used to transport the media.

A part of this security is that only a few people can call out the media containers. Several of these people should be among the first dispatched to the recovery site. (Summoning several people is a good idea in case one or more cannot attend.) Ensure that team members know who is recognized by the off-site storage company as possessing the authority to recall the media.

In the plan, be sure to include:

Image Media storage location.

Image Who is authorized by the company to withdraw material. This person must be present at the recovery site to receive the material or the courier may not be permitted to leave it.

Image Number to call to withdraw material.

Image Pass codes to validate identity.

Image A reminder to recall the latest weekly container and all daily containers since then.

Implement a Rest Plan

Recovering IT systems takes time. Tired people make more mistakes. So if the data center recovery will require more than one working day, the team leader must implement a rest plan for the group. This means that a portion of the techs will be in a quiet rest area (no loud music or anything to distract them). The remainder are working on assigned tasks. Although some people can sleep at anytime and anywhere, others will find it difficult to relax with so much excitement. Still, without a rest plan, everyone will run out of energy at about the same time and the recovery halts.

The rest plan is a published document that lists who is working what hours. Often this is determined by the sequence of events anyhow. The idea is if someone is not needed for a few hours, they should not get underfoot of the technicians working and should be in a separate area.

Janitorial Service

A facility that normally sits idle may not have a custodial staff. Someone is needed to clean the sanitary facilities, empty the trash, etc. Although this may not seem to be a priority for the team leader, think in terms of days. If the recovery is successful, then the recovery site will be the company’s data center until service is restored at the primary site. To keep the recovery site safe and sanitary, a periodic cleaning is needed.

Plan Testing

The Business Continuity Manager must certify at least quarterly that the IT recovery site is capable of supporting this plan. For a recovery site to provide emergency recovery, it must have:

Image Sufficient server capacity to load and run identified critical applications.

Image The right type of back-end servers (Active Directory, Domain Controllers, Tape Management server, DNS, DHCP, firewalls, etc.) to support the recovered applications.

Image Protected network connection to the Internet and Intranet.

CONCLUSION

Technical recovery plans are the heart of the recovery effort. They provide much of the “how” of the recovery. A plan must be written for every vital business function, not just for IT systems. There should be a plan for each business process in a work area recovery plan, telecommunications recovery, and every vital IT system. Companies aspiring to the level of business continuity planning should also provide plans for other important functions so they can be recovered from an incident more quickly.

Technical recovery plans must be tested regularly. Processes change over time and regular testing catches updates that have not made it to the plans. Exercising the plans is also the best way to train recovery team members. In a crisis, they will already be familiar with the content and recovery project workflow.

A separate plan is needed for the person managing the newly opened recovery site. Part of the plan is to open a miniature Command Center to control the site. Another part is to control the recovery team to ensure that the many personalities are working together instead of at cross-purposes. However, the primary purpose is to guide the sequence of recovery plan execution to speed the availability of these services to the company.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset