IT Service Continuity Management

As discussed when talking about availability management, a service delivers value only when it is available for use. In addition to the activities carried out under the availability management process, there is a requirement on the IT service provider to ensure that the service is protected from catastrophic events that could prevent it from being delivered at all. Where these cannot be avoided, there is a requirement to have a plan to recover from any such disruption in a timescale and at a cost that meets the business requirement. Ensuring IT service continuity is an essential element of the warranty of the service.

It is important to understand that IT service continuity management (ITSCM) is responsible for the continuity of the IT services required by the business. The business itself should have a business continuity plan to ensure that any potential situations that would impact the ability of the business to function are identified and avoided. Where it is not possible to avoid such an event, the business continuity management process should have a plan, which is appropriate and affordable, to both minimize its impact and recover from it. Thus, ITSCM can be seen as one of a number of elements making up a business continuity plan (BCM), along with a human resources continuity plan, a financial management continuity plan, a building management continuity plan, and so on.


realworld.gif
ITSCM Without BCM
The internal IT service provider for a large insurance company in the United Kingdom had a well-developed IT service continuity plan. This plan was detailed and tested regularly, so staff members were aware of their roles should it ever have to be used.
The business did not have a business continuity plan, however. Despite being urged by the IT director to consider the impact of a disaster on its ability to operate, the business was reluctant to spend any time or money on something that might not happen. The business leaders also assumed that the IT plan would be sufficient on its own.
The IT plan was based upon the need to have critical systems available to staff from a remote recovery site within 12 hours of a disaster that rendered the data center inoperable. Suggested possible events that might have this effect were fire, floods, extreme weather, and so on. Any such event would have meant that the entire head office (where the data center was based) would also be out of commission. The head office employed 600 insurance clerks selling policies and settling claims.
If such an event had actually occurred, the IT plan would have meant the critical services would have been available within hours. However, the staff members who used these services would have nowhere from which to work, because there was no alternative office accommodation planned (because there was no BCM plan). Even if some accommodation had been found, the staff would have been unable to work, because their work was based on paper files, housed in filing cabinets in the head office. Those paper files would not have been available.
Thus, the IT service continuity plan would have been useful only in the very limited circumstances of a major event that affected the data center and no other parts of the building—an unlikely scenario!
A competitor insurance company, based a few blocks away, had a detailed business continuity plan in addition to an IT service continuity plan. Every document entering the building was scanned, and an electronic copy was stored off-site. This company could be secure in the knowledge that any major event affecting the data center would not prevent the business from being able to continue working.

IT service continuity management process supports the organization’s business continuity management process. It is responsible for identifying and managing the risks to the IT services, agreeing with the business what the minimum requirement for the service would be in the event of a disaster, and ensuring that this agreed level can be provided.

A fundamental objective of the process is to reduce the chance of a disaster occurring at all by identifying the risks to IT services and implementing cost-effective countermeasures to reduce or remove the risk. Should a disaster occur despite these efforts, ITSCM ensures that there is a detailed, tested plan to recover the services to an agreed level within the agreed timescales. Dependent on the business requirement, the service restoration may need to take place within minutes, hours, or one or more days.

What Does ITSCM Aim to Achieve?

ITSCM should develop a number of plans to provide an acceptable level of IT services in the event of a major disruption. Several plans are required to fit the various scenarios involved. The scenarios catered for, and the decision as to what is an acceptable level of service, are arrived at in consultation with the overall business continuity management function.

The service continuity requirement may change over time as the business’s use of and dependence upon the various IT services changes. It is essential that ITSCM carries out regular business impact analysis (BIA) to ensure that the plan still fits the requirement. Should the requirement have changed, the plan must also be changed.

Risks to the IT services may also change over time, so a program of risk assessment exercises must be undertaken to ensure that new risks are identified and mitigated; the level of acceptable risk needs to be agreed on with the business. Risk assessment may require the involvement of availability and information security management, because each of these processes involves identifying and managing particular risks.

The ITSCM manager will be a source of expertise on continuity issues and so may be consulted by the business or the rest of IT needing guidance. It is essential that all changes have been assessed to understand their impact on the ITSCM plans and procedures. An apparently straightforward change may remove a level of resilience, for example, or a departmental reorganization may split a single role in the plan across a number of individuals, meaning that this responsibility may have to be reassigned.

The major objective of ITSCM is to ensure that solutions have been developed and put in place to ensure that the required level of service (or better) can continue to be provided. Where these solutions involve the use of services supplied by external third-party suppliers, ITSCM will work with supplier management to ensure the necessary contracts are negotiated and agreed.

What Is Included in ITSCM?

Every IT service suffers from failures from time to time. ITSCM is not concerned with these service interruptions, which are handled through the incident management process. Neither does it get involved with managing risks as a result of business changes. Its focus is on the major events that have a catastrophic impact on the ability of the service provider to supply the vital services that enable the business to achieve its aims. The definition of catastrophic failure will vary between organizations. For example, the trading floor of a financial institution will feel a major impact within minutes, but other organizations may not be affected for hours or longer. Damage may be financial, but it may also be legal (failure to submit information in time to an official regulatory or government body). There may be damage to the “brand.” Downtime on a global online book retailer’s website, for example, would cause poor publicity, as well as missed sales opportunities. Undertaking a business impact analysis will help the business and the service provider agree on what the minimum requirements are for a particular organization. They will need to consider the various locations, the business processes carried out there, and the services used at each. From this, an appropriate ITSCM response can be designed to provide the required technical facilities to enable the critical work to continue at the agreed level.

The scope of the ITCM process includes agreeing on the policies and the services to be included in the plans, carrying out business impact analysis, and assessing and managing likely risks. Managing the risks entails identifying any steps that could be taken to reduce the likelihood of an occurrence or lessen its impact if avoidance is impossible, as long as the cost is justified.

Developing a strategy for service continuity, based on this business impact analysis and the risk management actions and aligned to the business continuity strategy, is a major part of the ITSCM process, shown in Figure 6.11. The strategy includes detailed recovery plans and involves regular testing and adjustments as necessary should requirements change. We will start by looking at the business impact analysis and the risk assessment processes that form part of the requirements and strategy phase of the process.

FIGURE 6.11 The ITSCM lifecycle

Based on Cabinet Office ITIL® material. Reproduced under license from the Cabinet Office.

image

Assessing Business Impact

The requirements and strategy phase of the ITSCM process—involving a detailed understanding of the requirement, through BIA, and an assessment of likely risks—is crucial. If these stages are rushed or incomplete, there is a real risk that the plans would not fit the business requirement, leading to severe, possibly terminal, business impact should the worst happen. The assessment identifies which are the key services, because it is these services that must continue, despite what has occurred.

BIA also considers various scenarios; the same event may not have an equivalent impact if it occurs at different times; the failure of financial reporting at year end would have a much greater impact than at another time, for example. The analysis should also consider whether the impact would escalate the longer the service was unavailable, because this would affect the choice of recovery option, favoring a faster recovery even at a greater cost.

ITSCM must understand how long recovery would take and what would be required to enable this recovery to take place. The BIA clarifies the relative business priority for each service. Where an impact would be severe from the start, implementing measures to reduce the chance of a service-affecting failure would be justified (failover, and so on). Where the impact takes some time to build up, a plan to restore the service within hours or days would be sufficient (see Figure 6.12). Each organization is likely to include a variety of recovery requirements.

FIGURE 6.12 Graphical representation of business impacts

Based on Cabinet Office ITIL® material. Reproduced under license from the Cabinet Office.

image

Business impact analysis provides a mapping of the critical business processes against the IT components that provide the IT service that supports it. Only with this information can a decision be made as to what needs to be recovered and the necessary timescales. It is essential that senior business staff and those who actually carry out the activity are involved in the BIA; IT would otherwise decide this from an entirely technical viewpoint, being unaware that some apparently minor system may actually be required to deliver critical business processes. The business may also decide that the fast recovery options are too expensive and readjust their requirements.

Assessing Risk

Although the ITSCM plan provides a level of assurance that critical business processes could be recovered in a suitable timescale should a catastrophic event occur, it is preferable that the event does not occur at all. Many such events cannot be foreseen, or prevented, but a thorough risk assessment and management of the identified risks greatly reduces the likelihood. Risk assessment requires an understanding of likely threats and how vulnerable the organization is to those threats. Risk management then considers suitable cost-justifiable responses to these threats. The aim is to reduce the vulnerability to the risk, making it less likely to occur, or to minimize its impact, should it be unpreventable. As you learned earlier, risk management also takes place in the availability and information security management processes.

Risk assessment will compile a list of evaluated risks—some within an acceptable level of risk, some beyond it. The countermeasures should reduce the likelihood or the impact of a threat, reducing its score to within acceptable levels. Table 6.1 shows an example of the output from an assessment.

TABLE 6.1 Examples of risks and threats

Risk Threat
Loss of internal IT systems/networks, PABXs, ACDs, and so on Fire
Power failure
Arson and vandalism
Flood
Aircraft impact
Weather damage, such as from a hurricane
Environmental disaster
Terrorist attack
Sabotage
Catastrophic failure
Electrical damage, such as from lightning
Accidental damage
Poor-quality software
Loss of external IT systems/networks, such as e-commerce servers, cryptographic systems All of the above
Excessive demand for services
Denial-of-service attack, such as against an Internet firewall
Technology failure, such as cryptographic system
Loss of data Technology failure
Human error
Viruses, malicious software, such as attack applets
Loss of network services Damage or denial of access to network service provider’s premises
Loss of service provider’s IT systems/
networks
Loss of service provider’s data
Failure of the service provider
Unavailability of key technical and support staff Industrial action
Denial of access to premises
Resignation
Sickness/injury
Transport difficulties
Failure of service providers, such as outsourced IT Commercial failure, such as insolvency
Denial of access to premises
Unavailability of service provider’s staff
Failure to meet contractual service levels

Business impact analysis and risk management enable the IT service provider and the business to devise an appropriate ITSCM plan combining risk reduction measures with recovery in the event of an unavoidable event. The plan will be cost-effective, because only the critical services will have a full, speedy recovery; other services will have a lower level of protection that fits their lower level of criticality.

Not all risks can be avoided. A disaster affecting a nearby location, such as a gas explosion, would inevitably impact the service being provided; if the interruption to the service is short-lived, it may be decided that invoking the ITSCM plan is not warranted.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset