As discussed when talking about availability management, a service delivers value only when it is available for use. In addition to the activities carried out under the availability management process, there is a requirement on the IT service provider to ensure that the service is protected from catastrophic events that could prevent it from being delivered at all. Where these cannot be avoided, there is a requirement to have a plan to recover from any such disruption in a timescale and at a cost that meets the business requirement. Ensuring IT service continuity is an essential element of the warranty of the service.
It is important to understand that IT service continuity management (ITSCM) is responsible for the continuity of the IT services required by the business. The business itself should have a business continuity plan to ensure that any potential situations that would impact the ability of the business to function are identified and avoided. Where it is not possible to avoid such an event, the business continuity management process should have a plan, which is appropriate and affordable, to both minimize its impact and recover from it. Thus, ITSCM can be seen as one of a number of elements making up a business continuity plan (BCM), along with a human resources continuity plan, a financial management continuity plan, a building management continuity plan, and so on.
IT service continuity management process supports the organization’s business continuity management process. It is responsible for identifying and managing the risks to the IT services, agreeing with the business what the minimum requirement for the service would be in the event of a disaster, and ensuring that this agreed level can be provided.
A fundamental objective of the process is to reduce the chance of a disaster occurring at all by identifying the risks to IT services and implementing cost-effective countermeasures to reduce or remove the risk. Should a disaster occur despite these efforts, ITSCM ensures that there is a detailed, tested plan to recover the services to an agreed level within the agreed timescales. Dependent on the business requirement, the service restoration may need to take place within minutes, hours, or one or more days.
ITSCM should develop a number of plans to provide an acceptable level of IT services in the event of a major disruption. Several plans are required to fit the various scenarios involved. The scenarios catered for, and the decision as to what is an acceptable level of service, are arrived at in consultation with the overall business continuity management function.
The service continuity requirement may change over time as the business’s use of and dependence upon the various IT services changes. It is essential that ITSCM carries out regular business impact analysis (BIA) to ensure that the plan still fits the requirement. Should the requirement have changed, the plan must also be changed.
Risks to the IT services may also change over time, so a program of risk assessment exercises must be undertaken to ensure that new risks are identified and mitigated; the level of acceptable risk needs to be agreed on with the business. Risk assessment may require the involvement of availability and information security management, because each of these processes involves identifying and managing particular risks.
The ITSCM manager will be a source of expertise on continuity issues and so may be consulted by the business or the rest of IT needing guidance. It is essential that all changes have been assessed to understand their impact on the ITSCM plans and procedures. An apparently straightforward change may remove a level of resilience, for example, or a departmental reorganization may split a single role in the plan across a number of individuals, meaning that this responsibility may have to be reassigned.
The major objective of ITSCM is to ensure that solutions have been developed and put in place to ensure that the required level of service (or better) can continue to be provided. Where these solutions involve the use of services supplied by external third-party suppliers, ITSCM will work with supplier management to ensure the necessary contracts are negotiated and agreed.
Every IT service suffers from failures from time to time. ITSCM is not concerned with these service interruptions, which are handled through the incident management process. Neither does it get involved with managing risks as a result of business changes. Its focus is on the major events that have a catastrophic impact on the ability of the service provider to supply the vital services that enable the business to achieve its aims. The definition of catastrophic failure will vary between organizations. For example, the trading floor of a financial institution will feel a major impact within minutes, but other organizations may not be affected for hours or longer. Damage may be financial, but it may also be legal (failure to submit information in time to an official regulatory or government body). There may be damage to the “brand.” Downtime on a global online book retailer’s website, for example, would cause poor publicity, as well as missed sales opportunities. Undertaking a business impact analysis will help the business and the service provider agree on what the minimum requirements are for a particular organization. They will need to consider the various locations, the business processes carried out there, and the services used at each. From this, an appropriate ITSCM response can be designed to provide the required technical facilities to enable the critical work to continue at the agreed level.
The scope of the ITCM process includes agreeing on the policies and the services to be included in the plans, carrying out business impact analysis, and assessing and managing likely risks. Managing the risks entails identifying any steps that could be taken to reduce the likelihood of an occurrence or lessen its impact if avoidance is impossible, as long as the cost is justified.
Developing a strategy for service continuity, based on this business impact analysis and the risk management actions and aligned to the business continuity strategy, is a major part of the ITSCM process, shown in Figure 6.11. The strategy includes detailed recovery plans and involves regular testing and adjustments as necessary should requirements change. We will start by looking at the business impact analysis and the risk assessment processes that form part of the requirements and strategy phase of the process.
Based on Cabinet Office ITIL® material. Reproduced under license from the Cabinet Office.
The requirements and strategy phase of the ITSCM process—involving a detailed understanding of the requirement, through BIA, and an assessment of likely risks—is crucial. If these stages are rushed or incomplete, there is a real risk that the plans would not fit the business requirement, leading to severe, possibly terminal, business impact should the worst happen. The assessment identifies which are the key services, because it is these services that must continue, despite what has occurred.
BIA also considers various scenarios; the same event may not have an equivalent impact if it occurs at different times; the failure of financial reporting at year end would have a much greater impact than at another time, for example. The analysis should also consider whether the impact would escalate the longer the service was unavailable, because this would affect the choice of recovery option, favoring a faster recovery even at a greater cost.
ITSCM must understand how long recovery would take and what would be required to enable this recovery to take place. The BIA clarifies the relative business priority for each service. Where an impact would be severe from the start, implementing measures to reduce the chance of a service-affecting failure would be justified (failover, and so on). Where the impact takes some time to build up, a plan to restore the service within hours or days would be sufficient (see Figure 6.12). Each organization is likely to include a variety of recovery requirements.
Based on Cabinet Office ITIL® material. Reproduced under license from the Cabinet Office.
Business impact analysis provides a mapping of the critical business processes against the IT components that provide the IT service that supports it. Only with this information can a decision be made as to what needs to be recovered and the necessary timescales. It is essential that senior business staff and those who actually carry out the activity are involved in the BIA; IT would otherwise decide this from an entirely technical viewpoint, being unaware that some apparently minor system may actually be required to deliver critical business processes. The business may also decide that the fast recovery options are too expensive and readjust their requirements.
Although the ITSCM plan provides a level of assurance that critical business processes could be recovered in a suitable timescale should a catastrophic event occur, it is preferable that the event does not occur at all. Many such events cannot be foreseen, or prevented, but a thorough risk assessment and management of the identified risks greatly reduces the likelihood. Risk assessment requires an understanding of likely threats and how vulnerable the organization is to those threats. Risk management then considers suitable cost-justifiable responses to these threats. The aim is to reduce the vulnerability to the risk, making it less likely to occur, or to minimize its impact, should it be unpreventable. As you learned earlier, risk management also takes place in the availability and information security management processes.
Risk assessment will compile a list of evaluated risks—some within an acceptable level of risk, some beyond it. The countermeasures should reduce the likelihood or the impact of a threat, reducing its score to within acceptable levels. Table 6.1 shows an example of the output from an assessment.
Risk | Threat |
Loss of internal IT systems/networks, PABXs, ACDs, and so on | Fire |
Power failure | |
Arson and vandalism | |
Flood | |
Aircraft impact | |
Weather damage, such as from a hurricane | |
Environmental disaster | |
Terrorist attack | |
Sabotage | |
Catastrophic failure | |
Electrical damage, such as from lightning | |
Accidental damage | |
Poor-quality software | |
Loss of external IT systems/networks, such as e-commerce servers, cryptographic systems | All of the above |
Excessive demand for services | |
Denial-of-service attack, such as against an Internet firewall | |
Technology failure, such as cryptographic system | |
Loss of data | Technology failure |
Human error | |
Viruses, malicious software, such as attack applets | |
Loss of network services | Damage or denial of access to network service provider’s premises |
Loss of service provider’s IT systems/ networks |
|
Loss of service provider’s data | |
Failure of the service provider | |
Unavailability of key technical and support staff | Industrial action |
Denial of access to premises | |
Resignation | |
Sickness/injury | |
Transport difficulties | |
Failure of service providers, such as outsourced IT | Commercial failure, such as insolvency |
Denial of access to premises | |
Unavailability of service provider’s staff | |
Failure to meet contractual service levels |
Business impact analysis and risk management enable the IT service provider and the business to devise an appropriate ITSCM plan combining risk reduction measures with recovery in the event of an unavoidable event. The plan will be cost-effective, because only the critical services will have a full, speedy recovery; other services will have a lower level of protection that fits their lower level of criticality.
Not all risks can be avoided. A disaster affecting a nearby location, such as a gas explosion, would inevitably impact the service being provided; if the interruption to the service is short-lived, it may be decided that invoking the ITSCM plan is not warranted.