Chapter 34
Service Operation Processes: Incident and Problem Management

THE FOLLOWING ITIL INTERMEDIATE EXAM OBJECTIVES ARE DISCUSSED IN THIS CHAPTER:

  • ✓  Incident management and problem management are discussed in terms of
    • Purpose
    • Objectives
    • Scope
    • Value
    • Policies
    • Principles and basic concepts
    • Process activities, methods, and techniques
    • Triggers, inputs, outputs, and interfaces
    • Critical success factors and key performance indicators
    • Challenges
    • Risks

 The syllabus for the intermediate service operation exam covers the managerial and supervisory aspects of service operation processes. It excludes the information management aspects of the processes and the day-to-day operation of each process, including details of the process activities, methods, and techniques. More detailed process operation guidance is covered in the service capability courses. Each process is considered from the management perspective. That means at the end of the chapters covering the service operation processes (Chapters 34–37), you should understand the aspects that would be required to understand each process and its interfaces, oversee its implementation, and judge its effectiveness and efficiency.

Incidents and Problems: Two Key Service Management Concepts

The two processes of incident management and problem management are among the most important of all the ITIL processes. They are often the first to be implemented by an organization that has decided to adopt the ITIL framework. The differentiation between incident management and problem management is important, and an organization that has adopted both of these processes has made a major advance toward improving its services and its service management.

Both these processes are carried out by every IT service provider, whether they are called by these names or not. All service providers fix faults as quickly as possible when they occur (incident management) and try to ascertain why the fault occurred so that it can be prevented from happening again (problem management). Many organizations do not differentiate between the two processes, however, and problem management in particular may not be carried out in a consistent fashion. A failure to appreciate the difference between problems and incidents may result in delayed service restoration following an incident and in allowing incidents to recur, causing business disruption each time.

ITIL provides guidance for the best approach to these two key processes. Effective incident management will improve availability, ensuring that users are able to get back to work quickly following a failure. Problem management will improve the overall quality and availability of services by reducing recurring incidents and preventing incidents from happening in the first place (and as such, works in conjunction with continual service improvement); it also makes best use of skilled IT staff, who are freed from resolving repeat incidents and are able to spend time preventing them instead.

Incident Management

In ITIL terminology, an incident is defined as an unplanned interruption to an IT service, a reduction in the quality of an IT service, or a failure of a configuration item (CI) that has not yet impacted an IT service (for example, failure of one disk from a mirror set). We can easily think of examples of unplanned interruptions to an IT service: server crash, hardware failure, and so on. A good example of a reduction in the quality of the IT service is when a service is running so slowly that it is impacting the organization’s ability to achieve its objectives. An example of a failure that has yet to impact the IT service could be the failure of a disk in a storage device with a RAIDed configuration. This is treated as an incident even though the storage device is still usable. The point is that the failure of one disk means that the storage is no longer protected from the failure of another disk. If another disk fails, the device will fail, and the data on it will be inaccessible.

This definition of an incident is important because the incident is an interruption to service and restoring the service or improving the quality of the service to agreed levels resolves the incident. Note that incident resolution does not include understanding why the fault occurred or preventing its recurrence; these are matters for problem management. By understanding this distinction, you can see that resolving an incident does not require the skills that resolving a problem requires. If the service can be restored by a simple reboot, then the user can be instructed to do this by the service desk staff, and the service desk staff can use known error workarounds and knowledge base articles to resolve commonly occurring incidents. In both of these examples, it is not necessary to involve the more skilled (and therefore more expensive) second-line technicians. From the user and business perspectives, the focus is on being able to get back to work, and there is little interest in the cause of the failure. Repeat occurrences will impact their work and increase the number of calls to the service desk, so an investigation of the cause, and the permanent resolution of the underlying problem, will be required, but this can take place without impacting the users.

The incident management process is responsible for progressing all incidents from when they are first reported until they are closed. Some organizations may have dedicated incident management staff, but the most common approach is to make the service desk responsible for the process.

The Purpose of Incident Management

The purpose of incident management is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring that agreed levels of service quality are maintained. Normal service operation is defined as an operational state where services and CIs are performing within their agreed service and operational levels, that is, as agreed in the SLA.

As explained, by focusing on service restoration, incident management enables the business to return to work quickly, thus ensuring that the impact on business processes and deadlines is reduced.

The Objectives of Incident Management

The objectives of the incident management process are as follows:

  • Use standardized methods and models for managing incidents. This means that incidents should be handled in a consistent way regardless of service, technology, or support group. This enables the business and the service provider management to have clear expectations of how any particular incident will be handled. Also, in the case of repeat incidents, the logging and resolution can be proceduralized using incident models, thus ensuring both consistency and rapid restoration of service.
  • Increase visibility and communication of incidents. This allows the business to track the progress of the incidents reported and ensures that all IT staff have access to the information relating to an incident.
  • Enhance business perception of IT. All IT service providers will suffer incidents, but the way in which they respond can enhance their reputation; a well-designed incident management process should enhance the reputation of IT to the business by demonstrating a professional, effective approach.
  • Align activities with business needs by ensuring that incidents are prioritized based on their importance to the business.
  • Maintain user satisfaction.

All incidents must be efficiently responded to, analyzed, logged, managed, resolved, and reported. By carrying out these tasks in an efficient and effective manner and by ensuring that affected customers are updated as required, the IT service provider aims to improve customer satisfaction, even though a fault has occurred. At all times during the incident management process, the needs of the business must be considered; business priorities must influence IT priorities.

The Scope of Incident Management

Incident management encompasses all incidents: all events have a real or potential impact on the quality of the service. Incidents will mostly be logged as the result of a user contacting the service desk, but event management tools may report an incident following an alert (see the discussion of event management in Chapter 36, “Service Operation Processes: Event Management”); often there will be a link between the event system and incident management tool so that events meeting certain criteria can automatically generate an incident log. Third-party suppliers may notify the service desk of a failure, or technical staff may notice that an error condition has arisen and log an incident.

Requests may be logged and managed at the service desk, but it is important to differentiate between these requests and incidents; in the case of requests, no service has been impacted. Problem management seeks to reduce the number of incidents over time, whereas the IT service provider may want to handle increasing numbers of requests through the service desk and the request fulfillment process as a quick, efficient, and customer-focused method of dealing with them.

The Value of Incident Management to the Business

Efficient incident management delivers several benefits to the business. It reduces the cost of incident resolution by resolving incidents quickly, using less-skilled staff. Where incidents need to be escalated, the resolution is faster (because the relevant information required will have been gathered and an initial diagnosis will have been made). Faster incident resolution means a faster return to work for the affected users, who are once again able to exploit the functionality of the service to deliver business benefits. Effective incident management improves the overall efficiency of the organization because nonproductive users are a cost to the company.

Incident prioritization is based on business priorities, which ensures that resources are allocated to maximize the business benefit. The data gathered by the service desk about the numbers and types of incidents can be analyzed to identify training requirements or potential areas for improvement.

As highlighted in the discussion of the service desk in Chapter 39, “Organizing for Service Operation,” incident management is one of the most visible processes as well as one that all users understand the need for. It is therefore one of the easier areas to improve because an improved incident resolution service has easily understood benefits for the business.

Incident Management Policies

Next we consider some of the policies that support effective incident management.

Policy: Incidents and their status must be communicated in a timely and effective way. Users must be kept informed of the progress of incidents that affect them, and the information they are given must make sense to them—most users don’t understand technical jargon, so it should not be used. This policy shows that incident management is not solely about resolving the incident. The business needs information about the status of open incidents so that it can make decisions about what to do to minimize the business impact. This is one of the key responsibilities of the service desk.

Policy: Incidents must be resolved in timescales that are acceptable to the business. We should remember that IT is the servant of the business. Service level management will determine what timescales are acceptable and must ensure that the necessary operational level agreements and underpinning contracts (UCs) are in place to support them. To achieve the resolution within the required timescales requires the allocation of sufficient and appropriate resources to work on the incident, with access to the necessary technology and tools such as the known error database and configuration information.

Policy: Customer satisfaction must be maintained at all times. Customer satisfaction is not only about meeting SLA targets. Users and customers will be dissatisfied if support staff are rude or patronizing, even if IT meets or exceeds all of its incident-related targets.

Policy: Incident processing and handling should be aligned with overall service levels and objectives. Incidents should not be handled to suit the convenience or priorities of IT; incident management activities should support service levels and objectives by prioritizing those activities based on actual business need.

Policy: All incidents should be stored and managed in a single management system. Using a single system provides a definitive recognized source for incident information and supports reporting and investigation efforts. Status and detailed information on incidents should be recorded and updated on a timely basis in incident records.

Policy: All incidents should be categorized in a standard way. Standardized categorization enables useful analysis and reporting; it speeds up troubleshooting by making it easier to find other occurrences of a fault within the incident database or the known error database (KEDB). It makes identification of trends easier. For incident management to be effective, there must be a well-defined and communicated set of incident classification categories. The tool can be programmed to discourage the entry of nonstandard categories.

Policy: Incident records should be audited on a regular basis. The incident database is a rich source of information about what is happening in the infrastructure and about the difficulties being experienced by the business, but its usefulness depends on it being accurate and complete. We have already considered the importance of consistency in data entry; this policy ensures that the guidelines for data entry are being followed by auditing incident records for accuracy and completeness. Any issues discovered should be noted and acted upon.

Policy: All incidents should use a common format. Staff should record the information that the service provider has deemed to be necessary in a standard format. This helps both the management of a live incident and later reporting and analysis.

Policy: Use a common and agreed-upon set of criteria for prioritizing and escalating. This ensures that customer needs are handled consistently across all services and components rather than being dependent on the opinion of whoever logs the incident.

Principles and Basic Concepts for Incident Management

ITIL describes a number of principles and basic concepts to keep in mind when implementing the incident management process. They are covered in the following sections.

Timescales

Time is of the essence in incident management because every incident represents some loss or deterioration of service. Every aspect of the process needs to be optimized to produce the fastest end result. Service level agreements, operational level agreements, and underpinning contracts with measurable targets will define how long a support group or third party has to complete each step. When an incident is passed to a support group for investigation, a clock starts ticking—that group has a certain amount of time, defined in the OLA, to complete its work. If the incident is passed to a third party, the timescales set in the underpinning contracts would apply. These OLA and contract target timescales support the achievement of the SLA target.

Service management toolsets should be configured to capture how long it takes to log and escalate an incident, how many incidents are resolved within the first few minutes without requiring escalation, and how long support teams take to respond to and to fix incidents. These times should be monitored, and steps should be taken to identify bottlenecks or underperforming teams so that improvement actions can be taken. The tools can be used to automate timescales and escalate an incident as required based on predefined rules. The tools can also use alarms to warn when incidents are nearing a breach of the SLA. Care needs to be taken in organizations that cross time zones to make sure the clock is monitoring the appropriate time zone as well as the targeted time.

Incident Models

Many incidents have happened before and may well happen again. For this reason, many organizations will find it helpful to predefine standard incident models and apply them to appropriate incidents when they occur. A model is a predefined way of carrying out a commonly required task. ITIL recommends models for a number of processes, such as incident, request, and change processes. An incident model describes the steps needed to investigate and resolve a particular type of incident. The benefit of these models is that they speed up the resolution of recurring incidents as well as ensure consistency in handling them. Most service management tools have the capability to store multiple models; when one of these incidents occurs, it is logged using the appropriate prestored model. The incident can then be handled using the model. The tool can also automate many of the model steps, such as automatic assignment to the correct support group, escalations, and so on.

Incidents that would require specialized handling can be treated in this way (for example, security-related incidents can be routed to information security management, and capacity- or performance-related incidents would be routed to capacity management). Using incident models will help ensure consistency of approach and will speed up resolution.

A typical incident model should include the following contents:

  • The steps required to handle the incident, including the timescales, chronological order, and any dependencies they might have
  • Details of who is responsible for each step, and the escalation contacts
  • Precautions to be taken, such as backing up data, or steps required to comply with health- and safety-related guidelines, such as isolating equipment from the power supply
  • In the case of security- and capacity-related incidents, any steps to be taken to preserve evidence

Major Incidents

All incidents should get resolved as quickly as possible, but some incidents are so serious, with such an impact on the business, that they require extra attention. The first step is to agree on exactly what is a major incident. Some organizations will define all priority one incidents as major; others may restrict priority one incidents to those whose impact will be felt by the external customers. In this definition, an incident with a major impact within the organization would not normally be classed as major. An incident that (for example) prevents customers from ordering goods from the organization’s website and that is therefore affecting both revenue and reputation would be included. The definition must align with the priority scheme to avoid confusion.

The purpose of defining an incident as a major incident is so that it can receive special focus. Specific actions to be undertaken are defined in advance so that when the major incident occurs, everyone knows what they are expected to do. The following typical actions might be defined:

  • Notification of key contacts within the service provider organization and the business as soon as the major incident is declared.
  • Regular updates posted through agreed channels. This would include who should be updated (which could be all users and IT or business management contacts) and the content and frequency of the updates. The agreed channels might include, for example, an intranet announcement, email, and telephone calls.
  • Recorded greeting on the service desk phones to inform callers that the incident has occurred and is being dealt with to reduce the number of calls being handled by the desk.
  • Appointment of a major incident manager (this may be the service desk manager) and a separate team to focus on resolving the incident.

As with any incident, some major incidents can be resolved without understanding the cause (perhaps by restarting a server); some require the underlying cause to be understood. In the second case, problem management would become involved. It is essential, however, that the focus of incident management remains on restoring service as quickly as possible.

As we discuss in Chapter 39, a major responsibility of the service desk is communicating with the users; this is particularly true in the case of major incidents. Regular updates should be provided. The service desk staff members are also accountable for ensuring that the incident record is kept up-to-date throughout the incident, although it may be the technicians in other teams who actually enter the information. An accurate record is essential during an incident so that there is no confusion; it will also be used after the incident is resolved as part of the major incident review. Regular updates showing the steps taken and whether they were successful will allow improvements to be identified for future events.

Incident Status

Incident management tracks incidents through their lifecycle, using status codes and moving from when the incident is identified through diagnosis and resolution and finally closure. Incident management will remind resolving groups of the associated target times, making sure no incident is forgotten or ignored.

Most service management toolsets will allow a number of statuses to be defined for each incident to facilitate progress tracking. The following statuses are typical:

Open The incident has been identified and logged. A service desk analyst may be working on it, or the service desk may be considering which second-line team it should be escalated to. Incidents resolved by the first-line team may move directly from Open to Closed because the service desk analyst obtains the user confirmation that the incident has been satisfactorily resolved.

Assigned This may mean the incident has been sent to a support team but not allocated to a particular individual.

Allocated or In Progress This is usually defined as when a support technician has been allocated the call.

On Hold This status is sometimes used when the user is not available or doesn’t have the time to test the resolution. It is used to “stop the target clock,” because the service provider cannot do anything further to resolve the incident without the user.

Resolved This status indicates that the technician has completed their work but it has not been confirmed by the customer that it was successful. It is common to use the service management’s automated email facility to automatically email the user when an incident is resolved, asking for a response within a certain timescale if the user is still not happy. If no reply is received, the incident is automatically closed.

  • If the user is unhappy, the call is put back into In Progress, and further work is carried out to resolve it.
  • The service desk should attempt to contact users to obtain permission to close calls before the automated closure, especially for high-impact incidents where the user may not be aware of the resolution.

Closed This status confirms that the incident is over to the user’s satisfaction. The incident management process has no further involvement, although problem management may now investigate the underlying cause.

Expanded Incident Lifecycle

The expanded incident lifecycle is used by the service design availability management process and within CSI. It breaks down and examines each step of the process to understand the reasons for the failed targets. For example, the diagnosis of the incident may ascertain very quickly that the resolution requires the restoration of data, which takes three hours; this information would be used to pinpoint where improvements should be made. Delays in any step of the lifecycle can be analyzed, and improvements can be implemented to speed up resolution; implementing a knowledge base or storing spare parts on site are two typical measures that are taken to shorten the diagnosis and repair steps.

Incident Management Process Activities, Methods, and Techniques

In the following sections, we’ll take a high-level look at each of the activities in incident management. Figure 34.1 shows a diagram of the process flow.

Flow diagram shows incident identification, logging, categorization, prioritization, major incident procedure, initial diagnosis, functional escalation, investigation and diagnosis, resolution and recovery et cetera.

Figure 34.1 Incident management process flow

Copyright © AXELOS Limited 2010. All rights reserved. Material is reproduced under license from AXELOS.

Step 1: Incident Identification and Logging

First, the incident is identified and then logged. Many incidents will be reported to the service desk, which will log them. But not all calls to the service desk are related to incidents; some will be service requests, which are handled by the request fulfilment process. As Figure 34.1 shows, these are rerouted.

It is essential that incidents are resolved in the shortest possible time because each represents business disruption. Whenever possible, therefore, we should be trying to realize that an incident has occurred before the user notices or, failing that, before they have reported it to the service desk. Chapter 36 shows how monitoring tools can be used to identify failures. The event management process should link directly to incident management so that any incidents spotted are worked on immediately and resolved quickly. Where event management is not in place, incidents will be identified by users contacting the service desk.

The incident record contains all the information concerning a particular incident; details of when it was logged, assigned, resolved, and closed are often required for service level management reporting. Details of symptoms and the affected equipment may be used by problem management. Steps taken to resolve the incident can be used to populate a knowledge base. It is essential, therefore, that all relevant information is added to the record as it progresses through its lifecycle.

A good integrated service management tool makes good recordkeeping much easier because it can automatically populate the record with user details (from Active Directory or a similar tool) and equipment and warranty details (based on the CI number). Automatic date and time stamping of each update and identification of who made the update will both improve the completeness of the information in the record.

Step 2: Incident Categorization

Incidents are categorized during the logging stage. This can be helpful in guiding the service desk agent to the correct known error entry or the appropriate support team for escalation. A simple category structure should be used, however; a scheme that’s too complex leads to incidents all being logged as “other” or “miscellaneous” because the agent does not want to spend the time considering which category is correct. This makes later analysis very difficult. A multilevel scheme, as shown in Figure 34.2, achieves granularity without facing the service desk agent with a long list to choose from. Incidents should be recategorized during investigation and on resolution if the original choice was incorrect.

Diagram shows the incident categorization which includes location impacted, service impacted, system impacted, application impacted, database impacted, server impacted, and disk drive impacted.

Figure 34.2 Multilevel incident categorization

Copyright © AXELOS Limited 2010. All rights reserved. Material is reproduced under license from AXELOS.

Step 3: Incident Prioritization

Incidents need to be prioritized to ensure that the most critical incidents are dealt with first. It is often said that all users believe that their own incident is the highest priority, so it is important to agree during service level negotiations about what criteria should be used to decide priority. It is also imperative that incidents are prioritized consistently.

The ITIL framework recommends that two factors should be considered: business impact (effect the incident is having or will have on the business) and urgency (how quickly the business needs a resolution). Business impact can be assessed by considering a number of factors: the number of people affected, the criticality of the service, the financial loss being incurred, damage to reputation, and so on. Depending on the type of organization, other factors such as health and safety (for a hospital or a railway company or similar) and potential breach of regulations (financial institutions and so on) might be considered.

During the life of an incident, it may be necessary to adjust the priority of an incident if the assessment of the impact changes or a resolution becomes more urgent.

Deciding the priority must be a simple process because the incident has to be logged quickly. Employing service desk staff with good business knowledge and ensuring that they are trained to be aware of business impact will help to make a realistic assessment of business impact. Table 34.1 shows a simple but very effective way to determine priority.

Table 34.1 Impact and urgency: a matrix for determining an incident’s priority

Impact Urgency
High Medium Low
High 1 2 3
Medium 2 3 4
Low 3 4 5

Table 34.2 shows how the determination of priority made using the matrix in Table 34.1 can in turn be employed to set a target resolution time for the incident.

Table 34.2 Target resolution

Priority code Description Target resolution time
1 Critical 1 hour
2 High 8 hours
3 Medium 24 hours
4 Low 48 hours
5 Planning Planned

Step 4: Initial Diagnosis

The initial diagnosis step refers to the actions taken at the service desk to diagnose the fault and, where possible, to resolve it at this stage. The service desk agent will use the known error database provided by problem management, incident models (covered earlier), and any other diagnostic tools to assist in the diagnosis and possible resolution. Where the service desk is unable to resolve the incident, the initial diagnosis will identify the appropriate support team for escalation.

Part of this stage is the gathering of information to assist the second-line technician in resolving the incident quickly. Again, sufficient time is required for this step; “saving” time by passing the incident to a second-line technician quickly but with sparse details is not helpful. The second-line technician will need to contact the customer to obtain the information, adding delay and frustration. Support teams should provide guidance to the service desk about the type and level of information they should be gathering.

Step 5: Incident Escalation

The ITIL framework describes two forms of escalation that may take place during incident management: functional escalation and hierarchic escalation.

Functional escalation takes place when the service desk is unable to resolve the incident; this may be realized immediately because of the type of incident, such as a server failure, or because the service desk agent may have spent the maximum time allowed under the organization’s guidelines attempting to resolve the incident without success. It then needs to be passed to another group with a greater level of knowledge. The second-line support group that receives this escalated incident will also have a time limit for resolution, after which the incident gets escalated again to the next support level. Sometimes, as with the service desk, it is obvious that the incident will require a high level of technical knowledge, and in such a case the incident would be immediately escalated, without any attempt by second-line support staff to resolve it.

The service desk must know to whom the incident should be escalated to avoid unnecessary delays, so the service desk staff needs to have sufficient technical knowledge to be able to identify which incident goes to which team. Operational level agreements will specify the responsibilities of each group. There may be occasions when cooperation between support groups is required or the incident needs to be referred to third parties such as hardware maintenance companies. The OLAs and UCs should specify what should happen in these situations.

The second type of escalation is hierarchic escalation, which takes place for high-priority or major incidents. Hierarchic escalation consists of informing the appropriate level of management about the incident so that they are aware of it. This ensures that the management is able to make decisions that are required regarding prioritization of work, suppliers, and so on. In the case of a major incident, the IT director may be expected to brief the business directors about the progress of the incident; even if this is not the case, business managers may go directly to senior IT managers when a serious incident has occurred, so it is essential that the IT managers have been thoroughly briefed themselves.

It is also sometimes necessary to use hierarchic escalation when the incident is not progressing as quickly as it should or if there is disagreement among the support groups regarding assignment.

A good service management tool will be able to automatically escalate incidents, based on the SLA targets, and update the record with details. For example, a tool could be set to notify a team leader when 90 percent of the SLA target time had passed and to inform the team leader’s line manager when the incident breached the target.

Step 6: Investigation and Diagnosis

The major activity that takes place for every incident is investigation and diagnosis. The incident will have undergone the initial diagnosis step (step 4, covered earlier); that step identifies whether the service desk can resolve the incident. The investigation and diagnosis stage here is different; it involves trying to ascertain what has happened and how the incident can be resolved.

The incident record should be updated to record what actions have been taken, and an accurate description of the symptoms, and the various actions taken, is required to prevent duplication of effort; the record will also be useful when the incident is reviewed, perhaps as part of problem management. Typical investigation and diagnosis actions would include gathering a full description of the issue and its impact and urgency, creating a timeline of events, identifying possible causes (such as recent changes), interrogating knowledge sources such as the known error database, and so on.

Step 7: Resolution and Recovery

Potential incident resolutions should be tested to ensure that they actually resolve the issue completely with no unintended consequences. This testing may involve the user. Other resolution actions might include the service desk agent or technician remotely taking over the user’s equipment to implement a resolution or to show the user what they need to do in the future. Once the incident is resolved, it returns to the service desk for closure.

Step 8: Incident Closure

When the incident has been resolved and the service restored, the service desk will contact the user to verify that the incident can be closed. This is an important step because the fault may appear to be resolved to the IT department, but the user may still be having difficulties, especially if there were actually two incidents, with the symptoms of one being hidden by the other. The second incident would become apparent only after the first was resolved. The service desk may contact the user directly, or an email could be sent with a time limit, stating when the incident will be closed, as described earlier.

If the underlying cause of the incident is still unknown, despite the fact it has been resolved, a problem record may be raised to investigate the underlying cause and to prevent a recurrence. Finally, a user satisfaction survey is carried out.

Triggers for Incident Management

The incident management process is triggered by the notification of an incident. As already discussed, this could be from users, support staff or suppliers, or the event management process.

Inputs

The following information inputs are needed by incident management:

  • Communication of events from event management.
  • Information about CIs from the configuration management system (CMS). This information includes things like the location and configuration of hardware, CI status, software version numbers, and so on.
  • Known errors and their workarounds from the known error database. This database contains details of workarounds that enable the rapid resolutions of some recurring incidents.
  • Incidents and their symptoms.
  • Recent changes and releases. Many incidents are related to changes that have unexpected side effects. The schedule of change provides information about recent and planned changes that can help with the diagnosis of incidents. Information about planned changes can also be useful, especially if the changes are to implement a fix to a recurring fault.
  • Operational and service level objectives. Information in service level agreements provide resolution targets for incident management and also guidance on the impact and urgency of incidents affecting the service.
  • Customer feedback.
  • Criteria for prioritizing and escalating incidents.

Outputs

Information outputs from incident management are as follows:

  • Resolved incidents and records of the resolution actions.
  • Updated incident records with accurate incident detail and history.
  • Updated incident classification once the cause is known; this is used by problem management.
  • New problem records where the underlying cause of the incident has not been identified.
  • Validation that problem resolution has been effective in stopping recurring incidents.
  • Feedback on incidents related to changes and releases.
  • Identification of affected CIs.
  • Customer satisfaction feedback.
  • Feedback on the effectiveness of event management activities.
  • Incident and resolution history details to assist in assessing overall service quality.
  • Incident models if the change has been seen before or is likely to be a regular recurrence.
  • Management information.
  • New incident classification types.
  • Feedback on the achievement of SLA targets for incident management.

Interfaces between Incident Management and the Lifecycle Stages

Incident management is a key process that is carried out by all service providers. There are several links between the process and other processes both within service operation and within the service design stage.

Service Design

Several of the service design processes interface directly with the incident management process. These processes are among those we discussed previously; many of the process activities take place in the service operation lifecycle stage. Service level management interfaces with incident management because SLAs will contain incident targets; the other service design processes may result in incidents if the processes fail to prevent a security breach, a lack of capacity, or unplanned downtime.

Service Level Management As you have seen, there are numerous links between incident and service level management (SLM). Incidents have target response and resolution times; these targets are set in the SLAs. Incident management in return provides the management information from the service management toolset to enable SLM to report on the success achieved in meeting these targets. Incident reporting enables SLM to identify failing services and to implement service improvement plans for them (in conjunction with continual service improvement).

Information Security Management, Capacity Management, and Availability Management Incident management collects data on the number of security-related incidents and capacity issues. It provides the data on downtime that availability uses to calculate availability reports, and analysis of incident records helps availability management understand the weak points in the infrastructure that need attention. An efficient incident lifecycle also improves overall availability.

Service Transition

The service transition processes of service asset and configuration management (SACM) and change management interface with incident management; SACM provides useful information to the incident process, and changes may be the cause of incidents or the means by which incidents are resolved.

Service Asset and Configuration Management Incident management uses SACM data to understand the impact of an incident because it shows the dependencies on each CI. It provides useful information regarding who supports particular categories of CI. By logging each incident against a CI and checking that the user of the CI is as recorded in the CMS, incident management helps keep the CMS accurate.

Change Management Changes are often implemented to overcome incidents. Incidents may often be caused by changes. An important input into incident investigation is the change schedule; asking the question, What changed just before this incident occurred? can often highlight the cause of incidents. In the case of a major incident caused by a change, the decision may be made to back out of the change. Information identifying how many incidents were caused by changes should be fed back to change management to improve future changes.

Also, in some cases a fix to an incident may require an emergency change to be raised, to ensure that the CMS is kept up-to-date.

Service Operation

There is a strong interface between incident management and problem management. Access management issues may also cause incidents.

Problem Management As you will learn when we discuss problem management later in this chapter, incident management and problem management have many links. Incident management provides the data on repeat incidents that problem management uses to identify underlying problems. The permanent resolution of these problems helps incident management by reducing the number of incidents that occur. The incident impact and urgency information helps problem management prioritize between problems.

Problem management provides known error information, which enables incident management to restore service.

Access Management Incident management raises incidents following security breaches or unauthorized access attempts. This information can be used by access management to investigate access breaches. Failure to ensure that users are granted the access they require to do their job may result in incidents being reported because the user is greeted with an error message when attempting to carry out a task.

Critical Success Factors and Key Performance Indicators

Next we look at some examples of critical success factors (CSFs) for incident management and the key performance indicators (KPIs) that measure how successful they are.

Before we look at the CSFs and KPIs, take a minute to understand what these terms mean:

  • A critical success factor is a high-level statement of what a process must achieve if it is to be judged a success. Normally a process would have only three or four CSFs. A CSF cannot be measured directly—that’s what key performance indicators are for.
  • A key performance indicator is a metric that measures some aspect of a CSF. Each CSF will have three or four associated KPIs.

The following CSFs are used for incident management:

  • Critical success factor: “Resolve incidents as quickly as possible, minimizing impacts to the business.”

    Two of the KPIs that measure this are:

    • The mean time to achieve resolution
    • The breakdown of incidents at each stage.
  • Critical success factor: “Maintain user satisfaction with IT services.”

    The key performance indicators that measure its success:

    • The average user/customer survey score (total and by question category)
    • The percentage of satisfaction surveys answered versus total number of satisfaction surveys sent.

Challenges

Incident management faces a number of challenges. First is the challenge of detecting incidents as early as possible. Comprehensive monitoring of the infrastructure plays a significant role here, but there are other things to be done. For example, user education—encouraging users to always report incidents—is important.

A very common challenge is that of persuading technical staff to log incidents that they encounter. Their automatic response is to fix the issue and move on, which is good, but information about trouble spots is often lost.

A poor or nonexistent problem management process will cause the workload on incident management to continue to rise because underlying errors are not being found and corrected. Also, the lack of documented known errors can have a negative effect on the service desk’s ability to provide first-line fixes.

The lack of a configuration management system that holds accurate information about CIs and their relationships is another challenge. A comprehensive CMS would support the prioritization of incidents by linking CIs to services and provide valuable diagnostic information.

Finally, integrating the process with service level management would help the process to correctly assess the impact and priority of incidents and to define escalation procedures.

Risks

The risks faced by incident management include the failure to meet the challenges discussed in the preceding section. Another risk is insufficient resources in terms of numbers and capabilities of support staff, meaning that a process will be overwhelmed by incidents that cannot be handled within the agreed timescales.

A backlog of incidents resulting from inadequate support tools is also a risk. It is generally agreed that successful service management in general, and incident management in particular, needs good tools. Inadequate tools also produce the risk of inadequate information.

Finally, there is the risk of failing to meet agreed resolution times because the support staff are not properly incentivized by OLA targets that align with process objectives, or they are simply not aware of targets.

Problem Management

According to ITIL official terminology, a problem is defined as an underlying cause of one or more incidents. Problem management is the process that investigates the cause of incidents and, wherever possible, implements a permanent solution to prevent recurrence. Until such time as a permanent resolution is applied, it will also attempt to provide a workaround in the form of a known error record to enable the service to be restored and the incident to be resolved as quickly as possible. It is important to understand the differences between incidents and problems and to realize that an incident never becomes a problem, Each is held as a separate record. These records may be linked together in the ITSM tool.

Many organizations make the mistake of thinking that problem management is not essential. Until and unless problem management is undertaken, incidents will recur, inconveniencing the business and occupying support staff time. Problem management will reduce incidents, freeing up more time to undertake more problem management. It is a virtuous circle; the more time spent on it, the more time is freed up by it.

Let’s start by considering the definition of a problem. The definition used by ITIL is “the underlying cause of one or more incidents.”

The use of the word problem is often misunderstood. Outside of service management, we might hear a total network failure described as a major problem. But service management would describe it as an incident, possibly a major incident. If we did not know why the network had failed, we would also have a problem. If we did know why the failure occurred, there would be no problem. So, a problem is a mystery—the mystery of why an incident occurred.

Purpose

The purpose of the problem management process is to document, investigate, and remove causes of incidents. It also provides another very useful benefit; by providing workarounds, it reduces the impact of incidents that occur and have no known permanent solution or have a permanent solution that no one is prepared to fund. It proactively identifies errors in the infrastructure that could cause incidents and provides a permanent resolution, thus preventing the incidents.

Objectives

Problem management aims to identify the root cause of incidents, to document known errors, and to take action to remedy them. Problem management has three simple objectives:

  • First, it must prevent problems and resulting incidents from happening.
  • Second, it should eliminate recurring incidents. This is a key objective. There are few things that will do more damage to a service provider’s reputation than recurring incidents.
  • The third objective is to minimize the impact of incidents that can’t be prevented. The underlying fault can’t always be fixed, or it may take some time to implement a fix, but problem management may be able identify a fast, reliable way of restoring service should the incident recur. This workaround would be documented in the known error database and used by service desk or other support staff. The scope of problem management includes diagnosis of the root cause of incidents and taking the necessary action in conjunction with other processes (such as change management and release and deployment management) to permanently remove them.

Problem management is also responsible for compiling information about problems and any associated workarounds or resolutions. By identifying faults, providing workarounds, and then permanently removing these faults, problem management reduces the number and the impact of incidents. It has a strong relationship with knowledge management because it is responsible for maintaining a known error database. Proactive problem management supports continual service improvement because it prevents future incidents from occurring and thus improves the quality of the service delivery.

There are important similarities and differences between the two principal service operation processes. The same service management tool will usually be used to track both incidents and problems, and a good tool will facilitate the linking of incident occurrences to specific problem records. Similar categories can be used, although the prioritization of problems may differ from that of the associated incidents; an incident may be a priority one due to its impact and urgency, but once a workaround is supplied, finding a permanent resolution may not be urgent. Problem management is sometimes a process of which the business is unaware. Once a workaround has been applied and an incident resolved, the user may think no more about it. Meanwhile, the IT service provider uses problem management to prevent recurrence. An effective workaround can take some of the pressure off support staff, allowing them to take the time to investigate the underlying cause, without being chased for a resolution because the service has been restored.

Value

The value of problem management derives from the fact that it will diminish the number of incidents and their impact on the organization. Permanent fixes will tend to reduce the number of incidents, and the workarounds, documented in the known error database, enable fast resolution of recurring incidents. Together these mean higher availability of the services with all that implies.

Problem management improves the productivity of IT staff because they spend less time investigating and resolving repeat incidents because these are permanently resolved through problem resolutions.

Problem management identifies the underlying causes of incidents and is therefore able to devise an effective fix, or a workaround that always works. If the underlying cause is not known or is misunderstood, then there is a risk that any fix or workaround won’t work. Last, the process reduces the impact of recurring incidents, which reduces associated costs.

Policies

The ITIL core guidance suggests three policies that support problem management:

Policy: Problems should be tracked separately from incidents. The focus of each of the two processes is very different, especially as regards problem management’s proactive activities.

Policy: All problems should be stored and managed in a single management system. It is not a good idea for each technical team, for example, to have its own database of problems. There are many reasons a single management system is essential. For example, suppose the investigation of problems requires the use of scarce technical resources. To make the best use of those resources, we should prioritize activities by prioritizing problems. This becomes difficult or impossible if information about problems is scattered across many databases or spreadsheets.

Policy: All problems should subscribe to a standard classification schema that is consistent across the business enterprise. The standard classification of problems provides faster access to investigation and diagnostic data and enables more effective analysis, particularly by proactive problem management. Notice that this policy says “across the business enterprise.” In the case of a global enterprise, problems should be categorized in the same way wherever in the world they manifest themselves.

Principles and Basic Concepts

In the following sections, we’ll present some examples of problem management policies.

Reactive and Proactive Aspects

Unlike incident management, which is entirely reactive (you cannot resolve an incident until it has occurred), problem management has both reactive and proactive features:

  • Problem management will react to incidents and attempt to identify a workaround and a permanent resolution.
  • Problem management will also proactively try to identify potential incidents and take action to prevent them from ever happening. This might include analysis of incident trends, such as intermittent but increasingly frequent complaints about poor response times, to identify a potential capacity issue. By working with capacity management, problem management can take proactive measures to provide sufficient capacity and avoid any major breaks in service. Event management reports can also be analyzed to the same end, in this case to prevent an incident before the user is aware of any issue.
  • Problem management may assist in a major incident review, trying to identify how to prevent a recurrence.

The process steps for managing problems that are raised in reaction to incidents and for managing problems that are proactively identified are broadly similar. The main difference is the trigger for the process. Reactive activities take place as a result of an incident report and help prevent the incident from recurring or provide a workaround if avoidance is impossible; these activities complement the incident management process.

Proactive problem management analyzes incident records to identify underlying causes of incidents. It may be that analysis of previous incidents reveals a trend or pattern that was not apparent when each incident occurred. For example, users may complain of poor response periodically; it is only when all these complaints are analyzed that it becomes apparent that the poor response is always reported against the same module or from the same location. This would trigger a problem record to be raised to identify the common cause linking all these incidents.

Reactive and proactive problem management activities normally take place as part of service operation, but problem management is also closely related to continual service improvement. Where improvement opportunities are identified as a result of problem management, they should be entered into the CSI register.

Problem Models

Most problems are unique and can’t be investigated in a predetermined way. But sometimes there are underlying issues that cannot be resolved—they are too difficult, the resolution is too costly, or they are outside our scope of action. For example, a badly designed application may, over time, experience intermittent incidents, each of which has a different specific cause, a different problem. A problem model might guide the investigation of problems in that application. It might describe how to collect and interpret relevant diagnostic information, for example.

Incidents vs. Problems

As we have said, problem management and incident management are closely related. Although incident management is not concerned with determining the underlying causes of the incidents it investigates, sometimes it is necessary to do that in order to resolve the incident. When that is the case, problem management should get involved. Determining the root cause might provide the workaround needed to close the incident. Potentially, problem management could get involved whenever an incident occurred that had no corresponding problem or known error record, but that would probably be overkill. Every organization should define criteria for determining when it’s appropriate to involve problem management. It is very common, usual even, for problem management to be involved in the handling of major incidents.

Known Errors

When you examine the problem management process in detail, you will see that the process attempts to identify and document known errors, usually when the cause of the problem is known and a workaround is available but the permanent fix has yet to be applied. Known error records are used to document root causes and workarounds, allowing quicker diagnosis and resolution if further incidents do occur. Problem management is not the only source of known errors. Some will come from application development. It’s often the case that a new application will go live in spite of known bugs in the software because they are not considered serious enough to delay the rollout. Where this is the case, the bugs should be documented in the known error database so that the information is available should there be in incident during live operation. These bugs might be discovered in development or during testing in service transition. A final source of known errors is the notifications received from suppliers when they discover issues in their products.

Workaround

A workaround is defined as a means of reducing or eliminating the impact of an incident or problem for which a full resolution is not yet available. Restarting a failed configuration item is an example of a workaround. Workarounds for problems are documented in known error records.

Process Activities, Methods, and Techniques

Next we consider some of the main problem management activities and the methods and techniques used.

When Is a Problem Raised?

Sometimes it is helpful to raise a problem record while the incident is still open. Each organization will decide on its own criteria for when a problem should be raised. For example, a problem might be raised when the support teams are sure the incident has been caused by a new problem because the incident appears to be part of a trend or because there is no match with existing known errors. The service desk or support teams may have resolved the incident without knowing the cause, and so there is a risk that the fault may recur. This is particularly true in the case of a major incident; the underlying cause needs to be identified as soon as possible to prevent future disruption to the business. (The problem diagnosis activity may take place in parallel with the incident resolution and may continue after the successful resolution until the underlying cause is identified and removed.) As already explained, it is also possible that suppliers may inform their customers of problems that they have identified.

As stated earlier, problem management is not an optional activity; it is fundamental to providing a consistent service in line with SLA commitments. By providing workarounds to enable resolution of incidents by the first-line staff, the organization makes better use of the more skilled and therefore more expensive second- and third-line staff, who are freed up to use their skills in problem investigation.

Process Activities

Now we are going to examine the problem management process step by step. Refer to Figure 34.3 as we discuss each activity.

Flow diagram shows problem detection, logging, categorization, prioritization, investigation and diagnosis, workaround implementation, incident management, change management, problem resolution et cetera.

Figure 34.3 The problem management process

Copyright © AXELOS Limited 2010. All rights reserved. Material is reproduced under license from AXELOS.

Step 1: Detecting Problems

The first step in the process is to identify that a problem exists. As we discussed previously, problems may be raised either reactively (in reaction to incidents) or proactively. In addition to the triggers identified earlier, a problem may also be identified as a result of alerts received as part of event management. The event monitoring tools may identify a fault before it becomes apparent to users and may automatically raise an incident in response.

As the diagram shows, problems can be identified in a number of ways, but they are all handled by a single problem management process and logged in a single database. A major source is the service desk acting for the incident management process. Any incident whose cause is unknown could trigger a problem investigation. In practice, few organizations will do that because of the very large number of problems that result. Most organizations are quite selective about the problems they log and investigate. This is supplemented by regular analyses of incident data, searching for recurring incidents that might have been missed. Potential problems can also be identified by event management and suppliers, and of course proactive problem management itself identifies problems.

Step 2: Logging Problems

Having identified that a problem exists, a problem record containing all the relevant information should be logged and time-stamped to provide a complete picture. Typical details would include who reported it and when, the service and equipment used, and a description of the incident or incidents and actions taken. The incident record number(s) and the priority and category would also be required.

Where possible, use the service management tool to link problem records with the associated incident records. Remember that the incident has not “become” a problem; the incident must continue to be managed to resolution whether the problem is resolved or not.

Step 3: Categorizing Problems

The problem is then categorized using the same system used by incident management, which allows incidents and problems to be more easily matched. For example, it might be possible to relate a number of types of incidents to a single common problem, which would bring a lot of extra diagnostic information.

The problem manager should emphasize the importance of accurate categorization to the service desk. The use of incident models can be very helpful here because they standardize the way common incidents are recorded. Enforcing categorization on incident resolution, as mentioned earlier, will also help ensure that incident categories are accurate.

Step 4: Prioritizing Problems

As with incidents, the priority of a problem should be based on the business impact of the incidents that it is causing and the urgency with which it needs to be resolved. The problem manager should also consider how frequently the incidents are occurring. It is possible that a “frozen screen” that can be resolved with a reboot is not a high-priority incident; if it is occurring 100 times a day, the combined impact to the business may be severe, so the problem needs to be allocated a high priority. The impact to the business must always be considered, so factors such as the cost of resolving the incident and the time this is likely to take will be relevant when assessing priority. It is likely that some problems are not considered important enough to spend resources on, especially if a reasonable and effective workaround has been identified.

Analyzing factors such as number affected, the duration of the downtime, and the business cost is known as pain value analysis. This technique can be useful in determining the business impact of a problem. Of course, circumstances change, new information is learned, and further incidents may occur, so the priority of a problem should be kept under review and changed if appropriate.

Step 5: Investigating and Diagnosing Problems

The next stage in the process is to investigate and diagnose the problem. The purpose of this stage is first to discover the root cause and then to determine a workaround and permanent fix. There may not be the resources to investigate every problem, so the priority level assigned to each will govern which ones get the necessary attention. It is important to allocate resources to problem investigation because until the problem is resolved, the incident will recur and resources will be spent on incident resolution.

Usually the task of investigating and diagnosing the problem is passed to the appropriate technical or application team; if it is unclear where the problem originates, it might be necessary to establish a team of technical specialists to investigate it. Sometimes it will be necessary to escalate a problem to a supplier.

The ITIL framework suggests a number of different problem-solving techniques, which are helpful in approaching the diagnosis logically. The CMS can be very helpful in providing CI information to help identify the underlying cause. It will also help in identifying the point of failure when several incidents are reported; the CMS may identify that all the affected CIs are linked to one “parent” CI. The fault affecting the “parent” configuration item is causing all those CIs attached to it to experience a loss of service. The KEDB may also provide information about previous, similar problems and their causes. Where a test environment exists, this can be used to recreate the fault and to try possible solutions.

Step 6: Identifying a Workaround

Although the aim of problem management is to find and remove the underlying cause of incidents, this may take some time; meanwhile, the incident or incidents continue, and the service is affected. When a user suffers an incident, the first priority is to restore the service so that the user can continue working. A priority of the process, therefore, is to provide a workaround to be used until the problem is resolved. A workaround is a means of reducing the impact of an error. This can be in the form of a circumvention, which is a way of working that avoids triggering incidents, or it may be a way of resolving incidents should they occur. At this point a known error should be raised and stored in the known error database.

The workaround does not fix the underlying problem, but it allows the user to continue working by providing an alternative means of achieving the same result. The workaround can be provided to the service desk to enable them to resolve the incidents while work on a permanent solution continues. The problem record remains open because the fault still exists and is continuing to cause incidents. The details of the workaround are documented within the problem record, and a reassessment of its priority may be carried out.

It is possible that IT or business management may decide to continue to use the workaround and suspend work on a permanent solution if one is not justified. A problem affecting a service that is due to be replaced, for example, may not be worth the effort and risk involved in implementing a permanent solution.

Of course it is not always possible to devise a workaround, and sometimes a workaround is not acceptable to the business—the fault must be fixed.

Step 7: Raising a Known Error Record

When problem management has identified and documented the root cause and workaround, this information is made available to support staff as a known error. Information about all known errors, including which problem record it relates to, is kept in the known error database (KEDB). When repeat incidents occur, the support staff can refer to the KEDB for the workaround.

Although ITIL defines a known error as a problem with a cause that is known and a workaround that has been provided, in the real world, there may be times when a workaround is available even though the root cause is not yet known (for example, a reboot restores the service, but it’s not known what causes the error). On other occasions, we may know the cause but not have a workaround because a change has to be implemented to fix the fault.

Sometimes a known error is raised before a workaround is available and sometimes even before the root cause has been fully identified. This may be just for information purposes; a workaround may be available that has not been fully proven. Rather than have a rigid rule about when a known error should be raised, a more pragmatic approach is advisable; a known error should be raised as soon as it becomes useful to do so.

Step 8: Problem Resolution

When problem management has identified a solution to the problem, it should be implemented to resolve the underlying fault and thus prevent any further incidents from disrupting the service. Implementing the resolution may involve a degree of risk, however, so the change management process will ensure that the risk and impact assessment of the RFC is satisfactory before allowing the change. The error might be in an application that is scheduled to be phased out in the near future, so the business may choose to accept the likely disruption temporarily rather than accept the cost and risk of a fix. Ultimately, the decision whether to go ahead with the resolution despite the risk is a business decision; the business damage being done by the problem may mean the business is prepared to accept the risk in order to have the fix implemented.

Problem Closure

When a permanent solution to the problem has been identified, tested, and implemented through the change management process, the problem record can be updated and closed. Any open incidents caused by the problem can be closed too. The KEDB should be updated to show that the problem is resolved so any future incidents will not have been caused by it; however, the information contained within the problem record may prove useful in addressing a future, similar problem.

Major Problem Review

Each organization must determine what constitutes a major problem, based on its own business priorities. Once a major problem is resolved, it is advisable to hold a formal review to learn from the experience. The review should be carried out in the immediate aftermath of events, when memories are fresh. It is led by the problem manager but should include the participation of everyone who played a role in the events under review.

The review is not only concerned with the failure that first triggered the major problem, but also with everything that happened and was done afterward. It takes a holistic view, in other words.

Specifically, the review asks the following questions:

  • What did we do that was right? What did we do that made things worse?
  • What could we do better in future?
  • What can we do to prevent it from happening again?
  • What proposals can we make for improvement?

These could involve making changes to processes or technology, requiring specific training for support or other staff, and requiring action by suppliers or changes to contracts.

A major problem review also has a significant role to play in rebuilding customer confidence, which will probably have suffered due to the major problem. The review should be conducted with this in mind, and it should be reflected in the resulting report. It might be useful to include a representative from the business in the review.

Figure 34.4 illustrates the way incident, problem, and change management activities are linked. It is largely self-explanatory, but note that an incident is closed when service is restored but a problem is closed only when a fix has been applied and is confirmed to be effective. Sometimes problem management cannot supply a workaround, so the incident stays open until the permanent fix is implemented.

Diagram shows the links between incident logging, investigation, workaround implementation, raise problem, resolution identification, workaround identification, raise change, and change implementation.

Figure 34.4 How incidents, problems, and changes are linked

Triggers

Triggers for problem management will vary between organizations and include triggers for both reactive and proactive problem management.

Reactive problem management triggers include the identification of the need to determine the cause of one or more incidents by the service desk, resulting in a problem record being raised. The incidents may have been resolved, but the cause is unknown. The process of undertaking problem management should enable the identification and removal of the underlying cause, preventing any recurrence. Sometimes it is obvious that an incident, or incidents, has been caused by a major problem, so a problem record will be raised immediately.

Another trigger for reactive problem management is the result of incident analysis by a technical support group showing an underlying problem. Event monitoring tools may raise incidents automatically, and these may require a problem record to be raised. Finally, a supplier may inform the service provider that a problem exists and has to be resolved.

Proactive problem management triggers include analysis of incidents that have occurred, leading to a decision to raise a problem record to investigate what is causing them to occur. An analysis of trends may identify one or more underlying causes. When the cause is removed, recurrence can be prevented.

Another possible proactive trigger may result from continual service improvement; steps taken to improve the quality of a service may result in the need for a problem record to identify further possible improvement actions.

Trend analysis depends on meaningful and detailed categorization of incidents/problems and regular reporting of patterns and repeat occurrences of incidents. This can be helped by regular reporting on the “top ten” incidents.

Inputs

The incident database is a vital source of information for problem management because it contains information about current incidents and problems and is the basis of proactive problem management.

The configuration management system (CMS) provides information about the hardware and software components that underpin the IT services; this is essential for problem investigation and diagnosis. It also helps the process to prioritize problems appropriately by showing the links between components and services.

Problem management must assess whether a business case exists for the change request, which is required to implement a permanent fix to a problem; financial information is needed for this. Input from change management will provide feedback on progress and whether the changes are successful.

Finally in this list comes feedback from customers. If customers are unhappy about repeated recurrences of incidents interrupting their work, this would be both a trigger for and an input into problem management. Building and maintaining customer satisfaction is an essential objective of all service management activity, including problem management.

Outputs

Outputs from problem management include the obvious outputs such as updated problem records, requests for change, workarounds, and known error records. In addition, the process will output various reports. For example, a problem report might be discussed at a service review meeting. Reports of issues found should be referred to design teams and other processes. If reports include recommendations for improvement, they may be logged in the CSI register.

Interfaces between Problem Management and the Lifecycle Stages

Problem management is a key process that must be carried out if the service is to improve over time. There are several links between the process and other processes, both within the service operation area and across the service lifecycle.

Service Strategy

Problem management has a link with financial management for IT services to assess the financial impact of possible solutions or workarounds. This information can be used to decide whether a permanent resolution is financially justified.

Service Design

Problem management interfaces with several of the service design processes:

Availability Management This process and problem management have a similar goal: to prevent downtime. The proactive activities undertaken by availability management are directly related to proactive problem management; availability attempts to proactively identify risks that could result in a loss of service and to take preventative action. Problem management can supply information to availability management about the success of any measures taken.

Capacity Management Some performance problems can be caused by capacity issues. Capacity management will be involved in resolving these issues and also taking proactive measures to prevent capacity issues. Again, problem management can supply information about the success of any measures taken.

IT Service Continuity Management If a significant problem is causing or will cause major disruption to the business, it may be necessary to invoke the IT service continuity management (ITSCM) plan until the issue is resolved. ITSCM also attempts to proactively identify risks that could result in a major loss of service and to take preventative action.

Service Level Management The service level management process interfaces with problems in a different way. SLM is dependent on problem management to identify the root cause of incidents and resolve them in order to prevent downtime that could cause a service level target to be breached.

Service Transition

Problem management interfaces with several of the service transition processes:

Change Management As discussed, changes may be the cause or solution to problems.

Service Asset and Configuration Management Service asset and configuration management provides invaluable information to enable common factors to be identified across multiple incidents.

Release and Deployment Management Release and deployment is involved in contributing to problem management’s known error database, which is also related to knowledge management.

Knowledge Management The last interface with service transition is with knowledge management. The workarounds developed by problem management are examples of the service knowledge that is the core concern of knowledge management. The service knowledge management system can be the basis of the known error database.

Service Operation

As we discussed in the section about incident management, the major relationship that problem and incident management have is with each other.

Continual Service Improvement

Problem management and continual service improvement have similar objectives; problem management activities can also be seen as CSI activities. The aims of CSI and problem management are very closely aligned. Both seek to drive out errors and improve service quality. As stated earlier, actions identified to resolve or prevent problems may be entered into the CSI register.

The seven-step improvement process can be used by CSI or problem management to identify and resolve underlying problems.

Critical Success Factors and Key Performance Indicators

The next topic that we’ll discuss is that of the critical success factors (CSFs) for effective problem management and the key performance indicators (KPIs) that will show whether these CSFs are being achieved. The ITIL Service Operation publication provides three examples of critical success factors for problem management:

  • Critical success factor: “Minimize the impact to the business of incidents that cannot be prevented.”

    This CSF is concerned with workarounds and their effectiveness, so we need KPIs to evaluate this. Here are a couple of possible KPIs for this:

    • Number of errors added to the known error database. This is measuring the success of the process in identifying workarounds. Identifying workarounds is only beneficial if they are being used.
    • Percentage of incidents resolved by the service desk. The assumption here is that the fix rate of the service desk will increase in line with an increasing number of known errors logged.
  • Critical success factor: “Maintain quality of IT services through elimination of recurring incidents.”

    This is measuring the effectiveness of the process at eliminating the underlying causes of incidents through the provision of permanent fixes. You should be able to devise KPIs that support these CSFs. Two possible KPIs are as follows:

    • The size of the problem backlog
    • The number of repeat incidents
  • Critical success factor: “Provide overall quality and professionalism of problem handling activities to maintain business confidence in IT capabilities.”

    This CSF is very much about the process itself rather than its direct impact on the quality of the IT services. There are many possible KPIs that would be relevant to this; let’s consider just three examples.

    • The percentage of major problem reviews completed. A best practice problem management process would conduct a formal review of every major problem. This KPI indicates whether that is being done. A KPI that tells us whether the reviews were being conducted in a timely manner would also be valuable.
    • The average cost per problem. This refers to the cost of investigation and diagnosis; it would tell us if problem management is efficient.
    • The backlog of outstanding problems and the trend—how effective is the process? This is an example of a KPI that might be used when the process is first established but discarded once the process is mature and has eliminated the backlog.

Challenges

Our next topic is the challenges faced by problem management. We’ve seen that problem management is dependent on the incident management process. This poses a number of challenges:

  • The incident process must be mature enough to correctly identify possible problems and to gather sufficient information to enable problem management to diagnose the cause. A critical challenge is ensuring that the two processes have formal interfaces and common working practices.
  • Problem resolution staff must have the skills and capabilities for problem solving.
  • Ideally, a single tool should be used for problem and incident management. The tool should have the ability to relate incidents to problems and enable the determination of relationships between CIs to assist in problem diagnosis.
  • A good working relationship between the second- and third-line staff working on problem support activities and first-line staff must be developed. Each must understand their role in the investigation of problems and that it is very much a team effort.
  • Staff working on problem resolution must understand the impact of problems on the business. Understanding the business and the role of IT in supporting it is a common challenge across the lifecycle.
  • The next challenge is the integration of activities with the CMS, which holds essential information about configuration items, their relationships, and their history.
  • The final challenge is having staff with the necessary technical knowledge to investigate and diagnose problems. This requires the staff to be available to work on the problem. It can be difficult to release staff from other work to do problem management.

Risks

The final topic in this chapter is the risks facing problem management:

  • Failing to meet any of the challenges listed in the preceding section is a risk.
  • There may be simply too many problems to handle due to insufficient resources or because the criteria for raising problems are too loose.
  • There may be a lack of information from incident management or from a CMS.
  • The focus of operational level agreements may be on incident resolution, and so staff do not realize the importance of problem management.

Summary

This chapter explored the first two processes in the service operation stage, incident management and problem management. It covered the purpose, objectives, scope, and value of these processes. We examined how each of these processes supports the other and the importance of these processes to the business and to the IT service provider.

You learned about the high-level process activities, methods, and techniques. We discussed triggers, inputs, outputs, interfaces, critical success factors, and key performance indicators. Finally, we looked at the challenges and risks for these two processes.

In the following chapters, we look at the service operation processes of request fulfilment, event management, and access management and discuss some of the generic concepts and definitions associated with them.

Exam Essentials

Understand the purpose and objectives of incident management in reducing downtime by resolving incidents quickly. Describe the scope of incident management and basic concepts such as major incidents, incident models, and the importance of timely resolution. Identify sources of incident reporting other than users reporting them to the service desk; for example, suppliers, support staff, and event management alerts are all possible sources. Understand that incident management is a reactive process. Be able to list and explain the interfaces that incident management has with other processes, especially problem management and service level management.

Understand that the aim of incident management is to restore service, not to identify the cause. This focus on service restoration means that less-skilled staff are required to resolve incidents than are necessary to resolve problems. Be able to describe the differences between an incident, a problem, and a service request.

Explain how priority is calculated using business impact and urgency. Understand what the terms business impact and urgency mean.

Be able to explain the concept of incident and problem models and their use. Be able to describe the lifecycle of an incident and the use of the different statuses assigned to each stage. List the key information that would be recorded in an incident record. Describe the difference between the two types of escalation (hierarchic and functional) and when each is used.

Know the definition of a problem and the purpose, objectives, and scope of problem management. A problem is the unknown, underlying cause of one or more incidents, and the aim of problem management is to find the cause of incidents and remove it to prevent recurrence. Be able to describe the relationship between problem management and other processes.

Understand the value of problem management. List the ways in which problem management benefits the business, reduces downtime, and makes the best use of IT staff.

Understand the relationship between incidents, problems, and changes. Understand that incidents are caused by problems and so will continue to recur until the problem is resolved, usually by implementing a change.

Understand the concepts of a workaround and a known error. Explain why some problems might not be resolved, when it is not cost-effective to implement the fix, and when a workaround exists. Know how the known error database is used.

Review Questions

You can find the answers to the review questions in the appendix.

  1. Which is the best description of an incident?

    1. An event that has significance and impacts the service
    2. An unplanned interruption to an IT service or a reduction in the quality of an IT service
    3. A fault that causes failures in the IT infrastructure
    4. A user error
  2. When should an incident be closed?

    1. When the technical staff members are confident that it will not recur
    2. When desktop support staff members say that the incident is over
    3. When the user confirms that the service has been restored
    4. When the target resolution time is reached
  3. Which of the following is NOT a satisfactory resolution to an incident?

    1. A user complains of poor response; a reboot speeds up the response.
    2. A user complains of poor response; second-line support runs diagnostics to be able to monitor it the next time it occurs.
    3. The service desk uses the KEDB to provide a workaround to restore the service.
    4. The service desk takes control of the user’s machine remotely and shows the user how to run the report they were having difficulty with.
  4. Incident management aims to restore normal service operation as quickly as possible. How is normal service operation defined?

    1. It is the level of service the user requires.
    2. It is the level of service the technical management staff members say is reasonable.
    3. It is the level of service defined in the SLA.
    4. It is the level of service that IT believes is optimal.
  5. A service management tool has the ability to store templates for common incidents that define the steps to be taken to resolve the fault. What are these called?

    1. Major incidents
    2. Minor incidents
    3. Incident models
    4. Incident categories
  6. Which incidents should be logged?

    1. Major incidents
    2. All incidents that resulted from a user contacting the service desk
    3. Minor incidents
    4. All incidents
  7. What factors should be taken into consideration when assessing the priority of an incident?

    1. Impact and cost
    2. Impact and urgency
    3. Urgency and severity
    4. Severity and cost
  8. Which of the following are types of incident escalation defined by ITIL?

    1. Hierarchic
    2. Management
    3. Functional
    4. Technical
      1. 1 and 4
      2. 1 and 3
      3. 1, 2, and 4
      4. All of the above
  9. What is the best definition of a problem?

    1. An incident that the service desk does not know how to fix
    2. The result of a failed change
    3. The cause of one or more incidents
    4. A fault that will require a change to resolve
  10. Problem management can produce which of the following?

    1. Known errors
    2. Workarounds
    3. Resolutions
    4. RFCs
      1. 1 and 4
      2. 1 and 3
      3. 1, 2, and 4
      4. All of the above
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset