Problem Management

According to ITIL official terminology, a problem is defined as an underlying cause of one or more incidents. Problem management is the process that investigates the cause of incidents and, wherever possible, implements a permanent solution to prevent recurrence. Until such time as a permanent resolution is applied, it will also attempt to provide a workaround to enable the service to be restored and the incident to be resolved.

It is important to understand the differences between incidents and problems and to realize that an incident never becomes a problem.


A Mechanical Incident, Problem, and Workaround
One morning, as you leave your house to go to work, you find that your car will not start. You have an incident.
You have little mechanical knowledge, but you do know how to apply a workaround—to use jumper cables. You do this, the car starts, and your incident is over.
Every morning for a week, the same thing happens, and each time you apply the workaround to overcome the incident and restore service. The underlying problem could have several possible causes: a faulty battery, a mechanical fault preventing the engine from charging the battery, a light in the trunk left permanently on, and so on. The problem investigation has to be carried out by someone with a greater mechanical knowledge than you.
On the weekend, you take the car to a mechanic, who diagnoses the root cause and applies a permanent resolution (replaces the battery, fixes the wiring, or whatever is required). Your car will now start each morning!

Many organizations make the mistake of thinking that problem management is not essential. Typically they will state something like “this year we will concentrate on incident management; maybe next year we shall try some problem management.” This is unwise. Until and unless problem management is undertaken, incidents will recur, inconveniencing the business and occupying support staff time. Problem management will reduce incidents, freeing up more time to undertake more problem management. It is a virtuous circle; the more time spent on it, the more time is freed up by it.

The Purpose, Objectives, and Scope of Problem Management

The purpose of the problem management process is to document, investigate, and remove causes of incidents. It also provides another very useful benefit; by providing workarounds, it reduces the impact of incidents that occur. It proactively identifies errors in the infrastructure that could cause incidents and provides a permanent resolution, thus preventing the incidents.

Problem management aims to identify the root cause of incidents, to document known errors, and to take action to remedy them. Problem management has three simple objectives:

  • Prevent problems and resulting incidents from happening
  • Eliminate recurring incidents
  • Minimize the impact of incidents that cannot be prevented

The scope of problem management includes diagnosis of the root cause of incidents and taking the necessary action in conjunction with other processes (such as change management and release and deployment management) to permanently remove them.

Problem management is also responsible for compiling information about problems and any associated workarounds or resolutions. By identifying faults, providing workarounds, and then permanently removing them, problem management reduces the number and the impact of incidents. It has a strong relationship with knowledge management, because it is responsible for maintaining a known error database and could also be said to be part of continual service improvement.

There are important similarities and differences between the two principal service operation processes. The same service management tool will usually be used to track both incidents and problems, and a good tool will facilitate the linking of incident occurrences to specific problem records. Similar categories and prioritization classifications may be used. However, problem management may be a process of which the business is unaware. Once a workaround has been applied and an incident resolved, the user may think no more about it. Meanwhile, the IT service provider uses problem management to prevent recurrence. An effective workaround can take some of the pressure off support staff, allowing them to take the time to investigate the underlying cause, without being chased for a resolution, as the service has been restored.

As we have said, an incident is an unplanned interruption to an IT service or reduction in the quality of an IT service. Sometimes an incident cannot be resolved until the cause is known and remedied; a server fails and will not restart, for example, because of a hardware fault.

Unlike incident management, which is entirely reactive (you cannot resolve an incident until it has occurred), problem management has both reactive and proactive features.

  • Problem management will react to incidents and attempt to identify a workaround and a permanent resolution.
  • It will also proactively try to identify potential incidents and take action to prevent them from ever happening. This might include analysis of incident trends, such as intermittent but increasingly frequent complaints about poor response times, to identify a potential capacity issue. By working with capacity management, proactive measures can be taken to provide sufficient capacity and avoid any major breaks in service. Event management reports may also be analyzed to the same end, in this case, preventing an incident before the user is aware of any issue.
  • Problem management may assist in a major incident review, trying to identify how to prevent a recurrence.

Reactive and proactive problem management activities normally take place as part of service operation, but problem management is also closely related to continual service improvement. Where improvement opportunities are identified as a result of problem management, they should be entered into the CSI register.

Problem Management Concepts

As stated earlier, problem management is not an interesting optional activity; it is fundamental to providing a consistent service, in line with SLA commitments. By providing workarounds to enable resolution of incidents with the first-line staff, better use is made of the more skilled and therefore more expensive second- and third-line staff, who are freed up to use their skills in problem investigation.

Reactive and Proactive Activities

The process steps for managing problems that are raised in reaction to incidents and those that are proactively identified are broadly similar. The main difference is the trigger for the process. Reactive activities take place as a result of an incident report and help prevent the incident from recurring or provide a workaround if avoidance is impossible; these activities complement the incident management process.

Proactive problem management analyzes incident records to identify underlying causes of incidents. It may be that analysis of previous incidents reveals a trend or pattern that was not apparent when each incident occurred. For example, users may complain of poor response periodically; it is only when all these complaints are analyzed that it becomes apparent that the poor response is always reported against the same module or from the same location. This would trigger a problem record to be raised to identify the common cause linking all these incidents. We will look at this in a bit more detail in the following sections.

Proactive problem management process depends on the reporting capability of the service management tool; it must be able to produce reports that show the trend and allow drilling down into the data to find the connections that explain it. This may require incident reports sorted by category, date, time, location, application, or associated configuration item. Proactive steps are triggered by attempts to identify improvements and as such complement CSI.

Problem Models

It may be useful to use problem models to handle problems that have not and will not be resolved, perhaps because the cost or risk is too great or because the technology is due for replacement. These problem models are similar to the incident models described earlier, identifying the steps to take. They are used in addition to entries in the known error database.

When Is a Problem Raised?

Sometimes it is helpful to raise a problem record while the incident is still open. Each organization will decide its own criteria for when a problem should be raised. For example, a problem may be raised when the support teams are sure that the incident has been caused by a new problem, because the incident appears to be part of a trend or because there is no match with existing known errors. The incident may have been resolved by the service desk or support teams without knowing the cause and so there is a risk that the fault may recur. This is particularly true in the case of a major incident; the underlying cause needs to be identified as soon as possible to prevent future disruption to the business. (The problem diagnosis activity may take place in parallel with the incident resolution and may continue after the successful resolution, until the underlying cause is identified and removed.) It is also possible that suppliers may inform their customers of problems that they have identified.

Managing Problems: The Problem Management Process

Now we are going to examine the problem management process step by step. Refer to Figure 11.4 as we discuss each activity.

FIGURE 11.4 The problem management process

Based on Cabinet Office ITIL® material. Reproduced under license from the Cabinet Office.

image

Step 1: Detecting Problems

The first step in the process is to identify that a problem exists. As we discussed previously, problems may be raised either reactively in reaction to incidents or proactively. In addition to the triggers identified earlier, a problem may also be identified as a result of alerts received as part of event management. The event monitoring tools may identify a fault before it becomes apparent to users and may automatically raise an incident in response.

Step 2: Logging Problems

Having identified that a problem exists, a problem record should be logged. The problem record must contain all the relevant information, time-stamped to provide a complete picture. Wherever possible, the service management tool should be used to link problem records with the associated incident records. Incident details need to be copied into the problem record. Some tool sets enable the creation of a problem record from an incident, with automatic linking between the two. This can be very useful and saves a lot of time cutting and pasting details from one record to another. Be careful, however. Remember that the incident has not “become” a problem; the incident must continue to be managed to resolution whether the problem is resolved or not.

Typical details entered in a problem record and copied from the incident would include details of who reported it and when, details of the service and equipment used, and a description of the incident and actions taken. The incident record number and the priority and category would also be required.

Step 3: Categorizing Problems

Problems should be categorized in the same way as incidents, and using the same categorization scheme will make linking incidents and related problems together much easier.

An essential prerequisite for identifying trends in incidents is the accurate and consistent categorization of incidents. If every service desk analyst logs the same fault differently, it will be impossible to discern a trend. The example of poor response could be logged as a user complaint, a network issue, an application issue, or even “miscellaneous” or “other.” The problem manager should emphasize the importance of accurate categorization to the service desk. The use of incident models can be very helpful here because they standardize the way common incidents are recorded. Enforcing categorization on incident resolution, as mentioned earlier, will also help ensure incident categories are accurate.

Step 4: Prioritizing Problems

As with incidents, the priority of a problem should be based on the impact to the business of the incidents that it is causing and the urgency with which it needs to be resolved. The problem manager should also consider how frequently the incidents are occurring. It is possible that a “frozen screen” that can be resolved with a reboot is not a high-priority incident; if it is occurring 100 times a day, the combined impact to the business may be severe, so the problem needs to be allocated a high priority. The impact to the business must always be considered, so factors such as the cost of resolving the incident, and the time this is likely to take, will be relevant when assessing priority.

Step 5: Investigating and Diagnosing Problems

The next stage in the process is to investigate and diagnose the problem. There may not be the resources to investigate every problem, so the priority level assigned to each will govern which ones get the necessary attention. It is important to allocate resources to problem investigation, because until the problem is resolved, the incident will recur, and resources will be spent on incident resolution.

The ITIL framework suggests a number of different problem-solving techniques, which are helpful in approaching the diagnosis logically. The CMS can be very helpful in providing CI information to help identify the underlying cause. It will also help in identifying the point of failure, where several incidents are reported; the CMS may identify that all the affected CIs are linked to the same CI. The KEDB may also provide information about previous, similar problems and their causes. Where a test environment exists, this can be used to re-create the fault and to try possible solutions.

Step 6: Identifying a Workaround

Although the aim of problem management is to find and remove the underlying cause of incidents, this may take some time; meanwhile, the incident or incidents continue, and the service is affected. When a user suffers an incident, the first priority is to restore the service so that they can continue working. A priority of the process, therefore, is to provide a workaround to be used until the problem is resolved. The workaround does not fix the underlying problem, but it allows the user to continue working by providing an alternative means of achieving the same result. The workaround can be provided to the service desk to enable them to resolve the incidents, while work on a permanent solution continues. The problem record remains open, because the fault still exists and is continuing to cause incidents. The details of the workaround are documented within the problem record, and a reassessment of its priority may be carried out.

It is possible that IT or business management may decide to continue to use the workaround and suspend work on a permanent solution if one is not justified. A problem affecting a service that is due to be replaced, for example, may not be worth the effort and risk involved in implementing a permanent solution. In the previous example regarding the car that fails to start, the owner may decide not to repair the fault if the car is to be replaced within weeks. Until it is replaced, the owner uses the workaround of jumper cables, rather than pay the mechanic to fix the fault.

Step 7: Raising a Known Error Record

When problem management has identified and documented the root cause and workaround, this information is made available to support staff as a known error. Information about all known errors, including which problem record it relates to, is kept in the known error database (KEDB). When repeat incidents occur, the support staff can refer to the KEDB for the workaround.

There may be times when a workaround is available although the root cause is not yet known (for example, a reboot restores the service, although we do not know what causes the error). On other occasions, we may know the cause but not have a workaround because a change has to be implemented to fix the fault.

Sometimes a known error is raised before a workaround is available and sometimes even before the root cause has been fully identified. This may be just for information purposes; a workaround may be available that has not been fully proven. Rather than have a rigid rule about when a known error should be raised, a more pragmatic approach is advisable; a known error should be raised as soon as it becomes useful to do so.

Problem Resolution

When problem management has identified a solution to the problem, it should be implemented to resolve the underlying fault and thus prevent any further incidents from disrupting the service. Implementing the resolution may involve a degree of risk, however, so the change management process will ensure the risk and impact assessment of the RFC is satisfactory before allowing the change. Ultimately, the decision whether to go ahead with the resolution despite the risk is a business decision; the business damage being done by the problem may mean the business is prepared to accept the risk in order to have the fix implemented. For more discussion about the acceptance of risk by the business, see Chapter 8.

  • Often a change to resolve a problem will be an emergency change because of the impact of the problem on the business and the urgency with which it needs to be fixed.
  • In the circumstance mentioned earlier, where a permanent resolution is not justified, the KEDB should be used to document the workaround. The entry should state that the problem is not to be resolved to prevent any unnecessary work being done on it.
  • There may be workarounds that mitigate rather than remove the impact of the fault; these should be documented and used until a better resolution is found. Having a workaround like this available, although not entirely satisfactory, may allow the priority to be reassessed.

Problem Closure

When a permanent solution to the problem has been identified, tested, and implemented through the change management process, the problem record can be updated and closed. Any open incidents caused by the problem can be closed too. The KEDB should be updated to show that the problem is resolved, so any future incidents will not have been caused by it; however, the information contained within the problem record may prove useful in addressing a future, similar problem.

Major Problem Review

Each organization should define what constitutes a major problem; this may be all priority one problems, anything above a particular priority level that continues or some other criteria. Once a major problem has been resolved, a review should be held to identify any lessons that can be learned from what occurred. This review should happen close enough to the time of the event that those involved can still remember what happened. The importance of good recordkeeping is apparent at a review; a well-documented problem, with a full history of steps taken, will provide useful information. A problem record with little detail will mean important items will be forgotten, or those present may have differing recollections.

It is important to remember, for this and other reviews, that lessons can be learned from what went well, not just from what went badly. Concentrating on what went wrong may lead to a list of recommendations about what not to do next time but gives little guidance about what should be done instead. Even when a problem causes a lot of disruption, there may still be positive lessons to be learned. For example, the business may comment that it was really helpful to have regular updates from the service desk regarding the status of the incident that resulted from the problem, because this allowed them to plan how to make best use of staff time.

The output from the review should cover what went well, whether anything was done that was against the agreed process, any suggestions for improvements for the future, and any ideas about how the problem could be prevented in the future. Follow-up actions should be assigned to the relevant teams, process owners, or third-party supplier, and internal improvements can be entered into the CSI register. It is important that action is taken on any improvements identified, whether technical changes to monitoring or logging tools, changes to process activities, or addressing training needs which have been identified.

The review may highlight underlying causes that can be handled as part of proactive problem management. The service-level manager (possibly accompanied by the problem manager) should report to the next service review what improvements have been identified and implemented to help prevent future major incidents. This provides assurance to the business that the IT service provider is not complacent and is making a genuine effort to improve the service.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset