THE FOLLOWING ITIL INTERMEDIATE EXAM OBJECTIVES ARE DISCUSSED IN THIS CHAPTER:
Modern infrastructure management depends to a large extent on the use of event monitoring tools. These tools are able to monitor large numbers of configuration items simultaneously, identifying any issues as soon as they arise and notifying technical management staff. The process of event management is responsible for managing events throughout their lifecycle. Event management is one of the main activities of IT operations.
To begin, let’s consider some definitions from the ITIL Service Operation publication. These should be familiar from your Foundation course.
An event can be defined as any change of state that has significance for the management of a configuration item (CI) or IT service. Remember, an event is not necessarily an indication that something is wrong; it can merely be a confirmation that the system is working correctly. Many events are purely informational. Informational events could include notification of a user logging onto an application (significant because the use of the application may be metered) or a transaction completing successfully (significant because the notification of the successful completion may trigger the start of the next transaction).
An event that notifies staff of a failure or that a threshold has been breached is called an alert. An alert could be, for example, notification that a server has failed or a warning that the memory or disk usage on a device has exceeded 75 percent. If you consider these concepts in a non-IT environment, a car console may issue an event to say that the system has successfully connected to a Bluetooth device, or it might raise an alert (together with a beep or flashing light) to warn that a threshold has been breached and the car is now low on gas.
Effective service operation is dependent on knowing the status of the infrastructure and detecting any deviation from normal or expected operation. Event management monitors services for any occurrences that could affect their performance. It also provides information to other processes, including incident, problem, and change management.
There are two types of event monitoring tools:
The purpose of event management is to detect events, understand what they mean, and take action if necessary.
Many devices are designed to communicate their status, and event monitoring will gather these communications and act upon any that need action. Some communications report operational information, such as “backup of file complete,” “print complete,” and so on. These events show that the service is operating correctly. They can be used to automate routine activities such as submitting the next file to be backed up or the next document to be printed. They may also be used to monitor the load across several devices, issuing automated instructions to balance the load depending on the events received. If the event is an alert, such as “backup failed,” “printer jam,” or “disk full,” the necessary corrective steps will be taken. An incident should be logged in the case of a failure.
Event management has the following objectives:
Event management can be applied to any aspects of service management that need to be controlled and that could benefit from being automated. For example, the service management toolset automatically logs incidents in response to emails or events being received, escalates incidents when thresholds have been reached, and notifies staff of certain conditions (for example, a priority one incident being logged).
Configuration items can be monitored by event management tools; this monitoring can be for two different reasons:
Tracking licenses is another possible use for event management tools; licenses can be tracked to make sure there is no illegal use of an application by checking to see that the number of people using the software does not exceed the licenses held. This may also save money; by showing that there is less demand for concurrent use than was thought, the number of licenses can be reduced.
Monitoring for and responding to security events, such as detecting intruders, is another use; the tools can also be used to detect a denial of service attack or similar event.
Another use is the monitoring of environmental conditions. This might be for detecting a sudden increase in temperature in the server room or for other environmental changes.
Event management offers many benefits to a business:
Next, we consider suggested policies for event management.
The first policy states that event notifications should go to only those who have responsibility for acting on them. This means that a target audience must be identified for every event that we have chosen to handle—it is not acceptable to send a notification to everyone and hope that someone will do something.
The second policy relates to the centralization of event management. This ensures that notifications are handled consistently, that none are missed, and that none are handled by more than one person or team. It implies that a single rules engine will be used to process notifications, and of course that set of rules should be subject to change management.
The third policy provides guidance and constraints for the designers of new applications. There should be a common set of standards for events generated by applications. This will ensure consistency across applications and, of course, reduce the time to engineer event handling in new applications.
The next policy is that the handling of events should be automated as much as possible. The advantages of automation in general are well-known: reduced costs, fewer errors, and so on.
The fifth policy mandates the use of a standard classification scheme for events to ensure that similar types of events are handled in a consistent way.
The last policy states that all recognized events should at the very least be logged. This will provide a source of valuable information that might have a number of uses, for example, in problem investigation. A more sophisticated analysis of logged events might identify patterns of events that can be used to predict failures before they actually occur.
It is important to understand the difference between the two similar activities of monitoring and managing events. These are similar processes, but with specifically different emphasis.
We need to monitor events, but monitoring covers more than events. Monitoring can be used, for example, to make sure devices are operating correctly, even without any events being generated. Monitoring actually looks for conditions that do not generate events.
Event management is about having useful notifications about the status of the IT infrastructure and services. Event management sets up rules to ensure that events are generated so they can be monitored, captured, and acted upon if necessary. Action is the key to event management.
The particular notifications themselves may be vendor specific. However, they are likely to use Simple Network Management Protocol (SNMP), which is an Internet standard protocol for managing devices on IP networks. Devices that typically support SNMP include routers, switches, servers, workstations, printers, modem racks, and more. Because SNMP is an open standard, it makes interaction between different products simpler. Events must generate useful notifications. The time taken to create meaningful descriptions, with suggested actions, will save enormous effort later.
Event management can be enormously useful in managing large and complex infrastructures. It is often the case, however, that the full value of these tools is not realized. This is usually because there has been insufficient time spent making sure they are configured correctly to only notify staff of events for which they actually need notification. Failing to specify the correct thresholds, for example, will mean that far too many breaches are reported, causing staff to ignore them because they are seldom significant. Of course, this means that significant events are missed. If events are not filtered properly, the service management tool would be flooded with multiple spurious events, which would make it difficult to use its ability to automatically raise incidents.
Another important definition is that of an alert: An alert is a warning that a threshold has been reached, something has changed, or a failure has occurred. Alerts are often created and managed by system management tools and the event management process. Creating an alert when a disk or mailbox is nearly full is one such example.
Some events indicate a failure that must be fixed, while others simply flag that something has happened and should be recorded. These are two types of event: the first is an exception and the second informational. There is a third type—a warning event. A warning event signifies unusual but not necessarily exceptional behavior. Warning events require further analysis to determine whether any action is required. Events will be handled according to their type.
Here are some examples of each type of event:
Notice that not all of these examples relate to a failure. (Failures would be alerts.) Some simply contain information, but information that for some reason it is important to record. For example, the business might want to maintain a record of who is using an application for audit purposes.
There are no definitive criteria for determining the type of an event; it depends very much on the specific situation of the organization. For example, an event might be that a previously unknown device has been detected on the network. Some organizations allow their staff to attach their own laptops to the corporate network, in which case the event would be informational. In a highly secure organization, it would almost certainly be treated as an exception.
The next topic we’ll examine is event filtering. We don’t have complete control over the notifications that are generated by the configuration items. The manufacturers of hardware will have decided what notifications will be generated, and they may not have provided their customers with the ability to switch them off. A common experience when first beginning to monitor networks is that the monitoring tool is swamped by unrecognized and therefore unneeded notifications.
Filtering prevents the event management system from being overwhelmed by discarding notifications of events that have no significance to the organization.
There are four possible approaches to the problem:
These approaches are not mutually exclusive; many organizations will adopt some hybrid of them.
Successful event management in service operation requires analyzing and planning for what will be required. This should happen in the service design phase, although it will continue to be adjusted in service operation. Many organizations attempt and abandon event management, or they fail to achieve real value from it because this crucial design phase has been neglected.
The following questions should be asked when designing a service or planning the introduction of new technology:
When event management is first established, these questions should be asked about the existing services and infrastructure. Stakeholders who must be consulted include the business, process owners, and operations management staff. Each of these groups will have monitoring requirements, and each could be involved in handling events when they occur.
Instrumentation refers to specific ways to monitor and control the infrastructure and services. A number of practical issues need to be addressed when designing an event management system:
As events are detected, the event management system must interpret and make decisions about how to handle them. This is done by software known as a correlation engine. The correlation engine allows the creation of rule sets that it will use to process events.
Using a correlation engine will enable the system to determine the significance of each event and also to determine whether there is any predefined response to an event. Patterns of events are defined and programmed into correlation tools for future recognition. The correlation engine can translate component-level events into service impacts and business impacts, as shown in Figure 36.1.
Next, we take a look at the event management process. The process steps are shown in Figure 36.2.
The initial sequence of activities in the event management process is as follows:
At this point, the event type (exception, warning, or informational) has been identified. No further processing is required for informational events. For exception events, one or more of the service management processes will be triggered. If the event concerns something that has broken and requires restoring to normal service levels, an incident should be raised. A problem record may be updated if another example of a fault under investigation occurs. The automated response to an event may include raising a change. Some events, such as a “toner low” message, may require a service request to be handled by the request fulfilment process.
Warning events will then enter second-level correlation, which identifies how to proceed. In some cases, the event will be treated as informational or as an exception. Other cases will trigger either an automated response or an alert for human intervention, as detailed in the following section.
Let’s look at the initial process activities in a little more detail. Event notification refers to the communication of information about an event. You’ve already seen that some components will generate notifications independently, while others have to be prompted by polling.
Some events will be detected directly by the event management tool. Other events will be detected by a software agent running on the device being monitored. This agent then generates a notification that can be detected by the event management tool. All events are logged.
In first-level correlation, a decision about whether any further action is required is made, including whether the event has any significance to the organization. Correlation will determine whether an event is informational, a warning, or an exception. We discussed filtering earlier; this is necessary to stop staff being overwhelmed with events that do not require any action, or multiple reports of the same fault.
Informational events are closed at this point. Exception events will trigger one of the other service management processes. Warning events will go forward to second-level correlation.
Next, we consider the criteria that might be used by the second-level correlation engine to interpret an event:
The correlation engine determines whether the event requires some action or whether it can be treated as informational and closed.
Some events indicate conditions that can be resolved automatically without human intervention. For example, if a file server is detected to be nearly full, then a script could be run that would free up space by archiving old data.
Some events will require human intervention—an alert from a smoke detector, for example. It’s important that the alert is directed to the right person and that they know what to do.
Exception events will normally trigger the incident management process. Ideally, the event and incident management systems will be integrated so that an incident record can be raised automatically. A word of warning, however: This should be implemented only when you are happy that the filtering of events is working correctly. If this is not the case, your incident management system will be flooded with thousands of spurious incidents!
The problem management process might be triggered if the organization has a policy of always investigating the root causes of incidents that impact key services. Event management can support such a policy by automatically raising a problem record when it detects such an incident.
Change management can be triggered in two circumstances:
Remember, sometimes it will be necessary to trigger a combination of these responses.
There could be thousands of events each day, so it’s unlikely that every one of them could be reviewed. It’s sensible to review only what the service provider considers to be significant events. It is probably unnecessary to review events that have triggered other service management processes except to ensure that the triggers were effective.
Most events are neither opened nor closed but just logged in management systems or system logs. Many others can be closed automatically. For example, when a script is triggered to respond to an issue, the script itself could check that the corrective action has worked and generate an event to that effect.
We’ll now consider the event management process triggers, inputs, outputs, and interfaces.
Any type of change in state can trigger event management, and an organization should define which of these state changes need to be acted upon. Some examples are shown here:
Inputs to event management usually come from service design and service transition. They include the examples listed here:
Outputs from event management are usually passed to other service management processes, such as incident management, change management, and request fulfilment. They include the examples listed here:
The most obvious output of the process is the events themselves. These should have been communicated and escalated to the appropriate people. Another output is a chronological event log describing what events took place and any escalation and communication activities taken. This may be useful information if further investigation is required or to spot possible improvement opportunities.
Some events output by event management will indicate that an incident has occurred, and others will warn of the potential breach of an SLA or OLA objective. Of course, as we have said, not all events show that something is wrong, and many events will just indicate successful completion of deployment or operational activities. The data output from event management can be used to populate the SKMS with the event information and history.
Finally, let’s consider the interfaces event management has with the other lifecycle stages and their associated processes. Event management can interface with any process that requires monitoring and control, especially those that don’t require real-time monitoring but do require some form of intervention following an event or group of events. First we’ll consider how the process can even help the business directly.
The information provided by event monitoring may be used to help manage unusual occurrences with business processes.
Event management interfaces with a number of service design processes. Examples include the following:
Event management tools may also be used to support service transition processes:
Event management is a service operation process, and it interfaces with the other processes in that lifecycle stage:
The next topics for discussion are critical success factors (CSFs) and key performance indicators (KPIs). Before we look at the CSFs and KPIs relevant to event management, we should take a minute to understand what these terms mean.
Here are some examples of CSFs and KPIs for event management:
Possible associated KPIs for this CSF include the following (notice that the first KPI is trying to gauge the success in detecting faults while the second is trying to measure the scope of the event management implementation):
Associated KPIs might be as follows:
The following KPIs would enable the CSF to be assessed:
The following challenges could be encountered in event management:
The following risks are associated with event management; in many cases the risks are the result of failing to meet the challenges listed above.
If any of these risks are not addressed, they could adversely impact the success of event management.
This chapter explored the next process in the service operation stage, event management. It covered the use of event monitoring to manage large numbers of items and how automated responses to particular events may improve the delivery of services. It also explained the role of events in automating processes.
We discussed the key ITIL concepts of events and alerts and how event management can improve availability by preempting failures or reducing the time taken to identify them. Finally, we considered the technical and staff challenges of implementing this process.
Understand the purpose, objectives, and scope of event management. Describe events (a change of state that has significance for the management of a CI) and alerts (a failure or breach of a threshold) and the difference between them. Be able to give examples of each.
Understand the role of event management in automation. Describe passive and active monitoring and the difference between them. Be able to give examples of each. Understand the importance of filtering events and explain how effective event management can reduce downtime. Be able to explain automatic responses to certain types of events.
Know how event management benefits the customer and the IT department. Understand the efficiency benefits to be gained by being able to have a small number of staff monitor huge numbers of CIs and services. Understand how improved availability through reduced downtime benefits the business.
Understand how event management can be used to monitor business events and environmental conditions. Be able to explain how the process of event management can be applied beyond the technical IT environment.
You can find the answers to the review questions in the appendix. For which of these situations would implementing automation by using event management be appropriate? Event management can be used to monitor which of the following? Which of the following are types of event monitoring? Which of the following is the best description of an alert? Which of the following describes an active monitoring tool? What is the correct way to handle an event? Which of the following is NOT a type of event defined in ITIL? Which of the following does NOT describe a correlation engine? Which of the following describes the correct sequence of initial activities in the event management process? Which of the following are valid inputs to the event management process?Review Questions