Modern infrastructure management depends to a large extent on the use of event monitoring tools. These tools are able to monitor large numbers of configuration items simultaneously, identifying any issues as soon as they arise and notifying technical management staff. The process of event management is responsible for managing events throughout their lifecycle. Event management is one of the main activities of IT operations.
An event can be defined as any change of state that has significance for the management of a configuration item (CI) or IT service. Note that this does not state that the change of state is a failure. Many events are purely informational. Examples of informational events could include notification of a user logging onto an application (significant because the use of the application may be metered) or a transaction completing successfully (significant because the notification of the successful completion may trigger the start of the next transaction). An event that notifies staff of a failure or that a threshold has been breached is called an alert. Examples of alerts could include notification that a server has failed or a warning that the memory or disk usage on a device has exceeded 75 percent. If you consider these concepts in a non-IT environment, a car console may issue an event to say that the system has successfully connected to a Bluetooth device, or it might raise an alert (together with a beep or flashing light) to warn that a threshold has been breached and the car is now low on gas.
There are two types of event monitoring tools:
The purpose of event management is to detect events, understand what they mean, and take any necessary action. Many devices are designed to communicate their status, and event monitoring will gather these communications and act upon any that need action. Some communications report operational information, such as “backup of file complete,” “print complete,” and so on. These events show that the service is operating correctly. They can be used to automate routine activities such as submitting the next file to be backed up or the next document to be printed. They may also be used to monitor the load across several devices, issuing automated instructions to balance the load, dependent on the events received. If the event is an alert, such as “backup failed,” “printer jam,” or “disk full,” the necessary corrective steps will be taken. An incident should be logged in the case of a failure.
The objectives of event management include the following:
You do not need to know the process steps in detail for the exam, but an understanding of the key points will help you understand its objectives. The first step is the notification that an event has occurred. This depends on the monitoring tools being configured correctly to filter out notifications that have no significance. Without this, important events can be missed or lost among hundreds of spurious notifications. The event should then be logged; this may be an entry in the event monitoring log, or an automatic link to the incident management tool may raise an incident record. In the latter case, this interface should not be used until the appropriate filtering is in place to prevent spurious incidents from being raised. An analysis of the event should identify its significance; is it informational, a warning, or an exception? Dependent upon this analysis, any required actions are then taken.
Event management can be applied to any aspects of service management that need to be controlled and that could benefit from being automated. The service management tool set is an example, including automatically logging incidents in response to emails or events being received, escalating incidents when thresholds have been reached, and notifying staff of certain conditions (for example, a priority one incident being logged).
Configuration items can be monitored by event management tools; this monitoring can be for two different reasons:
Other areas where event management can be used include the monitoring of environmental conditions. This might be for fire and smoke detection or for other environmental changes.
Tracking license use is another possible use for event management tools; this ensures that there is no illegal use of an application by ensuring that the number of people using the software does not exceed the licenses held. This may also save money; by showing that there is less demand for concurrent use than was thought, the number of licenses can be reduced. Monitoring for and responding to security events, such as detecting intruders, is another use; the tools can also be used to detect a denial-of-service attack or similar event.
In addition to these uses, event management can be used for day-to-day management of the service. This might be monitoring performance of hardware or network equipment or tracking the use of a particular application.
It is important to understand the difference between the two similar but different activities of monitoring and managing events.
As you have seen, event management can be enormously useful in managing large and complex infrastructures. It is often the case, however, that the full value of these tools is not realized. This is usually because there has been insufficient time spent making sure that they are configured correctly to notify staff only for those events where they need notification. Failing to specify the correct thresholds, for example, will mean that far too many breaches are reported. The staff then ignores the events, because they are seldom significant. Of course, this means that significant events are missed. It is all too common that technical management teams have impressive plasma screens on the walls with flashing red warnings that everyone ignores. Sometimes the attitude is that the users will call the service desk if there really is an issue, which of course negates one of the major advantages of using such tools, that of being able to detect and respond to incidents before the user is impacted! Failing to filter the events properly means that the ability to automatically raise incidents cannot be used, because the service management tool would be flooded with multiple spurious events.