Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 38
Common Service Operation Activities

THE FOLLOWING ITIL INTERMEDIATE EXAM OBJECTIVES ARE DISCUSSED IN THIS CHAPTER:

✓ Service operation principles
✓ Techniques
✓ Relationships
✓ Application to the delivery and support of services at agreed levels
✓ To meet the learning outcomes and examination level of difficulty, you must ensure that you are able to understand, describe, identify, demonstrate, apply, distinguish, produce, decide, and analyze
- How the common activities of service operation are coordinated for the ongoing management of the technology that is used to deliver and support the services
- How monitoring, reporting, and control of the services contribute to the ongoing management of the services and the technology that is used to deliver and support the services
- How the operational activities of processes covered in other lifecycle stages contribute to service operation
- How IT operations staff should look for opportunities to improve the operational activities

This chapter focuses on a number of operational activities that ensure that technology is aligned with the overall service and process objectives. These are sets of specialized technical activities all aimed at ensuring that the technology required to deliver and support services is operating effectively and efficiently.

They are usually technical in nature, although the exact technology will vary depending on the type of services being delivered. This chapter focuses on the activities required to manage operational day-to-day delivery of IT services.

Common Service Operation Activities

It is important to remember that there is no “right” way of grouping and organizing the departments that perform these activities and services. In looking at these topics, we will not refer to names of departments.

These are typical technical activities involved in service operation. They do not represent any level of maturity but are usually all present in some form at all levels. They are just organized and managed differently at each level.

Sometimes they are carried out by a specialist team, sometimes shared between groups. For simplicity, we have listed the activities under the functional groups most likely to be involved in their operation, but many organizations will do things differently.

Smaller organizations usually assign groups of these activities (if they are needed at all) to single departments or even individuals.

We will not look in detail at the activities because they vary and change according to the technology in use. We will explore the importance and nature of technology management for IT service management.

Maturing Technology Management

As service operation matures, its focus moves from purely technical management of the infrastructure (as described in level 1 of Figure 38.1) to achieving control (described in level 2), followed next by consolidation and integration (described in level 3) and then mapping service provision to the business requirement (described in level 4).

Diagram shows five levels of technology management written in a stepwise pattern which include technology-driven, technology control, technology integration, service provision, and strategy contribution. — **Figure 38.1** Achieving maturity in technology management

Copyright © AXELOS Limited 2010. All rights reserved. Material is reproduced under license from AXELOS.

Finally, it matures to measuring services in terms of value to the business (described in level 5), and it becomes increasingly business-centric.

The diagram illustrates the steps involved in maturing from a technology-centric organization to an organization that harnesses technology as part of its business strategy. It further outlines the role of technology managers in organizations of differing maturity. The diagram is not comprehensive, but it does provide examples of the way in which technology is managed.

The following sections focus on technical management activities, but there is no single way of representing them. A less mature organization will tend to see each activity as an end in itself, not a means to an end. A more mature organization will tend to subordinate these activities to higher-level service management objectives. For example, the server management team will move from an insulated department, focused purely on managing servers, to a team that works closely with other technology managers to find ways of increasing their value to the business.

We begin by considering monitoring and control.

Monitoring and Control

The measurement and control of services is based on a continual cycle of monitoring, reporting, and subsequent action. We are going to explore this cycle because it is fundamental to the delivery, support, and improvement of services.

It is also important to note that, although this cycle takes place during service operation, it provides a basis for setting strategy, designing and testing services, and achieving meaningful improvement. It is also the basis for service level management measurement. Therefore, although monitoring is performed by service operation functions, it should not be seen simply as an operational matter. All stages of the service lifecycle will be involved in ensuring that measures and controls are clearly defined, executed, and acted upon.

Monitoring

Monitoring refers to the activity of observing a situation to detect changes that happen over time. In the context of service operation, this implies that tools will be used to monitor the status of key configuration items (CIs) and key operational activities. This should ensure that specified conditions are met or not met. In the case of the latter, this would lead to raising an alert to the appropriate group; for example, it would raise an alert to the network team if a key network device failed. This will assist with the management of the performance or utilization of a component or system and ensure that it is within a specified range as far as, for example, disk space or memory utilization.

Monitoring will also be concerned with the detection of abnormal types or unusual levels of activity in the infrastructure, such as potential security threats due to numerous unrecognized accounts trying to log in, and unauthorized changes, such as introduction of a software version or type not mandated within the organization. Monitoring can facilitate compliance with the organization’s policies (such as flagging inappropriate use of email) and allows the organization to track outputs to the business and ensure that they meet quality and performance requirements. Monitoring data is also used to track any information used to measure key performance indicators (KPIs).

Reporting

Reporting refers to the analysis, production, and distribution of the output of the monitoring activity. In the context of service operation, this implies that tools will be used to collate the output of monitoring information that can be disseminated to various groups, functions, or processes.

The reports will be used to support the interpretation of the meaning of the monitoring information, and to determine where that information would best be used. It is an important aspect of monitoring and control to ensure that decision-makers have access to the information that will enable them to make decisions. This requires that the organization is able to route the reported information to the appropriate person, group, or tool.

Control

Control refers to the process of managing the utilization or behavior of a device, system, or service. It is important to note, however, that simply manipulating a device is not the same as controlling it. Control requires three conditions: First, the action must ensure that the outcome conforms to a defined standard or norm; second, the conditions prompting the action must be defined, understood, and confirmed; and third, the action must be defined, approved, and appropriate for these conditions.

When considered in the context of service operation, control usually implies that tools will be used to define what conditions represent normal or abnormal operations. Development of thresholds in toolsets will allow for the regulation of the performance of devices, systems, or services. Controls also support the measurement of availability by initiating corrective action, which could be automated (reboot a device remotely or run a script when triggered by a “self-healing” system) or manual (notification to operations staff of a status that needs attention).

Monitor Control Loops

The most common model for defining control is the monitor control loop. Although it is a simple model, it has many complex applications within IT service management (ITSM). The diagram in Figure 38.2 outlines the basic principles of control.

Diagram shows a control loop which includes an activity block along with its inputs, outputs, and activities such as monitoring, comparing, norm, and controlling. — **Figure 38.2** The monitor control loop

Copyright © AXELOS Limited 2010. All rights reserved. Material is reproduced under license from AXELOS.

A single activity and its output are measured and compared using a predefined norm, or standard, to determine whether it is within an acceptable range of performance or quality. If not, action is taken to rectify the situation or to restore normal performance.

Typically, there are two types of monitor control loops, the open loop and the closed loop.

Open loop systems are designed to perform a specific activity regardless of environmental conditions. For example, a backup can be initiated at a given time and frequency and will run regardless of other conditions.

Closed loop systems monitor an environment and respond to changes in it. An example of this system would be the use of a closed loop in network load balancing. Monitoring will evaluate network traffic across a specific range. If network traffic exceeds this, the control system will begin to route traffic across a backup circuit. The monitor will continue to provide feedback to the control system, which will continue to regulate the flow of network traffic between the two circuits.

This example may help to clarify the difference. An open loop system would be used to solve capacity management through overprovisioning; a closed loop system would use a load balancer that detects congestion/failure and redirects capacity.

Complex Monitor Control Loop

Within the context of ITSM, the monitor control loop is often far more complex. The diagram illustrates a process consisting of three major activities. Each one has an input and an output, and the output becomes an input for the next activity.

In Figure 38.3, each activity is controlled by its own monitor control loop, using a set of norms for that specific activity. The process as a whole also has its own monitor control loop, which spans all the activities and ensures that all norms are appropriate and are being followed.

Image described by surrounding text. — **Figure 38.3** Complex monitor control loop

Copyright © AXELOS Limited 2010. All rights reserved. Material is reproduced under license from AXELOS.

There is a double feedback loop. One loop focuses purely on executing a defined standard, and the second evaluates the performance of the process and also the standards whereby the process is executed.

This approach can be used to monitor the performance of activities in a process or procedure. Each activity and its related output can potentially be measured to ensure that problems with the process are identified before the process as a whole is completed. For example, in incident management, the service desk monitors whether a technical team has accepted an incident in a specified time. If not, the incident is escalated.

This can also measure the effectiveness of a process or procedure as a whole. In this case, the activity box represents the entire process as a single entity. For example, change management will measure the success of the process by checking whether a change was implemented on time, to specification, and within budget.

Finally the monitor control loop can also monitor the performance of a device. For example, the activity box could represent the response time of a server under a given workload.

To recap, we can use control loops to monitor a whole variety of areas—from processes to devices. We need to establish what the norm for output should be, and this is the major consideration for service management. Service management monitoring will be used to support the achievement of specific targets for process effectiveness and process outputs.

To define how to use the concept of monitor control loops in service management, the following questions need to be answered.

How do we define what needs to be monitored?
What are the appropriate thresholds for each of these?
How will monitoring be performed (manual or automated)?
What represents normal operation?
What are the dependencies for normal-state service operation?
What are the dependencies for monitoring and controlling?
How frequently should the measurement take place?
Do we need to perform active measurement to check whether the item is within the norm or do we wait until an exception is reported (passive measurement)?
Is IT operations management the only function that performs monitoring?
If not, how are the other instances of monitoring related to operations management?
If there are multiple loops, which processes are responsible for each loop?

Although Figure 38.4 represents monitoring and control for the whole of ITSM, the control loop is used in service operation. Some may feel that it would be more suitably covered in ITIL service strategy. However, monitoring and control can effectively be deployed only when the service is operational. This means that the quality of the entire set of ITSM processes depends on how they are monitored and controlled in service operation.

Diagram shows the relationship between different monitor control loops, continual service improvement, service strategy, design, portfolios, standards and policies, technical architecture and performance standards. — **Figure 38.4** The ITSM monitor and control loop

Copyright © AXELOS Limited 2010. All rights reserved. Material is reproduced under license from AXELOS.

Internal and External Monitoring and Control

The definition of what needs to be monitored is based on understanding the desired outcome of a process, device, or system. IT should focus on the service and its impact on the business rather than just the individual components of technology. The first question that needs to be asked is, What are we trying to achieve?

If we consider internal monitoring and control, it is clear that most teams or departments are concerned about being able to execute effectively and efficiently the tasks that have been assigned to them. This type of monitoring and control focuses on activities that are self-contained within that team or department. For example, the service desk manager will monitor the volume of calls to determine how many staff need to be available to answer the telephone.

External monitoring and control is concerned with the monitoring that is addressing wider needs and requirements. Although each team or department is responsible for managing its own area, they do not act independently. Each team or department will also be controlling items and activities on behalf of other groups, processes, or functions. For example, the server management team will monitor the CPU performance on key servers and perform workload balancing so that a critical application is able to stay within performance thresholds set by application management.

If service operation focuses only on internal monitoring, it will have very well-managed infrastructure but no way of understanding or influencing the quality of services. If it focuses only on external monitoring, it will understand how poor the service quality is but will have no idea what is causing it or how to change it. In reality, most organizations have a combination of internal and external monitoring, but in many cases, these are not linked.

Defining Objectives for Monitoring and Control

It is common for organizations to start by asking the question, What are we managing? This will frequently lead to a strong internal monitoring system, with very little linkage to the real outcome or service that is required by the business. This may be because of the overall service management maturity of the organization.

Perhaps a better and more business-focused question would be, What is the end result of the activities and equipment that my team manages? Therefore, the best place to start when defining what to monitor is to determine the required outcome. But this will require a level of maturity, and it should start within the process of service level management.

The definition of monitoring and control objectives should ideally start with the service level requirements documents. These will specify how the customers and users will measure the performance of the service and are used as input into the service design processes. During service design, various processes will determine how the service will be delivered and managed. For example, capacity management will determine the most appropriate and cost-effective way to deliver the levels of performance required. The service design process will help to identify sets of inputs for defining operational monitoring, control norms, and control mechanisms.

All of this means that a very important part of defining what service operation monitors and how it exercises control is to identify the stakeholders of each service and their requirements.

Types of Monitoring Strategies

There are many different types of monitoring strategies and different situations in which each will be used. We have explored the concepts of internal and external monitoring. Now we’ll briefly consider other monitoring types that are complementary to these high-level approaches.

Active monitoring is the ongoing “interrogation” of a device or system to determine its status. Because this type of monitoring is often resource intensive, it may be reserved to proactively monitor the availability of critical devices or systems. It can also be used as a diagnostic step for resolving an incident or diagnosing a problem.

Passive monitoring, which is more common, is generating and transmitting events to a “listening device” or monitoring agent. Passive monitoring depends on successful definition of events and instrumentation of the system being monitored. If the configuration of the tool is not correct, then events will not be captured and actions may not be taken.

Reactive monitoring is designed to request or trigger action following a certain type of event or failure. For example, degradation of server performance may trigger a reboot, or a system failure may generate an incident. Reactive monitoring is most commonly used for exceptions. But it can also be used as part of normal operations procedures—for example, a batch job completes successfully, which prompts the scheduling system to submit the next batch job.

Proactive monitoring is used to detect patterns of events that indicate that a system or service may be about to fail. Proactive monitoring is generally used in more mature environments where these patterns have been detected previously, often several times. Reactive and proactive monitoring could be active or passive.

In Table 38.1, you can see some example interactions between active and passive monitoring with reactive and proactive monitoring.

Table 38.1 Active and passive reactive and proactive monitoring

	Active	Passive
Reactive	Used to diagnose which device is causing the failure and under what conditions (e.g., pinging a device, or running and tracking a sample transaction through a series of devices) Requires knowledge of the infrastructure topography and the mapping of services to CIs Requires capability to simulate service workloads and demand volumes	Detects and correlates event records to determine the meaning of the events and the appropriate action (e.g., a user logs in three times with the incorrect password, which represents a security exception and is escalated through information security management procedures) Requires detailed knowledge of the normal operation of the infrastructure and services
Proactive	Used to determine the real-time status of a device, system, or service—usually for critical components or following the recovery of a failed device to ensure that it is fully recovered (i.e., it’s not going to cause further incidents)	Event records are correlated over time to build trends for proactive problem management. Patterns of events are defined and programmed into correlation tools for future recognition.

Continuous Measurement vs. Exception-Based Measurement

Continuous measurement is focused on monitoring a system in real time to ensure that it complies with a performance norm. An example of this might be the availability of an application server for a specific percentage of agreed service hours. The difference between continuous measurement and active monitoring is that active monitoring does not have to be continuous. Continuous measurement is resource intensive and as a consequence is usually reserved for critical components or services. This may be reduced to a continuous regular sampling of data and extrapolation of the performance achieved over time, depending on the needs of the organization.

Exception-based measurement, as the name suggests, detects and reports against exceptions. For example, an event is generated if a transaction does not complete or if a predefined performance threshold is reached. This is more cost-effective and easier to measure, but it could result in longer service outages. Exception-based measurement is used for less critical systems or on systems where cost is a major issue. It is also used where IT tools are not able to determine the status or quality of a service and manual intervention is required. For example, manually checking a system for safety will be shown as the percentage of items that failed rather than focusing on success. It is important that both the OLAs and the SLA for the service being measured reflect that exception-based measurement is used because service outages are more likely to occur and users are often required to report the exception.

Performance vs. Output

There is an important distinction between the reporting used to demonstrate the achievement of service quality objectives and the reporting used to track the performance of components or teams or departments used to deliver a service.

IT managers often confuse these by reporting to the business on the performance of their teams or departments (e.g., number of calls taken per service desk analyst) as if that were the same thing as quality of service (e.g., incidents solved within the agreed time).

Performance monitoring and metrics should be used internally by the service management team to determine whether people, processes, and technology are functioning correctly and to standard.

Users and customers would rather see reporting related to the quality and performance of the service.

Although service operation is concerned with both types of reporting, the primary concern of this chapter is performance monitoring, whereas monitoring of service quality (or output-based monitoring) is discussed in detail in Part 5 of this book, which covers continual service improvement.

Monitoring in Test Environments

It’s important to ensure, as with any IT infrastructure, that the test environment is subject to monitoring and control. A test environment consists of infrastructure, applications, and processes that have to be monitored, managed, and controlled just like any other environment. Like-for-like conditions must be reflected in the test environment, and it is important to define how it will be used, including replicating the monitoring systems and tools used in the operational environment.

It’s also important to monitor items being tested. The results of testing need to be accurately tracked and checked. Any monitoring tools that have been built into new or changed services must also be tested.

Reporting and Action

It has been said that a report alone creates awareness but that a report with an action plan achieves results. Practical experience has shown that there is often more reporting in dysfunctional organizations than in effective organizations. This is because the reports are not being used to initiate predefined action plans but rather to shift blame for an incident or identify responsibility for an action.

Monitoring without control is considered to be irrelevant and ineffective. To be useful, monitoring should always be aimed at ensuring that service and operational objectives are being met. This means that unless there is a clear purpose for monitoring a system or service, it should not be monitored.

This also means that when monitoring is defined, so too should any required actions be defined. For example, being able to detect that a major application has failed is not sufficient. The relevant application management team should also have defined the exact steps that it will take when the application fails so it may be recovered.

In addition, it should be recognized that actions may need to be taken by different people. For example, a single event, such as an application failure, may trigger action by the application management team to restore service, by the users to initiate manual processing, and by management to determine how this event can be prevented in future.

Service Operation Audits

Regular audits must be performed on the service operation processes and activities to ensure that they are being performed as intended and that there is no circumvention. An audit will also establish if the processes are still fit for purpose or identify any required changes or improvements.

In an ideal situation, an organization’s internal audit team should carry these out in the interest of keeping some form of independent element to the audits, but if this capability is not available, some organizations may choose to engage third-party consultants or audit and assessment companies so that an entirely independent expert view is obtained. Service operation managers may choose to perform such audits themselves.

Service operation audits are part of the ongoing measurement that takes place as part of CSI.

Measurement, Metrics, and Key Performance Indicators

We will now review the measurement, metrics, and key performance indicators.

Measurement

Measurement refers to any technique used to evaluate the extent, dimension, or capacity of an item in relation to a standard or unit.

Extent refers to the degree of compliance or completion (e.g., are all changes formally authorized by the appropriate authority?).
Dimension refers to the size of an item (e.g., the number of incidents resolved by the service desk).
Capacity refers to the total capability of an item (e.g., the maximum number of standard transactions that can be processed by a server per minute).

Measurement only becomes meaningful when it is possible to measure the actual output or dimensions of a system, function, or process against a standard or desired level. This should be defined in service design and refined over time through CSI, but the measurement itself takes place during service operation.

Metrics

Metrics refers to the quantitative, periodic assessment of a process, system, or function together with the procedures and tools that will be used to make these assessments and the procedures for interpreting them.

This is an important definition because it not only specifies what needs to be measured, but also how to measure it, what the acceptable range of performance will be, and what action will need to be taken as a result of normal performance or an exception.

Key Performance Indicators

A key performance indicator (KPI) is a specific, agreed level of performance that will be used to measure the efficiency, effectiveness, and cost-effectiveness of a process, IT service, or activity. The KPIs are used to measure the critical success factors (CSFs) for a process. We explored the CSFs for the processes in earlier chapters.

Interfaces to Other Service Lifecycle Practices

There are a number of ways in which service operation activities interface into the other lifecycle stages. We have considered operational monitoring and reporting, but monitoring also forms the starting point for CSI, and there are some key differences in focus when monitoring is being used for CSI rather than service operation.

Quality is the key objective of monitoring for CSI. Monitoring will therefore focus on the effectiveness of a service, process, tool, organization, or CI. The emphasis is not on assuring real-time service performance; rather it is on identifying where improvements can be made to the existing level of service or IT performance.

Monitoring should focus on detecting exceptions and resolutions. For example, the CSI activities are not as interested in whether an incident was resolved but whether it was resolved within the agreed time and whether future incidents can be prevented.

Monitoring data can be quite large and voluminous if you’re looking at the entire IT service infrastructure. Because the CSI activities are generally focused on targeted service improvements, requirements for specific subsets of monitoring data are likely to be needed for this analysis. The subsets of data can be determined by input from the business or obtained through improvements to technology.

This has two main implications: first, monitoring for CSI will change over time, and second, there needs to be a common process between service operation and CSI to agree on the requirements. For example, there may be interest in monitoring the email service one quarter and then moving on to look at human resources systems in the next quarter.

IT Operations

In the following sections, we are going to review the activities related to IT operations for management of the operational environment.

We will look at server and mainframe management and support and network management. An important operational activity relates to storage and archive, frequently supported by database administration. Directory services management is another key area, as are desktop and mobile device support and middleware management. Many organizations also provide Internet or web management as part of the operational activities. Facilities and data center management are also included in the operational activities.

Operations Bridge

The console management/operations bridge function handles a structured set of activities that centrally coordinates management of events, incidents, routine operational activities, and reporting on the status or performance of technology components.

Historically, it would have involved monitoring mainframe master consoles, but as technology moves on, we are much more likely to see monitoring of specific configuration items, server farms, and, indeed, virtual operations. As technology changes, so does the nature of this function. Whatever the items being managed and monitored, the operations bridge acts as a centralization of the observation, receiving feeds from disparate locations and systems.

If this function exists in an organization, it may be used as resilience for service desk functions or other support functions, because it is often a 24-hour operation (if required) and will provide after-hours support. Obviously, this function is separated into a specific team only where the organization warrants it due to size, complexity, or security requirements. Otherwise, this function is carried out within the scope of the technical support and service desk teams.

Job Scheduling

In large organizations with volumes of data processing to be completed each day, a solution may be to run batch processing overnight, and should this processing overrun, the following day’s online services could be impacted.

Job scheduling aims to maximize overnight capacity and performance and has to factor in many variables, such as time sensitivity, critical and noncritical dependencies, workload balancing, and failure and resubmission. As a result, most operations rely on job scheduling tools that allow IT operations to schedule jobs for the optimal use of technology to achieve service level objectives.

The latest generation of scheduling tools allows for a single toolset to schedule and automate technical and service management process activities (such as change scheduling). While this is a good opportunity for improving efficiency, it also represents a greater single point of failure. Organizations using this type of tool will generally use other nonintegrated solutions as agents and also as a backup in case the main toolset fails.

Backup and Restore

Regularly backing up and restoring data is an essential service operation activity. It is important to establish what will be backed up, how often, how long it will be retained, and the time frame within which it will be required to be restored. During the earlier stages of the service lifecycle, the business will be asked for the requirements for data management, including backup and restore capability. These are all business decisions, and the backup regime will be designed as part of the overall solution during service design and tested during service transition. Requirements for backup and restoration are often regulatory and subject to audit, so the organization must agree on and document a traceable and defined approach.

Part of this definition and agreement is the recovery point objective. This describes the point to which data will be restored after recovery of an IT service. It may involve loss of data. For example, a recovery point objective of one day means that up to 24 hours of data may be lost. It is also necessary to agree on the recovery time objective. This describes the maximum time allowed for recovery of an IT service following an interruption. These agreements should be made through the service level management process and included in operational level agreements, contracts, and service level agreements.

Whatever arrangements are made, backups must be tested regularly to ensure not only that they have worked, but also that the data can be restored as agreed. This should be subject to the correct authorization, defined by the business requirements. This may form part of the continuity plan.

Print and Output Management

Although it may seem that print management is an old-fashioned concept, output management is not because electronic output is an important part of many organizational services. Many services consist of generating and delivering information in printed or electronic form. Ensuring that the right information gets to the right people, with full integrity, requires formal control and management. Print (physical) and output (electronic) facilities and services need to be formally managed because they often represent the tangible output of a service. The ability to measure that this output has reached the appropriate destination is therefore important; for example, it’s important to measure data transfers between organizations, which is now the most common way of transferring financial information and payments.

Physical and electronic output often contains sensitive or confidential information. It is vital that the appropriate levels of security are applied to both the generation and the delivery of this output.

Those of us who have worked in IT for a number of years will remember times when organizations had a requirement for centralized bulk printing requirements, which IT operations had to handle. In addition to the physical loading and reloading of paper and the operation and care of the printers, other activities were needed. These included prenotification of large print runs and alerts to prevent excessive printing by rogue print jobs or the physical control of high-value stationery such as company checks or certificates. Some organizations will still have these requirements, but more commonly it is electronic output that concerns our IT teams.

IT may also be responsible for the management of the physical and electronic storage required to generate the output. In many cases, IT will be expected to provide archives for the printed and electronic materials.

Where appropriate, IT will have control of all printed material to adhere to data protection legislation and regulation such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and Financial Conduct Authority (FCA) in the United Kingdom.

Where print and output services are delivered directly to the users, it is important that the responsibility for maintaining the printers and storage devices is clearly defined in the SLA.

Server and Mainframe Management and Support

Successful management of servers and mainframes is essential for successful service operation. Servers and mainframes are used in many organizations to provide flexible and accessible services such as hosting applications and databases, operating high-volume transaction systems, running client/server services, providing storage, and print and file management.

The ways in which server and mainframe management teams are organized are quite diverse. In some organizations, mainframe management is a single, highly specialized team, whereas in others the activities are performed by several teams or departments, with engineering and third-level support provided by one set of teams and daily operations combined with the rest of IT operations. The support activities for servers are generally the same as those for mainframes. Although the technologies and skill sets needed to actually perform support activities are different, the types of activities are essentially similar.

The following procedures and activities must be undertaken by mainframe and server teams or departments. Remember that separate teams may be needed where different technology platforms are used (i.e., mainframe OS, UNIX, Wintel, etc.).

Operating system support and maintenance of the appropriate operating system(s) and related utility software (e.g., failover software), including patch management and involvement in defining backup and restore policies
License management for all server CIs, especially operating systems, utilities, and any application software not managed by the application management teams
Third-level support for all incidents related to servers and/or server operating systems, including diagnosis and restoration activities. This will also include liaison with third-party hardware support contractors and/or manufacturers as needed to escalate hardware-related incidents.
Advice to the business on the selection, sizing, procurement, and usage of servers and related utility software to meet business needs
System security, including the control and maintenance of the access controls and permissions within the relevant server environment(s) as well as appropriate system and physical security measures. These include identification and application of security patches, access management, and intrusion detection.
Definition and management of virtual servers. Server management will be required to set the standards for load balancing and virtual management and then ensure that workloads are appropriately balanced and distributed. They are also responsible for being able to track which workload is being processed by which server so that they are able to deal with incidents effectively.
Provide information and assistance to capacity management to help achieve optimum throughput, utilization, and performance from the available servers

Other activities are also routine for the server and mainframe management team:

Defining standard builds for servers as part of the provisioning process, and then building and installing new servers as part of ongoing maintenance or for the provision of new services
Setting up and managing clusters, which are aimed at building redundancy, improving service performance, and making the infrastructure easier to manage

There is obviously a need for ongoing maintenance. This typically consists of replacing servers, or “blades,” on a rolling schedule to ensure that equipment is replaced before it fails or becomes obsolete. This results in servers that are not only fully functional, but also capable of supporting evolving services.

Decommissioning and disposal of old server equipment is often done in conjunction with the organization’s environmental policies for disposal.

There will be a requirement to provide interfacing to hardware (H/W) support, such as arranging maintenance, agreeing on time slots for work to be carried out, identifying H/W failure, meeting with H/W engineering, and providing assistance in writing batch and job scripts.

Network Management

Since almost all IT services require connectivity, network management’s role in managing and maintaining the communications infrastructure is essential. Network management is responsible for the organization’s own local area networks (LANs), metropolitan area networks (MANs), and wide area networks (WANs) and for liaising with third-party network suppliers. They are normally considered experts in the area of network administration.

Their role will include initial planning and installation of new networks or network components and maintenance and upgrades to the physical network infrastructure. This is done through service design and service transition.

Network management will usually provide third-level support for all network-related activities, including investigation of network issues and liaison with third parties as necessary. This also includes the installation and use of sniffer tools, which analyze network traffic, to assist in incident and problem resolution.

In most organizations, this team should also provide maintenance and support of network operating system and middleware software, including patch management and upgrades. They will also be responsible for monitoring network traffic to identify failures or to spot potential performance or bottleneck issues. Network management includes the reconfiguring or rerouting of traffic to achieve improved throughput or better balance in accordance with the definition of rules for dynamic balancing and routing.

Network security (in liaison with the organization’s information security management process), including firewall management, access rights, and password protection, is another key aspect of this area.

Other activities include assigning and managing IP addresses, domain name systems, and dynamic host configuration protocol (DHCP) systems; managing firewalls and secure gateways; and managing Internet service providers (ISPs). Network management will also take on some responsibilities on behalf of information security management by implementing, monitoring, and maintaining intrusion detection systems. They will also be responsible for ensuring that there is no denial of service to legitimate users of the network.

As with all operational teams, network management will be required to update service asset and configuration management as necessary by documenting CIs, status, relationships, and so on.

Network management is also frequently responsible, often in conjunction with desktop support, for remote connectivity issues such as dial-in, dial-back, and virtual private network facilities provided to home workers, remote workers, or suppliers.

Some network management teams or departments will also have responsibility for voice/telephony, including the provision and support for exchanges, lines, circuits, Automated Call Distribution system (ACD), and statistical software packages and for Voice-over Internet protocol (VoIP), quality of service, and remote monitoring systems.

At the same time, many organizations see VoIP and telephony as specialized areas and have teams dedicated to managing this technology. Their activities will be similar to those described. If network management is managing VoIP as a service, they will need to be aware of and mitigate against any variations in bandwidth and utilization to optimize VoIP usage.

Remember, this is not a prescriptive description of a network management team, and each organization will have its own approach to structuring its teams and departments.

Storage and Archive

There is often a requirement to store data, sometimes for many years. This may be a requirement of the business or a legal or regulatory requirement. Managing the safe storage of data, and ensuring that it can be retrieved as required, is a service operation responsibility. One of the key things that needs to be considered is the change in technology over time because information may need to be stored through a number of technology iterations.

There are a variety of possible storage mechanisms that service operation needs to manage, sometimes by a specialized team. Such mechanisms include storage devices such as disks, controllers, tapes, and other media. Specific technologies include network attached storage (NAS), which is storage attached to a network and accessible by several clients. There are also storage area networks (SANs) designed to attach computer storage devices such as disk array controllers and tape libraries. In addition to managing the storage devices, a SAN will require the management of several network components, such as hubs, cables, and other hardware.

Other devices include direct attached storage (DAS), which is a storage device directly attached to a server, and content addressable storage (CAS), which is storage that retrieves information based on its content rather than location. The focus in this type of system is on understanding the nature of the data and information stored rather than on providing specific storage locations.

Database Administration

Regardless of what type of storage systems are being used, storage and archiving will require management of the infrastructure components as well as the policies related to where data is stored, how long it’s stored, what form it’s stored in, and who may access it. Specific responsibilities will include the definition of data storage policies and procedures and the file storage naming conventions, hierarchy, and placement decisions.

As part of service design, there should be involvement with defining an archiving policy and agreeing on the housekeeping practices of all data storage facilities and the data archiving rules and schedules. The storage teams or departments will also provide input into the definition of these rules and will provide reports on their effectiveness as input into future design.

It is important to ensure that the design, sizing, selection, procurement, configuration, and operation of all data storage infrastructures as well as planning for the maintenance and support for all utility and middleware data storage software are included early in the service lifecycle. This will include meeting with information lifecycle management team(s) or governance teams to ensure compliance with freedom of information, data protection, and IT governance regulations.

Retrieval of archived data as needed (e.g., for audit purposes, for forensic evidence, or to meet any other business requirements) is another required activity, as is the management of archiving technologies and, if needed, migration from one (outdated) technology to a newer archiving technology in order to be able to restore data over a long period of time (e.g., 10 years for legal requirements). The teams will also be expected to provide third-line support for storage- and archive-related incidents.

Database administration must work closely with key application management teams or departments—and in some organizations, the functions may be combined or linked under a single management structure. There are a number of different organizational options that can be adopted; for example, database administration can be performed by each application management team for all the applications under its control. Alternatively, there may be a dedicated department that manages all databases regardless of type or application. Another option is to have several departments, each managing one type of database regardless of what application they are part of.

There is no defined approach. It is up to the organization to arrange database administration according to its requirements.

Database administration works to ensure the optimal performance, security, and functionality of databases that they manage. Database administrators (DBAs) typically have responsibilities that include creation and maintenance of database standards and policies and the initial database design, creation, and testing.

DBAs are also responsible for the management of database availability and performance (for example, resilience, sizing, and capacity volumetric information). Resilience may require database replication, which would be the responsibility of the DBAs, as well as ongoing administration of database objects such as indexes, tables, views, constraints, sequences, snapshots, stored procedures, and page locks to achieve optimum utilization. DBAs are responsible for the definition of triggers that will generate events, which in turn will alert DBAs about potential performance or integrity issues with the database. The role is also concerned with performing database housekeeping—the routine tasks that ensure that the databases are functioning optimally and securely (for example, tuning and indexing).

The team will also take responsibility for monitoring usage by keeping track of transaction volumes, response times, and concurrency levels. And they will generate reports; these could be reports based on the data in the database or reports related to the performance and integrity of the database.

The DBAs will assist with the identification, reporting, and management of database security issues, including audit trails and forensics. The DBAs’ assistance will be engaged in designing database backup, archiving, and storage strategy and designing database alerts and event management. DBAs are also the providers of third-level support for all database-related incidents.

This is a particularly specialized area of IT management, but smaller organizations may not be able to justify the cost of such skills in-house, and this capability may be provided through a third party.

Directory Services Management

A directory service is a specialized software application that manages information about the resources available on a network and which users have access. It is the basis for providing access to those resources and for ensuring that unauthorized access is detected and prevented.

Directory services view each resource as an object of the directory server and assigns it a name. Each name is linked to the resource’s network address so that users don’t have to memorize confusing and complex addresses.

Directory services are a good source of data and verification for the CMS because they are usually maintained and kept up-to-date.

Directory services management refers to the process that is used. Its activities include working as part of service design and service transition and locating resources on a network. It is important to ensure that the status of these resources are tracked. Directory services management will also provide the ability to manage those resources remotely. This includes managing the rights of specific users or groups of users to access resources on a network. It is also concerned with the definition and maintenance of naming conventions to be used for resources on a network, including ensuring consistency of naming conventions and access control on different networks in the organization.

Directory services management may link different directory services throughout the organization to form a distributed directory service; that is, users will only see one logical set of network resources (this is called distribution of directory services).

This function is also responsible for monitoring events on the directory services, such as unsuccessful attempts to access a resource, and taking the appropriate action where required. It will also maintain and update the tools used to manage directory services.

It is another specialized area of expertise, but it may be combined with other technical management teams. Again, it is important to remember that not all organizations are large or complex enough to justify a team to manage these capabilities separately.

Desktop and Mobile Device Support

Because most users access IT services using desktops, laptops, and mobile computing devices, it is key that these are supported to ensure the agreed levels of availability and performance of services.

Desktop and mobile device support will have overall responsibility for all of the organization’s desktop, laptop, and mobile device hardware, software, and peripherals. They will also manage desktop and mobile computing policies and procedures, such as licensing policies; use of laptops, desktops, and mobile devices for personal purposes; and USB lockdown.

As well as designing and agreeing on standard desktop and device images, this function will be responsible for service maintenance, including deployment of releases, upgrades, patches, and hotfixes (in conjunction with release and deployment management).

The design and implementation of desktop and mobile device archiving and rebuild policies (including policies relating to cookies, favorites, templates, personal data, and security) will also be managed by this function.

They will also provide third-level support of incidents related to desktops and mobile devices, including desk-side visits where necessary or replacing devices with reconfigured images and data when needed.

The function will take responsibility for the support of connectivity issues (in conjunction with network management) to home workers and mobile staff.

It is important to make sure configuration control and auditing of all desktop, laptop, and mobile device equipment is managed and maintained (in conjunction with service asset and configuration management and IT audit).

Middleware Management

Middleware is software that connects or integrates software components across distributed or disparate applications and systems. Middleware enables the effective transfer of data between applications and is therefore key to services that are dependent on multiple applications or data sources.

A variety of technologies are currently used to support program-to-program communication, such as object request brokers, message-oriented middleware, remote procedure calls, and point-to-point web services. Newer technologies are emerging all the time; for example, enterprise service bus (ESB) enables programs, systems, and services to communicate with each other regardless of the architecture and origin of the applications.

This is especially being used in the context of deploying service oriented architectures (SOAs).

Middleware management can be performed as part of an application management function (where it is dedicated to a specific application) or as part of a technical management function (where it is viewed as an extension to the operating system of a specific platform).

Functionality provided by middleware includes providing transfer mechanisms for data from various applications or data sources and sending work to another application or procedure for processing. It also includes transmitting data or information to other systems, such as sourcing data for publication on websites (e.g., publishing incident status information).

Middleware management will also be engaged in releasing updated software modules across distributed environments. The collation and distribution of system messages and instructions—for example, events or operational scripts that need to be run on remote devices—will also be part of this approach, as well as setting up multicast functionality with networks. Multicast is the delivery of information to a group of destinations simultaneously using the most efficient delivery route. This will require the management of queue sizes.

Middleware management is the set of activities that are used to manage middleware. These include working as part of service design and transition to ensure that the appropriate middleware solutions are chosen and that they can perform optimally when they are deployed. This will lead to the correct operation of middleware through monitoring and control, allowing the detection and resolution of incidents related to middleware.

Middleware management will also be responsible for maintaining and updating middleware, including licensing and installing new versions. This will include defining and maintaining information about how applications are linked through middleware.

Internet/Web Management

The Internet is increasingly important to most organizations as more and more conduct their business through it. A website may need to be updated regularly with prices, special offers, and other important information, and a failure could be catastrophic. (This is especially true for online retailers such as Amazon and Expedia who have no physical retail presence.)

Organizations with such a heavy dependence on the Internet will usually have a dedicated team for Intranet and Internet management. This team will be responsible for defining architectures for Internet and web services and specifying standards for development and management of web-based applications, content, websites, and web pages. This will typically be done during service design. The design, testing, implementation, and maintenance of websites will include the architecture of websites and the mapping of content to be made available.

The responsibilities of such a team or department incorporate both intranet and Internet management and are likely to include the management and operation of firewalls and secure gateways and secured subnetworks (e.g., the DMZ, or demilitarized zone) used to provide a secure perimeter between secured IT infrastructures and larger distrusted networks.

In many organizations, web management will include editing content to be posted as well as the maintenance of all web development and management applications.

Internet and web management is also responsible for meeting with and giving advice to web-content teams within the business. Content may reside in applications or storage devices, which implies close liaison with application management and other technical management teams.

Liaison with ISPs, hosts, and third-party monitoring or virtualization organizations will also be managed by this team. In many organizations, the ISPs are managed as part of network management.

They will also provide third-level support for Internet-/web-related incidents and support for interfaces with back end and legacy systems. This will often mean working with members of the application development and management teams to ensure secure access and consistent functionality.

Monitoring and management of website performance will include heartbeat testing, user experience simulation, benchmarking, on-demand load balancing, virtualization, website availability, resilience, and security. This will form part of the overall information security management of the organization.

Facilities and Data Center Management

Facilities management is the management of the physical environment of IT operations, usually located in data centers or computer rooms. In many respects, facilities management could be viewed as a function in its own right. However, in this section we review facilities management specifically as it relates to the management of data centers and as a subset of the IT operations management function.

Although data centers are often managed by general facilities management or office services departments (if these exist), they have specialized requirements regarding layout, heating and air-conditioning, planning the power capacity requirements, and so on. So, although data centers may be facilities owned by an organization, best practice would be to have them managed under the authority of IT operations. Where a general department carries it out, there should be a functional reporting line to IT.

The specific activities include building management, equipment hosting, and power management. It should also cover environmental conditions and safety and physical controls. Supplier management of the providers of the environment (when provided by an external organization) will also be part of this functionality.

Data Center Strategies

Managing a data center is far more than hosting an open space where technical groups install and manage equipment, using their own approaches and procedures. It requires an integrated set of processes and procedures involving all IT groups at every stage of the service lifecycle. Data center operations are governed by strategic and design decisions for management and control and are executed by operators. This requires a number of key factors to be put in place:

Data center automation uses specialized automation systems that reduce the need for manual operators and monitor and track the status of the facility and all IT operations at all times.
Policy-based management enables the rules of automation and resource allocation to be managed by policy rather than having to go through complex change procedures every time processing is moved from one resource to another.
Real-time services should be provided 24 hours a day, 7 days a week.
Capacity management of environmental factors include the physical environmental factors such as floor space, cooling, and power, which need to be managed in terms of their available capacities and workloads to ensure that shortfalls in these areas do not create incidents or generate unplanned costs.
Standardization of equipment provides greater ease of management, more consistent levels of performance, and a means of providing multiple services across similar technology. Standardization also reduces the variety of technical expertise required to manage equipment in the data center and to provide services.
SOAs define where service components can be reused, interchanged, and replaced very quickly and with no impact on the business. This will make it possible for the data center to be highly responsive in meeting changing business demands without having to go through lengthy and involved reengineering and re-architecting.
Virtualization means that IT services are delivered using an ever-changing set of equipment geared to meet current demand. For example, an application may run on a dedicated device together with its database during high-demand times but be shifted to a shared device with its database on a remote device during nonpeak times, all automated. This will mean even greater cost savings because any equipment can be used at any time, without any human intervention except to perform maintenance and replace failed equipment. The IT infrastructure is more resilient because components are backed up by any number of similar components, any of which could take over a failed component’s workload automatically. Remote monitoring, control, and management equipment and systems will be essential to manage a virtualized environment because many services will not be linked to any one specific piece of equipment.
Unified management systems have become more important as services run across multiple locations and technologies. Today it is important to define what actions need to be taken and what systems will perform that action. This means investing in solutions that will allow infrastructure managers to simply specify what outcome is required and let the management system calculate the best combination of tools and actions to achieve the outcome.

Operational Activities of Processes Covered in Other Lifecycle Stages

In the previous sections, we looked at service operation activities. In the following sections, we’ll cover the engagement of service operation in other lifecycle processes.

This will include the following processes from service transition:

Change management
Service asset and configuration management
Release and deployment management
Knowledge management.

In service design, we will consider the following operational activities:

Capacity
Availability
Service continuity
Information security
Service level management

The service strategy processes covered are as follows:

Demand management
Financial management

Finally, we’ll review the engagement of service operation in continual service improvement.

Change Management

Service operation staff will be involved with change management on a day-to-day basis. This includes using the change management process for standard, operational-type changes by raising and submitting requests for change (RFCs) as needed to address service operation issues. Operational staff will also participate in the change advisory board (CAB) or emergency change advisory board (ECAB) meetings to ensure that service operation risks, issues, and views are taken into account.

Obviously, there will be engagement from operational staff in implementing changes (or backing out changes) as directed by change management where they involve a service operation component or services. They will also assist with activities to move physical assets to their assigned locations within the data center.

Operational staff are also responsible for helping define and maintain change models relating to service operation components or services. They will receive change schedules and ensure that all service operation staff are made aware of and are prepared for all relevant changes. And finally, they will coordinate efforts with design activities to ensure that service operation requirements and concerns are addressed when planning and designing new or changed services.

Service Asset and Configuration Management

Service operation staff will be involved with certain aspects of service asset and configuration management (SACM) on a day-to-day basis. For example, they will inform service asset and configuration management of any discrepancies found between any CIs and the CMS and may be involved with making amendments necessary to correct discrepancies, under the authority of service asset and configuration management. Operational staff may also be tasked with labeling and tagging physical assets (e.g., serial numbers and bar codes) so they can be easily identified as well as assisting with audit activities to validate existence and location of service assets.

The responsibility for updating the CMS remains with service asset and configuration management, but in some cases operations staff might be asked, under the direction of service asset and configuration management, to update relationships, or even to add new CIs or mark CIs as “disposed” in the CMS if the updates are related to operational activities actually performed by operations staff. Operations staff may also assist service asset and configuration management activities by communicating changes in state or status with CIs impacted by incidents.

Release and Deployment Management

Service operation staff will be involved with release and deployment management on a day-to-day basis. They may also be under the direction of release and deployment management and be responsible for actual implementation actions regarding the deployment of new releases where they relate to service operation components or services.

It is important that operational staff participate in the planning stages of major new releases to advise on service operation issues.

Operational staff will manage the physical handling of CIs from/to the definitive media library (DML) as required to fulfil their operational roles, while adhering to relevant release and deployment management procedures, such as ensuring that all items are properly booked out and back in. They will also participate in activities to back out unsuccessful releases when they occur.

Knowledge Management

Relevant information (including data and metrics) should be passed up the management chain to other service lifecycle stages so that it can feed into the knowledge and wisdom layers of the organization’s service knowledge management system (SKMS). It is vitally important that all data and information that can be useful for future service operation activities are properly gathered, stored, and assessed.

Key repositories of service operation, which have been frequently mentioned elsewhere, are the CMS and the KEDB, but the repositories must also include documentation from all of the service operation teams and departments, such as operations manuals, procedures manuals, work instructions, and other operational documentation.

Capacity Management

Although many of the capacity management activities are of a strategic or longer-term planning nature, there are a number of operational capacity management activities that must be performed on a regular ongoing basis as part of service operation.

Capacity and Performance Monitoring

All components of the IT infrastructure should be continually monitored (in conjunction with event management) so that any potential problems or trends can be identified before failures or performance degradation occurs. The components and elements to be monitored will vary depending upon the infrastructure in use. There are different kinds of monitoring tool needed to collect and interpret data at each level. For example, some tools will allow performance of business transactions to be monitored, while others will monitor CI behavior.

Event management needs to set up and calibrate alarm thresholds so that the correct alert levels are set and filtering is established as necessary to raise only meaningful events. Capacity management should be involved in helping specify and select any such monitoring capabilities and integrating the results or alerts with other monitoring and handling systems.

Event management must work with all appropriate support groups to make decisions on where capacity and performance alarms are routed and on escalation paths and timescales.

If there is a current or ongoing capacity or performance management issue and an alert is triggered or an incident is raised at the service desk, capacity management support personnel must become involved to identify the cause and find a resolution. Working together with appropriate technical support groups, and alongside problem management personnel, they must perform all necessary investigations to detect exactly what has gone wrong and what is needed to correct the situation.

When a solution, or potential solution, has been found for a capacity- or performance-related problem, any changes necessary to resolve it must be authorized via formal change management before implementation. If the fault is causing serious disruption and an urgent resolution is needed, the emergency change process should be used.

Capacity management has a role to play in identifying capacity or performance trends as they become discernible. Service operation should include activities for logging and collecting performance data and information relating to performance incidents to provide a basis for problem and capacity management trend analysis activities.

Large amounts of data are usually generated through capacity and performance monitoring. In any organization, it is likely that the monitoring tools used will vary greatly. In order to coordinate the data being generated and allow the retention of meaningful data for analysis and trending purposes, some form of central repository for holding this summary data is needed, such as a capacity management information system.

Modeling and/or sizing of new services and/or applications must, where appropriate, be done during the design and transition stages. However, the service operation functions have a role to play in evaluating the accuracy of the predictions and feeding back any issues or discrepancies.

Availability Management

During service design and service transition, IT services are designed and tested for availability and recovery. Service operation is responsible for actually making the IT service available to the specified users at the required time and at the agreed levels.

The IT teams, and particularly the users, are often in the best position to detect whether services actually meet the agreed requirements and whether the design of these services is effective. The actual experience of the users and operational functions during service operation can provide primary input into the ongoing improvement of existing services and the design.

However, there are a number of challenges with gaining access to this knowledge because most of the experiences of the operational teams and users are either informal or spread across multiple sources. The process for collecting and collating this data needs to be formalized.

There are three key opportunities for operational staff to be involved in availability improvement, because these are generally viewed as part of their ongoing responsibility:

The review of maintenance activities and regular comparison of actual maintenance activities and times with the service design plans. This will highlight potential areas for improvement.
Major problem reviews. Problems could be the result of any number of factors, one of which is poor design. Problem reviews therefore may include opportunities to identify improvements to the design of IT services, which will include availability and capacity improvement.
Involvement in specific initiatives using techniques such as service failure analysis (SFA), component failure impact analysis (CFIA), and fault tree analysis (FTA) or as members of technical observation (TO) activities—either as part of the follow-up to major problems or as part of an ongoing service improvement plan (SIP), in collaboration with dedicated availability management staff. (These are fully explained in Chapter 13, “Service Design Processes: Service Level Management and Availability Management.”)

There may be occasions when operational staff themselves need downtime of one or more services to enable them to conduct their operational or maintenance activities; this may have an impact on availability if not properly scheduled and managed. In such cases, they must liaise with SLM and availability management staff, who will negotiate with the business/users to agree on and schedule such activities, often using the service desk to perform this role.

IT Service Continuity Management

Service operation functions are responsible for the testing and execution of system and service recovery plans as determined in the IT service continuity plans for the organization. In addition, managers of all service operation functions must participate in key coordination and recovery teams as they have been outlined in those continuity plans.

Service operation needs to be involved in risk assessment, using its knowledge of the infrastructure and techniques such as CFIA and access to information in the CMS to identify single points of failure or other high-risk situations. It also needs to be involved in the execution of risk management measures, such as the implementation of countermeasures and increasing resilience to components of the infrastructures.

Operational staff will provide assistance in writing the actual recovery plans for systems and services under its control and participate in testing of the plans (such as involvement in off-site testing, simulations, etc.) on an ongoing basis under the direction of the IT service continuity manager. They will manage the ongoing maintenance of the plans under the control of the IT service continuity management and change management processes.

Operational staff need to participate in training and awareness campaigns to ensure that they are able to execute the plans and understand their roles in a disaster.

The service desk will play a key role in communicating with staff, customers, and users during an actual disaster and should also provide assistance with testing and execution of system and service recovery plans.

Information Security Management

Information security management has the overall responsibility for setting policies, standards, and procedures to ensure the protection of the organization’s assets, data, information, and IT services. Service operation teams play a key role in executing these policies, standards, and procedures. As a consequence, they will work closely with the teams or departments responsible for information security management. It is important to separate the roles between the groups defining and managing the process and the groups executing specific activities as part of ongoing operation.

Key service operation team support activities can include policing and reporting, such as checking system journals, logs, and event/monitoring alerts, as well as intrusion detection and/or reporting of actual or potential security breaches.

Service operation staff are often first to detect security events and are in the best position to be able to shut down and/or remove access to compromised systems. Service operation staff may be required to escort visitors into sensitive areas and/or control their access. This will have to be established according to the requirements of the individual organization because it may not be considered appropriate for operational staff to be utilized in this way. They also have a role to play in controlling network access to third parties, such as hardware vendors dialing in for diagnostic purposes, for example.

Technical advice and assistance may be needed regarding potential security improvements (e.g., setting up appropriate firewalls or access/password controls). Technical support may also need to be provided to IT security staff to assist in investigating security incidents and producing reports or in gathering forensic evidence for use in disciplinary actions or criminal prosecutions. Event, incident, problem, and service asset and configuration management information can be relied on to provide accurate chronologies of security-related investigations.

Service operation staff are often responsible for maintaining operational security control by providing technical staff with privileged access to key technical areas (e.g., root system passwords and physical access to data centers and communications rooms). It is therefore essential that adequate controls and audit trails are kept of all such privileged activities to deter and detect security events.

All service operation staff should be screened and vetted to a security level appropriate to the organization in question. Suppliers and third-party contractors should also be screened and vetted—both the organizations and the specific personnel involved.

All service operation staff should be given regular and ongoing training of the organization’s security policy and procedures. This should include the details of disciplinary measures in place. In addition, security requirements should be specified in the employee’s contract of employment.

Service operation documented procedures must reference all relevant information relating to security issues extracted from the organization’s overall security policy documents.

Service Level Management

Service level management (SLM) is the process responsible for negotiating SLAs and ensuring that they are enforced. It monitors and reports on service levels and holds regular customer reviews.

Incident management priorities and required resolution targets should be guided by service level targets. Problem management activities contribute to the improved attainment of service level targets by identifying root cause and instigating changes that are needed to improve performance. Service operation teams play a role in executing monitoring activities through the event management process that can provide early detection of service level breaches. SLM also maintains the agreements used by access management to provide access to services (such as the definition of criteria for which business users may be granted access), while request fulfilment activities may be bounded by agreed service targets.

Demand Management

Demand management is the name given to a number of techniques that can be used to modify demand for a particular resource or service. There are aspects of demand management that are of an operational nature, requiring short-term action. This can include controlling and managing access to a specific application with limited licenses.

There may be occasions when optimization of infrastructure resources is needed to maintain or improve performance or throughput. It may require moving a service or workload from one location or set of CIs to another, often to balance utilization or traffic or to carry out technical virtualization. Service operation would be responsible for setting up and using virtualization systems to allow movement of processing around the infrastructure to give better performance/resilience in a dynamic fashion.

It will only be possible to manage demand effectively if there is a good understanding of which workloads exist, so monitoring and analysis of workloads is therefore needed on an ongoing operational basis.

Financial Management for IT Services

Service operation staff must participate in and support the overall IT budgeting and accounting system and may be actively involved in a charging system that may be in place.

The service operation manager must also be involved in regular (at least monthly) reviews of expenditure against budgets as part of the ongoing IT budgeting and accounting process. Care should therefore be taken to ensure that IT is involved in discussing all cost-saving measures and contributes to overall decisions.

Improvement of Operational Activities

All service operation staff should be constantly looking for areas in which process improvements can be made to provide higher IT service quality in a more cost-effective way. Opportunities for improvement will be included in the CSI register for review and prioritization.

This may be covered by a range of different operational activities, such as automation of manual tasks. All tasks should be examined for their potential for automation to reduce effort and costs and to minimize potential errors. A judgement must be made on the costs of the automation and the likely benefits that will result.

It is important to review makeshift activities or procedures that were designed to be short term but that have become the “norm” because there are often efficiencies to be achieved.

Operational audits should be conducted of all service operation processes to ensure that they are working satisfactorily, and used to identify operational improvement opportunities. It is important to include education and training for service operation teams, who should understand the importance of what they do on a daily basis.

Summary

This brings us to the end of this chapter, during which we explored the activities of service operation, providing the knowledge, interpretation, and analysis of service operation principles, techniques, and relationships and their application to the delivery and support of services at agreed levels.

This included monitoring and control as it relates to event management and IT operations for management of the operational environment. We also looked at server and mainframe management and support and network management.

An important operational activity is related to storage and archive, which is often supported by database administration. Directory services management is another key area, as are desktop and mobile device support and middleware management. Many organizations also provide Internet or web management as part of the operational activities. Facilities and data center management are also included in the operational activities.

We looked at the service operation activities throughout the other lifecycle processes. In each lifecycle stage, service operation staff have an important part to play, and we explored some of the key processes in which they will be involved.

Exam Essentials

Understand and explain the uses of monitoring and control in service operation. This includes the concepts of the monitor control loop and open and closed loop systems. Explain and expand on the nature of the various types of monitoring and measurement.

Understand how to apply the IT operations activities for service operation. Understand the following:

Server and mainframe management and support
Network management
Storage and archive
Database administration
Directory services management
Desktop and mobile device support
Middleware management
Internet/web management
Facilities and data center management

Understand the application of operational activities of processes covered in other lifecycle stages. The engagement of service operation in the lifecycle is key for each stage because service operation is involved in all stages and processes. It is important to understand the involvement and be able to understand and apply this to each lifecycle stage.

Understand and expand on the improvement of operational activities. Continual service improvement is based on the identification of improvement, and it is important to understand the mechanisms for managing improvement.

Review Questions

You can find the answers to the review questions in the appendix.

Measurement and control is a continuous cycle. Which of these is NOT part of that cycle?
1. Monitoring
2. Reporting
3. Subsequent action
4. Restoration
Measurement refers to any technique used to evaluate the extent, dimension, or capacity of an item in relation to a standard or unit. What is the definition ITIL applies to the term extent?
1. The degree of compliance or completion (e.g., are all changes formally authorized by the appropriate authority?)
2. The size of an item (e.g., the number of incidents resolved by the service desk)
3. The total capability of an item (e.g., the maximum number of standard transactions that can be processed by a server per minute)
4. The cost of the item (e.g., the amount of money spent on each item)
Which of these statements is/are correct?
1. Monitoring is used to establish whether specific conditions are met or not met.
2. Monitoring is also concerned with the detection of abnormal types or levels of activity.
  1. Statement 1 only
  2. Statement 2 only
  3. Both statements
  4. Neither statement
There are different types of monitoring. Which of these is the description for active monitoring?
1. Active monitoring is the ongoing “interrogation” of a device or system to determine its status.
2. Active monitoring is generating and transmitting events to a “listening device” or monitoring agent.
3. Active monitoring is designed to request or trigger action following a certain type of event or failure.
4. Active monitoring is used to detect patterns of events that indicate that a system or service may be about to fail.
In the monitor control loop, an activity and its output are compared to what?
1. A predefined SLA
2. A predefined OLA
3. A predefined norm
4. A predefined contract
In which stage of the service lifecycle is the ITSM monitor control loop based?
1. In service operation, because this is where monitoring and control can take place
2. In service design, because this is where the requirements for monitoring and control are defined
3. In service strategy, because this is where the controls are agreed upon with the business
4. In service transition, because this is where changes as a result of monitoring are managed
Which of these are common service operation activities?
1. IT operations
2. Design coordination
3. Server and mainframe management and support
4. Network management
5. Service validation and testing
6. Storage and archive
7. Database administration
  1. 2, 5, 6, and 7
  2. 1, 3, 4, 6, and 7
  3. 1, 2, 5, and 7
  4. All
Service operation audits are an important part of the service operation lifecycle stage. Which of these statements about audits is/are correct?
1. Regular audits must be performed on the service operation processes and activities to ensure that they are being performed as intended and that there is no circumvention.
2. An audit will also establish if the processes are still fit for purpose or identify any required changes or improvements.
  1. Statement 1 only
  2. Statement 2 only
  3. Both statements
  4. Neither statement
True or False? Service operation is involved in activities in all of the other lifecycle stages.
1. True
2. False
Improvement is an important part of service operation. Which of these statements is/are correct about improvement in service operation?
1. All service operation staff should be constantly looking for areas in which process improvements can result in higher IT service quality and/or be more cost-effective.
2. Opportunities for improvement will be included in the CSI register for review and prioritization.
  1. Statement 1 only
  2. Statement 2 only
  3. Both statements
  4. Neither statement

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 38 Common Service Operation Activities

Create new playlist

Sign In

Sign Up