Chapter 7

Practices for Noticing and Dealing with the Critical. A Case Study from Maintenance of Power Plants

Elizabeth Lay

If some evil genius were given the job of creating an activity guaranteed to produce an abundance of errors, he or she would probably come up with something that involved the frequent removal and replacement of large numbers of varied components, often carried out in cramped and poorly lit spaces … and usually under severe time pressure. …those who started a job need not necessarily be the ones required to finish it … a number of different groups work on the same item of equipment (Reason and Hobbs, 2003: 1).

This is an apt description of turbine maintenance work. This type of maintenance work can be fraught with rework and incidents that result from human error. The impact to power producers (utilities and independent power providers) of a maintenance service provider’s failure to perform to plan during a maintenance outage can result in high losses as every day the power plant is down for maintenance, they are not selling power. For nuclear power plants, this loss can be in excess of one million US dollars in lost revenue per day. Thus, the ability of a maintenance service provider to perform to plan is critical and can be the differentiating factor in the choice of who will perform the work. This chapter covers a case study in power plant maintenance wherein principles of Resilience Engineering were used to design practices to notice risk profile changes and then move into different actions to reduce the risk in order to improve performance to plan. Implementation of the concept of ‘pinging’ is described. ‘Pinging’ is the proactive probing for risk profile changes. The steps and lessons learned for implementing ‘pinging’ to notice critical situations will be shared along with the design of a menu of actions to prevent the situation from turning into a ‘high loss’ event.

Introduction

In high risk, high pressure, complex work such as the maintenance of power plants, quality and safety incidents can occur and sometimes be extremely costly for both the service provider and the customer. Thus, performing work consistently and predictably with few incidents can be the most important differentiating factor in the choice of service provider. Reactive safety and quality programs are often limited in scope and tend to be micro-focused on specific, historical incidents or trends. Principles of Resilience Engineering can be applied to design a broad, proactive strategy for noticing the critical and moving into different actions before high loss situations occur.

Business Background

Siemens is one of the world’s largest companies in the field of electrical engineering and electronics. About 400,000 employees develop and manufacture products, design and create systems and plants, and provide customized services in Industry, Energy, and Healthcare. The practices shared in this chapter were developed in the Energy Service business in the Americas where maintenance has been performed on more than 1000 turbines ranging in size from 50 to 1000 MW every year. This maintenance work is performed on nuclear and coal fuelled steam turbines and gas turbines that produce electricity.

Loss Control Philosophy

Loss is defined as the avoidable waste of any resource (Bird et al., 2003). Losses can result from safety and quality incidents and inefficiencies in work. The underlying management systems and possible breakdowns are mostly the same for safety, quality, and efficiency thus actions and plans to remedy potential loss situations are designed without differentiating between the three domains. In some companies, loss controls are viewed as adding cost but any loss that is controlled adds directly to profit and controlling loss can be an effective way to increase profit.

The field service group began to build the ‘Story of Loss’ specific to the maintenance of turbines about three years ago. The Loss Control Leadership Council was formed and included field engineers, technicians, and craft workers, about 10 people in all. After coming to a common understanding of what loss was and designing a simple method to quantify common types of loss, this council recorded and quantified the loss they saw on the jobs they were on. Other team members visited additional outages and compiled loss reports through observations and interviews with outage workers. After sampling about 40 outages over a 1 year period, the types of loss and average cost per loss-incident type were determined. The teams observed that the most common type of loss, but also the least frequently reported, was rework. Rework is defined as something done more than once or non-value added activity. The second most commonly occurring loss and also the highest cost per loss incident was waiting, waiting for tools, people, the crane, permits, decisions, parts, etc. Waiting was the highest cost type of loss event, because such loss is accrued at the burn rate (burn rate is the total cost of being on the power plant site per day including the crew’s pay and expenses, tools, temporary offices, etc) if the critical path was impacted, as it is in many waiting-type loss events.

The ‘Story of Loss’ specific to Siemens’ Americas’ Power Generation Field Service organization was shared widely within the organization. This story included a grounded estimate of the total amount of loss that was being incurred annually by that organization along with types of loss that had been observed during a sampling of outages, average loss amount per incident type, how frequently these incidents were occurring, and the total impact. The simple methodology for quantifying loss that was developed included typical burn rates for field service outages, as well as building an understanding among all front line workers of what burn rate and critical path were, along with blended hourly costing rates for different roles that could be used for calculating losses due to rework.

A simple ‘Story of Business’: Sales - Costs = Profit was combined with the ‘Story of Loss’ to illustrate how reducing costs (or loss) contributes directly to profit and to bring forth that a significant amount of outages would need to be performed to regain the reported annual loss in terms of profit. Considering only the reported annual non-conformance costs (which were a fraction of estimated actual loss), the number of outages that would have to be performed to bring this back to profit given current margin rates was calculated. The number was surprising to workers and helped them see the relevance of the amount of the estimated total annual loss.

This story and philosophy were shared at all levels of the field-service organization from front-line worker to management. This foundational philosophy provided a reason for action as we moved into risk and resilience design.

Highly Resilient Organizations

Highly Resilient Organizations can be characterized by the following four behaviors.

•  They anticipate critical disruptions and situations and their consequences.

•  They notice the critical disruptions and situations when they occur.

•  They plan how to respond.

•  They adapt and move into different actions.

Design of the strategy to become more resilient included tactics in each of these domains of action.

Resilience has been defined as a measure of the ability to respond to change. Highly Resilient Organizations are able to recover rapidly when work is disrupted and are able to respond to the unexpected in a way that minimizes loss or increases gain. Being resilient includes the ability to keep working after major disruption or during continuous stress and disturbances, not merely by responding or reacting to what happens but by adjusting how work is done, or moving into different actions.

The response of an organization to stress is strikingly similar to the response of a ductile metal to stress (Figure 7.1) (Woods and Wreathall, 2008). For a ductile metal, as load (or stress) is increased, it is able to recover or return to its original form when the load is removed, up to a point. This point is the yield point. Beyond the yield point, as load is applied, the material begins to permanently deform until it reaches the point where it fractures. For an organization, as demand or load on people increases, they can respond well and handle the stress up to a certain point. Beyond this point, the risk of loss increases as people reach a limit where they are overloaded and working beyond their capacity; their mental functioning is degraded and errors are more likely. They may even reach a point where they are no longer able to cope.

Image

Figure 7.1    A ductile metal stress-strain curve is representative of an organization’s response to stress. Highly Resilient Organizations notice when approaching the yield point or when things are taking a turn for the worse

Highly Resilient Organizations notice when people are approaching a ‘yield point’ and move into different actions (Figure 7.2). Building the skill and designing processes for improving ‘noticing’ are possible and one approach, ‘pinging’, is described later in this chapter. Highly Resilient Organizations move into different actions (adapt) to expand their capacity to react, extending their ability to respond to disruptions. They may remove some of the load, or stress, from the people involved in the critical situation, enabling them to return to a mode where they are able to function effectively. Some methods to do this are shared in the menu of possible solutions in the ‘Adapting’ section of this chapter.

Anticipate

An ‘outage’ involves crews of 30 to 100 or more people mobilizing to a power plant site to disassemble, inspect, and reassemble a complex machine (a turbine, for example). The work requires many specialty tools often shipped in on several tractor trailers, involves assembly of large, expensive, complex parts with very tight clearances and close tolerances, and lifting heavy components (a typical turbine rotor can weigh 50 to 80 US tons). The work is often done in extreme conditions of heat or cold, during 12 hour shifts, working 7 days a week, under extreme schedule pressure. This business has been in a growth mode with human resources being a critical factor in staffing the work that is accepted each season. It is common in this business that the teams are a mix of employees and contractors of varying levels of experience and skills, many of whom meet each other for the first time when arriving for work on the job site. Given the complexity of the work and the mix of workers, there can be many opportunities for incident likely situations to arise.

Work on the operational risk management program which includes designing practices to become more resilient began almost two years ago. The initial focus with resilience was on noticing outages where the risk profile was changing, or had changed and then moving into different actions before significant loss occurred.

Image

Figure 7.2    Upon noticing people are approaching a ‘yield point,’ Highly Resilient Organizations move into different actions extending their capacity to react

Notice

The first step on the road to become more resilient was improving ‘noticing’ of the critical, looking at both general situations on an outage and specific unexpected situations. One component of this was to implement a ‘pinging’ process. Pinging is the proactive probing for risk profile changes (Wreathall and Merritt, 2003). Through workshops with experienced project managers and operational support staff, signs that an outage could be approaching an out of control situation or risk profile change were hypothesized. Some of these potential risk factors and indicators of risk profile changes are:

•  multiple issues taking crew’s attention;

•  progress stalling, schedule impacts, multiple delays;

•  mood of project manager changes;

•  specialty personnel on site longer than anticipated;

•  suddenly have need for more people;

•  multiple personnel being changed out;

•  higher than usual amount of emergent work;

•  multiple safety and quality incidents, even if minor; an increase in errors;

•  common tasks not performed or performed late (such as getting permits);

•  special situation with the potential to change worker’s moods (working over Christmas) or risk level (Weather: storms, severe heat or cold, snow, wind, ice, hurricanes) on site;

•  decline in communication, such as unreturned calls or emails;

•  longer outage where potential for fatigue level is higher;

•  site housekeeping has slowed or stopped.

Training was conducted with different groups of support professionals (who were not on, but were in frequent contact with, the job site). The thinking was that sometimes when you are in the heat of the battle (on the job site); it can be difficult to tell you are approaching, or are at, the ‘overload point.’ One of our project managers compared this to sometimes not knowing when to call the doctor when you are sick. These off-site teams were requested to refer situations where outages may be facing these types of challenges to the risk-management team. There were conversations with management on recognizing these situations. Case studies of outages that were examples of these situations were developed and shared.

Some outages were raised to the risk team’s attention by operations management. None of these outages were simultaneously raised by the professionals. The potential reasons for this could have included that the professionals may not in all cases have been in a position to fully comprehend the challenge so they did not recognize the situations, or if they did notice them, they instead took immediate action and moved into a mode of helping in specific areas in alignment with their role instead of referring. The conclusion from this was that the targeted training and added pinging responsibility for specific groups of professionals did not work well in this case.

Pinging Started Narrow then Grew to a Network

The next approach was to train the entire organization from job-site clerks to front-line workers to project managers to directors on concepts of resilience, including recognizing risk profile changes, error likely situations and error likely ‘climates’ that could develop on specific job sites. Reason (2009: 100) noted that ‘climate relates to specific workplaces and to their local management, resources and workforce. Climate is shaped by both upstream cultural factors and the local circumstances. … Unlike cultures, local climates can change quite rapidly.’ Case studies based on actual incidents related to common error likely ‘climates’ were designed and brought forth a direct correlation between specific climates and significant loss incidents and near misses. Concluded error likely ‘climates’ for field service work were:

•  leaders who used a ‘top down’ approach or intimidating style;

•  leaders who were closed to listening to concerns of others on site and did not encourage questions;

•  leaders who were not engaged in the work; not on the turbine deck where work was being performed;

•  a site which had unclear roles / responsibilities for outage structure;

•  day versus night shift competition; crew not working as a team;

•  customer directing or overly involved in field service scope of work;

•  leaders not familiar with current practices and cultures (contract employees);

•  leaders who weren’t open to help.

Noticing where these climates may exist and understanding the serious potential consequences was a first step in reducing or eliminating the potential for these climates to exist on outages. Actions the workers could take when one of these ‘error likely’ climates was observed were part of the training. Actions available to workers included referring the situation to management off site (anonymously if requested) and/or requesting a professional trained in risk management, human performance tools, and dealing with error likely situations to visit the site and help with the situation. It should be noted that given the severity of the potential risks associated with turbine maintenance, noticing and addressing error likely climates was not trivial.

Planning

Planning, including designing new actions, is the most challenging part of the journey to increased resilience. Once you notice that you are either in, or approaching, a difficult situation, what action do you move into? There were times when there were no extra resources available or when it was not clear what help to provide. It was determined that a variety of different solutions may be required to improve the resilience of the service business.

There is an adage that you need to accumulate and develop the power and/or knowledge and/or resources before you need them but in order to justify the cost of additional resources, you may need to prove how this ‘buffering’ could really change the outcome. This proof could be difficult to come by as it can be difficult to measure loss that is prevented. Siemens recently began to develop a small team, trained to act in multiple roles (field engineer, risk, safety, and quality), to pilot the buffering concept. They are being deployed to outages selectively, where the overall situation is more complex, or the complexity is increasing, and additional help may be required. If this team proves themselves to be valuable (as grounded by the number of times they are requested to help and the outcomes of outages they respond to) then the case could be built to expand the team to build additional buffering capacity, if it is needed. There have been, to date, an insufficient number of outages to which people in this role were deployed to conclusively determine how well this is working but we can say that these outages were completed without significant loss.

Adapting

It is important to note that there may not be a direct cause and effect relationship between the risk factors and subsequent risks and mitigating actions. There are limits to the capacity of any human being. As difficult situations stack-up, dealing with the situations uses some of this limited capacity and stress can reduce the capacity to act. For example, extreme weather conditions were noticed on a significant percentage of high loss outages two years in a row. It is not necessary that there be a direct correlation between the weather event and the loss incidents. Consider how working in foul weather can increase stress levels of the workers and can make work more difficult. Imagine working in frigid conditions; the worker is uncomfortable, perhaps impeded by cold weather gear, mental and physical functioning is likely reduced. Adjustments such as slowing down the work, adding stop points to warm up or hydrate, adjusting work hours, adding heaters or coolers, or adding people can help compensate. Siemens outage leadership have implemented ‘stand downs,’ periodic intervals during the work to check-in with the crew and have conversations between site leadership and crew for the purpose of bringing continued focus to performing safe, quality work and assessing needs of workers. Siemens also provides Human Performance Tool Kits.

Thus, when designing actions to mitigate risk profile changes, the entire situation should be considered. When assessing capacity and limits of capacity, consider how people can respond when they are close to their limits. They can become forgetful, start missing things, become more prone to outbursts of anger, show signs of fatigue and stress. Actions can be designed to add capacity (example: add workers), adjust capacity limits (example: remove stressors), or remove load, freeing up existing capacity (example: defer or move work off site).

Questions to consider for designing mitigating actions:

•  What unplanned events or situations are using site leadership or worker’s capacity?

•  What resources or help are needed to add capacity, remove stressors, or free up existing capacity?

•  Where were site personnel already with respect to their capacity limits before the situation changed? Were they already highly loaded and close to their limit?

For work during which a possible change in risk profile was observed, the first action recommended was a call to the district service manager (to whom the outage site project management ultimately reported) for their assessment of the situation. A conversation between the district manager, the regional director, and site leadership would then ensue to explore whether and what type help might be needed and where that help might be available. Key here is that a person most familiar with the situation, the customer, and the workers ultimately makes the decision on what actions to take and what help is needed. In Siemens case, this can typically be a district service manager. Siemens district offices are regionally located across the Americas and Canada. For this situation, a central support group can add value in sharing learning from outages and incidents, observing patterns and trends, and, with front-line worker’s input, design actions to mitigate risk profile changes. Siemens district service managers typically have a deep and broad based background in field service work with a high level of knowledge that makes them suited to responding to difficult situations during an outage.

The following menu of possible solutions was designed by a group of experienced project managers and site supervisors:

•  Stop and assess situation. Reassess the plan and consider where additional help is needed. Identify where the issues may be occurring.

•  Better organize site. Evaluate parts management, tool management, roles and responsibilities, work plan/schedule, shift turnovers, procedures, checklists, work instructions, communication plan. Pause and take the time to improve the plan.

•  Perform a Rapid Risk Assessment. A Siemens designed process wherein the operations risk management team, site leadership, operations leadership, subject matter experts, those who may have been involved in similar situations and engineering discuss risks, mitigating actions, and clarify who the risk decision owner is.

•  Use human performance tools. Evaluate which tools could be used that currently are not being used and whether coaching on use of tools is needed.

•  Request an Outage Specialist to coach on human performance tools and risk and provide extra safety and quality oversight.

•  Communicate up the chain of command to the district service office, operations management, tooling management regional directors and ask for help per the Risk Escalation policy.

•  Request commercial help: Someone to be on site to deal with commercial situations on some complex jobs. This could give the project manager more time to work on logistics and managing the job.

•  Request logistics help: Someone to help with parts, people, and tools, especially for emergent work or issues.

•  Begin a heightened state of coordination and help; daily calls with those who are helping.

•  Develop resources to provide buffering. Multi-skilled people who can travel to jobs where help is needed and act in a variety of roles.

Conclusion

Field service is already seeing benefit from applying Resilience Engineering concepts even though it is at the beginning of the journey to become more resilient. It seems that even just building the skill ‘noticing’ can help reduce loss. Noticing triggers action as people improvise responses even without a prescribed menu of help.

The next steps on the journey to resilience are expected to include:

•  Better incorporation of practices to increase the ability to more consistently notice risk profile changes into work flow, such as adding a risk score to daily status reports.

•  Continuing to improve the risk evaluations at the front of the project by integrating them with project readiness reviews.

•  Continuing to observe issues that may arise and design adapting actions once critical situations are noticed.

•  Considering what other types of buffering capacity could help and design the processes to bring that in.

As Siemens continues on the journey to becoming increasingly resilient, direct financial benefits, as well as enhanced customer trust and satisfaction, are expected to continue to accrue.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset