Chapter 17

Properties of Resilient Organizations: An Initial View

John Wreathall

Concept of Resilience

While formal definitions of resilience, and its associated field of application, resilience engineering, have yet to be developed, one of the simplest explanations is contained in the following description. Other equally valid definitions are discussed in other sections of this book. I do not claim primacy of this over others, but for this discussion, it suits my purpose.

Resilience is the ability of an organization (system) to keep, or recover quickly to, a stable state, allowing it to continue operations during and after a major mishap or in the presence of continuous significant stresses.

The property in question of the organization is often safety, but should also include financial performance, and any other vital goal for the organization’s well being.

From this description, we can see a significant difference from the more traditional techniques of safety management, like probabilistic safety assessment (PSA), accident root-cause investigations, and so on. First, PSA and similar methods are concerned with identifying and defending against a prescribed set of hazards using techniques that have significant limitations in terms of their ability to represent human and organizational influences appropriately – often the most important influences on safety performance. Second, many of the methods involve the analysis of events to identify ‘causal’ factors from root-cause analyses, fractions of events resulting from ‘human errors’ and so on.

While there are several techniques that allow these kinds of analyses, all suffer from various weaknesses. One is that the identification of causes of accidents tends to be a social as well as a technical process (in terms of what causes are considered ‘acceptable’ by the owners of the system). Another, that they are built around rather limited models of safety that ignore the roles of cultural and organizational influences, and, many times, rely on only partially recalled knowledge of events by participants. Most importantly, models describing the control of safety are built in isolation from other management activities, as if there was no connection, rather than them being integrally entwined.

The concepts of resilience and the anticipated tools for resilience engineering are intended to address these weaknesses head on. Thus, resilience engineering is a new management discipline that encompasses both safety management and other types of management, particularly process and financial management.

Approach of Resilience Engineering

If resilience is to ensure the organization keeps (or recovers to) a safe stable state, there are several processes that must go on to accomplish this goal. The purpose of resilience engineering is to develop and provide the tools for these processes. While the development of these is yet to be specified in detail, the following are the kinds of tools that would be developed.

Tools to Reveal Safety Performance

Organizations constantly struggle to understand how they are performing with regard to safety. ‘Too much’ safety is thought to limit the potential for operational profits, and too little will result in harm to the workers, lost production, and even loss of expensive facilities. While, in practice, safety and production are not necessarily in opposition – seeking to eliminate unwanted deviances in operation will improve both safety and production, for example – a lack of knowledge of the current levels of safety performance can lead organizations to take more conservative decisions than would be appropriate about production activities, or the organization’s safety performance can ‘drift’ with a consequence that surprising accidents happen – as has been cited in the Challenger Shuttle accident.

Of course, organizations try to measure their safety performance, both in terms of industrial (worker) safety and process safety (accidents affecting the public and the environment). While many industries use different processes, some high-performing facilities (for example, in the nuclear industry) use essentially the same approach for both. Regardless of the processes used, both areas of safety require management processes based on performance data. Some of the most common processes involve trending safety outcomes like worker fatal accidents and lost-time injuries, costs of damage to process equipment, releases to the environment, and so on. Other industries, with which I am most familiar, such as nuclear power and chemical process industries, use safety modeling techniques such as probabilistic safety assessments to identify likely contributors to accidents (such as failures of protection equipment) and then trend the frequencies and durations for which such equipment is not operational. Some industries use techniques that involve some kinds of formal performance models and data-based evaluations of safety performance, including the aviation and defense sectors (although PSA is used in both of these, too). However, these are all very limited as sources of operational management information.

First, historic data from accidents will always be out-of-date as a measure of today’s performance since data are from relatively rare events and almost always aggregated over long periods of time. They represent the consequences of decisions usually made significantly (years) earlier. The interpretation of these types of data is often uncertain, often biased by inevitable pressures to find simple explanations that are ‘politically acceptable’ and based on overly simple ‘models of safety’ that seek to identify one or two ‘root causes’ of accidents, neglecting the complexity of workplace pressures – see discussions by Hollnagel (1998; 2004), Dekker (2002) and others.

Second, PSA and similar models are very static interpretations of how accidents occur, focusing usually on simple descriptions of how combinations of hardware faults combine to cause bad outcomes and neglect the complexity and interactions seen in complex accidents. Reviews of PSAs, when compared with major accidents, show that they typically fail to identify underlying organizational processes that override the usual assumptions of independence of failures, and neglect the complexities of human behaviors usually involved in accidents, often treating humans as if they were simple machines.

This is not to say the PSA is of no use – it is often an excellent tool for evaluating designs and selecting between alternatives. Additionally, serious efforts are underway to remove some of the poor modeling of human performance, such as the development of the ATHEANA method (NRC, 2000). However, as a source of information about making operational decisions now, these methods, even the improved ones, are critically limited.

What are required are data that allow the organization’s management to know the current ‘state of play’ of the safety performance within the organization, without suffering the problems outlined above.

Work has started to identify different data types and sources to provide the needed management information. An example is the work to develop leading indicators of organizational performance (see, for example, Wreathall, 1998, 2001; Wreathall & Merritt, 2003) that has been undertaken in several industries, including US nuclear power, aviation and oil exploration. This approach looks for data both at the working level (such as factors causing safety problems now for workers) and in organizational behaviors that can set them up to have vulnerabilities. These methods contribute to ensure the convergence of resilience at different levels of the organization in the management of uncertainties.

There are two tools that are used in a complementary manner in other industries that are used to measure the leading indicators associated with performance at the ‘sharp end.’ The first tool is intended to measure currently the kinds of problems commonly found in event investigations and near-miss reports. In most applications we have found 8–12 workplace and task factors typically encapsulate the dominant contributions to performance problems (Reason et al., 1998); these are identified by reviews of event reports and interviews with front-line workers. Typical examples seen across other industries include interfaces with other groups, lack of (or deficiencies in) relevant and timely input information, shortages of tools or other specific resources, and inadequate staffing. These problems are assessed proactively so that the organization does not have to wait for, and then suffer, the various costs of even partial failures in mission performance. Rather, the accumulation of data allows management to take countermeasures before the problems cause failures. This tool simply solicits data via a web server from samples of workers on a periodic basis about how much each of these factors has been a problem in getting work done in a recent period of time.

The second tool is based on models of organizational effectiveness that focus on the core processes by which an organization accomplishes its mission. This will be based on the approach we have developed for other industries, the nuclear industry initially and more recently the rail and medical industries. This approach is based on work by Reason, who performed a review of about 65 different models that describe in various ways the relationship of organizational processes for successful and safe outcomes. The results of this review were to identify a set of common themes that collectively encompass the kinds of processes that are critical to organizational success in both safety and production through proactive risk management, as described by Wreathall & Merritt (2003).

The themes identified in the review are management commitment, awareness, preparedness, flexibility, reporting culture, learning culture, and opacity. Each of these themes has a particular meaning or significance in a different application domain. By customizing each of these themes for a particular domain, we can identify the potential sources of data from which managers at different levels (but aimed particularly at senior management) can assess the levels of performance and riskiness within their organization. It is important to note that, in most applications, very few (if any) new sources of data are needed; rather, it is a question of selecting the existing data that are particularly cogent for each of the themes and their customized form in a single domain.

The seven themes in highly resilient organizations are:

•  Top-level commitment: Top management recognizes the human performance concerns and tries to address them, infusing the organization with a sense of significance of human performance, providing continuous and extensive follow-through to actions related to human performance, and is seen to value human performance, both in word and deed.

•  Just culture: Supports the reporting of issues up through the organization, yet not tolerating culpable behaviours. Without a just culture, the willingness of the workers to report problems will be much diminished, thereby limiting the ability of the organization to learn about weaknesses in its current defences.

•  Learning culture: A shorthand version of this theme is ‘How much does the organization respond to events with denial versus repair or true reform?’

•  Awareness: Data gathering that provides management with insights about what is going on regarding the quality of human performance at the plant, the extent to which it is a problem, and the current state of the defences.

•  Preparedness: ‘Being ahead’ of the problems in human performance. The organization actively anticipates problems and prepares for them.

•  Flexibility: It is the ability of the organization to adapt to new or complex problems in a way that maximizes its ability to solve the problem without disrupting overall functionality. It requires that people at the working level (particularly first-level supervisors) are able to make important decisions without having to wait unnecessarily for management instructions.

•  Opacity: The organization is aware of the boundaries and knows how close it is to ‘the edge’ in terms of degraded defenses and barriers.

An example is provided at the end of this chapter of how a product of work performed to develop leading indicators of organizational performance can be adapted to reflect the kinds of issues of interest in the development of resilience engineering.

Other techniques have been developed to measure safety culture within organizations and their impact on safety. See, for example, work by Flin et al., relating to such work in the oil industry (Mearns et al., 1998; Mearns et al., 2003). The need is now to tie this approach to the concepts of resilience, to provide knowledge inside the organization about what its levels of safety are now.

Resources and Defenses

As well as knowing what is the present state of safety in the organization, it is important that the organization has available appropriate levels of resources (particularly reserves) that can react to sudden increasing challenges or the sudden onset of a major hazard – Reason has referred to this capability as providing ‘harm absorbers’ – analogous to shock absorbers in mechanical systems. These resources can be material, such as providing additional staff to cope with significant challenges (e.g., dedicated emergency response teams), or they can be design-oriented, such as building in additional times for people to react (some have called this ‘white time’) so that plant and management personnel have time to reflect on the nature of the challenge and take appropriate responses.

While appealing in the abstract, these concepts need development to answer appropriate management questions, like ‘What kinds of resources to I really need?’, ‘How much in the way of resources, and at what cost?’, ‘When do I decide to deploy these resources?’, ‘What is the trigger?’, ‘How do I know my design ‘white time’ is adequate, and what am I giving up?’

Accompanying these questions is the overriding operational question: ‘When and how do I decide I should sacrifice productivity for safety?’ For example, it is when the production pressures are increasing that the need for greater safety questioning becomes more important. How can this be accomplished appropriately? How can the knowledge gained from the measures discussed above become useful?

On the matter of defenses, of course one class of defenses exists in the form of all the barriers that are built in to the design, and are represented in PSA models and the like. These have been extended by people such as Hollnagel (2004) to include more abstract (non-visible) barriers, such as standards, codes of conduct and procedures, and the like. However, equally important in the context of resilience is the role of people to act as positive promoters of safety. An example that has been identified in the world of medicine is where practices have evolved over time to cope with crises through changes in performance that are subtle (almost non-observable to the untrained eye). For instance, in observing a high-risk operation where the patient was suffering massive bleeding (a liver transplant), the anesthesiology team transformed from a routine of monitoring blood pressure and maintaining a regular blood transfusion supply, to a crisis response team that added staff to the team (resources discussed above), coordinated their efforts to add substantial amounts of blood in a short time (without jeopardizing safety by [e.g.,] skipping blood type checks) with virtually no orders or commands, and, when the patient was stable, relaxing back to the earlier behavior. This was all accomplished in a quiet, low-key manner.

From a safety perspective, the behavior avoided what in PSA terms would have been an initiating event. This positive performance was the result of the constant challenges faced in surgery (this procedure is notorious for the amount of blood loss and therefore people are not surprised by the need for response), and involved well practised performance in the face of a substantial challenge (with roles and duties well understood). However, the positive side of human performance, of which this is one example, is rarely or never factored into formal safety analyses and little forethought is given in preparing for such performance. How can the organization prepare and take credit for such performance? What infrastructure is needed? What needs to take place for this behavior to be ‘normal’? Work is needed to provide formalisms to understand how to use this kind of behavior in safety analyses. Resilience provides a framework to explore this approach.

Understanding of Work as Performed, Not as Imagined

All of the above concepts are concerned with designing and monitoring the work in the organization. What seems to be a key factor in each of these examples of resilience engineering, is to have a realistic understanding of how work is actually performed, and then engineering all the tools and processes to exploit the beneficial features of that work (as with the case of the anesthesiologists in the transplant event), and to remove systems, processes and artifacts that get in the way of work being performed safely and effectively – the data gathering from the workers about ‘things that get in the way of working safely’ is one example. This same need applies at the organizational levels as well as the workers’ level. How does the organization actually accomplish its work and how does that impact safety? Hence the need for measures of organizational behavior. This was already identified above as a needed area for work in resilience. How can this be accomplished?

Systems engineering techniques exist for describing formally the behavior of organizations and how in reality the organization manages to accomplish its goals. Many of these techniques stem from the work of the soft systems modelers in the late 1970s and 1980s, such as the work by Checkland (Checkland, 1981; Checkland & Scholes, 1990). Key elements of this approach involve systematic analysis of certain key facets of the organization’s behavior, such as its how its commercial and regulatory environments affect its processes and standards, and associated decision-making. Leveson, Wreathall and others are looking at how this approach is being connected to resilience, and how it interacts with elements like the use of safety performance measures and cultural dimensions, for example.

At the levels of the individual workers, there is a need to consider the role of new technologies and how they may affect safety, such as creating new forms of hazards. Cook, Woods, Hollnagel, Wreathall and many others have written about examples of where new technologies have been introduced in the belief that they will eliminate known ‘human errors’ only to find that the potential for new error types has been overlooked, and that the new ‘error’ is possibly worse than the ones being eliminated.

Summary

Creating resilience engineering involves the development of several elements to create a set of tools that can, together, be used to enhance safety in the face of constant stresses and sudden threats. Work has already started on many of these tools, though not necessarily within the framework of resilience. This includes:

•  the development of organizational and other performance indicators that provide current and leading information on safety performance;

•  data analysis related to safety culture and climate, and an understanding of how they relate to performance;

•  observations about how work is carried out in the real world, both at the worker level and for the organization as a whole;

•  the timing and extent of resources that are necessary for ‘harm absorption’;

•  how work processes and human behavior act to make safety better through individual and small team activities, as well as act as sources of failures;

•  improved understanding of decision-making when it relates to sacrificing production goals to safety goals, how to accomplish it, and what resources are involved.

What is needed now is multidisciplinary efforts to integrate these activities to provide an integrated body of knowledge and tools for management to take advantage of these ideas.

Example: Adaptation of Leading Indicators of Organizational Performance to Resilience Engineering Processes

Concrete examples of some of the seven top-level issues used in the development of these indicators associated specifically with resilience are:

•  Flexibility: The stiffness of the decision-making in the organization, and its failures to respond in a timely manner to an increasing need for revising its response to the pressures of production to allow increased protection, is typical of an organization that will have safety-related problems. Such a problem has been seen in several major incidents where the organization has maintained its fixation on production when the indications of a safety concern have been clear. Examples include the repeated violations at the Millstone nuclear power plant that led to the Nuclear Regulatory Commission issuing a fine of over $2 million in 1997 – see NY Times “Owner of Connecticut Nuclear Plant Accepts a Record Fine” September 28, 1997.1

•  Opacity (and its corollary, observability): The extent to which information about safety concerns are kept closely held by a few individuals has been identified by analysts as a characteristic of organizations that are being set up for problems. For instance, Weick et al. (1999) refers to ‘collective mindfulness’ as a characteristic of highly reliable organizations: collective mindfulness includes that fact that safety issues and concerns are widely distributed throughout the organization at all levels.

•  Just Culture (also openness): The degree to which the reporting of safety concerns and problems is open and encouraged provides a significant source of resilience within the organization. The justice embedded in the just culture leads to the organization not penalizing the bearers of bad news – the opposite from what has been seen in the Millstone example above.

•  Management Commitment: The commitment of the management to balance carefully the acute pressures of production with the chronic pressures of protection is a true measure of resilience. Their willingness to invest in safety and to allocate resources to safety improvement in a timely, proactive manner is a key factor in ensuring a resilient organization.

Acknowledgments

The author wishes to acknowledge the support provided by NASA Ames Research Center under grant NNA04CK45A to develop resilience engineering concepts for managing organizational risk, and for the support of many in the nuclear power industry for the evolution of ideas described herein.

Remedies

Yushi Fujita

Knowing the existence of mismatches between reality and formality is the first step for better remedy. Enforcing rules without understanding the mismatches is not an effective remedy. Appropriate monitoring mechanisms are a prerequisite for knowing the existence of mismatches. So are appropriate evaluation mechanisms for understanding mismatches. These mechanisms should maintain independence from and authority over administrative mechanisms.

Respecting humans at the front end (e.g., operators, maintenance persons) is also a useful step towards better remedy. They are the ones who best know the demanding reality, and how to cope with it. Overall safety has to be ensured by evaluating their behaviors, and corrective actions must be taken if their reactions are too risky.

1  Available at http://www.state.nv.us/nucwaste/news/nn10210.htm

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset