Epilogue: Resilience Engineering Precepts

Erik Hollnagel

David D. Woods

Safety is Not a System Property

One of the recurrent themes of this book is that safety is something a system or an organisation does, rather than something a system or an organisation has. In other words, it is not a system property that, once having been put in place, will remain. It is rather a characteristic of how a system performs. This creates the dilemma that safety is shown more by the absence of certain events – namely accidents – than by the presence of something. Indeed, the occurrence of an unwanted event need not mean that safety as such has failed, but could equally well be due to the fact that safety is never complete or absolute.

In consequence of this, resilience engineering abandons the search for safety as a property, whether defined through adherence to standard rules, in error taxonomies, or in ‘human error’ counts. By doing so it acknowledges the danger of the reification fallacy, i.e., the tendency to convert a complex process or abstract concept into a single entity or thing in itself (Gould, 1981, p. 24). Seeing resilience as a quality of functioning has two important consequences.

•  We can only measure the potential for resilience but not resilience itself. Safety has often been expressed by means of reliability, measured as the probability that a given function or component would fail under specific circumstances. It is, however, not enough that systems are reliable and that the probability of failure is below a certain value (cf. Chapter 16); they must also be resilient and have the ability to recover from irregular variations, disruptions and degradation of expected working conditions.

•  Resilience cannot be engineered simply by introducing more procedures, safeguards, and barriers. Resilience engineering instead requires a continuous monitoring of system performance, of how things are done. In this respect resilience is tantamount to coping with complexity (Hollnagel & Woods, 2005), and to the ability to retain control.

Resilience as a Form of Control

A system is in control if it is able to minimise or eliminate unwanted variability, either in its own performance, in the environment, or in both. The link between loss of control and the occurrence of unexpected events is so tight that a preponderance of the latter in practice is a signature of the former. Unexpected events are therefore often seen as a consequence of lost control. The loss of control is nevertheless not a necessary condition for unexpected events to occur. They may be due to other factors, causes and developments outside the boundaries of the system.

An unexpected event can also be a precipitating factor for loss of control and in this respect the relation to resilience is interesting. Knowing that control has been lost is of less value than knowing when control is going to be lost, i.e., when unexpected events are likely. In fact, according to the definition of resilience (Chapter 1), the fundamental characteristic of a resilient organisation is that it does not lose control of what it does, but is able to continue and rebound (Chapter 13).

In order to be in control it is necessary to know what has happened (the past), what happens (the present) and what may happen (the future), as well as knowing what to do and having the required resources to do it. If we consider joint cognitive systems in general, ranging from single individuals interacting with simple machines, such as a driver in a car, to groups engaged in complex collaborative undertakings, such as a team of doctors and nurses in the operating room, it soon becomes evident that a number of common conditions characterise how well they perform and when and how they lose control, regardless of domains. These conditions are lack of time, lack of knowledge, lack of competence, and lack of resources (Hollnagel & Woods, 2005, pp. 75-78).

Lack of time may come about for a number of reasons such as degraded functionality, inadequate or overoptimistic planning, undue demands from higher echelons or from the outside, etc. Lack of time is, however, quite often a consequence of lack of foresight since that pushes the system into a mode of reactive responding. Knowing what happens and being able to respond are not by themselves sufficient to ensure control, since a system without anticipation is limited to purely reactive behaviour. That inevitably incurs a loss of time, both because the response must come after the fact and therefore be compensatory, and because the resources to respond may not always be ready when needed but first have to be marshalled. In consequence of that, a system confined to rely on feedback alone will in most cases sooner or later fall behind the pace of events and therefore lose control.

Knowledge is obviously important both for knowing what to expect (anticipation) and for knowing what to look for or where to focus next (attention, perception). The encapsulated experience is sometimes referred to as the system’s ‘model of the world’ and must as such be dynamic rather than static. Knowledge is, however, more than just experience but also comprises the ability to go beyond experience, to expect the unexpected and to look for more than just the obvious. This ability, technically known as requisite imagination (Westrum, 1991; Adamski & Westrum, 2003), is a sine qua non for resilience.

Competence and resources are both important for the system’s ability to respond rationally.1 The competence refers to knowing what to do and knowing how to do it, whereas the resources refer to the ability to do it. That the latter are essential is obvious from the fact that control is easily lost if the resources needed to implement the intended response are missing. This is, for instance, a common condition in the face of natural disasters such as wildfires, earthquakes, and pandemics.

Figure E.1 illustrates three qualities that a system must have to be able to remain in control, and therefore to be resilient, with time as a fourth, dependent quality. The three main qualities are not linked in the sense that anticipation precedes attention, which in turn precedes response. Although this ordering in some sense will be present for any specific instance that is described or analysed, the whole point about resilience is that these qualities must be exercised continuously. The system must constantly be watchful and prepared to respond. Additionally, it must constantly update its knowledge, competence and resources by learning from successes and failures – its own as well as those of others.

Image

Figure E.1: Required qualities of a resilient system

It is interesting to note that Diamond (2005) in his book on how societies collapse and go under, identifies three ‘stops on the road to failure’ (p. 419). These are: (1) the failure to anticipate a problem before it has arrived, (2) the failing to perceive a problem that has actually arrived, and (3) the failure to attempt to solve a problem once it has been perceived (rational bad behaviour). A society that collapses is arguably an extreme case of lack of resilience, yet it is probably no coincidence that we find the positive version of exactly the same characteristics in the general descriptions of what a system – or even an individual – needs to remain in control. A resilient system must have the ability to anticipate, perceive, and respond. Resilience engineering must therefore address the principles and methods by which these qualities can be brought about.

Readiness for Action

It is a depressing fact that examples of system failures are never hard to find. One such case, which fortunately left no one harmed, occurred during the editing of this book. As everybody remembers, a magnitude 9.3 earthquake occurred in the morning of December 26, 2004, about 240 kilometers south of Sumatra. This earthquake triggered a tsunami that swept across the Indian Ocean killing at least 280,000 people. One predictable consequence of this most tragic disaster was that coastal regions around the world became acutely aware of the tsunami risk and therefore of the need to implement well-functioning early warning systems. In these cases there is little doubt about what to expect, what to look for, and what to do. So when a magnitude 7.2 earthquake occurred on June 14, 2005, about 140 kilometres off the town of Eureka in California, the tsunami warning system was ready and went into action.

As it happened, not one but two tsunami warning centres reacted. The first warning, covering the US and Canadian west coast, came from a centre in Alaska. Three minutes later a second warning was issued by a centre in Hawaii. The second warning said that there was no risk of tsunami, but excluded the west coast north of California from the warning. Rescue workers, missing this small but significant detail, were predictably confused (Biever & Hecht, 2005, p. 24).

Tsunami warnings are broadcast via radio by the US National Oceanic and Atmospheric Administration (NOAA). Unfortunately, some locations cannot receive the NOAA radio signals because they are blocked by mountains. They are therefore contacted by phone from Seattle. On the day in question, however, a phone line was down so that the message did not get through, effectively leaving some areas without warning. This glitch was, however, not noticed. As it happened, the earthquake was of a type that could not give rise to a tsunami, and the warning was cancelled after one hour.

This example illustrates a system that was not resilient, despite being able to detect the risk in time. While precautions had been made and procedures put in place, there was no awareness of whether they actually worked and no understanding of what the actual conditions are. The specific shortcoming was one of communication, issuing inconsistent warnings and lacking feedback, and the consequence was a partial lack of readiness to respond. Using the terminology proposed in Chapter 21, the communication failure meant that some districts did not go into the required state of high alert, as a preparation for an evacuation. While the tsunami warning system was designed to look for specific factors in the environment, it was not designed to look at itself, to ensure that the ‘internal’ functions worked. The system was designed to be safe by means of all the technology and procedures that were put in place, but it was not designed to be resilient.

Why Things Go Wrong

It is a universal experience that things sooner or later will go wrong,2 and fields such as risk analysis and human reliability assessment have developed a plethora of method to help us predict when and how it may happen. From the point of view of resilience engineering it is, however, at least as important to understand why things go wrong. One expression of this is found in the several accident theories that have been proposed over the years (e.g., Hollnagel, 2004), not least the many theories of ‘human error’ and organisational failure. Most such efforts have been engrossed with the problems found in technical or socio-technical systems. Yet there have also been attempts to look at the larger issues, most notably Robert Merton’s lucid analysis of why social actions often have unanticipated consequences (Merton, 1936).

It is almost trivial to say that we need a model, or a frame of reference, to be able to understand issues such as safety and resilience and to think about how safety can be ensured, maintained, and improved. A model helps us to determine which information to look for and brings some kind of order into chaos by providing the means by which relationships can be explained. This obviously applies not only to industrial safety, but to every human endeavour and industry. To do so, the model must in practice fulfil two requirements. First, that it provides an explanation or brings about an understanding of an event such that effective mitigating actions can be devised. Second, that it can be used with a reasonable investment of effort – intellectual effort, as well as time and resources. A model that is cumbersome and costly to use will, from an academic point of view, from the very start be at a disadvantage, even if it provides a better explanation.3 The trick is therefore to find a model that at the same time is so simple that it can be used without engendering problems or requiring too much specialised knowledge, yet powerful enough to go beneath the often deceptive surface descriptions.

The problem with any powerful model is that it very quickly becomes ‘second nature’, which means that we no longer realise the simplifications it embodies. This should, however, not lead to the conclusions that we must give up on models and try to describe reality as it really is, since that is a philosophically naïve notion. The consequence is rather that we should acknowledge the simplifications that the models bring, and carefully weigh advantages against disadvantages so that a choice of model is made knowingly.

Several models have been mentioned in the chapters in this book. The most important models in the past have been the Domino model and the Swiss cheese model (Chapter 1). Both are easy to comprehend and have been immensely helpful in improving the understanding of accidents. Yet their simplicity also means that some aspects cannot be easily described – or described at all, and that explanations in terms of the models therefore may be incomplete. (Strictly speaking, the two models are metaphors rather than models. In one, accidents are likened to a row of dominoes falling, and in the other, to harmful influences passing through a series of holes aligned.)

In the case of the Domino model, it is clear that the real world has no domino pieces waiting to fall. There may be precariously poised systems or subsystems that suddenly may change from a normal to an abnormal state, but that transition is rarely as simple as a domino falling. Likewise, the linking or coupling between dominoes is never as simple as the model shows. Similarly, the Swiss cheese model does not suggest that we should look for slices of cheeses or holes, or measure the size of holes or movements of slices of cheese. The Swiss cheese model rather serves to emphasise the importance of latent conditions and illustrate how these in combination with active failures may lead to accidents.

The Domino and Swiss cheese models are useful to explain the abrupt, unexpected onset of accidents, but have problems in accounting for the gradual loss of safety that may also lead to accidents. In order to overcome this problem, a model of ‘drift to danger’ has been used, for example in Chapter 3. Although the metaphor of drift introduces an important dynamic aspect, it should not be taken literally or as a model, for the following reasons:

•  Since the boundaries or margins only exist in a metaphorical sense or perhaps as emergent descriptions (Cook & Rasmussen, 2005), there is really no way in which an organisation can ‘sail close’ to an area of danger, nor ways in which the ‘distance’ can be measured. ‘Drift’ then only refers to how a series individual actions or decisions have larger, combined and longer term impacts on system properties that are missed or underappreciated.

•  The metaphor itself oversimplifies the situation by referring to the organisation as a whole. There is ample practical experience to show that some parts of an organisation may be safe while others may be unsafe. In other words, parts of the organisation may ‘drift’ in different directions. The safety of the organisation can furthermore not be derived from a linear combination of the parts, but rather depends on the ways in which they are coupled and how coordination across these parts is fragmented or synchronized (cf. Perrow, 1984). This is also the reason why accidents in a very fundamental sense are non-linear phenomena.

•  Finally, there are no external forces that, like the wind, push an organisation in some direction, or allow the ‘captain’ to steer it clear of danger. What happens is rather that choices and decisions made during daily work may have long-term consequences that are not considered at the time. There can be many reasons for this, such as the lack of proper ‘conceptual’ tools or a shortage of time.

It is inevitable that organisational practices change as part of daily work, one simple reason being that the environment is partly unpredictable, changing, or semi-erratic. Such changes are needed either for purposes of safety or efficiency, though mostly the latter. Indeed, the most important factor is probably the need to gain time in order to prevent control from being lost, as described by the efficiency-thoroughness trade-off (ETTO; Hollnagel, 2004). There is never enough time to be sufficiently thorough; finishing an activity in time may be important for other actions or events, which in turn cannot be postponed because yet others depend on them, etc. The reality of this tight coupling is probably best illustrated by the type of industrial action that consists in ‘working to rule.’ This also provides a powerful demonstration of how important the everyday trade-offs and shortcuts are for the normal functioning of a system.

Changed practices to improve efficiency often have long-term consequences that affect safety, although for one reason or another they are disregarded when the changes are made. These consequences are usually both latent and have latency and therefore only show themselves after a while. Drift is therefore nothing more than an accumulated effect of latent consequences, which in turn result from the trade-off or sacrificing decisions that are required to keep the system running.

A Constant Sense of Unease

Sacrificing decisions take place on both the individual and the organisation levels – and even on the level of societies. While they are necessary to cope with a partly unpredictable environment, they constitute a source of risk when they become entrenched in institutional or organisational norms. When trade-offs and sacrificing decisions become habitual, they are usually forgotten. Being alert or critical incurs a cost that no individual or organisation can sustain permanently and is therefore only used when necessary. Norms qua norms are for that reason rarely challenged. Yet it is important for resilience that norms remain conspicuous, not in the sense that they must constantly be scrutinised and revised, but in the sense that their existence is not forgotten and their assumptions take for granted.

Resilience requires a constant sense of unease that prevents complacency. It requires a realistic sense of abilities, of ‘where we are’. It requires knowledge of what has happened, what happens, and what will happen, as well as of what to do. A resilient system must be proactive; flexible; adaptive; and prepared. It must be aware of the impact of actions, as well as of the failure to take action.

Precepts

The purpose of this book has been to propose resilience engineering as a step forward from traditional safety engineering techniques – such as those developed in risk analysis and probabilistic safety assessment (PSA). Rather than try to force adaptive processes and organisational factors into these families of measures and methods, resilience engineering recognises the need to study safety as a process, provide new measures, new ways to monitor systems, and new ways to intervene to improve safety. Thinking in terms of resilience shifts inquiry to the nature of the ‘surprises’ or types of variability that challenge control.

•  If ‘surprises’ are seen as disturbances, or disrupting events, which challenge the proper functioning of a process, then inquiry centres on how to keep a process under control in the face of such disrupting events, specifically on how to ensure that people do not exceed given ‘limits.’

•  If ‘surprises’ are seen as uncertainty about the future, then inquiry centres on developing ways to improve the ability to anticipate and respond when so challenged.

•  If ‘surprises’ are seen as recognition of the need constantly to update definitions of the difference between success and failure, then inquiry centres on the kinds of variations which our systems should be able to handle and ways constantly to test the system’s ability to handle these classes of variations.

•  If ‘surprises’ are seen as recognition that models and plans are likely to be incomplete or wrong, despite our best efforts, then inquiry centres on the search for the boundaries of our assessments in order to learn and revise.

Resilience engineering entails a shift from an over-reliance on analysis techniques to adaptive and co-adaptive models and measures as the basis for safety management. Just as it acknowledges and tries to avoid the risks of reification (cf. above), it also acknowledges and tries to avoid the risks of oversimplifications, such as:

•  working from static snapshots, rather than recognising that safety emerges from dynamic processes;

•  looking for separable or independent factors, rather than examining the interactions across factors; and

•  modelling accidents as chains of causality, rather than as the result of tight couplings and functional resonance.

It is fundamental for resilience engineering to monitor and learn from the gap between work as imagined and work as practised. Anything that obscures this gap will make it impossible for the organisation to calibrate its understanding or model of itself and thereby undermine processes of learning and improvement. Understanding what produces the gap can drive learning and improvement and prevent dependence on local workarounds or conformity with distant policies. There was universal agreement across the symposium attendees that previous research supports the above as a critical first principle. The practical problem is how to monitor this gap and how to channel what is learned into organisational practice.

The Way Ahead

This book boldly asserts that sufficient progress has been made on resilience as an alternative safety management paradigm to begin to deploy that knowledge in the form of engineering management techniques. The essential constituents of resilience engineering are already at hand. Since the beginning of the 1990s there has been a growing evolution of the principles for organisational resilience and in the understanding of the factors that determine human and organisational performance. As a result, there is an appreciable basis for how to incorporate human and organisational risk in life cycle systems engineering tools and how to build knowledge management tools that proactively capture how human and organisational factors affect risk.

While additional studies can continue to document the role played by adaptive processes for how safety is created in complex systems, this book marks the beginning of a transition in resilience engineering from research questions to engineering management tools. Such tools are needed to improve the effectiveness and safety of organisations confronted by high hazard and high performance demands. In particular, we believe that further advances in the resilience paradigm should occur through deploying the new measures and techniques in partnership with management for actual hazardous processes. Such projects will have the dual goals of simultaneously advancing the research base on resilience and tuning practical measurement and management tools to function more effectively in actual organisation decision-making.

1  Being rational is not used in the traditional, normative meaning, but rather as the quality of being anti-entropic, cf. Hollnagel (2005).

2  This is often expressed in terms of Murphy’s law, the common version of which is that ‘everything that can go wrong, will’. A much earlier version is Spode’s law, which says that ‘if something can go wrong, it will.’ It is named after the English potter Josiah Spode (1733–97) who became famous for perfecting the transfer printing process and for developing fine bone china – but presumably not without many failed attempts on the way.

3  ‘Better’ is, of course, a rather dangerous term to use since it implies that some objective criterion or standard is available. Although there is no truth to be used as a point of reference, it is possible to show that one explanation – under given conditions – may be better than another, e.g., in providing more effective countermeasures. Changes are, however, never contemplated sub specie aeternatis but are always subject to often very mundane or pecuniary considerations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset