Chapter 21

States of Resilience

Erik Hollnagel

Gunilla Sundström

Introduction

A resilient system, or, organisation is able to withstand the effects of stress and strain and to recover from adverse conditions over long time periods. One way of describing that is to think of a system as being in one of several states, where a state can be defined as ‘any well-defined condition or property that can be recognised if it occurs again’ (Ashby, 1956, p. 25). The set of possible states constitutes the so-called state space, and each state has a set of associated conditions that specify whether the system changes to another state. The state space description is ubiquitous and can be applied to almost any type of system. The simplest possible case is, of course, a system that can be in either of two states or conditions. This applies to an uncomplicated device such as an electrical switch, which can be either ‘on’ or ‘off’, but also to a complex system such as a national economy that can be either growing or receding. A car engine might be described by means of three states, namely ‘stopped’, ‘idle’, or ‘running’. Even a mighty nation can – from one perspective – be characterised as being in one of five states according to the threat condition, such as ‘green: low’, ‘blue: guarded’, ‘yellow: elevated’, ‘orange: high’, and ‘red: severe’.

In Chapter 15, a business system was described as being in a healthy, an unhealthy or a catastrophic state with the possible transitions being shown in Figure 15.2. Note that the state space in this case, as well as the case of the five ‘threat’ states, is one-dimensional, which means that it is only possible to make transitions between adjacent states. To take a slightly more complex example, consider the following description of the state space of a power plant. Here there are six different states, as well as the possible transitions between them. Without going into any details, the diagram in Figure 21.1 shows that while some transitions are bi-directional, e.g., between ‘normal operation’ and ‘disturbance’, others are not reversible but require that the system passes through one or more intermediate states. For instance, once the system has gone to a state of ‘emergency’ it can only return to ‘normal operation’ via an ‘idle’ state.

Image

Figure 21.1: State-space diagram for power production

Resilience and State-space Transitions

The terminology used in Figure 21.1 is typical for systems that produces something physical, matter or energy, where the focus is on operations rather than services. The same principles can, however, be used to characterise various kinds of organisations, such as financial services firms, government departments, regulatory entities, rescue organisations, etc. Even if we disregard the extreme periods of a system’s life, hence exclude embryonic and decaying organisations, it is easily possible to recognise characteristic states that are more detailed than the triad used in Chapter 15 (healthy, unhealthy, catastrophic), cf. Figure 21.2. There must clearly always be a state of normal functioning, where the system provides or produces what it is intended to do in a reliable and, if required, in a profitable manner. There will usually also be a state of regular reduced functioning, for instance during night time or holidays, as well as a state of irregular reduced functioning due to a lack of internal resources (illness among staff, equipment failures, industrial actions, and the like). The transitions between normal and regular reduced functioning are scheduled, whereas the transitions between normal and irregular reduced functioning usually are unexpected and represent a temporary loss of control. Since even the most sanguine managers know that things can go wrong, a state of irregular reduced functioning has usually also been anticipated and adequate recovery functions are therefore hopefully in place.

Image

Figure 21.2: State-space diagram for service organisations

There will, unfortunately, also always be a state of disturbed functioning, corresponding to the unhealthy or even catastrophic states of Chapter 15. Figure 21.2 only shows one state of disturbed functioning, but there may of course be several. Indeed, it may be the mark of a resilient organisation that it has a number of different modes of functioning whenever a disturbance happens. The transition to a state of disturbed functioning clearly represents a loss of control. The return to a state of normal functioning may either be through a direct recovery, or in the case of severe disturbances via a state of repair. In extreme cases there is, of course, the possibility that the system ceases to exist after a severe disturbance, as the case of Barings PLC demonstrated. The state transitions described here are representative but not complete; under unusual circumstances other transitions may become possible.

The Sumatra Tsunami Disaster

To illustrate how a state space description can be used to understand the nature of resilience, consider how an organisation responds to a severe event. A resilient organisation should clearly be able to respond appropriately and as fast as possible in order not to lose control. Examples of how that happens – or rather, how it does not happen – are often seen in the case of major disasters, either natural or man-made. Some of the more spectacular cases in recent years are the fire in the Mont Blanc tunnel on March 26 1999 or the tsunami in Asia on December 26, 2004. (One might also mention the collapse of Barings plc described in Chapter 15.) While resilience is a quality that is essential for the response to all disturbances, be they large or small, the determining characteristics are often easier to note in the case of events of an unusual scale or severity. The example chosen here is the response of the Swedish Foreign Department to the tsunami that followed the earthquake off Sumatra on Sunday December 26 2004. The reason for using that is not that the Foreign Department was particularly inept in its handling of the situation, but merely that the information about this case is easily accessible. (Although the Swedish Foreign Department in the aftermath of the disaster was severely criticised at home, foreign departments in several other European countries got a similar treatment from their national press.)

A tsunami following a major earthquake is the type of event that requires a fast and appropriate response. It is, however, also a type of event that is very unusual and one that was not specifically foreseen in the procedures of the Foreign Department. Although the Foreign Department received information about the disaster around 5 o’clock in the morning of December 26, it did not begin to respond effectively until the next day. In the aftermath the Secretary of Foreign Affairs openly admitted that no one at the Foreign Department understood the scope of the disaster, and that she herself did not even know where Phuket was. (As it happened, the Phuket area was a popular destination for Swedish travellers. Several thousand visitors from Sweden were there during the 2004 Christmas holidays and 544 of them either died or went missing after the tsunami.) Partly because of that, and partly because of the holidays, the Foreign Department was not fully staffed and therefore responded very slowly. In contrast to that, both the charter companies that were responsible for the tourists in Thailand, the Swedish police, and the Swedish Rescue Services Agency went into a state of high alert within hours. Indeed, one of the travel companies had posted information about the tsunami around 7 o’clock in the morning, i.e., about two hours after they received the information from Thailand.

In terms of the concepts described above, the Foreign Department remained for too long in the regular reduced functioning state (probably somewhat due to the Christmas holiday season). There were no clear routines or procedures for what to do, and because of that offers of assistance, e.g., from the police, were turned down. The situation was further aggravated by the department’s utter failure to realise what was going on, despite the fact that the TV channels had extra newscasts throughout the day. When finally the magnitude of the disaster was realised, there was little capability to do anything.

If we consider the experiences from this disaster in terms of a state transition description, resilience can be seen to require three things: an ability to recognise that conditions have changed and that there is a need to respond, a set of transition rules (procedures or routines that allow the system to transition from one state to the other by changing its mode of operation), and a readiness or capability for getting on with the tasks at hand once the new state has been entered. To that may be added a fourth, namely the ability to maintain normal operations throughout. This is, however, not an absolute requirement, but depends on the type of organisation. For a foreign department it is clearly desirable to maintain normal operations even when an emergency arises; for the charter companies it was possible to shift the whole organisation into a new mode and suspend all normal functions as long as it lasted.

Conclusions

It is a common view that delays in changes or transitions often are due to fossilised organisational structures or inadequate and/or conflicting policies. Many organisations develop a cumbersome bureaucracy with built-in social conflicts where people at various levels are unable to, or incapable of, making the appropriate decisions. Decisions cannot be made on the spot but have to be passed through organisational layers until they reach a proper level, often at the top of the hierarchy. The resulting decisions and their translation into actions then have to go through the reverse process – although this often is considerably faster, partly because the decision resulted in a state change. Decision makers may with hindsight often state that they did not fully appreciate the seriousness of the situation, or that procedures were not in place for the required action. Yet from the point of view of resilience engineering it simply means that the organisation was not sufficiently resilient.

A resilient organisation must, however, not only be able to change from one state to a more appropriate one in time, but also be able to return to normal functioning when the alerting or unusual conditions are over. This does not necessarily mean that it should go back to what were normal procedures before the events, since the world may have changed. But it means that it should be able to resume durable and sustainable performance, in the sense of attaining at least the same quantity and quality of whatever it produces, as it did before the disrupting events.

The ability to rebound or recover again requires that the cessation of the abnormal state is detected, that there are proper procedures for returning or reverting to ‘normal’, and that the functions and capabilities required for a normal – or a revised normal – operation are in place. In cases when the emergency and normal operations run in parallel, returning to normal conditions is fairly simple. The normal system state must, however, allow the organisation somehow to absorb the resources that are released when the emergency operation comes to an end, possibly by keeping them idle.

In summary, the state transition analogy offers a number of concrete (potential) measures and procedures for the resilience engineering toolbox.

•  First, there is the issue of being able to recognise the changed situation, i.e., the triggering conditions for the transition. Here a host of organisational, psychological, and social factors may work against that – ignorance, biases and vested interests – in addition to more ‘simple’ technical problems such as lack of sufficient data. The associated measure is the organisation’s ability to diagnose, assess or understand the situation and anticipate the consequences. This ability is typically documented by rules and procedures, time and resources allocated to this task, etc.

•  Second, there is the issue of procedures or routines (i.e., change management processes) for the transition. Do people in the organisation know what to do? Are command lines defined and generally understood? Is it known who has responsibility, and whether the normal roles change? Is there a script or scenario, a set of guidelines or even a set of instructions that can be applied?

•  Third, does the ‘new’ (i.e., receiving) state provide the resources and capabilities needed for the desired functionality? Is there perhaps a back-up organisation ready? Do physical and communication facilities exist? Are required supplies and resources available? Are there sufficient human resources, human capital and experience? The associated measure is whether the organisation is ready to respond when a new state is entered, i.e., whether strategies, policies, procedures, resources, capabilities, and so on exist. Are there regular exercises? Do people know their roles and responsibilities?

•  A possible fourth issue is whether the transition is directly reversible in principle and/or in practice, or whether the organisation must pass through a repair or reconstruction state. At issue here is whether the organisation is able to stop or dismantle and/or significantly re-engineer an operation, i.e., to return from an abnormal to a normal state. Related to that is the issue of whether the cessation of a condition can be detected when it happens, but not too early.

In relation to the first step, recognising the changed situation, one additional issue is how obvious the change must be. This relates to the discussion elsewhere (Chapter 1) about being reactive or proactive. In the reactive case, which is the most common, a time lag necessarily exists between the change and its detection, but a resilient organisation will ensure that it is as small as possible. In the proactive case, the time lag is reduced or perhaps even eliminated. Yet even in this case the response is based on some cues; and the uncertainty of the outcome is increased in the sense that the action(s) taken may be inappropriate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset