Epilogue: RAG – The Resilience Analysis Grid by Erik Hollnagel

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Erik Hollnagel

Resilience is defined as the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Since resilience refers to something a system does (a capability or a process) rather than to something the system has (a product), it cannot be measured by counting specific outcomes such as accidents or incidents. This chapter describes an approach to measure the resilience of a system that focuses on the four main abilities that together constitute resilience: the ability to respond, the ability to monitor, the ability to anticipate and the ability to learn. These abilities can be assessed by means of a number of questions and the answers can be represented by an easily comprehensible graphical form. This can be used to compare consecutive measurements, and thereby as a way to support the management of a system’s resilience.

Introduction

Resilience Engineering is concerned not only with what makes systems resilient and how to make them resilient (i.e., to engineer resilience), but also with how to maintain or manage the resilience of a system. Since resilience refers to a quality rather than a quantity, to something that a system does rather than to something that a system has, managing resilience can be seen as a kind of process control. In order to manage or control a process, whether it is the resilience of a system or the steering of a vessel from point A to point B in an archipelago, three things are necessary. It is first of all necessary to know the current status or present position, that is, where one actually is at the moment. Second, it is necessary to know what the goal is, to have a clear idea about what the future status or position should be and therefore to know in which direction to move (using a Euclidean metaphor). And finally, it is necessary to know how a change can be made, specifically a change in direction, in magnitude, in speed, etc. It is, in other words, necessary to know the means by which a specific change can be brought about. While all three are essential for effectively managing a system’s resilience, this chapter will focus on the first.

Resilience versus Safety

A system is usually considered safe if the number of adverse outcomes can be kept acceptably low. This can mean the accidents and incidents that may happen, but can also include adverse outcomes of other types such as work-time injury, work-related illnesses, etc. The advantage of defining safety in this manner is that the level of safety can be measured by counting various types of outcomes. The common understanding is, of course, that a higher level of safety corresponds to fewer adverse outcomes – and vice versa. One example of that is the International Civil Aviation Organisation’s definition of safety as ‘the state in which the risk of harm to persons or of property damage is reduced to, and maintained at or below, an acceptable level through a continuing process of hazard identification and risk management’ (ICAO, 2006: 1). In a similar vein, the Patient Safety Indicator guide published by the US AHQR (Agency for Healthcare Research and Quality) defines safety as ‘freedom from accidental injury’, or ‘avoiding injuries or harm to patients from care that is intended to help them’. In other words, safety is defined by the absence of adverse outcomes.

There is, however, more to safety than just reducing the number of adverse events. Resilience Engineering argues that it is necessary to focus on what can go right as well as at what can go wrong. From a Resilience Engineering perspective, failures arise from the adjustments needed to cope with the underspecification of the real world rather than from a breakdown or malfunctioning of normal system functions. Being a practical discipline, Resilience Engineering therefore, looks for ways to enhance the ability of systems to continue to function in as many different situations as possible, and safety is consequently defined as the ability to succeed under varying conditions. This definition includes the traditional meaning of safety, both because failures will adversely affect the ability to succeed and because an increase in the number of things that go right means a decrease in the number of things that go wrong. But the definition of safety also focuses on the system’s ability to function under varying conditions, with consequences for how resilience is measured and for how it is managed.

Reactive and Proactive Adjustments

The key feature of a resilient system is its ability to adjust its performance. Adjustments to how things are done can, in principle, be reactive and take place after something has happened, be concurrent and take place while something happens, or be proactive and take place before something happens.

• Reactive adjustments are by far the most common. They happen in the aftermath of an event, for instance following the recommendations issued after an accident investigation, or as the ‘lessons learned’ from a major change or disruption. Responding when something has happened can, however, not guarantee a system’s safety and survivability, even if the response is fast. This is because a system can only be prepared to respond immediately to a limited set of events. Since it will take longer to respond to all other events, the response is less likely to be effective.

• Concurrent adjustments are basically fast reactive adjustments that take place while a situation is still developing. For instance, if there is a major accident in a community, such as a large fire or an explosion, local hospitals will change their state of functioning and prepare for the rush of people that may have been hurt (Cook and Nemeth, 2006). Concurrent adjustments are the basis for continuous regulation, as described by the common feedback control loop (cf., Chapter 5).

• Proactive adjustments mean that the system can change from a state of normal operation to a state of heightened readiness, and possibly also act, before something happens. A state of readiness means that resources are allocated to match the needs of the expected event, that special functions are activated, that defences are increased, etc. An everyday example from the world of aviation is to secure the seat belts before start and landing, or during turbulence. In these cases, the criteria for changing from a normal state to a state of readiness are clear. In other cases it may be less obvious either because of a lack of experience or because the validity of indicators is uncertain. Examples of that can be found in the financial systems and in earthquake predictions. An obvious advantage of acting before something happens is that fewer resources may be needed, since the conditions may not yet have become critical.

All systems must be able to respond or change the way things are done when something has happened, since they otherwise will become extinct. (The only theoretically possible exception are systems for which the environment is perfectly predictable, for instance because it never changes.) The obvious advantage of proactive adjustments is that they may ‘buy time,’ whereas reactive adjustments always will ‘take time’. Proactive adjustments can be strategic or tactical, depending on their time horizon. The potential gain is unfortunately limited by the uncertainty of whether a chosen response is the right one. On the other hand, control that is based exclusively on feedback, that is, on responding when something has happened, may quickly deteriorate into opportunistic ‘fire fighting’ and eventually scrambled responses, leading to a loss of control (Hollnagel, 1993).

The Four Essential Capabilities of Resilience

The above working definition of resilience can be made more concrete by considering four essential capabilities of resilience (cf., Figure E.1), namely:

• Knowing what to do, or being able to respond to regular and irregular variability, disturbances, and opportunities either by adjusting the way things are done or by activating ready-made responses. This is the capability to address the actual.

• Knowing what to look for, or being able to monitor that which changes, or may change, so much in the near term that it will require a response. The monitoring must cover the system’s own performance as well as changes in the environment. This is the capability to address the critical.

• Knowing what to expect, or being able to anticipate developments, threats, and opportunities further into the future, such as potential disruptions or changing operating conditions. This is the capability to address the potential.

• Knowing what has happened, or being able to learn from experience, in particular to learn the right lessons from the right experience. This is the capability to address the factual.

Figure E.1 The four main capabilities of a resilient system

Resilience Indicators

As mentioned in the Introduction, three things are necessary in order to be able to manage a process. The first requirement is to know what the current status or position is, in other words, to find appropriate indicators or measures – but of resilience rather than of safety. This involves several critical issues:

• Can the values of the indicators be rendered in a concise manner, either quantitative or qualitative?

• Are the indicators well defined, reliable and valid?

• Are the indicators objective, meaning that their interpretation is normative, or are they subjective, meaning that their interpretation depends on who looks at them?

• Are the indicators sufficiently sensitive to change, i.e., can the effects of a change be seen within a reasonable amount of time? (Another way of putting that is whether the indicators make concurrent control possible.)

• Are the indicators ‘lagging’, ‘current’, or ‘leading’, that is, do they represent a past state, the present state, or can they be interpreted as indicating a future state or development?

• Can the indicators be used as a basis for concrete actions within the operational context?

• Are the indicators easy to use (‘cheap’) or are they difficult to use (‘costly’)?

Measurements of Safety

It is quite understandable that safety indicators or safety measurements traditionally have focused on adverse outcomes, since these represent something that any system would want to avoid. Adverse outcomes also naturally attract attention both in terms of their direct effects (loss of life, property and money) and in terms of their indirect effects (disruption of functions and production, need of recovery operations, redesign, etc.).

If safety is defined by the absence of unwanted events, the level of safety is consequently measured by the relative occurrence of such events. (In fact, the definition of safety is in many cases derived from the ability to make certain measurements.) Consider, for instance, the top five HSE indicators used by the oil industry:

• (number of) fatal accidents

• total recordable injury frequency (TRIF)

• lost-time injury frequency (LTIF)

• serious HSE incident frequency (SIF)

• accidental oil spill (number and volume).

Common to these indicators are that they are reasonably objective, easy to quantify, and that they can be used without requiring costly changes to the existing system. They are probably also reliable, but it can be questioned whether they are valid safety indicators. (Another way to look at it is to ask which definition of safety the indicators imply.) They are all lagging indicators, and may be more useful to confirm effects after a while than to manage changes. Since the indicators represent outcomes rather than processes, they provide a useful basis for actions within the operational context.

This approach can be found in other industries as well. In the area of patient safety, for instance, the OECD has proposed the sets of indicators for ‘operative and post-operative complications’, ‘sentinel events’, ‘hospital-acquired infections’, ‘obstetrics’ and ‘other care-related adverse events’. Here the first group contains the following indicators:

• complications of anaesthesia

• postoperative hip fracture

• postoperative pulmonary embolism (PE) or deep vein thrombosis (DVT)

• postoperative sepsis

• technical difficulty with procedure.

Similar comments can be made as for the off-shore safety indicators. The patient safety indicators all refer to well defined events, but the problem is that counting how many events there are in each category does not by itself say much about what the level of safety is.

A final example is found in the programme of work for the European Technology Platform on Industrial Safety (ETPIS, 2005). The aim of this group is to implement a new safety paradigm, called an ‘incident elimination culture,’ in European industry by 2020. Safety is highlighted as a key factor for successful business and as an inherent element of business performance. The aim is to demonstrate a measurable improvement of industrial safety performance by a reduction in the number of the following four categories of outcomes:

• reportable accidents at work

• occupational diseases

• environmental incidents

• accident-related production losses.

The two milestones defined by the European Technology Platform on Industrial Safety are a 25 per cent reduction in accidents by 2020 and that programmes are in place by 2020 to continue accident reduction at a rate of >5 per cent per year. While these milestones have the advantage of looking very concrete and verifiable, they also point to the main problem with commonly used safety indicators, namely that they work best in the beginning when safety is bad, but less well later when safety is good. The reason is simply that if the number of reported events is large, as it typically is when a programme of improvement is begun, then it will be easy to see a reduction in the number of adverse outcomes. But if a programme has been running successfully for some time, then there will be few reportable events to measure. This can be illustrated by Figure E.2. (The effect of a given level of effort in the beginning, Δ₁, is much larger than the effect of the same level of effort, Δ₂, later on.)

From a control or management point of view the diminishing number of outcomes is a problem, since the absence of measurements means that there is no feedback, hence that the process becomes unmanageable. The logical consequence is to look for measurements that increase rather than decrease as the situation improves.

Figure E.2 The dilemma of basing safety on measuring adverse outcomes

Measurements of Resilience

Since resilience is defined by the system’s ability to adjust the way things are done, it follows that a measure of resilience must be different from the traditional measures of safety. And because resilience refers to a quality rather than a quantity, to something that the system does rather than to something that the system has, it is highly unlikely that it can be represented by a single or simple measurement. A possible solution is instead to consider the four capabilities that together define resilience, and from that basis develop a Resilience Analysis Grid, that is, four sets of questions where the answers can be used to construct a resilience profile. The rest of this chapter will present a general outline of what a Resilience Analysis Grid (RAG) may look like.

The Ability to Respond

No system, organisation, or organism can survive unless it is able to respond to what happens – whether it is a threat or an opportunity. Responses must furthermore be both timely and effective so that they can bring about the desired outcome or change before it is too late. In order to respond, the system must first detect that something has happened, then recognise the event and rate it as being so serious that a response is necessary and finally know how and when to respond and be capable of responding.

If an event is rated as serious, the response can either be to change from a state of normal operation to a state of readiness or to take specific action in the concrete situation. In order to take action it is necessary either to have prepared responses and the requisite resources, or to be flexible enough to make the necessary resources available when needed. In responding to events, it is essential to be able to distinguish between what is urgent and what is important.

Table E.1 Probing questions for the ability to respond

	Analysis item (ability to respond)
Event list	Is there a list of events for which the system has prepared responses? Do the events on the list make sense and is the list complete?
Background	Is there a clear basis for selecting the events? Is the list based on tradition, regulatory requirements, design basis, experience, expertise, risk assessment, industry standard, etc.?
Relevance	Is the list kept up-to-date? Are there rules/guidelines for when it should be revised (e.g., regularly or when necessary?) On which basis is it revised (e.g., event statistics, accidents)?
Threshold	Are there clear criteria for activating a response? Do the criteria refer to a threshold value or a rate of change? Are the criteria absolute or do they depend on internal/external factors? Is there a trade off between safety and productivity?
Response list	How is it determined that the responses are adequate for the situations they refer to? (Empirically, or based on analyses or models?) Is it clear how the responses have been chosen?
Speed	How soon can an effective response begin? How fast can full response capability be established?
Duration	For how long can an effective response be sustained? How quickly can resources be replenished? What is the ‘refractory’ period?
Resources	Are there adequate resources available to respond (people, materials, competence, expertise, time, etc.)? How many are kept exclusively for the prepared responses?
Stop rule	Is there a clear criterion for returning to a ‘normal’ state?
Verification	Is the readiness to respond maintained? How and when is the readiness to respond verified?

The Ability to Monitor

A resilient system must be able flexibly to monitor its own performance as well as changes in the environment. Monitoring enables the system to address possible near-term threats and opportunities before they become reality. In order for the monitoring to be flexible, its basis must be assessed and revised from time to time.

Monitoring can be based on ‘leading’ indicators that are bona fide precursors for changes and events that are about to happen. The main difficulty with ‘leading’ indicators is that the interpretation requires an articulated description, or model, of how the system functions. In the absence of that, ‘leading’ indicators are defined by association or spurious correlations. Because of this, most systems rely on current and lagging indicators, such on-line process measurements and accident statistics. The dilemma of lagging indicators is that while the likelihood of success increases the smaller the lag is (because early interventions are more effective than late ones), the validity or certainty of the indicator increases the longer the lag (or sampling period) is.

Table E.2 Probing questions for the ability to monitor

	Analysis item (ability to monitor)
Indicator list	How have the indicators been defined? (By analysis, by tradition, by industry consensus, by the regulator, by international standards, etc.)
Relevance	When was the list created? How often is it revised? On which basis is it revised? Is someone responsible for maintaining the list?
Indicator type	How appropriate is the mixture of ‘leading’, ‘current’ and ‘lagging’ indicators? Do indicators refer to single or aggregated measurements?
Validity	For ‘leading’ indicators, how is their validity established? Are they based on an articulated process model?
Delay	For ‘lagging’ indicators, what is the duration of the lag?
Measurement type	How appropriate are the measurements? Are they qualitative or quantitative? (If quantitative, is a reasonable kind of scaling used?) Are the measurements reliable?
Measurement frequency	How often are the measurements made? (Continuously, regularly, now and then?)
Analysis / interpretation	What is the delay between measurement and analysis/interpretation? How many of the measurements are directly meaningful and how many require analysis of some kind? How are the results communicated and used?
Stability	Are the effects that are measured transient or permanent? How is this determined?
Organisational support	Is there a regular inspection scheme or schedule? Is it properly resourced?

The Ability to Anticipate

While monitoring makes immediate sense, it may be less obvious that it is useful to look at the more distant future as well. The purpose of looking at the potential is to identify possible future events, conditions, or state changes that may affect the system’s ability to function either positively or negatively.

Risk assessment focuses on future threats and is suitable for systems where the principles of functioning are known, where descriptions do not contain too many details, where descriptions can be made relatively quickly and where the systems – and their environments – are sufficiently stable for their descriptions to remain valid for a reasonable time after they have been made. Many of the present day systems where industrial safety is a concern are unfortunately not like that, but are rather underspecified. For such systems the principles of functioning are only partly known, descriptions contain (too) many details, and it takes so long to make them that the system will have changed in the meantime. The systems are consequently intractable. For such systems, established risk assessment methods may be inappropriate.

The anticipation of future opportunities has little support in current methods, although it rightly ought to be considered just as important as the search for threats. This shortcoming is at least acknowledged by Resilience Engineering.

Table E.3 Probing questions for the ability to anticipate

	Analysis item (ability to anticipate)
Expertise	Is there expertise available to look into the future? Is it in-house or outsourced?
Frequency	How often are future threat and opportunities assessed? Are assessments (and re-assessments) regular or irregular?
Communication	How well are the expectations about future events communicated or shared within the organisation?
Assumptions about the future (model of future)	Does the organisation have a recognisable ‘model of the future’? Is this model clearly formulated? Are the model or assumptions about the future explicit or implicit? Is the model articulated or a ‘folk’ model (e.g., general common sense)?
Time horizon	How far does the organisation look ahead? Is there a common time horizon for different parts of the organisation (e.g., for business and safety)? Does the time horizon match the nature of the core business process?
Acceptability of risks	Is there an explicit recognition of risks as acceptable and unacceptable? Is the basis for this distinction clearly expressed?
Aetiology	What is the assumed nature of future threats? (What are they and how do they develop?) What is the assumed nature of future opportunities? (What are they and how do they develop?)
Culture	To which extent is risk awareness part of the organisational culture?

The Ability to Learn

It is indisputable that future performance only can be improved if something is learned from past performance. Indeed, learning is generally defined as ‘a change in behaviour as a result of experience’.

The effectiveness of learning depends on the basis for learning, that is, which events or experiences are taken into account, as well as on how the events are analysed and understood.

In order for effective learning to take place there must be sufficient opportunity to learn, events must have some degree of similarity, and it must be possible to confirm that something has been learned. (This is why it is difficult to learn from rare events.) Learning is not just a random change in behaviour but a change that makes certain outcomes more likely and other outcomes less likely. It must therefore be possible to determine whether the learning (the change in behaviour) has the desired effect. If learning has had no effect, then it has probably not happened. And if learning has the opposite effect, then it has certainly been wrong.

In learning from experience it is important to separate what is easy to learn from what is meaningful to learn. Experience is often couched in terms of the number or frequency of occurrence of adverse events. But compiling extensive accident statistics does not mean that anyone will actually learn anything. Furthermore, since the number of things that go right, including near misses, is many orders of magnitudes larger than the number of things that go wrong, it makes good sense to try to learn from representative events rather than from failures alone.

Table E.4 Probing questions for the ability to learn

	Analysis item (ability to learn)
Selection criteria	Is there a clear principle for which events are investigated and which are not (severity, value, etc.)? Is the selection made systematically or haphazardly? Does the selection depend on the conditions (time, resources)?
Learning basis	Does the organisation try to learn from what is common (successes, things that go right) as well as from what is rare (failures, things that go wrong)?
Data collection	Is there any formal training or organisational support for data collection, analysis and learning?
Classification	How are the events described? How are data collected and categorised? Does the categorisation depend on investigation outcomes?
Frequency	Is learning a continuous or discrete (event-driven) activity?
Resources	Are adequate resources allocated to investigation/analysis and to dissemination of results and learning? Is the allocation stable or is it made on an ad hoc basis?
Delay	What is the delay between the reporting the event, analysis, and learning? How fast are the outcomes communicated inside and outside of the organisation?
Learning target	On which level does the learning take effect (individual, collective, organisational)? Is there someone responsible for compiling the experiences and making them ‘learnable’?
Implementation	How are ‘lessons learned’ implemented? Through regulations, procedures, norms, training, instructions, redesign, reorganisation, etc.?
Verification/maintenance	Are there means in place to verify or confirm that the intended learning has taken place? Are there means in place to maintain what has been learned?

Applying the RAG – Rating Resilience

By considering in detail each of the four capabilities that define resilience, it is possible to propose four sets of issues and four corresponding sets of questions that can serve as a basis for assessing a system’s resilience. The same set of issues can also be the starting point for possible concrete measures to maintain or improve resilience.

The four sets of issues together comprise what is called the Resilience Analysis Grid (RAG). Similarly, the answers to the four sets of questions characterise the resilience of a system and can be used to construct a resilience profile. It is, of course, possible to work just with the answers or ratings, but for many purposes it is useful also to have some kind of pictorial or graphical representation to help communicate and discuss the results. The so-called star chart or radar chart is well suited for this purpose. The star chart is a straightforward way to display multivariate data in the form of a two-dimensional chart where all the variables are represented on axes starting from the same point, cf. Figure E.3.

The procedure for filling out a RAG is quite simple, and can be described the following steps.

Define and Describe the System for which the RAG is to be Constructed

The first step is, not surprisingly, to provide a clear and concise description of the system for which the RAG is to be filled out. Is the system, for instance, an aircraft crew (pilots plus flight attendants), the flight dispatch service, aircraft maintenance, or the airline as a whole? Is the system the central control room of a power plant, a work shift, the maintenance and repair services or the outage handling? A resilience analysis must always begin by defining as clearly as possible the boundaries of the system being considered, the organisational structure, the people and resources involved, the time horizon for typical activities, etc. Without that it is not possible to know which questions to ask, nor how to rate the answers.

Select a Subset of Relevant Questions for Each of the Four Capabilities

The four sets of questions presented in this chapter do not refer to any specific system or domain and should therefore not be used without confirming their relevance. This can be done in two steps. The first is to select four subsets of questions that correspond to the system defined by the first step, that is, the scope of activities and the nature of the core processes. The second is to reformulate individual questions so that they are appropriate for the domain, and possibly add new questions if needed. An investigation of the resilience of a hospital ward should, for instance, not use the same questions and the same formulations as an off-shore drilling rig. Different domains, and different kinds of businesses, may also affect the relative weight or importance of the four capabilities.

Rate the Selected Questions for Each Capability

Based on the outcome of the second step, it is now possible to get answers to the four sets of questions and to rate the answers. The answers must come from people who have experience of the domain. Various approaches can be used such as workplace interviews, discussions with experts, focus groups, etc. Since it is important for the proper use of the RAG that the ratings are done repeatedly rather than only once, it may be useful to nominate a number of people in the system who can serve as a pool of respondents.

In order for the RAG to be useful as a tool, it is necessary that the answers to each question are rated using a common terminology. It is proposed to use the following five categories.

• Excellent – the system on the whole exceeds the criteria addressed by the specific item.

• Satisfactory – the system fully meets all reasonable criteria addressed by the specific item.

• Acceptable – the system meets the nominal criteria addressed by the specific item.

• Unacceptable – the system does not meet the nominal criteria addressed by the specific item.

• Deficient – there is insufficient capability to meet the criteria addressed by the specific item.

In addition, a sixth category must be included to account for the situation where the system does not address a capability at all.

• Missing – there is no capability whatsoever to address the specific item.

When answering and rating the individual items in the lists, it should be kept in mind that the rating is not intended to be a ‘scoring’ of recent accidents and incidents. The examples that are used to provide the answers should be of the normal or typical way in which the system functions. If there has been a number of cases where the system has failed to meet the criteria, then this should clearly been taken into account during the rating. But the rating should describe how well the system is able to do something, rather than how badly things can turn out.

Combine the Ratings to a Score for Each Capability, and for the Four Capabilities Combined

Once the rating has been done for each set of items, they can be shown by means of a star chart. To illustrate this, Figure E.3 shows an empty star chart for the ability to monitor. The star chart has ten axes, corresponding to the ten variables (items) used to rate the ability to monitor, with each axis marked using the five rating categories described above. The sixth category of missing corresponds to the common starting point of the axes.

Figure E.3 Empty star chart for monitoring

The star chart is used in the following way. If, for instance, all variables were rated as ‘acceptable’, then the result would be a regular polygon (not shown in Figure E.3). If one or more of the variables were rated differently, either better or worse, then the result would be an irregular polygon. The shape of the polygon that is constructed from the ratings therefore provides a convenient visual representation of the ‘balance’ among the ratings. Note, however, that the reference rating for a specific variable will depend on the nature of the system’s activities. A specific system may, for instance, require that the ‘validity’ is excellent, whereas the ‘measurement type’ (i.e., mixture of qualitative and quantitative measures) only has to be acceptable. The star charts for the other abilities are produced in the same straightforward manner. The star charts for the four capabilities will together provide an overall view of how the system’s resilience was rated.

It is, however, also possible to combine the four star charts into one by making comparable the several dimensions (axes) for each capability. The simplest approach is to assign numerical values to the ratings, for instance from 1 to 5 where 1 corresponds to ‘deficient,’ 2 to ‘unacceptable’, and so on. It is then straightforward to calculate the value of the rating for each axis and to aggregate them into a single value. This approach can be made more reasonable by assigning appropriate weights to both the ratings and the dimensions. Provided that a procedure can be defined that respects the characteristics of the system and the domain, the RAG can be represented by a four-axis star diagram, as shown in Figure E.4.

The assignments shown in Figure E.4 are for purpose of illustration only. In this example the shape of the polygon is irregular, indicating that all is not well. The figure corresponds to a system that does well in terms of the ability to respond and monitor, but which fails in terms of the ability to anticipate and learn. While such a system may be safe in the short run, it is not resilient.

Figure E.4 An aggregated star diagram

Interdependence of the Four Capabilities

A less simple but more meaningful approach is to consider how the four basic capabilities are coupled. The ability to respond, for instance, can be enhanced by monitoring, which in turn may benefit from learning. While a detailed description of the couplings is beyond the scope of this chapter, a first attempt could be as follows (cf., Figure E.5), using the FRAM representation described in Chapter 13.

Responses can be triggered by external and/or internal events, and this can be facilitated by the output from the monitoring function. The response itself requires that the system is in a state of readiness and that the necessary resources (tools, materials and people) are available. The scheduling of the response is controlled by plans and procedures, predefined or ad hoc, and may require that the scheduling of ongoing actions is flexible so that the normal activities can be resumed when the response has come to an end.

The input to monitoring comes from internal and external developments that provide the raw data, and from the functions of anticipation and learning that provide the background for looking at and interpreting the data. Effective monitoring requires both that there is time available (cf., Hollnagel, 2009), that there is a monitoring strategy (i.e., that monitoring is both efficient and thorough), and that the people or operators involved have the requisite skills and knowledge.

Anticipation is heavily influenced by what has been learned from the past, such as suggestions for performance indicators. It is controlled or guided by the ‘model of the future’’ in particular the types of threats or opportunities that this model describes. Unlike the other functions, anticipation is not necessarily data-driven. The main resource is competent people, but anticipation is rarely a time-critical function. The pre-condition is the organisational culture or awareness, here described as a ‘constant sense of unease’, cf., Hollnagel et al. (2008).

Learning, finally, makes use of past events and responses, either in-house or in the general domain of activity, possibly mediated by regulators, and internal or external events even if these have not resulted in something requiring a response. Learning is ‘controlled’ or guided by the assumptions about why things happen. Here the organisation’s accident model is of particular importance, for instance in the way in which it determines which data and events are considered (Lundberg et al., 2009). Effective learning finally requires some kind of reporting scheme.

Figure E.5 Interdependence of the four resilience capabilities

Single versus Repeated Measures

Since the RAG is intended as a tool to support resilience management rather than resilience measurement, it is essential that it is used regularly. The RAG should not just give a measurement of a system’s resilience at a single point in time, but be used to follow how resilience develops over time. The RAG is thus itself intended as a (composite) current indicator, rather than a simple lagging or a leading indicator. When used in this way it actually becomes less critical how the aggregated star chart is produced, since the relative indications are more important than the absolute. But it is important that the RAG is applied systematically and consistently.

The frequency of the ratings clearly depends on the characteristics of the system’s core business and on the volatility of the operating environment. It is therefore not possible to provide any strict guidelines for that. But given the dynamics of current societies it does seem sensible to perform a rating for a system or an organisation at least every 2–3 months. (In business it does seem to be a tradition, if not a demand, to produce a report on how well things are going four times a year.)

Summary

The Resilience Analysis Grid presented here shows how it is possible to develop a tool that can support resilience management. It is not a tool that can be used off-the-shelf. It is rather intended as a basis from which more specific grids – or set of questions – can be developed.

The chapter has presented and discussed the principles for how the dimensions can be rated, and how they can be shown by means of a star diagram. The star diagram is not in itself a measure of resilience, but a compact representation of how the various items were rated. The RAG should also be thought of as a process measure rather than a product measure, since it shows the current level of resilience and of how well the system does on each of the four main capabilities.

Resilience Engineering cannot prescribe a certain balance or proportion among the four qualities. For a fire brigade, for instance, it is more important to be able to respond to the actual than to consider the potential, whereas for a sales organisation, the ability to anticipate may be just as important as the ability to respond. But it is clearly necessary for any system to address each of these qualities to some extent in order to be resilient. All systems traditionally put some effort into the ability to respond to the actual. Many also put some effort into the ability to learn from the factual, although it often is in a very stereotypical manner. Fewer systems make a sustained effort to monitor the critical, particularly if there has been a long period of stability. And very few systems put any serious effort into the ability to anticipate the potential.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Epilogue: RAG – The Resilience Analysis Grid by Erik Hollnagel

Create new playlist

Sign In

Sign Up

Table of Contents for
Epilogue: RAG – The Resilience Analysis Grid by Erik Hollnagel