Chapter 10

Noticing Brittleness, Designing for Resilience

Elizabeth Lay and Matthieu Branlat

Engineering is the discipline of applying art or science to practical problems. Resilience Engineering is the discipline of applying principles of Highly Reliable Organizing and Resilience Engineering to the design of resilient systems. An assessment of brittleness is the first step in determining which strategies and tactics to deploy and noticing brittleness (and resilience) is a skill that can be learned. This chapter covers how the skill to notice brittleness can be developed then applied in a workshop to assess brittleness and subsequently design strategies and tactics to increase resilience. These topics will be explored in the context of maintenance work.

Introduction

Organizations operating in high-risk/high-consequence domains recognize the variability of their work environment and its potential consequences on performance and safety. As a result, they actively seek ways to deal with this variability in order to avoid undesired states and outcomes. Traditional approaches based on risk management aim at anticipating, measuring and building mechanisms to address specific forms of variability often with a goal of reducing variability.

Although such approaches have shown positive results through building basic adaptive capacity within the systems considered, they are also based on strong assumptions that make the systems ineffective at managing disruptions; they typically overestimate their knowledge of the various forms of variability (oversimplified models of the world), and collapse in the face of surprising events. Resilience Engineering (RE) represents a different type of approach: rather than anticipating specific events, it assumes that the world is variable, that this variability cannot always be known in advance, and that it might even be surprising. As a result, RE aims at describing and designing the mechanisms that will support systems’ adaptive capacity in the face of known and unknown variations in the world: mechanisms that allow for the detection of the variability, for the understanding of its potentially surprising nature or scope, and for the timely reconfiguration of the system to manage it successfully. While characteristics of already existing High Reliability Organizations (HROs) and resilient systems have been thoroughly described, purposefully engineering resilience into a system is uncommon. Questions arise about what practical transformations can be made in support of resilience, and also about how to conduct interventions that introduce RE principles in organizations that have a more traditional risk management culture.

This chapter aims at describing our experience with these very issues, and at illustrating them through specific interventions in the domain of industrial maintenance.

Underlying Principles

Resilience and Brittleness

Resilience is the intrinsic ability of an organization to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions (Hollnagel, 2012). Resilient systems are agile in the face of change and have buffers (margins of maneuver) to respond to unforeseen demands, thus creating the conditions to avoid or minimize consequences of adverse events; they understand and manage their complexity. Brittle systems, on the other hand, fail to notice warnings and to adjust their behavior in time to prevent collapse; they overlook and fall victim of tight couplings. They may have a system designed around “standard” (low variability) maintenance even though variable scope is the norm (“standard” never happens).

Brittleness and resilience are two sides of the same coin. Just as where there are threats, there are corresponding opportunities; where an organization or system (or process) is brittle, there exists the possibility to increase resilience. Brittleness and resilience are system properties not outcomes (Cook, 2012). You can have a good outcome with a brittle system or a bad outcome with a resilient system although the probability of having a desired outcome increases with resilient systems. “Resilience Engineering … agenda is to control or manage a system’s adaptive capacities based on empirical evidence;” “to achieve resilient control … system must have capacity to reflect on how well it is adapted, what it is adapted to, and what is changing in its environment.” Managers need knowledge of how the system is resilient and brittle to make decisions on how to invest resources to increase resilience (Woods, 2006; Hollnagel, 2009). “Resilience/brittleness of a system captures how well it can adapt to handle events that challenge the boundary conditions for its operation” (Woods and Branlat, 2011).

Resilience as Management of Variability and Complexity

Work systems, as Complex Adaptive Systems, require adaptability in order to be resilient when anomalous situations arise, that is, maintain sufficient levels of safety and performance in the face of disruptions. However, adaptive processes can be fallible: systems may fail to adapt in situations requiring new ways of functioning; or the adaptations themselves may produce undesired consequences, especially as a result of unmanaged functional interdependencies. These challenges have been abstracted through three basic patterns in how adaptive systems fail (Woods and Branlat, 2011).

The three basic patterns are

1.  Decompensation – when the system exhausts its capacity to adapt as disturbances/challenges cascade. This pattern corresponds to situations where the system is unable to transition to new modes of functioning in a timely manner to respond to the disturbances.

2.  Working at cross-purposes – when roles exhibit behaviour that is locally adaptive but globally maladaptive. This pattern is a result of mis-coordination across the system and corresponds to a failure to manage functional interdependencies; issues can arise from interdependencies that remain undetected until they are revealed by incidents following a disturbance.

3.  Getting stuck in outdated behaviours – when the system over-relies on past successes. This pattern results from organizations failing to revise their models and plans in place, often through oversimplifying or disregarding the disturbances they experience.

These patterns of failure propose a description of how an organization is unsuccessful at managing adverse events, and suggest ways to transform the system so that it better manages its complexity in a variable environment. Measures to increase resilience derive directly from the nature of the patterns that they counterbalance. Corresponding forms of improvements include: (1) management of resources, for example, by creating tactical reserves and understanding the conditions for their employment; (2) coordination within the system, for example, by investigating and supporting functional interdependencies; (3) learning mechanisms, for example, by transforming the models underlying the investigation of incidents.

Resilience Compared with Traditional Risk Assessment

Traditional risk assessments typically include identifying risks (specific and detailed), analyzing the risks (qualify, quantify, rank), and designing specific responses for higher ranked risks. Risk assessment often focuses on preventing things that can go wrong.

Resilience Engineering involves designing to ensure things go right with more focus on preparedness and less on prediction. Consideration is given to broad, big picture situations and uncertainties. Responses tend to be general versus specific. Resilience Engineering involves:

•  Bounding uncertainty and possible outcomes: considering properties of systems when analyzing possible failures; understanding fundamental limitations of resources; describing possible big picture outcomes; and seeking to be approximately right across a broad set of eventualities (Taleb, 2010).

•  General response design: making the distinction between positive and negative contingencies; seizing opportunities; investing in preparedness, not prediction (Taleb, 2010); since variability is inevitable, looking for ways to build structure around and plan for variability; and designing general responses that could address a broad set of situations (without focusing on the precise and local).

Uncertainty is defined as the state of having limited knowledge where it is not possible to exactly describe existing state or future outcome, it is an unmeasurable risk and includes what we don’t know, ambiguity, and/or variability. In project management, according to De Meyer, there are four types of uncertainty (De Meyer et al., 2002, 61–2):

•  Variation: a range of values on a particular activity.

•  Foreseen uncertainty: identifiable and understood influences that the team cannot be sure will occur.

•  Unforeseen uncertainty: can’t be identified in planning, team is unaware of event’s possibility or considers it unlikely. Also called “unknown unknowns.”

•  Chaos: Even the basic structure of the plan is uncertain. There is constant change, iteration, evolution. Final results may be completely different from original intent.

During risk assessment, the tendency can be to act like the future can be more accurately predicted than is possible, such as when probabilities are estimated to a high degree of granularity. In risk assessment, uncertainty may be neglected. This is due, in part, to the psychological make-up of humans; studies have shown people are more averse to uncertainty than to risk alone (Platt and Huettel, 2008, 398–403). To be highly resilient is to be prepared for uncertainty. To be highly resilient is to respond robustly to the unexpected.

Observing Brittleness at Play

Noticing brittleness and resilience is a skill that can be learned. One approach to building this skill is through study groups by: understanding the principles through literature, observing and recognizing the patterns and characteristics of resilience and brittleness in everyday work, then discussing observations across situations and domains.

Over a period of time, people build the skill to notice brittleness and resilience by employing reciprocation (conversations), recurrence (periodic conversations and observations), and recursion (making repeated observations as knowledge is built to deepen understanding).

According to Weick and Sutcliffe, organizations are brittle if they (Weick et al., 2001):

•  Have little or no reserve

•  Don’t pay attention to or deny small failures

•  Make assumptions, small misjudgments

•  Accept simple diagnosis, don’t question

•  Take frontline operations for granted

•  Defer to authorities rather than experts

•  Keep working as usual upon disruptions.

The following table describes additional and expanded signs of brittleness that have been observed in the context of complex, variable industrial maintenance work situations. Relationships with the 3 maladaptive patterns (1–decompensation; 2–working at cross-purposes; 3–getting stuck in outdated behaviour) are indicated between parentheses.

Table 10.1  Observations of brittleness at play

Type of sign

Examples of observations

Buffers/reserves

No buffer or contingency plans for critical sequential events (1)

Critical singular resources (only one person with skill, only one tool), more brittle if long lead time to procure (1)

Single point failures, in general (being reliant on one vendor for critical work or resources) and especially serial activities with potential single point failures (1, 3)

Over use – burn-out – of key personnel. (2)

Stiffness/rigidity/lacking flexibility

Fixed configuration teams made of highly specialized, singularly skilled workers (especially if there is little reserve and multiple skills are necessary for work) (1, 2, 3)

Information and knowledge

Not knowing which resources are critical. Having low reserves of critical resources with a variable demand for those resources (people, tools, materials, supplies, etc.). A “critical resource” is defined as a resource on which critical path work is dependent.

Leadership lacks big picture view (2)

Leadership not having ability to get current information that describes changing situation (1, 2)

Lack knowledge of how changes (both large & many small) impact big picture or program (2, 3)

Communication is not frequent and timely beyond yield point (begin to lose control) (1, 2,)

Variability and uncertainty

Not exploring or planning for where uncertainty lies (3)

Not understanding dependencies and interactions (competition for same resources) (2, 3)

Ungrounded or unlikely assumptions (3)

Lack bounding variability (most likely and worst case scenarios) (3)

1st or 2nd time use of critical process, supplier or uncertainty that is added from not having experience or history, in general (3)

Planning

Lack plan for monitoring including where yield (begin to lose control)and failure points lie (1)

Not exploring or planning for potentially disruptive risks (technical issues) (1, 3)

Late assignments of resources (no time for them to plan/prepare) (1)

Lack of coordinated planning for big picture or group of projects that require inputs from several organizations or teams (lack shared resource planning model) (1, 2)

Highly likely, potentially disruptive emergent scope not included in plan even though history shows this to be common. (3)

Lack consideration of timing of disruption/issues on big picture impact (Which issues likely to arise early and significantly disrupt downstream projects? Where at risk for early disruptions?) (2, 3)

Step change in demands on resources wherein there is no change in the way planning or managing occurs, operations continue as normal (3)

Over reliance on “fire-fighting”. Last minute, high levels of change with little time to react (Plan “breaks down” after first tranche, teams start together but are soon fragmented. Effects of fragmentation not taken into account in planning or design of teams. Planning not performed to minimize fragmentation.) (1, 2)

Implementing Principles Of Resilience Engineering

This section describes the structure and content of a workshop that could be held for the purpose of introducing principles of Resilience Engineering in an organization and identifying ways to increase operations’ resilience. This description, which aims at providing a general guideline, is based on the conduction of such events in the context of high-risk/high-consequence industries. One of the central themes of the workshops conducted (and of the example described in this section) is the management of situations where load or demand potentially exceeds capacity of existing resources. Such focus is operationally relevant across industries and resonates with key questions about how to design for resilient control (Woods and Branlat, 2010). The design is based on the following properties for increased resilience (Woods, 2006, 23):

•  buffering capacity: system ability to absorb disruptions without breaking down;

•  flexibility versus stiffness: ability to restructure in response to changes;

•  margin: how close system is operating relative to performance boundaries (operation with little or no margin is precarious)

•  tolerance: system behavior near boundaries; degrades gracefully or collapses.

Workshop Participants

Ultimately, the goal of the workshop is to leverage participants’ work experience to help them diagnose their organizations’ brittleness and resilience, as well as to identify potentially fruitful directions to increase resilience. In this collaborative problem-solving type of task, the general approach is to generate success through diversity rather than through a few high performers (Hong and Page, 2004):

•  cross-domain representation brings diversity in perspectives and a fuller spectrum of relevant concerns;

•  consider including operations, engineering, marketing, project managers, resource planners (parts, people, and tools), and other groups that come in contact with or influence the project.

One of the most important components of the workshop is the facilitator. To help with the design of resilience, the conversation has to be led by a person who has the skills to notice brittleness and resilience, ideally supported by operational experience. Even though the workshop is structured, holding such a workshop is more complex than following a series of steps; it is critical that the facilitator has the abilities to notice and probe risk, uncertainty, and brittleness.

Helping Participants Notice Brittleness

A suggested agenda is to begin with introducing Resilience Engineering, following with an overview of the situation, current plan, and identification of key issues and concerns. As issues and concerns are raised, it is the role of the facilitator to begin to identify and note areas of brittleness for the group. This builds the foundation from which to perform a deeper diagnosis of brittleness using questions designed around HRO principles and properties of resilient systems. See Table 10.2 for sample questions. Assumptions in planning are rich fodder for finding brittleness; its likely surprises will be found within them.

Diagnosing brittleness involves questioning into and probing:

•  boundary areas and where performance began to degrade (margins/yield points/failure points)

•  specific projects for possible effects on programs

•  areas of variance, ambiguity, and assumptions

•  boundaries, limitations, critical resources, reserves, and other constraints

•  key decisions that can be changed or have not been yet made

•  interactions among risks for possible cascading situations.

Once significant risks, uncertainties, and critical scenarios are identified, these should be elaborated in more detail and bounded such that they can be responded to. The details of which project and exactly what the work is may not be important. It may be enough to know that if some type of emergent work occurs with this general timing, it could have this general impact. That can be enough to enable design of a flexible, general response to reduce disturbances such as moving people in the midst of a project.

Table 10.2  Sample workshop questions

Where do you lack information, which projects, processes, plans are not well defined (sources of uncertainty)?

What critical decisions have yet to be made?

What assumptions have you made?

What is new, novel, or different that adds risk or uncertainty?

Has anything changed that makes these issues more likely to cause failure?

Where is there uncertainty due to operation or maintenance history?

Is there anything you are uncomfortable with?

What constrains you in your ability to execute?

What will “stretch” or “stress” our system? Who will be most heavily loaded/stressed?

What combination of small failures could lead to a large problem?

Where can we easily add extra capacity to remove stressors?

What can we put in place to relieve, lighten, moderate, reduce and decrease stress or load?

Will there be times, such as during peak load, when we need to manage or support differently? What is the trigger?

Which support organizations need to be especially sensitive to front line needs and what is our plan to accomplish this?

Pinging is defined as the proactive probing for risk profile changes (Lay, 2011). A “pinging plan” identifies the triggers that prompt moving into different actions and supports monitoring. The plan includes identification of yield (things begin to fall apart) and failure points (run out of what, when) and margin (space between where you are and failure point).

Table 10.3  “Straw man” for pinging design. Indicators that risk level is increasing

Risk Level Indicator

Green

Yellow

Red

Scope expands by x

X1

X2

X3

Inspection finds

Description 1

Description 2

Description 3

Schedule extends

0 days

1–2 days

> 2 days

Customer relationship

Working as team, good communication

Tense, some communication breakdowns

Conflictive, lack trust, poor communication

# of significant issues team is dealing with simultaneously

<2

2–3

>3

Human resources

Fully staffed, majority rested, no change out of people

Short 1–2 people, some fatigued, change out 1–2 workers during project

Short > 2 people, many fatigued, change leads out, change out >2 workers, critical functions missing or late

Designing to Increase Resilience

According to Hollnagel (2012), resilient organizations: learn from history; respond (adapt) to regular and irregular conditions in a flexible, effective manner; monitor short-term threats and opportunities; revise risk models; and anticipate long-term threats and opportunities. Strategies and tactics related to the capacity of human resources being at or near the limit are shared below. They are grouped in tables according to themes. Rather than recipes, these tables provide guidelines that need to be tailored to the specificities of the situations considered. Details are provided to illustrate how specific strategies could be designed. They are based on experience in the domain of industrial maintenance.

Respond (Adapt)

Type of strategy

Examples

Management of deployed resources

Shift goals, shift roles, have critical resources perform critical tasks, only! Use less experienced people for less complex work; provide more oversight if needed … experts coach, provide oversight versus “do” work.

Add buffer such as a logistics person to manage parts, people, tools; especially for projects with multiple emergent work scopes or issues; commercial or other support to free project manager to focus on managing the job, or a human performance/risk/safety/quality specialist to perform additional checks and bring outside perspective.

Provision of extra resources

“Drop in an Expert”. This concept involves finding a person with deep, relevant knowledge (possibly a retiree) and funding them for a short time period with the mission to assess the situation then make offers to groups who need help. Both the expert and the groups who need help determine where the expert can offer the most help.

Form a crisis management team, typically made up of managers, to bring about a heightened state of coordination and help. Consider the decisions that need to be made and the power needed to remove barriers and expedite solutions in determining team members. The team strengthens leadership’s connection to the front lines and provides a forum for project managers to escalate issues to management’s attention. The team can be more effective with authority to add or move resources as needed.

Form a dedicated rapid response team, typically made up of professionals. As risks and issues multiply, this team can be assigned full time to removing barriers and implementing solutions. A cross-organization group can improve collaboration and hold a neutral position to smooth political tensions that arise during periods of high stress. The focus should be to aggressively address issues that have the potential to delay front-line work.

Increase use of human performance tools. Consider which tools could be deployed that currently are not being used and how tools could be used more effectively, such as a defined plan for peer checks.

Management of priorities

Adjust capacity limits by removing stressors from people. Shed tasks: do only what’s necessary, stop unnecessary work/paperwork.

Shed load: move, decline projects.

Manage differently considering how people respond when they are close to their limits (fatigued, stressed); they are more forgetful, less attentive and may miss things.

The strategies described in the table above relate particularly to the first pattern of adaptive response described in a previous section. They aim at favouring timely response to events. Cook and Nemeth (2006) have described the successful management of mass casualty events by Israeli hospitals through similar strategies.

Monitor

Type of strategy

Examples

Support of processes of sense-making

Someone steps back from (or out of) their usual role to gain a broader perspective.

Begin a heightened state of coordination and help; possibly daily calls with those who are involved.

Assure communications occur with enough context to allow cross-checking.

Avoid the tendency to handle serially versus holistically.

Support reflective processes

Know where yield points are (what % deployed is sustainable?).

Look for signs the mood or situation has changed.

Stop and assess global situation: The water is coming up. Where is dike going to breach? Need to put reinforcements at breach points. Need global assessment of where things are going to come apart.

Identify where the breakdowns are occurring and brainstorm on where further breakdowns are likely to occur.

Query front lines on breakdowns, concerns, and current capacity.

Ask: Who is at the point they can’t keep up? What resources or help is needed to add capacity, remove stressors, or free up capacity? What has affected or is impeding their ability to perform? What can be done to improve this situation? What can be done to unload workers or improve conditions? What is keeping them awake at night?

Continue to search for signs of brittleness, such as incomplete, unclear information or statuses, silo situations where workers were not optimally connected with front lines, communication issues, accuracy of assumptions, and key individuals for whom there is no back-up.

Strategies described in the table above relate to how organizations assess and understand their situation. They address especially the second and third pattern described previously through the improving coordination (a source of information sharing) across the system, and through mechanisms aiming at bridging the gap between a situation as imagined and the actual situation experienced.

Anticipate

Type of strategy

Examples

Anticipate knowledge gaps and needs

Practice and build depth, before it’s needed.

Develop multi-skilled workers. For example, a back office team that is also trained to hold various roles to unload or support front lines. A strategy for building this team is to recruit people with a variety of backgrounds with the understanding that they will periodically work the front lines to keep their skills fresh. Off peak, they could hold various support roles.

Anticipate resource gaps and needs

Anticipate losing people and their associated capacity.

Build buffering capacity and develop reserves before needed.

Design reconfigurable teams. This can be implemented by having a larger team that can be split into smaller components depending on the need, such as entire team working one shift or splitting to cover two shifts.

Pre-assign tactical reserves to planned work to reduce disturbances caused by emergent work. Tactical reserves could be back office personnel with appropriate experience. Assign them to planned work during peak load (giving them time to prepare), leaving active personnel available to respond to unplanned work with their more current skills enabling them to better handle variable situations.

Strategies described above aim at supporting the processes of monitoring and response. They correspond to longer term learning processes and are responses to conditions experienced in the past that have hindered resilient operations.

Conclusion

Probing interactions and interdependencies along with stepping back to look at the entire system differentiated this workshop from a traditional risk assessment. According to Erik Hollnagel, “… resilience and brittleness probably do not reside in components … but are rather a product of how well the components work together.… this must be understood by trying to comprehend the dynamics of how the system works, from a top-down rather than a bottom-up (component) view.” He suggests that one should look to improve the everyday practices or the ways of working, instead of focusing on components of a system.

The principles of RE and HRO are not complex and are learned fairly quickly in a workshop setting. These principles suggest strategies which serve as a foundation for design of practices. The more difficult part is holding a shifted perspective while viewing issues that may not be new; a facilitator with knowledge in RE and HRO domains is crucial to accomplishing this. Per Erik Hollnagel (2012): “With resilience, noticing is different,” and “A system’s willingness to become aware of problems, is associated with its ability to act on them” (Ron Westrum, 1993). This might be the core of the real value of RE, gaining and holding a shifted perspective that enables a person to notice what we couldn’t see before and, once noticed, we are compelled to act.

Commentary

Classical approaches to safety management accept that the world is probabilistic, but also assume that it is stable in the sense that we can trust probability calculations. Resilience Engineering takes a different view, assuming that the world is variable but that the variability is not always known in advance. The variability is, however, orderly rather than stochastic. It is, therefore, possible to develop ways which will support a system’s adaptive capacity vis-à-vis the variations in the world. By recognizing different types of uncertainty, the chapter illustrates how to go from theory to practice by showing how it was possible to teach people to be flexible and be prepared, and thereby avoid brittle performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset