Chapter 3

Fundamental on Situational Surprise: a Case Study with Implications for Resilience

Robert L. Wears and L. Kendall Webb

Things that never happened before happen all the time (Sagan, 1993).

Introduction

Surprise is inherently challenging to any activity; it challenges resilient action, since by definition it cannot be anticipated, and for some types of surprises, monitoring is limited by both lack of knowledge about what to target and the absence of precursor events or organisational drift (Dekker, 2011; Snook, 2000) that might have provided even soft signals of future problems. It does, however, present opportunities both for responding and for learning. In this chapter we describe a critical incident involving information technology (IT) in a care delivery organisation. The incident was characterised by the co-occurrence of both situational and fundamental surprise (Lanir, 1986), and the responses to it are informative about both specific vulnerabilities and general adaptive capacities of the organisation. We studied this event to gain insight into three aspects of resilience: first, how adaptive capacity is used to meet challenges; second, to understand better what barriers to learning are active; and finally, to infer recommendations for practice. We conducted these analyses both shortly after the event, and then revisited them in discussions with key participants about three years later. We note that temporal and cross-level factors played important roles in affecting the balance between situational and fundamental learning. Because the situational story (of component failure) developed first, it was difficult for the fundamental story of unknown, hidden hazards to supplant it. In addition, the story of the situational surprise was easily understood by all members of the organisation, but that of the fundamental surprise were difficult for many to grasp, including (especially) senior leadership, who tended to adopt an over-simplified (situational) view of the problem. Finally, over time, the fundamental surprise was virtually forgotten, and those members of the organisation who do remember it have (in effect) gone into self-imposed exile. Thus, although the organisation did learn and adapt effectively from this event, it has become progressively blind to the continuing threat of fundamental surprise in complex technology.

The Nature of Surprise

Analyses of critical incidents often distinguish between situational and fundamental surprise (Lanir, 1986; Woods, Dekker, Cook, Johannesen and Sarter, 2010). Events characteristic of situational surprise might be temporally unexpected, and their specific instantiation might not be known in advance, but their occurrence and evolution are generally explicable and, more importantly, compatible with the ideas generally held by actors in the system about how things do or do not work, and the hazards that they face in ordinary operations. For example, a sudden rain shower on a previously sunny day would be surprising, but generally compatible with our experience of weather. Fundamental surprise, on the other hand, is astonishing, often inexplicable, and forces the abandonment of the broadly held notions of both how things work, and the nature of hazards that are confronted. For example, a volcanic eruption in Paris would challenge basic notions about the geology of volcanism because it is incompatible with prior understandings.

An apocryphal story tells of a famous lexicographer (Noah Webster in America; Samuel Johnson in Britain) being unexpectedly discovered by his wife while locked in a passionate embrace with a housemaid. His wife exclaimed, ‘Oh! I am surprised!’ To which he reportedly replied, ‘No, my dear. I am surprised; you are astonished’.

It should be noted that the situational vs fundamental typology is a relative, not a dichotomous distinction. In addition, the same event for some people might be a situational surprise but for others a fundamental surprise, depending on the relation to and experience with the domain.

If we think of a system’s resilience as its intrinsic ability ‘to adjust its functioning prior to, during, or following changes or disturbances, so that it can sustain required operations under both expected and unexpected conditions’ (Hollnagel, 2011), then it is clear that surprise creates unexpected demands that call for a resilient response.

Lanir (1986) has identified four characteristics that distinguish situational from fundamental surprise. Fundamental surprise refutes basic beliefs about ‘how things work’, while situational surprise is compatible with previous beliefs. Second, in fundamental surprise one cannot define in advance the issues for which one must be alert. Third, situational and fundamental surprise differ in the value brought by information about the future. Situational surprise can be averted or mitigated by such foresight, while advance information on fundamental surprise actually causes the surprise. (In the preceding examples, advance knowledge that it was going to rain would eliminate the situational surprise, and allow mitigation by carrying an umbrella; advance knowledge that a volcano was going to erupt tomorrow in the Jardin du Luxembourg would, in itself, be astonishing, as astonishing as the actual eruption.) And finally, learning from situational surprise seems easy, but learning from fundamental surprise is difficult.

Resilience and Surprise

Resilience is characterised by four essential capabilities: monitoring, anticipating, responding and learning. While effective management of situational surprise would typically involve all four of these activities, fundamental surprise clearly is a profound challenge for resilience, because one cannot monitor or anticipate items or events that are inconceivable before the fact. Even though one can still monitor the system itself, rather than events, the lack of precursors, leading signals or drift in fundamental surprise severely hampers this modality. This leaves only responding and learning as the immediately available resilience activities in fundamental surprise,1 and explains in part why fundamental surprise is such a challenge to organisational performance. However, fundamental surprise does afford opportunities for deep learning, in particular the development of ‘requisite imagination’, an ability to picture the sorts of unexampled events that might befall (Adamski and Westrum, 2003)

We present a case study of:

•  The catastrophic failure of an information technology system in a healthcare delivery organisation;

•  The organisation’s response to it from the point of view of resilience;

•  And the organisation’s memory of the event years later.

The failure itself involved a combination of both situational and fundamental surprise. As might be expected, the immediate response involved both adaptations of exploitation (that is, consuming buffers and margin for manoeuvre to maintain essential operations) and adaptations of exploration (that is, novel and radical reorganisations of the way work gets done) (Dekker, 2011; March, 1991; Maruyama, 1963). Because fundamental surprise makes the disconnect between self-perception and reality undeniable, it affords the opportunity for a thorough-going reconstruction of views and assumptions about how things work, as effortful and unpleasant as that generally seems. However, in this case the conflation of fundamental and situational surprise led to a classic fundamental surprise response – a reinterpretation of the problem in local and technical terms, which allowed an easy escape from the rigours of fundamental learning.

The Case

In this section we describe the events and the adaptations to the interpretations made of them, based on notes, formal reviews and interviews during and after the incident.

Events

Shortly before midnight on a Monday evening, a large urban academic medical centre suffered a major information technology (IT) system crash which disabled virtually all IT functionality for the entire campus and its regional outpatient clinics (Wears, 2010). The outage persisted for 67 hours, and forced the cancellation of all elective procedures on Wednesday and Thursday, and diversion of ambulance traffic to other hospitals (52 major procedures and numerous minor procedures were cancelled; at least 70 incoming ambulance cases were diverted to other hospitals). There were 4 to 6 hour delays in both ordering and obtaining laboratory and radiology studies, which severely impacted clinical work. The total direct cost (not including lost revenue from cancelled cases or diverted patients) was estimated at close to $4 million. As far as is known, no patients were injured and no previously stored data were lost.

The triggering event was a hardware failure in a network component. This interacted with the unrealised presence of software modules left behind from an incompletely aborted (and ironically named) ‘high availability computing’ project some years previous; this interaction prevented the system from restarting once the network component was replaced. The restart failure could not be corrected initially because of a second, independent hardware failure in an exception processor. Once this was identified and replaced, the system still could not be restarted because unbeknownst to the IT staff, the permissions controlling the start-up files and scripts had been changed during the same project, so that no one in IT was able to correct them and thus restart the system. This fault had gone undetected because the system had not been subjected to a complete restart (a ‘cold boot’) for several years.

Adaptations

After a brief initial delay, the hospital was able to quickly reorganise in multiple ways to keep essential services operating for the duration. Adaptations included exploitation of existing resources or buffers; and exploration of novel, untried ways of working. These adaptations correspond roughly to the first- and second-order resilient responses described by a well-known materials science analogue (Wears and Morrison, 2013; Woods and Wreathall, 2008).

Adaptations of exploitation included deferring elective procedures and speeding discharges of appropriately improving inpatients. The former was limited in scope because the extent of the problem was not realised until Tuesday’s elective cases were well under way. The latter was stymied by the slow delivery of laboratory and imaging results; physicians were reluctant to discharge patients whose results were still pending. This, of course, is one of the classic patterns of failure – falling behind the tempo of operations (Woods and Branlat, 2011).

Several adaptations of exploration were invoked. An incident command team was formed. Because the area experiences frequent hurricanes, the incident command system was well rehearsed and familiar, so it was adapted to manage a different type of threat.

A similar novel use of available techniques evolved dynamically to compensate for the loss of medical record numbers (MRNs) to track patients, orders and results while the system was down. The emergency department had been planning to implement a ‘quick registration’ method, in which only basic patient information is obtained initially to permit earlier orders and treatment, and the remainder of registration completed at a later time. The IT failure prevented complete registration but was thought to have left the capability for quick registration. The incident occurred very close to the previously scheduled ‘quick registration’ implementation, so it was pressed into service early. However, its application in this setting uncovered a problem, in that different organisational units used the same variable to represent different information; this resulted in several patients getting ‘lost’ in the system. This failure led to an alternative, the use of the mass casualty incident (MCI) system.

In many MCIs, the numbers of arriving patients rapidly exceed the ability to record even their basic information and assign them identifying MRNs, so the organisation maintained a separate system with reserved MCI-MRNs and pre-printed armbands. Although this system was envisioned for use in high demand situations, in theory it could accommodate any mismatch between demand and available resources. In this case, demand was normal to low, but resources were much lower, so the MCI system was used to identify and track patients and marry them to formal MRNs after the incident had been resolved.

The most novel adaptation of exploration included rescheduling financial staff (who now had nothing to do, since no bills could be produced or charges recorded) as runners to move orders, materials and results around the organisation that had previously been transmitted electronically.

Interpretations

The case was viewed in multiple ways within the organisation, depending on the orientation to situational or fundamental surprise. It should be emphasised that there is not a ‘correct’ interpretation here – these views have both validity and utility, and must be understood and held simultaneously for a full understanding of the case and its implications for organisational resilience.

Situational surprise

Because the triggering event was a hardware failure, and because the organisation had experienced a similar incident leading to total IT failure 13 years previously (Wears, Cook and Perry, 2006), the failure was initially interpreted as a situational surprise. It evinced no fundamental misperception of the world; it was not ‘the tip of the iceberg’ but rather a hazard about whose possibility there had always been some awareness.

However, we should not downplay the importance of the organisation’s situational response, which was in many ways remarkably good. The organisation detected the fault and responded relatively quickly and effectively; the unfolding understanding of the situation and effectiveness of the response was monitored, and the organisation reconfigured to meet the threat. This reconfiguration involved a mixed control architecture where a central, incident command group set overall goals and made global-level decisions (for example, cancelling elective procedures, reassigning financial staff) and managed communications among the various subunits of the organisation, while allowing functional units (for example, the emergency department, operating room, intensive care units, pharmacy, radiology, laboratory and nursing) to employ a mixture of preplanned and spontaneously developed adaptations to maintain performance.

There was a specific attempt to capture situational learning from the incident. Each major unit conducted its own after-action review to identify performance issues; the incident command group then assembled those and conducted a final, overall review to consolidate the lessons learned. This review obtained broad participation; it resulted in 104 unique items that, while locally oriented and technically specific, form the nidus of organisational memory and could inform the approach to similar future events, which are broadly anticipated in their consequences (that is, another widespread IT failure at some point seems assured) if not in their causes.

One remarkable aspect of the response was the general absence of finger-pointing or accusatory behaviours, witchhunts or sacrificial firings. An essay on how complex systems fail (Cook, 2010) had been circulated among the senior leaders and the incident command group during the outage, with substantial agreement on how well it described the incident, its origins and consequences; this essay played an important role in minimising the temptation to seek culpability (Dekker, Nyce and Myers, 2012).

Fundamental surprise

However, as a fuller understanding of the incident developed, situational gave way to fundamental surprise. The discovery of the permissions problem refuted taken-for-granted beliefs – that the IT section understood and could maintain its own systems; and in particular, that restrictions to privileged (‘root’) access could not be compromised except by sabotage. It raised the question of what other, previously unknown threats, installed by a parade of vendors and consultants over the years, lay lurking just beneath the surface waiting to be triggered into behaviours both unexpected and unexplainable.

Lanir notes that ‘when fundamental surprises emerge through situational ones, the relation between the two is similar to that between peeled plaster and the exposed cracks in the wall. The plaster that fell enables us to see the cracks, although it does not explain their creation’ (Lanir, 1986). The IT unit recognised this clearly, and were astonished by the ‘hidden time bomb’ whose presence was only fortuitously revealed by the line card failure. This triggered a deeper review of known previous changes, a new commitment to not permitting unmonitored and undocumented changes by vendors or other third parties, and more stringent requirements for ‘as installed’ documentation (including personal identification of involved parties). It led to a general awareness among IT leaders that their knowledge of their own system was incomplete and that they should therefore act in the ‘continuing expectation of future surprise’ (Rochlin, 1999) or as in Sagan’s remark at the beginning of the chapter that ‘things that never happened before happen all the time’ (Sagan, 1993). This fundamental learning, however, did not spread throughout the organisation, but remained mostly encapsulated in IT.

The long view

In the years following the event, key personnel involved in the recovery made career moves that may have been influenced in part by this experience. The IT disaster recovery specialist was shaken by the fundamental surprise of this event, and oscillated between responding with human sacrifice – ‘falling on her sword’ – and resentment that her prior warnings and recommendations had been incompletely heeded. Eventually, she voluntarily left the disaster recovery post to take a position in the implementation group. The then IT director, whose leadership in the crisis was little short of extraordinary, decided to leave the organisation for a more technical and less managerial position in another industry. Neither had been subjected to discipline, threats or recrimination by the organisation, but initiated these changes on their own. Unfortunately, their departure left a void in organisational memory, such that situational surprise became the dominant view of the incident as time passed.

Some beneficial organisational learning did persist. The incident command centre and poly-centric control architecture employed in the management of this event was widely viewed as successful, and so was reused on several occasions subsequently, each time successfully. These occasions included anticipatory use in advance of major system upgrades. Thus the organisation can be said to have learned better how to respond, to have improved its repertoire of possible responses and its sensitivity to anticipating potentially problematic events. However, the experience of successfully anticipating and managing events over a long period of time risks the development of overconfidence, especially when the impact of fundamental surprise has been forgotten.

Discussion

Critical incidents are ambiguous: managing an event that stops just short of complete breakdown is both a story of success and a harbinger of future failure (Woods and Cook, 2006). Incidents embody a dialectic between resilient adaptation and brittle breakdown. In this case we see successful, resilient adaptation, but the real lesson is not in the success (Wears, Fairbanks and Perry, 2012) but rather in how adaptive capacity was used, how it can be fostered and maintained and how learning occurs. We also see limited fundamental learning, but the real lesson is not the failure of more broadly based learning but rather understanding what made that learning difficult.

Fundamental Surprise as a Challenge to Resilience

Fundamental surprise represents a major challenge to resilient performance. Since by definition, fundamental surprise events are inconceivable before the fact, they cannot be anticipated; since it is unknown whence they come, there can be little guidance on what, exactly, to monitor to facilitate their detection.

Factors Limiting Fundamental Learning

There is a strong tendency to reinterpret fundamental surprise in situational terms (Lanir, 1986). Several factors combined to limit fundamental learning in this case.

Situational surprise

The co-occurrence of a situational surprise (failure secondary to component failure) made it easy to redefine the issues in terms of local technical problems (for example, the lack of available spares). The easy availability of hardware failure as an explanation for the outage limited deeper analysis and understanding. This is likely an expression of an efficiency-thoroughness trade-off (Hollnagel, 2009; Marais and Saleh, 2008); accepting a simple, understandable explanation saves the resources that would be used in developing a deeper, more thorough understanding. In addition, the relative success of the adaptations to the failure paradoxically made deeper understanding seem less important.

Temporal factors

The full understanding of the incident did not develop until roughly 36 hours into the outage, so the initial characterisation of the problem as a hardware issue proved hard to dispel. In addition, the 24 × 7 × 365 nature of healthcare operations required urgent responses to prevent immediate harm to patients. This narrowed the focus of attention to actions that could be taken immediately to manage the disturbance, and moved deeper understanding to a lower priority. Because of this narrowed focus, the major formal opportunity for learning, the after-action review, was limited almost entirely to issues related to the adequacy of the response; little effort outside of the IT section was invested in understanding the causes of the failure, and even less on understanding what the incident revealed about other hidden threats, threats that may not yet have been activated.

Cross-level interactions

Different understandings were held at different levels of the organisation. The technical problem – unauthorised, unrecognised access to critical files – was harder for non-technical leadership to understand, particularly compared to the easily grasped story of component failure. Although one might suspect that the full story might have been embarrassing thus obscured or suppressed, this was not the case; the IT leadership was remarkably forthcoming in laying out the full explanation of what was known, as it became known.

In addition, one might question whether it was even pertinent for the clinical arm of the organisation to undergo fundamental learning. Clinical operational units need to be prepared for the consequences of IT failures, but have little role in anticipating or preventing them.

Healthcare-specific factors

IT in healthcare has several unique characteristics that contributed to both the incident and to the difficulty of fundamental learning. In contrast to other hazardous activities, IT in health is subject to no safety oversight whatsoever. The principles of safety-critical computing are virtually unmentioned in a large medical informatics literature (Wears and Leveson, 2008). Thus there is no locus in the organisation responsible for the safety of IT, and no individual or group who might be responsible for deeper learning from the incident.

In addition, IT in healthcare is relatively new compared to other industries. The systems in use today are fundamentally ‘accidental systems’, built for one purpose (billing), and grown by accretion to support other functions for which they were never properly designed. This has led to ‘criticality creep’, in which functions originally thought to be optional gradually come to be used in mission-critical contexts, in which properties that were benign in their original setting have now become hazardous (Jackson, Thomas and Millett, 2007).

Diverting factors

Finally, an external factor diverted at least senior leadership’s attention from a deeper exploration of the vulnerabilities whose presence this incident suggested. Nine months prior to this incident, the larger system of which this hospital is a part made a commitment to install a monolithic, electronic medical records, order entry and results-reporting system, provided by a different vendor across the entire system. Although full implementation was planned over a five-year span, major components of the new system were scheduled to go live nine months after the incident. This project gave the (misleading) appearance of a clean replacement of the previous system, a deus ex machina, and thus limited the felt need to understand the vagaries of the existing system more deeply, in addition to consuming a great deal of discretionary energy and resources.

Implications for Practice

Fundamental surprise is fortunately a rare event; its infrequency not only makes learning more difficult but also more important. An important general principle we glean from these events is the advantages of ‘experiencing history richly’ (March, Sproull and Tamuz, 1991) by attending to more aspects of an event. This requires a broader focus for causal investigations; rather than narrowly coning in on the cause(s) of a specific failure (which causes are unlikely to reoccur in this same configuration again), the investigation should broaden its scope, using the incident to identify broad classes of risks to which the organisation is exposed. Similarly, the enumeration of specific errors, failures and faults leading to an event does nothing to illuminate the processes that produce those errors, failures and faults, or permit their continued existence (Dekker, 2011).

Another way to enrich the learning from fundamental failure is to encourage multiple interpretations and accounts from multiple points of view. People tend to see only those issues they feel capable of managing, so by engaging a variety of disciplines and backgrounds, a team can see more (because they can do more). This runs counter to the felt desire to come up with an agreed consensus account of events, and runs the risk of local agreement within subunits of a complex organisation but global disagreement on hazards; thus, some mechanism for both sustaining a variety of accounts and sharing them broadly across the organisation would be important.

Organisations might also foster the construction of ‘near histories’ or hypothetical scenarios that might have evolved out of the incident in question; this would help develop the capacity for imagination that could help sustain the continuing expectation of surprise and counter overconfidence.

Conclusion

Fundamental surprise is a challenge for organisational resilience because anticipation is not a factor (or is at least severely restricted) and monitoring is limited, typically, to evaluating the quality of response. Fundamental surprise also affords great opportunities for deep and fundamental learning, but it is difficult to effectively engage organisations fully in the learning process. In this case, the combination of situation and fundamental surprise blurred the distinction between them; situational adaptation and learning were remarkable, but the ease of reinterpreting fundamental as situational surprise meant fundamental learning was encapsulated, limited to only parts of the organisation and subject to gradual attrition through the loss of key personnel.

Commentary

Every now and then (but hopefully rarely) a system may encounter situations that are completely surprising, and challenge preconceived ideas about what may happen and what should be done. For such fundamental surprises, resilience depends more on the abilities to respond, monitor, learn and anticipate internal rather than external developments. This can also be seen as the borderline between traditional safety management and disaster management. While fundamental surprises are challenging, they also create unique possibilities for strengthening all four basic abilities. The boundary between accidents and disasters is breached in the following chapter, which analyses the Fukushima Daichi disaster.

1  In the strictest sense, a limited variety of anticipation might still be possible, in Rochlin’s sense of the ‘continuing expectation of future surprise’ (Rochlin, 1999), or ‘expecting the unexpected’.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset