Chapter 11

Organisational Resilience and Industrial Risk

Nick McDonald

Introduction

Is resilience a useful term for describing how or whether an operational system functions effectively over an extended period of time? First, it is helpful to explore what resilience could mean and how the term could apply to operations that carry the threat of disastrous failure. Examining the normal way in which operational systems (and their host organisations) function uncovers a stark gap between formal requirements and what actually happens and points to the important role of professionalism in compensating for dysfunctional organisational systems. Quality and safety management does not easily ameliorate these complex problems. This, in turn, poses a problem for understanding how improvement and change comes about in organisational systems. A more fundamental question concerns how technology innovation processes in complex systems could address the problems of operational life. The notion of resilience could help shape our understanding of how to address these conundrums, but significant gaps limit our knowledge about the resilience of operational systems.

What is the Nature of Resilience?

If we can apply the term resilience to organisations and the operations they support, then it should be clear to what this term refers. There are a number of obvious ways to use the term, for example:

•  To refer to a property of a system,

•  A post-hoc judgement on a successful outcome – so far (the system has survived therefore it must be resilient),

•  A metaphor that maps only loosely onto organisational or system characteristics.

Certainly, if the concept of resilience is to aspire to have some use in diagnosing systems, let alone predicting outcomes, it needs to be anchored in some clearly describable characteristics – characteristics which transcend those that may have an occasional association with organisational success on one occasion or another. Again, using the opposition ‘resilience – brittleness’ as a general characterisation of organisational functioning will not be helpful unless it is more than a general analogy or metaphor drawing comparison with the dynamic characteristics of physical objects.

If resilience is a system property, then it probably needs to be seen as an aspect of the relationship between a particular socio-technical system and the environment of that system. Resilience appears to convey the properties of being adapted to the requirements of the environment, or otherwise being able to manage the variability or challenging circumstances the environment throws up. An essential characteristic is to maintain stability and integrity of core processes, despite perturbation. The focus is on medium to long-term survival rather than short term adjustment per se. However the organisation’s capacity to adapt and hence to survive becomes one of the central questions about resilience – because the stability of the environment cannot be taken for granted. Therefore, the notion is important of being able to read the environment appropriately and to be able to anticipate, plan and implement appropriate adjustments to address perceived future requirements.

Thus, resilience seems to reflect the tension between stability and change in organisational and operational systems, mediated by the notion of appropriate adaptation. But what is appropriate and to what environment? Within this frame, the notion of environment is potentially very broad, including both local disturbances (which cause, for example, the risk of operational failure) as well as the social and commercial factors (for example, the requirement to compete) that determine what has to be done to ensure survival of the organisation itself.

Commercial and operational risks interact, but the way in which they interact is not necessarily simple. For example, the notion of ‘safety margins’ is often invoked to imply a linear relationship between cost-cutting and the size of a theoretical ‘safety margin’. Safety margins are hard (perhaps impossible) to define, particularly where operational processes are characterised by routine and regular unofficial action. There are also different ways of adapting and cutting costs. For example, lean, low cost approaches to operations are often based on a focused approach to managing processes, which, while having the objective of stripping out redundant or ‘unproductive’ activity, may also direct attention to the fundamental safety requirements of those processes. Thus, there may be no necessary link between the degree of apparent redundancy in operational systems and greater safety. ‘Lean’ operational systems may be safer or less safe than traditionally managed operations and the nature of their vulnerability to risk may be different. It is precisely because there are different ways of designing and managing an operation and different ways of cutting costs or otherwise responding to environmental pressures of one kind or another, that it is necessary to drive down the analysis to ask specifically and precisely what characteristics of operational systems could, in fact, bring benefits in terms of resilience.

This then suggests a provisional definition of resilience. Resilience represents the capacity (of an organisational system) to anticipate and manage risk effectively, through appropriate adaptation of its actions, systems and processes, so as to ensure that its core functions are carried out in a stable and effective relationship with the environment.

System Characteristics of Stability and Change

Adopting the notion of an organisational system brings to the foreground the functional characteristics of systems. Organisational systems (as open systems) comprise inputs, transformation processes and outputs. It is important to show how the manner in which an organisation deals with the physical, social or economic material it encounters in its operating environment, leads to outcomes that maintain a stable (or otherwise positive) relationship with that environment. The real interest in this is where environments are not inherently stable, and where this instability brings a risk of catastrophic failure to the system. Thus, maintaining stability requires the capacity to adjust.

This adjustment can occur at different levels, for example, an operational group, the organisational system, groups of organisations contributing to a single operational system, and the fundamental technologies of such an operational system produced by an industry group. All of these levels imply radically different timescales for both operation and adjustment, from short-term operational decisions to the development of a new technology over a number of years. In principle all of these levels of analysis are relevant to resilience and are interdependent, as if each wider layer of the onion sets the context for how the layer inside can operate. Nevertheless, despite this interdependence, each layer may require quite distinct explanatory mechanisms. There may also be tensions and incompatibilities between the mechanisms that operate at these different levels. In order to understand the nature of resilience it is probably necessary to develop an understanding of these different levels and of the relationships between them.

As Amalberti (Chapter 16) has argued, it seems clear that the relationship between system and environment (i.e., risk) will change according to the stage of development of the technology. The stage of development of the organisation is just as important (i.e., how the human role in managing the technology is itself managed and controlled). These two are closely related.

In the context of organisational systems, resilience would seem, on the one hand, to depend on increasing standardisation. The following are examples of such tendencies:

•  Stronger co-ordination of processes by routinisation of procedures in operations and organisational systems;

•  Increased reliability through the removal of variance due to individual skill, and ensuring the substitutability of different people, through standardised selection and training;

•  Ensuring, through supervision, inspection, auditing, etc. that standardisation of the work-process does control the actual flow of work;

•  Better standardisation of the outputs of the process is made possible through better monitoring, recording of those outputs;

•  Automation of routine or complex functions.

On the other hand, resilience seems also to require a certain flexibility and capacity to adapt to circumstances. Common organisational forms of this are:

•  Informal work-practices, which are often unofficial;

•  Distributed decision systems with local autonomy;

•  Flexible/agile manufacturing systems which can adjust to changing demand;

•  Technologies that enable rather than constrain appropriate human action and modes of control;

•  Organisational systems that can manage feedback, learning and improvement.

It may be therefore that resilience is bound up with being able to successfully resolve apparent contradictions, such as:

•  Formal procedures – local autonomy of action;

•  Centralisation – decentralisation of functions/knowledge/control;

•  Maintaining system/organisation stability – capacity to change;

•  Maintain quality of product/service – adjust product/service to demand or changing need;

•  Use well-tested technologies – develop innovative technical systems.

In order to explore how some of these contradictions are managed in practice, we will use some examples from our own research to focus on the following issues: reconciling planning and flexibility in operational systems; the role of quality and safety management in fostering improvement; the problem of organisational change and the role of human operator requirements in industrial innovation.

It should be borne in mind, of course, that the successful reconciliation of these contradictions is not just a matter of internal consistency within the operation or organisation, but crucially concerns enabling particular ways of managing the interaction with the environment. This is a highly complex interaction. Take organisational theory as an example. Contingency theories, which posit an optimal fit between organisational form and environment, have been supplemented by recognition of the importance of choice of managerial strategy. This in turn has been modified by an acknowledgement of the power of large organisations to control and modify their environments, both through the dominance of the markets in which they operate and through their ability to create significance and meaning surrounding their areas of activity (for contrasting analyses see Thompson & McHugh, 1995, chapter 3; and Morgan, 1986, chapter 8). To adapt this line of argument to the notion of resilience, it seems clear that there is unlikely to be ‘one best way’ to achieve resilience, or even several ‘best ways’. It is important however to understand how organisations, and those that manage them make choices, and what the consequences are of these choices. It is also important to recognise that achieving ‘resilience’ (whatever that means) is not just a matter of finding the right technical solution to an operational problem, but of constructing a better way of understanding the operational space. Has the notion of resilience the conceptual potential to sustain a re-engineering of our understanding of the way in which organisations and the people within them cope with danger, difficulty and threat?

Planning and Flexibility in Operational Systems

For the past number of years we have been exploring, through a series of European projects concerning aircraft maintenance, the notion of a symbiotic relationship between the functional characteristics of an organisational/operational system and the behaviour and activity of the actors within the system. While this is a conceptually simple idea, it requires the accumulation of evidence from a diversity of sources in order to substantiate it. These sources include, amongst others, studies of self-reported compliance with procedures in task performance, observations of task performance, analysis of documentation and information systems, cases studies of planning and quality systems, and surveys of climate and culture.1

Compliance with Procedures

The idea originated in a study of compliance with documentation and procedures. A survey of aircraft maintenance technicians in four different organisations established that in approximately one third of tasks the technician reported that he or she had not followed the procedure according to the maintenance manual. Most commonly the technicians reported that there were better, quicker, even safer ways of doing the task than following the manual to the letter. This is a paradoxical result. On the one hand, in a highly regulated, safety critical industry like aircraft maintenance it seems surprising that such a high proportion of technicians admit to what is essentially an illegal act, which violates the regulations. On the other hand, virtually no-one who knows the industry well has expressed any surprise at this result. Follow-up case studies demonstrated that it is indeed not very hard to find ‘better’ ways than following the manual exactly. It is also not hard to find examples of where an alternative method to the manual, which has been followed, is clearly not better or carries an undesirable risk. The manuals themselves are not an optimum guide to task performance – they have to fulfil other criteria, such as being comprehensive, fully accurate and up to date. In many organisations the documentation is not presented in such a way as to provide an adequate support to task performance. What this establishes is simply that judgement based on experience is important, along with following procedures, in determining how tasks are done (McDonald, 1999).

So far this is not very different from evidence from other domains concerning compliance with procedures. Indeed the work of Helmreich and his colleagues (e.g. Helmreich, 2001; Klampfer et al., 2001; Klinect et al., 2003; Helmreich & Sexton, 2004) in the development and deployment of LOSA (Line Operations Safety Audit) in a variety of airlines has demonstrated the normality and pervasiveness of error and procedural deviation in flight operations and in medicine. This is an important development for two reasons. First LOSA represents a system audit procedure for recording ‘normal operational performance’. While Dekker (2005) might call this a contradiction in terms (because, for example, the act of observation transforms what is observed) LOSA represents a serious attempt to get nearer the observation of a normal situation by using elaborate mechanisms to protect flight crew from organisational jeopardy, preserve confidentiality, ensure remote analysis of aggregated data, etc. Thus for the first time there is a serious, if not perfect, window on an aspect of operational life, which, on the one hand, everyone knows about, but, on the other hand has been hidden from scrutiny and analysis. The second reason follows from this. The evidence from LOSA has been important in contributing to a new view of ‘human error’ in which error and deviations are seen to be routinely embedded in normal professional practice, and in which the practice of professionalism is concerned less with the avoidance of error per se, than with managing a situation in a way that ensures that the envelope of acceptable performance is commensurate with the context of potentially destabilising threats and that any ‘non-standard’ situation is efficiently recoverable. This is the Threat and Error Management model.

Planning and Supply

However, our studies of aircraft maintenance provide a wider context for examining professional behaviour. In some ways the flight deck or the operating theatre may provide a misleading model for contextual influence on behaviour precisely because the immediate context of that behaviour is so clearly physically delineated and apparently separated from the wider system of which it is part – the flight deck is the exemplar par excellence of this. On the contrary, aircraft maintenance operations are clearly intimately located in a wider technical and organisational context, which includes the hangar, airport, maintenance organisation and airline.

Aircraft maintenance is a complex activity. It can involve a maintenance department of an airline performing mainly ‘own account’ work or fully commercial independent maintenance organisations whose work is entirely third party. There are inherent contradictions to be resolved – for example, how to reconcile the requisite inventory of parts and tools to cover all foreseeable planned and unplanned work, while containing capital costs. A major check may involve thousands of tasks and take several weeks to complete.

For many organisations there appears to be an unresolved tension between effective planning and the requirement of flexibility to meet the normal variability of the operational environment. Some of the dimensions of this tension can be illustrated on some of our studies of aircraft maintenance organisations. The ADAMS project included a comparison of the planning systems of two organisations (Corrigan, 2002; McDonald, 1999). In one organisation, planning was done in a very traditional top-down hierarchical manner; in the second there were serious attempts to create a more responsive and flexible planning process. In the former the maintenance plan originates in the engineering department (following the requirements of the manufacturer’s master maintenance plan). The plan moves down through the system through schedulers to the operational team on the hangar floor where it is apportioned out in the form of task cards to the technicians actually carrying out the work. Within this case study, nowhere is there an opportunity for feedback and adjustment of the plan itself in response to unexpected developments, problems or difficulties in the operational group. The second case concerns an organisation in the process of redesigning its planning system along explicit process lines, building in co-ordination mechanisms for the relevant personnel along the process so that plans could be validated and adjusted according to changed circumstances. Comments from those contributing to the case study indicated that while not all the problems were solved this was a big improvement on what had gone before.

It is when one examines what happens in the check process itself that one finds the consequences of problems of planning and also of the various supply chains that deliver tools, parts and personnel to the hangar floor. For those at the front-line (skilled technicians and frontline managers) problems of adequate supply of personnel, tools, parts, and time seem to be endemic, according to one study, whereas, ironically, problems of documentation and information are not seen to be a problem (presumably because these are not relied on for task performance). Management appear to spend a lot of time and effort ‘fire-fighting’ in a reactive manner, are not confident in managing their staff, and tend to engage in various protective stratagems (e.g. holding staff or tools) to protect the efficiency of their group but which also exacerbate instability elsewhere. There is also a certain ambiguity in the roles of technical staff – for example in relation to precise understanding of what is involved in supervising and signing off the work of others (only those appropriately qualified can certify that the work has been done according to the required procedures) (Ward et al., 2004).

Professionalism

In organisational analysis and diagnosis often the easy part is to identify the apparent imperfections, deficiencies and inconsistencies to which most organisations are subject. What is less obvious may be what keeps the system going at a high level of safety and reliability despite these endemic problems. In many systems, what is delivering operational resilience (flexibility to meet environmental demands adequately) is the ‘professionalism’ of front-line staff. In this context, professionalism perhaps refers to the ability to use one’s knowledge and experience to construct and sustain an adequate response to varying, often unpredictable and occasionally testing demands from the operational environment. Weick (2001) has written well about this, using the term ‘bricolage’ to convey the improvisational characteristics that are often employed. However, it is important to keep in mind the notion that such behaviour is a function of both the environmental demands on task performance and the ways in which the organisational system supports and guides task accomplishment (or, relatively, fails to do so). The way in which this dynamic relationship between action, system and environment is represented in professional cultural values is illustrated in a meta-analysis of some studies of aircraft maintenance personnel.

Abstracting from a number of surveys of aircraft maintenance personnel in different organisations led to a generalisation that pervasive core professional values included the following characteristics: strong commitment to safety; recognising the importance of team-working and co-ordination; valuing the use of one’s own judgement and not just following rules; being confident in one’s own abilities to solve problems; having a low estimate of one’s vulnerability to stress; and being reluctant to challenge the decisions of others. What does this tell us? The significance of these values suddenly becomes rather more salient when they are counterposed to some of the system characteristics of the maintenance organisations to which many of these personnel belong. As well as the problems of planning and supply outlined above, such organisations were not notable for their facility in solving problems (see below). Thus, these characteristic professional values seem, in many ways, to match the deficiencies found in such organisations. Essentially, and perhaps crudely, the professional values can be seen to reflect a philosophy that says: rely on your own resources, do not expect help or support and do not challenge, but above all get the job done and get it done safely. Seen in this light, professionalism compensates for organisational dysfunction, it provides resilience in a system that is otherwise too rigid. We have adopted the label ‘well-intentioned people in dysfunctional organisations’ to characterise this systematic pattern of organisational life.

Of course it does not always work out as intended. One of the prototypical anecdotal narratives from all around this industry goes like this. A serious incident happens in which an experienced technician is implicated – there was pressure to get the job done and the official rules and procedures were not followed. Management are very surprised that ‘one of the best technicians’ had committed, apparently intentionally, such a serious violation. It seems inexcusable, and incomprehensible, until one reflects on what it is that being ‘one of the best technicians’ actually involves. The role of such people is to use their knowledge and experience in the most judicious but effective way to get the job done. Questions are only asked about precisely what was done, following something going wrong. On other occasions informal, unofficial practices are everywhere practised, universally known about but absolutely deniable and hidden from official scrutiny.

Road Transport

There are many other examples of this syndrome. Some of our recent research concerns road goods delivery in urban environments (Corrigan et al., 2004). 2 This urban multi-drop road transport operation is a highly competitive business that operates over a small radius. The multiple stops are arranged according to very tight schedules, which are dictated by the customer’s requirements, rather than the driver’s optimal logistics. Access is a major problem, with a lack of loading bays in the city centre, the absence of design requirements for ensuring good access, and, often, poor conditions at customer premises. The traffic system in which these drivers operate is congested, with complex oneway systems and delivery time restrictions. Clamping, fines and penalties are routine hazards. Coping with the multi-drop system reportedly cannot be done without traffic violations, particularly parking violations and speeding. Interestingly, in many companies, while the driver is liable for speeding penalties the company pays for parking violations. The constant time pressure leads to skipping breaks. Neither the working-time nor the driving-time regulations effectively control long working hours. Of course, the driver has to compensate for every problem, whether it relates to wrong orders, waiting for the customer, or delays due to congestion.

A high proportion of daily tasks are rated as risky by drivers. Risky situations are rated as high risk to self and others, high frequency, and high stress, and drivers have no control over them. For example, unloading while double parking in street involves risks from other traffic, loading heavy goods, risks to pedestrians, and a rush (to avoid vehicle clampers). Professionalism is managing these risks to ensure as far as possible a good outcome. Again, arguably it is this characteristic of the operating core – the professionalism of drivers – which gives properties of resilience to a rigid system that is not optimally designed to support this operation.

Organisational Differences

This discussion has focused on what may be, to a greater or lesser extent, common characteristics of humans in organisational systems. It is clear however that organisations differ quite markedly. Our research in aircraft maintenance organisations explored some of these differences and, again taking a global view, has given rise to certain intuitive generalisations. For example, certain organisations devote more effort and resources than others to the planning process. It sometimes seems as if this commitment to planning comes with an expectation that the plans will work and a reluctance to acknowledge when things go wrong. On the other hand, those organisations who are less committed to planning, have to learn to be flexible in managing, largely reactively, the normal variation of operational demands. It is tempting to suggest that some of these differences may be influenced by regional organisational cultures – but this is the subject of ongoing research and it is too early to make confident generalisations.

Looking at the wider literature, a contrasting analysis of what may be highly resilient systems has been given by the High Reliability Organisations group. This emphasises, for example, distributed decision-making within a strong common culture of understanding the operational system and mutual roles within it. Descriptions of this work (for example Weick, 2001) emphasise the attributes of organisations and their people, which characterise the positive values of high reliability. But some attributes of organisations may have positive or negative connotations depending of the context or depending on the focus of the investigation. Thus, the notion of informal practices based on strong mutual understanding can be seen in the context of Weick’s analysis to be one of the characteristics of high reliability in organisations. In another context (for example Rasmussen, 1997; Reason, 1997), which emphasises the importance of procedures and standardisation such practices can be seen to be the basis of systematic violations. A large part of this contrast may be differences in theoretical orientation with respect to the nature of performance, cognition, or intention. However, part of the contrast may also relate to real differences in organisational forms. It is impossible to know how much of this difference is due to fundamental organisational differences and how much is due simply to looking at organisations from a particular point of view.

At the operational level, therefore, resilience may be a function of the way in which organisations approach and manage the contradictory requirements of, on the one hand, good proceduralisation and good planning, and on the other hand, appropriate flexibility to meet the real demands of the operation as they present on any particular day. Different organisations, different industries will have different ways of managing these contradictions. However from the point of view of our own research with aircraft maintenance organisations, one problem is that the ‘double standard’ of work between formally prescribed and unofficial ways of working is hidden. While everyone knows it is there, the specifics are not transparent or open to scrutiny or any kind of validation. It is clear that the unofficial way of working both contains better (quicker, easier, sometimes safer) ways of working as well as worse ways. Sometimes these informal practices serve to cover over and hide systematic deficiencies in the organisation’s processes. Thus, a major issue is the transparency of informal practices – are they open to validation or improvement? It is important that transparency is not just to colleagues but also to the organisation itself. Therefore we need to look at the capacity of organisations to deal with informal and flexible ways of working. One aspect of the necessary change concerns the capacity to change rules and procedures to deal with specific circumstances. There appear to be few examples of this. Hale et al. (Chapter 18) cites an example in a chemical company and Bourrier (1998) describes a process by which the maintenance procedures in one nuclear power plant (in contrast to some others) are revised to meet particular circumstances.

However, the issue is not just one of procedural revision, but of ensuring that all the resource requirements are met for conducting the operation well. Indeed, the adequacy of documentation and information systems may, on occasion, be seen to be the least critical resource. Thus, the issue in not just one of technical revision of documentation but of the capacity to adapt the organisation’s systems (including those of planning and supply) to optimally meet the requirements of the operation. Thus the ‘resilience’ of the operational level needs to be seen in the context of some oversight, together with the possibility of modification, change or development at a wider organisational level. Although this could be done in a variety of different ways it makes sense to ask: what is the role of the quality and safety functions in an organisation in maintaining or fostering resilience?

The Role of Quality and Safety in Achieving Resilience

The argument is, therefore, that to be resilient, the operational system has to be susceptible to change and improvement – but what is the source or origin of that change? From some points of view quality and safety management are about maintaining stability – assuring that a constant standard of work or output (process and product) is maintained. From another point of view these functions are about constant improvement and this implies change. Putting both of these together achieves something close to the provisional definition of resilience suggested earlier – achieving sufficient adaptation in order to maintain the integrity and stability of core functions. How does this actually work in practice? Again some of our studies of aircraft maintenance organisations can help point to some of the dimensions that may underlie either achieving, or not achieving, resilience. The point is not to hold up a model of perfect resilience (perhaps this does not exist), but to use evidence from real experience to try to understand what resilience as a property of an organisational system might involve.

The Role of Quality

The traditional way of ensuring quality in the aircraft maintenance industry has been through direct inspection, and in several countries within the European Joint Aviation Authorities system, and in the United States under the FAA, quality control through independent direct inspection of work done is still the norm. In some companies the quality inspectors or auditors are regarded as policemen, with a quasi-disciplinary function, while in others the role of auditing or inspection is seen to include a strongly supportive advisory role. However, the essence of the current European regulations is a quality assurance function – the signing off for work done (by oneself or others) by appropriately qualified staff. Fundamentally this is the bottom line of a hierarchical system based on the philosophy of self-regulation. In this, the National Authorities, acting under the collective authority of the JAA (and now of the European Aviation Safety Authority – EASA), approve an operator (maintenance organisation) on the basis that they have their internal management systems in place (responsible personnel, documented procedures, etc.). It is this system that is audited by the authorities on a periodic basis. How far does this system of regulation contain mechanisms that can foster change and improvement?

None of the quality systems we have examined contain a systematic way of monitoring or auditing actual operational practice. Thus, the ‘double standard’ between the official and the actual way of doing things becomes almost institutionalised. There is a paper chain of documentation that originates in the manufacturer’s master maintenance plan, which moves through the maintenance organisation (engineering, planning, scheduling, operations, quality) and then is audited by the national authority. This document chain of accountability only partially intersects with what actually happens on the hangar floor or airport ramp, because there is no effective mechanism for making transparent actual operational practice, let alone a system for actually reducing the distance between the formal requirements and actual practice.

It is a requirement for maintenance organisations to provide an opportunity to give feedback on quality problems. While this may be a good idea in theory, our evidence suggests that it is not working in practice. Not all of the organisations we studied had a fully operational quality feedback system in place. However, even where there was a system that gathered thousands of reports per year, there was little evidence that feedback actually influenced the operational processes which gave rise to the reported problems.

Improvement

In order to try to address this problem, a specific project was undertaken in order to examine the improvement process. This project was called AMPOS – Aircraft Maintenance Procedure Optimisation System. This was an action research project based on a simple idea. A generic methodology moved individual cases (suggestions for improvement in procedure or process) through the improvement cycle (involving both maintenance organisation and manufacturer). The methodology was instantiated in a software system and implemented by a team of people who managed these cases at different stages in both organisations. A set of cases were gathered and processed through the system.

Two very broad conclusions came out of this effort. First, that problems involving the human side of a socio-technical system are often complex and, even if conceptually tractable, are organisationally difficult to solve. Although it is nonsensical to separate out problems into ‘technical’ and ‘human’ categories, as a very rough heuristic it makes sense to think of a continuum of problems along a dimension from those that are primarily technical in nature to those that are primarily human. At the latter end, the potential solutions tend to be more multifaceted (for example involving procedure, operational process, training, etc.) and require significant change of the operational system and the supporting organisational processes. Furthermore, it was not always clear exactly how these problems could (or should) be solved or ameliorated at the design end – and it became clear that this was a much more long-term process.

The second conclusion was that each succeeding stage of the improvement cycle was more difficult than the last. Thus, while it was relatively easy to elicit suggestions for improvement, it was rather more difficult to facilitate a thorough analysis that supported a convincing set of recommendations for change. Such recommendations have to be feasible and practicable to implement, which creates yet more profound problems, and even when such recommendations are accepted by those responsible for implementing them, it is even more difficult to get these recommendations implemented, particularly those that were more complex and challenging. As for evaluating the effectiveness of the implementation, this turned out to be beyond the feasible timescale of the project (McDonald et al., 2001).

Response to Incidents

The messages here are much more general than this particular project, however, and similar fundamental problems can be seen in relation to the manner in which organisations manage their response to serious safety incidents. An analysis of a series of incidents of the same technical/operational nature in one airline posed the questions: what was done after each incident in order to prevent such an incident happening again? Why was this not sufficient to prevent the following incident? One interpretation of the changing pattern of response to each succeeding incident was that there seemed to be a gradual shift from a set of recommendations that were technically adequate to recommendations that were both technically and operationally adequate – they actually worked in practice. This case study furthermore demonstrated that it is extremely difficult to get everything right in practice, as unforeseen aspects often come around in unexpected ways to create new possibilities for incidents. In this case, each succeeding incident appeared to have challenged the mindset of the team responsible for developing an appropriate response, disconfirming initial assumptions of what was necessary to solve the problem, and in this case, eventually involving the operational personnel directly in finding an adequate solution (McDonald, 1999).

This particular case concerned an airline with an exemplary commitment to safety and well-developed safety infrastructure (a dedicated professional investigation team, for example). For other organisations who are not as far down the road in developing their safety system, the cycle of organisational activity is a rather shorter one. Some organisations have not been able to satisfactorily reconcile the organisational requirement to establish liability and to discipline those at fault with the idea that prevention should transcend issues of blame and liability. Others have adopted an uneasy compromise, with an official ‘no-blame’ policy being associated with, for example, the routine suspension of those involved in incidents, normally followed by retraining. A lot of organisational effort goes into these different ways of managing incidents. The more apparently serious or frequent the incident the more necessary it is for the organisation to be seen to be doing something serious to address the problem. Such organisational effort rarely seems to tackle the underlying problems giving rise to the incidents. On the contrary such ‘cycles of stability’ appear to be more about reassuring the various stakeholders (including the national authorities) that something commensurate is being done, than with really getting to the bottom of the problem.

The issue for resilience is this. We do not seem to have a strong, empirically based model of how organisations respond, effectively or otherwise, to serious challenges to their operational integrity, such as are posed by serious incidents. The evidence we have gathered seems to suggest that ‘organisational cycles of stability’ (in which much organisational effort goes, effectively, into maintaining the status quo) may be more the norm than the exception. Even in apparently the best organisations it can take several serious incidents before an operationally effective solution is devised and implemented. On a wider scale, there is a disturbing lack of evidence in the public domain about the implementation of recommendations following major public enquiries into large-scale system failures. One exception to this is the register of recommendations arising out of major railway accident enquiries maintained by the UK Health and Safety Executive. In this case, when and how and if such recommendations are implemented are publicly recorded. Thus, by and large, we really do not know how organisations respond to major crises and disasters.

The Management of Risk

The concept of resilience would seem to require both the capacity to anticipate and manage risks before they become serious threats to the operation, as well as being able to survive situations in which the operation is compromised, such survival being due to the adequacy of the organisation’s response to that challenge. We know far too little about the organisational, institutional and political processes that would flesh out such a concept of resilience. More examples and case studies are needed of the processes involved in proactive risk management and safety improvement. Such examples should encompass not only the identification of risk and prioritising different sources of risk, but also the assignment of responsibility for actively managing the risk, the implementation and evaluation of an action plan and monitoring the reduction of risk to an acceptable level.

In a local example of such an initiative, Griffiths & Stewart (2005) have developed a very impressive redesign of an airline rostering system to reduce the risks associated with fatigue and the management of rest and sleep. This case was based on a clear identification and measurement of the extent of the risk, leading to the implementation of a redesigned roster system together with the continued monitoring of a number of indices of risk to ensure that the sources of risk were effectively controlled.

A systemic approach to quality and safety can help address these problems. Analysing the organisational processes of quality and safety as a system which receives inputs, transforms these and produces outputs can help focus attention on what such processes deliver to the organisation in terms of improvement. Following this logic, we have produced generic maps of two quality and safety processes, based on the actual processes of several aircraft maintenance organisations (Pérezgonzález et al., 2004). Further to this, Pérezgonzález (2004) has modelled the contribution of a wide range of organisational processes to the management of safety.

However, these examples do not really encompass the complex processes of organisational change that appear to be necessary to create what one might call an adaptive resilient organisation.

The Problem of Organisational Change

Even apparently simple problems of human and organisational performance require complex solutions. As we have seen, operational environments are not always designed to optimally support the operator’s role. The key to understanding the possibilities of change concerns this relationship between system and action – how the organisation’s systems and processes are understood as constraining or allowing certain courses of action in order to meet the contingencies of a particular situation. However, even if we understand what may need to change, how to change poses a question of a different order of complexity. Even if it is accepted that there is a problem, it cannot be taken for granted that change will occur. The problem of organisational systems which are apparently highly resistant to change despite major failure have been particularly strongly brought to public focus in the establishment of enquiries into the Hatfield train accident in the UK and the Columbia shuttle disaster in the US.

From our evidence, for many organisations, inability to change may be the norm. We have described ‘cycles of stability’ in quality and safety, where much organisational effort is expended but little fundamental change is achieved. Professional and organisational culture, by many, if not most, definitions of culture, reinforces stasis. This active maintenance of stability is conveyed in the popular conception that culture is ‘the way we do things around here’. The complexity of operational systems and the coupling and linkages between different system elements are also factors that would seem to militate against a view that change should be easy or straightforward. From the point of view of resilience, that core of stability in the operational system and its supporting processes is undoubtedly an essential characteristic, enabling it to absorb the various stresses and demands that the environment throws up, without distorting its essential mission and operational integrity. Thus the argument is not that change is necessarily beneficial, per se, rather it is the appropriateness of change that is the issue. Resilience would seem to demand the possibility of adaptation and change to improve the quality and reliability of a system’s interaction with the environment and to meet a variety of challenges: to face new threats or demands from that environment, to incorporate new technologies that change the nature of the operating process, to rise to competitive pressures that constantly alter the benchmark of commercially successful performance, or to respond to changing public expectations of what is an appropriate level of risk to accompany a particular social gain.

The overwhelming weight of evidence we have accumulated in the aircraft maintenance and other industries is that change is necessary, but also that change is inherently difficult. The slow deliberate process of change to create more resilient systems is evident from Hutter’s (2001) analysis of the implementation of safety regulation through effective risk management in the British Railways system, following the major accidents at King’s Cross and Clapham Junction, and just prior to privatisation. Hutter identifies three phases of corporate responsiveness to safety regulation. In the Design and Establishment phase, risk management systems, procedures and rules are set up, committees established and specialists appointed. The main people involved at this stage are senior management and relevant specialists. In the Operational phase, these systems, procedures and rules are operationalised or implemented. Committees meet, audits are undertaken and rules are enforced. Those involved are primarily management at all levels and worker or community representatives. In the third phase – Normalisation – compliance with risk management rules and procedures is part of normal everyday life. Everyone in the corporation is involved, with both corporate understanding and individual awareness of risk (Hutter, 2001, p. 302). What is well described in Hutter’s book is the process of moving from the Design and Establishment Phase to the Operational Phase. However, the critical transition from the operational phase to the normalisation phase is not well understood, and in relation to the British railways, appears to have been made more difficult by the privatisation of a previously nationally owned industry. Thus there seems to be little evidence about the way in which organisations have made a transition towards a fully normalised self-regulatory risk management regime. This poses the central problem in our analysis of change in resilience – how to generate effective change to increase resilience in the operating core?

There are few empirically based models that describe the dimensions and dynamics of the organisational change process. One such model was developed by Pettigrew & Whipp (1991) from an analysis of organisational change in a variety of industries in the UK in the 1980s. One of the compelling features of this model is that it puts the processes of organisational change firmly in the context of the competitive pressures in the commercial environment that drive such change. This perspective leads to a powerful analysis that seeks to demonstrate how the various dimensions of the change process contribute to meeting that environmental challenge. It is also a model of quite radical change, which is also appropriate to our analysis of resilience. Indeed it can be seen as a prototypical study of commercial resilience. The five dimensions of change which are highlighted in this analysis are environmental assessment, linking strategic and operational change, leadership, the role of human resources as assets or liabilities and coherence. Interestingly, the first of these, environmental assessment, is perhaps the least satisfactory. Pettigrew and Whipp emphasise the subjective, intuitive way in which senior management assess their commercial environment, which contrasts with the aspirations to rationality of a risk management approach, as outlined above. What is particularly interesting is the analysis of how that strategic imperative, which arises out of environmental assessment, is linked to the particular form of operational change that will deliver competitive advantage. Successfully achieving such operational change requires both intensive and multifaceted leadership support at all levels, and a programme of human resources development to ensure that the people in the equation become assets, rather than liabilities, to the change process. The final dimension – coherence – recognises that successful organisational change is complex and requires considerable effort to ensure that the different strands and levels of the process actually do work together to support the same direction of change.

This model poses a core challenge of theories of resilience to demonstrate not only how particular operational forms deliver outputs that have characteristics of resilience, but also what organisational requirements are necessary to implement these forms where they do not already exist.

Change in Technology

Arguably, in the long term, a more fundamental issue than that of organisational change is the issue of increasing resilience through technological innovation. To take air transport as an example, the impressive record of the aviation industry in dramatically improving its major accident rate per mile travelled over the last fifty years or so is commonly attributed to the advances in technological reliability of aircraft systems. Nevertheless, the human (pilot, air traffic controller) continues to make a predominant contribution to aviation accidents. Therefore it is logical to argue that the next generation of aviation technologies requires a step change in design for human use if a significant reduction is to be made in the persistence of human action as a contributor to accidents. The importance of this is emphasised by the pace of technological change, where the increasing potential for the integration of different subsystems (flight deck, air traffic control, airport ground control, for example) means that the roles of the human operator can potentially be radically re-engineered in a relatively short time. In such complex systems it is the human operator who plays the role of the critical interface between different subsystems. To achieve this, it is necessary to regard the human operator as a core system component (at all stages of the processes that make up the system) and whose role and functions need to be addressed from the very start of the design process.

Achieving a ‘system integration’ model of human factors, however, is difficult in practice as well as in theory. In many systems there is a considerable distance – professional, cultural, geographic – between design and operation. To take one example, those who write or are responsible for revising aircraft maintenance procedures are often surprised when they hear of routine non-compliance with such procedures. Such procedures are written to be followed and the idea that, from the technician’s point of view, there might be ‘better’ ways of doing the task than following the procedure to the letter often seems bizarre and irresponsible. Nevertheless, this behaviour makes sense to the technician. Maintenance manuals are designed to be comprehensive, accurate, up to date, and sufficiently detailed to allow an inexperienced mechanic to complete a task. This design requirement does not include being a user-friendly task support for the average technician. Designing information and documentation systems which more closely meet the way in which the task is actually performed as part of a maintenance process is a considerable challenge. The routine errors and violations uncovered by the LOSA methodology likewise pose a similar challenge to aircraft flight system designers. The same principle is true for those who design the road transport and goods handling infrastructure that is used by urban multi-drop drivers. The examples can be generalised to virtually any context of human activity within large complex technological systems. The question is: How to bring operator requirements to the forefront of the design process?

Currently we have been unable to find any convincing models of how user or operator needs can be constituted and put into play as a driver of innovation in large socio-technical systems. This seems to be a considerable gap in understanding how to develop resilience in new generation systems. This is therefore the topic of two current European projects, which address, amongst other things, the role of human factors in the aviation system.3 The basic idea is as follows. It is important to capture knowledge about the operator in the system, not only about those at the so-called ‘sharp end’ of the operation, but all along the operational process. Much of the relevant knowledge is tacit knowledge about how the system operates and how people act within it. This is knowledge that is rarely made explicit or written down, but which is commonly known. This knowledge has to be transformed to be of relevance as a source of potential design ideas. The model for this knowledge transformation process is the knowledge transformation project teams, described by Nonaka (2002), whose task is to develop new product ideas. Such teams have to have the requisite variety and redundancy of real operational expertise to ensure that good ideas can be generated and thoroughly explored. Knowledge transformation is not sufficient on its own, however. In a large distributed multiorganisation system, like aviation, such knowledge has to be shared between organisations. The model for this process is the theory of industrial innovation pioneered by Porter (1998) and Best (2001), who see innovation as coming from the interaction of clusters of organisations – in this case design and manufacturing organisations, operators and research and development organisations. Within our adaptation of this model, human factors research and development (both basic and applied developmental research) is fully incorporated into the innovation cycle, drawing its fundamental research topics from problems of innovation and change in system, process and technology. It is an ecological model of human factors research, which is grounded in operational reality, integrated in the systems, processes and technologies that structure the operation, and is interdisciplinary in setting the human requirements at the centre of a systems engineering perspective. The broad objective of this research is to model a system of innovation that can lead to the design of more resilient systems precisely because they are designed to meet operationally valid requirements derived from the actual practice of users. It is too early to see how far this will in fact be possible.

Conclusions – the Focus on Resilience

Resilience has been defined here in terms of a productive tension between stability and change. The basic stability and integrity of the system is an important dimension, as is the capacity to absorb major disturbances from the operating environment and to recover from failure. The notion of adaptation to the requirements of the operational environment implies the capacity to adapt and change in order to survive in a changing environment. The difficulty of understanding processes of adaptation and change is a recurring theme in this chapter.

The notion of resilience has to work on at least three levels – the operation (the individuals, group or team who work through the task and operational processes, with the relevant technology to produce the required result or output); the organisation (which incorporates, organises, co-ordinates, resources and in other ways supports the operations which produce the outputs that fulfil the organisations’ mission); and the industrial system (which designs and produces the technologies that make the operation possible). A truly resilient system should absorb, adapt, adjust and survive at all three levels. Requirements are different at each of these levels and in some ways may be contradictory. Thus, for example, systems that rely on informal flexibility at an operational level are not always transparent at an organisational level. The independence of quality and safety systems from operational and commercial influence has to be reconciled with the need for quality and safety functions to be actively engaged with improving the operation and its processes. The contradiction has to be resolved between the organisational imperative to change to adapt to new circumstances or new events, and the sheer organisational effort and difficulty of successfully engineering such change. Again, such change should not in turn disrupt the core stability of operational processes, which are the central requirement of resilience. There are also many barriers to sharing, at an industry level, critical knowledge about how an operation really happens. Putting such knowledge to work in improving the next generation of systems poses further, as yet unanswered, questions.

However, resilience is not just about being able to change (on the one hand) or maintaining stability (on the other). It is critically about the appropriateness of stability or change to the requirements of the environment or, more accurately, about the planning, enabling or accommodating of change to meet the requirements of the future environment (as anticipated and construed) in which the system operates. This poses two challenges – both conceptual and practical. How to conceptualise and formulate hypotheses about the relationships between operational systems and the demands of their environments? How to test these hypotheses over the appropriate timeframe with appropriate longitudinal design?

If resilience is to be a useful theoretical concept it has to generate the research that will identify the particular characteristics of the socio-technical system that do in fact give it resilience. This chapter has tried to flesh out some of the relevant dimensions, but at crucial points there is a lack of sound empirical evidence that is grounded in operational reality, is systemic (located in its technical and organisational context), dynamic (i.e., concerns stability and change over time) and ecological (i.e., concerns systems in their environment). Unless this evidence gap is addressed, the concept of organisational resilience is in danger of remaining either a post-hoc ascription of success, or a loose analogy with the domain of the mechanical properties of physical objects under stress, which allows certain insights but falls short of a coherent explanation.

1  The ADAMS, AMPOS, AITRAM and ADAMS2 projects were funded by the European Commission under the 4th and 5th Framework RTD Programmes.

2  Safety and Efficiency in Urban Multi-drop Delivery is part of the Centre for Transportation Research and Innovation for People (TRIP) at Trinity College Dublin, and is funded by the Higher Education Authority.

3  Technologies and Techniques for New Maintenance Concepts (TATEM) and Human Integration into the Life-cycle of Aviation Systems (HILAS) are funded by the European commission under the 6th Framework programme.

An Evil Chain Mechanism Leading to Failures

Yushi Fujita

Setting a higher goal for long-term system performance requires a variety of improvements such as improving yield, reducing operation time, and increasing reliability, which in turn leads to a variety of changes such as modifying the artifact (i.e., technical system), changing operating methods (either formally or informally), and relaxing safety criteria. These changes may implant potential hazards such as a more complicated artifact (i.e., technical system), more difficult operation, and smaller safety margins, which further may result in decreased reliability, decreased safety, and decreased longterm performance. The end results may be long interruptions or even the termination of business. It is important to understand that adaptive and proactive human behaviors are often acting as risk-taking compensation for the potential hazards behind the scenes, even though they are often admired as professional jobs. A resilient system can cut this evil chain by detecting and eliminating potentially hazardous side effects caused by changes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset