Chapter 5

Reliability Modeling Techniques

Abstract

This chapter explains the techniques of quantified reliability prediction and is condensed from Reliability Maintainability and Risk, 8th edition, David J Smith, Butterworth Heinemann (ISBN 978-0-08-096902-2).

Keywords

Auto/proof test; BETAPLUS; HAZOP; HEART; Modeling; “V” model
 
This chapter explains the techniques of quantified reliability prediction and is condensed from Reliability Maintainability and Risk, 8th edition, David J Smith, Butterworth Heinemann (ISBN 978-0-08-096902-2).

5.1. Failure Rate and Unavailability

In Chapter 1, we saw that both failure rate (λ) and probability of failure on demand (PFD) are parameters of interest. Since unavailability is the probability of being failed at a randomly chosen moment then it is the same as the PFD.
PFD is dimensionless and is given by:

PFD=UNAVAILABILITY=(λMDT)/(1+λMDT)(λMDT)

image

where λ is failure rate and MDT is the mean down time (in consistent units). Usually λ × MDT  1.
For revealed failures the MDT consists of the active mean time to repair (MTTR) PLUS any logistic delays (e.g., travel, site access, spares procurement, administration). For unrevealed failures the MDT is related to the proof-test interval (T), PLUS the active MTTR, PLUS any logistic delays. The way in which failure is defined determines, to some extent, what is included in the down time. If the unavailability of a process is confined to failures while production is in progress then outage due to scheduled preventive maintenance is not included in the definition of failure. However, the definition of dormant failures of redundant units affects the overall unavailability (as calculated by the equations in the next Section).

5.2. Creating a Reliability Model

For any reliability assessment to be meaningful it is vital to address a specific system failure mode. Predicting the “spurious shutdown” frequency of a safety (shutdown) system will involve a different logic model and different failure rates from predicting the probability of “failure to respond.”
To illustrate this, consider the case of a duplicated shutdown system whereby the voting arrangement is such that whichever subsystem recognizes a valid shutdown requirement then shutdown takes place (in other words “one out of two” voting).
When modeling the “failure to respond” event the “one out of two” arrangement represents redundancy and the two subsystems are said to be “parallel” in that they both need to fail to cause the event. Furthermore the component failure rates used will be those which lead to ignoring a genuine signal. On the other hand, if we choose to model the “spurious shutdown” event the position is reversed and the subsystems are seen to be “series” in that either failure is sufficient to cause the event. Furthermore the component failure rates will be for the modes which lead to a spurious signal.
The two most commonly used modeling methods are reliability block diagram analysis (RBD) and fault tree analysis.

5.2.1. Block Diagram Analysis

5.2.1.1. Basic equations

Using the above example of a shutdown system, the concept of a series RBD applies to the “spurious shutdown” case (Figure 5.1).
image
Figure 5.1 Series RBD.
The two subsystems (a and b) are described as being “in series” since either failure causes the system failure in question. The mathematics of this arrangement is simple. We ADD the failure rates (or unavailabilities) of series items. Thus:

λ(system)=λ(a)+λ(b)

image

and

PFD(system)=PFD(a)+PFD(b)

image

However, the “failure to respond” case is represented by the parallel block diagram model (Figure 5.2) where both units need to fail.
image
Figure 5.2 Parallel (redundant) RBD.
The mathematics is dealt with in “Reliability Maintainability and Risk.” However, the traditional results given prior to edition 7 of “Reliability Maintainability and Risk” and the majority of text books and standard was challenged in 2002 by K G L Simpson. It is now generally acknowledged that the traditional MARKOV model does not correctly represent the normal repair activities for redundant systems. The Journal of The Safety and Reliability Society, Volume 22, No 2, Summer 2002, published a paper by W G Gulland which agreed with those findings.
Software packages, such as ESC's SILComp®, provide a user-friendly interface for reliability modeling and automatically generate RBDs based on the voting configurations specified by the user. SILComp® also has a sensitivity analysis tool which allows the user to easily optimize test intervals and strategies.
Tables 5.1 and 5.2 provide the failure rate and unavailability equations for simplex and parallel (redundant) identical subsystems for revealed failures having a mean down time of MDT. However, it is worth mentioning that, as with all redundant systems, the total system failure rate (or PFD) will be dominated by the effect of common cause failure dealt with later in this chapter.
The results are as follows:
 

Table 5.1

System failure rates (revealed).

icon
 

Table 5.2

System unavailabilities (revealed).

icon

Allowing for revealed and unrevealed failures

Unrevealed failures will eventually be revealed by some form of auto test or proof test. Whether manually scheduled or automatically initiated (e.g., auto test using programmable logic) there will be a proof-test interval, T. Tables 5.3 and 5.4 provide the failure rate and unavailability equations for simplex and parallel (redundant) identical subsystems for unrevealed failures having a proof-test interval, T. The MTTR is assumed to be negligible compared with T.
 

Table 5.3

Failure rates (unrevealed).

icon
 

Table 5.4

Unavailabilities (unrevealed).

icon
In IEC 61508 the following nomenclature is used to differentiate between failure rates which are either:
• Revealed or Unrevealed.
• The failure mode in question or some other failure mode.
The term “dangerous failures” is coined for the “failure mode in question” and the practice has spread widely. It is, in the authors' opinion, ambiguous. Whilst it is acknowledged that the term “dangerous” means in respect of the hazard being addressed, it nevertheless implies that the so-called “safe” failures are not hazardous. They may well be (and often are) hazardous in some other respect. A spurious shutdown (so-called safe) failure may well put stress on a plant or even encourage operators to override the safety function.
The practice has become as follows:
λdd to mean failure rate of the revealed “dangerous failures”
λdu to mean failure rate of the unrevealed “dangerous failures”
λsd to mean failure rate of the revealed “safe failures”
λsu to mean failure rate of the unrevealed “safe failures”
Tables 5.15.4 model the system assuming that there is only one fault detection mechanism, i.e., self test or proof test. They are slight approximations but are, having regard to the accuracy of reliability assessment, sufficient for most purposes. However, for completeness, the following is included. If, however, for a particular device some faults are found by self test and others by proof test, then in addition to the equations given in Tables 5.15.4 there will be additional terms for undetected combined with detected failures as shown in Tables 5.5 and 5.6. For electronic devices it is quite common that the final result is dominated by the undetected faults along with the common cause failure (if considering a redundant configuration). The ESC software (SIL COMP®) models the full equations.

Allowing for “large” values of λT

In the vast majority of cases the value, λT, is very much less than one, in which case, having regard to the accuracy of this type of assessment, the above equations present adequate approximations. However, for completeness, the following is included. The occasion may arise when it is necessary to address the case of λT not being small.
Assuming failures of each of the devices follow the “normal” distribution, then
Unreliability (PFD) = 1  eλt
However if λt is  1 then this equation simplifies to:
Unreliability (PFD) = λt
This assumption has been assumed in the derivation of the equation given in Tables 5.15.6.

Table 5.5

Additional terms (failure rates) to be added to Tables 5.1 and 5.3.

Number of units
1
2λduλddT + duλddMDT
3λdu2λddT2+3.5λdu2λddTxMDT + 3.5λduλdd2TxMDT + duλdd2MDT2duλddMDT + duλddT
44.8λdu3λddT2xMDT + λdu3λddT3 + 7.7λdu2λdd2TxMDT2 + 4.8λdu2λdd2T2xMDT + duλdd3MDT3 + 7.7λduλdd3MDT2xT14λdu2λddTxMDT + 14λduλdd2TxMDT + du2λddT2 + 12λduλdd2MDT212λduλddMDT + duλddT
123
Number required to operate

image

Table 5.6

Additional terms (PFD) to be added to Tables 5.2 and 5.4.

Number of units
1
21.2λduλddTxMDT
31.2λdu2λddT2xMDT + 1.9λduλdd2TxMDT23.5λduλddTxMDT
41.2λdu3λddT3xMDT + 2.7λdu2λdd2T2xMDT2 + 2.7λduλdd3TxMDT34.8λdu2λddT2xMDT + 7.7λduλdd2MDT2xTduλddTxMDT
123
Number required to operate

image

However if this assumption is not valid then the following applies, where λs is the system failure rate of the system; i.e., λ has the following equations for the following configurations:

PFDAVG=0T(1eλst)dtT

image

This integrates to the following:

PFDAVG=[eλstλs+t]T0T

image

Substituting the limits provides the following:

PFDAVG=[eλsTλs+T][eλs0λs+0]T

image

PFDAVG=eλsTλsT+11λsT

image

PFDAVG=(eλsT1)λsT+1

image

for:1oo2:λs,1oo2=λd2Tfor:1oo3:λs,1oo3=λd3T2for:2oo3:λs,2oo3=3λd2T

image

Where:

λd=λdd+λdu

image

and

T=λduλd(Tp+MRT)+λddλdMRT

image

The ESC software (SILCOMP®) models the full equations.

Effect of staggered proof test

The equations in Tables 5.15.6 assume that, at every proof test, all the elements are tested at the same time. An alternative method of testing is to stagger the test of each redundant element. Assume, for a two-element redundant system, that element “A” starts new at time equal zero, runs for half the normal proof-test period, then a proof test is completed on element “B,” then after a further half of the normal proof-test period element “A” is tested, and so on.
The impact of this, on a system with redundant elements, will be to decrease the period in detecting both a common cause failure and the redundant element coincident failure (Table 5.7) (e.g., for a 1oo2 system the average time to detect a common cause failure with be halved).

Table 5.7

Factor to multiply by the PFD.

Factor for the dangerous undetected CCF PFDFactor for the dangerous undetected system PFD
1oo20.50.6
1oo30.30.4

Allowing for imperfect proof tests

Up to this point the formulae given in this chapter assume that 100% of dormant faults are detected by the proof test. However, if this is not the case, then the T in Tables 5.35.5 needs to be replaced with TE as follows;
• Let X be the percentage of the dangerous undetected failure rate that is revealed at the proof test T1 and
• (100  X) be the percentage of the dangerous undetected failure rate that is eventually revealed at period T2, where
• T2 might be the period between major overhauls or even the period between unit replacement.
TE = T1 X/100 + T2 (100  X)/100
Example let T1 = 1 year and T2 = 10 years with X = 90%
Then TE = 0.9 + 1 = 1.9 years
Typical guidance for the coverage that can be claimed is given in Table 5.8.

Table 5.8

Proof test coverage (ie effectiveness)

Proof Test coverage (i.e., effectiveness)For the process industryApplied to
98%Detailed written PT procedure for each SIF, process variable manipulated and executive action confirmed, staff trainingWhole SIS loop (eg sensor and valve)
95%General written PT procedures, process variable manipulated and executive action confirmed, some staff trainingWhole SIS loop (eg sensor and valve)
90%Some written PT procedures, some uncertainty in the degree of fully testing, no records of adequate staff training, or SIF is complex which makes it difficult to fully test with full range of parametersWhole SIS loop (e.g., sensor and valve)
80%PT coverage for valve only, tight shut-off required but not fully confirmedValve only
50%PT coverage only for sensor, where only the electrical signal is manipulatedSensor only
An estimate of the typical interval (T2) for the non proof-test coverage is 10 years, unless another interval can be supported.

image

The ESC software package SILComp® will allow all the above factors to be readily accounted for, see end flier.

Partial stroke testing

Another controversial technique is known as Partial Stroke Testing. This arose due to the inconvenience of carrying out a full proof test on some types of large shutdown valve where the full shutdown causes significant process disruption and associated cost. The partial stroke technique is to begin the closure of the movement and (by limit switch or pressure change sensing) abort the closure before the actuator has moved any significant distance. Clearly this is not a full proof test but, it is argued, there is some testing of the valve closure capability.
The controversy arises from two areas:
• Argument about the effectiveness of the test
• Argument concerning possible damage/wear arising from the test itself
The literature reveals that partial stroke testing can reveal approximately 70% (arguably 40–80%) of faults. However, this is widely debated and there are many scenarios and valve types where this is by no means achieved. The whole subject is dealt in-depth (including a literature search) in Technis Guidelines T658.
In brief, Table 5.9 shows the elements of PFD in a series model (an OR gate in a fault tree) where:
• λ is the failure rate of the valve
• PSI is the partial stroke interval (typically 2 weeks)
• PTI is the proof test interval (typically a year)
• MTTR is the mean time to repair
• DI is the real demand interval on the valve function (typically 10 years)
• 75% is one possible value of partial stroke effectiveness (debateable)
• 95% is one possible value of proof test effectiveness (debateable)
• 98% is a credible reliability of the partial stroke test initiating function
Despite optimistic claims by some, obtaining SIL 3 from partial stroke testing of a single valve is not that easy. Technis Guidelines T658 show that SIL 2 is a more likely outcome.

Table 5.9

Partial stroke equations.

Revealed by partial stroke test PFDRevealed by proof test PFDRevealed by “demand” (ESD or real) PFD
No partial stroke testn/a95% × λ × PTI/2(1 – 95%) × λ × DI/2
Partial stroke test75% × 98% × λ × (MTTR + PSI/2)(1 [75% × 98%]) × 95% × λ × PTI/2Remainder × λ × DI/2

image

Note: “remainder” is given as 1 – [75% × 98%] – [(1 [75% × 98%]) × 95%].

5.2.2. Common Cause Failure (CCF)

Whereas simple models of redundancy assume that failures are both random and independent, common cause failure (CCF) modeling takes account of failures which are linked, due to some dependency, and therefore occur simultaneously or, at least, within a sufficiently short interval as to be perceived as simultaneous.
Two examples are:
(a) The presence of water vapor in gas causing two valves to seize due to icing. In this case the interval between the two failures might be of the order of days. However, if the proof-test interval for this dormant failure is two months then the two failures will, to all intents and purposes, be simultaneous.
(b) Inadequately rated rectifying diodes on identical twin printed circuit boards failing simultaneously due to a voltage transient.
Typically, causes arise from:
(a) Requirements: incomplete or conflicting
(b) Design: common power supplies, software, emc, noise
(c) Manufacturing: batch-related component deficiencies
(d) Maintenance/operations: human induced or test equipment problems
(e) Environment: temperature cycling, electrical interference, etc.
Defenses against CCF involve design and operating features which form the assessment criteria given in Appendix 3.
CCFs often dominate the unreliability of redundant systems by virtue of defeating the random coincident failure feature of redundant protection. Consider the duplicated system in Figure 5.2. The failure rate of the redundant element (in other words the coincident failures) can be calculated using the formula developed in Table 5.1, namely 2λ2MDT. Typical failure rate figures of 10 per million hours (105 per hr) and 24 hrs down time lead to a failure rate of 2 × 1010 × 24 = 0.0048 per million hours. However, if only one failure in 20 is of such a nature as to affect both channels and thus defeat the redundancy, it is necessary to add the series element, shown as λ2 in Figure 5.3, whose failure rate is 5% × 105 = 0.5 per million hours, being two orders more frequent. The 5%, used in this example, is known as a BETA factor. The effect is to swamp the redundant part of the prediction and it is thus important to include CCF in reliability models. This sensitivity of system failure to CCF places emphasis on the credibility of CCF estimation and thus justifies efforts to improve the models.
image
Figure 5.3 Reliability block diagram showing CCF.
In Figure 5.3, (λ1) is the failure rate of a single redundant unit and (λ2) is the CCF rate such that (λ2) = β(λ1) for the BETA model, which assumes that a fixed proportion of the failures arise from a common cause. The contributions to BETA are split into groups of design and operating features which are believed to influence the degree of CCF. Thus the BETA multiplier is made up by adding together the contributions from each of a number of factors within each group. This Partial BETA model (as it is therefore known) involves the following groups of factors, which represent defenses against CCF:
- Similarity (Diversity between redundant units reduces CCF)
- Separation (Physical distance and barriers reduce CCF)
- Complexity (Simpler equipment is less prone to CCF)
- Analysis (FMEA and field data analysis will help to reduce CCF)
- Procedures (Control of modifications and of maintenance activities can reduce CCF)
- Training (Designers and maintainers can help to reduce CCF by understanding root causes)
- Control (Environmental controls can reduce susceptibility to CCF, e.g., weather proofing of duplicated instruments)
- Tests (Environmental tests can remove CCF prone features of the design, e.g., emc testing)
The Partial BETA model is assumed to be made up of a number of partial βs, each contributed to by the various groups of causes of CCF. β is then estimated by reviewing and scoring each of the contributing factors (e.g., diversity, separation).
The BETAPLUS model has been developed from the Partial Beta method because:
- It is objective and maximizes traceability in the estimation of BETA. In other words the choice of checklist scores, when assessing the design, can be recorded and reviewed.
- It is possible for any user of the model to develop the checklists further to take account of any relevant failure causal factors that may be perceived.
- It is possible to calibrate the model against actual failure rates, albeit with very limited data.
- There is a credible relationship between the checklists and the system features being analyzed. The method is thus likely to be acceptable to the nonspecialist.
- The additive scoring method allows the partial contributors to β to be weighted separately.
- The β method acknowledges a direct relationship between (λ2) and (λ1) as depicted in Figure 5.3.
- It permits an assumed “nonlinearity” between the value of β and the scoring over the range of β.
The BETAPLUS model includes the following enhancements:

(a). Categories of factors

Whereas existing methods rely on a single subjective judgment of score in each category, the BETAPLUS method provides specific design and operationally related questions to be answered in each category.

(b). Scoring

The maximum score for each question has been weighted by calibrating the results of assessments against known field operational data.

(c). Taking account of diagnostic coverage

Since CCF is not simultaneous, an increase in auto-test or proof-test frequency will reduce β since the failures may not occur at precisely the same moment.

(d). Subdividing the checklists according to the effect of diagnostics

Two columns are used for the checklist scores. Column (A) contains the scores for those features of CCF protection which are perceived as being enhanced by an increase in diagnostic frequency. Column (B), however, contains the scores for those features believed not to be enhanced by an improvement in diagnostic frequency. In some cases the score has been split between the two columns, where it is thought that some, but not all, aspects of the feature are affected (See Appendix 3).

(e). Establishing a model

The model allows the scoring to be modified by the frequency and coverage of diagnostic test. The (A) column scores are modified by multiplying by a factor (C) derived from diagnostic related considerations. This (C) score is based on the diagnostic frequency and coverage. (C) is in the range 1–3. A factor “S,” used to derive BETA, is then estimated from the RAW SCORE:

S=RAWSCORE=(A×C)+B

image

(f). Nonlinearity

There are currently no CCF data to justify departing from the assumption that, as BETA decreases (i.e., improves), successive improvements become proportionately harder to achieve. Thus the relationship of the BETA factor to the RAW SCORE [(ΣA × C) + ΣB] is assumed to be exponential and this nonlinearity is reflected in the equation which translates the raw score into a BETA factor.

(g). Equipment type

The scoring has been developed separately for programmable and non-programmable equipment, in order to reflect the slightly different criteria which apply to each type of equipment.

(h). Calibration

The model has been calibrated against field data.
Scoring criteria were developed to cover each of the categories (i.e., separation, diversity, complexity, assessment, procedures, competence, environmental control, and environmental test). Questions have been assembled to reflect the likely features which defend against CCF. The scores were then adjusted to take account of the relative contributions to CCF in each area, as shown in the author's data. The score values have been weighted to calibrate the model against the data.
When addressing each question (in Appendix 3) a score less than the maximum of 100% may be entered. For example, in the first question, if the judgment is that only 50% of the cables are separated then 50% of the maximum scores (15 and 52) may be entered in each of the (A) and (B) columns (7.5 and 26).
The checklists are presented in two forms (listed in Appendix 3) because the questions applicable to programmable equipments will be slightly different to those necessary for non-programmable items (e.g., field devices and instrumentation).
The headings (expanded with scores in Appendix 3) are:
(1) Separation/Segregation
(2) Diversity
(3) Complexity/Design/Application/Maturity/Experience
(4) Assessment/Analysis and Feedback of Data
(5) Procedures/Human Interface
(6) Competence/Training/Safety Culture
(7) Environmental Control
(8) Environmental Testing
Assessment of the diagnostic interval factor (C)
In order to establish the (C) score, it is necessary to address the effect of diagnostic frequency. The diagnostic coverage, expressed as a percentage, is an estimate of the proportion of failures which would be detected by the proof test or auto test. This can be estimated by judgment or, more formally, by applying FMEA at the component level to decide whether each failure would be revealed by the diagnostics.
An exponential model is used to reflect the increasing difficulty in further reducing BETA as the score increases. This is reflected in the following equation which is developed in Smith D J, 2000, “Developments in the use of failure rate data”:

ß=0.3exp(3.4S/2624)

image

However, the basic BETA model applies to simple “one out of two” redundancy. In other words, with a pair of redundant items the “top event” is the failure of both items. However, as the number of voted systems increases (in other words N > 2) the proportion of common cause failures varies and the value of β needs to be modified. The reason for this can be understood by thinking about two extreme cases:
1 out of 6
In this case only one out of the six items is required to work and up to five failures can be tolerated. Thus, in the event of a common cause failure, five more failures need to be provoked by the common cause. This is less likely than the “one out of two” case and β will be smaller (see tables below).
5 out of 6.
In this case five out of the six items are required to work and only one failure can be tolerated. Thus, in the event of a common cause failure, there are five items to which the common cause failures could apply. This is more likely than the “one out of two” case and β will be greater (see tables below).
This is an area of much debate. There is no empirical data and the models are a matter of conjecture based on opinions of various contributors. There is not a great deal of consistency between the various suggestions. It is thus a highly controversial and uncertain area. The original suggestions were from a SINTEF paper (in 2006) which were the MooN factors originally used in the Technis BETAPLUS package version 3.0. The SINTEF paper was revised (in 2010) and again in 2013. The IEC 61508 (2010) guidance is similar but not identical (Table 5.10). The SINTEF(2013) values are shown in Table 5.11. The BETAPLUS (now Version 4.0) compromise is shown in Appendix 3.

Table 5.10

BETA(MooN) factor IEC 61508.

M = 1M = 2M = 3M = 4
N = 21
N = 30.51.5
N = 40.30.61.75
N = 50.20.40.82

image

Table 5.11

BETA(MooN) factor SINTEF(2013).

M = 1M = 2M = 3M = 4
N = 21
N = 30.52
N = 40.31.12.8
N = 50.20.81.63.6

image

5.2.3. Fault Tree Analysis

Whereas the RBD provides a graphical means of expressing redundancy in terms of “parallel” blocks repressing successful operation, fault tree analysis expresses the same concept in terms of paths of failure. The system failure mode in question is referred to as the Top Event and the paths of the tree represent combinations of event failures leading to the Top Event. The underlying mathematics is exactly the same. Figure 5.4 shows the OR gate which is equivalent to Figure 5.1 and the AND gate which is equivalent to Figure 5.2.
image
Figure 5.4 Series and parallel equivalent to AND and OR.
Figure 5.5 shows a typical fault tree modeling the loss of fire water arising from the failure of a pump, a motor, the detection, or the combined failure of both power sources.
In order to allow for common cause failures in the fault tree model, additional gates are drawn as shown in the following examples. Figure 5.8 shows the RBD of Figure 5.6 in fault tree form.
The CCF can be seen to defeat the redundancy by introducing an OR gate along with the redundant G1 gate.
Figure 5.7 shows another example, this time of “two out of three” redundancy, where a voted gate is used.
image
Figure 5.5 Example of a fault tree (A) with equivalent block diagram (B).
A highly cost-effective fault tree package is the Technis TTREE package.

5.3. Taking Account of Auto Test

The mean down time (MDT) of unrevealed failures is a fraction of the proof-test interval (i.e., for random failures, it is half the proof-test interval as far an individual unit is concerned) PLUS the actual MTTR.
image
Figure 5.6 CCF in fault trees.
image
Figure 5.7 “2oo3” voting with CCF in a fault tree.
In many cases there is both auto test, whereby a programmable element in the system carries out diagnostic checks to discover unrevealed failures, as well as a manual proof test. In practice, the auto-test will take place at some relatively short interval (e.g. 8 min) and the proof test at a longer interval (e.g., one year).
The question arises as to how the reliability model takes account of the fact that failures revealed by the auto test enjoy a shorter down time than those left for the proof test. The ratio of one to the other is a measure of the diagnostic coverage and is expressed as a percentage of failures revealed by the test.
Consider now a dual redundant configuration (voted one out of two) subject to 90% auto test and the assumption that the manual test reveals 100% of the remaining failures.
The RBD needs to split the model into two parts in order to calculate separately in respect of the auto-diagnosed and manually diagnosed failures.
Figure 5.8 shows the parallel and common cause elements twice and applies the equations from Section 5.2 to each element. The failure rate of the item, for the failure mode in question, is λ. Note: The additional terms, descried in Sections 5.5 and 5.6, have not been included.
image
Figure 5.8 Reliability block diagram, taking account of diagnostics.
The equivalent fault tree is shown in Figure 5.9.
image
Figure 5.9 Equivalent fault tree.

5.4. Human Factors

5.4.1. Addressing Human Factors

In addition to random coincident hardware failures, and their associated dependent failures (Section 5.3), it is frequently necessary to include human error in a prediction model (e.g., fault tree). Specific quantification of human error factors is not a requirement of IEC 61508; however, it is required that human factors are “considered.”
It is well known that the majority of well-known major incidents, such as Three Mile Island, Bhopal, Chernobyl, Zeebrugge, Clapham, and Paddington, are related to the interaction of complex systems with human beings. In short, the implication is that human error was involved, to a greater or lesser extent, in these and similar incidents. For some years there has been an interest in modeling these factors so that quantified reliability and risk assessments can take account of the contribution of human error to the system failure.
IEC 61508 (Part 1) requires the consideration of human factors at a number of places in the life cycle. The assessment of human error is therefore implied. Table 5.12 summarizes the main references in the Standard.

Table 5.12

Human Factors References

Part 1
Paragraph 1.2ScopeMakes some reference
Table 1Life cycleSeveral uses of “to include human factors”
Paragraph 7.3.2.1ScopeInclude humans
Paragraph 7.3.2.5Definition stageHuman error to be considered
Paragraph 7.4 variousHazard/risk analysisReferences to misuse and human intervention
Paragraph 7.6.2.2Safety requirements allocationAvailability of skills
Paragraphs 7.7.2, 7.15.2Ops and maintenanceRefers to procedures
Part 2
Paragraph 7.4.10Design and developmentAvoidance of human error
Paragraph 7.6.2.3Ops and maintenanceHuman error key element
Paragraph 7.7.2.3ValidationIncludes procedures
Paragraph 7.8.2.1ModificationEvaluate mods on their effect on human interaction
Part 3
Paragraph 1.1ScopeHuman computer interfaces
Paragraph 7.2.2.13SpecificationHuman factors
Paragraph 7.4.4.2DesignReference to human error
Annex GData drivenHuman factors

image

One example might be a process where there are three levels of defense against a specific hazard (e.g., overpressure of a vessel). In this case the control valve will be regarded as the EUC. The three levels of defense are:
(1) the control system maintaining the setting of a control valve;
(2) a shutdown system operating a separate shut-off valve in response to a high pressure; and
(3) human response whereby the operator observes a high-pressure reading and inhibits flow from the process.
The risk assessment would clearly need to consider how independent of each other are these three levels of protection. If the operator action (3) invokes the shutdown (2) then failure of that shutdown system will inhibit both defenses. In either case the probability of operator error (failure to observe or act) is part of the quantitative assessment.
Another example might be air traffic control, where the human element is part of the safety loop rather than an additional level of protection. In this case human factors are safety-critical rather than safety-related.

5.4.2. Human Error Rates

Human error rate data for various forms of activity, particularly in operations and maintenance, are needed. In the early 1960s there were attempts, by UKAEA, to develop a database of human error rates and these led to models of human error whereby rates could be estimated by assessing relevant factors such as stress, training, and complexity. These human error probabilities include not only simple failure to carry out a given task, but diagnostic tasks where errors in reasoning, as well as action, are involved. There is not a great deal of data available due to the following problems:
• Low probabilities require large amounts of experience in order for meaningful statistics to emerge
• Data collection concentrates on recording the event rather than analyzing the causes.
• Many large organizations have not been prepared to commit the necessary resources to collect data.
For some time there has been an interest in exploring the underlying reasons, as well as probabilities, of human error. As a result there are currently several models, each developed by separate groups of analysts working in this field. Estimation methods are described in the UKAEA document SRDA-R11, 1995. The better known are HEART (Human Error Assessment and Reduction Technique), THERP (Technique for Human Error Rate Prediction), and TESEO (Empirical Technique to Estimate Operator Errors).
For the earlier overpressure example, failure of the operator to react to a high pressure (3) might be modeled by two of the estimation methods as follows:

“HEART” method

Basic task “Restore system following checks”—error rate = 0.003.
Modifying factors:
Few independent checks×350%
No means of reversing decision×25%
An algorithm is provided (not in the scope of this book) and thus:
Error probability = 0.003 × [2 × 0.5 + 1] × [7 × 0.25 + 1] = 1.6 × 102

“TESEO” method

Basic task “Requires attention” – error rate = 0.01.
× 1 for stress
× 1 for operator
× 2 for emergency
× 1 for ergonomic factors
Thus error probability = 0.01 × 1 × 1 × 2 × 1 = 2 × 102
The two methods are in fair agreement and thus a figure of 2 × 102 might be used for the example.
Figure 5.10 shows a fault tree for the example assuming that the human response is independent of the shutdown system. The fault tree models the failure of the two levels of protection (2) and (3). Typical (credible) probabilities of failure on demand are used for the initiating events. The human error value of 2 × 102 could well have been estimated as above.
image
Figure 5.10 Fault tree involving human error.
Quantifying this tree would show that the overall PFD is 1.4 × 104 (incidentally meeting SIL 3 quantitatively).
Looking at the relative contribution of the combinations of initiating events would show that human error is involved in over 80% of the total. Thus, further consideration of human error factors would be called for.

5.4.3. A Rigorous Approach

There is a strong move to limit the assessment of human error probabilities to 101 unless it can be shown that the human action in question has been subject to some rigorous review. The HSE have described a seven-step approach which involves:
STEP 1 Consider main site hazards
e.g., A site HAZOP identifies the major hazards.
STEP 2 Identify manual activities that effect these hazards
The fault tree modeling of hazards will include the human errors which can lead to the top events in question.
STEP 3 Outline the key steps in these activities
Task descriptions, frequencies, task documentation, environmental factors, and competency requirements.
STEP 4 Identify potential human failures in these steps
The HEART and TESEO methodologies can be used as templates to address the factors.
STEP 5 Identify factors that make these failures more likely
Review the factors which contribute (The HEART list is helpful)
STEP 6 Manage the failures using hierarchy of control
Can the hazard be removed, mitigated, etc.
STEP 7 Manage Error Recovery
Involves alarms, responses to incidents, etc.
Anecdotal data as to the number of actions, together with the number of known errors, can provide estimates for comparison with the HEART and TESEO predictions. Good agreement between the three figures helps to build confidence in the assessment.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset