Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Fault-Tolerant Design

Nur A. ToubaUniversity of Texas, Austin, Texas

About This Chapter

Fault tolerance is the ability of a system to continue error-free operation in the presence of unexpected faults. Faults can be either temporary (because of radiation, noise, ground bounce, etc.) or permanent (because of manufacturing defects, oxide breakdown, electromigration, etc.). As technology continues to scale, circuit behavior is becoming less predicable and more prone to failures, thereby increasing the need for fault-tolerant design. Fault tolerance requires some form of redundancy in time, space, or information. When an error occurs, it either needs to be masked/corrected or the operation needs to be retried if it is a temporary fault. If there is a permanent fault, then retrying an operation will not solve the problem. In that case, sufficient redundancy or spare units are required to continue error-free operation, or the part needs to be repaired or replaced.

This chapter gives an overview of a number of fault-tolerant design schemes suitable for nanometer system-on-chip (SOC) applications. It starts with an introduction to the basic concepts in fault-tolerant design and the metrics used to specify and evaluate the dependability of the design. Coding theory is next reviewed and some of the commonly used error detecting and correcting codes are described. Then fault-tolerant design schemes using hardware, time, and information redundancy are discussed. This is followed by some examples of various types of fault-tolerant applications used in industry.

Introduction

Fault tolerance is the ability of a system to continue error-free operation in the presence of an unexpected fault. Fault tolerance has always been important in mission-critical applications, such as medical, aviation, and banking, where errors can be costly [Lala 2001] [Pradhan 1996] [Siewiorek 1998]. As semiconductor manufacturing technology has scaled to a point where circuit behavior is less predictable and more prone to failures, fault tolerance is becoming important even for mainstream applications in order to keep failure rates within acceptable levels.

Faults can be either permanent or temporary. Permanent faults can result from manufacturing defects, early life failures, or wearout failures. Defects introduced during the manufacturing process can cause failures right away or early in the lifetime of the part, which can result in a permanent fault. Wearout failures can occur later in the lifetime of a part because of various mechanisms, such as electromigration, hot carriers degradation, and time-dependent dielectric breakdown, which can also lead to a permanent fault. Temporary faults, on the other hand, occur because of either transient or intermittent disturbances that are only present for a short period of time. Transient errors are nonrecurring errors caused by a disturbance external from the fault site such as radiation, noise, or power disturbance. Intermittent errors are recurring errors caused by marginal design parameters such as timing problems that result from races, hazards, skew, or signal integrity problems caused by crosstalk, ground bounce, and other factors. Wearout failures often first present themselves as intermittent errors. Note that some fault-tolerant design techniques can only tolerate permanent faults, only temporary faults, or a combination of both.

Fault tolerance requires some form of redundancy in time, hardware, or information [Lala 2001] [Siewiorek 1998]. An example of time redundancy would be to perform the same operation twice and see whether the same result is obtained both times (if not, then a fault has occurred). This can detect temporary faults but not permanent faults, as a permanent fault would affect both computations. The advantage of time redundancy is that it requires little or no extra hardware. However, the drawback is that it impacts the system or circuit performance. Hardware redundancy involves replicating hardware and comparing the outputs from two or more replicated hardware units. This can detect both temporary and permanent faults. The advantage of hardware redundancy is that it allows the redundant computation to be done in parallel and thus has little or no impact on performance. The drawback is the cost in terms of area and power for the redundant hardware. One approach to reduce the hardware requirements is to use information redundancy where the outputs are encoded with error detecting or correcting codes [Lin 1983] [Peterson 1972]. Generally, information redundancy using codes requires less hardware to generate the redundant information bits than hardware redundancy that replicates hardware units. The codes can be selected to minimize the redundancy required for the class of faults that need to be tolerated. The drawback of information redundancy is the added complexity in design compared with simply replicating hardware units.

Fundamentals of Fault Tolerance

The failure rate, λ(t), for a component varies over time. As was shown in Figure 1.2 in Chapter 1, the graph of the failure rate versus time tends to look like a bathtub curve. It is high during the infant mortality period and then becomes relatively constant over the normal lifetime of the part, and finally increases again during the wearout period. The failure rate is typically measured in units of FITS, which is the number of failures per 10⁹ hours.

A system is constructed from various components. If no fault tolerance is provided in the system, then if any component fails, the whole system will fail. In that case, the failure rate for a system, λ_sys, that consists of k components is equal to the sum of the individual failure rates of each of its components (λ_{c, i}):

However, as we will show, fault-tolerant design can be used to construct a more dependable system from less dependable components by using redundancy. Some important measures of dependability that will be defined in this section are reliability, mean time to failure (MTTF), maintainability, and availability.

Reliability

The reliability of a component is defined as the probability that the component will perform its required function for a specified period of time. If the component is working at time t = 0, then R(t) is the probability that the component is still working at time t. If the failure rate for the component is assumed to be constant, which is generally a good approximation for components that have passed the infant mortality period, then the reliability of a component follows the exponential failure law and is given by the following expression:

R(t) = e^–λt

where λ is the failure rate.

Consider the system shown in Figure 3.1, which consists of three components: A, B, and C. All three components must be operational in order for the system to be operational. Hence, the system reliability in this case would be equal to the product of the component reliabilities:

R_sys = R_AR_BR_C

Figure 3.1. System reliability for series system.

Now consider the system in Figure 3.2 where there are two components B’s in parallel. In this case, there is a redundant component B, and the system will be operational if both component A and C are operational and if either of the two component B’s is operational. Hence, the system reliability in this case is as shown:

Here the system is able to tolerate one of the component B’s failing, and hence its reliability is improved. This simple example illustrates the general principle of how redundancy can be used to improve system reliability. A number of schemes for introducing redundancy in a system to improve its reliability are further described in Section 3.4.

Figure 3.2. System reliability with component B in parallel.

Mean Time to Failure (MTTF)

The average time before a component or system fails is called the mean time to failure (MTTF) and is given by:

In other words, it is equal to the area under the reliability curve when plotted against time. This holds for any failure distribution. The units used for MTTF are typically hours.

For the exponential failure law, MTTF is equal to:

Example 3.1

For the system shown in Figure 3.1, if each of the three components A, B, and C has a failure rate (λ_i) of 10,000 FITS, then the system failure rate (λ_sys) is (10,000/ 10⁹) × 3 = 3 × 10^–5 failures/hour. Because MTTF_sys = 1/λ_sys, the mean time before the system fails is 33,333 hours. Given that R(t) = e^–λt, the probability that the system has not failed after 1 year is equal to:

R(1 year)=e^{–(3×10^–5)(365×24)}=0.769

Example 3.2

For the system shown in Figure 3.2, if each of the three components A, B, and C, has the same failure rate of 10,000 FITS, then they each have the same reliability. Hence, . MTTF_sys can then be computed as:

The probability that the system has not failed after 1 year is equal to:

R(1 year)=2e^{–(3×10^–5)(365×24)}–e^{–(4×10^–5)(365×24)}=0.834

Comparing this result with what was computed in Example 3.1 shows the benefit of adding the redundant B component in terms of improving both system reliability and MTTF.

Maintainability

When a system permanently fails and needs to be manually repaired, the probability that the system will be restored back to operation after a specified period of time is defined as the maintainability, M(t), of the system. Assuming the system is failed at time t = 0, then M(t) is the probability that it is repaired and operational again after time t. The time for repairing a system can be divided into two periods. The first is the passive repair time, which is the time required for a service engineer to travel to the repair site, and the second is the active repair time, which is the time required to locate the failing component(s), repair or replace it (them), and verify that the system is operational again. The active repair time can be improved by designing the system so that it is easy to locate the failed components and verify that the system is working correctly again.

The repair rate, μ, measures the rate at which a system is repaired (analogous to the failure rate λ). Maintainability is often modeled as:

M(t) = 1–e^–μt

The mean time to repair, MTTR, is then equal to 1/μ.

Availability

Suppose the state of system operation is represented as S, where S = 0 means the system is operational, and S = 1 means the system is failed. Then S is a function of time t, as illustrated in Figure 3.3.

Figure 3.3. System operation and repair.

Suppose the system is operational at t = 0, it fails at t₁, and is repaired and brought back to operation at t₂ by some software modification, reset, or hardware replacement. Similar failure and repair events happen at t₃ and t₄. The average duration of intervals when the system is operational such as t₁ – t₀ and t₃ – t₂ is equal to the MTTF. The average duration of intervals when the system is unoperational such as t₂ – t₁ and t₄ – t₃ is equal to the MTTR.

The fraction of time that a system is operational is called the system availability and is given by:

This formula is widely used in reliability engineering; for example, telephone systems are required to have system availability of 0.9999 (simply called four nines), whereas high-reliability systems may require seven nines or more. Achieving such high system availability requires using fault-tolerant design to construct a dependable system from less reliable components.

Component reliability depends on the robustness of the component design as well as the thoroughness of the reliability screens used during manufacturing test. One of the key reliability screens that is used is burn-in where a component is operated at an elevated temperature to age it past its infant mortality period. This helps to eliminate marginal parts so that the surviving parts that are placed in a system will have a lower failure rate. Another useful reliability screen is I_DDQ testing, which detects anomalies in quiescent current, which may indicate a marginal part. Unfortunately, burn-in is expensive, and many of the other reliability screens such as I_DDQ testing are becoming ineffective for designs manufactured at 90 nm or below. As indicated in [SIA 2004], fundamentally new long-term solutions must be developed for reliability screening and may include significant on-die hardware for stressing or special reliability measurements.

The bottom-line is that component reliability is limited and becoming lower with increasing IC complexity. Component reliability is often not high enough to construct a system that meets availability requirements without the use of fault-tolerant design.

Example 3.3

Consider a system that consists of 100 components, each with a failure rate of 10,000 FITS. When the system fails, it requires an average of 4 hours to repair. What is the system availability? The system failure rate (λ_sys) is equal to (10,000/10⁹)×100 = 10^–3. MTTF_sys = 1/λ_sys, so the mean time to failure of the system is 1000 hours and the mean time to repair is 4 hours. Thus, the system availability is equal to MTTF/(MTTF + MTTR) = 1000/(1000 + 4) = 0.996.

Fundamentals of Coding Theory

Coding data involves using more bits to represent the data than necessary. This information redundancy provides a way to detect and even sometimes correct errors. Errors occur when the value of one or more bits gets flipped because of noise or other disturbances to the signal values. There are many different error-detecting codes [Lin 1983] [Peterson 1972]. They have different properties in terms of the amount of redundancy that they use, the class of errors that they can detect, and how easy it is to encode and decode the data. Some codes are more suitable for certain applications. In this section, some of the most commonly used codes are described and their properties are discussed.

Linear Block Codes

In a block code, the data being encoded, called the message, is encoded using n-bit blocks. If each n-bit block can encode m distinct messages, then the redundancy of the block code is equal to:

If there is no redundancy, then m messages are encoded using log₂(m) bits, which is the minimum number of bits possible. To detect errors, some redundancy is required. In that case, the space of all distinct 2ⁿ blocks is partitioned into codewords and noncodewords, and each of the m messages is assigned to distinct codewords. Errors that cause a codeword to become a noncodeword can be detected. Errors that cause a codeword to become another codeword cannot be detected.

A block code is said to be separable if the n-bit blocks can be partitioned into k information bits and (n –k) check bits where the k information bits directly represent 2^k distinct messages. The advantage of a separable code is that the k-bit message can be directly extracted from the codeword without the need for any decoding. The rate of a separable block code is defined as k/n.

Example 3.4

An example of a separable block code is a single-bit parity code in which one extra check bit is added to the message to indicate whether it contains an odd or even number of 1’s. For example, if the message is 101, then the corresponding codeword would contain 3 information bits (k = 3), which are the same as the message plus 1 check bit, which is equal to the modulo-2 sum (XOR) of the 3 information bits. Hence, the codeword would have a total of 4 bits (n = 4) and be equal to 1010 where the last 0 is the parity of the 3 information bits. The redundancy and rate of a single-bit parity code is equal to:

For the example here with 3 information bits, the redundancy would be 1/4 and the rate would be 3/4. Separable block codes are typically denoted as (n, k) block codes. So in this example, the single-bit parity code with 3 information bits is a (4,3) block code.

Example 3.5

An example of a nonseparable block code is a one-hot code in which every codeword contains a single 1 and the rest 0. For example, if n = 8, then there are eight codewords (10000000, 01000000, 0010000, 00010000, 00001000, 00000100, 00000010, 00000001). For an n-bit one-hot code, the number of distinct messages that can be represented is n. The redundancy of a one-hot code is equal to:

If there are eight messages, the redundancy would be 5/8. This is higher than the redundancy for a single-bit parity code with eight messages, which is equal to 1/8 (as shown in Example 3.4). Moreover, the drawback of a nonseparable code is that some decoding logic would be needed to convert the one-hot codeword back to the corresponding message.

Linear block codes are a special class of block codes in which the modulo-2 sum of any two codewords is also a codeword [Lin 1983]. The codewords for a linear block code correspond to the null space of a (n – k) × n Boolean matrix, which is called the parity check matrix, H_(n–k)×n. In other words,

cH^T = 0

for any n-bit codeword c. The corresponding n-bit codeword c is generated for a k-bit message m, by multiplying m with a generator matrix, G_k×n. In other words,

c = mG

and thus

GH^T = 0

A systematic block code is one in which the first k-bits correspond to the message and the last n–k bits correspond to the check bits [Lin 1983]. For a systematic code, the generator matrix has the following form: G = [I_k×k : P_k×(n–k)]—that is, it consists of a k×k identity matrix followed by P_k×(n–k). Note that any generator matrix can be put in systematic form by performing Gaussian elimination. If G = [I_kxk: P_kx(n–k)], then .

Example 3.6

For a single-bit parity code, there is one check bit, which is equal to the modulo-2 sum of all the information bits. The G-matrix and H-matrix for a single-bit parity (4,3) code are shown here:

For the message m = [1 0 1]. The corresponding codeword c is obtained as follows:

The codeword c is in the null space of the H-matrix:

Example 3.7

Consider the code corresponding to the following G-matrix and H-matrix:

From the dimensions of the matrices it can be determined that it is a (6,3) code (i.e., n = 6, k = 3, and n – k = 3). The matrices are in systematic form, so each of the first three columns of the H-matrix corresponds to the 3 information bits, and each of the last three columns correspond to the 3 check bits. Each check bit is equal to the parity of some of the information bits. Each row in the H-matrix corresponds to a check bit and indicates the set of bits that it is a parity check for. In other words, the first row of the H-matrix indicates that the first check bit (corresponding to the fourth column) is equal to the parity of the first two information bits. The second row of the H-matrix indicates that the second check bit (corresponding to the fifth column) is the parity of the second and third information bits. The last row indicates that the last check bit is the parity of all 3 information bits. For the message m = [1 0 1], the corresponding codeword, c, would be computed as follows:

The distance between two codewords is the number of bits in which they differ. The distance of a code is equal to the minimum distance between any two codewords in the code. If n = k (i.e., no redundancy is used), then the distance between any two messages is 1. For single-bit parity, the distance between any two codewords is 2 because all codewords differ in at least 2 bits.

A code with distance d can detect up to d – 1 errors and can correct up to ⌊(d – 1)/2⌋ errors. So a single-bit parity code can detect one-bit errors, but cannot correct them. Consider the codewords 0000 and 0011 (which have even parity), a single-bit error could cause 0000 to become 0010 which is a noncodeword (because it has odd parity) and hence the error would be detected. However, a single-bit error could also cause 0011 to become 0010. As 0010 can be reached because of a single-bit error from more than one codeword, there is no way to correct 0010 because it cannot be determined from which original codeword the error occurred. To correct single-bit errors, the code must have a distance of 3.

Example 3.8

An example of a code with a distance of 3 is a triplication code in which each codeword contains three copies of the information bits. For k = 3, the triplication code has n = 9. The corresponding H-matrix for a (9,3) triplication code is shown here:

Consider the codeword 100100100; if there is a single-bit error in the first bit position, it will result in 000100100 which is a noncodeword. There is no single-bit error from any other codeword that can result in 000100100. Consequently, that noncodeword can be corrected back to 100100100 assuming that only a single-bit error occurred.

A code with distance 3 is a called a single-error-correcting (SEC) code. A code with distance 4 is called a single-error-correcting and double-error-detecting (SEC-DED) code. A systematic procedure for constructing a SEC code for any value of n was described by [Hamming 1950]. Any H-matrix in which all columns are distinct and no column is all 0 is a SEC code. Based on this, for any value of n, a SEC code can be constructed by simply setting each column equal to the binary representation of the column number starting from 1. The number of rows in the H-matrix (i.e., the number of check bits) will be equal to [log₂(n + 1)].

Example 3.9

A SEC Hamming code for n = 7 will have ⌈log₂(n+1)⌉ = 3 check bits. The (7,4) code is shown here:

It is constructed by making the first column equal to the binary representation of 1, the second column equal to the binary representation of 2, and so forth.

For a Hamming code, correction is done by computing the syndrome, s, of the received vector. The syndrome for vector v is computed as s = Hv^T. If v is a codeword, then the syndrome will be all 0. However, if v is a noncodeword and only a single-bit error has occurred, then the syndrome will match one of the columns of the H-matrix and hence will contain the binary value of the bit position in error.

Example 3.10

For the (7,4) Hamming code from Example 3.9, suppose the codeword 0110011 had a 1-bit error in the first bit position, which changed it to 1110011. The syndrome would be equal to:

The syndrome is the binary representation of 1, and hence it indicates that the first bit is in error. By flipping the first bit, the error can be corrected.

An SEC Hamming code can be made into an SEC-DED code by adding a parity check over all bits. This extra parity bit will be 1 for a single-bit error, but will be 0 for a double bit error. This makes it possible to detect double-bit errors so that they are not assumed to be a single-bit error and hence miscorrected.

Example 3.11

The (7,4) SEC code from Example 3.9 can be converted into a (7,3) SEC-DED code by adding a parity check over all bits, which corresponds to adding an all 1 row to the H-matrix as shown here:

Suppose the codeword 0110011 has a 2-bit error in the first two bit positions, which changes it to 1010011. The syndrome is computed as follows:

In this case, the syndrome does not match any column in the H-matrix thereby indicating that it is a double-bit error, which cannot be corrected (although it is detected).

Another method for constructing an SEC-DED code was described in [Hsiao 1970]. The weight of a column is defined as the number of 1’s in the column. Any H-matrix in which all columns are distinct and all columns have odd weight corresponds to an SEC-DED code. Based on these criteria, a Hsiao code for any value of n can be constructed by first using all possible weight-1 columns, then using all possible weight-3 columns, then using all possible weight-5 columns, and so forth up until n columns have been formed. The number of check bits required for an SEC-DED Hsiao code is ⌈log₂ (n+1)⌉. The advantage of a Hsiao code is that it minimizes the number of 1’s in the H-matrix, which results in less hardware and less delay for computing the syndrome. The disadvantage is that correcting an error requires some decoding logic to determine which bit is in error because the syndrome is no longer simply the binary value of the bit position in error (i.e., each column in the H-matrix is not necessarily the binary representation of the column number).

Example 3.12

A (7,3) Hsiao SEC-DED code is constructed as shown here:

First the four weight-1 column vectors are used. Then weight-3 column vectors are used. This (7,3) Hsiao code has 13 entries in the H-matrix that are 1’s whereas the SEC-DED code in Example 3.11 had 19 entries in the H-matrix that are 1’s. Fewer 1’s in the H-matrix requires fewer XOR gates being needed to generate the syndrome and consequently less propagation delay. Because the syndrome needs to be generated every time to check for an error, having less delay is beneficial for performance. The drawback is that the correction is a little more complicated because the syndrome is not the binary representation of the bit position in error; however, correction is rarely required and hence has negligible impact on performance. If a single-bit error occurs that results in the syndrome 0100, determining which bit is in error is done by identifying which column it matches in the H-matrix. In this case, 0100^T would match the third column, so the third bit is in error.

Unidirectional Codes

Unidirectional errors are defined as errors in a block of data that only cause 1→0 errors or only cause 0→1 errors, but not both within the same block of data [Lala 1985]. In other words, there are two possibilities: either all the errors in a block of data are 1’s that become 0’s, or all the errors in a block of data are 0’s that become 1’s. For example, if the correct codeword is 111000, then examples of unidirectional errors would be 001000, 000000, and 101000 (all of which are 1→0 errors) or 111101, 111100, and 111111 (all of which are 0→1 errors). Examples of nonunidirectional errors would be 101001, 011001, and 011011 because both 1→0 errors and 0→1 errors are present at the same time.

Error-detecting codes that are able to detect all unidirectional errors in a codeword (i.e., any number of bits in error provided they are unidirectional) are called all unidirectional error detecting (AUED) codes. Single-bit parity can detect all errors where there are an odd number of bits in error, but it cannot detect errors where there is an even number of bits in error, and hence it is not an AUED code. In fact, no linear code is AUED because all linear codes must contain the all 0 codeword (which always exists in the null space of the H-matrix) and hence there will always be some set of 0→1 errors that will make the all 0 codeword equal to some other codeword. Detecting AUED requires a nonlinear code.

Two-Rail Codes

A two-rail code has one check bit for each information bit which is equal to the complement of the information bit [Lala 1985]. For example, if the message is 101, then the check bits for a two-rail code would be the complement which is 010. The set of codewords for a (6,3) two-rail code where the information bits are the first 3 bits and the check bits are the last 3 bits would be 000111, 001110, 010101, 011100, 100110, 101010, 110001, and 111000.

A two-rail code is AEUD because no unidirectional errors can change the value of both an information bit and its corresponding check bit. 0→1 errors can only affect 0’s and 1→0 errors can only affect 1’s. Either an information bit will be affected or its corresponding check bit will be affected, but not both because they have complementary values. Hence, no codeword can be changed into another codeword through any set of unidirectional errors.

For a two-rail code, n = 2k because there is one check bit for each information bit; thus it requires 50% redundancy, which is quite high. Lower redundancy AUED codes exist and are described next.

Berger Codes

Berger codes [Berger 1961] have the lowest redundancy of any separable code for detecting all possible unidirectional errors. If there are k information bits, then a Berger code requires log₂⌈k+1⌉ check bits, which are equal to the binary representation of the number information bits that are 0. For example, if the information bits are 1000101, then there are log₂⌈7+1⌉ = 3 check bits, which are equal to 100, which is the binary representation of 4 since four information bits are 0. The set of codewords for a (5,3) Berger code where the 3 information bits come first followed by the two check bits are 00011, 00110, 01010, 01101, 10010, 10101, 11001, and 11100.

Any unidirectional error will be detected by a Berger code. If the unidirectional error contains 1→0 errors, then it will increase the number of 0’s in the information bits, but can only decrease the binary number represented by the check bits thereby causing a mismatch. If the unidirectional error contains 0→1 errors, then it will reduce the number of 0’s in the information bits, but it can only increase the binary number represented by the check bits.

Example 3.13

If there are 8 information bits (k = 8), then a Berger code requires log₂⌈8 + 1⌉ = 4 check bits (i.e., n = 12), so its redundancy is:

This is much less than a (16,8) two-rail code, which requires 50% redundancy. The redundancy advantage of the Berger code becomes greater as k increases.

Constant Weight Codes

The codes discussed so far have been separable codes. Constant weight codes are nonseparable codes, which have lower redundancy than Berger codes for detecting all unidirectional errors. Each codeword in a constant weight code has the same number of 1’s. For example, a 2-out-of-3 constant weight code would have three codewords that each have two 1’s in them (i.e., 110, 101, and 011). One type of constant weight code is a one-hot code, which is a 1-out-of-n code where there are n codewords each having a single 1.

Constant weight codes detect all unidirectional errors because unidirectional errors will either increase the number of 1’s or decrease the number of 1’s, thus resulting in a noncodeword.

The number of codewords in an m-out-of-n constant weight code is equal to . The number of codewords is maximized when m is as close to n/2 as possible. Thus, for an n-bit constant weight code, the minimum redundancy code is (n/2)-out-of-n if n is even and (n/2 – 0.5 or n/2 + 0.5)-out-of-n if n is odd as that maximizes the number of codewords. For example, if n = 12, then the 6-out-of-12 code has minimum redundancy, and if n = 13, then either the 6-out-of-13 or the 7-out-of-13 would have the minimum redundancy.

Example 3.14

A 6-out-12 constant weight code has codewords, so its redundancy is:

In comparison, a 12-bit Berger code has 4 check bits and thus has only 2⁸ = 256 codewords and a redundancy equal to:

A constant weight code requires less redundancy than a Berger code, but the drawback of a constant weight code is that it is nonseparable, so some decoding logic is needed to convert each codeword back to its corresponding binary message. This is not the case for a Berger code where the information bits are separate from the check bits and hence can be extracted without any decoding.

Cyclic Codes

Cyclic codes are a special class of linear codes in which any codeword shifted cyclically (rotating the last bit around to become the first bit) is another codeword. Cyclic codes are used for detecting burst errors including those that wrap around from the last to the first bit of the codeword. The length of a burst error is the number of bits between the first error and last error, inclusive. For example, if the original codeword is 00000000 and errors occur in the 3rd, 4th, 5th, and 6th bit positions resulting in 00111100, then it is a burst error of length 4. If the errors occurred only in the 3rd, 4th, and 6th bit positions resulting in 00110100, it would still be a burst error of length 4. Any number of errors can occur between the first error and last error. The reason why burst errors are of particular interest is because multi-bit errors tend to be clustered together. Noise sources tend to affect a contiguous set of bus lines in communication channels. Less redundancy is required to detect burst errors than general multi-bit errors. For example, some distance-2 codes can detect all burst errors of length 4, whereas detecting all possible 4-bit errors requires a distance-5 code.

The most widely used cyclic code is a cyclic redundancy check (CRC) code, which uses a binary alphabet and is based on GF(2). A CRC code is an (n, k) block code formed using a generator polynomial, g (x), which is also called the code generator. It is a degree n – k polynomial (i.e., its degree is the same as the number of check bits) and has the following form:

g(x)=g_n–kx^n–k+...+g₂x²+g₁x+g₀

Each coefficient g_i is binary. A codeword, c(x), is formed by Galois (modulo-2) multiplication of the message, m(x), with the generator, g(x):

c(x) = m(x) g(x)

Example 3.15

The codewords for a (6,4) CRC code generated by g(x) = x² + 1 are shown in Table 3.1.

Table 3.1. Codewords for (6,4) CRC Code Generated by g (x) = x²+1

Message	m(x)	g(x)	c(x)	Codeword
0000	0	x² +1	0	000000
0001	1	x² +1	x² +1	000101
0010	x	x² +1	x³ + x	001010
0011	x +1	x² +1	x³ + x² + x +1	001111
0100	x²	x² +1	x⁴ + x²	010100
0101	x² +1	x² +1	x⁴ +1	010001
0110	x² + x	x² +1	x⁴ + x³+ x² + x	011110
0111	x² + x + 1	x² +1	x⁴ + x³+ x +1	011011
1000	x³	x² +1	x⁵ + x³	101000
1001	x³ +1	x² +1	x⁵ + x³+ x² +1	101101
1010	x³ + x	x² +1	x⁵ + x	100010
1011	x³ + x + 1	x² +1	x⁵ + x²+ x +1	100111
1100	x³ + x²	x² +1	x⁵ + x⁴+ x³ + x²	111100
1101	x³ + x² + 1	x² +1	x⁵ + x⁴+ x³ +1	111001
1110	x³ + x² + x	x² +1	x⁵ + x⁴+ x² + x	110110
1111	x³ + x² + x + 1	x² +1	x⁵ + x⁴+ x +1	110011

A CRC code is a linear block code, so it has a corresponding G-matrix and H-matrix. Each row of the G-matrix is simply a shifted version of the generator polynomial as shown here:

Codewords for (6,4) CRC Code Generated by g (x) = x2+1

Example 3.16

A (6,4) CRC code generated by g(x) = x² + 1 has the following G-matrix:

The CRC codes shown so far have not been systematic (i.e., the first k bit positions of the codeword do not correspond to the message). To obtain a systematic CRC code, the codewords can be formed as follows:

Note that this involves using Galois (modulo-2) division rather than multiplication. This is nice because Galois division can be performed using a linear feedback shift register (LFSR), which is a compact circuit.

Example 3.17

To illustrate Galois division, consider encoding m(x) = x² +x with g(x) = x² +1. This requires dividing m(x)x^n–k = (x² + x)(x²) = x⁴ + x³ by g(x). Note that subtraction in Galois arithmetic is the same as modulo-2 addition.

The remainder r(x) = x + 1, so the codeword is equal to:

c(x) = m(x) x^n–k+r(x) = (x²+x)(x²)+x+1 =x⁴+x³+x+1

Consider another example of encoding m(x) = x² with g(x) = x² + 1. The Galois division is shown next. Notice that in this case, the first bit of the quotient is 1 even though 101 is larger than 100. Each bit of the quotient will be a 1 anytime the high order bit of the remaining dividend is 1 and will be 0 anytime the high order bit of the remaining dividend is 0.

The remainder r(x) = 1, so the codeword is equal to:

c(x) = m(x) x^n–k+r(x) =(x²) (x²) + 1 = x⁴ + 1

Example 3.18

The codewords for a systematic (6,4) CRC code generated by g(x) = x² +1 are shown in Table 3.2.

Table 3.2. Codewords for Systematic (6,4) CRC Code Generated by g (x) = x²+1

Message	m(x)	g(x)	r(x)	c(x)	Codeword
0000	0	x² +1	0	0	000000
0001	1	x² +1	1	x² +1	000101
0010	x	x² +1	x	x³ + x	001010
0011	x +1	x² +1	x +1	x³ + x²+ x +1	001111
0100	x²	x² +1	1	x⁴ + 1	010001
0101	x² +1	x² +1	0	x⁴ + x²	010100
0110	x² + x	x² +1	x +1	x⁴ + x³+ x +1	011011
0111	x² + x + 1	x² +1	x	x⁴ + x³+ x² + x	011110
1000	x³	x² +1	x	x⁵ + x	100010
1001	x³ +1	x² +1	x +1	x⁵ + x²+ x +1	100111
1010	x³ + x	x² +1	0	x⁵ + x³	101000
1011	x³ + x + 1	x² +1	1	x⁵ + x³+ x² +1	101101
1100	x³ + x²	x² +1	x +1	x⁵ + x⁴+ x +1	110011
1101	x³ + x² + 1	x² +1	x	x⁵ + x⁴+ x² + x	110110
1110	x³ + x² + x	x² +1	1	x⁵ + x⁴+ x³ +1	111001
1111	x³ + x² + x +1	x² +1	0	x⁵ + x⁴+ x³ + x²	111100

Generating the check bits for CRC codes can be performed using an LFSR whose characteristic polynomial, f(x), corresponds to the generator polynomial for the code. To encode a message, n – k 0’s are appended to the end of a message, and then it is shifted into an LFSR to compute the remainder. This is illustrated in the following example.

Example 3.19

To encode the message m(x) = x² + x + 1 using a (6,3) CRC code with the generator g(x) = x³ + x + 1, the following LFSR whose characteristic polynomial, f(x), corresponds to the generator polynomial can be used. The initial state of the LFSR is reset to all 0, and (n – k = 3) 0’s are appended to the end of the message which is then shifted into the LFSR is shown in Figure 3.4.

Figure 3.4. Using LFSR to generate check bits for CRC code.

After shifting this in, the final state of the LFSR contains the remainder 010 as shown in Figure 3.5. The three appended 0’s are then replaced with the remainder to form the codeword, which is thus equal to 111010.

Figure 3.5. Final LFSR state after shifting in message and appended 0’s.

The CRC codeword can be transmitted across a communication channel to a receiver. To check if there were any errors in transmission, the receiver can simply shift the codeword into an LFSR with the same characteristic polynomial as was used to generate it (as illustrated in Figure 3.6). The final state of the LFSR will be all 0 if there is no error and will be nonzero if there is an error.

Figure 3.6. Using LFSR to checking CRC codeword.

One key issue for CRC codes is selecting the generator polynomial. A generator polynomial in which the first and last bit of the polynomial is 1 will detect burst errors of length n – k or less. If the generator polynomial is a multiple of (x + 1), then it will detect any number of odd errors. Moreover, if the generator polynomial is equal to:

g(x) = (x+1) p(x)

where p(x) is a primitive polynomial of degree n – k – 1 and n< 2^n–k–1, then it will detect all single, double, triple, and odd errors.

There are some commonly used CRC generator polynomials. Some of these are shown in Table 3.3 [Tanenbaum 1981].

Table 3.3. Some Commonly Used CRC Generator Polynomials

CRC Code	Generator Polynomial
CRC-5 (USB token packets)	x⁵+x²+1
CRC-12 (Telecom systems)	x¹²+x¹¹+x³+x²+x+1
CRC-16-CCITT (X25, Bluetooth)	x¹⁶+x¹²+x⁵+1
CRC-32 (Ethernet)	x³²+x²⁶+x²³+x²²+x¹⁶+x¹²+x¹¹+x¹⁰+x⁸+x⁷+x⁵+x⁴+x+1
CRC-64 (ISO)	x⁶⁴+x⁴+x³+x +1

Fault Tolerance Schemes

As discussed earlier, without using redundancy, the reliability of a system is limited by the reliability of its components. Moreover, the system is unprotected from transient errors. Adding fault tolerance to a design to improve the dependability of a system requires the use of redundancy. In this section, a number of fault tolerance schemes are described. They are organized based on the type of redundancy used: hardware, time, or information [Lala 2001].

Hardware Redundancy

Hardware redundancy involves replicating hardware units, which can be done at any level of design (gate-level, module-level, chip-level, board-level, or system-level). There are three basic forms of hardware redundancy:

Static (also called passive). Static redundancy masks faults rather than detects them. It prevents faults from resulting in errors at the output. It achieves fault tolerance without any action from the operator or CPU of the system. The interconnections are fixed.
Dynamic (also called active). Dynamic redundancy detects faults and then reconfigures the system to use spare fault-free hardware. It requires the ability to detect, locate, and recover from a fault. The interconnections are reconfigurable.
Hybrid. Hybrid redundancy combines both active and passive approaches. It uses fault masking to prevent errors, but it also uses fault detection to reconfigure the system to replace the failed hardware with a spare.

Schemes using each of the three forms of hardware redundancy are described in this section.

Static Redundancy

Static redundancy masks faults so that they do not produce erroneous outputs. For real-time systems where there is no time to reconfigure the system or retry an operation, the ability to mask faults is essential to permit continuous operation. In addition to allowing uninterrupted operation, the other advantage of static redundancy is that it is simple and self-contained without requiring the need to update or roll back the system state.

Triple Modular Redundancy (TMR)

One well-known static redundancy scheme is triple modular redundancy (TMR). The idea in TMR is to have three copies of a module and then use a majority voter to determine the final output (see Figure 3.7). If an error occurs at the output of one of the modules, the other two error-free modules will outvote it and hence the final output will be correct.

Figure 3.7. Triple modular redundancy (TMR).

A TMR system will work if the voter works and either all three modules are working or any combination of two modules is working. Thus, if the reliability of each module is R_m and the reliability of the voter is R_v, then the reliability for the TMR system is expressed as follows:

The MTTF for a TMR system is equal to:

Neglecting the failure rate of the voter (which is generally much lower than the module itself), the expression simplifies to:

MTTF_simplex is the MTTF of a system that consists of just the module itself. What this implies is that the MTTF for a TMR system is actually shorter than that for a simplex system. The advantage of TMR is that it can tolerate temporary faults that a simplex system cannot, and it has higher reliability for short mission times. The crossover point can be computed as follows:

So R_TMR will be greater than R_simplex as long as the mission time is shorter than 70% of the MTTF.

Example 3.20

For a component with a failure rate of 1000 FITS, what is its reliability after 2 years?

R_simplex=e^–λt=e^{–(1000×10^–9)(2)(365)(24)}=0.983

If TMR is used, what is the reliability after 2 years (neglecting the failure rate of the voter)?

R_TMR=3e^–2λt–2e^–3λt=3e^{–(2)(1000×10^–9)(2)(365)(24)}–2e^{–(3)(1000×10^–9)(2)(365)(24)}=0.999

Note that the TMR reliability equations are actually somewhat pessimistic because TMR is able to continue error-free operation in the presence of multiple faults as long as they do not cause simultaneous errors at the outputs of two or more modules. For example, the outputs of modules 1 and 2 can be stuck-at-0 and stuck-at-1, respectively, yet the TMR circuit will continue to function properly; this is referred to as compensating module faults [Stroud 1994] [Siewiorek 1998].

One vulnerable part of TMR is the voter. Errors in the voter are not masked. The voter circuit generally is small though, so the probability of errors originating in the voter is fairly minimal. However, the voter circuit can also be replicated; this is referred to as triplicated TMR in [Lala 1985].

N-Modular Redundancy (NMR)

TMR is actually one case of a general class of error masking circuits referred to as N-modular redundancy (NMR). In NMR circuits, N modules can be used along with a majority voter. In this case, the number of failed modules that can be masked is equal to ⌊(N–1)/2⌋. As N increases, the MTTF of the system decreases, but the reliability for short missions increases. However, if the main goal is only to tolerate temporary faults, then typically TMR is sufficient as temporary faults generally cause an error in only one module at a time.

Interwoven Logic

Another method that uses static redundancy to mask errors is interwoven logic [Pierce 1965]. Interwoven logic is implemented at the gate level by replacing each gate with 4 gates and using an interconnection pattern that automatically corrects error signals. This is illustrated in Figure 3.8. In this example, the functional design shown on the left is implemented with interwoven logic by replacing each NOR gate with 4 NOR gates shown on the left. So although the original design has a total of 4 NOR gates, the interwoven logic design has a total of 16 NOR gates. The gates are interconnected with a systematic pattern that ensures that any single error will be masked after one or two levels of logic.

Figure 3.8. Example of interwoven logic for design with four NOR gates.

Consider the example shown in Figure 3.9 where the primary inputs X and Y are both 0. In this case, the output of gates 1 and 4 is 1 and the output of gates 2 and 3 is 0. If there were an error on the third Y input that causes it to be 1 instead of a 0, then that error would propagate through the next level of logic and cause the output of gates 1c and 1d to be a 0 instead of a 1. However, the error would not propagate to the level of logic after that because the output of gates 3a, 3b, 3c, 3d, 4a, 4b, 4c, and 4d, would still be 1 (i.e., the errors would all be masked).

Figure 3.9. Example of error on third Y input.

More details about interwoven logic including a systematic design procedure can be found in [Pierce 1965]. Several similar design methodologies for fault masking logic at the gate level have been described in [Tyron 1962], [Schwab 1983], and [Barbour 1989].

Traditionally, interwoven logic has not been as attractive as TMR because it requires a lot of area overhead because of the additional interconnect and gates. However, it has seen some renewed interest by researchers investigating fault-tolerant approaches for emerging nanoelectronic technologies.

Dynamic Redundancy

Dynamic redundancy involves detecting a fault, locating the faulty hardware unit, and reconfiguring the system to use a spare hardware unit that is fault-free.

Unpowered (Cold) Spares

One option is to leave the spares unpowered (“cold” spares). The advantage of using this approach is that it extends the lifetime of the spare. If the spare is assumed to not fail until it is powered, and perfect reconfiguration capability is assumed (i.e., failures in reconfiguring to the spare are neglected), then the reliability and MTTF of a module with one unpowered spare is equal to:

Note that adding one unpowered spare doubles the MTTF. This is because when the original module fails, then the spare powers up and replaces it thereby doubling the total MTTF (because the spare cannot fail until it is powered up). Using N unpowered spares would increase the MTTF by a factor of N. However, keep in mind that this is assuming that faults are always detected and the reconfiguration circuitry never fails.

One drawback of using a cold spare is the extra time required to power and initialize the cold spare when a fault occurs. Another drawback is that the cold spares cannot be used to help in detecting faults. Fault detection in this case requires either using periodic offline testing or using online testing based on time or information redundancy.

Powered (Hot) Spares

Another option is to use powered (“hot” spares). In this case, it is possible to use the spares for online fault detection. One approach is to use duplicate-and-compare as illustrated in Figure 3.10. Two copies of the module are compared. If the outputs mismatch at some point, it indicates that one of them is faulty. At that point, a diagnostic procedure must be run to determine whether module A is faulty or module B is faulty. The faulty module is then replaced by a spare module so that the system can keep operating. In the meantime, the system can notify the operator that it has a failed module so that the operator can manually repair or replace the failed module, hopefully before another failure occurs. Any number of spares can be used with this scheme.

Figure 3.10. Duplicate-and-compare scheme with spares.

One drawback of using a single duplicate-and-compare unit is that a diagnostic procedure has to be executed to determine which module is faulty so that the spare can replace the right one. The system has to halt operation until the diagnostic procedure is complete. To avoid the need for using a diagnostic procedure, a pair of duplicate-and-compare units can be used. This scheme is commonly called “pair-and-a-spare.” This is illustrated in Figure 3.11. In this case, if one duplicate-and-compare unit mismatches, then the system switches over and uses the spare duplicate-and-compare unit. Meanwhile, the system can notify the operator that one duplicate-and-compare unit has failed and needs to be repaired or replaced.

Figure 3.11. Pair-and-a-spare scheme.

TMR/Simplex

As Section 3.4.1.1 demonstrated, the MTTF for a TMR system is lower than for a simplex system. This occurs because once one module fails, then there are two remaining modules connected to the voter. If either of those modules fails, the whole system fails. The idea in TMR/simplex is that once one of the modules in TMR fails, then the system is reconfigured as a simplex system using one of the remaining good modules. This improves the MTTF while still retaining all the advantages of a normal TMR system while there are three good modules. The graph in Figure 3.12 compares the reliability of TMR alone, simplex alone, and TMR/simplex versus the normalized mission time (time/MTTF_simplex). As the figure shows, TMR is better than simplex up until 0.7 MTTF_simplex as demonstrated earlier. However, TMR/simplex is always better than either TMR or simplex alone.

Figure 3.12. Reliability versus normalized mission time for simplex, TMR, and TMR/simplex systems.

Hybrid Redundancy

Hybrid redundancy combines both static and dynamic redundancy. It masks faults like static redundancy while also detecting faults and reconfiguring to use spares like dynamic redundancy.

TMR with Spares

One approach for hybrid redundancy is to use TMR along with spares. If one TMR module fails, then it is replaced with a spare that could be either a hot or cold spare. Although the system has at least three working modules, the TMR will provide fault masking for uninterrupted operation in the presence of a fault.

Self-Purging Redundancy

Another approach for hybrid redundancy is self-purging redundancy [Losq 1976]. It is illustrated in Figure 3.13 using five modules. It uses a threshold voter instead of a majority voter. A threshold voter outputs a 1, if the number of its inputs that are 1 is greater than or equal to the threshold value; otherwise it outputs a 0. The threshold voter in the design shown in Figure 3.13 has five inputs and uses a threshold value of 2. This is different from a majority voter, which outputs a 1 only if a majority of the inputs are 1, which in this case would require three inputs to be a 1. The idea in self-purging redundancy is that if only one module fails, then its output will be different from the others. If the faulty module outputs a 1 while all the others output a 0, then the threshold voter will still output a 0 because only one of its inputs is a 1, which is less than the threshold value of 2.

Figure 3.13. Self-purging redundancy.

The design of the elementary switch is shown in Figure 3.14. The elementary switch checks if a module’s output differs from the output of the threshold voter. If it does differ, then the module is assumed to be faulty and its control flip-flop is reset to 0. This permanently masks the output of the module so that its input to the threshold voter will always be 0, which is a safe value. The self-purging system in Figure 3.13 will continue to operate correctly as modules are masked out as long as at least 2 modules are still working. When it reaches the point where only one module is not being masked then when the correct output should be 1, there is only one module that can generate a 1, which is not enough to reach the threshold and consequently the threshold voter will erroneously output a 0.

Figure 3.14. Elementary switch.

In comparing the self-purging system in Figure 3.13 with a 5MR system using a majority voter, the self-purging system is able to tolerate up to three failing modules and still operate correctly, whereas a 5MR system can only tolerate two failing modules. The advantage of the 5MR system is that it can tolerate two modules simultaneously failing, whereas a self-purging system with a threshold of two cannot. A self-purging system with a threshold value of T can tolerate up to T – 1 simultaneous failures.

In comparing the self-purging system in Figure 3.13 with a TMR system having two spares, the fault tolerance capability is the same. The advantage of the self-purging system is that the elementary switches that it uses are much simpler than the reconfiguration circuitry that is required to support TMR with two spares. This is important because it reduces the failure rate of the reconfiguration circuitry, which can be a significant factor in high reliability systems having lots of spares. An advantage of TMR with two spares is that it can support unpowered spares, whereas self-purging redundancy uses powered spares.

Time Redundancy

An alternative to hardware redundancy is to use time redundancy. The advantage of time redundancy is that it requires less hardware, but the drawback is that it can only detect temporary faults. It cannot detect permanent faults. However, temporary faults occur much more frequently than permanent faults, and thus for applications where it is desirable to reduce the error rate without using a lot of additional hardware, time redundancy is an attractive approach. If an error is detected, then the system needs to roll back to a known good state before resuming operation.

Repeated Execution

The simplest way to use time redundancy is to simply repeat an operation twice and compare the results. This detects temporary faults, which occur during only one of the executions (but not both) as it will cause a mismatch in the results. Repeated execution can reuse the same hardware unit for both executions and thus requires only one copy of the functional hardware. However, it does require some mechanism for storing the results of the executions and comparing them. In a processor-based system, this can be done by storing the results in memory or on a disk and then comparing the results in software. The main cost of repeated execution is the additional time required for the redundant execution and comparison.

In a processor-based system that can run multiple threads (different streams of instructions run in parallel), multithreaded redundant execution can be used [Mukherjee 2002]. In this case, two copies of a thread are executed concurrently and the results are compared when both complete. This takes advantage of a processor system’s built-in capability to exploit available processing resources to reduce execution time for multiple threads. This can significantly reduce the performance penalty.

Multiple Sampling of Outputs

At the circuit level, one attractive way to implement time redundancy with low performance impact is to use multiple sampling of the outputs of a circuit. The idea is that in a clock cycle, the output of the functional circuit is sampled once at the end of the normal clock period and then a second time after a delay of Δt. The two sampled results are then compared, and any mismatch indicates an error. This method is able to detect any temporary fault whose duration is less than Δt, because it would only cause an error in one of the sampled results and not the other one. The performance overhead for this scheme depends on the size of Δt relative to the normal clock period, because the time for each clock cycle is increased by that amount. The duration of most common temporary faults is generally relatively short compared with the clock period, so the value of Δt can be selected to obtain good coverage with relatively small impact on performance.

A simple approach for implementing multiple sampling of outputs is to use two latches as illustrated in Figure 3.15. It is also possible to implement multiple sampling of outputs using only combinational logic as described in [Franco 1994]. The approach in [Franco 1994] uses a stability checker, which is activated after the normal clock period to monitor an output during the Δt period and give an error indication if the output changes (i.e., transitions) during that time period (Figure 3.16 shows a timing diagram). An example of a gate-level stability checker design is shown in Figure 3.17. If the output makes a transition during the stability checking period, then a pulse whose width is equal to the propagation delay of the inverter will be generated on the error indication signal. One stability checker is needed for each output, and the error signals can be logically ORed together. In [Franco 1994], a transistor-level design that integrates the stability checker into a flip-flop is also given, which significantly reduces the overhead. Other implementations of multiple sampling of the outputs can be found in [Metra 1998], [Favalli 2002], [Austin 2004], and [Mitra 2005].

Figure 3.15. Multiple sampling of outputs.

Figure 3.16. Timing diagram for stability checking.

Figure 3.17. Example of a stability checker.

Diverse Recomputation

Pure time redundancy approaches where the same hardware is used to repeat a computation the same way cannot detect permanent faults. However, if the computation can be performed differently the second time to utilize the hardware in a different manner, then it can be possible to detect permanent faults if the fault causes the two computations to mismatch. One simple approach used for some arithmetic or logic operations is to shift the operands when performing the computation the second time, which yields a shifted version of the result that can be compared with the result obtained the first time [Patel 1982]. With this approach, a permanent fault in the hardware could be detected if it affected the computations differently. For example, a permanent fault affecting only one bit-slice of the hardware could be detected because it would create an error in different bit positions of the normal and shifted versions of the computation. A more general approach would be to have two different versions of a software routine that perform the same function but do it in a different manner thereby exercising different portions of the hardware [Oh 2002]. By running both versions of the routine and comparing the result, it is possible to detect permanent faults in hardware.

Information Redundancy

Information redundancy has the advantages of hardware redundancy in that it can detect both temporary and permanent faults with minimal impact on performance, yet it can generally be implemented with less hardware overhead than by using multiple copies of the module. Information redundancy is based on using error detecting and correcting codes. Section 3.2 reviewed the fundamentals of coding theory and described a number of commonly used codes. This section describes fault tolerance schemes that use these codes to detect and correct errors.

Error Detection

Error-detecting codes can be used to detect errors. If an error is detected, then the system needs to roll back to a previous known error-free state so that the operation can be retried. Rollback requires adding some storage to save a previous state of the system so that the system can be restarted from the saved state. The amount of rollback that is required depends on the latency of the error detection mechanism. If errors are detected in the same clock cycle in which they occur, then zero-latency error detection is achieved and the rollback can be implemented by simply preventing the state of the system from updating in that clock cycle. If errors are detected only after an n clock cycle latency, then rollback requires restoring the system to the state it had at least n clock cycles earlier. Often the execution is divided into a set of operations, and before each operation is executed a checkpoint is created where the system’s state is saved. Then if any error is detected during the execution of an operation, the state of the system is rolled back to the last checkpoint and the operation is retried. If multiple retries do not result in an error-free execution, then operation halts and the system flags that a permanent fault has occurred.

The basic idea in information redundancy is to encode the outputs of a circuit with an error-detecting code and have a checker, which checks if the encoded output is a valid codeword code [Lin 1983] [Peterson 1972]. A noncodeword output indicates an error.

As discussed in Section 3.2, some error-detecting codes are “separable” and some are “nonseparable.” A separable code is one in which each codeword is formed by appending check bits to the normal functional output bits. The advantage of a separable code is that the normal functional outputs are immediately available without the need for any decoding. A nonseparable code requires some decoding logic in order to extract the normal functional outputs.

Figure 3.18 shows a block diagram for using a separable error-detecting code. A check bit generator is added to generate the check bits. Then there is a checker, which checks if the functional output bits plus the check bits form a valid codeword. If not, then an error indication is given. Typically a self-checking checker is used which is capable of detecting errors in the checker itself as well.

Figure 3.18. Block diagram of error detection using a separable code.

A self-checking checker has two outputs, which in the normal error-free case have opposite values, (1,0) or (0,1). If they are equal to each other—that is, (1,1) or (0,0)—then an error is indicated. The reason why a self-checking checker cannot have a single output is that if a stuck-at 0 fault was on its output, then it would permanently indicate no error. By having two outputs which are equal to (1,0) for one or more codeword inputs and (0,1) for one or more codeword inputs, if either output becomes stuck-at 0, then it will be detected by at least one codeword input. In other words, at least one codeword input will cause the output of the checker to become (0,0), which would detect the fault in the checker.

In order for a checker to be totally self-checking, it must possess the following three properties: code disjoint, fault secure, and self-testing [Anderson 1971] [Lala 2001]. A checker is said to be code disjoint if all codeword inputs get mapped to codeword outputs, and all noncodeword inputs get mapped to noncodeword outputs. This property ensures that any error causing a noncodeword input to the checker will be detected by the checker. A checker is said to be fault secure for a fault class F if for all codeword inputs, the checker in the presence of any fault in F will either produce the correct codeword output or will produce a noncodeword output. This property ensures that either the checker will work correctly, or it will give an error indication (noncodeword output). A checker is said to be self-testing for a fault class F if for each fault in F there exists at least one codeword input that causes a noncodeword output (i.e., detects the fault by giving an error indication). Assuming that all codeword inputs occur periodically with a higher frequency than faults occur, this property ensures that if a permanent fault occurs in the checker, it will be detected before a second fault can occur. The reason for this property is that if more than one permanent fault occurs, the checker may no longer be fault secure. The self-testing property ensures that the first permanent fault that occurs in the checker will get flagged with an error indication before subsequent faults can occur and thus prevents the circuit from losing its fault secure property. Alternatively, [Smith 1978] showed that if the checker is designed to be strongly fault secure such that it remains fault secure in the presence of any number of faults in F, then the self-testing property is not necessary.

Schemes using some of the most commonly used error-detecting codes are described next, and the design of self-checking checkers for them is discussed.

Duplicate-and-Compare

The simplest error-detecting code is duplication. A copy of the functional logic is used as the check bit generator, and an equality checker is used to compare the outputs of the duplicate functional logic as illustrated in Figure 3.19. If the outputs mismatch, then the equality checker indicates an error. With this approach, any fault that causes an error in only one module is guaranteed to be detected because its output will mismatch with the duplicate module’s error-free output. The only way for an undetected error to occur would be if there was a common-mode failure, which caused the exact same error to be produced at the output of both modules [Mitra 2000]. An example of a common-mode failure would be if there was a particular gate in the module that was especially sensitive to power supply noise; then if just the right amount of noise occurred in the power rail, it could cause an error at that one gate in both copies of the module, thereby creating the exact same error at the output of both modules and thus go undetected. One way to reduce the chance of a common-mode failure is to use design diversity where the duplicated modules are each designed differently [Avizienis 1984] [Mitra 2002]. The two modules implement the same function, but they are designed differently. This reduces the likelihood of the two modules producing identical errors in the presence of coupled disturbances.

Figure 3.19. Duplicate-and-compare scheme.

Note that with a duplicate-and-compare scheme, no fault on the primary input stems or in any circuitry driving the primary inputs can be detected. Only faults that occur after the primary inputs stems can be detected, because those faults will only affect one of the modules but not both. Faults on the primary input stems fan out to both copies of the module and thus would produce identical errors at the outputs of both modules and thus go undetected. To detect faults on the primary input stems or in the circuitry driving the primary input stems, a separate checker is required for that portion of the design.

The advantage of a duplicate-and-compare scheme is that the design is simple and modular, and it provides high coverage missing only common-mode failures. The drawback of duplicate-and-compare is the large hardware overhead required. Greater than 100% hardware overhead is needed for adding the extra copy of the module plus the equality checker.

Single-Bit Parity Code

On the other end of the spectrum from duplication is single-bit parity. It is the simplest error-detecting code, which uses a single check bit to indicate the parity of the functional output bits. The parity check bit is equal to the XOR of all the functional output bits and indicates if an even or odd number of functional output bits is equal to 1. When using single-bit parity, the check bit generator predicts what the parity should be, and then the checker determines whether the functional outputs plus the parity bit form a valid codeword. If not, then an error is indicated.

The checker for single-bit parity is simple. It consists of just an XOR tree. To make the checker totally self-checking, the final gate is removed so that there are two outputs that have opposite values in the fault-free case (see Figure 3.20). The functional circuit has six outputs, and the parity prediction logic predicts the parity bit so that the 7-bit codeword (six outputs plus parity bit) has odd parity (i.e., there is an odd number of 1’s in each codeword). This ensures that the two error indication outputs will have opposite values for valid codeword inputs. If a noncodeword having even parity is applied to the checker, the two error indications outputs will have equal value, thereby indicating an error.

Figure 3.20. Example of a totally self-checking parity checker.

If there is a fault in the parity prediction logic, it will be detected because the parity will be incorrect. If there is a fault in the checker, it will be detected because the checker is self-checking. If there is a fault in the functional logic that causes an error on an odd number of functional outputs, it will be detected. However, if a fault in the functional logic causes an error on an even number of functional outputs, then it will not be detected because the parity will remain unchanged.

One way to ensure that no even bit errors occur is to design the functional logic such that each output is generated with its own independent cone of logic (i.e., no gate has a structural path to more than one output). In this case, a single point fault could only cause a single-bit error, which is guaranteed to be detected by a single-bit parity code. However, using independent cones of logic with no logic sharing between the outputs generally results in large area, so this approach is typically only feasible for small circuits.

For large circuits that have a lot of logic sharing between the outputs, typically somewhere around 75% of the errors will be odd errors. The reason that it is not evenly distributed around 50% is that single-bit errors typically occur with much higher frequency than multiple bit errors, which skews the distribution toward odd errors. If the error coverage with single-bit parity is not sufficient, then a code with multiple check bits can be used.

Parity-Check Codes

Single-bit parity is actually a special case of the more general class of parity-check codes. In parity-check codes, each check bit is a parity check for some group of the outputs called a parity group. A parity-check code can be represented by a parity-check matrix, H, in which there is one row for each parity group and each column corresponds to a functional output or a check bit. An entry in the matrix is a 1 if the parity group corresponding to the row contains the output or check bit corresponding to the column. Figure 3.21 shows an example of a parity-check matrix, H, for a circuit with 6 outputs (Z₁–Z₆) and 3 check bits (c₁–c₃). There are three parity groups each containing one check bit and a subset of the outputs. For example, check bit c₁ is a parity check for outputs Z₁, Z₄, and Z₅.

Figure 3.21. Example of a parity-check matrix (H matrix).

With a parity-check code, an error combination (i.e., a particular combination of functional output bits in error) will be detected provided that at least one parity group has an odd number of bits in error. By increasing the number of check bits (i.e., having more parity groups), the number of error combinations that can be detected will increase. If there are c check bits and k functional outputs, then the number of possible parity-check codes is equal to 2^ck because each entry in the parity-check matrix can be either a 1 or 0. Selecting the best parity-check code to use for a particular functional circuit depends on the structure of the circuit and the set of modeled faults because that determines which error combinations can occur. For example, if there is no shared logic between the cones of logic driving outputs Z₁ and Z₂ (i.e., there is no gate that has a structural path to both Z₁ and Z₂), then it is not possible to have an error combination that includes both Z₁ and Z₂ because of a single point fault. Consequently, Z₁ and Z₂ could be in the same parity group with no risk of an even bit error because Z₁ and Z₂ cannot both have an error at the same time. Given a functional circuit, it can be analyzed to determine what error combinations are possible based on the sensitized paths in the circuit, and a parity-check code with the minimum number of check bits required to achieve 100% coverage can be selected [Fujiwara 1987]. If it is possible to synthesize the functional circuit or to modify or redesign it, then the structure-constrained logic synthesis procedure described in [Touba 1997] can be used to constrain the fanout in the functional circuit to simplify the parity-check code that is needed to achieve 100% coverage. This is done by intelligently factoring the logic in a way that minimizes the number of ways that errors can propagate to the outputs.

The checker for a parity-check code can be constructed by using a self-checking single-bit parity checker for each of the parity groups and then combining the two-rail outputs of each self-checking parity checker using a self-checking two-rail checker. A two-rail checker takes two pairs of two-rail inputs and combines them into a single pair of two-rail outputs where if either or both of the two sets of two-rail inputs is a noncodeword, then the output of the two-rail checker will be a noncodeword input. The design of a total self-checking two-rail checker is shown in Figure 3.22 where the two two-rail input pairs (A0, A1) and (B0, B1) are combined into a single two-rail output pair (C0, C1). Figure 3.23 shows the self-checking checker for the party-check code shown in Figure 3.21.

Figure 3.22. A totally self-checking two-rail checker.

Figure 3.23. A self-checking checker for parity-check code in Figure 3.21.

Berger Codes

As discussed in Section 3.2, Berger codes have the lowest redundancy of any separable code for detecting all possible unidirectional errors. If there are k functional output bits, then a Berger code requires log₂⌈k + 1⌉ check bits.

The only way to get a nonunidirectional error for a single point fault is if there are one or more paths from the fault site to a primary output which have an even number of inversions and one or more paths from the fault site to a primary output that has an odd number of inversions. It was shown in [Jha 1993] that if a circuit is synthesized using only algebraic transformations, then it is possible to transform the circuit so that it is inverter-free (i.e., it has inverters only at the primary inputs), and hence all single point faults will cause only unidirectional errors. Thus, for an inverter-free circuit, a Berger code is guaranteed to give 100% coverage for single point faults.

An ALU design using a Berger code was described in [Lo 1992]. Self-checking checker designs for Berger codes can be found in [Piestrak 1987], [Lo 1988], and [Kavousianos 1998].

Constant Weight Codes

The codes discussed so far are all separable codes. As discussed in Section 3.2, constant weight codes are nonseparable codes that have lower redundancy than Berger codes for detecting all unidirectional errors. The drawback of constant weight codes is that some decoding logic is needed to convert each codeword back to its corresponding binary value. However, one application where constant weight codes are useful is for state encoding of a finite state machine (FSM). The states in an FSM can be encoded with a constant weight code, and consequently there is no need for decoding when generating the next state, so the next state logic can be checked with a constant weight code checker, which is much simpler than a Berger code checker.

Self-checking checker designs for constant weight codes can be found in [Marouf 1978], [Gaitanis 1983], and [Nanya 1983].

Error Correction

Information redundancy can also be used to mask (i.e., correct) errors by using an error-correcting code (ECC). For logic circuits, using ECC is not as attractive as using TMR because the logic for predicting the check bits is generally complex, and the number of output bits that can be corrected is limited. However, ECC is commonly used for memories [Peterson 1972]. Errors do not propagate in memories as they do in logic circuits, so single-error-correction and double-error-detection (SEC-DED) codes are usually sufficient.

Memories are dense and prone to errors especially because of single-event upsets (SEUs) from radiation. However, schemes that only detect errors are generally not attractive as there may be a long latency from when the memory was written until the point at which an error is detected, thereby making rollback infeasible. Hence, ECC codes are needed. Each word in the memory is encoded by adding extra check bits that are stored together with the data. SEC-DED codes are commonly used. For a 32-bit word, SEC-DED codes require 7 check bits thereby requiring that the memory store 39 bits per word. For a 64-bit word, 8 check bits are required and thus the memory must store 72 bits per word. So using SEC-DED ECC increases the size of the memory by 7/32 = 21.9% for 32-bit words and 8/64 = 12.5% for 64-bit words. SEC-DED codes are able to correct single errors and detect double bit errors in either the data bits or the check bits.

The hardware required for implementing memory with SEC-DED ECC is shown in Figure 3.24. When the memory is written to, the check bits are computed for the data word. This is done using the “generate check bits” block. The check bits are then stored along with the data bits in the memory. When the memory is read, the check bits for the data word are computed again using the generate check bits block. This is then XORed with the check bits read from the memory to generate the syndrome. The syndrome indicates which check bits computed from the data word mismatch with the check bits read from the memory. As explained in Section 3.3, if the syndrome is all zero, then no error has occurred and the data word can be used as is. If the syndrome is nonzero, then an error has been detected. If the parity of the syndrome is odd, then a single-bit error has occurred and the value of the syndrome indicates which bit is in error. That bit is then flipped to correct it. Generally the corrected word is written back to the memory to reduce the risk of a second error occurring in that word, which would make it uncorrectable the next time it is read. If the syndrome has even parity, then a double bit error has occurred, and it cannot be corrected. However, the error is flagged so that the system can take appropriate action.

Figure 3.24. Memory ECC architecture.

Figure 3.25 shows the application of a Hamming code to construct memory ECC that supports pipelined RAM access. In this example, the Hamming code generation circuitry is inserted at the input to the RAM and Hamming code detection/correction circuitry is added at the output of the RAM. Here the Hamming code generation circuitry is not shared by both write and read ports as was the case in Figure 3.24. The Hamming check bits are generated by the exclusive-OR of a particular subset of data bits associated with each individual Hamming check bit as illustrated by the Hamming parity-check matrix example in Figure 3.26 [Lala 1985]. The Hamming check bits are then stored in the RAM along with the data bits. At the output of the RAM, the Hamming check bits are regenerated for the data bits. The regenerated Hamming check bits are compared to the stored Hamming check bits to determine the syndrome which points to the bit to be inverted to provide SEC. The Hamming code that is shown in Figure 3.26 has a distance of 3 such that an additional parity bit can be included to obtain a distance of 4 and to provide DED. This additional bit gives the parity across the entire codeword (data bits plus Hamming bits) and is used to determine the occurrence of SEC/DED as summarized in Table 3.4.

Figure 3.25. Hamming code generation/regeneration for ECC RAM.

Figure 3.26. Example 8-bit Hamming code parity-check matrix.

Table 3.4. Error Indications for ECC RAM

Error Type	Condition
No bit error	Hamming check bits match, no parity error
Single-bit correctable error	Hamming check bits mismatch, parity error
Double-bit error detection	Hamming check bits mismatch, no parity error

Memory SEC-DED ECC is generally effective because memory bit flips tend to be independent and uniformly distributed. If a bit flip occurs, the next time the memory word is read, it will be corrected and written back. The main risk is that if a memory word is not accessed for a long period of time, then the chance for multiple bit errors to accumulate is increased. One way to avoid this is to use memory scrubbing, in which every location in the memory is read on a periodic basis. This can be implemented by having a memory controller that, during idle periods, cycles through the memory reading every location and correcting any single-bit errors that have occurred in a word. This reduces the time period in which multiple bit errors can accumulate in a word, thereby reducing the chance of an uncorrectable or undetectable error.

Another issue is that some SEUs can affect more than one memory cell, resulting in multiple-bit upsets (MBUs). MBUs typically result in errors in adjacent memory cells. To prevent MBUs from resulting in multiple bit errors in a single word, memory interleaving is typically used. In memory interleaving, the memory has more columns than the number of bits in a single word, and the columns corresponding to each word are interleaved. For example, if the memory has four times as many columns as the width of a word, then every fourth column starting with column 0 is assigned to one word, and every fourth column starting with column 1 is assigned to another word, and so forth. This is illustrated in Figure 3.27. Thus, an MBU affecting up to four adjacent memory cells will cause single-bit errors in separate words, each of which can be corrected with an SEC code.

Figure 3.27. Bit interleaving with four words.

Note that memory ECC also helps for permanent faults because faulty memory cells can be tolerated up to the limit of the correcting capability of the code. However, the existence of faulty memory cells increases the vulnerability to additional temporary faults as there is less redundancy remaining to correct them. ECC can be used to identify faulty memory cells by logging the errors that occur and looking for any memory cell that has an unusually high frequency of errors.

Industry Practices

The dependability requirements for industrial designs vary greatly depending on the application. Table 3.5 characterizes different types of applications in terms of their dependability requirements and the fault tolerance schemes that are used.

Table 3.5. Use of Fault Tolerance in Different Types of Systems

Type	Issues	Goal	Examples	Techniques
Long-life systems	Difficult or expensive to repair	Maximize MTTF	Satellites, spacecraft, implanted biomedical	Dynamic redundancy
Reliable real-time systems	Error or delay catastrophic	Fault masking capability	Aircraft, nuclear power plant, air bag electronics, radar	TMR
High-availability systems	Downtime very costly	High availability	Reservation system, stock exchange, telephone systems	No single point of failure; self-checking pairs; fault isolation
High-integrity systems	Data corruption very costly	High data integrity	Banking, transaction processing, database	Checkpointing, time redundancy; ECC; redundant disks
Mainstream low-cost systems	Reasonable level of failures acceptable	Meet failure rate expectations at low cost	Consumer electronics, personal computers	Often none; memory ECC; bus parity; changing as technology scales

Long-life systems. Some systems are difficult or expensive to repair (e.g., those used in space or implanted in a human). Consequently, dynamic redundancy is typically used to maximize their MTTF so they will have a long lifetime without the need for repair. Such systems carry spares so that when a module fails, it can be replaced by a spare.
Reliable real-time systems. Other systems can be repaired, but they need to be able to tolerate failures when they occur without any delay. Such real-time systems (e.g., aircraft) cannot afford the time required to diagnose a fault and switch to a spare. These systems need to mask faults, which is typically done using TMR. At some safe point in time (e.g., after the aircraft has landed), a failed module can be repaired so that the TMR fault masking capability is restored before the next mission.
High-availability systems. For some systems, downtime is costly and may result in losing business or alienating customers. Examples include a reservation system, stock exchange, or telephone systems. In these systems, the goal is to have high availability. To accomplish this objective, it is important to avoid any single point of failure that could take the system down. Everything is typically replicated including the power subsystem and cables along with all boards and buses. Self-checking pairs are often used where if one pair fails, the other pair replaces it. The system can then notify a maintenance engineer, who can then repair or replace the failed pair. Good fault isolation is important in high-availability systems to make them fast and easy to repair.
High-integrity systems. In some systems, (e.g., banking) data corruption is costly. It is important in these systems to maintain high data integrity. Such systems typically use checkpointing, in which the state of the system is saved before performing an operation so that if a fault occurs during the operation, then the operation can be rolled back. Time redundancy is often used to perform the operation multiple times to ensure that it is correct before committing it to the database. Using ECC and having redundant disks protect the database itself. Note that many high-integrity systems are also high-availability systems and vice versa.
Mainstream low-cost systems. In the applications mentioned previously, dependability is the main concern, and the user is willing to pay for a highly redundant system to obtain it. However, for most applications, the primary concern is low cost. In such systems, the user is typically willing to accept some reasonable level of failures. For example, in a personal computer, users typically would not be willing to pay twice the price for a redundant system if they can get a reasonably reliable system that does not incorporate fault tolerance. Traditionally, most mainstream low-cost systems have not used any fault tolerance. However, this is changing as technology scales and failure rates increase. Personal computers, for example, now typically use memory ECC because it provides a significant improvement in failure rates at low cost. Including a parity line on a bus is another commonly used low-cost technique. As technology continues to scale, soft error rates are becoming a major concern and meeting users’ expectations for reasonable failure rates will likely require incorporating some form of fault tolerance. We examine this subject in greater detail in Chapter 8.

Concluding Remarks

There are many different fault-tolerant design schemes. Choosing the scheme to use in a particular design depends on what types of faults need to be tolerated (temporary or permanent, single or multiple point failures, etc.) and the constraints on area, performance, and power. As technology continues to scale and circuits become increasingly prone to failure, achieving sufficient fault tolerance will be a major design issue.

Note that this chapter is introductory in that it only discusses fundamental fault tolerance schemes that are applicable for coping with various types of faults in the nanometer design era. Further details can be found in subsequent chapters. In particular, Chapter 8 extensively covers various sources of permanent faults and temporary faults (soft errors) and goes into more detail on techniques for coping with them, and Chapter 12 discusses implementing fault tolerance in field programmable gate arrays (FPGAs).

Exercises

3.1	(Reliability) Assume that the failures of three computers—A, B, and C—are independent, with constant failure rates λ = 1/1000 λ = 1/1200, and λ = 1/900 failures per hour, respectively. What is the probability that at least one system fails in a 4-week period? What is the probability that all three systems fail in a 4-week period?
3.2	(Mean Time to Failure) There is one light bulb in your apartment and it burns out. You are only going to be living in the apartment for another 30 weeks. You go to the store to buy a light bulb and there are two choices. One has a MTTF of 100 weeks and costs $1, and the other has a MTTF of 300 weeks and cost $1.50. If you buy the cheaper bulb, what is the probability that you will have to buy another before you move out? If you buy the more expensive bulb, what is the probability that you will have to buy another before you move out? Which bulb would you buy?
3.3	(System Reliability) The system in Figure 3.28 functions correctly if all components along at least one path from input to output are functioning. All components failures can be assumed to be independent. Figure 3.28. System reliability diagram. Express the system reliability as a function of the component reliabilities. Each component has a reliability R_m. What is the MTTF of the system if the MTTF of each component is 1 year and each component has a constant failure rate?
3.4	(Availability) Assume you own a computer with a failure rate of one failure per 100 weeks. Computer service company ABC promises to provide repairs fast enough that your MTTR will be 2 weeks. Computer service company XYZ promises to provide repairs fast enough that your availability will be 99%. If both companies charge the same, which would you contract with? Explain.
3.5	(Linear Code) For the following generating matrix: List all valid codewords. Derive the H-matrix.
3.6	(Linear Code) Write the generator matrix for the following codes: Single-bit even parity for 4 message bits. Duplication code having 3 message bits and 3 check bits.
3.7	(Code Redundancy) If the number of information bits is 16 (k = 16), calculate the redundancy of the following codes: Berger code. 9-out-of-19 constant weight code. Code in which the information bits are partitioned into two blocks of 8 bits each, and each block is encoded with a 6-out-of-12 constant weight code.
3.8	(CRC Code) For a (6,3) CRC code with G = x³ + x² + 1: Write out all the codewords. Derive the G-matrix for this code.
3.9	(CRC Code) For a systematic (8,4) CRC code with G = x⁴ + x³ + x² + 1: What are the check bits for the message x³ + x² + 1? What are the check bits for the message x + 1?
3.10	(ECC) For the following SEC Hamming code, correct the single-bit error in each of the following words: 110001111110 101101110100
3.11	(ECC) Construct an H-matrix for the (14,9) odd-column-weight (Hsiao) code. Try to keep the row weights as equal as possible.
3.12	(TMR) For a TMR system, suppose the three modules have the same function but are designed differently such that the failure rate for each of the three modules is 1000 FITS, 2000 FITS, and 3000 FITS, respectively. Compute the reliability of the TMR system after 2 years. Compute the MTTF of the TMR system.
3.13	(NMR) For an NMR system with five modules each with a failure rate of λ: Derive an expression for the reliability. Derive an expression for the MTTF.

Acknowledgments

The author wishes to thank Avijit Dutta of the University of Texas, Austin, Professor Kartik Mohanram of Rice University, and Professor Subhasish Mitra of Stanford University for their helpful comments. The author also wishes to thank Professor Charles Stroud of Auburn University for contributing material on ECC RAMs.

References

Books

Fundamentals of Fault Tolerance

Fundamentals of Coding Theory

Fault Tolerance Schemes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Fault-Tolerant Design

Create new playlist

Sign In

Sign Up

Chapter 3. Fault-Tolerant Design

About This Chapter

Introduction

Fundamentals of Fault Tolerance

Reliability

Mean Time to Failure (MTTF)

Example 3.1

Example 3.2

Maintainability

Availability

Example 3.3

Fundamentals of Coding Theory

Linear Block Codes

Example 3.4

Example 3.5

Example 3.6

Example 3.7

Example 3.8

Example 3.9

Example 3.10

Example 3.11

Example 3.12

Unidirectional Codes

Two-Rail Codes

Berger Codes

Example 3.13

Constant Weight Codes

Example 3.14

Cyclic Codes

Example 3.15

Example 3.16

Example 3.17

Example 3.18

Example 3.19

Fault Tolerance Schemes

Hardware Redundancy

Static Redundancy

Triple Modular Redundancy (TMR)

Example 3.20

N-Modular Redundancy (NMR)

Interwoven Logic

Dynamic Redundancy

Unpowered (Cold) Spares

Powered (Hot) Spares

TMR/Simplex

Hybrid Redundancy

TMR with Spares

Self-Purging Redundancy

Time Redundancy

Repeated Execution

Multiple Sampling of Outputs

Diverse Recomputation

Information Redundancy

Error Detection

Duplicate-and-Compare

Single-Bit Parity Code

Parity-Check Codes

Berger Codes

Constant Weight Codes

Error Correction

Industry Practices

Concluding Remarks

Exercises

Acknowledgments

References

Books

Fundamentals of Fault Tolerance

Fundamentals of Coding Theory

Fault Tolerance Schemes

Table of Contents for
3. Fault-Tolerant Design