Chapter 8. Coping with Physical Failures, Soft Errors, and Reliability Issues

Laung-Terng (L.-T.) WangSynTest Technologies, Inc., Sunnyvale, California

Mehrdad NouraniUniversity of Texas at Dallas, Richardson, Texas

T. M. MakIntel Corporation, Santa Clara, California

About This Chapter

Physical failures caused by manufacturing defects and process variations, as well as soft errors induced by alpha-particle radiation, have been identified as the main source of faults attributed to chip or system failure. Today, the semiconductor industry relies heavily on two test technologies: scan and built-in self-test (BIST). Existing scan implementations may no longer be sufficient as scaling introduces new failure mechanisms that exceed the ability to capture by single-fault-model-based tests. BIST will also become problematic if it does not achieve sufficient fault coverage in reasonable time. Faced with significant test problems in the nanometer design era, it is imperative that we seek viable test solutions now to complement the conventional scan and BIST techniques.

In this chapter, we focus on test techniques to cope with physical failures for digital logic circuits. Techniques for improving process yield, silicon debug, and system diagnosis, along with test methods and DFT architectures for testing field programmable gate array (FPGA), microelectromechanical systems (MEMS), and analog and mixed-signal (AMS) circuits are covered in subsequent chapters. In this chapter, we first discuss test techniques to deal with signal integrity problems. We then describe test techniques to screen manufacturing defects and process variations. Finally, we present a number of promising online error-resilient architectures and schemes to cope with soft errors as well as for defect and error tolerance.

Introduction

Since the early 1980s, complementary metal oxide semiconductor (CMOS) process has become the dominant manufacturing technology. At the introduction rate of a new process technology (node) roughly every 2 years, which is a reflection of the Moore’s law [Moore 1965], new defect mechanisms have initially caused low manufacturing yield, elevated infant mortality rates, and high defect levels. As CMOS scaling continues (down to a feature size of 65 nanometers and below), additional defect mechanisms caused by new manufacturing defects (such as defects as a result of optical effects) and process variations continue to create various failure mechanisms. To meet yield, reliability, and quality goals (referred to as defective parts per million [DPM]), these defects must be screened during manufacturing or be tolerated during system operation.

Defects are physical phenomena that occur during manufacturing and can cause functional or timing failures. Examples of defects are missing conducting material or extra insulating material (possibly causing opens), the presence of extra conducting material, or missing insulating material between two signal lines (possibly causing shorts), among others. A defect does not always manifest itself as a single isolated problem such as an open or a short. When a circuit parameter is out of given specifications, it can also cause a failure or become susceptible to other problems (temperature effects, crosstalk, leakage power, etc.). New manufacturing processes, such as a changeover from aluminum metallization to copper metallization and from SiO2 to low-K interlayer dielectric [Tyagi 2000], have created new defect and fault mechanisms. These defect mechanisms include copper-related defects (an effect of dual damascene copper deposition process), optical induced defects (an effect of undercorrection or overcorrection), and design-related defects (low threshold voltage and multiple voltages in low power designs) [Aitken 2004] [Guardiani 2004]. These defect mechanisms along with their associated potential defects and failure mechanisms are extensively discussed in [Gizopoulos 2006]. The more recently announced changeover of the gate stack, from polysilicon/SiO2 to high-K/metal gate dielectric, will certainly bring forth new defect mechanisms of their own [Chau 2004].

Broadly speaking, defects can be random or systematic, and they can be functional or parametric. Random defects are caused by manufacturing imperfections and occur in random places. Most of these random defects are relatively easy to find except for a few that may require a lot of test time to find. In certain cases, 50% of test time for some devices can go after 0.5% of the defects [Nigh 2007, personal communication]. Systematic defects are caused by process or manufacturing variations (sometimes the result of lithography, planarity, film thickness, etc.). A systematic defect can also include a temporal systematic component (e.g., every 10th wafer). At 65 nm and below, these systematic variations are the greatest cause of catastrophic chip failures and electrical issues related to timing, signal integrity, and leakage power. Reference [Clear Shape 2007] indicates that at 65 nm, systematic variations of 3 nm on a transistor gate can cause a 20% variation in delay and have a 2× impact on leakage power. For some devices, leakage power can vary by 10× within the process window [Nigh 2007, personal communication]. Leakage power is gradually dominating process variations, in becoming a major yield detractor as process technology continues to scale down.

The traditional treatment of defects focuses more on functional random (spot) defects, which lead to existing yield models. Growing process variations and other uncertainty issues require looking at the other types of defects. In a narrow sense, defects are caused by process variations or random localized manufacturing imperfection [Sengupta 1999].

Process variations, such as transistor channel length variation, transistor threshold voltage variation, metal interconnect thickness variation, and interlayer dielectric thickness variation, have a big impact on device speed characteristics. In general, the effect of process variation shows up first in the most critical paths in the design, those with maximum and minimum delays. For instance, a shorter channel length or lower threshold voltage can result in potentially faster device speed but with a significantly higher leakage current. A thinner gate oxide can result in potentially increased device speed at the expense of a significantly increased gate tunneling current and reliability concerns.

Random imperfection, such as resistive shorts between metal lines, resistive opens on metal lines, improper via formations, and shallow trench isolation defects, is yet another source of defects, called random defects. Based on the electrical characteristics of the defect and neighboring parasitic, the defect may result in a functional (hard) or parametric (delay or marginal) failure. For instance, a missing contact could cause an open fault. A resistive short between two metal lines or an extra contact could cause a bridging fault. A resistive open on a metal line or an improper via formation could cause a delay fault.

Recall that defect level (DL) is a function of process yield (Y) and fault coverage (FC) [Williams 1981]. The authors in [McCluskey 1988] further showed that:

DL = 1 – Y(1–FC)

This indicates that in order to reduce defect level to meet a given DPM goal, one can improve the fault coverage of the chips (devices) under test, the process yield, or both at the same time. In reality, not all chips passing manufacturing tests would function correctly in the field. Reports have shown that (1) chips can be exposed to alpha-particle radiation and (2)nonrecurring transient errors caused by single-event upsets, called soft errors, can occur [May 1979] [Baumann 2005]. The chips can also be exposed to noises, such as power supply noise or signal integrity, and cause unrecoverable errors [Dally 1998].

For nanometer system-on-chip (SOC) designs, there is also a growing concern as to whether one can find defect-free or error-free dies [Breuer 2004a]. Advanced design and test technologies are eminent now in order to meet yield and DPM goals and ensure that defective chips function harmlessly in the field.

There are two fundamentally complementary test technologies that can be taken to meet our goals, similar to those approaches used to improve the reliability of computer systems:design for testability (DFT) [Williams 1983] [Abramovici 1994] [Bushnell 2000] [Mourad 2000] [Jha 2003] [Wang 2006] and fault tolerance [Siewiorek 1998] [Lala 2001]. The fault tolerance approach aims at preventing the chip (computer system) from malfunction despite the presence of physical failures (errors), whereas design for testability uses design techniques to reduce defect levels or the probability of chip (system) failures as a result of manufacturing faults by screening those defective devices more effectively during manufacturing test.

In the following subsections, we first discuss promising test techniques to deal with signal integrity problems induced by physical failures. We next describe promising schemes to screen manufacturing defects and process variations and to improve manufacturing yield. We then discuss a few online test architectures designed to protect soft errors induced by radiation or other reliability concerns. Finally, promising schemes for defect and error tolerance are presented to ensure that defective chips can still function in nanometer designs.

Signal Integrity

Signal integrity is the ability of a signal to generate correct responses in a circuit. Informally speaking, signal integrity indicates how clean or distorted a signal is. A signal with good integrity stays within safe (acceptable) margins for its voltage amplitude and transition time. For example, an input signal to a flip-flop with good integrity arrives on time to satisfy the setup/hold time requirements and does not have large undershoots that may cause erroneous logic readout or large overshoots that affect the transistor’s lifetime.

Leaving the safe margins may not only cause failure in a system (e.g., unexpected ringing) but also shorten the system’s lifetime. The latter occurs because of the time-dependent dielectric breakdown (TDDB) [Hunter 1999] or injection of high-energy electrons and holes (also called hot carriers) into the gate oxide. Such phenomena ultimately cause permanent degradation of metal oxide semiconductor (MOS) transistors. To quantify these, systematic methods can be employed to perform the lifetime analysis and measure the performance degradation of logic gates under stress (e.g., repeated overshoots) [Fang 1998].

Basic Concept of Integrity Loss

Signal integrity depends on many internal (e.g., interconnects, data, characteristics of transistors, power supply noise, process variations) and external (e.g., environmental noise, interactions with other systems) factors. By using accurate simulation in the design phase, one can apply conservative techniques (e.g., stretched sizing/spacing, shielding) to minimize the effect of integrity loss. There are interdependencies among these parameters, which can result in performance degradation or permanent/intermittent failures. Because of the uncertainty caused by these interdependent parameters, it is impossible (with our current state of knowledge) to have a guaranteed remedy at the design phase. Thus, testing future very-large-scale integration (VLSI) chips for signal integrity seems to be inevitable.

True characteristics of a signal are reflected in its waveform. In practice, digital electronic components can tolerate certain levels of voltage swing and transition/ propagation delay. Any portion of a signal that exceeds these levels represents integrity loss (IL). This concept has graphically been shown in Figure 8.1, in which the horizontal and vertical shaded strips correspond to the amplitude- and time-safe margins, respectively. The black areas illustrate the time frames in which the signal has left the safe margin and thus integrity loss has occurred.

The concept of signal integrity loss.

Figure 8.1. The concept of signal integrity loss.

Any portion of a signal f(t) that exceeds the safe margins contributes to the integrity loss metric, which can be quantified as:

The concept of signal integrity loss.

where Vi is one of the acceptable amplitude levels (i.e., a border of safe margin) and [bi, ei] is a time frame during which integrity loss occurs.

Figure 8.1 and the preceding formula show the basic concept of integrity loss. Not all signals experience the same fluctuations. The presence or the level of integrity loss (overshoot, ringing, and delay) depends on the technology (e.g., interconnect parasitic R/L/C values), process variations (e.g., changes of threshold voltage, oxide thickness), and the application (e.g., on-chip versus printed circuit board [PCB] wiring). For example, overshoots and ringing are more commonly found in PCBs and chip packaging where inductance (L) of wires is not negligible. Delays (transitions or settling), on the other hand, are more important for on-chip interconnects because of the larger effect of wire’s parasitic resistance (R) and capacitance (C).

With today’s computing power, the computationally intensive analysis/simulation recommended by this model would not be practical for real-world circuits; yet it implies three main requirements in testing VLSI chips for signal integrity: (1) the target source (location) selection to stimulate or sample/monitor IL, (2) integrity loss sensors/monitors, and (3) readout circuitry to deliver IL information to an observation/analysis point. Almost all solutions presented in the literature so far point to the necessity of a combination of these three requirements. Next, we briefly discuss these requirements and some of the techniques described in the literature.

Sources of Integrity Loss

To have a practical evaluation of integrity loss, we need to decide where and what to look at. Various sources of signal integrity loss in VLSI chips have been identified. The most important ones are the following:

  • Interconnects, which contribute to crosstalk (signal distortion caused by cross-coupling effects among signals), overshoot (signal rising momentarily above the power supply voltage), and electromagnetic interference (resulting from the antenna properties) [Bai 2000] [Chen 2001] [Nourani 2002].

  • Power supply noise, whose large fluctuations, mainly the result of simultaneous switchings, affect the functionality of some gates and eventually may lead to failure [Senthinatharr 1994] [Zhao 2000a].

  • Process variations, which are deviations of parameters from their desired values because of the imperfect nature of the fabrication process. Sources of process variations include random dopant fluctuation, annealing effects, lithographic limitations, among others [Borkar 2004a].

The pattern generation mechanism depends on the source (location) of IL that is the target of testing. For interconnects and power supply noise, deterministic test pattern generation methods are often used. For testing IL caused by process variation, however, pattern generation is replaced by sensing or monitoring devices.

Interconnects

The maximum aggressor (MA) fault model [Cuviello 1999] is a simplified model that many researchers use mainly for crosstalk analysis and testing. This model, shown in Figure 8.2, assumes the signal traveling on a victim line V may be affected by signals/transitions on other aggressor line(s)A in its neighborhood. The coupling can be represented by a generic coupling component Z. In general, the result could be noise and excessive delay that may lead to functional error and performance degradation, respectively. There is, however, controversy as to what patterns trigger maximal integrity loss. Specifically, in the traditional MA model that takes only coupling C into account, all aggressors make the same simultaneous transition in the same direction, whereas the victim line is kept quiescent for maximal ringing (see pattern pair 1) or makes an opposite transition for maximal delay (see pattern pair 2).

Signal integrity fault model using concept of aggressor/victim lines.

Figure 8.2. Signal integrity fault model using concept of aggressor/victim lines.

Figure 8.3 pictures six cases (patterns) for three line interconnects where the middle one is the victim based on the MA fault model. Test patterns for signal integrity are vector pairs. For example, when the victim line is kept quiescent at 0 or 1 (see columns 1 through 4), four possible transitions on the aggressors are examined.

The MA fault model and test patterns.

Figure 8.3. The MA fault model and test patterns.

When mutual inductance comes into play, some researchers have shown that the MA model may not reflect the worst case and have presented other test generation approaches (using pseudo-random, weighted pseudo-random, or deterministic patterns) to stimulate the maximal integrity loss [Chen 1998, 1999] [Attarha 2002].

As reported in [Naffziger 1999], a device failed when the nearest aggressor lines change in one direction and the other aggressors change in the opposite direction. The MA fault model does not cover these and many similar scenarios. Exhaustive testing covers all situations, but it is time consuming because of the huge number of test patterns. In [Tehranipour 2004], the authors proposed a multiple transition (MT) fault model assuming a single victim, a limited number of aggressors, a full transition on victim, and multiple transitions on aggressors. In the MT model, all possible transitions on the victim and aggressors are applied, whereas in the MA model only a subset of these transitions is generated. Briefly, the MA-pattern set is a subset of the MT-pattern set.

Power Supply Noise

Power supply is distributed in a VLSI chip through wires that contain parasitic R/L/C elements. Hence, drawing current from a power source produces voltage fluctuations. Noise is mainly created in the power supply lines because of the resistive and inductive parasitic elements. Specifically, the inductive noise, also known as di/dt noise, is caused by the instantaneous change in current drawn from the power supply. Inductive noise becomes significant in high-frequency designs because of the sharp rise and fall times. High-frequency switching causes the current to be drawn (from the power supply) for very short duration (usually in hundreds of picoseconds) causing very high di/dt. Resistive noise (IR drop) is dependent on the current drawn from the power supply. Hence, the power supply noise, shown as PSN(t) or simply PSN, is collectively given by:

Power Supply Noise

For an n-input circuit, there are 2n (2n – 1 ≈) 22n possible pattern pairs that may cause internal transitions. Because simulating all possible pairs is unrealistic, it is essential to be able to select a small set of pattern pairs without exhaustive simulation. One approach is to use random-pattern-based simulation. Unfortunately, because of the random nature of these patterns, such approaches cannot guarantee to create maximum PSN in a reasonable amount of simulation time. Researchers have, therefore, applied deterministic or probabilistic heuristics to find such pattern sets with no guarantee of optimality.

Authors in [Lee 2001] have presented the generation and characterization of three types of noise induced by electrostatic discharge in power supply systems. These three types are I/O protection induced signal loss, latent damage after electrostatic discharge stress, and power/ground coupling noise. To speed up the power supply noise analysis and test generation process, some works exploited the concept of random search. For example, the authors in [Jiang 1997] and [Bai 2001] used a genetic algorithm (with random basis) to stimulate the worst-case PSN. Some researchers, such as [Zhao 2000b], precharacterize cells using transistor-level simulators and annotate the information into the PSN analysis phase. A technique for vector generation for power supply noise estimation and verification is described in [Jiang 2001]. The authors used a genetic algorithm to derive a set of patterns producing high power supply noise. A pattern generation method to minimize the PSN effects during testing is presented in [Kristic 2001]. In [Nourani 2005], the authors identified three design metrics (level, fanin, and fanout) that have the maximum effects on PSN. A greedy algorithm and a conventional fault simulator are then used to quickly construct pattern pairs that simulate the worst-case PSN based on circuit topology, regardless of whether the circuit is in functional mode or in test mode.

Process Variations

As device technology progresses toward 45 nm and below, the fidelity of the process parameter modeling becomes questionable. For every process, there is some level of uncertainty in its device parameters because of limitations imposed by the laws of physics, imperfect tools, and properties of materials not fully comprehended [Visweswariah 2003]. The deviation of parameters from desired values because of the limited controllability of a process is called process variation (PV). Sources of process variation include random dopant fluctuation, annealing effects, and lithographic limitations. Typical variations are 10% to 30% across wafers and 5% to 20% across dies, and they may change the behavior of devices and interconnects [Borkar 2004a].

Researchers have explored various ways of analyzing and dealing with process variation. The solutions are broadly labeled design for manufacturability (DFM) techniques. DFM struggles to quantify the impact of PV on circuits and systems. This role has made DFM techniques of high interest to semiconductor and manufacturing companies [Nassif 2004]. There are tens of factors that affect or contribute to process variation, including interconnects, thermal effects, and gate capacitance [Dryden 2005]. Process variation has been shown to potentially cause a 40% to 60% variation for effective gate length Leff, a 10% to 20% fluctuation in Vt and Tox that may lead to malfunction and ultimately failure [Borkar 2004a].

The PV monitoring/analysis approaches can be classified using different criteria. From a source point of view, variation can be intradie (within-die) or interdie. The latter can be classified as a die-to-die, center-to-edge (in a wafer), wafer-to-wafer, lot-to-lot, or fab-to-fab variation. From a methodology point of view, the solutions are classified into two broad classes: statistical and systematic. Examples of statistical approaches are PV modeling [Sato 1998], analyzing the impact of parameter fluctuations on critical-path delay [Bowman 2000] [Agarwal 2003], mapping statistical variations into an analytical model [Cao 2005], and addressing the effect of PV on crosstalk delay and noise [Narasimha 2006].

In systematic approaches, because of the complexity of parameters, almost all researchers traced a limited number of PV metrics and their effects on design characteristics. [Orshansky 2000] explored the impact of the gate length on performance. The authors in [Chen 2005] proposed a current monitor component to design PV-tolerant circuits, and [Azizi 2005] analyzed the effect of voltage scaling on making designs more resistant to process variations. Other works that focused on the effect of PV on key design characteristics include [Mehrotra 1998], which studied the effect of manufacturing variation on microprocessor interconnects; [Ghanta 2005], which showed the effect of PV on the power grid; [Agarwal 2005], which presented the failure analysis of memories by considering the effect of process variation; and [Ding 2005], in which the authors investigated the effect of PV on soft error vulnerability.

Because of the nature of parameters (e.g., Vt, Tox, and Leff) affected by process variation, tracing and pinpointing each variation for any realistic circuit is not a viable option. Hence, it becomes necessary to abstract and limit the problem to a specific domain to look at it collectively. Such abstraction can be clearly seen in prior works that consider process variation for clock distribution [Bowman 2001], delay test [Lu 2004], defect detection [Acharyya 2005], PV monitoring techniques [Kim 2005], reliability analysis [Borkar 2004b], and yield prediction [Jess 2003]. The authors in [Mohanty 2007] considered simultaneous variation of Vt, Tox, and Vdd for the transistor’s current characterization and optimization.

Tracing process variations for individual parameters is not possible because of the size, complexity, and unpredictability of playing factors. Like in conventional testing approaches (e.g., stuck-at fault, path-delay fault), we need a simplified model for PV-faults to be able to devise and apply a PV test methodology. In spite of its simplicity, the model should be generic in concept, straightforward in measurement, and practical in application. A single PV-fault model is defined in [Nourani 2006]. This model assumes that there exists only one grid (unit area) in the layout of the circuit under test, where a sensor planted in that region can generate a faulty metric zf instead of a fault-free metric z such that Δz =|zfz| is measurable (in terms of delay, frequency, etc.).

Integrity Loss Sensors/Monitors

The integrity loss metric depends on the signal’s waveform (see Figure 8.1), which is subject to change as it travels through a circuit. For an accurate and uniform measurement, the integrity loss needs to be captured or sampled right after creation. This will be practical only by limiting the observation sites and designing cost-effective sensors and readout circuitry. Various types of on-chip sensors, potentially useful for integrity loss detection, have been reported in the literature. This section explores a few of the on-die environmental sensors that can be used in BIST or scan-based architectures for targeting integrity loss and process variations.

Current Sensor

Current sensors are often used to detect the completion of asynchronous circuits [Lampinen 2002] [Chen 2005]. Figure 8.4 shows a conventional current sensor. The supply current of a logic circuit block is mirrored through a current mirror transistor pair (M0 and M1) to a bias-generation circuit. The bias-generation circuit contains an N-channel metal oxide semiconductor (NMOS) biased as a resistor (M2). If the supply current is high, the voltage drop across M2 is high, which generates Done = 0 indicating the job is not completed. When the circuit operation is completed, only the leakage current flows through the circuit. This makes the voltage drop across M2 quite small, thus producing Done = 1.

Conventional current sensor.

Figure 8.4. Conventional current sensor.

In general, designing and fine-tuning the current sensors are challenging tasks. Yet, different versions of current sensors have been used for various monitoring and testing applications. For example, the authors in [Yang 2001] proposed boundary scan combined with transition power supply current (IDDT) for testing interconnect buses; the authors in [Chen 2005] proposed a leakage canceling current sensor (LCCS). The sensor was then recommended for self-timed logic to design process-variation tolerant systems. The self-timed systems can accept input data synchronously and supply their outputs asynchronously (i.e., by a “Done” signal generated by the current sensor).

Power Supply Noise Monitor

A PSN monitor was presented in [Vazquez 2004] by which the authors claimed to detect high-resolution (100ps) PSN at the power/ground lines. The schematic of this circuit is shown in Figure 8.5. Briefly, the three inverters work as a delay line, whose delay depends on its effective supply voltage. The charge supplied to Cx (voltage Vx) is proportional to the propagation delay of the inverter block. The Vx voltage at the end of sampling period (controlled by the NOR gate) depends on the supply voltage. Thus, the voltage Vx depends on the power/ground bounce: the higher the PSN is, the longer the propagation delay and the higher the voltage Vx will be.

Power supply noise monitor.

Figure 8.5. Power supply noise monitor.

Noise Detector (ND) Sensor

A modified cross-coupled P-channel metal oxide semiconductor (PMOS) differential sense amplifier is designed to detect integrity loss (noise) relative to voltage violations [Nourani 2002]. Figure 8.6 shows a noise detector (ND) sensor, which sits physically near the receiving Core j for sampling the actual signal plus noise transmitted through Core i. TE is connected to test mode to create a permanent current source in the test mode, and input Noise Detector (ND) Sensor is connected to VDD to define the threshold level for sensing Vb, (i.e., the voltage received in x). By adjusting the size of the PMOS transistors (i.e., W and L), the current through transistors T1 and T2 and the threshold voltages to turn the transistors on or off can be tuned. A designer uses this tuning technique to set the high and low noise threshold levels (VHthr and VHmin) in the ND sensor. Each time when noise occurs (i.e., Vb >VHthr), the ND sensor generates a 0 signal that remains unchanged until Vb drops below VHmin. The ND sensor shows a hysteresis (Schmitt-trigger) property, which implies a (temporary) storage behavior. This property helps to detect the violation of two threshold voltages (VHthr and VHmin) with the same ND sensor.

Noise detector (ND) sensor using a cross-coupled PMOS amplifier.

Figure 8.6. Noise detector (ND) sensor using a cross-coupled PMOS amplifier.

Integrity Loss Sensor (ILS)

The integrity loss sensor (ILS) is a delay violation sensor shown in Figure 8.7 [Tehranipour 2004]. The ILS sensor consists of two parts: the sensor and the detector (XNOR gate). An acceptable delay region (ADR) is defined as the time interval from the triggering clock edge during which all output transitions must occur. The test clock TCK is used to create a delayed signal b, and together they determine the ADR window. The input interconnect signal a is in the acceptable delay period if its transition occurs during the period when b is at logic 0. Any transition that occurs during the period when b is at logic 1 is passed through the transmission gates to the detector composed of an XNOR gate. The XNOR gate is implemented using dynamic precharged logic. Output c becomes 1 when a signal transition occurs during b = 1 and remains unchanged till b = 0, the next precharge cycle. Output c is used to trigger a flip-flop. The minimum detectable delay can be decreased by increasing the delay of inverter 2 (Tinv2) or decreasing the delay of inverter 1 (Tinv1). Tinv2 can be decreased until the duration is enough to precharge the dynamic logic.

Integrity loss sensor (ILS).

Figure 8.7. Integrity loss sensor (ILS).

Jitter Monitor

Jitter is often defined as the time deviation of a signal from its ideal location in time. Jitter characterization is important for phase-locked loops (PLLs) and other circuits with time-sensitive outputs. Jitter measurement of a data signal with sub-gate resolution can be done using two delay lines feeding a series of D latches as shown in Figure 8.8. Such a structure is known as a Vernier delay line (VDL) [Gorbics 1997]. Assuming the clock signal is jitter-free, the propagation delay of the clock and data paths differ by ΔT = Td – Tc (e.g., the time difference between rising edges). The time difference decreases by ΔT after each stage, and the phase relationship between these two rising edges is recorded by a D latch in each stage. A counter reads the output of the D latch and counts the number of times the data signal leads the clock signal with a delay difference that depends on the position of D latch in the chain. Alternatively, the histogram of the jitter (i.e., the jitter’s probability density function) can be directly derived by ORing the outputs of all D latches and counting the number of 1’s over the time period of the clock. The accuracy of jitter measurement using VDL depends on the matching of delay elements between stages [Chan 2001].

Jitter monitor using VDL.

Figure 8.8. Jitter monitor using VDL.

Process Variation Sensor

Using ring oscillators (ROs) to probe on-die process variation is a long-standing practice. The oscillators inserted into the system are affected by process variation (PV) along with the rest of the system. The variation of delay caused by PV-faults in any of the inverters in the loop results in deviation in the frequency of the oscillator, which can be detected. Several foundries have already used ring oscillators on wafers to monitor process variations. This often serves as a benchmark performance measure (sanity check) [MOSIS 2007]. Conventionally, several in-wafer ROs are placed on dicing lines (scribe lines) for process parameter monitoring. Unfortunately, this is insufficient for evaluating variations on each die, as there is a growing demand for sensors and methodologies that allow process variations to be evaluated with precision.

Ring oscillators are implemented by cascading an odd number of inverters to form a loop. By using an odd-numbered loop, we can assure that the output of the last inverter is the opposite (inverse) of the previous input to the first inverter, thus preventing it from stabilizing to a steady state. The oscillation frequency of a ring oscillator is given as the reciprocal of the total delay of the inverters. That is fRO = 1/(NinvTinv), where Ninv is an odd number of inverters and Tinv is the delay of one inverter. Using standard CMOS inverters, the following equation can be written [Rabaey 1996]:

Process Variation Sensor

This equation is a first-order approximation of the relationship among the current, load capacitance, and frequency of an oscillator. Being an approximation, the formula cannot accurately capture the complex relationships among the factors. Yet, in general, process variation collectively causes a measurable frequency shiftf) in the output of a ring oscillator. For example, simulation results using a commercially available SPICE circuit simulator at a TSMC 180-nm process node for a 41-stage ring oscillator shows a shift in frequency of 16 MHz and 6 MHz for 10% variation of threshold voltage (Vt) and transistor oxide thickness (Tox), respectively [Nourani 2006].

Readout Architectures

Both popular test methodologies (i.e., BIST and scan) can be used to coordinate the activities among IL sensors and the readout mechanism in a signal integrity test session. We now briefly address these two basic architectures.

BIST-Based Architecture

In a logic BIST environment, a test pattern generator (TPG) is used to generate pseudo-random patterns for detecting manufacturing faults in a circuit under test (CUT). An output response analyzer (ORA) is used to compact the test responses of the CUT and form a signature. Under the control and coordination of the BIST controller, the final signature is then compared against an embedded golden signature to determine pass/fail of the CUT. Figure 8.9 shows a typical logic BIST architecture to test SOC interconnects for integrity. The integrity loss monitoring cell (IL sensor) can be any of those sensors discussed earlier such as ND or ILS. The TPG and ORA are located on the two sides of the interconnect under test (IUT). The IUTs can be long interconnects or those suspicious of having noise/delay violations as a result of environmental factors, (crosstalk, electromagnetic effects, environmental noise, etc.).

Typical logic BIST architecture for integrity testing.

Figure 8.9. Typical logic BIST architecture for integrity testing.

The rationale in using pseudo-random patterns for integrity testing is the fact that finding patterns that are guaranteed to create the worst-case scenarios for integrity loss (e.g., noise and delay) is prohibitively expensive with the current state of knowledge. This is mainly because of the complexity of the distributed resistance-inductance-capacitance (RLC) interconnect model, parasitic values, and too many influential factors.

Detecting signals that leave the noise-safe and time-safe regions is a crucial step in IL monitoring and testing. Various IL sensors may be needed per interconnect to detect noise (crossing the threshold supply voltage VHthr and the minimum supply voltage VHmin) and delay violations. The test architecture used to read out the information stored in these cells is based on a DFT decision, which depends on the overall SOC test methodology, testing objective, and cost consideration. Figure 8.10 shows one such test architecture given in [Nourani 2002]. IL sensors are pairs of ND and ILS cells, which, in coordination with scan cells, record the occurrence of any noise or delay violation. The results are scanned out via Sout to the scan-out chain for analysis. In test mode, the flag signal is first transmitted through the multiplexer to the test controller. When a noise or delay violation (low integrity signal) occurs (flag = 1), the contents of all scan cells are then scanned out through Sout for further reliability and diagnosis analysis. Suppose an n-bit interconnect is under test for m cycles (i.e., with m pseudo-random test patterns). The pessimistic worst-case scenario in terms of test time is a case in which all lines are subject to noise in all m test cycles. This situation requires overall m and mn cycles for response capture and readout, respectively. In practice, a much shorter time (e.g., kn, where k<<m) is sufficient, as the presence of defects or environmental factors causing an unacceptable level of noise/delay (integrity loss) is limited.

The readout circuitry.

Figure 8.10. The readout circuitry.

Scan-Based Architecture

The IEEE 1149.1 boundary-scan test standard [IEEE 1149.1-2001], also known as Joint Test Action Group (JTAG) standard, has been widely accepted and practiced in the electronics industry for testing interconnects between devices and providing external access to a device under test or diagnosis. The standard provides excellent test and diagnosis capabilities for devices mounted on a printed-circuit board (PCB) or embedded in a system with low complexity, but it was not intended to address high-speed testing and signal integrity loss.

To address signal integrity loss, in [Whetsel 1996], the author proposed a method to simplify the development of a mixed-signal test standard by adding the analog interconnect test to 1149.1. The IEEE 1149.4 mixed-signal test bus standard [IEEE 1149.4-1999] was then developed to allow access to the analog pins of a mixed-signal device. In addition to the ability to test interconnects using digital patterns, the 1149.4 standard includes the ability to measure actual passive components, such as resistors and capacitors; however, it cannot support high-frequency phenomena, such as crosstalk on interconnects. To deal with high-speed testing, the IEEE 1149.6 standard provides a solution for testing AC-coupled interconnects between integrated circuits on PCBs or systems [IEEE 1149.6-2003]. Various issues on the extended JTAG architecture to test SOC interconnects for signal integrity are reported in [Ahmed 2003], [Tehranipour 2003a], and [Tehranipour 2003b] where maximum aggressor (MA) and multiple transition (MT) fault models are employed.

Integrating pseudo-random pattern generators and IL sensors within scan test architecture is a relatively straightforward task. To activate IL sensors, a separate test mode is needed. For example, the authors in [Tehranipour 2004] proposed to modify the boundary-scan cells (BSCs) for testing integrity loss on interconnects. At the driving side of an interconnect, a modified BSC that generates test patterns, called pattern generation BSC (PGBSC), is used. At the receiving side of the interconnect, the authors proposed to use an observation BSC (OBSC) that includes an integrity loss sensor.

Figure 8.11 shows the overall test architecture with n interconnects between core i and core j in a two-core SOC. The five standard JTAG interfaces (TDI, TCK, TMS, TRST, and TDO) are still used without any modification. Two new instructions (called G-SITEST and O-SITEST) have been defined for signal integrity test, one to activate PGBSCs to generate test patterns and the other to read out the test results. The cells at the output pins of core i are changed to PGBSCs, and the cells at the input pins of core j are changed to the OBSCs. The remaining cells are standard BSCs, which are present in the scan chain during signal integrity test mode. In the case of bidirectional interconnects, boundary-scan cells used at both ends are a combination of PGBSC and OBSC to test the interconnects in both directions. The IL sensing part of OBSCs does not need any special control and automatically captures the occurrence of integrity loss. After all patterns are generated and applied, signal integrity information stored in the IL sensing scan cell is scanned out to determine which interconnect has a problem.

Test architecture.

Figure 8.11. Test architecture.

PV-Test Architecture

In PV testing, because of the self-generating and on-spot nature of process variation, no fault stimulation is needed. However, the output of a sensor needs to be carried out to an observation point for analysis. Researchers have observed that the intradie variations are significantly more difficult to predict and deal with than die-to-die variations on a wafer [Bowman 2000] [Nassif 2004]. On-chip ROs with counters, embedded in a test chip, were presented in [Hatzilambrou 1996] to detect process variation by measuring the RO’s frequency shifts. This frequency variation was then used to grade the die performance or the performance of individual cores. This approach is not intended as a pass/fail test but instead as a grading test.

There are a number of techniques on PV probing and monitoring. Two examples were described in [Ukei 2001] and [Samaan 2003]. The monitor test element group (TEG) proposed in [Ukei 2001] consists of a ring oscillator and a control circuitry. Five TEGs are arranged in the four corners and the middle of a die, and their signals are reported one by one for process variation and manufacturing yield analysis. In [Samaan 2003], ROs are disposed over an integrated circuit chip depending on available layout space. To record its oscillation frequency, only one RO is allowed to operate at any one time. An analog-frequency wire is used to deliver test data to the counting and monitoring units for analysis. In [Bhushan 2006], the authors employed ring oscillators to evaluate the effect of process variations on key transistor parameters like switching delay and active/leakage power. These metrics reflect the average behavior of a few hundred MOSFETs embedded within each RO test structure. The authors experimented with IBM’s 90-nm and 65-nm circuits and showed that their test mechanism can be useful for manufacturing test.

In [Nourani 2006], a distributed network of several (extendable to a large number of) ring oscillators per die was presented. The methodology targets detecting those process variation changes that collectively cause a measurable frequency shiftf) in the output of a ring oscillator planted in that region. These measurements can be further used to identify the problematic region(s) of the die and even grade the quality of the die or assembled chip. This approach carefully chooses three parameters of architecture, namely, types, numbers, and positions of ring oscillators. The basic architecture is shown in Figure 8.12.

Basic concept of PV test architecture with ROs and compactor(s).

Figure 8.12. Basic concept of PV test architecture with ROs and compactor(s).

The layout under PV test may be a full die or portion of die such as a sensitive core. In this figure, each RO symbolically occupies one or more regions out of 9 × 9 = 81 regions. Practically, the placement/layout generation tools position ROs automatically. The concept of fault sampling was used to stay practical while collecting a good estimate of coverage. The authors assumed that the area of each RO is a multiple of a grid area and the size of a die is equivalent to Np grids. Applying the theory of sampling and the tradeoff between accuracy versus number of samples, the authors randomly chose Ns grids (where PV-faults occur) out of Np and devised sensors to collect data on process variations.

Manufacturing Defects, Process Variations, and Reliability

Aside from noise-induced errors such as power supply noise and signal integrity, manufacturing faults caused by manufacturing defects and process variations can severely impact device (process) yield and DPM levels. An internal document from a foundry reports that when defect avoidance and defect tolerance schemes were employed for one 50-mm2 design at a 130-nm process node on an 8-inch fabrication line, the defect density was reduced from 0.2 to 0.1 defects per square inch, the device yield was increased from 85% to 92%, and there were 40 more good dies on one wafer (490 versus 450).

Fault-model-based (structural) tests, such as stuck-at tests and transition tests, have become the requirement for improving a device’s fault coverage during manufacturing test. Studies have shown that stuck-at tests with 100% single stuck-at fault coverage could not guarantee perfect product quality (i.e., no test escape) [McCluskey 2000] [Li 2002] [McCluskey 2004]. An investigation by [Ferhani 2006] further revealed that only 6% of the 483 defective ELF18 chips contained defects that acted as single-stuck-at faults, whereas 18% of the 205 defective ELF35 chips and 35% of the 116 defective Murphy chips acted as single-stuck-at faults. The remaining defects were (1) timing-dependent, (2) sequence-dependent, or (3) attributed to timing-independent, non-single-stuck-at faults, such as multiple stuck-at faults or nonfeedback bridging faults. A timing-dependent defect is sequence dependent because timing dependence implies that a transition arrives either earlier or later than expected; these transitions are created by a sequence of values applied to the circuit inputs that form a test for the defect.

Possible causes of timing-dependent defects are resistive opens, connections that have significantly higher resistance than intended, or transistors with lower drive than intended [McCluskey 2004]. Possible causes of sequence-dependent defects are (1) a defect that acts like a stuck-open fault [Li 2002] or (2) one that causes a feedback bridging fault [Franco 1995].

Fault Detection

To detect these manufacturing faults caused by manufacturing defects and process variations, one common approach is to generate multiple test sets each targeting a different fault model. These fault-model-based tests are commonly referred to as structural tests. Stuck-at tests and delay tests (including transition tests and path-delay tests) belong to this category. There are also structural tests, which are modeless [Boppana 1999]; they were first used for fault diagnosis but later for ATPG. Because these structural tests cannot provide sufficient defect coverage, a conventional approach has been to supplement structural tests with functional tests running at the circuit’s rated speed. As we move toward the nanometer design era, meeting the stringent DPM goal is becoming a serious problem. This has prompted the need to generate defect-based tests by enumerating likely defect sites (failures) from the layout based on physical characteristics [Sengupta 1999] [Segura 2002] [Gizopoulos 2006]. The physical characteristics of defects were studied to find better tests [Hawkins 1994] or understand yield learning [Maly 2003].

Figure 8.13 shows a defect-based test architecture [Sengupta 1999]. Structural tests are first generated from ATPG based on conventional fault models, such as stuck-at faults and transition faults. These structural tests are then combined with functional tests, and the resulting tests are fault-graded using a defect-based fault simulator for a given logical fault list, such as small delay defects and bridging faults, extracted from the physical layout. The undetected fault list is then sent to a defect-based ATPG for generating additional defect-based tests to meet the product’s fault coverage and DPM goals.

A defect-based test architecture.

Figure 8.13. A defect-based test architecture.

Structural Tests

Structural tests are fault-model-based tests that usually include stuck-at tests for detecting stuck-at faults and delay tests for detecting transition faults and path-delay faults. In the 1980s and 1990s, the most commonly used structural tests are stuck-at tests. Because stuck-at tests have difficulty in meeting a product’s DPM goal, functional tests are often used to supplement these structural tests during manufacturing test. An experiment conducted in [Maxwell 1991] has shown that a structural test with 92% single stuck-at fault coverage had lower overall defect coverage than a combined structural and functional test with only 82% single stuck-at fault coverage. More recent experiments further confirm that even a structural test with 100% stuck-at fault coverage is inadequate to screen most manufacturing faults [Ferhani 2006].

To further improve the circuit’s defect coverage, at-speed delay tests have been used to supplement stuck-at tests since the 1990s when process node started to move to 180 nanometers and below [Foote 1997]. One study on a 733-MHz PowerPC microprocessor designed at IBM showed that if at-speed delay tests were removed from the test program, the escape rate would rise nearly 3% [Gatej 2002]. As a result, at-speed delay testing has become mandatory [Iyengar 2006] [Tendolkar 2006] [Vo 2006] for designs manufactured at 90 nanometers and below. These at-speed delay tests can come from scan or BIST [Wang 2005, 2006].

Modern ATPG and logic BIST programs can also take test power consumption into consideration when generating power-aware structural tests. These power-aware structural tests can avoid excessive heat during shift operation and IR-drop-induced yield loss during capture operation [Girard 2002] [Nicolici 2003] [Butler 2004] [Wen 2005] [Remersaro 2006] [Wen 2006]. These techniques have been extensively discussed in Chapter 7.

Defect-Based Tests

Defect-based tests include tests that are generated to target specific manufacturing faults arising from imprecise process technologies such as process variation and lithography. These defect-based test methods have been found crucial in screening additional physical failures during manufacturing test for designs at 130-nm process node or below. To increase manufacturing yield and meet a stringent DPM goal, these defect-based tests must supplement structural tests.

Small Delay Defect Tests

Small delay defect tests are delay tests that take timing delay associated with the fault sites and propagation paths from the layout into consideration. Although it is more accurate to compute path delay from the layout, because of process variation, the critical path in one chip may differ from another chip. So one can just approximate the longest paths without layout and target the set of longest paths rather than one critical path. These small delay defect tests by targeting the set of longest paths rather than one critical path for each delay defect are intended to catch small delay defects that escape traditional transition fault tests [Park 1988] [Williams 1991]. This is because shrinking feature size, growing circuit scale, increasing clock speed, and decreasing power supply voltage have made small delay defects the dominant failure mechanism in the nanometer design era [Mitra 2004].

One approach to test small delay defects is to group these structural transition tests into sets of almost equal-length paths [Kruseman 2004] and then test each group at faster than its rated speed. Those flip-flops fed by paths that exceed the cycle time or containing hazards must be masked off. The faults that are not detected because of masking are targeted by patterns run at the next lower clock speed [Barnhart 2004]. Applying transition tests at faster than the rated speed has been shown to catch small delay defects that escape traditional transition fault tests [Kruseman 2004] [Amodeo 2005]. The drawback of this approach is that circuit design may limit how much faster than the rated speed tests may be safely generated. In addition, both hazard [Kruseman 2004] and IR drop [Ahmed 2005] issues need to be considered when applying this approach.

By targeting or favoring the longest delay paths, the authors in [Sato 2005], [Hamada 2006], and [Qiu 2006] have reported that transition fault tests simply running at the circuit’s rated speed can improve the quality of these transition tests and detect small delay defects. When the launch-on-capture clocking scheme is used, it is important to stretch the double-capture clock with an on-chip programmable clock controller that contains delay lines, in order to provide true at-speed delay tests [Rearick 2005].

The statistical delay quality model (SDQM) [Sato 2005] has been proposed for the evaluation of delay test quality. The SDQM is generated by first assuming a delay defect distribution that is based on the actual defect probability in a fabrication process and then investigating the sensitized transition paths and calculating their delay lengths. Detectable delay defect sizes are defined as the difference between the test timing and the path lengths. Finally, the probability of detecting small delay defects is calculated by multiplying the distribution probability for each defect. The calculated value is called the statistical delay quality level (SDQL).

Bridging Defect Tests

It is also important to further supplement the defect-based delay tests and the traditional structural tests with bridging defect tests (often called bridging tests only). One of the most common defect types in CMOS designs is the interconnect bridge. Bridges (shorts) involving many nodes are typically catastrophic, so although they are vital to yield calculation, they are less important for fault modeling. In [Aitken 2004], the author indicated that a bridging fault between two small NAND gates using a particular 130-nm process can result in (1) a dominance fault where strongest node always wins, (2) a voting fault where winner depends on relative drive strengths of opposite values, (3) an analog fault including the Byzantine Generals behavior (where downstream gates interpret an intermediate shorted voltage differently, some as 0, some as 1), and (4) a delay fault. The bridge behavior depends on the resistance of the defect, when running at different voltages and temperatures (see Tables 8.1 and 8.2). Thus, multiple voltage bridge tests are required to detect delay-independent bridging faults, and delay tests running at different speeds shall be used to detect short-induced delay faults [Aitken 2004]. As the number of potential bridges is astronomical, it is more realistic to enumerate likely bridging fault sites (physical bridging faults) from the layout [Stroud 2000] [Stanojevic 2001] and map them to logical bridging faults (realistic bridging faults) for fault simulation, scan ATPG, or fault diagnosis [Stanojevic 2000] [Zou 2005] [Ko 2006] [Wang 2006].

Table 8.1. Example Bridge Behavior at 1.2 V, 25°C

Bridge Resistance

Behavior (1.2 V, 25°C)

<3000 ohms

Dominance or voting fault

3000–3700 ohms

Analog or Byzantine fault

>3700 ohms

Delay fault

Table 8.2. Example Bridge Behavior at 1.32 V, 25°C

Bridge Resistance

Behavior (1.32 V, 25°C)

<2500 ohms

Dominance or voting fault

2500–3000 ohms

Analog or Byzantine fault

>3000 ohms

Delay fault

N-Detect Tests

It has been reported in [Ma 1995] that N-detect stuck-at tests that detect every stuck-at fault multiple (N) times are better at closing DPM holes than tests that detect each fault only once. This approach, called N -detect, works because each fault is generally targeted in several different ways, increasing the probability to activate a particular defect when the observation path to the fault site opens up. N-detect at-speed tests can also be used [Pomeranz 1999], but a promising study shows that by generating transition tests one for each reachable output for a given transition fault, transition fault propagation to all reachable outputs (TARO) was able to detect faults that other tests could not on a test chip [Tseng 2001] [McCluskey 2004] [Park 2005]. TARO can be a good candidate for tests that require extreme thoroughness, such as sample-based quality assurance tests, and for logic diagnosis when a much better resolution is required.

On the other hand, gate exhaustive tests that apply all possible input combinations to each gate and observe test responses of the gate at a scan cell or primary output have also shown effectiveness in detecting more defective chips than single stuck-at tests [McCluskey 1993] [Cho 2005]. The authors in [Ferhani 2006] further demonstrated that N-detect and gate exhaustive test sets have higher TARO coverage than the transition and single stuck-at test sets; however, they cannot have higher transition coverage than the transition test set because the transition test set has been generated to have the highest possible transition coverage. Therefore, TARO seems to be a better metric than the other three in detecting timing-dependent and sequence-dependent defective chips, though it may require more test patterns.

IDDQTests

Normally, the leakage current of CMOS circuits under a quiescent state is small and negligible. When a fault, such as a transistor stuck short or a bridging fault, occurs and causes a conducting path from power to ground, it may draw an excessive supply current [Bushnell 2000] [Jha 2003]. Additionally, a weak CMOS chip can contain flaws or defects that do not cause functional failures in normal operation mode but degrade the chip’s performance, reduce noise margins, or draw excessive supply current. Such flaws or defects must be also screened. Using IDDQ tests by monitoring the quiescent power supply current (IDDQ), it was shown to have detected many types of these defects, including some timing-dependent defects [Levi 1981] [Hawkins, 1986] [Nigh 1998, 2000]. Thus, in addition to detecting leakage current related failures [SIA 2005], IDDQ tests can also be used for reliability screening. Effective methods to screen IDDQ outliers at the wafer or lot level further include using current ratios [Maxwell 2000] and the statistical outlier screening method [Daasch 2001].

IDDQ testing became an accepted test method for the IC industry in the 1980s. The small geometry sizes of today’s devices, however, have made normal fault-free IDDQ quite large because of the collective leakage currents of millions of transistors on a chip. This makes the detection of the additional IDDQ current difficult; hence, IDDQ testing is becoming ineffective and has caused many companies to abandon or rely less on IDDQ tests [Williams 1996a, 1996b].

A similar approach is transient power supply current (IDDT) testing. When a CMOS circuit switches states, a momentary path is established between the supply lines VDD and VSS that results in a dynamic current IDDT. The IDDT waveform exhibits a spike every time the circuit switches with the magnitude and frequency components of the waveform dependent on the switching activity; therefore, it is possible to differentiate between fault-free and faulty circuits by observing either the magnitude or the frequency spectrum of IDDT waveforms. Monitoring the IDDT of a CMOS circuit may also provide additional diagnostic information about possible defects unmatched by IDDQ and other test techniques [Min 1998]; however, IDDT testing suffers many of the same problems as IDDQ testing as the number of transistors in VLSI devices continues to grow.

MinVDDTests

Similar to IDDQ testing, which can be used to screen outliers for VLSI designs, a traditional minimum supply voltage (MinVDD) testing can also be used to detect manufacturing faults. In general, a datasheet VDD specification is in the range of ±10% of VDD to allow a certain level of power supply tolerance. The device is supposed to still run at the rated speed with the raised/lowered VDD. This minVDDtest technique will try to find the lowest operating voltage when the device operates functionally, usually below the rated –10% VDD. Similar to IDDQ testing, this MinVDDtest technique can also be used for reliability screening. The observation is that as we push the device to operate at the edge with barely enough voltage, marginal devices will show up as failures, not meeting the rated speed. In MinVDD testing, a minimum pass/fail VDD level is measured, usually by a binary search method over a large sample of dies, and the resulting voltage level is used to test each die by applying a full vector set of MinVDD tests. This screening method has shown to be effective while it may miss some MinVDD outliers.

To further look for these MinVDD outliers, it may be more practical to use the statistical outlier screening method [Madge 2002]. The statistical outlier screening method, referred to as feed-forward voltage testing in [Madge 2002], consists of three steps:

  1. The first step is to use a reduced vector set (RVS), typically 3% to 15% of the full vector set (FVS), to search for a minimum passing voltage (the intrinsic defect-free MinVDD) for each die. The reduced vector set can be composed of only stuck-at tests, delay tests, memory BIST tests, logic BIST tests, functional tests, or any combination of the above. A binary search algorithm for reducing test time during wafer sort is then employed to find and record the MinVDD by which the die will function correctly at the nominal test frequency.

  2. The second step is to use the full vector set (FVS) to test each die at this recorded minimum passing voltage (MinVDD) plus a small voltage guard-band. This guard-band is user definable and depends on the level of screening required to meet the product’s fault coverage and DPM goals. This calculated voltage is referred to as the feed-forward MinVDD, and the bad dies screened are called feed-forward outliers.

  3. The last step is to “post-process” the RVS binary search data by using complementary statistical post-processing (SPP) methods, namely delta VDD and nearest neighbor residual (NNR), to screen SPP outliers that are not screened by feed-forward but are statistical MinVDD outliers identified using the RVS binary search data. Delta VDD uses the MinVDD data collected from different intradie tests (in memory cores, scan cores, functional blocks, etc.) and calculates the delta between the values for each test. The differences between vector values on the same die are then compared against an expected difference to get an intradie residual, which is then measured against a threshold to determine a downgrade. The threshold value is adaptive (i.e., databased) and varies from die to die because of the gradual across-wafer variation of the intrinsic MinVDD data. NNR computes an interdie residual based on the vector value averages of the surrounding sites. The interdie residual is then compared to an adaptive threshold value to flag outlier dies. The concept of SPP outlier screening was mainly borrowed from IDDQ testing where delta IDDQ and NNR have been shown to be effective in screening IDDQ outliers [Gattiker 1996] [Powell 2000] [Daasch 2001].

The statistical outlier screening or adaptive MinVDD method is most effective when there is a good correlation between RVS MinVDD and FVS MinVDD. The authors applied the method to four ASIC chips designed at LSI Logic using a guard-band of 50 mV at a 180-nm process node [Madge 2002]. Experimental results indicated that MinVDD yield fallout was 0.2%–0.8% depending on product complexity, and around 20% of these outlier dies showed positive or negative VDD shifting during burn-in. Resistive vias and tungsten stringers were identified as the source of defects during failure analysis.

VLV Tests

As opposed to MinVDD testing, which performs testing at a datasheet rated speed and minimum voltage, very-low-voltage (VLV) testing is conducted at a test speed that is below anything guaranteed in the device datasheet (e.g., 600 MHz for a 2-GHz device). The test voltage can be at or below the MinVDD level. VLV testing was first proposed in [Hao 1993] as an alternative to burn-in to detect delay flaws. A delay flaw (nonoperational delay fault) is a defect that causes a local timing failure but the failure is not severe enough to cause circuit malfunction. An example failure is NMOS gate oxide shorts, which can be modeled as resistive shorts [Hao 1993]. A delay fault (operational delay fault) is a timing failure that causes the circuit to malfunction at its rated speed, although it is functional when operated at a lower speed. Example timing failures include high resistance interconnects, via defects, and tunneling opens that can only be detected by at-speed tests [Chang 1996b].

The VLV test technique is based on the observation that delay can increase substantially as VDD lowers or the driving strength of a gate (transistor) weakens to a certain level. The authors in [Chang 1996a] and [Chang 1996b] indicated that by setting the supply voltage at a low supply voltage between 2 and 2.5 times the threshold voltage (Vt) of the CMOS transistors, VLV testing can detect many types of delay flaws at the transistor level, including (1) transmission gate opens, (2) threshold voltage shifts, (3) diminished-drive gates, (4) gate oxide shorts, (5) metal shorts, and (6) defective interconnect buffers. The authors in [Ali 2006] further claimed that testing at the lowest operating voltages is only required for certain types of design flaws such as transmission gate opens and bridging faults; weak resistive opens that cause delay faults are best tested at higher operating voltages. To guarantee the quality of dynamic voltage scaling (DVS) systems (see Section 8.3.4), it is necessary to select a number of voltage-specific delay tests.

Although these design flaws do not do any harm to the normal circuit operation, they may cause circuit malfunction (intermittent or early-life failures) if the supply voltage changes during operations because of IR drop or simultaneous switching noise. The test speed for VLV testing can be determined using the methods presented in [Chang 1996a] to achieve high design flaw coverage. One potential problem with this setting is that the VLV limit may have hit the operating voltage in 90-nm designs [Gizopoulos 2006].

Though improved methods could be further explored to determine the VLV setting and test speed, the author in [Roehr 2006] showed that by (1) empirically setting the test speed to 12.5 MHz, (2) using MinVDD testing to search for a VLV setting at 0.9 V, and (3) applying a new VLV ratio (VLVR) at 10% on each die, during burn-in for initial product qualification of the first two 130-nm low power (LP) CMOS products, the intermittent failures caused by an interplay of several factors in different wafer fabrication lots were screened. The VLV test at 0.9 V running at 12.5 MHz was able to first eliminate the “tail” dies (the MinVDD outliers beyond 0.9 V). The VLV ratio that is defined as (Max-Min)/Min of the two MinVDD values for two VLV tests (in two logic blocks: Registers and SysMem) on the same die and set to 10% was then able to eliminate the “flip” dies (the intermittent failures).

The differences between MinVDD testing and VLV testing will slowly disappear as CMOS scaling continues. The typical operating VDD will continue to trend down as technology scales (to ensure device reliability). With VDD well below 1 V for some of the mobile products at 65 nm, there is simply not much room to lower VDD, when instrumentation errors and test interface limitations come into play. Moreover, as vendors seek to offer ever lower power products, a reduced-frequency part with lower voltage is already part of the product portfolio. Thus, we have seen vendors start to screen their low-power products with lower voltage at reduced frequencies. VLV testing becomes mainstream product testing but it does not leave much for reliability screen.

Functional Tests

Functional testing, once the sole test method that allows for testing actual functional paths at-speed, has begun to regain its acceptance in the industry. Microprocessor testing is one particular example [Sengupta 1999]. To meet the aggressive DPM goal, functional tests must be added to supplement structural tests (at-speed tests, stuck-at tests, and bridging tests).

Traditionally, functional tests are manually generated. It is a resource-intensive process, and fault coverage can only be assessed through fault simulation (again, a tedious process). Usually, fault coverage with functional tests cannot be improved upon easily, because the circuit could be complex and few commercial tools are available to help with this manual test generation process.

Efforts have been made to utilize validation techniques to tackle this problem. One of the common postsilicon validation techniques is that of random test generation with random instruction and data [Shen 1998]. With the proper test templates, these procedures can permute instruction sequences, addressing modes, data types, etc., to allow a more diverse set of tests to find potential design errors. These types of tests also help in detecting more manufacturing faults if deployed during manufacturing test. The limitation with doing this in manufacturing has been the immense amount of tester storage required to hold all of these randomly generated test vectors. The authors in [Parvathala 2002] came up with an idea to utilize the large cache that is a common feature for modern-day microprocessors. They shoehorned a version of this random test generator into the cache of a microprocessor. The area overhead required to load the test generator into the cache and run the code without invalidating the cache is considered very small as it piggybacks on other memory test (DFT) architectures. By running this test generator on-chip repeatedly, an infinite number of new test vectors can be generated with minimal additional test storage since we only need to load up the cache memory just once. Fault detection is through the periodic signature compression of the memory maps (the cached memory), further reducing bandwidth requirements. The additional benefit of this mode of testing is that the test can be run on the processor core at-speed, thus detecting delay faults as well as signal integrity faults. The authors in [Parvathala 2002] provided data showing that this method is effective in catching additional faulty chips and has improved DPM. The only limitation to this test method is that portions of the device dealing with external bus transactions and the cache miss related logic cannot be covered.

Another advancement came in the area of actually generating specific instruction sequences targeting manufacturing faults. [Tupuri 1997] and [Tupuri 2002] described such methods to reuse deterministic ATPG for targeted faults within an embedded module while extracting the input/output constraints for delivering those tests to the targeted module. The authors in [Chen 2003] used basic test templates and learned the characteristics (how they can reach control states and how they can propagate key signals, etc.) of these instruction sequences. Once a sufficient number of test templates are learned, this test generator is able to piece together these instruction sequences to activate the nodes (to be tested) and enable the propagation logic to allow them to be observable. Although these are promising early results, these techniques have to be proven on more complicated test cases. More functional test methods can also be found in Chapter 11.

An optimal test ordering of these tests plays an important role in minimizing the overall test time while maintaining the fault coverage. The study in [Butler 2000] revealed that functional tests should be placed early in the test flow before the use of transition tests, stuck-at tests, and IDDQ tests. The authors in [Pouya 2000] and [Ferhani 2006] have also indicated that an optimal test ordering to have the shortest test time is to apply path-delay tests, transition tests, and stuck-at tests, in that order. However, finding an optimal test sequence is not a simple process, and the shortest test time is not always the best when tests are required for yield, defect, and performance learning. In that case, an ideal optimal test sequence should mean to trade off test time with test data collection and to pick the best point.

The key issue is thus how to generate these manufacturing tests (made up of structural tests, defect-based tests, and functional tests) in a timely manner to best meet time-to-market, DPM, and test budget goals all at the same time. In manufacturing processes at the 65-nm node and below, the test data volume can become so considerable that the cost of testing will soar. Therefore, we expect test compression to become indispensable. Similarly, functional BIST techniques like the cache-resident functional test generator described earlier may become more of a necessity. Active research will be more directed toward reducing scan ATPG time and test power, physical fault modeling, speedup of concurrent fault simulation, coverage enhancement of logic BIST, and logic built-in self-repair (BISR) [SIA 2005, 2006].

Reliability Stress

Some manufacturing defects (such as bridges and opens) are not necessarily screenable because of the defect mechanisms that make them too high or too low ohmic (depending on whether it is a bridge or open) to exhibit a distinguishable behavior from a fault-free circuit. However, they may degrade with normal life and cause the circuit to fail in time. These are categorized as infant mortality failures (see the bathtub curve in Figure 1.2 of Chapter 1). It is highly desirable that these failures be screened during manufacturing test before the devices are shipped out.

The reliability screen methods described previously (by either monitoring the supply current or lowering the supply voltage) are better at detecting infant mortality failures. Little evidence, however, shows that they can replace the costly reliability stress or burn-in that potentially reduce the normal life of the device.

The most common method to screen infant mortality is to burn-in the devices, followed by subsequent test screening. Burn-in essentially is a stress process by which devices are aged with proper excitation mechanisms, such as elevated voltage, temperature, and humidity, while the internal nodes are biased alternately (e.g., node toggles with some kind of built-in tests or externally supplied tests). The basic acceleration mechanism is described with the following Arrhenius equation:

Reliability Stress

where ttf is time to failure (hours), C is a constant (hours), EA is the activation energy (eV), k is the Boltzman’s constant (8.616 × 10–5 eV/°K), and T is an absolute temperature (°K).

Essentially, we are using up some lifetime of the product with burn-in so when the product is eventually in the customer’s hands, it has reached the normal life portion of its lifetime, with a low failure rate for the next 7 to 10 years. Some levels of defects may still exist, but they can be covered under the warranty (repair or replacement of products) offered by most manufacturers.

There are limits as to how much we can accelerate the defects. For example, if we elevate the voltage during burn-in, it also produces a higher than normal electric field across the drain and source of the MOS transistor. This higher than normal field increases the number of hot electrons that will then degrade the gate oxide and cause a more than normal shift of the threshold voltage (Vt). Stress-induced leakage currents and failures in CMOS devices have also been found in [Lee 2002] and [Pacha 2004]. This can essentially hasten the wearout process and reduce the normal life of the device. So it is a delicate balance between trading off infant mortality versus normal life. Increasing the temperature also has its limit, as a higher temperature (and voltage too) will increase static power consumption dramatically, requiring chips to be cooled during burn-in. This seems to be contradictory between refrigerated cooling and burn-in, but it is true for chips with hundreds of millions of transistors. In the extreme situation, if heat is not removed fast enough, this will cause a phenomenon called thermal runaway. Increased leakage causes more power consumption and more power consumption causes more thermal buildup, and the cycle continues until something is melted.

There are also alternate stress methods, such as elevated voltage stress or SHOVE as described in [Chang 1997]. This method is essentially a short-term burn-in, with stress time in fraction of a second to a few seconds. It has a certain degree of effectiveness for defects that will degrade quickly with elevated voltage, such as oxide thinning, which occurs when the oxide thickness of a transistor is less than expected, and via defects [Chang 1997]. In addition, it can avoid the wearout effect, as the time is so short, but by the same token, it may not be effective in screening all the defects that require both temperature and time to reveal themselves. Nevertheless, it is a common stress method used in some manufacturing test flows. This method may be a complement to burn-in or will lower the failure rate enough to avoid burn-in altogether. Again, there is a limit to how high a voltage one can apply. A voltage that is too high will also lead to (gate) oxide breakdown, which may cause irreversible damage if the current is not limited. This voltage is typically determined by the process technology of each process generation.

Redundancy and Memory Repair

General memory redundancy scheme uses spare rows, columns, or blocks. These schemes are good for repair of bad cells or sense amps, etc., resulting from manufacturing defects. However, if the defect rate is higher, setting aside redundant elements will have a diminishing return because the redundant elements themselves can also be faulty [Spica 2004]. The yield of additional redundant elements is subjected to the same exponential diminishing yield inversely proportional to the area.

Because error-correcting code (ECC) is also generally applied to detect and repair transient errors (e.g., soft error) for high reliability memories as well as on-chip caches, one can also consider them to be a repair mechanism. ECC operates by creating a syndrome (check bits) from the data word, when the data are first written. These check bits are stored together with the data themselves. When the data are read, the syndrome is again calculated and compared with the stored syndrome (detail of the mathematics involved is not contained here). To minimize the performance impact for these calculations, sometimes the syndrome will not be calculated directly (as in the flow-through architecture) during the read but will be calculated in a parallel path in the data pipeline. If the stored syndrome matches with the recomputed syndrome, the data flow will not be interrupted. If there is a mismatch, the erroneous bit position can be identified from the new syndrome (versus the stored one) and the bit is flipped (if it is not a 1, then it must be a 0, the beauty of a binary system). It will take extra logic processing (maybe an extra clock cycle) to identify this bit position, so once the error is detected, the data flow has to be halted while the corrected data are generated and sent forth for further consumption. Theoretically, one can employ ECC to correct hard errors as well. If ECC (such as Hamming code) is used to correct hard errors, it will lose its primary function to protect against transient errors. Because transient errors are expected to occur infrequently, ECC can be the first line of defense (always detect and correct, no matter what the cause is). The system has to react to the situation (that a hard error has developed) and apply a remedy so that the reliability of the system will not degrade further. The chosen correction scheme has to deal with a convolution of the failure distributions to guarantee that it will be sufficient to meet the short-term (manufacturing) yield goals, the infant mortality goals, as well as reliability goals. The sum of the aforementioned distributions now must be considered for an overall error correction scheme.

The Pellston technology is one such scheme [Wuu 2005]. As described previously, this technology essentially makes use of ECC to protect large on-chip caches. Once an ECC event occurs, it issues a machine check architecture (MCA) interrupt and immediately branches to a handler that will pick up the erroneous location. This handler then writes the corrected data into the location and reads them back again. If the error persists, it sets a “not to use” bit for that particular cache line. This bit is similar to the modified/exclusive/shared/invalid (MESI) cache coherency protocol bits, and this cache line will not be used to cache any new data. This scheme again assumes that the amount of field failures (as a result of infant mortality, etc.) will be low. The loss of a few cache lines will have minimal impact on system-level performance. The small loss in performance is more than made up for with the increase in reliability against any soft error as well as hard (infant mortality) errors that might develop with the usage of the system.

There is also research that describes more elaborate schemes for dealing with increasing field failures that may be brought about by scaling to ever smaller devices. The authors in [Bhattacharjee 2004] described a scheme whereby the actual erroneous cache lines are mapped to spare cache lines through a mapping table. Memory testing during initial system boot time provides the failing address/location. Conceivably, this can also be coupled with the ECC mechanism and essentially ECC checking performs a test every time a cache line is read. Yet another scheme that advocates the recycling of erroneous bits was described in [Agarwal 2005]. The authors envision that these erroneous cells may not really be hard defects; they may fail to operate properly with power (or other) constraints (such as lowered VDD in a mobile application). On-chip self-testing will identify the bad cache lines, and the memory addressing will be reconfigured so that the erroneous locations are used. Once the operation environment is changed again (e.g., VDD has resumed to normal), the on-chip self-test circuit will provide new failure map information to allow reconfiguration to a fuller-size cache. Innovations continue in this area to deal with the ever-complex failure mechanisms brought about as CMOS scaling continues.

Process Sensors and Adaptive Design

Traditionally, process sensors are used only by the process and yield engineers to monitor the process variation and yield analysis. These are generally test structures (transistors of various sizes, via chains, contact chains, etc.) put on the scribe lines (the four sides of the die). Generally, not all test structures are measured; only sample test data (usually called e-test data) are taken from each wafer and stored into a database (for later analysis). After dies are assembled into packages, these structures disappear with the wafer saw process (to extract dies). So additional test data will not be measurable in case one would like to investigate the causes of failure with a particular packaged chip. They can only extract the data from the databases.

With deep submicron scaling, on-die process variation slowly becomes a more significant cause of variation. Data have shown that the on-die process variation can be as much as the whole process variation. So monitoring the test structure on the scribe line would be of limited value. Embedding additional process sensors on-chip (the actual product) is essential to the understanding of on-die process variation. On-die sensors have the additional advantage that they are still available with the separated die and after package assembly. With the appropriate board-level hookup, one can even extract data when the die is running in the end-user application, so it is very powerful.

Process Variation Sensor

Ring oscillators have long served this purpose. With an odd number of inverter stages, a ring oscillator will oscillate naturally, and its frequency can be measured easily with a counter clocked appropriately. Because many factors can affect the frequency of the ring oscillator, it is generally difficult to de-convolute the cause(s) of the varied frequency. The authors in [Samaan 2003], [Stinson 2003], [Nassif 2004], and [Krishnamoorthy 2006] have proposed the use of multiple ring oscillators of varying design parameters. With many of these embedded ring oscillators, de-convolution of the process variation, temperature, and even voltage is possible, providing an immense source of information during manufacturing and online operational feedback. Because these sensors are so helpful, hundreds of them can be sprinkled onto different areas of the die to provide useful information about on-die process variation, local temperature (hot spots), and local power grid fluctuation, serving multiple purposes with a single set of sensors. The only requirement is that one has to know how to gain access to these sensors, and analysis must be done to de-convolute the underlying changes, which may be difficult to be carried out with on-die resources alone.

Another type of process variation sensor utilizes an analog circuit, which is sensitive to process parameters; in fact, almost all analog circuits are sensitive to different process parameters [Cherubal 2001]. It is not critical to pick an ideal circuit, as the analog circuit may have multiple specifications and these specifications may in turn be sensitive to different process parameters. These specifications can be measured using any on-die or instrumentation-based test methods. Once enough data are collected on a large sample of sensors (in a die or even from various locations of the die for across-die process variation) and with appropriate statistical analysis techniques such as principal component analysis (PCA), the authors claimed that they can de-convolute the various device/process parameters to which the circuit is sensitive. The technique is similar to solving the variables with many equations and many unknowns. In this case, one must be careful about choosing the analog circuit itself. If the circuit is not sensitive to some types of process parameters, the analysis will not reveal any such variation at all. Because the analysis is essentially based on statistical methods, a large sample of data is needed. Unlike ring oscillators, the analog process variation sensor also does not report the process variation at the specific spot on the die (unless enough data are collected from that very spot). It is also unlikely that one can extract and analyze these data in real time.

Thermal Sensor

This leads us to the discussion of more mission-critical sensors, such as thermal sensors. As high performance drives higher-frequency operations and a higher level of power consumption, today’s processors and high-end SOC designs could generate lots of heat, which needs to be dissipated properly. If, for whatever reason, the heat sink is not properly mounted or the cooling fan (a mechanical device) fails to turn, the device can accumulate enough heat to go into thermal runaway. In an extreme situation, it may cause irreversible damage to the chip. Thus, thermal sensors are extremely important for protecting the chip as well as the rest of the system (power supply, socket, motherboard, etc.). As a result, the industry has started putting thermal sensors onto the heat sink or the motherboard itself. Putting thermal sensors on the motherboard or even inside packages may not have the fast response time needed (thermal conduction is a relatively slow process as opposed to electronic processes) to control power consumption. Often, hot spots are local (such as in the floating point unit [FPU] or integer execution unit [IEU]), and the device can heat up and cool off quickly with code changes. Power control has to happen really fast; otherwise, the accumulated heat may lead to circuit failure or, worst yet, may melt down as a result of thermal runaway. Designing conservatively so heat will never be a problem can be a solution, but that leaves performance on the table. There is also the inevitable accident that the heat sink may fall off or the fans may stop working. All of these require on-chip thermal sensors to act as the last defense to prevent system crash or permanent damage to the chip. Therefore, these are mission-critical sensors.

An example of a thermal sensor is illustrated in Figure 8.14 [Pham 2006]. It is essentially a diode coupled with a current source. The diode current (and voltage) is a function of temperature, so the trip point can be set to detect the reach of a particular temperature. With multiple sensors, one can construct a system whereby different temperatures trigger different events—for example, slowing the clock down with the first trigger; if the second trigger happens, a more drastic measure has to be taken, such as stopping all clocks; and the last sensor will shut down the system power supply to protect itself.

Thermal sensor example.

Figure 8.14. Thermal sensor example.

Diode does have its issues (same for any fabricated device), as the diode voltage varies with process variation and not just with its design parameters. In applications where absolute temperature measurement is needed, calibration is necessary. It can be accomplished at manufacturing time and with correction factors burned into on-chip fuses. For protection mechanisms like the one described earlier, calibration may be optional.

Thermal protection actually brought us to an important system feature that has made a significant impact on the world of testing. The chips of today are increasingly adaptive because of power savings, maximizing performance within a given power envelop, multiple clock domains with variable frequencies, high-speed IO interfaces, etc. Using the thermal diode as an example, each diode is slightly different from another because of fabrication variation. Each chip also may consume power differently (varying load capacitance and leakage). So the trip point from one die to the next can be very different. When they trip can also be different, even though both of them run the same patterns. Although this poses few issues with system-level operation (nobody would figure out down to the millisecond when the trigger event occurs), this is totally unacceptable in the digital testing paradigm.

The digital testing paradigm rests on the principle of stored stimuli and stored responses. Automatic test equipment (ATE) essentially stored all test patterns during logic or RTL simulation, and the device under test (DUT) is expected to perform exactly as simulated. Any bits coming off in an earlier or later cycle cause an error and the device is discarded. ATEs simply do not have the intelligence to figure out that the clock has changed to a different frequency, especially when the trip point can vary so much in time because of all the variances. Similar nondeterminism happens with all the conditions mentioned earlier, so our digital chips are increasingly difficult to test. One can implement DFT to turn off this nondeterminism (if one knows about it), but then we are either not testing a particular chip feature or will pose much greater design effort to turn off a natural feature of the chip. The saving grace is that structural tests are not impacted (because logic is tested from latch to latch anyway with structural testing). However, we are losing coverage as increasingly we have to turn an adaptive chip into a nonadaptive chip.

Dynamic Voltage Scaling

Another adaptive feature for power management is dynamic voltage scaling (DVS) [Flautner 2001]. As an efficient power reduction technique, DVS has been implemented in several commercial embedded microprocessors such as Transmeta Crusoe [Transmeta 2002], Intel Xscale [Intel 2003], and ARM IEM [ARM 2007]. DVS exploits the fact that the clock frequency (f) of a processor changes proportionally with the supply voltage (V), whereas the dynamic energy (P) is proportional to the square of the processor’s supply voltage (P α V2f). Thus, to save power, one can simply lower the clock frequency of the processor. If the frequency is lowered, then the voltage supply also will be lowered, resulting in a cubic power reduction. When the system goes into a period of low activity (e.g., a user is looking at the screen while thinking), it will signal the circuit to go into a low power state whereby frequency and voltage are scaled down. Once there is a need to process more data (e.g., a key is pressed), the system will resume to higher voltage and frequency, making the system appear to be responsive all the time. An example of the DVS scheme is shown in Figure 8.15.

Dynamic voltage scaling scheme.

Figure 8.15. Dynamic voltage scaling scheme.

Additional adaptability features may fit into this category. One example is the use of sleep transistors [Tschanz 2003] (which essentially scales the voltage close to 0 for specific circuits at a specific time) and dynamic body biasing (which varies the back bias of the transistor to increase or decrease the threshold voltage) to save power. The function of the back bias is to affect the threshold voltage Vt of the transistor. It can be forward biased and it will lower Vt and speed up the transistor, but at the expense of leaking much more current. It can also be reversed biased, and it will increase Vt and lower the leakage current, but it will slow down the transistor. The other example is the adaptive test method [Edmondson 2006] for smart binning (not necessarily just speed), whereby devices are tested for power level before the optimal voltage/frequency classification of the chip is made. This is done through a voltage identifier (voltage ID), which is supplied by the chip to the voltage regulator module (VRM) on the motherboard during power up. The core frequency is determined by an internal multiplier, which is programmed during test through a fuse (after determining how much power it consumes). The voltage ID is also fused at the same time. Both uses of sleep transistors and adaptive testing are unconventional ways of adaptations to cope with process variations.

Soft Errors

Soft errors are transient single-event upsets (SEUs) caused by various types of radiation. Cosmic radiation has long been regarded as the major source of soft errors, especially in memories [May 1979], and chips used in space applications typically use parity or error-correcting code (ECC) for soft error protection. As circuit features begin to shrink into the nanometer ranges, error-causing activation energies are reduced. As a result, terrestrial radiation, such as alpha particles from the packaging materials of a chip, is also beginning to cause soft errors more frequently. This has created reliability concerns, especially for microprocessors, network processors, high-end routers, and network storage components.

In this section, we first illustrate the sources of soft errors and the soft error rate (SER) trends. Following a discussion of general fault tolerance schemes for soft error protection, we then discuss DIVA [Austin 1999] and Razor [Ernst 2003] [Ernst 2004], two representative error-resilient processor microarchitectures, as well as three soft error mitigation methods through built-in soft-error resilience (BISER) [Mitra 2005] and circuit-level modifications [Almukhaizim 2006] [Zhou 2006]. DIVA and Razor are mainly used for high-performance processor designs. BISER and circuit-level modification methods, however, are applicable to any design for soft error protection.

Sources of Soft Errors and SER Trends

Soft errors are the result of transients that are induced in the circuit when a radiation particle strikes. This radiation can range from cosmic origin (when stars are formed and die) or from everyday material (e.g., lead isotopes) [Ziegler 2004]. When high-energy cosmic rays reach into our atmosphere, they collide and strip off air molecules and send off neutrons. These neutrons continue their journey and penetrate through most types of matter (so shielding is largely out of the question). As these neutrons transverse through silicon, they ionize the silicon lattice and leave a trail of holes and electrons behind. These will then be moved by the electric field of surrounding diffusion and wells. As holes and electrons recombine, they charge or discharge the node appropriately. From a circuit standpoint, we simply have a glitch.

Radioactive isotopes emit alpha particles as the radioactive decay process occurs. These alpha particles are larger and heavier, so they will not have deep penetration; however, because they may exist (in a stray amount) in packaging material (e.g., ceramic or solder), they are located close to the die and can lead to a relatively high error rate (in fact, early soft errors were first discovered in radioactive elements in packaging material [May 1979]). If such a glitch is induced in a memory element, its state can be reversed. As an example, let us examine a SRAM cell that has two back-to-back inverter pairs, as shown in Figure 8.16.

Induced soft error on a SRAM cell.

Figure 8.16. Induced soft error on a SRAM cell.

When the select transistors are off, this cell holds the state in a stable configuration. If a glitch is introduced at the drain of the PMOS or the source of the NMOS, the other inverter can pick up the glitch and the state of the cell is reversed. A similar problem can occur for all storage elements, such as D latches and D flip-flops (see Figure 8.17). If a glitch strikes the combinational logic elements, the resulting glitch is evaluated and passed on by the succeeding logic.

Induced soft error on a D latch.

Figure 8.17. Induced soft error on a D latch.

Soft errors can happen to all memory and storage elements. Sometimes, they can be benign (e.g., the memory elements are not used in the application); other times, they can cause a system crash or, even worse, a silent data corruption (SDC) if they are undetected. That is why we have to devise online detection (or fault tolerance and error mitigation) mechanisms to protect against such transients and cope with soft errors. These kinds of detection and fault tolerance mechanisms are further discussed in the following section. Unlike defects or other fault types, a soft error is transient and is induced by a glitch at one time at one location; it is not repeatable; thus the term “soft error.” This property is well utilized in the solution space.

Logic circuits are less susceptible to these glitches than memories for the following reasons [Mohanram 2003]. First, the glitch must be of sufficient strength to propagate from the location of the strike, through each stage of the logic circuit, until it reaches an output; otherwise, the glitch is attenuated and the transient error is electrically masked. Second, the glitch needs to have a functionally sensitized path to a latch; otherwise, the glitch is logically masked. Finally, the timing of the glitch must be such that the glitch arrives at a latch during its latching window; otherwise, the glitch is latching-window masked (an exception is a domino-type circuit, where its logic states are held at every gate by a feedback element). These three masking factors are illustrated in Figure 8.18; historically they have made soft errors a non-issue for combinational logic circuits. Technology trends, however, are rapidly reducing the effect of these masking factors. For example, reduced logic depth implies a higher probability of having a sensitized path to the output and less attenuation of the glitch; lower supply voltage leads to less noise margin and, hence, smaller glitches may produce an error; higher clock frequency leads to a higher probability that a glitch will be latched, etc. These masking factors are judiciously utilized by circuit-level soft error mitigation techniques that reduce the susceptibility of logic circuits to soft errors.

Masking factors of soft errors in combinational logic: (a) electrical masking, (b) logical masking, and (c) latching window masking.

Figure 8.18. Masking factors of soft errors in combinational logic: (a) electrical masking, (b) logical masking, and (c) latching window masking.

Because the physics of soft errors involves node charging and discharging, the amount of stored charge at a given node determines how sensitive it is to a particle strike. The charge (Q) is represented by the following relationship with voltage (V) and capacitance (C):

Q = CV.

As processing technology scales, capacitance for a given node decreases. This is good for performance but bad for soft error. Because of the hot electron type of degradation, reliability requirements also force VDD to be lowered. This compounded effect causes a decrease in stored charge and increases the soft error rate (SER). With scaling, we also get more transistors (roughly 2×) per chip, resulting in increase in the soft error rate [Baumann 2005]. The saving grace is that, because the transistor junction area is also scaled, the ability for the node to collect stray charge is also reduced; however, this is not sufficient to slow the increase in soft error vulnerability. The Moore’s law prediction of doubling transistors with every process generation [Moore 1965] effectively doubles the soft error rate, so not only should SRAM cells (such as caches and registers) be protected, but protection on storage elements as well as against glitches creeping through the combinational logic is also important. These areas are now all hot research topics.

The implication of soft errors with regard to chip testing varies. From the surface, soft errors really cannot be tested. Even good circuits are susceptible to soft errors, so there is nothing to screen for. Soft errors are also not easily exercisable with electrical test stimulus. The natural occurrence of radiation also does not usually happen during the short test time of component testing. What really requires attention is an online detection scheme or a fault tolerance scheme.

Often the three types of redundancy—hardware (spatial), time (temporal), and information—involve extra circuit elements (refer to Chapter 3 for more detailed description). At a minimum, there is a self-checking equality checker, which indicates whether there is any error. With information redundancy, there are extra check bits (or code bits). With hardware redundancy, there is even duplicate circuitry. Each redundancy circuit has to be tested to make sure that the redundancy scheme can detect soft errors and, when found, can signal that there is an error or simply correct the error.

As it is difficult to test every redundancy circuitry and they are probably not testable without appropriate DFT means, special attention must be given to such redundancy circuitry so it is accounted for in the overall test strategy. If a redundancy scheme is capable of correcting errors by itself, then even manufacturing faults can hide behind the redundancy scheme, as output results are always correct. The undiscovered defects will consume the correction capability, and any subsequent soft error hit to that functional circuitry or redundancy circuitry may cause an unrecoverable error.

Coping with Soft Errors

As chips are susceptible to soft errors, the growing circuit sizes of these chips have drastically increased their soft error rate (SER). To cope with these soft errors, many soft error protection schemes targeting chip designs have been proposed.

Fault Tolerance

One approach to improve the reliability of a chip is to remove the source of soft errors. Because the early discovery of soft error was the result of contaminated packaging material, the solution was simple: eliminate the radiation contaminant from packaging material [May 1979]. However, trace amounts of radioactive isotopes do exist in common processing and packaging materials—such as the boron in borophosphosilicate glass (BPSG) and ceramics and the lead in solder—and their removal is costly. Their radioactive decay still leads to some level of soft error rate. Because alpha-particle radiation (see the previous section) can be stopped on the surface of the die, a die coating (epoxy resins that are deposited on the surface of the die) was introduced and used for some period of time. Die coating is only effective if the radiation comes from the outside and is of limited value if the alpha emitter is among the materials that transistors or interconnects are made of. As we move from wire bonding packaging to flip-chip solder (tin/lead) joint (or C4) packaging, solders are never far away from the surface of the die and die coatings would have no effect at all. Today, the primary alpha-particle source is solders (lead radiation isotopes). The careful selection of raw material has resulted in low-alpha solder. As a result of environmental and health concerns over lead, tin/lead solder is being phased out in packaging material, which will reduce alpha-induced soft errors. However, the other source of radiation, high-energy neutrons, cannot be stopped by anything associated with packaging.

All of these preventive measures combined with advances in manufacturing process technology have improved the system-level reliability substantially. In the past, soft errors were not critical for most computer systems for terrestrial applications. Thus, traditionally, only high-reliability applications, especially those deployed in the financial transaction, transportation, and aerospace/defense industries, have required fault tolerance to prevent the systems from crashes and silent data corruption errors.

As these measures are not effective enough to prevent soft errors from happening, traditional fault tolerance schemes commonly used for high-reliability applications have started to emerge. There are three fundamental fault tolerance schemes that can be used to protect such systems or devices from hard errors or soft errors: (1) hardware (special) redundancy, (2) time (temporal) redundancy, and (3) information redundancy [Pradhan 1996] [Siewiorek 1998] [Lala 2001]:

  1. Hardware (spatial) redundancy relies on the assumption that defects and radiation particles will only hit on a specific device and not another device (at least not simultaneously). So having a duplicate circuitry of the functional circuitry and with their outputs compared using a self-checking equality checker (checking circuitry), mismatches will point to an error (hard or soft error) (see Figure 3.19 in Chapter 3). This can happen at the circuit level (e.g., one adder is compared with another adder while both are fed the same data) or at the system level (e.g., a processor’s front side bus is compared with another one on the same bus while executing the same codes). Because the computation occurs in parallel, there is little or no penalty on the overall system performance, but there has to be hardware duplication and a self-checking equality checker, resulting in higher hardware costs and a much higher level of power consumption.

  2. Time (temporal) redundancy relies on the assumption that even if functional circuitry receives a radiation strike it is unlikely that the strike will happen on the same circuitry again at a slightly later time, so the scheme will not require duplicate circuitry. In this case, the same computation is repeated on the same functional circuitry for a second time, and the results of the first computation are not committed without comparison to the second computation. This obviously has the benefit of not requiring additional hardware, but the software must be coded to execute the program twice, which means saving the results of the first computation on memory or disk. One serious problem with this scheme is that it cannot detect any hard error; the same erroneous result will happen when recomputed. Therefore, time redundancy may give a false sense of security with regard to any hard error that results from physical failures.

  3. Information redundancy uses error-detecting code (EDC) or error-correcting code (ECC) to represent information contents [Peterson 1972]. Some of these coding properties are maintained even after computation, so by checking these codes before and after, one can determine if a hard or soft error has occurred. Parity is one such code. Parity represents whether the number of ones in a computer word is odd or even. Normally, this parity is computed when the information is generated and stored in the memory system. Upon reading of the word, parity again is recalculated and compared against the earlier stored parity bit. A mismatch identifies that an error has occurred. One major benefit of using parity code is that a single parity bit can detect any odd number of bit errors (caused by soft errors and hard errors) in each computer word; however, there is always the danger that a single radiation strike could affect more than a single bit, and when an even number of bits are flipped, the errors escape detection (because parity only counts odd or even). In this case, more sophisticated codes (such as Hamming code) can be used [Peterson 1972]. This code allows detection of 2-bit errors as well as correction of single-bit error. This, in general, is referred to as the error-correcting code. It requires the storage of more check bits (codes) and a computation unit that does the check code generation. It is important to note that properties like parity or ECC are often embedded for arithmetic operations (such as add/subtract/multiply) during normal operation. Thus, they are also used for arithmetic computation protection. Because additional information is stored, it can protect against both hard and soft errors.

Having detection capability is only half the story. After an error is detected, some recovery actions have to be taken. The most common action is for the operating system to stop the application, generate the necessary error message/log, and close the application. This action has varying degrees of system integrity implications, as partially computed results may have been written to disk already. Of course, this is better than simply letting the system crash, but we need better schemes that can provide more integrity. Checkpointing and rollback constitute one such scheme. Checkpointing is essentially taking a snapshot of the system states, which when revoked (rollback) will cause the system to restart from that point without rebooting or terminating the application. However, one has to ensure that no error has occurred (and the system is intact) before the checkpoint. One also has to consider where (or how regular) checkpoints are done to optimize for performance (because of the checkpointing process) as well as make sure that system states are not contaminated before the checkpoint.

Thus, having explained the basic principle, what are some of the common fault tolerance schemes used in high reliability systems? As mentioned before, duplicate and compare is one method that is commonly used in mainframes and high-end servers [Spainhower 1999] [Bartlett 2004]. For systems that cannot fail, a more secure system is triple modular redundancy (TMR) [Sklaroff 1976] [Siewiorek 1998] [Lala 2001]. Consider the TMR example shown in Figure 8.19. Here we have three pieces of computing units (modules), and their results are constantly compared. Because we have three results, the two matched results outvote the mismatched result and the deemed correct result will be sent on. The aerospace industry particularly favors this approach (for obvious reasons). More recently, because a central processing unit (CPU) is capable of running multiple threads (different streams of instructions that can be run in parallel on a CPU), one can also send a redundant thread through another path (virtual compute units) and check the results before retiring the thread/instructions. This is called redundant multithreading (RMT) [Mukherjee 2002].

A TMR example.

Figure 8.19. A TMR example.

The incorporation of fault tolerance in a compute system did not begin with the processor. It began with the circuit that had the highest transistor density and the part of the system that had the most transistors—the main memory. The protection scheme is the use of ECC, which is quickly followed by redundant array of inexpensive disk (RAID). Even though the disk system does not necessarily have the highest number of transistors, disk drives are susceptible to mechanical failures; hence, it is essential to protect the data that it holds. Because of the need to have signals routed over long wires or traces, buses and backplanes that interconnect various subsystems are protected with parity. The networking communication protocols also contain error codes, such as cyclic redundancy check (CRC) and checksum (the sum of all the binary numbers in a particular packet of data). In the early 1990s, the importance of fault tolerance to CPUs became apparent, as on-chip cache memories have become large enough to warrant their own ECC protection. On some high-end server CPUs, register files are protected with parity, and duplicated execution blocks help to identify errors [Spainhower 1999]. So the last holdout seems to be flip-flops and combinational logic. Researchers have come up with hardened latch/flops [Calin 1997], where circuit design has decreased the internal nodes’ vulnerability to a radiation strike. More recently, enhancing the storage elements or scan cells to fill the role of duplication (as the states are already duplicated) has been suggested [Franco 1994] [Austin 1999] [Ernst 2003] [Mitra 2005] [Tamhankar 2005]. These on-chip fault tolerance schemes are referred to as error resilience.

As we move toward the nanotechnology era, more and more system-level functions (in the form of IP cores) will be integrated on a single piece of silicon (or package). This trend has substantially exposed the nanometer SOC design to ever more manufacturing faults and soft errors; therefore, it is becoming more and more important to embed online error detection or correction schemes in these chips as the distinction between computer systems and systems-on-chip (SOCs) increasingly narrows in nanometer designs.

Error-Resilient Microarchitectures

Because hardware redundancy requires one or more redundant modules for error detection or correction, the hardware overhead using this fault tolerance scheme is a concern for use in chips. At the same time, while information redundancy can significantly reduce hardware overhead for array logic such as programmable logic array (PLA) and memory, this fault tolerance scheme is not applicable for random logic. This has left time redundancy as a viable solution to providing online error detection or correction for random logic on-chip.

These on-chip fault tolerance schemes for error detection or correction, referred to as error resilience, were first proposed for processor designs. While the major purpose of these error-resilient processor microarchitectures is to achieve maximum processor performance and power savings using dynamic voltage scaling (DVS), the online error detection and correction circuits used therein are applicable for soft error protection.

In this section, we only discuss two representative error-resilient processor microarchitectures: DIVA [Austin 1999] and Razor [Ernst 2003] [Ernst 2004]. Both DIVA and Razor adopt the fault tolerance scheme using spatial redundancy. DIVA uses a simpler DIVA checker to verify the recomputed results before commit, while Razor uses a Razor flip-flop similar to the stability checker (see Figure 3.17 of Chapter 3) proposed in [Franco 1994] to check whether each main flip-flop during normal operation functions correctly. For more information on other DVS schemes that are also applicable for soft error protection, refer to [Ernst 2003] and [Ernst 2004]. Chapter 3 of this book also provides a good reference to the basics of these fault tolerance schemes. For a more specific explanation, other implementations of the error-resilient microarchitectures can be found in [Patel 1982], [Metra 1998], [Oh 2002], [Favalli 2002], and [Tamhankar 2005].

DIVA

One error-resilient processor microarchitecture is the dynamic implementation verification architecture (DIVA) [Austin 1999]. As illustrated in Figure 8.20, DIVA uses a smaller and simpler shadow processor (DIVA checker) that computes concurrently as the main processor (DIVA core). Instead of using two large complex cores as in the case of using the double modular redundancy (DMR) scheme, the DIVA checker largely depends on the DIVA core to do the computation while it verifies the correctness of the core processor’s computation with simpler hardware (to save on silicon cost).

DIVA architecture.

Figure 8.20. DIVA architecture.

The DIVA core constitutes the entire microprocessor design except the retirement stage. The main processor core fetches, decodes, and executes instructions, holding their speculative results in the reorder buffer (ROB). The DIVA checker contains a functional checker stage (CHK) that verifies the correctness of all core computations, only permitting correct results to pass through to the commit stage (CT) where they are written to architected storage. A watchdog timer (WT) is added to detect faults that can lock up the core processor or put it into a deadlock or livelock state where no instructions attempt to retire. If an error is detected during any core computation, then the DIVA checker will fix the errant computation, flush the processor pipeline, and restart the processor at the next instruction.

This dynamic verification scheme, a microarchitecture-based technique that can significantly reduce the burden of correctness in microprocessor designs, is only applicable to the processor world where these microarchitectures make sense. This implies the protection is only limited to portions of the design which are not shared; because the whole bus unit, instruction decoding, and the backend of the pipeline are shared, they are not protected. Like any DMR scheme, this technique is good for detecting an error (be it permanent or SER); recovery is yet another issue. Similarly to DMR, it can be coupled with checkpointing and rollback as a reasonable error recovery scheme to achieve fault tolerance.

Razor

With increasing clock frequencies and silicon integration that come with scaling, power-aware computing has become a critical concern in the design of high-performance computing. In addition, the energy consumption and dissipation issues also spread to embedded processors and SOCs. One approach for power conservation is to run the circuit at the lowest VDD where it still executes correctly. Instead of having a fixed VDD, the voltage is adjusted dynamically, hence dynamic voltage scaling (DVS). DVS is one of the more effective and widely used methods for power-aware computing [Pering 1998]. To obtain maximum power savings from DVS, it is essential to scale the supply voltage as low as possible while ensuring correct operation of the design. This critical supply voltage is difficult to be set correctly all the time considering a wide diversity of process and environmental variations, where the design might not operate correctly.

The authors in [Austin 1999] attempted to address the power conservation issues by using a self-tuned clock and voltage scheme in which dynamic verification is used to reclaim frequency and voltage margins. The Razor scheme proposed in [Ernst 2003] and [Ernst 2004], on the other hand, is based on the use of in situ timing error detection and correction to permit increased energy reduction as voltage margins are completely eliminated. The key idea of Razor is to tune the supply voltage by monitoring the error during circuit operation, thereby eliminating the need for voltage margins and exploiting the data dependence of circuit delay. Similar to DIVA, this is accomplished with a shadow unit, but this shadow unit has been pushed all the way down into a Razor flip-flop. This Razor flip-flop, shown in Figure 8.21a, double-samples pipeline stage values, one with a fast clock, clk, and the other with a time-borrowing delayed clock, clk_del. It includes a main flip-flop controlled by the fast clock and a shadow latch controlled by the delayed clock. The value of the shadow latch is assumed to be correct under any operating voltage. A metastability-tolerant comparator then validates the values stored in the main flip-flop.

Razor flip-flop: (a) schematic of the Razor flip-flop, (b) reduced overhead Razor flip-flop with metastability detection circuit, and (c) waveform.

Figure 8.21. Razor flip-flop: (a) schematic of the Razor flip-flop, (b) reduced overhead Razor flip-flop with metastability detection circuit, and (c) waveform.

Actually, using a delay checker for on-chip error detection was first proposed in [Franco 1994], long before power became a significant issue in chip design. The stability checker proposed in that paper was designed to deal with delay degradation (reliability). The delayed data are sampled and checked against the primary data. A transistor-level design that integrates the stability checker into a flip-flop is also given, which significantly reduces the overhead. The paper also discussed the possibility of moving the checking ahead of the primary latch to detect oncoming timing errors and changing the dynamic frequency of the circuit to cope with the CUT where field repair is not possible. However, the earlier paper does not address (1) power saving when power was not yet an issue and (2) methods for dealing with an error when it occurs. These are crucial elements in all time redundancy (delay detection) schemes. Razor documents the whole error handling part in detail.

A reduced overhead Razor flip-flop with the metastability detection circuit is illustrated in Figure 8.21b. The operating voltage is tuned to the level so that the worst-case delay is guaranteed to meet the shadow latch setup time, even though the main flip-flop could fail. By comparing the values latched by the main flip-flop and the shadow latch during each clock cycle, a delay error in the main flip-flop can be detected. The value stored in the shadow latch is then used to correct the error.

The operation of the Razor flip-flop is illustrated in Figure 8.21c. In clock cycle 1, the combinational logic L1 meets the setup time by the rising edge of both clocks so both the main flip-flop and the shadow latch will latch the correct data. In this case, the error signal at the output of the XOR gate remains logic 0, and the operation of the pipeline is unaltered. In clock cycle 2, we show an example of the operation when the combinational logic L1 exceeds the intended setup time of the main flip-flop but does not exceed the worst-case setup time of the shadow latch during subcritical voltage scaling. In this case, the data are not correctly latched by the main flip-flop but are successfully latched by the shadow latch because the shadow latch is controlled by a delayed clock. This means, in order to guarantee that the shadow latch will always latch the input data correctly, the allowable operating voltage must be constrained at design time so under worst-case conditions, the logic delay in L1 will never exceed the setup time of the shadow latch. By comparing the valid data of the shadow latch with the incorrect data latched by the main flip-flop, an error signal, Error_L, is then generated in clock cycle 3. In clock cycle 4, the valid data in the shadow latch are restored into the main flip-flop and become available to the next pipeline stage L2. These local error signals are OR’ed together to ensure that all contents of the main flip-flops are restored even when only one of the Razor flip-flops generates an error.

Having said that, it is still a delicate balance that we have to guarantee the shadow latch latches the correct data. If both of them are latched incorrectly as a result of unpredictable timing mismatches caused by whatever reasons, the corrupted system will remain undetected and becomes a silent data corruption (SDC). This also involves a complicated pipeline reloading mechanism (see Figure 8.21), and it is unclear what the implication would be for applications of a nonpipelined (finite-state machine) type. The voltage adjustment mechanism also has to be designed such that it can be changed reasonably fast (within a couple of clock cycles); otherwise the whole system will keep circulating until the main flip-flop matches the shadow latch. This can be a problem for large circuits where a power-adjusting circuit would not be able to react fast enough to the heavy load.

Soft Error Mitigation

The objective of soft error mitigation techniques is to provide partial immunity of a design to potential soft errors while significantly minimizing the required cost over fault tolerance schemes. Thus, these methods are highly suitable for mainstream applications where the SER is reduced in a cost-effective manner without providing complete tolerance capabilities. We review three soft error mitigation methods through built-in soft-error resilience (BISER) [Mitra 2005] and circuit-level modifications [Zhou 2006] [Almukhaizim 2006].

Built-In Soft-Error Resilience

The built-in soft-error resilience (BISER) proposed in [Mitra 2005] can be used to allow scan design to protect a device from soft errors during normal system operation. BISER is based on the observation that soft errors either (1) occur in memories and storage elements and manifest themselves by flipping their stored states or (2) result in a transient fault in a combinational gate, as caused by an ion striking a transistor within the combinational gate, and can be captured by a memory or storage element [Nicolaidis 1999]. Data from [Mitra 2005] show that combinational gates and storage elements contribute to a total of 60% of the soft error rate (SER) of a design manufactured using current state-of-the-art technology versus 40% for memories. Hence, it is no longer enough to consider soft error protection only for memories without considering any soft error protection for storage elements as well.

Figure 8.22 shows the BISER scan cell design [Mitra 2005] that reduces the impact of soft errors affecting storage elements by more than 20 times. This scan cell consists of a system flip-flop and a scan portion, each comprising a one-port D latch and a two-port D latch, a C-element, and a bus keeper. This scan cell supports two operation modes: system mode and test mode.

Built-in soft-error resilience (BISER) scan cell.

Figure 8.22. Built-in soft-error resilience (BISER) scan cell.

In test mode, TEST is set to 1, and the C-element acts as an inverter. During the shift operation, a test vector is shifted into latches LA and LB by alternately applying clocks SCA and SCB while keeping CAPTURE and CLK at 0. Then, the UPDATE clock is applied to move the content of LB to PH1. As a result, a test vector is written into the system flip-flop. During the capture operation, CAPTURE is first set to 1, and then the functional clock CLK is applied which captures the circuit response to the test vector into the system flip-flop and the scan portion simultaneously. The circuit response is then shifted out by alternately applying clocks SCA and SCB again.

In system mode, TEST is set to 0, and the C-element acts as a hold-state comparator. The function of the C-element is shown in Table 8.3. When inputs O1 and O2 are unequal, the output of the C-element keeps its pervious value. During this mode, a 0 is applied to the SCA, SCB, and UPDATE signals, and a 1 is applied to the CAPTURE signal. This converts the scan portion into a master-slave flip-flop that operates as a shadow of the system flip-flop. That is, whenever the functional clock CLK is applied, the same logic value is captured into both the system flip-flop and the scan portion. When CLK is 0, the outputs of latches PH1 and LB hold their previous logic values. If a soft error occurs either at PH1 or at LB, O1 and O2 will have different logic values. When CLK is 1, the outputs of latches PH2 and LA hold their previous logic values, and the logic values drive O1 and O2, respectively. If a soft error occurs either at PH2 or at LA, O1 and O2 will have different logic values. In both cases, unless such a soft error occurs after the correct logic value passes through the C-element and reaches the keeper, the soft error will not propagate to the output Q, and the keeper will retain the correct logic value at Q.

Table 8.3. C-Element Truth Table

O1

O2

Q

0

0

1

1

1

0

0

1

Previous value retained

1

0

Previous value retained

The beauty of this scheme is that the BISER scan design has self-correction capability. Each BISER scan cell can still function as a normal scan cell in test mode. Once the chip is in the final application, it can be configured in self-checking mode (system mode) and no errors will propagate any further than the C-element. There are no new routing and new control signals to be added other than the existing scan control signals.

The only shortcoming, however, is that this will incur more power and area to some degree. The scan portion of the scan cell is much weaker because scan routing does not have to go far and each scan chain can run much slower (to ease design complexity) than the functional logic. For the scan cell to latch the same data, it has to be sized up appropriately so that it can run at more or less the same speed as the normal system flip-flop. This will increase loading and power to some degree (up to 2×). However, as not all system flip-flops need to be protected before we can achieve a 20× reduction of the SER, the area and power penalty for the whole chip is in the 3% to 5% range, much smaller than any conventional SER detection mechanism like DMR (where it is at least 100% more) [Mitra 2005].

Another important attribute of this scheme is that it is applicable for any latch-based or flip-flop-based logic design. Other architectural-level solutions, such as checkpointing and rollback, are good for processor designs but are totally nonrelevant for nonprocessor types of logic design.

Circuit-Level Approaches

Circuit-level approaches attempt to increase the ability of logic circuits to mask glitches by increasing the soft error masking factors in the circuit. While the reduction in the ability of a circuit to mask soft errors via latching-window masking is a consequence of the rapid increase in the operating frequency of logic circuits, the electrical and logical masking factors can be improved in a cost-effective manner for a targeted design. We describe two techniques that successfully mitigate soft errors in logic circuits by improving the electrical masking and logical masking factors of a design using gate resizing [Zhou 2006] and netlist transformations [Almukhaizim 2006], respectively.

Gate resizing for soft error mitigation [Zhou 2006] is based on physical-level design modifications, wherein individual transistor characteristics are perturbed to reduce the sensitivity of gates to glitches. Specifically, a select set of gates are resized, by altering the width/length (W/L) ratios of transistors in these gates, in order to increase their immunity to glitches and, by extension, reduce the SER of the logic circuit. Figure 8.23 illustrates the effect of gate resizing on the amplitude and width of a 0-to-1 transient at the output of a gate. As illustrated in the figure, the magnitude and duration of the transient diminish rapidly as the size of the transistor(s) that are collecting this charge. Because there may be multiple transistor diffusions (capacitors) on any given node, the collected charge tends to redistribute among the nodes once it is collected at any given junction. It also follows the RC discharge curve, as the wires connecting these diffusion areas are resistive. Thus, transistors in highly susceptible gates can be resized to disperse the injected charge as quickly as it is collected; hence, the transient does not achieve sufficient magnitude and duration to propagate to the fanout of the gate. SER estimation and assessment of the susceptibility of logic gates to soft errors is performed either through fault injection and simulation [Zhou 2006], wherein the soft error masking factors are evaluated separately, or through symbolic representation [Miskov-Zivanov 2006], wherein all the masking factors are evaluated in a unified approach. In both cases, gate resizing is performed for the most susceptible logic gates (i.e., the ones that contribute the most to the SER of the design). Results in [Zhou 2006] indicate that a 10-fold reduction in the SER of logic circuits implemented using present-day technologies is achievable at an overhead of roughly 30% in area and power consumption.

Effect of gate resizing on the amplitude/width of SETs [Zhou 2006].

Figure 8.23. Effect of gate resizing on the amplitude/width of SETs [Zhou 2006].

More recent data reported in [Seifert 2006] indicate that SER may be flattened or even declined with scaling, and it points to the collection efficiency of the smaller diffusion area. This is still somewhat controversial as the same effect has not been reported by the majority of the researchers. However, if this theory is proved true, the gate resizing method would have to be considered carefully as it may increase the collection area and hence increase SER.

Netlist transformation for soft error mitigation [Almukhaizim 2006] is based on logic-level design modifications, wherein the logic circuit is modified while preserving the functionality of the netlist, to reduce the probability of sensitizing glitches to an output of the circuit. This has to take into account the vulnerability of all gates in the circuit, minimizing the overall vulnerability of the circuit and not just shifting the SER of one path to another. Design modification is performed using an ATPG-based rewiring method to generate functionally equivalent yet structurally different gate-level circuit implementations. Together with a SER estimation method, the design is iteratively modified through the selection of rewiring operations that minimize the overall SER of the circuit.

Consider, for example, the logic circuit in Figure 8.24. If rewiring is performed on the dashed wire in the circuit in Figure 8.24a, the dashed wire is replaced with input c in the circuit in Figure 8.24b. Performing a second rewiring on the dashed wire in the circuit in Figure 8.24b generates the circuit in Figure 8.24c. The soft error failure rate computed using SERA [Zhang 2006] indicates that the circuit in Figure 8.24c is improved by 5.00% over the circuit in Figure 8.24b and by 9.50% over the original circuit in Figure 8.24a. Because soft errors are mitigated at the logic level, the method proposed in [Almukhaizim 2006] is technology independent and, thus, enables design modifications for SER reduction that are equally effective, independent of the technology to which the circuit will be eventually mapped. Moreover, the mechanisms through which soft errors are mitigated at the logic level are orthogonal to those at the physical level; hence, not only does it provide a better starting point, it may also be applied synergistically with the current state of the art in gate resizing-based soft error mitigation techniques. Results in [Almukhaizim 2006] indicate that it is often possible to reduce the SER of a logic circuit without incurring any overhead in terms of area, delay, or power consumption.

Example of rewiring to reduce the soft error failure rate: (a) original circuit, (b) circuit after first rewiring, and (c) circuit after second rewiring.

Figure 8.24. Example of rewiring to reduce the soft error failure rate: (a) original circuit, (b) circuit after first rewiring, and (c) circuit after second rewiring.

Defect and Error Tolerance

A couple of tolerance terminologies have surfaced: defect tolerance [Koren 1998] and error tolerance [Breuer 2004a, 2004b]. Defect tolerance requires inserting redundancy circuitry in a circuit under test so the circuit can continue correct operation in the presence of defects. An example is adding self-diagnosis and self-repair circuitry or error-correcting code (ECC) to ensure correct memory operation. Error tolerance, on the other hand, allows the circuit to continue acceptable operation in the presence of errors. An example is the MPEG player where a pixel may be faulty but the player can continue acceptable operation.

In the nanometer design era, the International Technology Roadmap for Semiconductors (ITRS) has indicated that the efforts in fabricating a defect-free chip can be extremely expensive [SIA 2006]. Consider random spot defects. Assume a design consists of N submodules each having n unique positions where a defect would cause it to fail its tests. There are D defects uniformly distributed over the submodule such that each defect lands at a unique position. In addition, assume that the number of defects in any submodule is independent of the number of defects in other submodules. The authors in [Breuer 2004a] conducted an analysis of defect probability and showed that the probability that an arbitrary position on a submodule is associated with a defect is p = D/(nN), and the probability of having d defects in a given submodule is:

P(d) = C(n,d)pd (1 – p)n–d

where C(n,d) = n!/(d!(n – d)!). Because P(d) is binomially distributed, the average number of defects E(d) in an arbitrary submodule is:

E(d) = λ = np = D/N

For large n and small p, the binomial distribution can be approximated by a Poisson distribution:

P(d)= e–λd/d!)

Where λ is the failure rate. Assume a submodule is equally likely to be defect-free or defective. We have:

P(d = 0)= e–λ0/0!)= 0.5

Thus, λ = 0.693. Table 8.4 shows some numerical results taken from [Breuer 2004a]. The table shows that for a submodule yield (Y) of 50%, the probability of having exactly one defect at one submodule is 0.35. Even if the submodule yield reaches 80%, the probability of having one or two defects on the submodules is still as high as 20% (= 0.18 + 0.02). This suggests that effective yield can increase significantly if the system can accept some defective submodules.

Table 8.4. Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) for Various Values of Failure Rate λ

d

λ = 0.105

λ = 0.223

λ = 0.357

λ = 0.511

λ = 0.693

λ = 0.916

λ = 1.204

λ = 1.609

λ = 2.303

 

Y =

Y =

Y =

Y =

Y =

Y =

Y =

Y =

Y =

0

0.90

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10

1

0.09

0.18

0.25

0.31

0.35

0.37

0.36

0.32

0.23

2

 

0.02

0.04

0.08

0.12

0.17

0.22

0.26

0.27

3

  

0.01

0.01

0.03

0.05

0.09

0.14

0.20

4

     

0.01

0.03

0.06

0.12

5

      

0.01

0.02

0.05

6

        

0.02

7

        

0.01

Defect Tolerance

Defect tolerance [Koren 1998] is not new and used to be called redundancy repair. A typical defect-tolerant design is shown in Figure 8.25 where two spares (identical modules) are used and a switch is used to select one module. This is in contrast to the TMR system where a voter is used to vote on the majority of the three identical modules (see Figure 8.19). Defect tolerance allows increased process yield (the percentage of good manufactured parts). In the late 1980s, redundancy techniques were used in the manufacture of DRAMs. By using spare rows, columns, or blocks, defective elements can be identified during the manufacturing test process, and fuses are blown to map the spare resources to replace those that are defective. The use of these techniques becomes mandatory as DRAMs scale to the gigabit level. We would not be able to buy and sell DRAMs at the price level we enjoy today without the redundancy repair process.

A defect-tolerant design with two spares.

Figure 8.25. A defect-tolerant design with two spares.

A similar technique is also used in the manufacture of hard disk drives. During the drive test process, defective sectors are identified, and a map containing those defective sectors is stored permanently on the drive control electronics. These defective sectors are mapped so the drive will not use them to store data. In both situations, spare elements replace the defective elements. Other circuits that have regular structures also can benefit from these defect tolerance techniques, such as FPGAs, cache memories, and processors. For field programmable gate arrays, testing can identify bad cells or routing resources; thus, the mapping and routing tools can work around those obstacles during the mapping process. For processors, it is possible to sell the product with the fewer features (e.g., minus the floating-point unit [FPU]) upon detection of a fault during manufacturing test; however, defect tolerance has its limit. Not only can the regular circuit elements become faulty because of defects, but the spare elements themselves can also be defective. As the percentage of spare elements increases, therefore, they will occupy more area of the die and the larger die area will result in even more defects, affecting both the normal circuit elements and the spare elements. Therefore, there is a point at which the law of diminishing returns begins to set in [Koren 1990] [Hirase 2001].

In chip design, defect tolerance can also be achieved by using defect avoidance in a circuit implementation to improve process yield. These defect avoidance techniques are generally referred to as design for manufacturability (DFM) or design for yield (DFY). Layout as well as circuit design methods (e.g., double vias) are commonly used to reduce the sensitivity of the circuit to fabrication defects and process variations. These DFM and DFY techniques are discussed extensively in Chapter 9.

Error Tolerance

Error tolerance is a different concept. Conventional wisdom would seem to suggest that if an error is injected and trapped in the logic, it will not perform to its intended functionality; however, some logic functionality defies that conventional wisdom. An example is the processing or storage of any kind of multimedia data (e.g., video, pictures, or music). Compression techniques (e.g., JPEG, MPEG, MP3) are generally used for these types of data, and these compression algorithms are lossy in nature—that is, some details of the raw data are lost in the compression process. A stuck bit in the least significant portion of the data word may or may not be distinguishable from artifacts with the compression process [Breuer 2004a, 2004b]. Also, these kinds of data appeal to the senses, and our senses are usually not keen enough to spot minute variances (unless one observes them with an expert’s eyes or ears). This sort of error tolerance is application-specific, and general-purpose machines that are tolerant of all kinds of errors have not yet been designed. For example, if an error occurs at the most significant bits of the compressed data or within the control logic instead of within the data, the data processing can still lead to incorrect processing and may yield an unacceptable picture or sound.

In essence, the concept of error tolerance is also not new. For instance, microprocessors with different speeds have been sold at different prices in the marketplace, even though they were produced from the same production line. RAMs or flash memories that are fault-free can be sold to any customer, whereas those that are slightly defective might go to a manufacturer of digital answering machines or video games. The main objective of error tolerance is to increase the effective yield of a process by identifying defective but acceptable chips. This lies in the development of an accurate method to estimate error rate with a specified level of confidence and an effective method to predict yield increase based on the results of error-rate estimation [Lee 2005] [Hsieh 2006].

In [Lee 2005], the authors proposed a fault-oriented test methodology to enhance effective yield based on error-rate analysis. The main idea is illustrated in Figure 8.26, where a set of fault models is assumed. Before actually carrying out production testing, the error rate of each modeled fault is estimated, and a set of acceptable faults is identified based on their error rates. These acceptable faults are then excluded from ATPG. Because chips containing only acceptable faults can now pass through the manufacturing test floor, the effective yield of these chips increases.

Fault-oriented test methodology.

Figure 8.26. Fault-oriented test methodology.

Instead of using fault models to estimate error rates, the authors in [Hsieh 2006] proposed an error-oriented test methodology, where focus is on errors produced by defective chips rather than on modeled faults. Because no fault model is required, the proposed test methodology can be applied to any circuit without knowing its detailed structure. In addition, because the entire circuit is processed at one time, the estimation process can be greatly simplified and becomes easier to carry out. The proposed error-oriented test methodology is illustrated in Figure 8.27. First, a sampling-based method is used to estimate the error rates of these chips. The estimated results are then used to determine the acceptability of the faulty chips. With this test methodology, defective chips with high error rates are considered unacceptable and will be rejected, whereas chips with low error rates can be placed into different acceptable classes depending on their error rates, thus increasing the effective yield of the manufactured chips.

Error-oriented test methodology.

Figure 8.27. Error-oriented test methodology.

As for the trend, error tolerance is application-oriented; it is especially useful for audio/video or multimedia applications. The measures for error tolerance may also include factors other than error rate [Breuer 2004a]. Error tolerance may also play its important role in analog and mixed-signal (AMS) circuits because these circuits can be naturally graded based on their performance.

Concluding Remarks

Advances in semiconductor manufacturing technology have allowed the integration of a billion transistors in a nanometer design. The nanometer design has faced many test challenges [SIA 2005, 2006]. Growing complexity and defect mechanisms of fabricating the nanometer design have made the circuit even more vulnerable than earlier generations to physical failures caused during manufacturing and susceptible to radiation during system applications. In this chapter, we first presented several promising techniques for testing nanometer designs that can cope with physical failures caused by signal integrity, defects, and process variations during manufacturing. These test techniques are developed for reliability screen or DPM reduction. A few include on-chip hardware for stressing or special reliability measurements. These defect avoidance techniques are generally referred to as design for manufacturability (DFM).

With the process node now scaling down to 65 nanometers and below, DFM alone is not enough. Many new defect mechanisms, such as copper-related defects and defects as a result of optical effects, become much more likely [Gizopoulos 2006]. Any single-event upsets can increase logic and memory soft error rates. It is thus becoming crucial to ensure that the chips can still function at the end system in the presence of these defects and soft errors, especially when the chips are to be installed into airplanes, pacemakers, or cars for safety-critical concerns. This chapter then further covered a number of error-resilient and defect-tolerant designs embedded on-chip to tolerate soft errors and defects. Although these schemes may require additional area overhead, they pave the way to develop more advanced error-resilient and defect-tolerant scan and logic BIST architectures to cope with the physical failures of the nanometer age. These error resilience and defect tolerance schemes are now referred to as design for reliability (DFR).

Exercises

8.1

(Signal Integrity) Figure 8.10 shows one readout circuit for reading the signal integrity loss information through a scan chain. How do you change this architecture to achieve the following?

  1. Less integrity test overhead.

  2. More accuracy in terms of pinpointing the problematic wire/bus.

In each case, draw your architecture and estimate the duration of the signal integrity test session.

8.2

(Signal Integrity) For the equation given in Section 8.2.3 that approximates the frequency of a ring oscillator, do the following:

  1. Use a partial derivative to analytically determine the sensitivity of fRO with respect to Vt and Tox.

  2. Use the library and technology parameters available to your SPICE tool, replace the subset of parameters needed, and find the frequency shift for a 15% increase of Vt and a 20% reduction in Tox.

  3. Run SPICE for the same variations, and compare the frequency shift with the analytical results found in part (b).

8.3

(Memory Repair) What is the minimum cost of using Hamming ECC (a single-bit error correction and double-bit error detection code) to protect a 64-bit-wide memory word (e.g., the number of check bits over the number of 64-bit word)?

8.4

(Redundancy) What is the marginal increase of redundancy if the number of redundant elements (rows, columns, or blocks) increases by 2×? (Hint: You have to look at the yield decrease because the area has increased by 2×.)

8.5

(Adaptive Design) What other adaptive mechanisms are available that can reduce power consumption besides the frequency and voltage knobs mentioned in Section 8.3.4?

8.6

(Soft Errors) Consider the original circuit in Figure 8.24a and the final circuit in Figure 8.24c. Verify that the probability of sensitizing an error at the inputs of the gate driving f2 in the final circuit is higher than the probability of sensitizing an error at the inputs of the gate driving f2 in the original circuit. What is the improvement?

8.7

(Fault Tolerance) Calculate the reliability R of the defect-tolerant design with two spares shown in Figure 8.25. Assume that the probability of the module M nonoperational is p, where 0 <p < 1. Then, calculate the reliability of the TMR system given in Figure 8.19, and plot a chart to explain which system will yield a more reliable operation.

8.8

(Defect and Error Tolerance) Refer to Table 8.4. Assume that there are 10 identical modules in a design. Each module has 10,000 unique positions where a defect can exist and make the module behave imperfectly, and 20 defects are uniformly distributed over the module. Let P(d) be the probability mass function of the number of defects d on an arbitrary module, where d is a random variable. Calculate the probability, p, that an arbitrary position on a module is associated with a defect, P(d = 1), and the probability of having zero defect, P(d = 0), in a given module. What’s the yield of the design?

8.9

(Defect and Error Tolerance) Repeat Exercise 8.8. Because P(d) is binomially distributed, calculate the average number of defects, E(d), in an arbitrary module. For large n and small p, we can approximate the binomial distribution P(d) by a Poisson distribution with failure rate λ. Assume that the design is equally likely to be defect-free or defective. Calculate the failure rate λ and the probability of having one defect, P(d = 1), in a given module. Explain the difference between the approximated probability and the probability derived in Exercise 8.8.

8.10

Exercises (A Design Practice) Write a C/C++ program or use the fault simulator on the Web site to estimate the transient failure probability by applying 1000 random patterns to a 4-bit carry-ripple adder implemented in combinational logic gates. Assume soft errors are solely caused by single-event upsets, which can randomly cause one internal net to flip its value from 0 to 1 or 1 to 0 during one clock cycle.

Acknowledgments

The authors wish to thank Professor Yiorgos Makris of Yale University for contributing the Circuit-Level Approaches in the Soft Errors section; Professor Mohammad Tehranipoor of University of Connecticut and Professor Saraju P. Mohanty of University of North Texas for reviewing the Signal Integrity section; Dr. Jonathan T.-Y. Chang of Intel, François-Fabien Ferhani of Stanford University, Professor James C.-M. Li of National Taiwan University, Praveen K. Parvathala and Michael Spica of Intel, Professor Michael S. Hsiao of Virginia Tech, Phil Nigh of IBM, and Dr. Brion Keller of Cadence Design Systems for reviewing the Structural Tests and Defect-Based Tests sections; Praveen K. Parvathala and Dr. Li Chen of Intel for reviewing the Functional Tests section; Dr. Shih-Lien Lu of Intel for reviewing the Process Sensors and Adaptive Design section; Professor Subhasish Mitra of Stanford University for reviewing the Soft Errors section; Professor Kuen-Jong Lee of National Cheng Kung University and Michael Spica of Intel for proofreading the Defect and Error Tolerance section; Professor Shi-Yu Huang of National Tsing Hua University and Dr. Srikanth Venkataraman of Intel for providing helpful comments; as well as Teresa Chang of SynTest Technologies for drawing most of the figures.

References

Books

Introduction

Signal Integrity

Manufacturing Defects, Process Variations, and Reliability

Soft Errors

Defect and Error Tolerance

Concluding Remarks

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset