Chapter 10. Design for Debug and Diagnosis

T. M. MakIntel Corporation, Santa Clara, California

Srikanth VenkataramanIntel Corporation, Hillsboro, Oregon

About This Chapter

Designers have to be prepared for the scenario where chips do not work as intended or do not meet performance expectations after they are fabricated. Product yield engineers need to know what has caused their product yield to be below expectation. Reliability engineers need to know what circuits or elements have failed from various stresses and which ones customers have returned. It is necessary to debug and diagnose chip failures to find the root cause of these issues, and this is best facilitated if the design has accommodated features for debug and diagnosis.

This chapter focuses on the design features at the architectural, logic, circuit, and layout level that are needed to facilitate silicon debug and defect diagnosis of integrated circuits. These design features are generally referred to as design for debug and diagnosis (DFD). We explain how these DFD features are used effectively in a debug or diagnosis environment for applications ranging from design validation, low-yield analysis, and all the way to field failure analysis. Common probing techniques used to access internal signal values (logic and timing) and alter the execution and timing behavior of an integrated circuit in a controlled manner are covered. We also describe circuit editing techniques and tools to enable probing, validate hypotheses, and confirm root-cause design fixes. These DFD features are categorized into logic DFD (LDFD) features, which enable the nonintrusive isolation of failures, and physical DFD (PDFD) features, which enable physically intrusive analysis.

Introduction

Few chips ever designed function or meet their performance goal the first time. Many fabricated chips may also fail to function because of defects, circuit/process sensitivity, infant mortality, or wearout throughout their product lifetime. When fabricated integrated circuits (ICs) of a new design do not function correctly or fail to meet their performance goal, the silicon debug process starts immediately to identify the root cause of the failure. For manufacturing defect-induced failures, the cause of failure may not be important as long as the failure rate is comparable to the process norm and is acceptable. However, if the yield is low or if the chip performance varies with process variation, the diagnosis process is utilized to identify the root cause so that corrective actions (with process and/or design changes) can be taken to improve yield. Failing to do so will reduce the profitability of the product or result in the product being unable to meet the volume demands of the market because of availability shortfall. When chips fail in the field and the field failure rate is high and above acceptable levels, customers will typically initiate action focused on the quality of the incoming product. Customers usually send these chips back, and the chips then have to be analyzed for the root cause of the failure. Following this step, corrective actions have to be put into place. If the cause of the problem is not understood and the customer is not convinced, a customer lines-down situation or even product recall may result, which can cripple the product and cause a loss of market share relative to competitors.

In all of the above scenarios, speedy and accurate debug and diagnosis is needed. This chapter focuses on the changes that can make a design more debug-able and diagnosable—hence, the acronym DFD, which stands for design for debug and diagnosis.

What Are Debug and Diagnosis?

Webster’s dictionary defines diagnosis as the investigation or analysis of the cause or nature of a condition, situation, or problem. In the context of very-large-scale integrated (VLSI) circuits and systems, diagnosis is a general term that applies to various phases of the design and manufacturing process that involves isolating failures, faults, or defects. The term debug refers to the process of isolating bugs or errors that cause a design to behave differently from its intended behavior under its specified operating conditions. Debug and diagnosis techniques try to address the following questions:

  • What was wrong with this device?

  • Which chip was bad on this board?

  • Why did this system crash?

  • Why did the simulation of the arithmetic logic unit (ALU) show that 2 + 2 = 5?

  • Why was this signal late?

The aim of debug and diagnosis is to locate the root cause of the device bugs or failures.

Where Is Diagnosis Used?

Electronic systems are diagnosed at different stages of design and manufacturing of the constituent components of the system for several different objectives. However, the common aim of any diagnosis procedure is to locate the root cause of the device failures. Depending on whether the device is a VLSI chip, multichip module (MCM), a board, or a system, diagnosis is performed with different objectives (see Figure 10.1). In the case of MCMs, boards, or systems, diagnosis is intended to identify and subsequently replace the faulty subcircuit (a chip on a board or a board in a system) or to reconfigure the circuit around the failure. In the case of chips or integrated circuits, diagnosis is performed to improve the manufacturing process. Diagnosis followed by failure analysis is vital to IC design and manufacturing. Diagnosis identifies root cause defects for yield enhancement, finds design flaws that hinder circuit operation, and isolates reliability problems that could lead to early product failure. The primary focus of this chapter is on debug and diagnosis usage during the manufacturing and postsilicon phase of digital ICs or VLSI chips.

Diagnosis at the chip, board, and system level.

Figure 10.1. Diagnosis at the chip, board, and system level.

IC-Level Debug and Diagnosis

Figure 10.2 shows the typical life cycle of an integrated circuit or VLSI chip, from the requirements stage to the end of life. Three major phases of the life cycle are shown: (1) design along with validation and verification, (2) ramp-up to production with design revisions or steppings, and (3) volume production. There are three major applications of diagnosis and debug techniques and technologies at the IC level to root-cause issues that are either design or manufacturing process related: (1) design error diagnosis during the design phase, (2) silicon debug during the ramp-up to production, and (3) defect diagnosis through production ramp and volume production. The focus of this chapter is on silicon debug and defect diagnosis, and they are described in more detail in the following sections.

Debug and diagnosis applications across an IC life cycle.

Figure 10.2. Debug and diagnosis applications across an IC life cycle.

Silicon Debug versus Defect Diagnosis

Silicon debug is the first and necessary step in the first/initial silicon stage of a product life cycle for any complex VLSI chip. Silicon debug starts with the arrival of first silicon and continues until volume production. Once first silicon arrives, these early chips are sent to the design team for validation, test, and debugging of bugs not uncovered during the presilicon design process. Many bugs and errors, such as logic errors, timing errors, and physical design errors, may be introduced during the design process. These bugs and errors should have been uncovered during the design verification and validation processes that include simulation, timing verification, logic verification, and design rule checking. However, very often bugs and errors still escape these checkpoints and cause silicon not to behave as designed. These errors are often caused by limitations and flaws in the circuit models, simulations, or verification. Problems could range from logic or functional bugs, to circuit sensitivities or marginalities, to timing or critical speed-path issues [van Rootselaar 1999] [Josephson 2001]. Verification tests performed during the design phase may not cover all corner cases encountered in a real application leading to logic or functional bugs. Static timing analysis and dynamic timing simulations may not accurately and precisely predict every critical path. This results in timing or speed paths being encountered in silicon that were not factored into design. As an interesting side note, one of the first examples of “debugging” was in 1945 when a computer failure was traced to a moth that was caught in a relay between contacts [Gizopoulos 2006].

Manufacturing defects are physical imperfections in the manufactured chip. The process of locating manufacturing defects is called defect diagnosis, fault diagnosis, or fault isolation. A key application of defect diagnosis is supporting low-yield analysis and yield enhancement activities.

One application of defect diagnosis to manufacturing yield enhancement is shown in Figure 10.3. Initially the effort is mainly focused on memories, test chips, and inline monitors. The regular and repetitive structure of a memory makes it easier to diagnose or to isolate its faults. Test chips and static random-access memories (SRAMs) are used to bring up a new process, and the large embedded memories in products are usually used to maintain or improve yield [Gangatirkar 1982] [Hammond 1991] [Segal 2001]. However, because of the differences in the possible defect types between logic and memories (layout topology, number of layers, and transistor density/count), memories may not capture all manufacturing process issues, leading to the use of logic diagnosis in yield learning in the intermediate and mature phases of the manufacturing process. A second application of defect diagnosis is dealing with manufacturing excursions or low-yield situations as shown in Figure 10.4. Diagnosis is performed on low-yield chips to isolate the root cause of failure (process abnormality leading to high defect density, process variation, design and process sensitivities, etc.) and follow up with appropriate corrective actions. Other applications of defect diagnosis include analysis of failures from the qualification of the product, which includes reliability testing (both infant mortality and wearout) and failures from the field or customer returns.

Defect diagnosis applications to manufacturing yield.

Figure 10.3. Defect diagnosis applications to manufacturing yield.

Defect diagnosis applications to manufacturing excursions.

Figure 10.4. Defect diagnosis applications to manufacturing excursions.

Design for Debug and Diagnosis

Debug and diagnosis require a high degree of observability. The ability to observe erroneous events close to when they happen is important. In addition, controllability is needed to further validate what the cause of the problem is and to narrow down the circuits that manifest the symptoms.

Many design for testability (DFT) features (e.g., scan), which enable both controllability and observability, can be reused for debug and diagnosis [Gu 2002]. There are also specific DFD features (e.g., clock controllability or reconfigurable logic) that are developed or tailored for debug and diagnosis.

The DFD features can be broadly bucketed into logic DFD structures, which are added to the design to extract logic information and to manipulate the operation of a chip in a physically nonintrusive manner. There are several physical tools to extract logic and timing information from a chip in a minimally physically intrusive manner. Circuit or focused ion beam (FIB) editing is performed to either enable probing or make corrections to the circuit to verify root-cause fixes. DFD structures to enable probing and circuit editing are called physical DFD structures [Livengood 1999].

Because probing is also needed to confirm the hypothesis of the cause of failures, we will touch on physical debug tools, and we will also cover some layout related DFD features that enable probing and circuit editing. Speed debug is an extension of logic debug. Throughout the description of these techniques, emphasis is placed on debugging for speed-related problems in addition to logic failures. We have decided not to include memory-related diagnosis techniques that are common for yield analysis purposes. However, references are provided should the reader want to study this area [Gangatirkar 1982] [Hammond 1991] [Segal 2001].

Logic Design for Debug and Diagnosis (DFD) Structures

Logic design for debug and diagnosis (DFD) structures are added to the design to extract logic information and to manipulate the operation of a chip in a physically nonintrusive manner. This section explores key logic DFD features, including scan, observation-only scan, observation points and multiplexers, array dumps, clock control, partitioning and core isolation, as well as reconfigurable logic.

Scan

Scan is a widely used DFT feature. Scan is also useful for debug and diagnosis as a DFD feature (see Chapter 7 of [Wang 2006]). Although scan is typically utilized to enable automatic test pattern generation (ATPG), the increased observability of the design because of the insertion of scan also enables debug and diagnosis. For more information on scan and other DFT features, please refer to DFT textbooks [Abramovici 1994] [Bushnell 2000] [Jha 2003] [Wang 2006].

For ATPG patterns generated automatically and applied using scan, automated analysis of the failures can be performed using diagnosis tools that provide fault candidates for further validation [Waicukauski 1989] [Venkataraman 2001] [Guo 2006]. Modern-day DFT tool vendors also provide diagnosis tools along with their ATPG offerings.

Scan also helps in functional-pattern-based debug and diagnosis. There are two commonly used types of scan design: (1) muxed-scan or clocked-scan for flip-flop–based scan designs [Wang 2006] and (2) level-sensitive scan design (LSSD) for latch-based scan designs [Eichelberger 1977]. In both types of scan design, a functional test can be executed up to a certain clock cycle and then stopped. Once the system flip-flops and latches are reconfigured as scan cells, the functional state of these system flip-flops and latches in the chip can be shifted out for analysis during the shift operation. This process has to be coupled with register-transfer level (RTL) or logic simulation to compare against the expected state. For most scan designs (using muxed-scan, clocked-scan, or LSSD-like), unloading the scan cell contents is destructive in that the functional state is destroyed during shift. Hence, functional execution cannot be resumed after the data are unloaded. If a functional state at a further clock cycle needs to be investigated, then the functional pattern would have to be reexecuted and then stopped at that later cycle. This can be time consuming if the time from reset to the failure point is long, which is common in system-level debugging. An enhancement to this scheme is to wrap the scan chain around from scanout to scanin. If properly accounted for, the system flip-flops and latches will contain exactly the same contents as they did before it was shifted. Hence, functional execution can continue until the next clock cycle of interest [Hao 1995] [Levitt 1995]. This approach also requires that the scan shift operation be nondestructive with respect to all other storage elements (flip-flops and latches) in the circuit that are not scanned. Alternately, many designs add observation-only scan cells (described next). It is important to note that scan only provides a means to access the internal state of the storage elements that are scanned and help determine the incorrect state captured during the execution of a failing test. It does not tell exactly what caused the incorrect state. More debugging is necessary from the point of observed failures to determine its root cause.

Observation-Only Scan

Observation-only scan is a specialized version of scan that is typically used only for debugging purposes [Carbine 1993, 1997] [Needham 1995] [Josephson 2001] [Gizopoulos 2006]. This is also called scanout or shadow scan. Figure 10.5 shows the schematic of a typical scanout cell. The Sin port is connected to the Sout port of the preceding stage to create a scanout chain, whereas the Data input port is connected to the signal of interest. When the LOAD signal is enabled (set to logic 1), the signal of interest is latched into a separate scanout cell (where a snapshot is taken just for a clock cycle of interest) and can be kept there until it is ready to be shifted out when enabling the SHIFT signal. The whole captured state from all the signals of interest connected to scanout cells can be shifted out through a special scanout chain for analysis. This type of observation-only scan is part of a separate scan chain that is typically clocked by the same system clock capturing the signals of interest at-speed. From a performance penalty standpoint, observation-only scan merely adds some small capacitance to the signal of interest, and the functional signal does not have to pass through any additional multiplexers. Hence, this type of scan is more acceptable for functional debugging for high-performance circuits and systems.

A scanout cell.

Figure 10.5. A scanout cell.

Scanout cells are judiciously placed throughout the design to maximize the observability of key states and signals. Potentially, they can be inserted at any signal line, but that would increase the cost because of the added area. A selective placement of this kind of scanout cells on key signals helps the debugging process tremendously while keeping area overhead low. Because of their unique speed-capturing nature, these cells can also be put on different segments of long signal lines to observe the effects of signal degradation or noise. Very often, the area of the scanout cells can also be hidden (or further reduced) if they are placed under routing channels, where silicon is not utilized. Just like conventional scan design, the placement and routing of these cells can also be automated, so the scheme is perfectly compatible with both custom and application-specific integrated circuit (ASIC) design styles.

As indicated before, observation-only scan is primarily aimed at speed debugging, because it is running concurrently with the system clock. At the specific clock cycle of interest, one would trigger the LOAD signal so the Data signal of interest would be captured into the scanout cell. On one hand, the captured state can be kept there until the test is completed and then shifted out for analysis. On the other hand, the captured state can also be shifted out simultaneously while the system is running. Repeated capturing and shifting is possible provided that the time to shift the whole chain is shorter than the time it takes to arrive at the next capture point. Again, RTL or logic simulation is executed to allow the comparison of expected states. Figure 10.6 illustrates how the scanout system is used with an automatic test equipment (ATE) to allow the capturing of internal state using scanout cells while the chip is being tested with a functional pattern that is being executed on the ATE.

Operation of an observation-only scan design.

Figure 10.6. Operation of an observation-only scan design.

Additionally, scanout systems typically have a compressed signature mode built into the cell, whereas the content of an upstream cell is XORed with the content of the current scanout cell. This is accomplished by enabling with LOAD and SHIFT signals simultaneously as shown in Figure 10.5. The final signature after the execution of a test can be shifted out of the scanout chain. The advantage of a signature versus a single snapshot is that it is more efficient to simply check an accumulated signature to see if an error is captured during a time interval starting from one trigger event, which starts recording to another trigger event, which stops recording. This can narrow down the cycles where errors occurred during the execution of a test. A signature compare operation can also be done with the chip running in a passing condition rather than running in a failing condition. An alternate observation-only cell design was proposed in [Sogomonyan 2001].

Observation Points with Multiplexers

Multiplexers (MUXes) are a common alternate way to observe on-chip signals. Once a set of important signals is identified for observation, multiplexers can be added so that individual signals can be mapped out to a test port (e.g., through the Test Data Out [TDO] port defined in the IEEE 1149.1 boundary-scan standard [IEEE 1149.1-2001]). However, the test control signals for the MUXes can add to the overhead, especially when a large set of signals is observed. Also during debugging, the signal that one would like to observe may not be included for observation and would not be available during the debugging process. The authors in [Abramovici 2006] proposed to have a layer of programmable logic that can be programmed to allow specific signals to be observed. Figure 10.7 shows how this can be done by using wrappers around individual blocks of circuits. The MUXes bring various signals of interest to each wrapper and allow signals from various wrappers to be mapped to observation points, thus alleviating the amount of overhead required while at the same time maximizing the likelihood for signals to be observed. However, the overall architecture of placing the programmable logic where it will provide a high degree of observability may still be expensive for certain designs.

Programmable observation of internal signals.

Figure 10.7. Programmable observation of internal signals.

The same programmable logic has an additional function beyond providing observation. Once a hypothesis of the problem is formulated, the erroneous logic can be altered with this programmable logic to provide a soft fix. Because this programming logic is already part of the chip, this repair method is more flexible than the blue wire patching method, which we will describe in Section 10.4. Sample or even production units can be supported provided that this logic can be programmed before the chip is initialized for use in the end user system.

Array Dump and Trace Logic Analyzer

In addition to logic states, which can be observed using scan and scanout, other memory states such as those held in embedded arrays (caches, register files, buffers/pointers, etc.) are also important for debug and diagnosis. Because these structures are usually not scanned, scan dump cannot access the information stored in these arrays. So an array dump mechanism is typically designed [Carbine 1997] so that the contents stored in these arrays can be observed after a normal functional operation is stopped with clock control. (Clock control is covered in the next section.) The information can be dumped on an external bus, which can then be captured by the ATE (tester) or a logic analyzer. Alternately, the data may be accessed through other test data pins (e.g., test access port [TAP] as defined in the IEEE 1149.1 boundary-scan standard [IEEE 1149.1-2001]).

The inverse of array dump is to use the existing array(s) to store on-chip activities, such as internal bus traffic and specific architectural state changes. Multiplexers or even programmable interfaces can be designed to redirect these types of information to be stored on specific on-chip arrays (e.g., the L2 cache on a microprocessor). This is called trace logic analyzer (TLA) in [Pham 2006]. Of course, these arrays have to be resized so that they will not impair existing functionality. Alternately, we can design in dedicated arrays to capture these traces, but that will be too expensive just for debugging purposes.

Clock Control

Clock control is an important DFD feature for speed debugging. Internal observation mechanisms like scan and scanout help obtain information in the spatial domain (where did the error originate?), while clock control supplements it by enabling extraction of information in the time domain (when did the error start to occur?).

Most modern VLSI chips with clock frequencies that run at hundreds of megahertz (MHz) to gigahertz (GHz) usually have an internally generated clock from a phase-locked loop (PLL). This generated clock is usually synchronized to an external system clock of much lower frequency. High-frequency oscillators are difficult to build, and high-frequency clocks are difficult to route on a board. Thus, the solution of having an internal PLL multiply to a high-frequency internal clock is common. Moreover, modern-day VLSI chips also typically contain multiple clock domains that drive specialized circuits or external I/O interfaces.

Starting, stopping, and restarting these internal clocks while keeping them synchronized to specific external and internal events is critical for debug [Josephson 2001]. Being able to start and stop the clock is an important first step in the debugging process. Whereas scan or scanout capture can extract internal logic state, the question of when to take this internal observation is answered by clock control. Stopping the clock at a specific clock cycle and synchronizing it to an internal or external event are the most basic of clock control features. If the clock stops at the wrong cycle, the internal observation results will be wrong, and the analysis will be performed down the wrong path.

Although it is possible to stop the clock after an external event, it is definitely preferable to use internal events to control the clock, as an external event may be too imprecise. This happens because the external event is synchronized to a slower external clock, and there may be an offset between the slow external clock and the faster internal clock.

Specific offset counters are typically added to make the clock stopping points much more flexible. These offset counters are often part of a more comprehensive set of debug registers designed specifically to capture information from or assert control to different parts of the chip, specifically for debugging purposes [Pham 2006]. These are also commonly referred to as control registers [Carbine 1997]. For example, one can specify that a scan or scanout capture be taken at 487 clock cycles after an exception event has occurred. The programming of these clock stop event(s) and offset is usually preshifted in through specific scan chains so that they are ready to execute when the debug tests are run.

Another useful function, in addition to starting and stopping the clocks, is to issue a single clock pulse. This is also sometimes called single stepping. Single stepping allows one to observe the execution of the circuit in a controlled manner. A scanout or observation-only scan like capability that is nondestructive is needed to complement single stepping with the internal observation of circuit state.

Besides starting and stopping the internal clocks, the capability of stretching a specific clock cycle is useful in the debugging process. When the clock is supplied externally, this is relatively simple. This can be accomplished by swapping in a different set of timing generators on a specific clock from the ATE. However, for internally generated clocks, circuit modification to the clock circuit is needed. Debugging performance problems (for example, “Why won’t the chip run any faster than 500 MHz?”) requires isolating the offending path or paths in the circuit. By sequentially stretching the clock from the failure point backward, the specific clock cycle where the failure was latched internally can be determined. This will provide clues as to where the problem lies considering the pipelining nature of high-performance systems. Multiple failing paths may exist requiring multiple iterations of stretching clocks to find the root cause in an onion-peeling fashion. Debug flows will be described in more detail in Section 10.6.

Stretching of individual clock phases can also be implemented to pinpoint the path to a particular clock phase. For flip-flop-based designs, the specific phase may not be important, but for latch-based designs, this phase stretching feature is indispensable.

Figure 10.8 illustrates the clock controls for the Pentium 4 microprocessor [Kurd 2001]. An external system clock is fed into individual Core PLL and IO PLL, where the clock is multiplied. A skew detect circuit will make sure that the skew between the 2 PLLs can be monitored for deviation. Both clock systems can be adjusted for their duty cycles, while the core clock allows for skew adjustment between its various distribution buffers because this clock is distributed all over the chip and is subjected to on-die process variation. Finally, this core clock can also be adjusted for its duty cycle by stretching either one phase or the other.

Pentium 4’s on-die clock control.

Figure 10.8. Pentium 4’s on-die clock control.

Another clock control feature that can help with debugging is the ability to introduce relative skew between different clock domains. As a chip can contain tens of millions to hundreds of millions of transistors (and can include tens of thousands to hundreds of thousands of storage elements), the clock system has to be distributed from the PLL to all these storage elements with an elaborate clock distribution system. The clock distribution system consisting of a series of buffers must have the ability to deskew the various domains so that all the clocks arrive at their destination within a tight tolerance. Without this kind of clock distribution system, much performance could be lost because of the natural skews introduced by on-die process variations [Kurd 2001] [Fetzer 2006].

Because the clock distribution system can be deskewed, intentional skews can also be introduced through additional adjustment above and beyond what is needed for deskewing [Josephson 2001] [Josephson 2004] [Mahoney 2005] [Gizopoulos 2006]. This intentional skew, which is illustrated in Figure 10.9, gives more sampling time for the storage elements using the delayed clocks. Of course, the storage elements using this delayed clock also launch their outputs later, so failures may be observed in their downstream logic as well. This feature provides additional information as to where the failing path may lie. If pulling in the clock edges of a clock region causes the test to pass, then the source or the driving storage elements of the failing path can be pinned to that clock region. Similarly, if pulling in the clock edges of a clock region causes the test to fail, then the destination or the receiving storage elements of the failing path can be pinned to that clock region.

Introducing intentional clock skew.

Figure 10.9. Introducing intentional clock skew.

Partitioning, Isolation, and De-featuring

VLSI chips, including system-on-chips (SoCs) and microprocessors, are complex devices. They typically consist of multiple heterogeneous subfunctional blocks operating together to deliver the final functionality that the chip is supposed to deliver. Debugging such a chip will be difficult if we cannot isolate or narrow down the problem at the macro level before taking internal observation. Partitioning is one mechanism to determine if the problem still exists after some blocks are separated. This could be accomplished by logic partitioning or by rewriting software so as to confine the execution to a much smaller logic unit. For example, multiple execution units on the processor can be disabled so that the parallel execution of instructions can be restricted to a few or even down to one unit. The instructions can then be directed to individual units by subsequently enabling or disabling specific units. Of course, this ability is limited to units that have some level of redundancy built in. This is also called de-featuring.

Reconfigurable Logic

As mentioned in Section 10.2.3, the authors in [Abramovici 2006] proposed placing a programmable logic (also called reconfigurable fabric) between blocks or interspersing them into the regular logic fabric to aid both debugging and making a fix (see Figure 10.7). Patching can be done by re-programming the logic to replace or change existing logic functionality. This can be seen as halfway to fully programmable logic. However, it does provide benefits that focused ion beam (FIB) patching cannot achieve—that is, it ensures (1) instantly patched logic and (2) that every chip can be patched because every chip can be reprogrammed. (FIB patching is covered in Section 10.4.) This will allow very fast time-to-market by allowing units to be shipped without having to wait for a new design revision and a new set of masks. Should a bug appear, a software download can provide the fix.

Probing Technologies

The logic DFD structures described in the previous section, along with debug and diagnosis techniques to be described later, provide the first level of reasoning and isolation down to the signals and signal paths that failed to capture the correct logic values within the prescribed timing window of a clock cycle under the operating window of supply voltage and temperature. However, little else can be derived about the exact nature of the problem. Was the problem caused by a weakly driven signal somewhere along the combinational path? Did it occur because the signal suffers from coupling noise? Was it caused by a power glitch at the vicinity of the affected logic? Was there charge leakage that is supply voltage sensitive? We need to know what causes the signals or signal paths not to propagate with the correct logic values within the prescribed timing window and under the operating condition window. A fix for the circuit cannot be put into place to solve the issue if the root cause is not verified. Probing technologies (or probing in short) can provide more detailed information to further isolate the issue, come up with root-cause hypotheses, and verify the correct hypothesis.

Probing technologies can be broadly classified as contacting and noncontacting. Contact probing is also called mechanical probing. Noncontact probing can be accomplished using two classes of probing, both of which are minimally invasive. The first class includes techniques and tools that inject beams (e.g., laser or electron beams [e-beams]) and sense the response while the circuit is operational. These can also be categorized as active techniques. The behavior of the chip may be intentionally altered by this injection as well in some cases. The second class involves measuring the emissions (photon or thermal infrared) during the operation of the circuit. These can be categorized as passive techniques. These techniques are described next.

Mechanical Probing

Mechanical probing has been the mainstay of silicon analysis for decades. Essentially, it is a mechanical probe with a tip that can be manipulated to land on metal lines or specific layout pads. At this scale of layout geometry, this is called pico-probing. Its predecessor was called micro-probing. Pico-probe tips are on the scale of ~1 μm in diameter with the most sophisticated version maybe a quarter of that. Because of the relatively large capacitance that the probe will present to the signal, this is pretty much limited to the probing of strong signals such as clock or power buses with today’s technology. Because the pico-probe is usually hooked up to high-frequency, high-impedance oscilloscopes, it serves a function that can never be satisfied with other means—high signal fidelity. This technique potentially gives the most accurate signal extraction, should the additional capacitance not be a problem to the signal being observed.

Figure 10.10 shows a typical mechanical probe station. The platen is for the placement of these probes. The wafer/chip is placed on the circular chuck at the center. The microscope is for the optical viewing of the probe placement. The fine adjustment (x, y, z) of the probes is provided by the three tuning knobs on the probes at top/side and the back located on the probe manipulator.

Mechanical probe setup.

Figure 10.10. Mechanical probe setup.

Injection-Based Probing

These signal injection-based probing tools are also alternately known as beam-based probing tools. Internal signals and transistors can be probed by using an electron beam (E-beam) or laser to inject a signal and by measuring the characteristics of the returned signal, which is modulated by the actual signal on the probed node while a circuit is in operation under a test. E-beam probing was the mainstay for a long time. However, in the current generation of integrated circuits using multiple metal layers and flip-chip packaging, inspection of the circuits from the front side (top side) of the die or wafer using E-beam probing is difficult. Fortunately, silicon is transparent at infrared wavelengths. Thus, by injecting a laser in the infrared spectrum from the backside of the die or wafer, it is possible to probe circuits in a relatively noninvasive manner. This technique is called laser voltage probing. In the case of E-beam, there is a voltage contrast effect where the secondary electron (SE) yield is attenuated by the voltage on the line. For laser (optical) probing, the electric fields modify the phase and amplitude of the returning signal.

E-beam Probing

E-Beam (electron beam) probing uses a principle that is similar to that used by a scanning electron microscope (SEM). In a vacuum environment, electron beams are scanned over the area that one wishes to observe. As the beam hits upon the metal lines, secondary electrons (SEs) are generated. A sensor pickup (SE detector) nearby collects these secondary electrons. For the metals that have a positive charge, the secondary electrons are attracted to fall back onto the metal line itself and result in little collected charge from the sensor. If the line is negatively charged, secondary electrons are expelled, and more of them will be collected by the pickup mechanism (SE detector or energy analyzer). Hence, a contrasting image can be developed based on these varying levels of electron pickup. This resulting image is also called voltage contrast imaging. Figures 10.11 and 10.12 illustrate this principle.

E-beam probing technique.

Figure 10.11. E-beam probing technique.

Voltage contrast image (with E-beam probing): normal image (left), low potential (0 V) area highlighted (right).

Figure 10.12. Voltage contrast image (with E-beam probing): normal image (left), low potential (0 V) area highlighted (right).

E-beam probing requires a line of sight to the metal line that is to be probed. It also requires certain clearing from neighboring metal lines so that the reading will not be distorted because of the charge on these lines. Thus, certain layout design rules have to be followed (see Section 10.5). The passivation or interlayer dielectrics (insulation) also have to be stripped off completely so that the electron beams can hit upon the metal line directly and secondary electrons can be scattered freely. For signals carried on lower levels of metals, one can only observe them through the spaces between the upper level metals or by cutting holes through wide fat power buses (so that the underlying layers can be exposed). E-beam probing is almost impossible for signals carried on deeper layers (e.g., M2 [metal-2] or M3 [metal-3] in a six-layer metal system), so this technology is not effective for modern-day multiple-layer metallization chips. By detecting logic states in time, E-beam probing can be used to recreate the timing waveform on a signal as shown in Figure 10.13.

E-beam probing, which can show logic state map at any clock (left) or timing waveform at any node (right).

Figure 10.13. E-beam probing, which can show logic state map at any clock (left) or timing waveform at any node (right).

Laser Voltage Probing

E-beam probing as a debug technique was impacted by the introduction of the controlled collapse chip carrier (C4) flip-chip packaging technology. In flip-chip, to improve performance over bond wires, solder bumps replace the conventional bond wires. This can be achieved by forming solder bumps on existing die bond pads and corresponding mirrored pads on the surface of the package. The die is then flipped and placed on top of the corresponding package. Through reheating, the solder on both sides will fuse and form a small solid solder connection, much like the much larger surface mounting printed-circuit board (PCB) assembly, albeit with smaller sized solder bumps instead. Other than lower inductance for each of these bumps, many more such solder bumps can be placed all over the die area, resulting in thousands of parallel paths, which lower the resistance/inductance of the chip’s power grid connections. Also, because all the bumps can be fused in one single reheat process, productivity is superior to the serial wire bonding technology.

With this kind of packaging, the surface of the die is out of view. One can only see the backside of silicon as seen in Figure 10.14. The metal wiring side is hidden between the die and the package. Probing from the backside of the silicon poses a new challenge.

Wirebond packaging (left) and flip-chip packaging (right).

Figure 10.14. Wirebond packaging (left) and flip-chip packaging (right).

With E-beam, it would require thinning and etching the die from the backside to expose to M1 for E-beam probing as shown on Figure 10.15. However, such removal of material would require the M1 exposure to be reasonably far away from active silicon so as not to impact device and circuit performance. This would require debug design rules that are not density friendly.

E-beam probing scheme for backside probing.

Figure 10.15. E-beam probing scheme for backside probing.

Additionally, each M1 (metal-1) exposure would have to be done using a direct write etching technique, such as FIB nano-machining, to expose the lines without damaging adjacent transistors. Such techniques are time consuming and are not conducive to in situ signal probing.

A solution to the problem of probing from the backside came from material characteristics. The technique is called laser voltage probing (LVP) [Paniccia 1998] [Yee 1999]. It was noticed that infrared (IR) light can be observed (i.e., transmitted) from silicon, as illustrated in Figure 10.16, as a result of hole electron recombination. Because IR can be emitted and observed through silicon, it can also be transmitted into and through the silicon. Not only is silicon transparent to IR emission, but the infrared light energy also can be reflected off a charged object. The phase change of this reflected light can be analyzed and a voltage contrast image reconstructed. However, light absorption by silicon does require the silicon to be thinned down to 50μm or so, but that can be supported through general silicon planarization technology (e.g., chemical-mechanical polishing [CMP]) as well as a focused ion beam (FIB).

Backside view (left) using IR versus frontside view (right) at the same location of the die.

Figure 10.16. Backside view (left) using IR versus frontside view (right) at the same location of the die.

Figure 10.17 illustrates how a system can be set up to allow a laser pulse to be focused on the device junctions and the reflected energy to be observed with sensitive photodiodes and further processed to extract voltage information. Figure 10.18 illustrates in more detail how the field around a junction will modulate the reflected energy. This variation of reflected energy provides one with the ability to figure out the voltage transitions happening at the junction.

IR prober setup.

Figure 10.17. IR prober setup.

Laser voltage probing (LVP) theory of operation.

Figure 10.18. Laser voltage probing (LVP) theory of operation.

Emission-Based Probing

As mentioned earlier, in the current generation of integrated circuits using multiple metal layers and flip-chip packaging, inspection of the circuits using E-beam probing from the front side (top side) of the die or wafer is difficult. Fortunately, silicon is transparent at infrared wavelengths. Thus, by sensing infrared emission from the backside of the die or wafer, it is possible to examine circuits in a noninvasive manner.

Infrared Emission Microscopy (IREM)

Complementary metal oxide semiconductor (CMOS) failures often result in unwanted or unexpected electron-hole recombination. This electron-hole recombination is accompanied by weak photon emission. Light is also emitted by intraband carrier transition in high field situations. Sensitive and long-exposure cameras mounted in dark boxes can register these emissions and the associated emission sites. These emissions can be overlaid on a reference image to determine the location of the lighted object. This technique is also known as emission microscopy (EMMI) [Uraoka 1992].

A sensor array collects near-IR radiation in the 800- to 2500-nm wavelength range. Combined with advanced camera optics and precise stage movement, this emission detection method allows super-accurate pinpointing of the location of emissions. The tester is docked to the infrared emission microscopy (IREM), and failing patterns are applied to the circuit under test. Emissions are observed from the backside through a thinned substrate [Liu 1997] [Chew 1999]. A comparison can be made between passing and failing units. An abnormal emission seen on the failing unit but not seen on the passing unit indicates that a defect is likely associated with the abnormal emission, and localization techniques such as those aligned to a layout database can be used to determine the location of the emission [Loh 1999].

The die substrate is thinned and antireflective-coating (AR-coating) is applied to the backside of the die to allow a more uninterrupted light path when the die is observed through the backside. Strong abnormal emission can indicate a variety of circuit or process defects, as shown in Figure 10.19. Saturated transistors, devices with contention (because of improper control), and high-leakage devices give out strong emission.

Emissions observed on an IREM.

Figure 10.19. Emissions observed on an IREM.

In addition to observing abnormal emissions, it is also possible to map the normal emissions from transistor diffusions to the logic state as shown in Figure 10.20. N-diffusion emissions can be mapped to logic state 1 and P-diffusion emissions mapped to logic state 0. This is called logic state imaging [Bockelman 2002]. Emissions are observed on the N and P diffusions of inverters. If the input to the inverter is logic 0, the N-diffusion region emits; if the input to the inverter is logic 1, the P-diffusion region emits. The emission mechanism being observed is soft avalanche breakdown, in which a high concentration of electron-hole plasma in the drain region recombines in part through indirect radiative recombination [Bockelman 2002].

Emissions observed on an IREM mapped to logic state.

Figure 10.20. Emissions observed on an IREM mapped to logic state.

Picosecond Imaging Circuit Analysis (PICA)

When electrons speed through the drain-source region, they emit optical energy (mainly in the infrared band) in the process. This emission occurs during switching [Kash 1997] [Knebel 1998] [Sanda 1999], and by detecting this optical energy in real time it is possible to identify when a transistor switches. With a high-speed simultaneous spatial detection system using an image sensing array, it is possible to detect how the transistors of a local area switched in sequence. After the data are captured, they can be analyzed by replaying them at a much slower speed for more detailed examination. This process is termed as picosecond imaging circuit analysis (PICA). For example, if a clock distribution system is observed, the light pulses start appearing at the first stage of the clock driver tree and then spread to different areas of the chip as successive stages of the clock tree fire. Then the process repeats itself with subsequent clock switching. The optical energy from N-transistors is much stronger than those of the P-transistors, marking the relative switching between the pulldown and pullup network. This is useful in identifying relative timing problems, especially when the circuits are nearby and within the field of the detection system.

Because this is optical detection only, it will not disturb the dynamics of the switching event around the device junctions and will preserve the timing behavior of the signals. An additional advantage is that this light detection is possible from both the front side as well as the backside of silicon, making it more versatile for various packaging technologies.

In Figure 10.21, the respective N/P-transistors switch with the input signal. The switching energy of the N-transistor is much stronger than that of the P-transistor making the detection of individual edges possible.

Optical pulses as revealed by PICA.

Figure 10.21. Optical pulses as revealed by PICA.

Time Resolved Emissions (TRE)

The use of an imaging array in PICA produces pictures of how the emission occurs over an area of the circuit. However, the use of an array imaging sensor is slow and requires longtime sampling of the signals. For repetitive signals such as a clock, this is not an issue. However, it is ineffective for small emission from weak transistors with low signal switching rates. Moreover, technology scaling and the associated voltage scaling also reduce emission and shift the emission spectrum to those of longer wavelengths. Many authors [Bodoh 2002] [Varner 2002] [Vickers 2002] have improved upon the technique with a more sensitive single element photon counting detector to detect the emission from a specific node in question; they have named the technique time resolved emissions (TRE). A high-speed low-noise InGaAs avalanche photodiode with below-breakdown bias and gated operation is typically used. Figure 10.22 shows the waveform caused by the switching activity of a signal using TRE.

Switching emissions observed on a TRE.

Figure 10.22. Switching emissions observed on a TRE.

Circuit Editing

Circuit editing is performed to either enable probing or make corrections to the circuit to verify root-cause fixes. Circuit editing is also called silicon microsurgery (or, more recently, nanosurgery). This involves either removing or adding material. Removal operations typically include cross-sections for observations, trenching to access and probe signals, and cutting signals. Deposition of short wires to create new connections can also be made. This is achieved through a tool called a focused ion beam or FIB. FIB along with DFD structures to enable circuit editing as well as layout-database-driven navigation systems to enable circuit editing are described next.

Focused Ion Beam

A focused ion beam (FIB) itself is not a probing tool but rather the enabler to mechanical or optical probing and, more importantly, it offers a means to do some patching at the silicon level to confirm a hypothesis of what an error is or how an error can be fixed.

A FIB used for circuit rewiring combines a high-energy particle beam (typically gallium) with locally introduced gas species to enable a desired ion beam–assisted chemical etch or ion-beam–induced chemical vapor deposition (CVD) as shown in Figure 10.23. This capability is analogous to a direct-write back end fab, where metal traces and devices can be cut or trimmed and rewired using dielectric and metal deposition.

Focused ion beam.

Figure 10.23. Focused ion beam.

With the right chemistry, an ion beam can also react in a chemical atmosphere to result in solid formation, resulting in deposited material on the surface of the die. The deposited material is not in a crystalline structure and will result in a resistive line structure, playing the role of the “blue wires” for patching up circuitry on silicon (i.e., rerouting circuits).

Layout-Database-Driven Navigation System

To navigate around the die, one needs to make use of a layout-database-driven navigation system to position an E-beam, IR beam, or FIB to the location of the devices, wires, or circuits that one would like to observe or repair. Because of the complexity of the chips, manual navigation is almost impossible.

Once certain reference coordinates of the die can be located, all other device or wire locations can be accessed through an automated system so that one can call up any circuit or signal with ease. The reverse process is also possible. If a signal is traced to a different part of the layout, one can also look up the database to find out what the design layout should look like. Figure 10.24 shows a layout-database-driven navigation system. Signals can be looked up in a schematic viewer and mapped to their corresponding layout polygons (see Figure 10.25). These polygons and their coordinates are communicated to the FIB stage driver, which moves the FIB to the right location on the silicon.

A layout-database-driven navigation system.

Figure 10.24. A layout-database-driven navigation system.

Mapping of probed topological information to schematics and then to high-level logic models.

Figure 10.25. Mapping of probed topological information to schematics and then to high-level logic models.

Spare Gates and Spare Wires

Based on the data collected from logic DFD and probing, a hypothesis of what the problem is may be formulated. However, the hypothesis needs confirmation. Further, other problems may lurk beneath the surface, hidden by the problem being debugged. One way to answer these questions would be to implement the circuit with logic fixes and tape out a new chip. However, this would push out any further validation by several months and can cost more than a million dollars in mask and other silicon fabrication expenses. Validating that a hypothesis is correct and continuing further validation requires patching up the circuit and reexecuting all the validation tests. With a FIB, we can cut and repatch wires, but what about transistors? How can we extend the patching concept to circuits in silicon, where adding transistors is impossible, unless we have the transistors already in place? This is where spare gates and spare wires (also called bonus gates and bonus wires) [Livengood 1999] come in handy. If spare transistors or even gates are placed in the layout at strategic locations or even randomly where space is available, one may find the replacement transistors or gates nearby and patch the circuit up so as to perform a fix. Even though it is tedious to cut and reconnect a bunch of wires using FIB, it is still far faster than the redesign-tapeout-fabrication route, and it costs much less. This kind of patching is indispensable because time-to-market or time-to-manufacturing is critical for any product. Ultimately a new tapeout is still needed to get robust fixes and correct silicon, but a temporary fix verifies that the hypothesis is correct and makes sure that the initial bug did not mask any problems.

Figure 10.26 illustrates a preplanned layout with a bonus AND gate on a circuit diagram. The inputs are normally shorted to Vcc and Vss so that it is dormant. If this gate is needed, then the inputs are cut from the tie-downs, and the gate is then connected to the respective signals (X and Y). The output of the gate can be hooked up to several routing wires and will be isolated with more cuts to drive the desired signal.

Sample spare gate and spare wire usage with FIB patching.

Figure 10.26. Sample spare gate and spare wire usage with FIB patching.

Physical DFD Structures

The probing technologies outlined in the previous section cannot be successful without the appropriate support from the layout design. We have collectively termed these layout changes physical DFD (PDFD). The term “physical” also refers to “physical design,” which is another name for layout design. Physical DFD are DFD features implemented on the layout design to enable probing and circuit editing.

Physical DFD for Pico-Probing

To facilitate probing, one should designate the placement of well-thought-out locations with plenty of open space around them to ease probing and to avoid accidental shorting. To facilitate pico-probing, the probe pads have to be reasonably large so that landing the probe does not present a problem. Care must also be taken not to place active signal wires underneath it to avoid leakage, should the pressure of the tip crack the interlayer dielectric. To help with planarization, buffer (dummy) metals should be placed under its footprint.

Physical DFD for E-Beam

Specific design rules have to be developed to space out the metal lines so that they do not interfere with the signal pickup of the subsequent lower metal layers. This is because the lower layer metal lines are only visible through the gaps between metal lines on the upper layers. For fat metal buses (e.g., power buses), specific design rules also have to be developed so that holes of reasonable size can be dug to open up for lower-level signal lines to be observed.

Figure 10.27 shows all the features that are needed for exposing signals from various layers to the E-beam. This is a five-layer metallization. Because the top layers may have wide power buses, the cutting of holes or notching is necessary to expose signals at the subsequent layers. It is also preferable to widen the signal lines designated as probe pads where the signal is to be probed to allow maximum reflectivity of the electrons.

Sample E-beam layout openings illustration.

Figure 10.27. Sample E-beam layout openings illustration.

Physical DFD for FIB and Probing

Because the front side of the die is not available (or not visible), there is no geometry or marking on the backside to indicate relative positioning. Even for probing, there is a need to locate where we want to thin out silicon for probing so as to not miss the probe point when thinning out by using a FIB. Specific markers formed by diffusion have to be placed on the layout at the four corners of the die as shown in Figure 10.28. These are infrared visible so that they can be used to orient the die precisely.

Flip-chip markers on silicon for infrared probing alignment.

Figure 10.28. Flip-chip markers on silicon for infrared probing alignment.

In general, the rules are roughly driven by the FIB precision control. Spacing between the cut point and the surrounding geometries has to be allocated generously so that it will not cut into other undesirable or sensitive areas. Sometimes it may be preferable to plan and design the cut sites to facilitate this.

During probing, because infrared energy can disturb the nodes under observation, it is preferable to have a specific test structure away from the transistor junction as a probe point. The plasma protection diode or an additional reverse biased diode at the input of the gate is typically used. Although it will add some small loading, it is not to the degree that it will interfere with the switching properties of the transistor.

Diagnosis and Debug Process

Figure 10.29 shows a generic diagnosis flow. The diagnosis process involves starting with the test results, which capture all the observed failures and mapping the test results to defects or errors, which are then used as a starting point for repair (replacement or redesign) or finding the root cause for process improvement depending on the goals of the diagnosis process. Depending on the objectives of the diagnosis flow and the type of systems under diagnosis, the defects or errors may include defective components (for example, a faulty IC on a board, a faulty cell in a random-access memory [RAM], a faulty column of cells in a field programmable gate array [FPGA], or a faulty board in a system), defective interconnections (shorts or opens), logic implementation errors, timing errors, or IC manufacturing defects.

A generic diagnosis flow.

Figure 10.29. A generic diagnosis flow.

The diagnosis process would typically use information about the circuit model and applied tests along with other data that may be available. Fault models [Aitken 1995] [Wang 2006] are also typically used as a means to arrive at the final defects or errors for consideration as shown in Figure 10.29.

Defects are fabrication errors caused by problems in the manufacturing process or human errors. They may also be physical failures that are caused by wearout or environmental factors. Manufacturing defects are unpredictable in both location and effect, and the processes that cause them are continuous over a wide range of variables (e.g., a short between two adjacent wires may occur anywhere along their length with a range of possible resistances, capacitances, etc.). Ideally, the test process would take every defect into account, develop a test for each, and apply these tests during manufacturing testing. However, because the space of all possible defects is continuous and unpredictable, there is no way to apply a finite number of tests that are guaranteed to detect everything in that space. The complexity of defect behavior makes a strict defect-based test approach impossible; some simplification is necessary.

Defects themselves could be approximated. This approach is usually taken both during testing and during diagnosis [Abramovici 1994] [Wang 2006]. The infinite defect space is approximated by a finite set of faults. A fault is a deterministic, discrete change in circuit behavior. It is important to stress that the fault is an approximation of defective behavior, not a true representation. By this definition, a fault can never be “found” during diagnosis. Fault models are used as a means to arrive at the final defects or errors for consideration. Figure 10.29 illustrates this process. Faults are often thought of as localized within a circuit (e.g., a particular gate is broken), but they may also be thought of as transformations that change the Boolean function implemented by a circuit. Many fault models are timing independent, whereas a few include timing behavior explicitly. The most commonly used fault models include stuck-at faults, bridging faults, delay faults (including path-delay faults and transition faults), and functional faults. The choice of fault model depends on its intended use (test generation, manufacturing quality prediction, defect diagnosis, characterization for defect tolerance, etc.).

Fault models are an integral part of the fault diagnosis process and thus help find the root cause of the defective device under consideration. In its most basic form, a fault model is used to predict the behavior of faulty circuits, compare these predictions to the actual observed behavior of defective chips, and identify the predicted behavior that most closely matches the observations. An analogy to this process is a detective story. A fault can be likened to a criminal or suspect that exists in the circuit model. The fault may be permanent, intermittent (with alibis most of the time), or transient (hit-and-run) depending on the nature of the fault model. An error (or fault effect) is created by applying stimulus to activate the fault (provoke the criminal) at the fault site. The fault effect may propagate through the circuit and be detected. A detection implies that the fault effect has propagated to an observation point and can be observed as an error or failure. Figure 10.30 illustrates fault effect propagation. The goal of the process is to enable further analysis by identifying promising locations for further study.

A fault with its propagated fault effect leading to an observed error.

Figure 10.30. A fault with its propagated fault effect leading to an observed error.

Diagnosis Techniques and Strategies

The diagnosis process can employ a variety of methodologies. Some of the common types of diagnoses are enumerated:

  • One-step or nonadaptive. The test sequence is fixed and does not change during the diagnosis process.

  • Multistep or adaptive. Tests applied in a subsequent step of diagnosis depend on results of the previous steps.

  • Static. Diagnostic information is precomputed for all possible faults before testing.

  • Dynamic. Some diagnostic information is computed during diagnosis based on the actual device under test (DUT) response.

  • Cause-effect. Computes faulty responses for a set of faults.

  • Effect-cause. Analyzes the actual DUT response to determine compatible faults.

  • One-step. Without replacement.

  • Multistep. Alternate retest and replacement steps.

In line with the detective story, the common elements of most diagnosis methods include the following:

  1. Be prepared.

  2. Assume, suspect, and exclude.

  3. Track them down.

Being prepared involves storing information that may potentially be useful for analysis later. In criminal investigation, this may involve fingerprinting the entire population, communities, or just those individuals with a prior criminal record. This is the strategy employed by cause-effect fault dictionaries [Abramovici 1994] [Wang 2006]. However, the questions of the feasibility of fingerprinting the entire population or determining which communities to fingerprint are tedious at best. The challenges with this criminal investigation strategy also have an analogy in diagnosis. Which faults or faults models (stuck faults, bridging faults, delay faults, etc.) should be considered? Is it feasible to store responses of all faults?

How do we process all the prestored information during analysis? Questions in criminal investigation may include finding matches for fingerprints found at the crime site, what happens when the criminal has not been previously fingerprinted, or what happens when several criminals committed the crime and their fingerprints are mixed.

Analogous situations occur in diagnosis: using a single-stuck-at fault dictionary when in reality the defect is a short that behaves close to a bridging fault, or using a single-stuck-at fault dictionary when the defect behaves like multiple stuck-at faults. Dealing with these situations requires diagnostic models and algorithms that can deal with partial matches. Several diagnostic models and algorithms have been developed and successfully employed [Waicukauski 1989] [Venkataraman 2001] [Guo 2006]. More diagnosis approaches can also be found in [Wang 2006].

The “assume, suspect, and exclude” strategy used in criminal investigation applies equally well to the diagnosis problem. Make some initial assumptions about the criminals (faults in the case of diagnosis). Do they act isolated (single faults) or in gangs (multiple faults)? Are they always bad (permanent faults) or only sometimes (intermittent faults)? Where are they coming from (stuck-at faults, bridging faults, functional faults, etc.)? The next step involves rounding up the plausible suspects and then reducing the set of suspects. In the words of Sherlock Holmes [Doyle 1905]: “One should always look for a possible alternative and provide against it. It is the first rule of criminal investigation.” In criminal investigation, this involves checking existing data for alibis. In diagnosis, this may require posttest fault simulation on the data collected from the on-chip logic DFD and clock control features discussed in Section 10.2.5. One could also perform experiments to provide alibis (adaptive diagnosis). Quoting Sherlock Holmes again: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” This step is performed using the on-chip DFD features and collecting failing information on a tester. This leads to the final step. When there are no more suspects, change the initial assumptions and restart. In diagnosis, this may involve starting over with a new fault model.

The final strategy involves tracking down the suspects or backtracing in diagnosis. “There is no branch of detective science which is so important and so much neglected as the art of tracing footsteps” [Doyle 1888]. Tracking down or backtracing involves starting from the available acts (observed errors in the device), tracing back through the city (tracing through a circuit) and in time (tracing through previous test vectors), and following paths that lead to the criminal(s) (actual fault[s]). This step uses the physical DFD structures on-chip and the physical tools described earlier to retrieve information from inside the chip.

Silicon Debug Process and Flow

The silicon debug process starts with the identification of bugs, which can be classified as logic (functional) or electrical. The identification of functional or logic bugs occurs during functional validation on a tester or system platform using architectural and design validation tests. Code generation using pseudo-random instructions and data is also typically used for microprocessors. Operation is performed at safe operating points (nominal voltage, temperature, and frequency) to ensure that only logic bugs are exposed.

Electrical characterization is performed to expose circuit sensitivities or marginalities, and timing or critical speed-path issues. Testing is performed to the extremes of the operation region [Josephson 2001] by varying frequency, voltage, and temperature to extremes. Characterizing with respect to process variation is performed with skewed wafer lots with the process parameters varied.

Debug can start initially on a tester or a system environment. The tester environment is easier to handle and more flexible [van Rootselaar 1999] [Josephson 2001] as there is full control of all pins on the chips and full access to all DFD features described earlier. Patterns can be looped (to allow for signal sampling) to perform probing using the physical tools described in previous sections. The operation of the chip in this environment is fully deterministic.

In contrast, the system environment is cheaper and it is easier to generate and apply system-level functional patterns. Furthermore, many issues are only uncovered when running real programs (OS, drivers, and applications). However, nondeterminism because of system events such as interrupts and memory refreshes need to be factored in. It is also harder to control the chip in a system environment because of the lack of access to all the debugging features. The typical approach is to quickly locate sections of failing tests on a system and try to replicate the problem on a tester for further debugging [Holbrook 1994] [Hao 1995] [Carbine 1997] [van Rootselaar 1999] [Josephson 2001].

A typical debug flow consists of three steps:

  1. Find a pattern that excites the bug.

  2. Find the root cause of the bug.

  3. Determine a fix to correct the bug.

The first step is finding a test pattern that excites the bug. This may be found during system validation, which was described earlier or in a customer sighting. Very often the pattern is already used on the tester or pops out of the high-volume manufacturing (HVM) test environment because of process shifts. It is also possible to craft special tests using sequential ATPG and using existing scan to generate patterns for embedded arrays and queues [Kwon 1998].

Finding the root cause of the bug requires three steps. The first step involves using the logic DFD features to extract data from the chip. Next, simulation tools and deductive reasoning are used to arrive at hypotheses. This process is described in more detail next. Finally, probing using the physical DFD features and probing tools is employed to confirm a hypothesis while eliminating others. FIB edits and modified tests are then used to verify the hypothesis.

Debug Techniques and Methodology

The debug techniques used to arrive at hypotheses for the observed failures (bugs) involve using the logic DFD features to perform two operations:

  1. Controlled operation of the chip

  2. Getting internal access (to signals and memories) in the chip

Controlled operation of the chip is needed to stop the chip close to the point of the first internal error and place it in a test mode. This is accomplished by (1) trigger mechanisms programmed to events, (2) clock control mechanisms, and (3) clock manipulation.

Trigger mechanisms programmed to events [Carbine 1997] [van Rootselaar 1999] [Josephson 2001] involve using control registers [Carbine 1997], probe modes in the system to monitor states, and matchers and brakes [van Rootselaar 1999] to monitor for specific events and stop their occurrence. Mechanisms for stopping on trigger events are also called breakpoints.

Clock control mechanisms are used to step one clock at a time or step through a specified number of clocks. Clock manipulation involves skipping cycles or phases, moving clock edges by stretching or shrinking specific clock cycles, and skewing a clock region relative to other clock regions as described in Section 10.2.5.

Internal access (to signals and memories) in the chip typically involves taking scan snapshots (see Section 10.1) and dumps. These may be destructive or nondestructive to the functional state depending on the scan style (destructive or observation-only scan). Sample on the fly (scanout) involves capturing the state and shifting it out while the chip is still in operation. This can be accomplished with the scanout or observation-only scan structure described in Section 10.2. It is possible to restart after reloading the scan state and initializing array states. Of course, absolute full-scan and full array initialization (to restore all the states) is required to accomplish this end. Freezing arrays and dumping observation-only registers are other observation mechanisms that are typically used.

A typical debug methodology involves the following steps:

  1. Using a trigger to get close to the cycle of interest

  2. Sampling on the fly to narrow the range of clock cycles

  3. Using clock manipulation or scan dumps to isolate the cycle of interest

  4. Taking a complete internal observation at the offending cycle

  5. Using simulation and deductive reasoning to isolate the offending circuitry

Concluding Remarks

Silicon debug and diagnosis is a complex field that employs a wide set of technologies (architecture, logic, circuit and layout design, simulation at different levels, instrumentation, optics, metrology, and even chemistries of various sorts). Logical deduction and the process of elimination are of paramount importance and central to debug and diagnosis. Design for debug and diagnosis (DFD) features are critical to enable an efficient debug and diagnosis process.

Debug and diagnosis requires bringing together people with various skills, being open minded, and welcoming or embracing new technologies to keep ahead of challenges. As technology scales and designs grow in complexity, the debug and diagnosis processes are constantly challenged. Although much progress has been made, many challenges lie ahead.

The operating voltage continues to scale down to keep reliability in check and for power reduction. It has been close to 1 V. Further scaling will make circuits operate in the subthreshold region; this will further reduce the emissions from the electron-hole recombination process. Lower voltage combined with smaller capacitance also leads to ever-smaller charge stored at any node. This smaller charge will also lead to its sensitivity to any kind of injected energy, be it optical or electromagnetic in nature. We are getting to the point where the Heisenberg Uncertainty Principle applies—the very act of observing something changes its nature.

Power dissipation and in particular leakage power has become an important issue for deep submicron integrated circuits, with leakage power contributing more than 30% of the total power dissipation. Thermal density and heat removal need to be carefully considered in the physical debug and probing processes. In addition, designs are becoming more power-aware and adaptive or dynamic control of voltage and frequency to regulate power is being adopted. Debugging failure in the presence of such adaptive or dynamic control is a major challenge.

Because automatic test equipment (ATE) or testers are barely able to keep up with the high pin count and high frequencies of device operation in functional mode, more on-chip DFD features are expected to be needed in the future. More DFD features that monitor a device actively (e.g., on-chip power droop, device variation across the die, and on-chip temperature) will be needed. On-chip DFD features that facilitate in situ system debug will become increasingly important [Abramovici 2006].

Tools to enable automated diagnosis continue to evolve while tools for debug are still in a relatively infant state. Tools that identify sensitive areas of the design, add DFD features in an automated manner, generate tests for validation, debug and diagnosis, and automate the debug process need to advance.

Technologies presented in this chapter will need to continue to advance to keep up with CMOS scaling. Following Moore’s law, chip complexity will double every 2 years, while we expect ever fewer design resources and ever faster time-to-market and time-to-profitability. Effective and efficient debug and diagnosis are certainly critical to the success of products. Better debug and diagnosis capabilities and faster fixes to problems are constant imperatives. It is expected and hoped that many of the readers will contribute solutions to this challenge.

Exercises

10.1

(Debug versus Diagnosis) What are the differences between silicon debug and low-yield analysis? Give at least three attributes.

10.2

(Logic DFD Structures) What combination of logic design for debug and diagnosis (DFD) features is typically needed to identify slow circuit paths?

10.3

(Logic DFD Structures) The purpose of introducing intentional clock skew to a clock domain is to allow more time for the signal to arrive at the storage element (flip-flop or latch) in that clock domain. What is the potential undesirable side effect of introducing this clock skew?

10.4

(Logic DFD Structures) What are the advantages of using observation-only scan over using a typical scan design for debug purposes?

10.5

(Probing Technologies) Which set of probing tools allows nonintrusive observations of a signal without disturbing the original signal?

10.6

(Probing Technologies) Which set of probing tools allows the injection of a signal that overrides what is already there?

10.7

(Probing Technologies) Optical probing tools are mostly noninvasive. What are their limitations? What signals really require mechanical types of probing?

10.8

(Circuit Editing) In Figure 10.26, which signals are connected to the inputs and outputs of the AND gate shown in the figure after all the new patches and cuts are made?

10.9

(Physical DFD Structures) Why do we need physical design for debug and diagnosis (DFD) for focused ion beam (FIB)? Give at least two reasons.

10.10

(Physical DFD Structures) For a chip with multiple metal layers, what physical DFD features are needed to use E-beam probing to observe signals at lower-level metal lines? Give at least three specific features.

10.11

(Physical DFD Structures) For backside laser voltage probing (LVP), what are the relevant physical DFD features for a successful probing? Give at least three specific features.

Acknowledgments

The authors wish to thank Dr. Rick Livengood of Intel, Professor Irith Pomeranz of Purdue University, and Dr. Franco Stellari of IBM for providing helpful feedback and comments.

References

Books

Introduction

Logic Design for Debug and Diagnosis (DFD) Structures

Probing Technologies

Circuit Editing

Diagnosis and Debug Process

Concluding Remarks

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset