Chapter 2. Digital Test Architectures

Laung-Terng (L.-T.) WangSynTest Technologies, Inc., Sunnyvale, California

About This Chapter

Design for testability (DFT) has become an essential part for designing very-large-scale integration (VLSI) circuits. The most popular DFT techniques in use today for testing the digital logic portion of the VLSI circuits include scan and scan-based logic built-in self-test (BIST). Both techniques have proved to be quite effective in producing testable VLSI designs. Additionally, test compression, a supplemental DFT technique to scan, is growing in importance for further reducing test data volume and test application time during manufacturing test.

To provide readers with an in-depth understanding of the most recent DFT advances in scan, logic BIST, and test compression, this chapter covers a number of fundamental and advanced digital test architectures to facilitate the testing of modern digital circuits. These architectures are required to improve the product quality and reduce the defect level, test cost, and test power of a digital circuit while at the same time simplifying the test, debug, and diagnosis tasks.

In this chapter, we first describe fundamental scan architectures followed by a discussion on advanced low-power and at-speed scan architectures. Next we present a number of fundamental and advanced logic BIST architectures that allow the digital circuit to perform self-test on-chip, on-board, or in-system. We then discuss test compression architectures designed to reduce test data volume and test application time. This includes a description of advanced low-power and at-speed test compression architectures practiced in industry. Finally, we explore promising random-access scan architectures devised to further reduce test power dissipation and test application time while retaining the benefits of scan and logic BIST.

Introduction

With advances in semiconductor manufacturing technology, VLSI circuits can now contain tens to hundreds of millions of transistors running in the gigahertz range. The production and usage of these VLSI circuits has run into a variety of test challenges during wafer probe, wafer sort, pre-ship screening, incoming test of chips and boards, test of assembled boards, system test, periodic maintenance, repair test, etc. The semiconductor industry heavily relies on two techniques for testing digital circuits: scan and logic built-in self-test (BIST) [McCluskey 1986] [Abramovici 1994]. Scan converts a digital sequential circuit into a scan design and then uses automatic test pattern generation (ATPG) software [Bushnell 2000] [Jha 2003] [Wang 2006a] to detect faults that are caused by manufacturing defects (physical failures) and manifest themselves as errors, whereas logic BIST requires using a portion of the VLSI circuit to test itself on-chip, on-board, or in-system. To keep up with the design and test challenges [SIA 2003, 2006], more advanced design-for-testability (DFT) techniques have been developed to further address the test cost, delay fault, and test power issues [Gizopoulos 2006] [Wang 2006a]. The evolution of important DFT techniques for testing digital circuits is shown in Figure 2.1.

Evolution of DFT advances in digital circuit testing.

Figure 2.1. Evolution of DFT advances in digital circuit testing.

Scan design is implemented by first replacing all selected storage elements of the digital circuit with scan cells and then connecting them into one or more shift registers, called scan chains, to provide them with external access. With external access, one can now control and observe the internal states of the digital circuit by simply shifting test stimuli into and test responses out of the shift registers during scan testing. The DFT technique has proved to be quite effective in improving the product quality, testability, and diagnosability of scan designs [Crouch 1999] [Bushnell 2000] [Jha 2003] [Gizopoulos 2006] [Wang 2006a]. Although scan has offered many benefits during manufacturing test, it is becoming inefficient to test deep submicron or nanometer VLSI designs. The reasons mostly relate to the facts that (1) traditional test schemes using ATPG software to target single faults have become quite expensive and (2) sufficiently high fault coverage for these deep submicron or nanometer VLSI designs is hard to sustain from the chip level to the board and system levels.

To alleviate these test problems, the scan approach is typically combined with logic BIST that incorporates BIST features into the scan design at the design stage [Bushnell 2000] [Mourad 2000] [Stroud 2002] [Jha 2003]. With logic BIST, circuits that generate test patterns and analyze the output responses of the functional circuitry are embedded in the chip or elsewhere on the same board where the chip resides to test the digital logic circuit itself. Typically, pseudo-random patterns are applied to the circuit under test (CUT) while their test responses are compacted in a multiple-input signature register (MISR) [Bardell 1987] [Rajski 1998a] [Nadeau-Dostie 2000] [Stroud 2002] [Jha 2003] [Wang 2006a]. Logic BIST is crucial in many applications, in particular, for safety-critical and mission-critical applications. These applications commonly found in the aerospace/defense, automotive, banking, computer, health care, networking, and telecommunications industries require on-chip, on-board, or in-system self-test to improve the reliability of the entire system, as well as the ability to perform remote diagnosis.

Since the early 2000s, test compression, a supplemental DFT technique to scan, is gaining industry acceptance to further reduce test data volume and test application time [Touba 2006] [Wang 2006a]. Test compression involves compressing the amount of test data (both test stimulus and test response) that must be stored on automatic test equipment (ATE) for testing with a deterministic (ATPG-generated) test set. This is done by using code-based schemes or adding additional on-chip hardware before the scan chains to decompress the test stimulus coming from the ATE and after the scan chains to compress the test response going to the ATE. This differs from logic BIST in that the test stimuli that are applied to the CUT form a deterministic (ATPG-generated) test set rather than pseudo-random patterns.

Although scan design has been widely adopted for use during manufacturing test to ensure product quality, the continued increase in circuit complexity of a scan design has started reaching the limit of test power dissipation, which in turn threatens to damage the devices that are under test. As a result, random-access scan (RAS) design, as an alternative to scan design, is gaining momentum in addressing the test power dissipation issue [Ando 1980] [Baik 2005a] [Mudlapur 2005] [Hu 2006]. Unlike scan design that requires serially shifting data into and out of a scan cell through adjacent scan cells, random-access scan allows each scan cell to be randomly and uniquely addressable, similar to storage cells in a random-access memory (RAM).

In this chapter, we first cover three commonly used DFT techniques: scan, logic BIST, and test compression. For each DFT technique, we present a number of DFT architectures practiced in industry. Fundamental DFT architectures along with advanced DFT architectures suitable for low-power testing and at-speed testing, which are growing in importance for nanometer VLSI designs, are examined. All of these DFT architectures are applicable for testing, debugging, and diagnosing scan designs. Then, we describe some promising DFT architectures using random access scan to reduce test power dissipation and test application time. For more information on basic VLSI test principles and DFT architectures, refer to [Bushnell 2000], [Jha 2003], and [Wang 2006a]. Advances in fault tolerance, at-speed delay testing, low-power testing, and defect and error tolerance are further discussed in Chapters 3, 6, 7, and 8, respectively.

Scan Design

Scan design is currently the most widely used structured DFT approach. It is implemented by connecting selected storage elements of a design into one or more shift registers, called scan chains, to provide them with external access. Scan design accomplishes this task by replacing all selected storage elements with scan cells, each having one additional scan input (SI) port and one shared/additional scan output (SO) port. By connecting the SO port of one scan cell to the SI port of the next scan cell, one or more scan chains are created.

The scan-inserted design, called scan design, is now operated in three modes: normal mode, shift mode, and capture mode. Circuit operations with associated clock cycles conducted in these three modes are referred to as normal operation, shift operation, and capture operation, respectively.

In normal mode, all test signals are turned off, and the scan design operates in the original functional configuration. In both shift and capture modes, a test mode signal TM is often used to turn on all test-related fixes in compliance with scan design rules. A set of scan design rules that can be found in [Cheung 1996] and [Wang 2006a] are necessary to simplify the test, debug, and diagnosis tasks, improve fault coverage, and guarantee the safe operation of the device under test. These circuit modes and operations are distinguished using additional test signals or test clocks. Fundamental and advanced scan architectures are described in the following subsections.

Scan Architectures

In this subsection, we first describe a few fundamental scan architectures. These fundamental scan architectures include (1) muxed-D scan design, where storage elements are converted into muxed-D scan cells; (2) clocked-scan design, where storage elements are converted into clocked-scan cells; (3) LSSD scan design, where storage elements are converted into level-sensitive scan design (LSSD) shift register latches (SRLs); and (4) enhanced-scan design, where storage elements are converted into enhanced-scan cells each comprised of a D latch and a muxed-D scan cell.

Muxed-D Scan Design

Figure 2.2 shows a sequential circuit example with three D flip-flops. The corresponding muxed-D full-scan circuit is shown in Figure 2.3. An edge-triggered muxed-D scan cell design is shown in Figure 2.3a. This scan cell is composed of a D flip-flop and a multiplexer. The multiplexer uses a scan enable (SE) input to select between the data input (DI) and the scan input (SI.) The three D flip-flops, FF1, FF2, and FF3, as shown in Figure 2.2, are replaced with three muxed-D scan cells, SFF1, SFF2, and SFF3, shown in Figure 2.3b.

Sequential circuit example.

Figure 2.2. Sequential circuit example.

Muxed-D scan design: (a) muxed-D scan cell and (b) muxed-D scan design.

Figure 2.3. Muxed-D scan design: (a) muxed-D scan cell and (b) muxed-D scan design.

In Figure 2.3a, the data input DI of each scan cell is connected to the output of the combinational logic as in the original circuit. To form a scan chain, the scan inputs SI of SFF2 and SFF3 are connected to the outputs Q of the previous scan cells, SFF1 and SFF2, respectively. In addition, the scan input SI of the first scan cell SFF1 is connected to the primary input SI, and the output Q of the last scan cell SFF3 is connected to the primary output SO. Hence, in shift mode, SE is set to 1, and the scan cells operate as a single scan chain, which allows us to shift any combination of logic values into the scan cells. In capture mode, SE is set to 0, and the scan cells are used to capture the test response from the combinational logic when a clock is applied.

In general, combinational logic in a full-scan circuit has two types of inputs: primary inputs (PIs) and pseudo primary inputs (PPIs). Primary inputs refer to the external inputs to the circuit, whereas pseudo primary inputs refer to the scan cell outputs. Both PIs and PPIs can be set to any required logic values. The only difference is that PIs are set directly in parallel from the external inputs, whereas PPIs are set serially through scan chain inputs. Similarly, the combinational logic in a full-scan circuit has two types of outputs: primary outputs (POs) and pseudo primary outputs (PPOs). Primary outputs refer to the external outputs of the circuit, and pseudo primary outputs refer to the scan cell inputs. Both POs and PPOs can be observed. The only difference is that POs are observed directly in parallel from the external outputs, and PPOs are observed serially through scan chain outputs.

Clocked-Scan Design

An edge-triggered clocked-scan cell can also be used to replace a D flip-flop in a scan design [McCluskey 1986]. Similar to a muxed-D scan cell, a clocked-scan cell also has a data input DI and a scan input SI; however, in the clocked-scan cell, input selection is conducted using two independent clocks, data clock DCK and shift clock SCK, as shown in Figure 2.4a.

Clocked-scan design: (a) clocked-scan cell and (b) clocked-scan design.

Figure 2.4. Clocked-scan design: (a) clocked-scan cell and (b) clocked-scan design.

Figure 2.4b shows a clocked-scan design of the sequential circuit given in Figure 2.2. This clocked-scan design is tested using shift and capture operations, similar to a muxed-D scan design. The main difference is how these two operations are distinguished. In a muxed-D scan design, a scan enable signal SE is used, as shown in Figure 2.3a. In the clocked scan shown in Figure 2.4, these two operations are distinguished by properly applying the two independent clocks SCK and DCK during shift mode and capture mode, respectively.

LSSD Scan Design

Figure 2.5 shows a polarity-hold shift register latch (SRL) design described in [Eichelberger 1977] that can be used as an LSSD scan cell. This scan cell contains two latches: a master two-port D latch L1 and a slave D latch L2. Clocks C, A, and B are used to select between the data input D and the scan input I to drive +L1 and +L2.

Polarity-hold shift register latch (SRL).

Figure 2.5. Polarity-hold shift register latch (SRL).

LSSD scan designs can be implemented using either a single-latch design or a double-latch design. In single-latch design [Eichelberger 1977], the output port +L1 of the master latch L1 is used to drive the combinational logic of the design. In this case, the slave latch L2 is used only for scan testing. Because LSSD designs use latches instead of flip-flops, at least two system clocks C1 and C2 are required to prevent combinational feedback loops from occurring. In this case, combinational logic driven by the master latches of the first system clock C1 are used to drive the master latches of the second system clock C2, and vice versa. For this to work, the system clocks C1 and C2 should be applied in a nonoverlapping fashion. Figure 2.6a shows an LSSD single-latch design using the polarity-hold SRL shown in Figure 2.5.

LSSD designs: (a) LSSD single-latch design and (b) LSSD double-latch design.

Figure 2.6. LSSD designs: (a) LSSD single-latch design and (b) LSSD double-latch design.

Figure 2.6b shows an example of LSSD double-latch design [DasGupta 1982]. In normal mode, the C1 and C2 clocks are used in a nonoverlapping manner, where the C2 clock is the same as the B clock. The testing of an LSSD scan design is conducted using shift and capture operations, similar to a muxed-D scan design. The main difference is how these two operations are distinguished. In a muxed-D scan design, a scan enable signal SE is used, as shown in Figure 2.3a. In an LSSD scan design, these two operations are distinguished by properly applying nonoverlapping clock pulses to clocks C1, C2, A, and B. During the shift operation, clocks A and B are applied in a nonoverlapping manner, and the scan cells SRL1 ~ SRL3 form a single scan chain from SI to SO. During the capture operation, clocks C1 and C2 are applied in a nonoverlapping manner to load the test response from the combinational logic into the scan cells.

The operation of a polarity-hold SRL is race-free if clocks C and B as well as A and B are nonoverlapping. This characteristic is used to implement LSSD circuits that are guaranteed to have race-free operation in normal mode as well as in test mode.

Enhanced-Scan Design

Testing for a delay fault requires applying a pair of test vectors in an at-speed fashion. This is used to generate a logic value transition at a signal line or at the source of a path, and the circuit response to this transition is captured at the circuit’s operating speed. Applying an arbitrary pair of vectors as opposed to a functionally dependent pair of vectors, generated through the combinational logic of the circuit under test, allows us to maximize the delay fault detection capability. This can be achieved using enhanced scan [Malaiya 1983] [Glover 1988] [Dervisoglu 1991]. The enhanced-scan or hold-scan test circuit was implemented in the 90-nm Intel Pentium 4 processor [Kuppuswamy 2004].

Enhanced scan increases the capacity of a typical scan cell by allowing it to store two bits of data that can be applied consecutively to the combinational logic driven by the scan cells. For a muxed-D scan cell or a clocked-scan cell, this is achieved through the addition of a D latch.

Figure 2.7 shows a general enhanced-scan architecture using muxed-D scan cells. In this figure, in order to apply a pair of test vectors < V1, V2 > to the design, the first test vector V1 is first shifted into the scan cells (SFF1 ~ SFFs) and then stored into the additional latches (LA1 ~ LAs) when the UPDATE signal is set to 1. Next, the second test vector V2 is shifted into the scan cells while the UPDATE signal is set to 0 in order to preserve the V1 value in the latches (LA1 ~ LAs). Once the second vector V2 is shifted in, the UPDATE signal is applied, in order to change V1 to V2 while capturing the output response at-speed into the scan cells by applying CK after exactly one clock cycle.

Enhanced-scan design.

Figure 2.7. Enhanced-scan design.

The main advantage of enhanced scan is that it allows us to achieve high delay fault coverage by applying any arbitrary pair of test vectors, that otherwise would have been impossible. The disadvantages, however, are that each enhanced-scan cell needs an additional scan-hold D latch and that maintaining the timing relationship between UPDATE and CK for at-speed testing may be difficult. An additional disadvantage is that many false paths, instead of functional data paths, may be activated during test, causing an over-test problem. To reduce over-test, the conventional launch-on-shift (also called skewed-load [Savir 1993]) and launch-on-capture (also called broad-side in [Savir 1994] or double-capture in [Wang 2006a]) delay test techniques using normal scan chains can be used.

Low-Power Scan Architectures

Scan design can be classified as serial scan design, as test pattern application and test response acquisition are both conducted serially through scan chains. The major advantage of serial scan design is its low routing overhead, as scan data are shifted through adjacent scan cells. Its major disadvantage, however, is that individual scan cells cannot be controlled or observed without affecting the values of other scan cells within the same scan chain. High switching activities at scan cells during shift or capture can cause excessive test power dissipation, resulting in circuit damage, low reliability, or even test-induced yield loss.

Low-power scan architectures are scan designs targeting test power reduction. Test power is related to dynamic power. Dynamic power on a circuit node is measured as 0.5CVDD2f, where C is the effective load capacitance, VDD is the supply voltage, and f is the node’s switching frequency [Girard 2002] [Jha 2003]. Thus, test power is proportional to VDD2f.

Many approaches can be used to reduce test power [Girard 2002]. Typically, these approaches can result in a reduction of 2X to 10X in test power (shift power, capture power, or both). A number of representative low-power scan architectures are described in this subsection. These scan architectures are all applicable to muxed-D, clocked, and LSSD scan designs. If achieving a 100X reduction in shift power is required, one may consider using random-access scan design given in Section 2.5 or the advanced techniques detailed in Chapter 7.

Reduced-Voltage Low-Power Scan Design

A simple approach to reducing test power is to reduce the supply voltage. By reducing the supply voltage by 2X, a reduction of 4X in test power can be immediately achieved. The problem with this approach is that the circuit may not be designed to function at the reduced supply voltage.

Reduced-Frequency Low-Power Scan Design

Another approach is to slow down the shift clock frequency [Chandra 2001]. By reducing the shift clock frequency by 10X, a reduction of 10X in test power can be immediately achieved. The drawback of this approach is that test application time is increased by 10X, as test application time is mainly dominated by shift clock frequency. This can result in a dramatic increase in test cost.

Multi-Phase or Multi-Duty Low-Power Scan Design

One common approach to reduce test power is to apply shift clocks in a multiphase (nonoverlapping) or multi-duty (skewed) order [Bonhomme 2001] [Saxena 2001] [Yoshida 2003] [Rosinger 2004]. The multi-phase clocking technique splits the shift clock into a number of nonoverlapping clock phases each for driving a small scan segment of scan cells. Thus, test power is reduced, but test application time may be increased. To avoid test application time increase, the scan inputs and the scan outputs of all scan segments can be tied together and multiplexed with the original shift clock, respectively [Bonhomme 2001] [Saxena 2001] [Rosinger 2004]. The low-power scan design described in [Yoshida 2003] uses a multi-duty clocking technique to avoid test application time increase. This is done by adding delays to the shift clock so a skewed clock phase is applied to a small scan segment of scan cells. This technique also helps reduce peak power, but total energy consumption and heat dissipation may not change. A multi-phase or multi-duty low-power scan design reconfigured from Figure 2.3b is shown in Figure 2.8 where the clock CK shown in Figure 2.3b is split (or skewed) into three clock phases: CK1, CK2, and CK3. Using this scheme, up to 3X reduction in test power can be achieved. The disadvantage of this approach is increased routing overhead and complexity during clock tree synthesis (CTS).

Multi-phase or multi-duty low-power scan design.

Figure 2.8. Multi-phase or multi-duty low-power scan design.

Bandwidth-Matching Low-Power Scan Design

It is also possible to reduce test power by splitting each scan chain into multiple scan chains and reducing the shift clock frequency. This is accomplished by using pairs of serial-in/parallel-out shift register and parallel-in/serial-out shift register for bandwidth matching [Whetsel 1998] [Khoche 2002]. Consider a design with 16 scan chains running at a shift clock frequency of 10 MHz. Each scan chain is split into 10 subscan chains with the SI and SO ports of each 10 subscan chains connected to a serial-in/parallel-out shift register and a parallel-in/serial-out shift register, respectively. In this case, the 16 pairs of shift registers run at 10 MHz, whereas all 160 subscan chains can now be shifted at 1 MHz. As a result, because test power is proportional to the shift clock frequency, a reduction of 10X in test power is achieved, without a corresponding increase in test time. Figure 2.9 shows the bandwidth-matching low-power scan design. The time-division demultiplexer (TDDM) is a serial-in/parallel-out shift register, whereas the time-division multiplexer (TDM) is a parallel-in/serial-out shift register. The main drawback of this approach is the induced area overhead.

Bandwidth-matching low-power scan design.

Figure 2.9. Bandwidth-matching low-power scan design.

Hybrid Low-Power Scan Design

Any of the above-mentioned low-power scan designs can typically achieve 2X to 10X reduction in test power (either shift power or capture power). When combined, further test power reduction is possible. In cases where a 100X reduction in shift power is required, one can consider using random-access scan designs as detailed in Section 2.5 or resort to a hybrid approach that combines two or more low-power test techniques. These advanced techniques are discussed in Chapter 7.

At-Speed Scan Architectures

Although scan design is commonly used in the industry for slow-speed stuck-at fault testing, its real value is in providing at-speed testing for high-speed and high-performance circuits. These circuits often contain multiple clock domains, each running at an operating frequency that is either synchronous or asynchronous to the other clock domains. Two clock domains are said to be synchronous if the active edges of both clocks controlling the two clock domains can be aligned precisely or triggered simultaneously. Two clock domains are said to be asynchronous if they are not synchronous.

There are two basic capture-clocking schemes for testing multiple clock domains at-speed: (1) skewed-load (also called launch-on-shift) and (2) double-capture (also called launch-on-capture or broad-side). Both schemes can test path-delay faults and transition faults within each clock domain (called intra-clock-domain faults) or across clock domains (called inter-clock-domain faults). Skewed-load uses the last shift clock pulse followed immediately by a capture clock pulse to launch the transition and capture the output test response, respectively. Double-capture uses two consecutive capture clock pulses to launch the transition and capture the output test response, respectively. In both schemes, the second capture clock pulse must be running at the domain’s operating speed or at-speed. The difference is that skewed-load requires the domain’s scan enable signal SE to switch its value between the launch and capture clock pulses making SE act as a clock signal. Figure 2.10 shows sample waveforms using the basic skewed-load and double-capture at-speed test schemes.

Basic at-speed test schemes: (a) skewed-load and (b) double-capture.

Figure 2.10. Basic at-speed test schemes: (a) skewed-load and (b) double-capture.

Because scan designs typically include many clock domains, which do not interact with each other, clock grouping can be used to reduce test application time and test data volume during ATPG. Clock grouping is a process used to analyze all data paths in the scan design in order to determine all independent or noninteracting clocks that can be grouped and applied simultaneously.

An example of the clock grouping process is shown in Figure 2.11. This example shows the results of performing a circuit analysis operation on a scan design in order to identify all clock interactions, marked with an arrow, where a data transfer from one clock domain to a different clock domain occurs. As Figure 2.11 illustrates, the circuit in this example has seven clock domains (CD1 ~ CD7) and five crossing-clock-domain data paths (CCD1 ~ CCD5). This example shows that CD2 and CD3 are independent from each other; hence, their related clocks can be applied simultaneously during test as CK2. Similarly, clock domains CD4 through CD7 can also be applied simultaneously during test as CK3. Therefore, in this example, three grouped clocks instead of seven individual clocks can be used to test the circuit during the capture operation.

Clock grouping example.

Figure 2.11. Clock grouping example.

To guarantee the success of the capture operation, additional care must be taken in terms of the way the grouped clocks are applied. This is mainly because the clock skew between different clock domains is typically large. A data path originating in one clock domain and terminating in another might result in a mismatch when both clocks are applied simultaneously, and the clock skew between the two clocks is larger than the data path delay from the originating clock domain to the terminating clock domain. To avoid the mismatch, the timing governing the relationship of such a data path shown in the following equation must be observed:

clock skew < data path delay + clock-to-Q delay (originating clock)

If this is not the case, a mismatch may occur during the capture operation. To prevent this from happening, grouped clocks can be applied sequentially (using the staggered clocking scheme [Wang 2005a, 2007]) such that any clock skew that exists between the clock domains can be tolerated during the test generation process. It is also possible to apply only one grouped clock during each capture operation using the one-hot clocking scheme. Most modern ATPG programs can also automatically mask off unknown values (X’s) at the originating scan cells or receiving scan cells across clock domains. In this case, all grouped clocks can also be applied simultaneously using the simultaneous clocking scheme [Wang 2007]. During simultaneous clocking, if the launch clock pulses [Rajski 2003] [Wang 2006a] or the capture clock pulses [Nadeau-Dostie 1994] [Wang 2006a] can be aligned precisely, which applies only for synchronous clock domains, then depending on the ATPG capability, maybe there is no need to mask off unknown values across these synchronous clock domains. These clocking schemes are illustrated in Figure 2.12.

At-speed clocking schemes for testing two interacting clock domains: (a) one-hot clocking, (b) staggered clocking, and (c) simultaneous clocking.

Figure 2.12. At-speed clocking schemes for testing two interacting clock domains: (a) one-hot clocking, (b) staggered clocking, and (c) simultaneous clocking.

In general, one-hot clocking produces the highest fault coverage at the expense of generating many more test patterns than the other two schemes. Simultaneous clocking can generate the smallest number of test patterns but may result in high fault coverage loss because of unknown (X) masking. The staggered clocking scheme is a happy medium for its ability to generate a test pattern count close to simultaneous clocking and fault coverage close to one-hot clocking. For large designs, it is no longer uncommon for transition fault ATPG to take longer than 2 to 4 weeks to complete. To reduce test generation time while at the same time obtaining the highest fault coverage, modern ATPG programs tend to either (1) run simultaneous clocking followed by one-hot clocking or (2) use staggered clocking followed by one-hot clocking. As a result, modern at-speed scan architectures now start supporting a combination of at-speed clocking schemes for test circuits comprising multiple synchronous and asynchronous clock domains. Some programs can even generate test patterns by mixing skewed-load and double-capture schemes.

In these modern at-speed scan architectures, the launch clock pulse and capture clock pulse can be either directly supplied from the tester or internally generated by the phase-locked loop (PLL) associated with each clock domain. Although it is easy to supply the clock pulses directly from the tester, the test cost associated with the use of an expensive tester and its limited high-frequency channels may hinder the approach from being practical. To use internal PLLs, additional on-chip clock controllers are required. When the skewed-load scheme is employed, it may be also necessary to perform clock tree synthesis (CTS) on the scan enable signal SE controlling each clock domain. Alternatively, the SE signal can be pipelined to avoid CTS. An example of a pipelined SE design to drive both positive-edge and negative-edge scan cells is shown in Figure 2.13 [Gizopoulos 2006]. Figure 2.14a shows an on-chip clock controller for generating two capture clock cycles using the double-capture scheme [Beck 2005]. When scan_en is set to 1, scan_clk is directly connected to clk_out; when scan_en is set to 0, the output of the clock-gating cell is directly connected to clk_out. The implementation of the clock-gating cell makes sure that no glitches or spikes appear on clk_out. The clock-gating cell is enabled by the signal hs_clk_en that is generated from the five-bit register. The shift register is clocked by pll_clk. According to Figure 2.14b, a single scan_clk pulse is applied after scan_en is set to 0. This clock pulse generates a 1 that is latched by the D flip-flop and shifted through the shift register. After two pll_clk cycles, hs_clk_en is asserted for the next two pll_clk cycles. As the clock-gating cell is enabled during that period, exactly two PLL clock pulses are transmitted from the PLL to clk_out.

Pipelined scan enable design.

Figure 2.13. Pipelined scan enable design.

An on-chip clock controller for generating two capture clock pulses: (a) example on-chip clock controller and (b) waveform.

Figure 2.14. An on-chip clock controller for generating two capture clock pulses: (a) example on-chip clock controller and (b) waveform.

A test clock controller for detecting inter-clock-domain delay faults by using an internal PLL and the double-capture clocking scheme can be also found in [Furukawa 2006]. The authors in [Iyengar 2006] further presented an on-chip clock controller that can generate high-speed launch-on-capture as well as launch-on-shift clocking without the need to switch SE at-speed.

Logic Built-In Self-Test

Figure 2.15 shows a typical logic built-in self-test (BIST) system. The test pattern generator (TPG) automatically generates test patterns for application to the inputs of the circuit under test (CUT). The output response analyzer (ORA) automatically compacts the output responses of the CUT into a signature. Specific BIST timing control signals, including scan enable signals and clocks, are generated by the logic BIST controller for coordinating the BIST operation among the TPG, CUT, and ORA. The logic BIST controller provides a pass/fail indication once the BIST operation is complete. It includes comparison logic to compare the final signature with an embedded golden signature, and it often encompasses diagnostic logic for fault diagnosis. As compaction is commonly used for output response analysis, it is required that all storage elements in the TPG, CUT, and ORA be initialized to known states before self-test and no unknown (X) values be allowed to propagate from the CUT to the ORA. In other words, the CUT must comply with more stringent BIST-specific design rules [Wang 2006a] in addition to those scan design rules required for scan design.

A typical logic BIST system.

Figure 2.15. A typical logic BIST system.

For BIST pattern generation, in-circuit TPGs are commonly constructed from linear feedback shift registers (LFSRs) [Golomb 1982] or cellular automata [Hortensius 1989] to generate test patterns or test sequences for exhaustive testing, pseudo-random testing, and pseudo-exhaustive testing [Bushnell 2000] [Wang 2006a]. Exhaustive testing always guarantees 100% single-stuck and multiple-stuck fault coverage. This technique requires all possible 2n test patterns to be applied to an n-input combinational CUT, which can take too long for combinational circuits where n is huge; therefore, pseudo-random testing [Bardell 1987] is often used for generating a subset of the 2n test patterns and uses fault simulation to calculate the exact fault coverage. The TPG is often referred to as a pseudo-random pattern generator (PRPG). In some cases, this might become quite time consuming, if not infeasible. To eliminate the need for fault simulation while at the same time maintaining 100% single-stuck fault coverage, we can use pseudo-exhaustive testing [McCluskey 1986] [Wang 2006a] to generate 2w or 2k– 1 test patterns, where w < k < n, when each output of the n-input combinational CUT at most depends on w inputs. For testing delay faults, hazards must also be taken into consideration.

For output response compaction, the ORAs are commonly constructed from multiple-input signature registers (MISRs). The MISR is basically an LFSR that uses an extra XOR gate at the input of each LFSR stage for compacting the output responses of the CUT into the LFSR during each shift operation. Oftentimes, to further reduce the hardware overhead of the ORA, a linear phase compactor comprised of a network of XOR gates is connected to the MISR inputs.

Logic BIST Architectures

Several architectures for incorporating offline BIST techniques into a design have been proposed. These BIST architectures can be classified into two classes: (1) those using the test-per-scan BIST scheme and (2) those using the test-per-clock BIST scheme. The test-per-scan BIST scheme takes advantage of the already built-in scan chains of the scan design and applies a test pattern to the CUT after a shift operation is completed; hence, the hardware overhead is low. The test-per-clock BIST scheme, however, applies a test pattern to the CUT and captures its test response every system clock cycle; hence, the scheme can execute tests much faster than the test-per-scan BIST scheme but at an expense of more hardware overhead.

In this subsection, we only discuss two representative BIST architectures, one for each class. Although pseudo-random testing is commonly adopted in both BIST schemes, the exhaustive and pseudo-exhaustive test techniques are applicable for designs using the test-per-clock BIST scheme. For a more comprehensive survey of these BIST architectures, refer to [McCluskey 1985], [Bardell 1987], [Abramovici 1994], and [Wang 2006a].

Self-Testing Using MISR and Parallel SRSG (STUMPS)

A test-per-scan BIST design was presented in [Bardell 1982]. This design, shown in Figure 2.16, contains a PRPG (parallel shift register sequence generator [SRSG]) and a MISR. The scan chains are loaded in parallel from the PRPG. The system clocks are then triggered and the test responses are shifted to the MISR for compaction. New test patterns are shifted in at the same time while test responses are being shifted out. This BIST architecture using the test-per-scan BIST scheme is referred to as self-testing using MISR and parallel SRSG (STUMPS) [Bardell 1982].

STUMPS.

Figure 2.16. STUMPS.

Because of the ease of integration with traditional scan architecture, the STUMPS architecture is the only BIST architecture widely used in industry to date. To further reduce the lengths of the PRPG and MISR and improve the randomness of the PRPG, a STUMPS-based architecture that includes an optional linear phase shifter and an optional linear phase compactor is often used in industrial applications [Nadeau-Dostie 2000] [Cheon 2005]. The linear phase shifter and linear phase compactor typically comprise a network of XOR gates. Figure 2.17 shows the STUMPS-based architecture.

A STUMPS-based architecture.

Figure 2.17. A STUMPS-based architecture.

Concurrent Built-In Logic Block Observer (CBILBO)

STUMPS is the widely adopted logic BIST architecture for scan-based designs. The acceptance of this STUMPS architecture is mostly because of the ease with which the BIST circuitry is integrated into a scan design. The efforts required to implement the BIST circuitry and the loss of the fault coverage for using pseudo-random patterns, however, have prevented the STUMPS-based logic BIST architecture from being widely used across all industries.

One solution to solve the fault coverage loss problem is to use the concurrent bulit-in logic block observer (CBILBO) approach [Wang 1986]. The CBILBO is based on the test-per-clock BIST scheme and uses two registers to perform test generation and signature analysis simultaneously. A CBILBO design is shown in Figure 2.18, where only three modes of operation are considered: normal, scan, and test generation and signature analysis. When B1 = 0 and B2 = 1, the upper D flip-flops act as a MISR for signature analysis, whereas the lower two-port D flip-flops form a TPG for test generation. Because signature analysis is separated from test generation, an exhaustive or pseudo-exhaustive pattern generator (EPG/PEPG) can now be used for test generation; therefore, no fault simulation is required, and it is possible to achieve 100% single-stuck fault coverage using the CBILBO architectures for testing designs shown in Figure 2.19. However, the hardware cost associated with using the CBILBO approach is generally higher than for the STUMPS approach.

A three-stage concurrent BILBO (CBILBO).

Figure 2.18. A three-stage concurrent BILBO (CBILBO).

CBILBO architectures: (a) for testing a finite-state machine and (b) for testing a pipelined-oriented circuit.

Figure 2.19. CBILBO architectures: (a) for testing a finite-state machine and (b) for testing a pipelined-oriented circuit.

Coverage-Driven Logic BIST Architectures

In pseudo-random testing, the fault coverage is limited by the presence of random-pattern resistant (RP-resistant) faults. If the fault coverage is not sufficient, then four approaches can be used to enhance the fault coverage: (1) weighted pattern generation, (2) test point insertion, (3) mixed-mode BIST, and (4) hybrid BIST. The first three approaches are applicable for in-field coverage enhancement, whereas the fourth approach is applicable for manufacturing coverage enhancement.

Weighted pattern generation inserts a combinational circuit between the output of the PRPG and the CUT to increase the frequency of occurrence of one logic value while decreasing the other logic value. Test point insertion adds control points and observation points for providing additional controllability and observability to improve the detection probability of RP-resistant faults so they can be detected during pseudo-random testing. Mixed-mode BIST involves supplementing the pseudo-random patterns with some deterministic patterns that detect RP-resistant faults and are generated using on-chip hardware. When BIST is performed during manufacturing test where a tester is present, hybrid BIST involves combining BIST and external testing by supplementing the pseudo-random patterns with deterministic data from the tester to improve the fault coverage. This fourth option is not applicable when BIST is used in the field, as the tester is not present. Each of these approaches is described in more detail in the following subsections.

Weighted Pattern Generation

Typically, weighted pseudo-random patterns are used to increase the circuit’s fault coverage. A weighted pattern generation technique employing an LFSR and a combinational circuit was first described in [Schnurmann 1975]. The combinational circuit inserted between the output of the LFSR and the CUT is to increase the frequency of occurrence of one logic value while decreasing the other logic value. This approach may increase the probability of detecting those faults that are difficult to detect using the typical LFSR pattern generation technique.

Implementation methods for realizing this scheme are further discussed in [Chin 1984]. The weighted pattern generation technique described in that paper modifies the maximum-length LFSR to produce an equally weighted distribution of 0’s and 1’s at the input of the CUT. It skews the LFSR probability distribution of 0.5 to either 0.25 or 0.75 to increase the chance of detecting those faults that are difficult to detect using just a 0.5 distribution. Better fault coverage was also found in [Wunderlich 1987], where probability distributions in a multiple of 0.125 (rather than 0.25) are used. For some circuits, several programmable probabilities or weight sets are required to further increase each circuit’s fault coverage [Waicukauski 1989] [Bershteyn 1993] [Kapur 1994] [Lai 2005]. Additional discussions on weighted pattern generation can be found in [Rajski 1998a] and [Bushnell 2000]. Figure 2.20 shows a four-stage weighted (maximum-length) LFSR with probability distribution 0.75 [Chin 1984].

Example weighted LFSR as PRPG.

Figure 2.20. Example weighted LFSR as PRPG.

Test Point Insertion

Although weighted pattern generation is simple in design, achieving adequate fault coverage for a BIST circuit remains a problem. Test points can then be used to increase the circuit’s fault coverage to a desired level. Figure 2.21 shows two typical types of test points that can be inserted. A control point can be connected to a primary input, an existing scan cell output, or a dedicated scan cell output. An observation point can be connected to a primary output through an additional multiplexer, an existing scan cell input, or a dedicated scan cell input.

Typical test points inserted for improving a circuit’s fault coverage: (a) test point with a multiplexer and (b) test point with AND-OR gates.

Figure 2.21. Typical test points inserted for improving a circuit’s fault coverage: (a) test point with a multiplexer and (b) test point with AND-OR gates.

Figure 2.22b shows an example where one control point and one observation point are inserted to increase the detection probability of a 6-input AND-gate given in Figure 2.22a. By splitting the six-input AND gate into two fewer-input AND gates and placing a control point and an observation point between the two fewer-input AND gates, we can increase the probability of detecting faults in the original six-input AND gate, (e.g., output Y stuck-at-0 and any input Xi stuck-at-1), thereby making the circuit more RP testable. After the test points are inserted, the most difficult fault to detect is the bottom input of the four-input AND gate stuck-at-1. In that case, one of inputs X1, X2, and X3 must be 0, the control point must be 0, and all inputs X4, X5, and X6 must be 1, resulting in a detection probability of 7/128 (= 7/8 × 1/2 × 1/2 × 1/2 × 1/2).

Example of inserting test points to improve detection probability: (a) an output RP-resistant stuck-at-0 fault and (b) example of inserted test points.

Figure 2.22. Example of inserting test points to improve detection probability: (a) an output RP-resistant stuck-at-0 fault and (b) example of inserted test points.

Test Point Placement

Because test points add area and performance overhead, an important issue for test point insertion is where to place the test points in the circuit to maximize the coverage and minimize the number of test points required. Note that it is not sufficient to only use observation points, as some faults require control points in order to be detected. Optimal placement of test points in circuits with reconvergent fanout has been shown to be NP-complete [Krishnamurthy 1987]. Several approximation techniques for placement of test points have been developed using either fault simulation [Iyengar 1989] [Touba 1996] or testability measures to guide them [Seiss 1991] [Tamarapalli 1996] [Zhang 2000]. Timing-driven test point insertion techniques [Tsai 1998] have also been developed to avoid adding delay on a critical timing path. The number of test points that must be added can be reduced by using the almost-full-scan BIST technique proposed in [Tsai 2000] that excludes a small number of scan cells from the scan chains during BIST operation.

Control Point Activation

Once the test points have been inserted, the logic that drives the control points must be designed. When a control point is activated, it forces the logic value at a particular node in the circuit to a fixed value. During normal operation, all control points must be deactivated. During testing, there are different strategies as to when and how the control points are activated. One approach is random activation, where the control points are driven by the pseudo-random pattern generator. The drawback of this approach is that when a large number of control points are inserted, they can interfere with each other and may not improve the fault coverage as much as desired. An alternative to random activation is to use deterministic activation. The technique in [Tamarapalli 1996] divides the BIST into phases and deterministically activates some subset of the control points in each phase. The technique in [Touba 1996] uses pattern decoding logic to activate the control points only for certain patterns where they are needed to detect RP-resistant faults.

Mixed-Mode BIST

A major drawback of test point insertion is that it requires modifying the circuit under test. In some cases this is not possible or not desirable (e.g., for hard cores, macros, hand-crafted designs, or legacy designs). An alternative way to improve fault coverage without modifying the CUT is to use mixed-mode BIST. Pseudo-random patterns are generated to detect the RP-testable faults, and then some additional deterministic patterns are generated to detect the RP-resistant faults. There are a number of ways for generating deterministic patterns on-chip. Three approaches are described next.

ROM Compression

The simplest approach for generating deterministic patterns on-chip is to store them in a read-only-memory (ROM). The problem with this approach is that the size of the required ROM is often prohibitive. Although several ROM compression techniques have been further proposed for reducing the size of the ROM, the industry seems to still shy away from using this approach [Agarwal 1981] [Aboulhamid 1983] [Dandapani 1984] [Edirisooriya 1992].

LFSR Reseeding

Instead of storing the test patterns themselves in a ROM, techniques have been developed for storing LFSR seeds that can be used to generate the test patterns [Könemann 1991]. The LFSR that is used for generating the pseudo-random patterns is also used for generating the deterministic patterns by reseeding it with computed seeds. The seeds can be computed with linear algebra as described in [Könemann 1991]. Because the seeds are smaller than the test patterns themselves, they require less ROM storage. One problem is that for an LFSR with a fixed characteristic (feedback) polynomial, it may not always be possible to find a seed that will efficiently generate the required deterministic test patterns. A solution to this problem was proposed in [Hellebrand 1995a] in which a multiple-polynomial LFSR (MP-LFSR), as illustrated in Figure 2.23, is used. An MP-LFSR is an LFSR with a reconfigurable feedback network. A polynomial identifier is stored with each seed to select the characteristic polynomial that will be used for that seed. Techniques for further reductions in storage can be achieved by using variable-length seeds [Rajski 1998b], a special ATPG algorithm [Hellebrand 1995b], folding counters [Liang 2001], and seed encoding [Al-Yamani 2005].

Reseeding with multiple-polynomial LFSR.

Figure 2.23. Reseeding with multiple-polynomial LFSR.

Embedding Deterministic Patterns

A third approach for mixed-mode BIST is to embed the deterministic patterns in the pseudo-random sequence. Many of the pseudo-random patterns generated during pseudo-random testing do not detect any new faults, so some of those “useless” patterns can be transformed into deterministic patterns that detect RP-resistant faults [Touba 1995]. This can be done by adding mapping logic between the scan chains and the CUT [Touba 1995] or in a less intrusive way by adding the mapping logic at the inputs to the scan chains to either perform bit-fixing [Touba 2001] or bit-flipping [Kiefer 1998]. Figure 2.24 shows a bit-flipping BIST scheme taken from [Kiefer 1998]. A bit-flipping function detects these “useless” patterns and maps them to deterministic patterns through the use of an XOR gate that is inserted between the LFSR and each scan chain.

Bit-flipping BIST.

Figure 2.24. Bit-flipping BIST.

Hybrid BIST

For manufacturing fault coverage enhancement where a tester is present, deterministic data from the tester can be used to improve the fault coverage. The simplest approach is to perform top-up ATPG for the faults not detected by BIST to obtain a set of deterministic test patterns that “top-up” the fault coverage to the desired level and then store those patterns directly on the tester. In a system-on-chip, test scheduling can be done to overlap the BIST run time with the transfer time for loading the deterministic patterns from the tester [Sugihara 1998] [Jervan 2003]. More elaborate hybrid BIST schemes have been developed, which attempt to store the deterministic patterns on the tester in a compressed form and then make use of the existing BIST hardware to decompress them. Such techniques are described in [Das 2000], [Dorsch 2001], [Ichino 2001], [Krishna 2003a], [Wohl 2003a], [Jas 2004], and [Lei 2005]. More discussions on test compression can be found in the following section.

Low-Power Logic BIST Architectures

Test power consumption in logic BIST designs tends to become more serious than that in scan designs. One major reason is that unlike scan designs in which test power can be reduced by simply using software ATPG approaches [Girard 2002] [Wen 2006], test power in logic BIST designs can only be reduced using hardware.

However, there are still quite a few hardware approaches that can be used to reduce test power. The low-power scan architectures discussed in Section 2.2.2 are mostly applicable for BIST designs. Three approaches are further described next. For more information, refer to Chapter 7.

Low-Transition BIST Design

One simple approach is to design a low-transition PRPG that generates test patterns with low switching activity. [Wang 1999] belongs to this category. The low-transition random test pattern generator (LT-RTPG) described in [Wang 1999] and used as a PRPG that is shown in Figure 2.25 inserts an AND gate and a toggle (T) flip-flop at the scan input of the scan chain. The inputs of the AND gate are connected to a few outputs of the LFSR. If the output of the AND gate in the LT-RTPG is 0 for k cycles, then identical values are applied at the scan input for k clock cycles. Hence, the switching activity is reduced. This approach is less design-intrusive and entails no performance degradation. It also requires low hardware overhead. The drawback of this approach is low fault coverage or long test sequence when required to achieve adequate fault coverage.

Low-transition random test pattern generator (LT-RTPG) as PRPG.

Figure 2.25. Low-transition random test pattern generator (LT-RTPG) as PRPG.

Test-Vector-Inhibiting BIST Design

Another approach is to inhibit the LFSR-generated pseudo-random patterns, which do not contribute to fault detection from being applied to the circuit under test (CUT). This test-vector-inhibiting technique can reduce test power while achieving the same fault coverage as the original LFSR. [Manich 2000] belongs to this category. A test-vector-inhibiting RTPG (TVI-RTPG) is shown in Figure 2.26 for use as a PRPG. When a pseudo-random pattern generated by the PRPG does not detect any faults, the pattern is not transmitted to the CUT. For this purpose, a decoding logic is connected to the output of the LFSR and outputs a 0 to inhibit the pseudo-random pattern from passing through the transmission gate network to the CUT. A transmission gate can be an XOR gate. Whereas this approach targets test-per-clock BIST, it is applicable for test-per-scan BIST designs. The drawback of this approach is high area overhead and impact on circuit performance.

Test-vector-inhibiting RTPG (TVI-RTPG) as PRPG.

Figure 2.26. Test-vector-inhibiting RTPG (TVI-RTPG) as PRPG.

Modified LFSR Low-Power BIST Design

The third approach is to use a modified LFSR structure, composed of two separated or interleaved n/2-stage LFSRs, to drive the circuit under test (CUT). The two n/2-stage LFSRs would activate only one part of the CUT in a given time interval. [Girard 2001] belongs to this category. It was demonstrated in the paper that shorter test length to reach target fault coverage can be achieved with the proposed modified LFSR structure as shown in Figure 2.27. A test clock module is used to generate the two nonoverlapping clocks, CK1 and CK2, for driving LFSR-1 and LFSR-2, respectively. Because only one part of the CUT is activated at any given time, this BIST scheme provides high percentage of power (and energy) reduction and results in no performance degradation and test time increase. The drawback of this approach is the requirement of constructing special clock trees.

Two n/2-stage LFSRs as PRPG.

Figure 2.27. Two n/2-stage LFSRs as PRPG.

At-Speed Logic BIST Architectures

There are three basic capture-clocking schemes that can be used for testing multiple clock domains: (1) single-capture, (2) skewed-load, and (3) double-capture. We will illustrate with BIST timing control diagrams how to test synchronous and asynchronous clock domains using these schemes. In this section, we first discuss the three basic capture-clocking schemes and then briefly describe the logic BIST architectures practiced by the electronic design automation (EDA) vendors. Throughout this section, we will assume that a STUMPS-based architecture is used and that each clock domain contains one test clock and one scan enable signal. The faults we will consider include structural faults, such as stuck-at faults and bridging faults, as well as timing-related delay faults, such as path-delay faults and transition faults.

Single-Capture

Single-capture is a slow-speed test technique in which only one capture pulse is applied to each clock domain. It is the simplest for testing all intra-clock-domain and inter-clock-domain structural faults. Two approaches can be used: (1) one-hot single-capture and (2) staggered single-capture.

One-Hot Single-Capture

Using the one-hot single-capture approach, a capture pulse is applied to only one clock domain during each capture window, while all other test clocks are held inactive. A sample timing diagram is shown in Figure 2.28. In the figure, because only one capture pulse (C1 or C2) is applied during each capture window, this scheme can only test intra-clock-domain and inter-clock-domain structural faults. The main advantage of this approach is that the designer does not have to worry about clock skews between the two clock domains during self-test, as each clock domain is tested independently. The only requirement is that delays d1 and d2 be properly adjusted; hence, this approach can be used for slow-speed testing of both synchronous and asynchronous clock domains. Another benefit of using this approach is that a single, slow-speed global scan enable (GSE) signal can be used for driving both clock domains, which makes it easy to integrate with scan. A major drawback is longer test time, as all clock domains have to be tested one at a time.

One-hot single-capture.

Figure 2.28. One-hot single-capture.

Staggered Single-Capture

The long test time problem using one-hot single-capture can be solved using the staggered single-capture approach [Wang 2006b]. A sample timing diagram is shown in Figure 2.29. In this approach, capture pulses C1 and C2 are applied in a sequential or staggered order during the capture window to test all intra-clock-domain and inter-clock-domain structural faults in the two clock domains. For clock domains that are synchronous, adjusting d2 will allow us to detect inter-clock-domain delay faults between the two clock domains at-speed. In addition, because d1 and d3 can be as long as desired, a single, slow-speed GSE signal can be used. This significantly simplifies the logic BIST physical implementation for designs with multiple clock domains. There may be some structural fault coverage loss between clock domains if the ordered sequence of capture clocks is fixed for all capture cycles.

Staggered single-capture.

Figure 2.29. Staggered single-capture.

Skewed-Load

Skewed-load is an at-speed delay test technique in which a last shift pulse followed immediately by a capture pulse, running at the test clock’s operating frequency, are used to launch the transition and capture the output response [Savir 1993]. It is also referred to as launch-on-shift. This technique addresses the intra-clock-domain delay fault detection problem, which cannot be tested using single-capture schemes. Skewed-load uses the value difference between the last shift pulse and the next-to-last-shift pulse to launch the transition and uses the capture pulse to capture the output response. For the last shift pulse to launch the transition, the scan enable signal associated with the clock domain must be able to switch operations from shift to capture in one clock cycle. Three approaches can be used: (1) one-hot skewed-load, (2) aligned skewed-load, and (3) staggered skewed-load.

One-Hot Skewed-Load

Similar to one-hot single-capture, the one-hot skewed-load approach tests all clock domains one by one [Bhawmik 1997]. A sample timing diagram is shown in Figure 2.30. The main differences are (1) it applies shift-followed-by-capture pulses (S1-followed-by-C1 or S2-followed-by-C2) to detect intra-clock-domain delay faults, and (2) each scan enable signal (SE1 or SE2) must switch operations from shift to capture within one clock cycle (d1 or d2). Thus, this approach can only be used for at-speed testing of intra-clock-domain delay faults in both synchronous and asynchronous clock domains. The disadvantages are (1) it cannot be used to detect inter-clock-domain delay faults, (2) it has a long test time, and (3) it is incompatible with scan, as a single, slow-speed GSE signal can no longer be used.

One-hot skewed-load.

Figure 2.30. One-hot skewed-load.

Aligned Skewed-Load

The disadvantages of one-hot skewed-load can be resolved by using the aligned skewed-load scheme. One aligned skewed-load approach that aligns all capture edges together is illustrated in Figure 2.31 [Nadeau-Dostie 1994] [Nadeau-Dostie 2000]. The approach is referred to as capture aligned skewed-load. The major advantage of using this approach is that all intra-clock-domain and inter-clock-domain faults can be tested. The arrows shown in Figure 2.31 indicate the delay faults that can be tested. For example, the three arrows from S1 (CK1) to C are used to test all intra-clock-domain delay faults in the clock domain controlled by CK1, and all inter-clock-domain delay faults from CK1 to CK2 and CK3. The remaining six arrows shown from S2 (CK2) to C, and S3 (CK3) to C are used to test all the remaining delay faults.

Capture aligned skewed-load.

Figure 2.31. Capture aligned skewed-load.

Because the active edges (rising edges) of the three capture pulses (see dash line C) must be aligned precisely, the circuit must contain one reference clock, and the frequency of all remaining test clocks must be derived from the reference clock. In the example given here, CK1 is the reference clock operating at the highest frequency, and CK2 and CK3 are derived from CK1 and designed to operate at 1/2 and 1/4 the frequency, respectively; therefore, this approach is only applicable for at-speed testing of intra-clock-domain and inter-clock-domain delay faults in synchronous clock domains.

A similar aligned skewed-load approach that aligns all last shift edges, rather than capture edges, is shown in Figure 2.32 [Hetherington 1999] [Rajski 2003]. This approach is referred to as launch aligned skewed-load. Similar to capture aligned skewed-load, it is also only applicable for at-speed testing of intra-clock-domain and inter-clock-domain delay faults in synchronous clock domains.

Launch aligned skewed-load.

Figure 2.32. Launch aligned skewed-load.

Consider the three clock domains, driven by CK1, CK2, and CK3, again. The eight arrows among the dash line S and the three capture pulses (C1, C2, and C3) indicate the intra-clock-domain and inter-clock-domain delay faults that can be tested. Unlike in Figure 2.31, however, to test the inter-clock-domain delay faults from CK1 to CK3, a special shift pulse S1 (when SE1 is set to 1) is required. As this method requires a much more complex timing-control diagram, a clock suppression circuit is used to enable or disable selected shift or capture pulses [Rajski 2003]. The dotted clock pulses shown in the figure indicate the suppressed shift pulses.

Staggered Skewed-Load

Although the aligned skewed-load approaches can test all intra-clock-domain and inter-clock-domain faults in synchronous clock domains, their physical implementation is extremely difficult. There are two main reasons. First, to effectively align all active edges in either capture or last shift, the circuit must contain a reference clock. This reference clock must operate at the fastest clock frequency, and all other clock frequencies must be derived from the reference clock; such designs rarely exist. (2) For any two edges that cannot be aligned precisely because of clock skews, we must either resort to a one-hot skewed-load approach or add capture-disabling circuitry on the functional data paths of the two clock domains to prevent the cross-domain logic from interacting with each other during capture. This increases the circuit overhead, degrades the functional circuit performance, and reduces the ability to test inter-clock-domain faults.

The staggered skewed-load approach shown in Figure 2.33 relaxes these conditions [Wang 2005b]. For test clocks that cannot be precisely aligned, a delay d3 is inserted, to eliminate the clock skew interaction between the two clock domains. The two last shift pulses (S1 and S2) are used to create transitions at the outputs of some scan cells, and the output responses to these transitions are captured by the following two capture pulses (C1 and C2), respectively. Both delays d1 and d2 are set to their respective clock domains’ operating frequencies; hence, this scheme can be used to test all intra-clock-domain faults and inter-clock-domain structural faults in asynchronous clock domains. A problem still exists, as each clock domain requires an at-speed scan enable signal, which complicates physical implementation.

Staggered skewed-load.

Figure 2.33. Staggered skewed-load.

Double-Capture

The physical implementation difficulty using skewed-load can be resolved by using the double-capture scheme. Double-capture is another at-speed test technique in which two consecutive capture pulses are applied to launch the transition and capture the output response. It is also referred to as broad-side [Savir 1994] or launch-on-capture. The double-capture scheme can achieve true at-speed test quality for intra-clock-domain and inter-clock-domain faults in any synchronous or asynchronous design and it is easy for physical implementation. Here, true at-speed testing is meant to (1) allow detection of intra-clock-domain faults within each clock domain at its own operating frequency and detection of inter-clock-domain structural faults or delay faults, depending on whether the circuit under test is synchronous, asynchronous, or a mix of both, and (2) ease physical implementation for seamless integration with the conventional scan/ATPG technique.

One-Hot Double-Capture

Similar to one-hot skewed-load, the one-hot double-capture approach tests all clock domains one by one. A sample timing diagram is shown in Figure 2.34. The main differences are (1) two consecutive capture pulses are applied (C1-followed-by-C2 or C3-followed-by-C4) at their respective clock domains’ frequencies (of period d1 or d2) to test intra-clock-domain delay faults, and (2) a single, slow-speed GSE signal is used to drive both clock domains. Hence, this scheme can be used for true at-speed testing of intra-clock-domain delay faults in both synchronous and asynchronous clock domains. Two drawbacks remain: (1) it cannot be used to detect inter-clock-domain delay faults, and (2) it has a long test time.

One-hot double-capture.

Figure 2.34. One-hot double-capture.

Aligned Double-Capture

The drawbacks of the one-hot double-capture scheme can be resolved by using an aligned double-capture approach. Similar to the aligned skewed-load approach, the aligned double-capture scheme allows all intra-clock-domain faults and inter-clock-domain faults to be tested [Wang 2006b]. The main differences are (1) two consecutive capture pulses are applied, rather than shift-followed-by-capture pulses, and (2) a single, slow speed GSE signal is used. Figures 2.35 and 2.36 show two sample timing diagrams. This scheme can be used for true at-speed testing of synchronous clock domains. One major drawback is that precise alignment of the capture pulses is still required. This complicates physical implementation for designs with asynchronous clock domains.

Capture aligned double-capture.

Figure 2.35. Capture aligned double-capture.

Launch aligned double-capture.

Figure 2.36. Launch aligned double-capture.

Staggered Double-Capture

The capture alignment problem in the aligned double-capture approach can finally be relaxed by using the staggered double-capture scheme [Wang 2005a, 2006b]. A sample timing diagram is shown in Figure 2.37. During the capture window, two capture pulses are generated for each clock domain. The first two capture pulses (C1 and C3) are used to create transitions at the outputs of some scan cells, and the output responses to the transitions are captured by the second two capture pulses (C2 and C4), respectively. Both delays d2 and d4 are set to their respective domains’ operating frequencies. Because d1, d3, and d5 can be adjusted to any length, we can simply use a single, slow-speed GSE signal for driving all clock domains; hence, true at-speed testing is guaranteed using this approach for asynchronous clock domains. Because a single GSE signal is used, this scheme significantly eases physical implementation and allows us to integrate logic BIST with scan/ATPG easily to improve the circuit’s manufacturing fault coverage.

Staggered double-capture.

Figure 2.37. Staggered double-capture.

Industry Practices

Logic BIST has a history of more than 30 years since its invention in the 1970s. Although it is only a few years behind the invention of scan, logic BIST has yet to gain strong industry support. The worldwide market is estimated to be close to 10% of the scan market. The logic BIST products available in the marketplace include Encounter Test from Cadence Design Systems [Cadence 2007], ETLogic from LogicVision [LogicVision 2007], LBIST Architect from Mentor Graphics [Mentor 2007], and TurboBIST-Logic from SynTest Technologies [SynTest 2007]. The logic BIST product offered in Encounter Test by Cadence currently includes support for test structure extraction, verification, logic simulation for signatures, and fault simulation for coverage. Unlike all three other BIST vendors that provide their own logic BIST structures in their respective products, Cadence offers a service to insert custom logic BIST structures or to use any customer inserted logic BIST structures; the service includes working with the customer to have custom on-chip clocking for logic BIST. A similar case arises in ETLogic from LogicVision when using the double-capture clocking scheme.

All these commercially available logic BIST products support the STUMPS-based architectures. Cadence supports a weighted-random spreading network (XOR network) for STUMPS with multiple weight selects [Foote 1997]. For at-speed delay testing, ETLogic [LogicVision 2007] uses a skewed-load-based at-speed BIST architecture, TurboBIST-Logic [SynTest 2007] implements the double-capture-based at-speed BIST architecture, and LBIST Architect [Mentor 2007] adopts a hybrid at-speed BIST architecture that supports both skewed-load and double-capture. In addition, all products provide inter-clock-domain delay fault testing for synchronous clock domains. On-chip clock controllers for testing these inter-clock-domain faults at-speed can be found in [Rajski 2003], [Furukawa 2006], [Nadeau-Dostie 2006], and [Nadeau-Dostie 2007]. Table 2.1 summarizes the capture-clocking schemes for at-speed logic BIST used by the EDA vendors.

Table 2.1. Summary of Industry Practices for At-Speed Logic BIST

Industry Practices

Skewed-Load

Double-Capture

Encounter Test

Through service

Through service

ETLogic

Through service

LBIST Architect

TurboBIST-Logic

 

Test Compression

Test compression can provide 10X to 100X reduction or even more in the amount of test data (both test stimulus and test response) that must be stored on the automatic test equipment (ATE) [Touba 2006] [Wang 2006a] for testing with a deterministic ATPG-generated test set. This greatly reduces ATE memory requirements; even more important, it reduces test time because fewer data have to be transferred across the limited bandwidth between the ATE and the chip. Moreover, test compression methodologies are easy to adopt in industry because they are compatible with the conventional design rules and test generation flows used for scan testing.

Test compression is achieved by adding some additional on-chip hardware before the scan chains to decompress the test stimulus coming from the tester and after the scan chains to compact the response going to the tester. This is illustrated in Figure 2.38. This extra on-chip hardware allows the test data to be stored on the tester in a compressed form. Test data are inherently highly compressible because typically only 1% to 5% of the bits on a test pattern that is generated by an ATPG program have specified (care) values. Lossless compression techniques can thus be used to significantly reduce the amount of test stimulus data that must be stored on the tester. The on-chip decompressor expands the compressed test stimulus back into the original test patterns (matching in all the care bits) as they are shifted into the scan chains. The on-chip compactor converts long output response sequences into short signatures. Because the compaction is lossy, some fault coverage can be lost because of unknown (X) values that might appear in the output sequence or aliasing where a faulty output response signature is identical to the fault-free output response signature. With proper design of the circuit under test (CUT) and the compaction circuitry, however, the fault coverage loss can be kept negligibly small.

Architecture for test compression.

Figure 2.38. Architecture for test compression.

Circuits for Test Stimulus Compression

A test cube is defined as a deterministic test vector in which the bits that are not assigned values by the ATPG procedure are left as “don’t cares” (X’s). Normally, ATPG procedures perform random fill in which all the X’s in the test cubes are filled randomly with 1’s and 0’s to create fully specified test vectors; however, for test stimulus compression, random fill is not performed during ATPG so the resulting test set consists of incompletely specified test cubes. The X’s make the test cubes much easier to compress than fully specified test vectors.

As mentioned earlier, test stimulus compression should be an information lossless procedure with respect to the specified (care) bits in order to preserve the fault coverage of the original test cubes. After decompression, the resulting test patterns shifted into the scan chains should match the original test cubes in all the specified (care) bits.

Many schemes for compressing test cubes have been surveyed in [Touba 2006] and [Wang 2006a]. Two schemes based on linear decompression and broadcast scan are described here in greater detail mainly because the industry has favored both approaches over code-based schemes from area overhead and compression ratio points of view. These industry practices can be found in [Wang 2006a].

Linear-Decompression-Based Schemes

A class of test stimulus compression schemes is based on using linear decompressors to expand the data coming from the tester to fill the scan chains. Any decompressor that consists of only XOR gates and flip-flops is a linear decompressor [Könemann 1991]. Linear decompressors have a very useful property: their output space (i.e., the space of all possible test vectors that they can generate) is a linear subspace that is spanned by a Boolean matrix. In other words, for any linear decompressor that expands an m-bit compressed stimulus from the tester into an n-bit stimulus (test vector), there exists a Boolean matrix Anxm such that the set of test vectors that can be generated by the linear decompressor is spanned by A. A test vector Z can be compressed by a particular linear decompressor if and only if there exists a solution to a system of linear equations, AX = Z, where A is the characteristic matrix of the linear decompressor and X is a set of free variables stored on the tester (every bit stored on the tester can be thought of as a “free variable” that can be assigned any value, 0 or 1).

The characteristic matrix for a linear decompressor can be obtained by symbolic simulation where each free variable coming from the tester is represented by a symbol. An example is shown in Figure 2.39, where a sequential linear decompressor containing an LFSR is used. The initial state of the LFSR is represented by free variables X1X4, and the free variables X5X10 are shifted in from two channels as the scan chains are loaded. After symbolic simulation, the final values in the scan chains are represented by the equations for Z1Z12. The corresponding system of linear equations for this linear decompressor is shown in Figure 2.40.

Example of symbolic simulation for linear decompressor.

Figure 2.39. Example of symbolic simulation for linear decompressor.

System of linear equations for the decompressor in Figure 2.39.

Figure 2.40. System of linear equations for the decompressor in Figure 2.39.

The symbolic simulation goes as follows. Assume that the initial seed X1X4 has been already loaded into the flip-flops. In the first clock cycle, the top flip-flop is loaded with the XOR of X2 and X5, the second flip-flop is loaded with X3), the third flip-flop is loaded with the XOR of X1 and X4, and the bottom flip-flop is loaded with the XOR of X1 and X6. Thus, we obtain Z1 = X2X5, Z2 = X3, Z3 = X1X4, and Z4= X1X6. In the second clock cycle, the top flip-flop is loaded with the XOR of the contents of the second flip-flop (X3) and X7, the second flip-flop is loaded with the contents of the third flip-flop (X1X4), the third flip-flop is loaded with the XOR of the contents of the first flip-flop (X2X5) and the fourth flip-flop (X1X6), and the bottom flip-flop is loaded with the XOR of the contents of the first flip-flop (X2X5) and X8. Thus, we obtain Z5 = X3X7, Z6 = X1X4, Z7 = X1X2X5X6, and Z8 = X2X5X8. In the third clock cycle, the top flip-flop is loaded with the XOR of the contents of the second flip-flop (X1X4) and X9, the second flip-flop is loaded with the contents of the third flip-flop (X1X2X5X6); the third flip-flop is loaded with the XOR of the contents of the first flip-flop (X3X7) and the fourth flip-flop (X2X5X8), and the bottom flip-flop is loaded with the XOR of the contents of the first flip-flop (X3X7) and X10. Thus, we obtain Z9 = X4X9, Z10= X1X6, Z11 = X2X5X8, and Z12 = X3X7X10. At this point, the scan chains are fully loaded with a test cube, so the simulation is complete.

Combinational Linear Decompressors

The simplest linear decompressors use only combinational XOR networks. Each scan chain is fed by the XOR of some subset of the channels coming from the tester [Bayraktaroglu 2001, 2003] [Könemann 2003] [Wang 2004] [Mitra 2006]. The advantage compared with sequential linear decompressors is simpler hardware and control. The drawback is that, to encode a test cube, each scan slice must be encoded using only the free variables that are shifted from the tester in a single clock cycle (which is equal to the number of channels). The worst-case most highly specified scan slices tend to limit the amount of compression that can be achieved because the number of channels from the tester has to be sufficiently large to encode the most highly specified scan slices. Consequently, it is very difficult to obtain a high encoding efficiency (typically it will be less than 0.25); for the other less specified scan slices, a lot of the free variables are wasted because those scan slices could have been encoded with many fewer free variables.

One approach for improving the encoding efficiency of combinational linear decompressors, proposed in [Krishna 2003b], is to dynamically adjust the number of scan chains that are loaded in each clock cycle. So for a highly specified scan slice, four clock cycles could be used in which 25% of the scan chains are loaded in each cycle; for a lightly specified scan slice, only one clock cycle can be used in which 100% of the scan slices are loaded. This allows a better matching of the number of free variables with the number of specified bits to achieve a higher encoding efficiency. Note that it requires that the scan clock be divided into multiple domains.

Sequential Linear Decompressors

Sequential linear decompressors are based on linear finite-state machines such as LFSRs, cellular automata, or ring generators [Mrugalski 2004]. The advantage of a sequential linear decompressor is that it allows free variables from earlier clock cycles to be used when encoding a scan slice in the current clock cycle. This provides greater flexibility than combinational decompressors and helps avoid the problem of the worst-case most highly specified scan slices limiting the overall compression. The more flip-flops that are used in the sequential linear decompressor, the greater the flexibility that is provided. [Tobua 2006] classified the sequential linear decompressors into two classes:

  1. Static reseeding. The earliest work in this area was based on static LFSR reseeding, a technique that computes a seed (an initial state) for each test cube [Touba 2006]. This seed, when loaded into an LFSR and run in autonomous mode, produces the test cube in the scan chains [Könemann 1991]. This technique achieves compression by storing only the seeds instead of the full test cubes.

    One drawback of using static reseeding for compressing test vectors on a tester is that the tester is idle while the LFSR is running in autonomous mode. One way around this is to use a shadow register for the LFSR to hold the data coming from the tester while the LFSR is running in autonomous mode [Volkerink 2003] [Wohl 2003b].

    Another drawback of static reseeding is that the LFSR must be at least as large as the number of specified bits in the test cube. One way around this is to only decompress a scan window (a limited number of scan slices) per seed [Krishna 2002] [Volkerink 2003] [Wohl 2005].

  2. Dynamic reseeding. [Könemann 2001], [Krishna 2001], and [Rajski 2004] proposed dynamic reseeding approaches. Dynamic reseeding calls for the injection of free variables coming from the tester into the LFSR as it loads the scan chains [Touba 2006]. Figure 2.41 shows a generic example of a sequential linear decompressor that uses b channels from the tester to continuously inject free variables into the LFSR as it loads the scan chains through a combinational linear decompressor, which typically is a combinational XOR network. This network expands the LFSR outputs to fill n scan chains. The advantages of dynamic reseeding compared with static reseeding are that it allows continuous flow operation in which the tester is always shifting in data as fast as it can and is never idle, and it allows the use of a small LFSR.

Typical sequential linear decompressor.

Figure 2.41. Typical sequential linear decompressor.

In [Rajski 2004], the authors described a methodology for scan vector compression based on a sequential linear decompressor. Instead of using an LFSR, this work uses a ring generator [Mrugalski 2004], which improves encoding flexibility and provides performance advantages. A fixed number of free variables are shifted in when decompressing each test cube. In this case, the control logic is simple because this methodology decompresses every test cube in exactly the same way. Constraining the ATPG generates test cubes that are encodable using the fixed number of free variables.

In [Könemann 2001], the authors described a methodology for scan vector compression in which the number of free variables used to encode each test cube varies. The method requires having an extra channel from the tester to gate the scan clock. For a heavily specified scan slice, this extra gating channel stops the scan shifting for one or more cycles, allowing the LFSR to accumulate a sufficient number of free variables from the tester to solve for the current scan slice before proceeding to the next one. This approach makes it easy to control the number of free variables that the decompressor uses to decompress each test cube. However, the additional gating channel uses some test data bandwidth.

Broadcast-Scan-Based Schemes

Another class of test stimulus compression schemes is based on broadcasting the same value to multiple scan chains. This was first proposed in [Lee 1998] and [Lee 1999]. Because of its simplicity and effectiveness, this method has been used as the basis of many test compression architectures, including some commercial design for testability (DFT) tools.

Broadcast Scan

To illustrate the basic concept of broadcast scan, first consider two independent circuits C1 and C2. Assume that these two circuits have their own test sets T1 = < t11, t12,..., t1k > and T2 = < t21, t22,..., t2l >, respectively. In general, a test set may consist of random patterns and deterministic patterns. In the beginning of the ATPG process, usually random patterns are initially used to detect the easy-to-detect faults. If the same random patterns are used when generating both T1 and T2 then we may have t11 = t21, t12 = t22,..., up to some ith pattern. After most faults have been detected by the random patterns, deterministic patterns are generated for the remaining difficult-to-detect faults. Generally these patterns have many “don’t care” bits. For example, when generating t1(i+1), many don’t care bits may still exist when no more faults in C1 can be detected. Using a test pattern with bits assigned so far for C1, we can further assign specific values to the don’t care bits in the pattern to detect faults in C2. Thus, the final pattern would be effective in detecting faults in both C1 and C2.

The concept of pattern sharing can be extended to multiple circuits as illustrated in Figure 2.42. One major advantage of using broadcast scan for independent circuits is that all faults that are detectable in all original circuits will also be detectable with the broadcast structure. This is because if one test vector can detect a fault in a stand-alone circuit then it will still be possible to apply this vector to detect the fault in the broadcast structure. Thus, the broadcast scan method will not affect the fault coverage if all circuits are independent. Note that broadcast scan can also be applied to multiple scan chains of a single circuit if all subcircuits driven by the scan chains are independent.

Broadcasting to scan chains driving independent circuits.

Figure 2.42. Broadcasting to scan chains driving independent circuits.

Illinois Scan

If broadcast scan is used for multiple scan chains of a single circuit where the subcircuits driven by the scan chains are not independent, then the property of always being able to detect all faults is lost. The reason for this is that if two scan chains are sharing the same channel, then the ith scan cell in each of the two scan chains will always be loaded with identical values. If some fault requires two such scan cells to have opposite values in order to be detected, it will not be possible to detect this fault with broadcast scan.

To address the problem of some faults not being detected when using broadcast scan for multiple scan chains of a single circuit, the Illinois scan architecture was proposed in [Hamzaoglu 1999] and [Hsu 2001]. This scan architecture consists of two modes of operations, namely a broadcast mode and a serial scan mode, which are illustrated in Figure 2.43. The broadcast mode is first used to detect most faults in the circuit. During this mode, a scan chain is divided into multiple subchains called segments and the same vector can be shifted into all segments through a single shared scan-in input. The response data from all subchains are then compacted by a MISR or other space/time compactor. For the remaining faults that cannot be detected in broadcast mode, the serial scan mode is used where any possible test pattern can be applied. This ensures that complete fault coverage can be achieved. The extra logic required to implement the Illinois scan architecture consists of several multiplexers and some simple control logic to switch between the two modes. The area overhead of this logic is typically quite small compared to the overall chip area.

Two modes of Illinois scan architecture: (a) broadcast mode and (b) serial chain mode.

Figure 2.43. Two modes of Illinois scan architecture: (a) broadcast mode and (b) serial chain mode.

The main drawback of the Illinois scan architecture is that no test compression is achieved when it is run in serial scan mode. This can significantly degrade the overall compression ratio if many test patterns must be applied in serial scan mode. To reduce the number of patterns that need to be applied in serial scan mode, multiple-input broadcast scan or reconfigurable broadcast scan can be used. These techniques are described next.

Multiple-Input Broadcast Scan

Instead of using only one channel to drive all scan chains, a multiple-input broadcast scan could be used where there is more than one channel [Shah 2004]. Each channel can drive some subset of the scan chains. If two scan chains must be independently controlled to detect a fault, then they could be assigned to different channels. The more channels that are used and the shorter each scan chain is, the easier to detect more faults because fewer constraints are placed on the ATPG. Determining a configuration that requires the minimum number of channels to detect all detectable faults is thus highly desired with a multiple-input broadcast scan technique.

Reconfigurable Broadcast Scan

The multiple-input broadcast scan may require a large number of channels to achieve high fault coverage. To reduce the number of channels that are required, a reconfigurable broadcast scan method can be used. The idea is to provide the capability to reconfigure the set of scan chains that each channel drives. Two possible reconfiguration schemes have been proposed, namely static reconfiguration [Pandey 2002] [Samaranayake 2003] and dynamic reconfiguration [Li 2004] [Sitchinava 2004] [Wang 2004] [Han 2005c]. In static reconfiguration, the reconfiguration can only be done when a new pattern is to be applied. For this method, the target fault set can be divided into several subsets, and each subset can be tested by a single configuration. After testing one subset of faults, the configuration can be changed to test another subset of faults. In dynamic reconfiguration, the configuration can be changed while scanning in a pattern. This provides more reconfiguration flexibility and hence can in general lead to better results with fewer channels. This is especially important for hard cores when the test patterns provided by core vendor cannot be regenerated. The drawback of dynamic reconfiguration versus static reconfiguration is that more control information is needed for reconfiguring at the right time, whereas for static reconfiguration the control information is much less because the reconfiguration is done only a few times (only after all the test patterns using a particular configuration have been applied).

Figure 2.44 shows an example multiplexer (MUX) network, which can be used for dynamic configuration. When a value on the control line is selected, particular data at the four input pins are broadcasted to the eight scan chain inputs. For instance, when the control line is set to 0 (or 1), the scan chain 1 output will receive input data from pin 4 (or pin 1) directly.

Example MUX network with control line(s) connected only to select pins of the multiplexers.

Figure 2.44. Example MUX network with control line(s) connected only to select pins of the multiplexers.

Virtual Scan

Rather than using MUX networks for test stimulus compression, combinational logic networks can also be used as decompressors. The combinational logic network can consist of any combination of simple combinational gates, such as buffers, inverters, AND/OR gates, MUXs, and XOR gates. This scheme, referred to as virtual scan, is different from reconfigurable broadcast scan and combinational linear decompression where pure MUX and XOR networks are allowed, respectively. The combinational logic network can be specified as a set of constraints or just as an expanded circuit for ATPG. In either case, the test cubes that ATPG generates are the compressed stimuli for the decompressor itself. There is no need to solve linear equations, and dynamic compaction can be effectively utilized during the ATPG process.

The virtual scan scheme was proposed in [Wang 2002] and [Wang 2004]. In these papers, the decompressor was referred to as a broadcaster. The authors also proposed to add additional logic, when required, through VirtualScan inputs to reduce or remove the constraints imposed on the decompressor (broadcaster), thereby yielding little or no fault coverage loss caused by test stimulus compression.

In a broad sense, virtual scan is a generalized class of broadcast scan, Illinois scan, multiple-input broadcast scan, reconfigurable broadcast scan, and combinational linear decompression. The advantage of using virtual scan is that it allows the ATPG to directly search for a test cube that can be applied by the decompressor and allows very effective dynamic compaction. Thus, virtual scan may produce shorter test sets than any test stimulus compression scheme based on solving linear equations; however, because this scheme may impose XOR or MUX constraints directly on the original circuit, it may take more time than schemes based on solving linear equations to generate test cubes or compressed stimuli. Two examples of virtual scan decompression circuits are shown in Figure 2.45.

Example virtual scan decompression circuits: (a) broadcaster using an example XOR network with additional VirtualScan inputs to reduce coverage loss and (b) broadcaster using an example MUX network with additional VirtualScan inputs that can be also connected to data pins of the multiplexers.

Figure 2.45. Example virtual scan decompression circuits: (a) broadcaster using an example XOR network with additional VirtualScan inputs to reduce coverage loss and (b) broadcaster using an example MUX network with additional VirtualScan inputs that can be also connected to data pins of the multiplexers.

Comparison

In this section, we compare the encoding flexibility for different types of combinational decompression techniques: Illinois scan using a pure buffer network, reconfigurable broadcast scan using MUX networks, and linear combinational decompression using only one-level 2-input XOR gates or one-level, three-input XOR gates [Dutta 2006].

Consider the bits coming from the tester each clock cycle as a tester slice. Then the tester slice gets expanded every clock cycle to fill the scan slice, which equals the number of scan chains. The authors in [Dutta 2006] did some experiments to measure the encoding flexibility of different ways of expanding the tester slices into scan slices. Figure 2.46 shows the percentage of all possible scan slices with different number of specified bits that can be encoded in each case for expanding a 16-bit tester slice to a 160-bit scan slice with an expansion ratio (or split ratio) of 10. The x-axis is the number of specified bits in the scan slice, and the y-axis is the percentage of all possible combinations of that number of specified bits that can be encoded. As the graph shows, all the decompression networks can always encode one specified bit. However, as the number of specified bits increases, the probability of being able to encode the scan slice drops. Because Illinois scan has the least encoding flexibility, it has the lowest probability of being able to encode a scan slice. The results for using MUXs are shown for two cases. One is where the control and data lines are separated (i.e., one of the tester channels is dedicated to driving the select line and the other 15 tester channels are used to drive the data lines). The other is where combinations of all 16 tester channels are used to drive either the select or data lines of the MUXs. The results indicate that greater encoding flexibility can be obtained by not having a separate control line. Another interesting result is that using two-input XOR gates is not as good as using MUXs for low numbers of specified bits, but it becomes better than MUXs when the number of specified bits is equal to 10 or more. Using three-input XORs provides considerably better encoding flexibility, although it comes with the tradeoff of adding more complexity to the ATPG than the others.

Encoding flexibility among combinational decompression schemes.

Figure 2.46. Encoding flexibility among combinational decompression schemes.

The experiments indicate that using the combinational XOR network for test stimulus decompression provides the highest encoding flexibility and hence can provide better compression than using other broadcast-scan-based schemes. The more inputs used per XOR gate, the better the encoding flexibility. Better encoding flexibility allows a more aggressive expansion ratio and allows the ATPG to perform more dynamic compaction resulting in better compression.

Circuits for Test Response Compaction

Test response compaction is performed at the outputs of the scan chains. The purpose is to reduce the amount of test response that needs to be transferred back to the tester. Whereas test stimulus compression must be lossless, test response compaction can be lossy. A large number of different test response compaction schemes have been presented and described to various extents in the literature [Wang 2006a]. The effectiveness of each compaction scheme depends on its ability to avoid aliasing and tolerate unknown test response bits or X’s. These schemes can be grouped into three categories: (1) space compaction, (2) time compaction, and (3) mixed space and time compaction.

Typically, a space compactor using the space compaction scheme comprises XOR gates, whereas a time compactor using the time compaction scheme is a MISR. A mixed space and time compactor typically feeds a space compactor to a time compactor. The difference between space compaction and time compaction is that a space compactor compacts an m-bit-wide output pattern to a p-bit-wide output pattern (where p < m), whereas a time compactor compacts n output patterns to q output patterns (where q < n). This section presents some widely used compaction schemes in industry. Promising techniques for tolerating X’s are also included.

Space Compaction

A space compactor is a combinational circuit for compacting m outputs of the circuit under test to n test outputs, where n < m. Space compaction can be regarded as the inverse procedure of linear expansion (which was described in Section 2.4.1.2). It can be expressed as a function of the input vector (i.e., the data being scanned out) and the output vector (the data being monitored):

Y = Φ(X)

where X is an m-bit input vector and Y is an n-bit output vector, n < m. Because each output sequence can contain unknown values (X’s), the space compaction scheme in use shall have the capability to mask off or tolerate unknowns in order to prevent faults from going undetected.

X-Compact

X-compact [Mitra 2004] is an X-tolerant response compaction technique and has being used in several designs. The combinational compactor circuit designed using the X-compact technique is called an X-compactor. Figure 2.47 shows an example of the X-compactor with eight inputs and five outputs. It is composed of 4 three-input XOR gates and 11 two-input XOR gates.

An X-compactor with eight inputs and five outputs.

Figure 2.47. An X-compactor with eight inputs and five outputs.

The X-compactor can be represented as a binary matrix (matrix with only 0’s and 1’s) with n rows and k columns; this matrix is called the X-compact matrix. Each row of the X-compact matrix corresponds to a scan chain, and each column corresponds to an X-compactor output. The entry in row i and column j of the matrix is 1 if and only if the jth X-compactor output depends on the ith scan chain output; otherwise, the matrix entry is 0. The corresponding X-compact matrix M of the X-compactor shown in Figure 2.47 is as follows:

An X-compactor with eight inputs and five outputs.

For a conventional sequential compactor, such as a MISR, there are two sources of aliasing: error masking and error cancellation. Error masking occurs when one or more errors captured in the compactor during a single cycle propagate through the feedback path and cancel out with errors in the later cycles. Error cancellation occurs when an error bit captured in a shift register is shifted and eventually cancelled by another error bit. The error cancellation is a type of aliasing specific to multiple-input sequential compactor. Because the X-compactor is a combinational compactor, it only results in error masking. To handle aliasing, the following theorems provide a basis for systematically designing X-compactors:

Theorem 2.1

If only a single scan chain produces an error at any scan-out cycle, the X-compactor is guaranteed to produce errors at the X-compactor outputs at that scan-out cycle if and only if no row of the X-compact matrix contains all 0’s.

Theorem 2.2

Errors from any one, two, or odd number of scan chains at the same scan-out cycle are guaranteed to produce errors at the X-compactor outputs at that scan-out cycle if every row of the X-compact matrix is nonzero, distinct, and contains an odd number of 1’s.

If all rows of the X-compact matrix are distinct and contain an odd number of 1’s, then a bitwise XOR of any two rows is nonzero. Also, the bitwise XOR of any odd number of rows is also nonzero. Hence, errors from any one or any two or any odd number of scan chains at the same scan-out cycle are guaranteed to produce errors at the compactor outputs at that scan-out cycle. Because all rows of the X-compact matrix of Figure 2.47 are distinct and odd, by Theorem 2.2, simultaneous errors from any two or odd scan chains at the same scan-out cycle are guaranteed to be detected.

The X-compact technique is nonintrusive and independent of the test patterns used to test the circuit. Insertion of X-compactor does not require any major change to the ATPG flow; however, the X-compactor cannot guarantee that errors other than those described in Theorem 2.1 and Theorem 2.2 are detectable.

X-Blocking

Instead of tolerating X’s on the response compactor, X’s can also be blocked before reaching the response compactor. During design, these potential X-generators (X-sources) can be identified using a scan design rule checker. When an X-generator is likely to reach the response compactor, it must be fixed [Naruse 2003] [Patel 2003]. The process is often referred to as X-blocking or X-bounding.

In X-blocking, the output of an X-source can be blocked anywhere along its propagation paths before X’s reach the compactor. In case the X-source has been blocked at a nearby location during test and will not reach the compactor, there is no need to fix further; however, care must be taken to ensure that no observation points are added between the X-source and the location at which it is blocked.

X-blocking can ensure that no X’s will be observed; however, it does not provide a means for observing faults, which can only propagate to an observable point through the now-blocked X-source. This can result in fault coverage loss. If the number of such faults for a given bounded X-generator justifies the cost, one or more observation points can be added before the X-source to provide an observable point to which those faults can propagate. These X-blocking or X-bounding methods have been discussed extensively in [Wang 2006a].

X-Masking

Although it may not result in fault coverage loss, the X-blocking technique does add area overhead and may impact delay because of the inserted logic. It is not surprising to find that, in complex designs, more than 25% of scan cycles could contain one or more X’s in the test response. It is difficult to eliminate these residual X’s by DFT; thus, an encoder with high X-tolerance is very attractive. Instead of blocking the X’s where they are generated, the X’s can also be masked off right before the response compactor [Wohl 2004] [Han 2005a] [Volkerink 2005] [Rajski 2005, 2006]. An example X-masking circuit is shown in Figure 2.48. The mask controller applies a logic value 1 at the appropriate time to mask off any scan output that contains an X.

An example X-masking circuit.

Figure 2.48. An example X-masking circuit.

Mask data are needed to indicate when the masking should take place. These mask data can be stored in compressed format and can be decompressed using on-chip hardware. Possible compression techniques are weighted pseudo-random LFSR reseeding or run-length encoding [Volkerink 2005].

X-Impact

Although X-compact, X-blocking, and X-masking each can achieve a significant reduction in fault coverage loss caused by X’s present at the inputs of a space compactor, the X-impact technique described in [Wang 2004] is helpful in that it simply uses ATPG to algorithmically handle the impact of residual X’s on the space compactor without adding any extra circuitry.

Example 2.1

An example of algorithmically handling X-impact is shown in Figure 2.49. Here, SC1 to SC4 are scan cells connected to a space compactor composed of XOR gates G7 and G8. Lines a, b,..., h are internal signals, and line f is assumed to be connected to an X-source (memory, non-scan storage element, etc.). Now consider the detection of the stuck-at-0 (SA0) fault f1. Logic value 1 should be assigned to both lines d and e in order to activate f1. The fault effect will be captured by scan cell SC3. If the X on f propagates to SC4, then the compactor output q will become X and f1 cannot be detected. To avoid this outcome, ATPG can try to assign either 1 to line g or 0 to line h in order to block the X from reaching SC4. If it is impossible to achieve this assignment, ATPG can then try to assign 1 to line c, 0 to line b, and 0 to line a in order to propagate the fault effect to SC2. As a result, fault f1 can be detected. Thus, X-impact is avoided by algorithmic assignment without adding any extra circuitry.

Handling of X-impact.

Figure 2.49. Handling of X-impact.

Example 2.2

It is also possible to use the X-impact approach to reduce aliasing. An example of algorithmically handling aliasing is shown in Figure 2.50. Here, SC1 to SC4 are scan cells connected to a compactor composed of XOR gates G7 and G8. Lines a, b,..., h are internal signals. Now consider the detection of the stuck-at-1 fault f2. Logic value 1 should be assigned to lines c, d, and e in order to activate f2, and logic value 0 should be assigned to line b in order to propagate the fault effect to SC2. If line a is set to 1, then the fault effect will also propagate to SC1. In this case, aliasing will cause the compactor output p to have a fault-free value, resulting in an undetected f2. To avoid this outcome, ATPG can try to assign 0 to line a in order to block the fault effect from reaching SC1. As a result, fault f2 can be detected. Thus, aliasing can be avoided by algorithmic assignment without any extra circuitry.

Handling of aliasing.

Figure 2.50. Handling of aliasing.

Time Compaction

A time compactor uses sequential logic (whereas a space compactor uses combinational logic) to compact test responses. Because sequential logic is used, one must make sure that no unknown (X) values from the circuit under test will reach the compactor. If that happens, X-bounding or X-masking must be employed.

The most widely adopted response compactor using time compaction is the multiple-input signature register (MISR). The MISR uses m extra XOR gates for compacting each m-bit-wide output sequence into the LFSR simultaneously. The final contents stored in the MISR after compaction is called the (final) signature of the MISR. For more information on signature analysis and the MISR design, refer to [Wang 2006a].

Mixed Time and Space Compaction

The previous two sections introduced different kinds of compactors for space compaction and time compaction independently. This section introduces mixed time and space compactors [Saluja 1983]. A mixed time and space compactor combines the advantages of a time compactor and a space compactor. Many mixed time and space compactors have been proposed in the literature, including OPMISR [Barnhart 2002], convolutional compactor [Rajski 2005], and q-compactor [Han 2003] [Han 2005a,b].

Because q-compactor is simple, this section uses it to introduce the conceptual architecture of a mixed time and space compactor. Figure 2.51 shows an example of a q-compactor assuming the inputs are coming from scan chain outputs. The spatial part of the q-compactor consists of single-output XOR networks (called spread networks) connected to the flip-flops by means of additional two-input XOR gates interspersed between successive storage elements. As the figure shows, every error in a scan cell can reach storage elements and then outputs in several possible ways. The spread network that determines this property is defined in terms of spread polynomials indicating how particular scan chains are connected to the register flip-flops.

An example q-compactor with single output.

Figure 2.51. An example q-compactor with single output.

Different from a conventional MISR, the q-compactor presented in Figure 2.51 does not have a feedback path; consequently, any error or X injected into the compactor is shifted out after at most five cycles. The shifted-out data will be compared with the expected data and then the error will be detected.

Example 2.3

Figure 2.51 shows an example of a q-compactor with six inputs, one output, and five storage elements—five per output. For the sake of simplicity, the injector network is shown in a linear form rather than as a balanced tree.

Low-Power Test Compression Architectures

The bandwidth-matching low-power scan design given in Figure 2.9 is also applicable for test compression. The general UltraScan architecture shown in Figure 2.52 uses a time-division demultiplexer and a time-division multiplexer (TDDM/TDM) pair, as well as a clock controller to create the UltraScan circuit [Wang 2005b]. The TDDM is typically a serial-input/parallel-output shift register, whereas the TDM is a parallel-input/serial-output shift register. The clock controller is used to derive the scan shift clock, ck2, by dividing the high-speed clock, ck1, by the demultiplexing ratio. The broadcaster can be a general decompressor using any linear-decompression-based scheme or broadcast-scan-based scheme.

UltraScan architecture.

Figure 2.52. UltraScan architecture.

In this UltraScan circuit, assume that eight high-speed input pads are used as external scan input ports, which are connected to the inputs of the TDDM circuit. The TDDM circuit uses a high-speed clock, provided externally or generated internally using a phase-locked loop (PLL), to demultiplex the high-speed compressed stimuli into compressed stimuli operating at a slower data rate for scan shift. Similarly, the TDM circuit will use the same high-speed clock to capture and shift out the test responses to high-speed output pads for comparison. Assume the demultiplexing ratio, the ratio between the high-speed data rate and the low-speed data rate, is 10. This means that designers can split eight original scan chains into 1280 internal scan chains for possible reduction in test data volume by 16X and test application time by 160X. In this example, for a desired scan shift clock frequency of 10 MHz, the external I/O pads are operated at 100 MHz. The TDDM/TDM circuit will not compress test data volume but only reduce test application time or test pin count by additional 10X. For low-power applications, however, it is possible to use UltraScan as a low-power test compression architecture to reduce shift power dissipation. In these cases, one can reduce shift power dissipation by 10X, by slowing down the shift clock frequency to 1 MHz and operating the high-speed I/O pads at 10 MHz. Although shift power dissipation has been reduced by 10X, reduction in test application time can now reach a maximum of 16X only.

Industry Practices

Several test compression products and solutions have been introduced by some of the major DFT vendors in the CAD industry. These products differ significantly with regard to technology, design overhead, design rules, and the ease of use and implementation. A few second-generation products have also been introduced by a few of the vendors. This section summarizes a few of the products introduced by companies such as Cadence [Cadence 2007], Mentor Graphics [Mentor 2007], SynTest [SynTest 2007], Synopsys [Synopsys 2007], and LogicVision [LogicVision 2007].

Current industry solutions can be grouped under two main categories for stimulus decompression. The first category uses linear-decompression-based schemes, whereas the second category employs broadcast-scan-based schemes. The main difference between the two categories is the manner in which the ATPG engine is used. The first category includes products such as ETCompression [LogicVision 2007] from LogicVision, TestKompress [Rajski 2004] from Mentor Graphics, XOR Compression [Cadence 2007] from Cadence, and SOCBIST [Wohl 2003b] from Synopsys. The second category includes products such as OPMISR+ [Cadence 2007] from Cadence, VirtualScan [Wang 2004] and UltraScan [Wang 2005b] from Syn-Test, and DFT MAX [Sitchinava 2004] from Synopsys.

For designs using linear-decompression-based schemes, test compression is achieved in two distinct steps. During the first step, conventional ATPG is used to generate sparse ATPG patterns (called test cubes), in which dynamic compaction is performed in a nonaggressive manner, while leaving unspecified bit locations in each test cube as “X.” This is accomplished by not aggressively performing the random fill operation on the test cubes, which is used to increase coverage of individual patterns and hence reduce the total pattern count. During the second step, a system of linear equations describing the hardware mapping from the external scan input ports to the internal scan chain inputs are solved in order to map each test cube into a compressed stimulus that can be applied externally. If a mapping is not found, a new attempt at generating a new test cube is required.

For designs using broadcast-scan-based schemes, only a single step is required to perform test compression. This is achieved by embedding the constraints introduced by the decompressor as part of the ATPG tool, such that the tool operates with much more restricted constraints. Hence, whereas in conventional ATPG, each individual scan cell can be set to 0 or 1 independently, for broadcast-scan-based-schemes the values to which related scan cells can be set are constrained. Thus, a limitation of this solution is that in some cases, the constraints among scan cells can preclude some faults from being tested. These faults are typically tested as part of a later top-up ATPG process if required, similar to using linear-decompression-based schemes.

On the response compaction side, industry solutions have utilized either space compactors such as XOR networks, or time compactors such as MISRs, to compact the test responses. Currently, space compactors have a higher acceptance rate in the industry because they do not involve the process of guaranteeing that no unknown (X) values are generated in the circuit under test.

A summary of the different compression architectures used in the commercial products is shown in Table 2.2. Six products from five DFT companies are included. Since June 2006, Cadence added XOR Compression as an alternative to the OPMISR+ product described in [Wang 2006a].

Table 2.2. Summary of Industry Practices for Test Compression

Industry Practices

Stimulus Decompressor

Response Compactor

XOR Compression or OPMISR+

Combinational XOR network or Fanout Network

XOR network with or without MISR

TestKompress

Ring generator

XOR network

VirtualScan

Combinational logic network

XOR network

DFT MAX

Combinational MUX network

XOR network

ETCompression

(Reseeding) PRPG

MISR

UltraScan

TDDM

TDM

It is evident that the solutions offered by the current EDA DFT vendors are quite diverse with regard to stimulus decompression and response compaction. For stimulus decompression, OPMISR+, VirtualScan, and DFT MAX are broadcast-scan-based, whereas TestKompress and ETCompression are linear-decompression-based. For response compaction, OPMISR+ and ETCompression can include MISRs, whereas other solutions purely adopt (X-tolerant) XOR networks. The UltraScan TDDM/TDM architecture can be implemented on top of any test compression solution to further reduce test application time and test pin count. What is common is that all six products provide their own diagnostic solutions.

Generally speaking, any modern ATPG compression program supports at-speed clocking schemes used in its corresponding at-speed scan architecture. For at-speed delay fault testing, ETCompression currently uses a skewed-load based at-speed test compression architecture for ATPG. The product can also support the double-capture clocking scheme through service. All other ATPG compression products—including OPMISR+, TestKompress, VirtualScan, DFT MAX, and UltraScan—support the hybrid at-speed test compression architecture by using both skewed-load (a.k.a. launch-on-shift) and double-capture (a.k.a. launch-on-capture). In addition, almost every product supports inter-clock-domain delay fault testing for synchronous clock domains. A few on-chip clock controllers for detecting these inter-clock-domain delay faults at-speed have been proposed in [Beck 2005], [Furukawa 2006], [Nadeau-Dostie 2006], and [Nadeau-Dostie 2007].

The clocking schemes used in these commercial products are summarized in Table 2.3. It should be noted that compression schemes may be limited in effectiveness if there are a large number of unknown response values, which can be exacerbated during at-speed testing when many paths do not make the timing being used.

Table 2.3. Summary of Industry Practices for At-Speed Delay Fault Testing

Industry Practices

Skewed-Load

Double-Capture

XOR Compression or OPMISR+

TestKompress

VirtualScan

DFT MAX

ETCompression

Through service

UltraScan

Random-Access Scan Design

Our discussions in previous sections mainly focus on serial scan design that requires shifting data into and out of a scan cell through adjacent scan cells. Although serial scan design has been one of the most successful DFT techniques in use that has minimum routing overhead, one inherent drawback with this architecture is its test power dissipation. Test power consists of shift power and capture power. Because of the serial shift nature, excessive heat can cumulate and damage the circuit under test. Excessive dynamic power during capture can also cause IR drop and induce yield loss. In addition, any fault present in a scan chain makes fault diagnosis difficult, because the fault can mask out all scan cells in the same scan chain. When scan chain faults are combined with combinational logic faults, the fault diagnosis process can even become more complex.

All of these problems result from the underlying architecture used for serial scan design. Random-access scan (RAS) [Ando 1980] offers a promising solution. Rather than using various hardware and software approaches to reduce test power dissipation in serial scan design [Girard 2002], random-access scan attempts to alleviate these problems by making each scan cell randomly and uniquely addressable, similar to storage cells in a random-access memory (RAM). As its name implies, because each scan cell is randomly and uniquely addressable, random-access scan design can reduce shift power dissipation with an increase in routing overhead. In addition, because there are no scan chains, scan chain diagnosis will be no longer an issue. One can simply apply combinational logic diagnosis techniques for locating faults within the combinational logic [Wang 2006a]. What has to be explored next is whether random-access scan could further reduce capture power dissipation.

In this section, we first introduce the basic concepts on random-access scan design. Next, RAS architectures along with their associated scan cell designs to reduce routing overhead are presented. As these RAS architectures do not specifically target test cost reduction, we then examine test compression RAS architectures to further reduce test application time and test data volume. At-speed RAS architectures are finally discussed.

Random-Access Scan Architectures

Traditional RAS design [Ando 1980] is illustrated in Figure 2.53. All scan cells are organized into a two-dimensional array, where they can be accessed individually for observing (reading) or updating (writing) in any order. This full-random access capability is achieved by decoding a full address with a row (X) decoder and a column (Y) decoder. A[log2 n]-bit address shift register, where n is the total number of scan cells, is used to specify which scan cell to access. A scan-in port SI is connected to all scan cells and a scan-out port SO is used to observe the state of each scan cell.

Traditional random-access scan architecture.

Figure 2.53. Traditional random-access scan architecture.

Therefore, the RAS design can access any selected scan cell without changing the states of other scan cells. This significantly reduces shift power dissipation, because there is no need to shift data into and out of the selected scan cell through scan chains; data in each scan cell can be directly observed and updated through the SO and SI ports, respectively. As opposed to serial scan design, however, there is no guarantee that the RAS design can further reduce the test application time or test data volume if a large number of scan cells has to be updated for each test vector or the addresses of scan cells to be accessed consecutively have little overlap.

Although RAS design can easily reduce shift power dissipation and simplify fault diagnosis, a major disadvantage of using this architecture is its high area and routing overhead, which has unfortunately hindered the approach from being considered for practical applications since it was invented in the 1980s. Only until recently, since silicon gates are no longer expensive for nanometer VLSI designs, has the RAS design started to regain its momentum.

A traditional RAS scan cell design proposed in [Wagner 1984] is shown in Figure 2.54a. An additional multiplexer is placed at the SI port of the muxed-D scan cell to either update data from the external SI port or keep its current state. This is controlled by the address select signal AS. Each scan cell output Q is directly fed to a multiple-input signature register (MISR) for output response compaction. As it is required to broadcast the external SI port to all scan cells and connect each scan cell output to the MISR, routing becomes a serious problem. A toggle scan cell design is proposed in [Mudlapur 2005] and illustrated in Figure 2.54b that eliminates the external SI port and connects selected scan cell outputs to a bus lead to an external SO port. Because this scheme eliminates the global SI port, a clear mechanism is required to reset all scan cells before testing. This introduces additional area and routing overhead.

Traditional random-access scan cell designs: (a) traditional RAS scan cell design and (b) toggle scan cell design.

Figure 2.54. Traditional random-access scan cell designs: (a) traditional RAS scan cell design and (b) toggle scan cell design.

Progressive Random-Access Scan Design

A progressive random-access scan (PRAS) design [Baik 2005a] was proposed in an attempt to alleviate the problems associated with the traditional serial scan design. The PRAS scan cell, as shown in Figure 2.55a, has a structure similar to that of a static random access memory (SRAM) cell or a grid-addressable latch [Susheel 2002], which has significantly smaller area and routing overhead than the traditional scan cell design [Ando 1980]. In normal mode, all horizontal row enable RE signals are set to 0, forcing each scan cell to act as a normal D flip-flop. In test mode, to capture the test response from D, the RE signal is set to 0 and a pulse is applied on clock Φ, which causes the value on D to be loaded into the scan cell. To read out the stored value of the scan cell, clock Φ is held at 1, the RE signal for the selected scan cell is set to 1, and the content of the scan cell is read out through the bidirectional scan data signals SD and Progressive Random-Access Scan Design. To write or update a scan value into the scan cell, clock Φ is held at 1, the RE signal for the selected scan cell is set to 1, and the scan value and its complement are applied on SD and Progressive Random-Access Scan Design, respectively.

Progressive random-access scan design: (a) PRAS scan cell design, (b) PRAS architecture, and (c) PRAS test procedure.

Figure 2.55. Progressive random-access scan design: (a) PRAS scan cell design, (b) PRAS architecture, and (c) PRAS test procedure.

The PRAS architecture is shown in Figure 2.55b, where rows are enabled in a fixed order, one at a time, by rotating a 1 in the row enable shift register. That is, it is only necessary to supply a column address to specify which scan cell in an enabled row to access. The length of the column address, which is ⌈log2 m⌉ for a circuit with m columns, is considerably shorter than a full (row and column) address; therefore, the column address is provided in parallel in one clock cycle instead of providing a full address in multiple clock cycles. This reduces test application time. To minimize the need to shift out test responses, the scan cell outputs are compressed with a multiple-input signature register (MISR).

The test procedure of the PRAS design is shown in Figure 2.55c. For each test vector, the test stimulus application and test response compression are conducted in an interleaving manner when the test mode signal TM is enabled. That is, all scan cells in a row are first read into the MISR for compression simultaneously, and then each scan cell in the row is checked and updated if necessary. Repeating this operation for all rows compresses the test response to the previous test vector into the MISR and sets the next test vector to all scan cells. Next, TM is disabled and the normal clock is applied to conduct test response acquisition. The figure shows that the smaller the number of scan cells to be updated for each row, the shorter the test application time. This can be achieved by reducing the Hamming distance between the next test vector and the test response to the previous test vector. Possible solutions include test vector reordering and test vector modification [Baik 2004; 2005a,b; 2006] [Le 2007].

Shift-Addressable Random-Access Scan Design

The PRAS design has demonstrated that it can significantly reduce shift power dissipation by 100X and reduce routing overhead to within 10%. One difficulty is the control complexity of the test control logic for updating the selected scan cells one at a time. In fact, when a PRAS design contains 100 or more rows (scan cells), reduction in shift power dissipation could also possibly reach 100X, even if all columns would have to be updated simultaneously [Wang 2006c].

The shift-addressable random-access scan (STAR) architecture proposed in [Wang 2006c] uses only one row (X) decoder and supports two or more SI and SO ports. All rows are enabled (selected) in a fixed order one at a time by rotating a 1 in the row enable shift register. When a row is enabled, all columns (or scan cells) associated with the enabled row are selected at the same time; therefore, there is no need to provide a column address. This reduces the test application time as opposed to traditional RAS designs, which require a column address to write a selected scan cell one at a time [Ando 1980] [Baik 2005a] [Mudlapur 2005] [Hu 2006]. The STAR architecture and its associated test procedure are shown in Figure 2.56. The STAR architecture can use any RAS scan cell design as proposed in [Wagner 1984], [Baik 2005a], or [Mudlapur 2005].

Test procedure for a shift-addressable random-access scan (STAR) design: (a) STAR architecture and (b) STAR test procedure.

Figure 2.56. Test procedure for a shift-addressable random-access scan (STAR) design: (a) STAR architecture and (b) STAR test procedure.

It has been reported in [Baik 2005a] and [Mudlapur 2005] that RAS design can easily provide a 100X reduction in shift power dissipation. Because each scan cell is updated when needed, a 2X to 3X reduction in test data volume and test application time is also achieved. These results indicated that RAS design achieved a significant reduction in shift power dissipation, as well as a good reduction in test data volume and test application time. Whether or not RAS design can further reduce capture power dissipation remains a research topic.

Test Compression RAS Architectures

Although RAS design has proven to be effective in reducing shift power dissipation at the cost of an increased area and routing overhead, reduction in test data volume and test application time is not significant. Since 2000, many test compression schemes have been developed to drastically reduce test data volume and test application time [Wang 2006a]. Even though these schemes are not aimed at reducing test power and mainly target serial scan design, they are applicable for use in RAS design.

All of these test compression schemes require that the design contain many short scan chains and data in the scan cells located on the same row be shifted in and out of the scan cells simultaneously within one shift clock cycle. Because most RAS architectures adopt the traditional RAS design architecture given in [Ando 1980] that updates the states of scan cells one at a time, they could substantially increase test application time.

The STAR architecture shown in Figure 2.56a overcomes the problem by allowing all scan cells on the same row to be accessed simultaneously. A general test compression RAS architecture based on the STAR architecture, called STAR compression architecture [Wang 2006c], is shown in Figure 2.57.

STAR compression architecture.

Figure 2.57. STAR compression architecture.

A decompressor is used to decompress the ATE-supplied stimuli, and a compactor is used to compact the test responses. In principle, the decompressor can be a pure buffer network as used in broadcast scan [Lee 1999] or Illinois scan [Hamzaoglu 1999], a MUX network as proposed in reconfigurable broadcast scan [Pandey 2002] [Sitchinava 2004], a broadcaster as practiced in virtual scan [Wang 2004], a linear decompressor as used in [Wohl 2003b] and [Rajski 2004], or a coding-based decompressor [Hu 2005]. The compactor can be a MISR, an XOR network, or an X-tolerant XOR network [Mitra 2004].

One important feature in the STAR compression architecture is its ability to reconfigure the RAS scan cells into a serial scan mode. The purpose is to uncover faults, which go undetected because of the decompression-compression process. Unlike serial scan design where multiplexers are inserted to merge two or more short scan chains into a long scan chain, the reconfiguration in RAS design is accomplished by adding a multiplexer at the scan input of each column (short scan chain) and an AND gate at the scan output of each column (short scan chain). The multiplexer allows transmitting scan-in stimulus from one column to another column, whereas the AND gate enables or disables the scan-out test response on the column to be fed to the compactor in serial scan mode. One or more additional pins may be required to support the reconfiguration. Figure 2.58 shows the reconfigured STAR compression architecture. This architecture is also helpful for fault diagnosis, failure analysis, and yield enhancement.

Reconfigured STAR compression architecture.

Figure 2.58. Reconfigured STAR compression architecture.

At-Speed RAS Architectures

In addition to the major advantages of providing significant shift power reduction and facilitating fault diagnosis, RAS design offers an additional benefit for at-speed delay fault testing. Typically, the launch-on-shift (also known as skewed-load) or launch-on-capture (also known as double-capture) capture-clocking scheme is employed for at-speed testing of path-delay faults and transition faults in serial scan design. Testing for a delay fault requires applying a pair of test vectors in an at-speed fashion. Either scheme requires generating a logic value transition at a signal line or at the source of a path in order to be able to capture the circuit response to this transition at the circuit’s operating frequency.

In random-access design, these delay tests can be easily generated and applied using an enhanced-scan scheme [Malaiya 1983] [Glover 1988] [Dervisoglu 1991] [Kuppuswamy 2004] [Le 2007]. Rather than generating a functionally dependent pair of vectors, a single-input-change pair of vectors can be easily generated by a combinational ATPG. [Gharaybeh 1997] showed that any testable path can be tested by a single-input-change vector pair. Hence, an enhanced-scan based at-speed RAS architecture allows RAS design to maximize the delay fault detection capability. This is in sharp contrast with using launch-on-shift or launch-on-capture in serial scan design, which relies on scan chains to shift in the first initialization vector. In enhanced scan, the second test vector can be applied by simply flipping the state of the selected scan cell. Moreover, no additional hardware is required, as opposed to applying the enhanced-scan scheme to serial scan design [Dervisoglu 1991] [Wang 2006a] [Le 2007].

Although enhanced scan offers many benefits as stated here, the vector count could be a problem. The reason is that a design can contain millions of delay faults and hundreds of thousands of scan cells. Generating single-input-change vector pairs may not yield a sufficient compacted test set. For RAS designs containing many noninteracting clock domains, the enhanced-scan scheme fails to generate a single vector set to test these clock domains simultaneously.

One approach to overcome the long vector count problem is to use the enhanced-scan cell that adds a latch to a storage element as shown in Figure 2.7 [Dervisoglu 1991] [Kuppuswamy 2004]. This allows the application of two independent test vectors in two successive clock cycles. The drawback is that this enhanced-scan based at-speed RAS architecture adds more hardware overhead to the RAS design.

Another approach is to employ the conventional launch-on-capture scheme. The launch-on-capture based at-speed RAS architecture allows applications of multiple transitions on the initialization vector, thereby reducing the vector count. Because RAS design does not contain scan chains, the launch-on-shift clocking scheme is not applicable for RAS design. One promising hybrid at-speed RAS architecture would be to first support launch-on-capture and then supplement it with enhanced scan when required to maximize the delay fault coverage. To ease silicon debug and failure analysis, it may be advantageous to use a faster-than-at-speed RAS architecture, which applies delay tests at faster than the operating speed to the clock domain being tested. This can catch small delay defects that escape traditional transition fault tests [Kruseman 2004] [Amodeo 2005].

Concluding Remarks

Scan and logic built-in self-test (BIST) are currently the two most widely used design-for-testability (DFT) techniques for ensuring circuit testability and product quality. For completeness, we first cover a number of fundamental scan and logic BIST architectures in use today [Wang 2006a]. Because a scan design can now contain tens to hundreds of millions of transistors and test set with 100% single stuck-at fault coverage using scan ATPG can no longer guarantee adequate product quality, we have seen at-speed delay testing and test compression rapidly become a requirement for 90-nanometer designs and below. Many physical failures have manifested themselves as delay faults, thus requiring at-speed delay test patterns for detection of these faults [Ferhani 2006]. As the need for additional test sets to detect manufacturing faults grows, test compression is becoming crucial for reducing the explosive test data volume and long test application time problems.

Because scan ATPG assumes a single-fault model, physical failures that cannot be modeled as single faults for ATPG can potentially escape detection [Gizopoulos 2006]. To detect these physical failures, logic BIST is of growing importance in VLSI manufacturing, when combined with its major advantages of performing on-chip self-test and in-system remote diagnosis. We anticipate that for VLSI designs at 65 nanometers and below, logic BIST and low-power testing will gain more industry acceptance. Although the STUMPS-based architecture [Bardell 1982] is the most popular logic BIST architecture now practiced for scan-based designs, the efforts required to implement the BIST circuitry and the loss of the fault coverage for using pseudo-random patterns have prevented the BIST architecture from being widely used across all industries.

As the semiconductor manufacturing technology moves into the nanometer design era, it remains to be seen how the CBILBO-based architecture proposed in [Wang 1986], which can always guarantee 100% single stuck-at fault coverage and has the ability of running 10 times more BIST patterns than the STUMPS-based architecture, will perform. Challenges lie ahead with regard to whether or not pseudo-exhaustive testing will become a preferred BIST pattern generation technique and random-access scan will be a promising DFT technique for test power reduction.

Exercises

2.1

(Muxed-D Scan Cell) Show a possible CMOS implementation of the muxed-D scan cell shown in Figure 2.3a.

2.2

(Low-Power Muxed-D Scan Cell) Design a low-power version of the muxed-D scan cell given in Figure 2.3a by adding gated-clock logic, which includes a lock-up latch to control the clock port.

2.3

(At-Speed Scan) Assume that a scan design contains three clock domains running at 100 MHz, 200 MHz, and 400 MHz, respectively. In addition, assume that the clock skew between any two clock domains is manageable. List all possible at-speed scan ATPG methods, and compare their advantages and disadvantages in terms of fault coverage and test pattern count.

2.4

(At-Speed Scan) Describe two major capture-clocking schemes for at-speed scan testing, and compare their advantages and disadvantages. Also discuss what will happen if three or more captures are used.

2.5

(BIST Pattern Generation) Implement a period-8 in-circuit test pattern generator (TPG) using a binary counter. Compare its advantages and disadvantages with using a Johnson counter (twisted-ring counter).

2.6

(BIST Pattern Generation) Implement a period-31 in-circuit test pattern generator (TPG) using a modular linear feedback shift register (LFSR) with characteristic polynomial f(x) = 1 + x2 + x5. Convert the modular LFSR into a muxed-D scan design with minimum area overhead.

2.7

(BIST Pattern Generation) Implement a period-31 in-circuit test pattern generator (TPG) using a five-stage cellular automaton (CA) with construction rule = 11001, where “0” denotes a rule 90 cell and “1” denotes a rule 150 cell. Convert the CA into an LSSD design with minimum area overhead.

2.8

(Cellular Automata) Derive a construction rule for a cellular automaton of length 54, and then derive construction rules up to length 300 in order to match the list of primitive polynomials up to degree 300 reported in [Bardell 1987].

2.9

(Test Point Insertion) For the circuit shown in Figure 2.22, calculate the detection probabilities, before and after test point insertion, for a stuck-at-0 fault present at input X3 and a stuck-at-1 fault present at input X6 simultaneously.

2.10

(BIST Response Compaction) Discuss in detail what errors can be and cannot be detected by a MISR.

2.11

(STUMPS Versus CBILBO) Compare the performance of a STUMPS design and a CBILBO design. Assume that both designs operate at 400 MHz and that the circuit under test (CUT) has 100 scan chains each having 1000 scan cells. Calculate the test time required to test each design when 100,000 test patterns are to be applied. In general, the scan shift frequency is much slower than a circuit’s operating speed. Assuming the scan shift frequency is 50 MHz, calculate the test time for the STUMPS design again. Explain further why the STUMPS-based architecture is gaining more industry acceptance than the CBILBO-based architecture.

2.12

(Scan Versus Logic BIST Versus Test Compression) Compare the advantages and disadvantages of a scan design, a logic BIST design, and a test compression design, in terms of fault coverage, test application time, test data volume, and area overhead.

2.13

(Test Stimulus Compression) Given a circuit with four scan chains, each having five scan cells, and with a set of test cubes listed as follows:

Exercises
  1. Design the multiple-input broadcast scan decompressor that fulfills the test cube requirements.

  2. Explain the compression ratio.

  3. The assignment of X’s will affect the compression performance dramatically. Give one X-assignment example that will unfortunately lead to no compression with this multiple-input broadcast scan decompressor.

2.14

(Test Stimulus Compression) Derive mathematical expressions for the following in terms of the number of tester channels, c, and the expansion ratio, k.

  1. The probability of encoding a scan slice containing 2 specified bits with Illinois scan.

  2. The probability of encoding a scan slice containing 3 specified bits where each scan chain is driven by the XOR of a unique combination of 2 tester channels such that there are a total of Exercises scan chains.

2.15

(Test Stimulus Compression) For the sequential linear decompressor shown in Figure 2.39 whose corresponding system of linear equations is shown in Figure 2.40, find the compressed stimulus X1 – X10 necessary to encode the following test cube: < Z1,..., Z12 > = < 0-0-1-0--011 >.

2.16

(Test Stimulus Compression) For the MUX network shown in Figure 2.44 and then the XOR network shown in Figure 2.45a, find the compressed stimulus at the network inputs necessary to encode the following test cube: <1-0---01>.

2.17

(Test Response Compaction) Explain further how many errors and how many unknowns (X’s) can be detected or tolerated by the X-tolerant compactor and q-compactor as shown in Figures 2.47 and 2.51, respectively.

2.18

(Test Response Compaction) For the X-compact matrix of the X-compactor given as follows:

Exercises
  1. What is the compaction ratio?

  2. Which outputs after compaction are affected by the second scan chain output?

  3. How many errors can be detected by the X-compactor?

2.19

(Random-Access Scan) Assume that a sequential circuit with n storage elements has been reconfigured as a scan design as shown in Figure 2.3b and two random-access scan designs as shown in Figures 2.53 and 2.56. In addition, assume that the scan design has m balanced scan chains and that a test vector vi is currently loaded into the scan cells of the three scan designs. Now consider the application of the next test vector vi+1. Assume that vi+1 and the response of vi are different in d bits. Calculate the number of clock cycles required for applying vi+1 to each of the three designs.

2.20

Exercises(A Design Practice) Write a C/C++ program to find the smallest number of clock groups in clocking grouping.

2.21

Exercises(A Design Practice) Assume that a scan clock and a PLL clock operate at 20 MHz and 50 MHz, respectively. Write an RTL code in Verilog hardware description language (HDL) to design the on-chip clock controller shown in Figure 2.14. Revise the RTL code to generate two pairs of staggered double-capture clock pulses each pair to control one clock domain.

2.22

Exercises(A Design Practice) Design a STUMPS-based logic BIST system in Verilog RTL code using staggered double-capture for a circuit with two interactive clock domains, one with 10 inputs and 20 outputs, the other with 16 inputs and 18 outputs. Report the circuit’s BIST fault coverage every 10,000 increments up to 100,000 pseudo-random patterns.

2.23

Exercises (A Design Practice) Repeat Exercise 2.19 but to design a CBILBO-based logic BIST system. Compare the observed BIST fault coverage with the BIST fault coverage given in Exercise 2.19, and explain why both methods produce the same or different fault coverage numbers.

2.24

Exercises (A Design Practice) Use the ATPG programs and user’s manuals contained on the Companion Web site to generate test patterns for the three largest ISCAS-1985 combinational circuits and record the number of test patterns needed for each circuit. Then combine the three circuits into one circuit by connecting their inputs in such a way that the first inputs of the three circuits are connected to the first shared input of the combined circuit, the second inputs of the three circuits are connected to the second shared input, etc. Use the ATPG tool again to generate test patterns for this combined circuit. Compare the number of test patterns generated for the combined circuit with that generated for each individual circuit.

2.25

Exercises (A Design Practice) Repeat Exercise 2.24, but this time try to use different input connections so as to reduce the number of test patterns for the combined circuit as much as you can. What is the least number of test patterns you can find?

Acknowledgments

The author wishes to thank Professor Xinghao Chen of The City College and Graduate Center of The City University of New York for contributing a portion of the Scan Architectures section; Professor Nur A. Touba of the University of Texas at Austin for contributing a portion of the Coverage-Driven Logic BIST Architectures section; Professor Xiaowei Li of the Chinese Academy of Sciences, Professor Kuen-Jong Lee of National Cheng Kung University, and Professor Nur A. Touba of the University of Texas at Austin for contributing a portion of the Circuits for Test Stimulus Compression and Circuits for Test Response Compaction sections; and Professor Xiaoqing Wen of the Kyushu Institute of Technology and Shianling Wu of SynTest Technologies for contributing a portion of the Random-Access Scan Architectures section. The author also would like to express his gratitude to Claude E. Shannon Professor John P. Hayes of the University of Michigan, Professor Kewal K. Saluja of the University of Wisconsin-Madison, Professor Yinhe Han of the Chinese Academy of Sciences, Dr. Patrick Girard of LIRMM, Dr. Xinli Gu of Cisco Systems, Dr. Rohit Kapur and Khader S. Abdel-Hafez of Synopsys, Dr. Brion Keller of Cadence Design Systems, Anandshankar S. Mudlapur of Intel, Dr. Benoit Nadeau-Dostie of LogicVision, Dr. Peilin Song of IBM, Dr. Erik H. Volkerink of Verigy US, Inc., and Dr. Seongmoon Wang of NEC Labs for reviewing the text and providing valuable comments; and Teresa Chang of SynTest Technologies for drawing most of the figures.

References

Books

Introduction

Scan Design

Logic Built-In Self-Test

Test Compression

Random-Access Scan Design

Concluding Remarks

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset