Chapter 11. Software-Based Self-Testing

Jiun-Lang HuangNational Taiwan University, Taipei, Taiwan

Kwang-Ting (Tim) ChengUniversity of California, Santa Barbara, California

About This Chapter

With the advances in semiconductor manufacturing technology, the concept of system-on-chip (SOC) has become a reality. An SOC device can contain a large number of complex, heterogeneous components including digital, analog, mixed-signal, radiofrequency (RF), micromechanical, and other systems on a single piece of silicon. The increasing heterogeneity and programmability of the SOC, along with the ever-increasing frequency and technology changes, however, are posing serious challenges to manufacturing test and on-chip self-test. Scan testing is the most commonly used design-for-testability (DFT) technique to address the fault coverage and test cost concerns. The problem is lacking self-test ability in the field. Hardware-based structural self-test techniques, such as logic built-in self-test (BIST), offer a feasible solution. Many solutions have been developed over the years to increase a circuit’s fault coverage while reducing area overhead and development time. However, structural BIST usually places the circuit in specific self-test mode. Like scan testing, it may also cause excessive test power consumption, overtesting, and yield loss.

Software-based self-testing (SBST) has attracted much attention to address those problems caused by structural BIST. The idea is to utilize on-chip programmable resources such as embedded processors for on-chip functional (rather than structural) test pattern generation, test data transportation, response analysis, and even diagnosis. In this chapter, we present a variety of methods for SBST. We start by discussing processor self-test techniques followed by a brief description of a processor self-diagnosis method. We continue with a discussion of methods for self-testing global interconnects as well as other nonprogrammable SOC cores. We also describe instruction-level DFT methods based on the insertion of test instructions into the data stream. These methods are intended to increase the circuit’s fault coverage and reduce its test application time and program size. Finally, we summarize methodologies for digital signal processing-based (DSP-based) self-test of analog and mixed-signal components.

Introduction

There is little doubt that SOC has become the logical solution for modern chip manufacturing because of the stringent demands for devices offering short time-to-market, rich functionalities, high portability, and low power consumption that are ultimately placed on designers. A typical SOC device contains a large number of complex, heterogeneous components including digital, analog, mixed-signal, RF, micromechanical, and other systems on a single piece of silicon. As the trend of limited accessibility to individual components, increased operating frequencies, and shrunken feature sizes continues, testing will face a whole new set of challenges.

The cost of silicon manufacturing versus that of testing given in the International Technology Roadmap for Semiconductors (ITRS) [SIA 1997, 1999] is illustrated in Figure 11.1 where the top and bottom curves show the fabrication and test capital per transistor, respectively. The trend clearly shows that unless some fundamental changes are made in the test process, it may eventually cost more to test a chip than to manufacture it [SIA 2003].

Fabrication versus test capital (based on 1997 SIA and 1999 ITRS roadmap data).

Figure 11.1. Fabrication versus test capital (based on 1997 SIA and 1999 ITRS roadmap data).

Also depicted in Figure 11.1 is the test paradigm shift. The difficulties in generating functional test patterns to reduce a chip’s defect level and test cost contribute to this trend. DFT and BIST have been regarded as the solutions for changing the direction of the test cost trend in Figure 11.1. BIST empowers the integrated circuit (IC) by enabling at-speed test signals to be analyzed on-chip using an embedded hardware tester. By implementing BIST, not only is the need for high-speed external testers eliminated, but also greater testing accuracy is achieved. Existing BIST techniques are based on structural BIST. Although the most commonly adopted scan-based BIST techniques—[Cheng 1995] [Lin 1995] [Tsai 1998] [Wang 2006]—offer good test quality, the required circuitry to realize the embedded hardware tester—including full scan, linear-feedback shift register (LFSR), multiple-input signature register (MISR), and BIST controllers—incur nontrivial area, performance, and design time overheads. Furthermore, structural BIST suffers the problem of elevated test power consumption that is inherent in structural testing because test patterns are less correlated spatially or temporally than functional patterns, resulting in higher switching activity. Another serious limitation of existing structural BIST is the complexity associated with applying at-speed patterns to timing related faults. This includes various complex timing issues related to multiple clock domains, multiple frequencies, and test clock skews that must be resolved for such testing to be effective.

This chapter discusses the concept of SBST, a promising solution for alleviating the problems inherent in external testers and structural BIST, and outlines the enabling techniques. In the new SBST paradigm, memory blocks that facilitate SBST are tested first. Then the on-chip programmable components such as processors and digital signal processors are self-tested. Finally, these programmable components are configured as an embedded software tester to test on-chip global interconnects and other nonprogrammable components. The SBST paradigm is sometimes referred to as functional self-testing, instruction-based self-testing, or processor-based self-testing.

Software-Based Self-Testing Paradigm

The SBST concept is depicted in Figure 11.2 using a bus-based SOC. In this illustration, the central processing unit (CPU) accesses the system memory via a shared bus, and all the intellectual property (IP) cores are connected to the system bus via a virtual component interface (VCI) [VCI 2000]. Here, the VCI simply acts as the standard communication interface between the core and the shared bus. To support the self-test methodology, each core is surrounded by a test wrapper that contains the test support logic needed to control scan chain shifting as well as buffers to store scan data and support at-speed testing.

A software-based self-testable SOC.

Figure 11.2. A software-based self-testable SOC.

Self-Test Flow

The overall SBST methodology consists of the following steps:

  1. Memory self-testing. The memory block (either the system or processor cache memory) that stores the test programs, test responses, and signatures is tested and repaired if necessary. A good reference for learning more about memory BIST and built-in self-repair (BISR) techniques can be found in [Wang 2006].

  2. Processor self-testing. During processor self-testing, the external tester first loads the memory with the test program and the signatures. Then the processor tests itself by executing the test program, aiming at the fault models of interest. The test program responses are written to the memory and later compared to the stored signatures to make the pass/fail decision.

  3. Global interconnect testing. To validate the global interconnect functionality, the embedded processor runs the corresponding test programs. Predetermined patterns that activate the defects of interest are transmitted between pairs of cores among which data or address transmission exists. The responses captured in the destination cores are then compared to the stored signatures.

  4. Testing nonprogrammable cores. When testing the remaining non-self-testable cores, the embedded processor controls the test pattern generation, test data transportation, and response analysis programs. For analog/mixed-signal cores, processor and DSP cores may be employed to perform the required pre-and postprocessing DSP procedures.

Comparison with Structural BIST

The fundamental difference between SBST and structural BIST is that the former handles testing as a system application whereas the latter places the system in nonfunctional test mode. Handling testing as an application has the following advantages:

  1. The need for DFT circuitry is minimized. In structural BIST, one may have to add wrapper cells as well as necessary logic to control or observe the system’s external input/output (I/O) ports that are not connected to low-cost testers. In SBST, additional DFT techniques can be employed if SBST alone does not achieve the desired fault coverage or diagnosis resolution.

  2. The performance requirement for the external tester is reduced as all high-speed transactions occur on-chip and the tester’s main responsibility is to upload the test program to the system memory and download test responses after self-testing.

  3. Performing test pattern application and response capture on-chip achieves greater accuracy than that obtainable with a tester, which reduces the yield loss owing to tester inaccuracy.

  4. Because the system is operated in functional mode while executing the test programs, excessive test power consumption and potential overkill problems inherent in structural BIST are eliminated.

One major concern of SBST is its fault detection efficiency. In a programmable component like a processor or DSP core, some structural faults that cannot be detected using instructions do not necessarily map to low fault coverage. If a fault can be detected only by test patterns that cannot be realized by any instruction sequence, then the fault is redundant by definition in the normal operating mode of the programmable component. Thus, there is no need to test for this type of fault during manufacturing test, even though we may still want to detect and locate these faults during silicon debug and diagnosis for manufacturing process improvement.

For an IP core that is not self-testable, one may rely on structural BIST techniques to reach the desired level of fault coverage. In this case, using the processor as the test pattern generator and output response analyzer (i.e., an embedded software tester) gives one the flexibility of combining multiple test strategies to achieve the desired fault coverage—one just has to alter the corresponding programs or parameters without any hardware modification. In [Hellebrand 1996], the authors discuss mixed-mode pattern generation for random and deterministic patterns using embedded processors. After identifying the best pattern generation scheme—including random and deterministic techniques to meet the desired fault coverage goal, test time budget, or available memory storage—the test program is synthesized accordingly without the need to alter any BIST hardware.

Processor Functional Fault Self-Testing

The complexity of processors, together with the limited accessibility of their internal logic, makes them extremely difficult to test. From the point of view of SBST, the situation is even worse for users or test engineers because the processors’ internal design details are often unavailable and difficult to comprehend.

To resolve these problems, processor functional-level fault models and their corresponding test generation methods have been proposed in [Thatte 1980] and [Brahame 1984]. Because only knowledge of the processor’s instruction set and the functions it performs is needed, these techniques can provide a testing solution for general-purpose processors. In the following subsections, we describe the fault models and test generation procedures for functional faults associated with register decoding, instruction decoding and control, data storage and transfer, and data manipulation as presented in [Thatte 1980].

Processor Model

To facilitate functional-level test generation, the processor is modeled at the register-transfer level (RTL) by a system graph (S-graph), which is based on its instruction set and the functions it performs. In the S-graph, each register that can be explicitly modified by an instruction is represented by a node. Two additional nodes, IN and OUT, are also added to the graph to cover the main memory and I/O devices. Nodes in the S-graph are connected by directed edges. There exists a labeled directed edge from node A to node B if data flow occurs from A to B during the execution of any instruction. An example S-graph is illustrated in Figure 11.3. For convenience, only a subset of instructions and those registers in the processor that are directly involved in carrying out these instructions are shown and summarized in Table 11.1.

An example S-graph.

Figure 11.3. An example S-graph.

Table 11.1. Summary of the Registers and Instructions in Figure 11.3

R1:

Accumulator (ACC)

R2:

General purpose register

R6:

Program counter (PC)

I1:

Load R1 from the main memory using immediate addressing. (T)

I2:

Load R2 from the main memory using immediate addressing. (T)

I4:

Add the contents of R1 and R2 and store the results in R1. (M)

I7:

Store R1 into the main memory using implied addressing. (T)

I8:

Store R2 into the main memory using implied addressing. (T)

I9:

Jump instruction. (B)

I11:

Left shift R1 by one bit. (M)

As in [Flynn 1974], the instructions are classified as transfer (class T), manipulation (class M), or branch (class B). In Table 11.1, I4 and I11 are class M instructions, I9 is a class B instruction, and the others are class T instructions. Note that multiple edges may be associated with one instruction (e.g., I4 and I9). However, an instruction is allowed to have multiple destination registers only if it involves a data transfer between the main memory or an I/O device and if it registers this transfer during its execution.[1] Also, it is assumed that any register can be written (implicitly or explicitly) as well as read (implicitly or explicitly) using a sequence of either class T or B instructions.

After constructing the S-graph, one assigns integer labels to the nodes by the node labeling algorithm in Figure 11.4. Regarding the S-graph, the node label indicates the shortest distance of a node to the OUT node. On a processor, the node label corresponds to the minimum number of class T or B instructions that need to be executed in order to read the contents of the register represented by that node. The nodes of the S-graph in Figure 11.3 are labeled as follows: the OUT node is labeled 0, and R1, R2, and R6 are labeled 1.

Table 11.4. The node labeling algorithm.

1.

assign label 0 to the OUT node;

2.

K ← 0;

3.

while there exist unlabeled nodes

4.

 

assign K + 1 to unlabeled nodes whose contents can be transferred to any register(s) labeled K by executing a single class T or B instruction;

5.

 

KK + 1;

6.

end while

The edges in the S-graph are then assigned labels in the following way. First, for any instruction that implicitly or explicitly reads a register to the OUT node during its execution, all the edges that are involved in its execution are labeled 1, such as the edges of I7, I8, and I9 for example. In this way, the edges of class B instructions will all be assigned a label of 1. The edges of the remaining instructions are labeled as follows: if the destination register of the instruction is labeled K, all the edges of that instruction will be labeled K + 1. For the S-graph in Figure 11.3, the edges of instructions I7, I8, and I9 are labeled 1, and those of I1, I2, I4, and I11 are labeled as 2.

The purpose of the node and edge labels is to facilitate the test generation process. Tests are generated in a way such that the knowledge gained from the correct execution of tests used to check the decoding of registers and instructions with lower labels is utilized in generating tests to check the decoding of registers and instructions with higher labels.

Functional-Level Fault Models

Processor functional-level fault models are developed at a higher level of abstraction independent of the details of their implementation. The functional-level fault models described include register decoding fault, instruction decoding and control fault, data storage fault, data transfer fault, and data manipulation fault.

  • Register decoding fault model. Under the register decoding fault, the decoded address of the register(s) is incorrect. As a result, the wrong register(s) may be accessed, or else no register is accessed at all. The resultant outcome retrieved when no or multiple registers are accessed is technology dependent.

  • Instruction decoding and control fault model. When there is a fault in the instruction decoding and control function, the processor may execute a wrong instruction, execute some other instructions (including the original one), or execute no instruction at all.

  • Data storage fault model. Single stuck-at faults may occur in any number of cells in any number of registers.

  • Data transfer fault model. The possible faults include a line in the data transfer path stuck at 1 or 0; or two lines in the data transfer path are coupled.

  • Data manipulation fault model. No specific fault models are proposed for data manipulation units. Instead, it is assumed that the required test patterns can be derived according to the implementation.

  • Processor functional-level fault model. The processor may possess any number of faults that belong to the same fault model.

Test Generation Procedures

This section describes the test generation procedures for the modeled functional faults. Note that the S-graph as well as the node and edge labels facilitate the test generation process.

Test Generation for Register Decoding Fault

The goal is to validate that the register mapping function, which is Test Generation for Register Decoding Fault, where Test Generation for Register Decoding Fault is the set of all registers, is correct. That is, fD (Ri) = Ri for all registers.

The test generation flow is illustrated in Figure 11.5. At the beginning of the process, the first-in first-out queue Q is initialized with the set of all registers such that registers with smaller labels are in the front of Q. A, the set of processed registers, consists of the first element of Q. Then, test generation is performed, one register at a time. Each iteration consists of the write and read phases. In the write phase, all the registers in A are written with ONE (all ones), and the first register in Q, denoted by Rnext, is written ZERO (all zeros). The needed write operation for each register is the shortest sequence of class T or class B instructions that are necessary to write the target register. In the read phase, registers in A are read in the order of ascending labels. Then, the content of Rnext is read out. Similarly, the needed read operation for each register denotes the shortest sequence of class T or class B instructions that are necessary to read the target register.

Table 11.5. Test generation for register decoding faults.

1.

Q ← sort (Test generation for register decoding faults.);

2.

Rnext ← dequeue (Q);

3.

A ← {Rnext};

4.

while Q ≠ φ

5.

 

foreach Ri ∊ A

6.

  

append write (Ri, ONE) to test program;

7.

 

end foreach

8.

 

Rnext ← dequeue (Q);

9.

 

append write (Rnext, ZERO) to test program;

10.

 

Q′ ← sort (A);

11.

 

while Q ≠ φ

12.

  

append read (dequeue (Q′)) to test program;

13.

 

end while

14.

 

append read (Rnext) to test program;

15.

 

A ← A ∪ {Rnext};

16.

end while

17.

repeat 1–16 with complementary data;

The test generation algorithm in Figure 11.5 assures that all the registers have disjoint image sets under the register mapping function fD, thus establishing that fD has a one-to-one correspondence. The generated test program is capable of finding any detectable fault in the fault model for the register decoding function. One example of a possible undetectable fault is the concurrent occurrence of fD (Ri) = Rj and fD (Rj) = Ri.

Test Generation for Instruction Decoding and Control Fault

Let Ij be the instruction to be executed. The faults in which no instruction is executed at all, the wrong instruction is executed such as Ik instead of Ij, or some other instruction Ik is also executed are each denoted by f (Ij/φ), f (Ij/Ik), and f (Ij/Ij + Ik), respectively. To simplify the test generation task for instruction decoding and control functions, it is assumed that the labels of class M instructions are no greater than 2 and that all class B instructions have the label 1. Only class T instructions can have labels greater than 2.

To ensure fault coverage, the order in which faults are detected is crucial. Figure 11.6 depicts the order in which the tests are applied. Note how the tests are applied in such a way that the knowledge gained from testing instructions of lower labels is utilized for testing instructions with higher labels.

Table 11.6. The order of test generation for instruction decoding and control function.

1.

K ← 1;

2.

for K = 1 to Kmax

3.

 

apply tests to detect f (Ij/φ), f (Ij /Ik), and f (Ij/Ij + Ik), where label (Ij) = label (Ik) = K;

4.

 

apply tests to detect f(Ij/Ij + Ik), where 1 ≤ label (Ij) ≤ K, label (Ik) = K + 1, and K < Kmax;

5.

 

apply tests to detect f(Ij/Ij + Ik), where K + 1 ≤ label (Ij) ≤ Kmax, and label (Ik) = K;

6.

end for

Consider the instruction Ij, label (Ij) = 2. To detect the f(Ij/φ) fault, one first writes O1 to Ij’s destination register Rd using one class T or class B instruction. Then, proper operand(s) are written to Ij’s source registers,[2] such that when Ij is executed, it produces O2 (O2O1) in its destination register Rd. Ij is then executed and Rd is read where the expected output is O2.

To detect the f(Ij/Ik) faults, consider the case when (1) label (Ij) = label (Ik) = K ≥ 3 and (2) Ij and Ik have the same destination register Rd. According to the previous assumptions, both Ij and Ik are class T instructions, and they each have only one destination register. The test procedure first writes O1 and O2 (O1O2) to the source registers of Ij and Ik, respectively. Then, Ij is executed, and Rd is read for K times with the expected output being O1. Finally, Ik is executed and Rd is read with the expected output being O2. It is interesting to note that Ij is executed K times. If O2 is really stored in the source register of Ik at the beginning of the procedure, f (Ij/Ik) will be detected the first time Ij is executed and Rd read out. However, because of the faults involved in instructions used to write O2 in Ik’s source register, O1 may have been stored in Ik’s source register. In such a case, f(Ij/Ik) will not be detected. In the worst case, Ij has to be executed K times to guarantee the detection of f(Ij/Ik).

As for f(Ij/Ij + Ik), an example when 1 ≤ label (Ij) ≤ K, label (Ik) = K + 1, and K ≥ 2 is the result. Note that Ik is a class T instruction, and the destination registers of Ij and Ik, denoted by Rj and Rk, respectively, are different. When the label of Ik’s source register is less than K, different operands O1 and O2 are first written to Ik’s source and destination registers, respectively. Then, Ik’s source register is read, and the expected output is O1. Finally, Ij is executed and Ik’s destination register is read. The expected output is O2.

Test Generation for Data Transfer and Storage Function

Depending on the class to which the involved instruction belongs (i.e., class T, B, or M), different test generation procedures are applied.

Consider a sequence of class T instructions Ij1, Ij2,..., Ijk of which the associated edges form a directed path from the IN node to the OUT node in the S-graph. All such paths should be tested. Testing the data transfer faults of the instructions in this path starts by executing Ij1 with operand O1. Then Ij2, Ij3,..., Ijk are executed, and the expected output is O1. Let the width of the data transfer path be w; the procedure is repeated for the following O1 configurations:

Test Generation for Data Transfer and Storage Function

For class M instructions, use instruction I4, which adds the contents of R1 and R2 together and stores the sum in R1, as an example. (The involved edges and nodes of I4 can be found in Figure 11.3.) The test of I4 consists of testing the paths from the arithmetic logic unit (ALU) to R1, and from R1 and R2 to ALU. For the former, I1 and I2 are utilized to load R1 and R2 with O1 and all zeros. Then, I4 is executed, followed by I7. The result is read and stored in R1. Assuming that the processor is an 8-bit processor, to fully test the path, the procedure is repeated for the following O1 configurations (complemented and uncomplemented): 1111 1111, 1111 0000, 1100 1100, and 1010 1010. For the paths from R1 to ALU, R1 is loaded with O1, and R2 with all zeros. Then, I4 and I7 are executed and the expected output is O1. The procedure is repeated for the following O1 configurations: 0000 0001, 0000 0010,..., 1000 0000. Testing the path from R2 to ALU is similar and not repeated here.

Finally, the transfer paths and registers involved in class B instructions must be tested. Take the JUMP instruction I9 for example. Assume that the address bus width is 16. Then, I9 should be executed with the following jump addresses (in both complemented and uncomplemented forms):

0000 0000 0000 0000,

0000 0000 1111 1111,

0000 1111 0000 1111,

0011 0011 0011 0011,

0101 0101 0101 0101

 

Test Generation for Data Manipulation Function

No specific fault model is proposed for the data manipulation functions. Instead, given the test patterns for a data manipulation unit (ALU, shifter, etc.), the desired operands can be delivered to its input(s) and the results can be delivered to the output(s) using class T instructions.

Test Generation Complexity

The complexity of the test sequences in terms of the number of instructions used depends on the numbers of registers and instructions, denoted by nR and nI, respectively. For register decoding faults, the complexity is found to be O Test Generation Complexity. For instruction decoding and control faults, the complexity is O Test Generation Complexity if the instruction labels do not exceed two.

The technique was applied to an 8-bit microprocessor with 2200 single stuck-at faults simulated. About 90% of the faults are detected by the test sequences for register decoding, data storage, data transfer, and data manipulation functions. The number of instructions for these sequences was about 1000. The test sequences for the faults that caused simultaneous execution of multiple instructions consist of about 8000 instructions. Many of these faults were subtle and required very elaborate test sequences to detect them. The remaining faults (about 4%) could not be detected with valid instructions; thus, for this particular processor, the fault coverage was excellent.

Processor Structural Fault Self-Testing

In this section, recent advances in processor SBST techniques that target structural faults, including stuck-at and delay faults, will be discussed.

Test Flow

In general, the structural fault oriented processor SBST techniques [Lai 2000a, 2000b] [Chen 2001] [Kranitis 2003] [Chen 2003] [Bai 2003] [Paschalis 2005] [Krannitis 2005] [Psarakis 2006] consist of two phases: the test preparation phase and the self-testing phase.

Test Preparation

In the test preparation phase, instruction sequences that deliver structural test patterns to the inputs of the processor component under test and transport the output responses to observable outputs are generated.

One challenge for processor component test generation is the instruction-imposed I/O constraints. For a processor component, the input constraints define the input space of the component allowed or realizable by processor instructions. A fault is undetectable if none of its test patterns is in the input space. The output constraints, on the other hand, define the subset of component outputs observable by instructions. A fault is undetected at the chip level if its resulting errors fail to propagate to any observable outputs. Without incorporating these instruction-imposed I/O constraints, the following component test generation may produce test patterns that cannot be delivered by processor instructions.

In [Lai 2000a, 2000b], [Chen 2001, 2003], [Bai 2003], the extracted component I/O constraints are expressed in the form of Boolean expressions or hardware description language (HDL) descriptions and are fed to automatic test pattern generation (ATPG) for constrained component test generation. Next, the test program synthesis procedure maps the constrained test patterns to processor instructions. In addition to the test application instruction sequence, test supporting instruction sequences may be added (in front of or after the test application sequence) to set up the required processor state (e.g., register values) and to transport the test responses to main memory.

It is interesting to note that the SBST approach offers great flexibility in test pattern generation and response analysis. For example, depending on which method is more efficient for a particular case, the test patterns may be loaded directly to the data memory or generated on-chip using a test pattern generation program (e.g., a software-based LFSR). Similarly, the captured responses may be compressed on-chip using a software version MISR.

Self-Testing

The processor self-testing setup is illustrated in Figure 11.7. Because on-chip system memory (or cache memory) is utilized to store the test program(s) and responses, it has to be tested with standard techniques such as memory BIST [Wang 2006] and repaired if necessary to ensure that it is functioning. Then, a low-cost external tester can be used to load the test program(s) and the test data to the on-chip memory.

Processor self-testing setup.

Figure 11.7. Processor self-testing setup.

To apply the structural tests, the processor is set up to properly execute the loaded test program. Finally, the test signatures are downloaded to the external tester for pass/fail decision or diagnosis.

In the following subsections, we describe the SBST methods for processor stuck-at faults [Chen 2001, 2003] and path delay faults [Lai 2000a, 2000b].

Stuck-At Fault Testing

The technique reported in [Chen 2001] targets processor stuck-at faults. Details of its test preparation step and the results of a case study will be illustrated.

Instruction-Imposed I/O Constraint Extraction

To reduce the test generation complexity, constrained test generation is performed for subcomponents instead of the full processor. To facilitate constrained component test generation, the instruction-imposed I/O constraints are extracted first. The constraints can be divided into input constraints and output constraints, which are determined by the instructions for controlling the component inputs and the instructions for observing the component outputs. Furthermore, a constraint may be a spatial constraint or a temporal constraint.

Take the PARWAN processor’s shifter unit (SHU) in Figure 11.8a as an example. The instructions that control the SHU inputs include lda (load accumulator), and, add, sub, asl (arithmetic shift left), and asr (arithmetic shift right), and the corresponding spatial constraints on its inputs are listed in Table 11.2. For instance, if the executed instruction is sub, then both the asl and asr control inputs must be 0, the input flag v is set to 1 if flags c and s (the most significant bit of the data input, i.e., data_in [7]) are different, and the z flag is set to 1 if the data input is 0 (i.e., data_in [i] = 0, i = 0...7). As for data_in and the carry flag c, there is no spatial constraint with respect to the sub instruction.

The SHU I/O and test application sequence: (a) SHU I/O and (b) SHU test application sequence.

Figure 11.8. The SHU I/O and test application sequence: (a) SHU I/O and (b) SHU test application sequence.

Table 11.2. Spatial Constraints at SHU Inputs

 

Control

Inputs

Flag Inputs

Data Inputs

 

asl

asr

v

c

z

n

s (data_in[7])

data_in[6:0]

lda

0

0

0

0

data_in ≡ 0

s

X

X

and

0

0

0

0

data_in ≡ 0

s

X

X

add

0

0

cs

X

data_in ≡ 0

s

X

X

sub

0

0

cs

X

data_in ≡ 0

s

X

X

asl

1

0

0

0

data_in ≡ 0

s

X

X

asr

0

1

0

0

data_in ≡ 0

s

X

X

The temporal constraints on SHU, on the other hand, are imposed by the sequence of instructions that applies tests to SHU (Figure 11.8b). The sequence consists of three steps: (1) loading data from memory (MEM) into the accumulator (AC), (2) shifting the data stored in AC and storing the result in AC, and (3) storing the result in memory for later analysis. The corresponding temporal constraint model is illustrated in Figure 11.9 and summarized as follows:

  1. The SHU inputs are connected to the primary inputs only in the first phase.

  2. The SHU data outputs are connected to the primary outputs only in the third phase.

  3. The shifting signals, asl and asr, are set to 0 in the first and third phases.

  4. The v and c flags are set to 0 in the second and third phases because neither the shift nor the store instruction can set them to 1.

The SHU temporal constraint model.

Figure 11.9. The SHU temporal constraint model.

Constrained Component Test Generation

Once the component spatial and temporal constraints are derived, the constrained component test generation algorithm in Figure 11.10 is utilized to generate component structural test patterns.

Table 11.10. The constrained component test generation algorithm.

1.

IIC; FFC; VC ← φ; TC ← φ;

2.

while I ≠ φ and F ≠ φ

3.

 

pick i from I;

4.

 

if not Vi,CVC

5.

  

(Ti,C, Fdet) ← constrainedTG (F, Vi,C);

6.

  

TCTCTi,C;

7.

  

FFFdet;

8.

  

VCVCVi,C;

9.

 

end if

10.

 

II – {i};

11.

end while

In the initialization step (line 1), I is initialized to be the set of instructions for controlling the component under test C (i.e., IC) and F the fault list of C (i.e., FC). VC and TC, the covered input space and generated tests up to now, are initialized to the empty set φ. The test generation process then repeats until all the instructions are processed or the fault list is empty (line 2). In each iteration, an instruction i is selected for test generation (line 3). If the input space associated with i, denoted by Vi,C, is covered by previous instructions, i is skipped (line 4). Otherwise, constrained test pattern generation is performed. In line 5, Ti,C is the set of generated test patterns and Fdet is the set of newly detected faults by Ti,C. In lines 6 to 8, the test set, the remaining fault set, and the covered input space are updated.

The resulting test set TC has two important properties. First, if the tests generated under the constraints imposed by any single instruction i achieve the maximum possible fault coverage in the functional mode allowed by i, TC can achieve the maximum possible fault coverage in the functional mode allowed by IC. That is, TC detects any faults detectable by VC. Second, any test vector in TC can be realized by at least one instruction in IC.

In practice, the algorithm faces some challenges. First, determination of IC is nontrivial. One may utilize simulation-based approaches to identify the set of instructions that affect the inputs of C. Also, the instructions in IC may be ordered according to the simplicity of instruction-imposed I/O constraints—instructions with simpler constraints first. Second, determining whether Vi,CVC is true is a co-NP-complete problem. This step can be relaxed to screen out only the instructions that obviously cover the same input space as any previously processed instruction.

To better illustrate the constrained component test generation process, the component test preparation results for the PARWAN processor [Navabi 1997] ALU are shown in Table 11.3. The ALU contains two 8-bit data inputs (in_1 and in_2) and one 3-bit control input (alu_code). Input in_1 is connected to the data bus between the memory and the processor, whereas in_2 is connected to the output of the accumulator. In Table 11.3, column 1 lists the instructions that control the ALU inputs. In columns 2 to 4, the input constraints imposed by the instructions as well as the generated test patterns are shown. The constrained inputs are expressed in the form of fixed values (e.g., the alu_code field and the all Z’s in in_1 and in_2). The unconstrained inputs, on the other hand, are to be generated by a software LFSR procedure; therefore, they are expressed by a self-test signature (S,C,N), where S and C are the seed and configuration of the pseudo-random pattern generator, and N is the number of pseudo random patterns used.

Table 11.3. Component Tests for PARWAN ALU

Test Pattern

 

alu_code

in_1

in_2

lda

100

(11111111,01100011,82)

ZZZZZZZZ

sta

110

ZZZZZZZZ

(11111111,01100011,82)

cma

001

ZZZZZZZZ

(11111111,01100011,35)

and

000

(11111111,01100011,98): odd

(11111111,01100011,98): even

sub

111

(11111111,01100011,24): odd

(11111111,01100011,24): even

add

101

(11111111,01100011,26): odd

(11111111,01100011,26): even

In [Kranitis 2003], [Paschalis 2005], and [Kranitis 2005], a different approach to component test generation is adopted based on the following observations:

  1. The functional components such as ALU, multiplexers, registers, and register files should be given the highest priority for test development because their size dominates the processor area and thus they have the largest contribution to the overall processor fault coverage.

  2. The majority of these functional components have a regular or semiregular structure. They can be efficiently tested with small and regular test sets that are independent of the gate-level implementation.

Thus, instead of using gate-level ATPG, a component test library of test algorithms that generate small deterministic test sets and provide very high fault coverage for most types and architectures of functional components is developed. Compact loop-based test routines are utilized to deliver these tests to the functional components and the test responses to observable outputs or registers.

Test Program Synthesis

Because the component tests are developed under the processor instruction-imposed I/O constraints, it will always be possible to find instructions for applying the component tests. On the output end, however, special care must be taken when collecting the component test responses. Inasmuch as data outputs and status outputs have different observability, they should be treated differently during response collection. In general, although there are no instructions for storing the status outputs of a component directly to memory, an image of the status outputs can be created in memory using conditional instructions. Following the PARWAN ALU example, which has an 8-bit data output (data_out) and a 4-bit status output (out_flag = vczn), the test program that observes the ALU status outputs after executing the add instruction is shown in Figure 11.11.

Table 11.11. Test program for observing ALU status outputs.

1.

  

lda

addr(y)

//load AC

2.

  

add

addr(x)

 

3.

  

sta

data-out

//store AC

4.

  

lda

11111111

 

5.

  

brav

ifv

//branch if overflow

6.

  

and

11110111

 

7.

label

ifv

brac

ifc

//branch if carry

8.

  

and

11111011

 

9.

label

ifc

braz

ifz

//branch if zero

10.

  

and

11111101

 

11.

label

ifz

bran

ifn

//branch if negative

12.

  

and

11111110

 

13.

label

ifn

sta

flag–out

 

Processor Self-Testing

The processor self-testing flow is illustrated in Figure 11.12. First, the unconstrained test patterns are generated on-chip using the self-test signatures and the test generation program, which is a software version LFSR. Because all the self-test signatures are the same except in the N field, the program constructs and stores a shared array of test patterns.

Microprocessor self-testing flow.

Figure 11.12. Microprocessor self-testing flow.

Once the test patterns are ready, the test application program is executed and the test responses stored. If desired, the captured responses may be analyzed or compressed (e.g., using a software MISR before being delivered to the external tester).

The preceding technique was applied to the PARWAN processor. Although component tests are generated only for a subset of components (ALU, SHU, and program counter [PC]) that are easily accessible through instructions, other components, such as the instruction decoder, are expected to be tested intensively during the application of the self-test program. The overall fault coverage was 91.42%.

Test Program Synthesis Using Virtual Constraint Circuits (VCCs)

One major challenge of the test program synthesis process is to efficiently extract the instruction imposed I/O constraints.

Tupuri et al. proposed an approach for generating functional tests for processors by using a gate-level sequential ATPG tool [Tupuri 1997]. It attempts to generate tests for all detectable stuck-at faults under the functional constraints, and then applies these functional test vectors at the system’s operational speed. The key idea of this approach lies in the synthesized logic embodying the functional constraints, also known as virtual constraint circuits (VCCs). The extracted functional constraints are described in HDL and synthesized into a gate-level network. Then a commercial ATPG is used to generate module-level vectors with such constraint circuitry imposed. These module-level vectors are translated to processor-level functional vectors and fault simulated to verify the fault coverage.

Based on the VCC concept but with a different utilization, [Chen 2003] performed module-level test generation such that the generated component test patterns can be directly plugged into the settable fields (e.g., the operand and the source/destination register fields) in test program templates. This utilization simplifies the automated generation of test programs for embedded processors. Figure 11.13 illustrates the overall test program synthesis process proposed in [Chen 2003], in which the final self-test program can be synthesized directly from (1) a simulatable HDL processor design at RTL level and (2) the instruction set architecture specification of the embedded processor. The goals and details of each step are discussed next.

Table 11.13. VCC-based test program synthesis flow.

1.

M ← partitioning ();

2.

T ← extractTemplate ();

3.

foreach mM

4.

 

Tm ← rankTemplate ();

5.

 

while Tm ≠ φ or fault coverage not accceptable

6.

  

t ← highest ranked template in Tm;

7.

  

F ← deriveMappingFunction (t,m);

8.

  

generateVCC ();

9.

  

Pm,t ← constrainedTG ();

10.

  

TPm,t ← synthesizeTestProgram (m,t);

11.

  

processor-level fault simulation;

12.

  

TmTm – {t};

13.

 

end while

14.

end foreach

  • Processor partitioning. The first step (line 1) involves partitioning the processor into a collection of modules-under-test (MUTs), denoted by M. The test program for each MUT will be synthesized separately.

  • Test template construction. This step (line 2) systematically constructs a comprehensive set of test program templates. The test program templates can be classified into single-instruction and multi-instruction templates. A single-instruction template is built around one key instruction, whereas a multi-instruction template includes additional supporting instructions, for example, to trigger pipeline forwarding. To exhaust all possibilities in generating test program templates would be impossible, but generating a wide variety of templates is necessary to achieve high fault coverage.

  • Test template ranking. Templates are ranked according to a controllability/ observability-based testability metric through simulation (line 4). Templates at the top of Tm have high controllability (meaning that it is easy to set specific values at the inputs of the MUT) or high observability (meaning that it is easy to propagate the values at the outputs of the MUT to data registers or to observation points, which can be mapped onto and stored in the memory).

  • Mapping function derivation. For each MUT, both input mapping functions and output mapping functions are derived in this step (line 7). The input mapping function models the circuits between the instruction template’s settable fields, including operands, source registers, or destination registers, and the inputs of the MUT. It is derived by simulating a number of instances of template t to obtain traces followed by regression analysis to construct the mapping function between the settable fields and the MUT inputs. The output mapping function models the circuit between the outputs of the MUT and the system’s observation points. It is derived by injecting the unknown X value at the outputs of the MUT for simulation, followed by observing the propagation of the X values to the specified template destinations.

  • VCC generation. The derived mapping functions are synthesized into VCCs (line 8). As will be explained later, the utilization of VCCs not only enforces the instruction-imposed I/O constraints, but also facilitates the translation from module-level test patterns to instruction-level test programs.

  • Module-level test generation. Module-level test generation is performed for the composite circuit of the MUT sandwiched between the input and output VCCs (line 9). An illustration of the composite circuit is shown in Figure 11.14. During the constrained test generation, the test generator sees the circuit including MUT m and the two VCCs (i.e., the shaded area in Figure 11.14). Note that faults within the VCCs will be eliminated from the fault list and will not be considered for test generation. With this composite model, the pattern generator can generate patterns with values directly specified at the settable fields in instruction template t.

    Constrained module-level test generation using VCCs.

    Figure 11.14. Constrained module-level test generation using VCCs.

  • Test program synthesis. The test program for MUT m is synthesized using the module-level test patterns generated in the previous step (line 10). Note that the module-level test patterns assign values in some of the settable fields of each instruction template t. The other settable fields without value assignments would be filled with random values. The test program is then synthesized by converting the values of each settable field into its corresponding position in the instruction template t. An example of the target program synthesis flow is given in Figure 11.15. In step 1, the values assigned to the settable fields by the generated test patterns Pm,t are identified. Then, in step 2, pseudo-random patterns are assigned to the other settable fields in t. In step 3, t is analyzed to identify the positions of the settable fields (nop stands for the “no operation” instruction). Finally, in step 4, the test program TPm,t is generated by filling the values assigned to the settable fields in their corresponding placeholders in t.

    A test program synthesis example.

    Figure 11.15. A test program synthesis example.

  • Processor-level fault simulation. Processor-level fault simulation is performed on the synthesized test program segment to identify the set of newly detected faults (line 11). The achieved fault coverage is updated accordingly.

Delay Fault Testing

To ensure that the processor meets its performance specifications requires the application of delay tests. These tests should be applied at-speed and contain two-vector patterns, applied to the combinational portion of the circuit under test, to activate and propagate the fault effects to registers or other observation points. Compared to structural BIST, which needs to resolve various complex timing issues such as multiple clock domains, multiple frequency domains, and test clock skews, processor delay fault self-testing using instruction sequence is a more natural application of at-speed tests.

As in the case of stuck-at faults, not all delay faults in the microprocessor can be tested in the functional mode (i.e., by any instruction sequence). This is simply because no instruction sequence can produce the desired test sequence that can sensitize the path and capture the fault effect into the destination output or flip-flop at-speed. A fault is said to be functionally testable if there exists a functional test for that fault; otherwise, it is functionally untestable. In practice, one may apply the path classification algorithm [Lai 2000a] to identify a tight superset of the set of functionally testable paths in a microprocessor. Then test program synthesis algorithms [Lai 2000b] [Singh 2006] can be applied to generate test programs for these functionally testable paths.

Functionally Untestable Delay Faults

To illustrate the concept of functionally untestable delay faults, consider the datapath of the PARWAN processor (Figure 11.16), which contains an 8-bit ALU, an accumulator (AC), and an instruction register (IR). The data inputs of the ALU, A7–A0 and B7–B0, are connected to the internal data bus and AC, respectively. The control inputs of the ALU are S2–S0, which instruct the ALU to perform the desired arithmetic or logic operation. The outputs of the ALU are connected to the inputs of AC and the inputs of IR.

Datapath of the PARWAN processor.

Figure 11.16. Datapath of the PARWAN processor.

Assuming the use of enhanced scan, paths that start from A7–A0, B7–B0, or S2–S0 and end at inputs of IR or AC are structurally testable if we can find a vector pair to test them. However, some of the paths may be functionally untestable. For example, it can be shown that for all possible instruction sequences, whenever a rising transition occurs on signal S1 at the beginning of a clock cycle, AC and IR can never be enabled at the end of that cycle. Therefore, paths that start at S1 and end at the inputs of IR or AC are functionally untestable because delay fault effects on them can never be captured by IR or AC immediately after the vector pair has been applied.

Constraint Extraction

Different methodologies are applied to extract the constraints associated with the datapath and the control logic. For datapath logic, all the vector pairs that can be applied to the datapath at speed are symbolically simulated. Constraints are then extracted from the simulation results.

For example, the symbolic simulation results of the instruction sequence, NOP followed by add, applied to the datapath in Figure 11.16 are listed in rows 2 and 3 of Table 11.4. In the first cycle, the ALU data inputs at A7–A0 and B7–B0 are V1A and V1B, respectively, and the ALU executes the NOP operation because its control inputs S2–S0 are set to 100. Note that neither IR nor AC is in the latching mode in this cycle. In the second cycle, the ALU executes an addition and AC will latch the result. The constraints extracted from the simulation results are listed in the bottom row of Table 11.4. Because inputs A7–A0 change from V1A to V2A, they can be assigned arbitrary vector pairs (i.e., there is no temporal constraint, which is denoted by all X’s). Inputs B7–B0, on the other hand, must remain the same in both cycles; therefore, the associated constraint is that they have to be constant, denoted by all C’s. The constraint on the control inputs S2–S0 is apparent. The symbols O, Z, and R here denote 11, 00, and 01, respectively. Because AC is latching the ALU addition result in the second cycle, it is labeled “care” to indicate that it stores cared output.

Table 11.4. Datapath Constraints for the NOP; add; Sequence

 

A7–A0

B7–B0

S2–S0

IR

AC

Cycle 1

V1A

V1B

100 (NOP)

  

Cycle 2

V2A

V1B

101 (add)

 

Latch

Constraint

X ... X

C ... C

OZR

 

Care

Note that there exist covering relationships among the extracted constraints. A constraint α covers another constraint β if it is possible to set α exactly the same as β by assigning any of C, Z, O, F, or R to the X terms in α. In such a case, the constraint β can be removed. For the DLX [Gumm 1995] processor, after the reduction process, only 24 constraints remain to be considered.

The controller constraints can be easily identified from the controller’s RTL description, including the input transition constraint, the input spatial constraint, and the output constraint. The input transition constraint specifies the necessary conditions on the control inputs to those registers that connect to the controller’s inputs. The registers must be set to be in the latching or reset mode for transitions to occur at the control inputs. An input spatial constraint specifies all the legitimate input patterns to the controller. An output constraint records the necessary signal assignments for an output to be latched by a register.

Test Program Generation

The flow of delay fault test program generation in [Lai 2000b] is shown in Figure 11.17. In step 1, given the instruction set architecture and the microarchitecture of the processor, the spatial and temporal constraints, between and at the registers and control signals, are first extracted. In step 2, the path classification algorithm, extended from [Cheng 1996] [Krstic 1996], implicitly enumerates and examines all paths and path segments with the extracted constraints imposed. If a path cannot be sensitized with the imposed extracted constraints, the path is functionally untestable and thus is eliminated from the fault universe. Identifying the functionally untestable faults helps reduce the computation effort of the subsequent test generation process. As the preliminary experimental results shown in [Lai 2000a] indicate, a nontrivial percentage of the paths in simple processors (such as PARWAN and DLX) are functionally untestable but structurally testable.

Delay fault test program generation flow.

Figure 11.17. Delay fault test program generation flow.

In step 3, constrained delay fault test generation is performed for a set of long paths selected from the functionally testable paths. A gate-level ATPG for path delay faults is extended to incorporate the extracted constraints into the test generation process, where it is used to generate test vectors for each target path delay fault. If the test is successfully generated, it not only sensitizes the path but also meets the extracted constraints. Therefore, it is most likely to be deliverable by processor instruction sequences. (If the complete set of constraints has been extracted, the delivery by instructions could be guaranteed.)

Finally, in the test program synthesis process that follows, the test vectors specifying the bit values at internal flip-flops are first mapped back to word-level values in registers and values at control signals. These mapped value requirements are then justified at the instruction level. A predefined propagating routine is used to propagate to the memory the fault effects captured in the registers or flip-flops of the path delay fault. This routine compresses the contents of some or all registers in the processor, generates a signature, and stores it in memory. The procedure is repeated until all target faults have been processed. The test program, which is generated offline, will be used to test the processor at-speed.

The test program generation algorithm was applied to PARWAN and DLX processors. On the average, 5.3 and 5.9 instructions were needed to deliver a test vector, and the achieved fault coverage for testable path delay faults was 99.8% for PARWAN and 96.3% for DLX.

Functional Random Instruction Testing

One of the major challenges of the SBST methodology is the functional constraint extraction process. The process is time-consuming, usually manually done or partially automated, and in general, only a subset of the functional constraints is extracted, which complicates the succeeding test program synthesis process.

In [Parvathala 2002], the authors proposed a technique called functional random instruction testing at speed (FRITS). FRITS is basically software BIST and is applicable to devices like microprocessors that have an extensive instruction set that can be utilized to realize the software that enables the functional BIST. The FRITS tests (kernels) are different from normal functional test sequences. Once these kernels are loaded into the processor cache, they repeatedly generate and execute pseudo-random or directed sequences of machine codes.

An example FRITS kernel execution sequence is shown in Figure 11.18. After proper initialization, the kernel generates a sequence of machine instructions and the data to be used by these instructions. Note that these instructions as well as the data, not the FRITS kernel, constitute the functional test. The kernel then branches to execute the generated functional test. The test responses, including the register files and memory locations that are modified by the test, are compressed. To enhance the fault coverage, one may choose to use multiple test data sets for the same random instruction sequence. The test generation and execution process continues until the desired number of test sequences has been generated and executed.

The FRITS kernel execution flow.

Figure 11.18. The FRITS kernel execution flow.

The FRITS technique was applied to a series of Intel microprocessors. Although it is not the primary test generation and test method used on these processors, it provides significant complementary coverage over existing structural and functional test content on these processors.

Processor Self-Diagnosis

Besides enabling at-speed self-test with low-cost testers, SBST eliminates the use of scan chains and the associated test overhead, making it an attractive solution for testing high-end microprocessors. However, the elimination of scan chains poses a significant challenge for accurate fault diagnosis. Deterministic methods for generating diagnostic tests are available for combinational circuits [Grüning 1991], but sequential circuits are much too complex to be handled by the same approach. There have been several proposals related to generating diagnostic tests for sequential circuits by modifying existing detection tests [Pomeranz 2000] [Yu 2000]. However, the success of these methods depends on the success of the sequential test generation techniques. As a result, the existence of scan chains is still crucial for sequential circuit diagnosis [Venkataraman 2001].

Though current sequential ATPG techniques are not sufficiently practical for handling large sequential circuits, SBST methods are capable of successfully generating tests for a particular type of sequential circuits—microprocessors. If properly modified, these tests might possibly achieve high diagnostic capabilities. Functional information (instruction set architecture and microarchitecture) can be used to guide and facilitate diagnosis.

Challenges to SBST-Based Processor Diagnosis

Because diagnostic programs rely on instructions to detect and distinguish between faults, SBST-based microprocessor diagnosis may suffer low diagnostic resolution for the following reasons:

  1. Faults that are located on the functionality critical nodes in the processor, such as in the instruction decode unit and tristate buffers that control the buses, tend to fail all diagnostic test programs.

  2. The test program that targets one module could also activate a large number of faults in other modules.

  3. Some faults cannot be detected by SBST at all.

To achieve a high diagnostic resolution, a great number of carefully constructed diagnostic test programs are generated. Each program is designed to cover as few faults as possible, while the union of all test programs covers as many faults as possible. The diagnostic test program construction principles are as follows:

  1. Reduce the variety of instructions in each test program. If possible, use one type of instruction only.

  2. Reduce the number of instructions in each test program. Each test program should contain only the essential instructions needed for test data transportation to and from the target internal module.

  3. Create multiple copies of the same test program. Each instance is designed to observe the test response on different outputs from the target module. This allows the user to differentiate faults that will cause errors to propagate to different nodes.

Diagnostic Test Program Generation

Initial investigations for the diagnostic potential of SBST were reported in [Chen 2002] and [Bernardi 2006]. Whereas the former attempted to generate test programs that were geared toward diagnosis, the latter adopted an evolutionary approach to refine existing postproduction SBST programs to achieve the desired diagnosis resolution.

Figure 11.19 illustrates the proposed algorithm flow in [Bernardi 2006]. The process starts from a set of test programs for postproduction testing. These test programs could be hand-written by the designers or test engineers to cover the corner cases, or by automatic test program synthesis approaches, and their diagnosis capability is enhanced by the following sporing, sifting, and evolutionary improvement processes.

Refining postproduction test programs for diagnosis.

Figure 11.19. Refining postproduction test programs for diagnosis.

  • Sporing. In the sporing process, each program belonging to the initial set of SBST programs is fragmented into a huge number of spores. Each spore represents a completely independent program that is able to excite some processor function, observe the response, and possibly signal the occurrence of faults. A first fault simulation is performed to determine the fault coverage figure for each generated spore.

  • Sifting. The sifting process is intended to reduce the number of diagnostic test programs obtained after the sporing process. First, each spore is assigned a fitness value that indicates its diagnosis capability. The fitness value depends on both the number of faults the spore can detect and the number of spores that detect these faults. Then, only the spores that possess high fitness values and detect faults not covered by other spores are retained in the basic diagnostic test program set.

  • Evolutionary improvement. When there are still unsatisfactorily large equivalence classes,[3] the evolutionary improvement process is applied to generate new test programs that are able to split them.

The diagnosis technique was applied to the Intel i8051 processor core. The processor core’s parallel outputs are fed to a 24-bit MISR, and the stored signature is read out at the end of each test program execution. Table 11.5 summarizes the experimental results. The original postproduction test program set consists of eight programs, and the test set size is 4 KB. The percentages of uniquely classified faults D(1) and correctly classified faults D(10)[4] are 11.56% and 32.90%, respectively. After the sporing process, about 60,000 test programs were generated. This test set is reduced to 7231 test programs after the sifting process. The resulting basic diagnosis program set is able to uniquely classify 35.70% and correctly classify 58.02% of faults. Thirty-five new test programs were added in the evolutionary improvement process. The percentages of uniquely and correctly classified faults by the final test set are enhanced to 61.93% and 84.30%, respectively.

Table 11.5. Summary of the Diagnostic Test Program Generation Results

 

Postproduction Test Set

Basic Test Set

Final Test Set

Number of Programs

8

7231

7266

Test Set Size (KB)

4

165

177

D 1 (%)

11.56

35.70

61.93

D 10 (%)

32.90

58.02

84.30

Testing Global Interconnect

In SOC designs, a device must be capable of performing core-to-core communications across long interconnects. As we find ways to decrease gate delay, the performance of interconnects is becoming increasingly important for achieving high overall performance [SIA 2003]. Increases in cross-coupling capacitance and mutual inductance mean that signals on neighboring wires may interfere with each other, thus causing excessive delay or loss of signal integrity. Although many techniques have been proposed to reduce crosstalk, owing to the limited design margins and unpredictable process variations, crosstalk must also be addressed during manufacturing test.

Because of their impact on circuit timing, testing for crosstalk effects may have to be conducted at the rated speed of the circuit under test. At-speed testing of GHz systems, however, may require prohibitively expensive high-speed testers. With external testing, hardware access mechanisms are required for applying tests to interconnects deeply embedded in the system. This may lead to unacceptable costs such as in overhead needed for area or performance.

In [Bai 2000], a BIST technique in which an SOC tests its own interconnects for crosstalk defects using on-chip hardware pattern generators and error detectors has been proposed. Although the amount of area overhead may be amortized for large systems, for small systems, the amount of relative area overhead may be unacceptable. Because this method falls into the category of structural BIST techniques, utilizing this particular technique may cause overtesting and yield loss as not all test patterns generated are valid when the system is operated in normal mode.

For SOCs with embedded processors, utilizing the processor itself to execute interconnect self-test programs is a viable solution because in such SOCs most of the system-level interconnects, such as on-chip buses, are accessible to the embedded processor core(s). During interconnect self-test program execution, test vector pairs can be applied to the appropriate bus in normal functional mode of the system. In the presence of crosstalk-induced glitches or delay effects, the second vector in the vector pair becomes distorted at the receiver end of the bus. The processor, however, can store this error effect in memory as a test response. This can be unloaded later by an external tester and used for off-chip analysis. In this section, the maximum aggressor (MA) bus fault model is introduced first. Then two software-based interconnect self-test techniques [Lai 2001] [Chen 2001] for MA faults are described.

Maximum Aggressor (MA) Fault Model

The MA fault model proposed in [Cuviello 1999] is suitable for modeling crosstalk defects on interconnects. It abstracts the crosstalk defects on global interconnects by using a linear number of faults.

The MA fault model defines faults based on the resulting crosstalk error effects, including positive glitch (gp), negative glitch (gn), rising delay (dr), and falling delay (df). For a set of N interconnects, the MA fault model considers the collective aggressor effects on a given victim line Yi, whereas all other N – 1 wires act as aggressors. The required transitions on the aggressor/victim lines to excite the four error types are shown in Figure 11.20. The test for positive glitch (gp) on a victim line Yi, as shown in the first column, would require that line Yi hold a constant 0 value, for example, while the other N – 1 aggressor lines have a rising transition. Under this condition, the victim line Yi would have a positive glitch created by the crosstalk effect. If excessive, the glitch would result in errors. These patterns, collectively called MA tests, excite the worst-case crosstalk effects on the victim line Yi. For a set of N interconnects, there are 4N MA faults, requiring 4N MA tests. It has been shown that these 4N faults cover all crosstalk defects on any of the N interconnects.

Maximum aggressor tests for victim line Yi.

Figure 11.20. Maximum aggressor tests for victim line Yi.

Processor-Based Address and Data Bus Testing

In a core-based SOC, the address, data, and control buses are the main types of global interconnects with which the embedded processors communicate with memory and other cores of the SOC via memory-mapped I/O. The proposed technique in [Chen 2001] concentrates on testing the data and address buses in a processor-based SOC. The crosstalk effects on the interconnects are modeled using the MA fault model.

Data Bus Testing

For a bidirectional bus, such as a data bus, crosstalk effects vary as the bus is driven from different sources. This requires crosstalk tests to be conducted in each direction [Bai 2000]. However, to apply a pair of vectors (v1, v2) in a particular bus direction, the direction of v1 is irrelevant, as long as the logic value at the bus is held at v1. Only v2 needs to be applied in the specified bus direction. This is because the signal transition triggering the crosstalk effect takes place only when v2 is being applied to the bus.

To apply a test vector pair (v1, v2) to the data bus from an SOC core to the processor, the processor first exchanges data v1 with the core. The direction of this data exchange is irrelevant. If the core being tested is the memory, for example, the processor may either read v1 from the memory or write v1 to the memory. The processor then requests data v2 from the core. This might be a memory-read if the core being tested is memory. When the data v2 is obtained, the processor writes v2 to memory for later analysis. To apply a test vector pair (v1 v2) to the data bus from the processor core to an SOC core, the processor first exchanges data v1 with the core. Then the processor sends data v2 to the core or executes a memory-write if the core being tested is memory. If the core is memory, v2 can be directly stored to an appropriate address for later analysis; otherwise, the processor must execute additional instructions to retrieve v2 from the core and store it to memory.

Address Bus Testing

To apply a test vector pair (v1 v2) to the address bus, which is a unidirectional bus from the processor to an SOC core, the processor first requests data from two addresses (v1 and v2) in consecutive cycles. In the case of a nonmemory core, because the processor addresses the core via memory-mapped I/O, v2 must be the address corresponding to the core. If v2 is distorted by crosstalk, the processor would be receiving data from a wrong address, v2, which may be a physical memory address or an address corresponding to a different core. By keeping different data at v2 and v2 (i.e., mem [v2] ≠ mem [v2]), the processor is able to observe the error and store it in memory for analysis.

Figure 11.21 illustrates the address bus testing process. Where the processor is communicating with a memory core, to apply test (0001, 1110) in the address bus from the processor to the memory core for example, the processor first reads data from address 0001 and then from address 1110. In the system with the faulty address bus, the second address may become 1111. If different data are stored at addresses 1110 and 1111, say, mem [1110] = 0100 and mem [1111] = 1001, the processor would receive a faulty value from memory. This might be 1001 instead of 0100. This error response can be stored in memory for future analysis.

Address bus testing.

Figure 11.21. Address bus testing.

The feasibility of this method has been demonstrated by applying it to test the interconnects of a processor-memory system, and the defect coverage was evaluated using a system-level crosstalk-defect simulation method.

Processor-Based Functional MA Testing

Even though the MA tests have been proven to cover all physical defects related to crosstalk between interconnects, Lai et al. observe that many of these types of defects can never occur during normal system operations because of system constraints [Lai 2001]. Therefore, testing buses using MA tests might screen out chips that are functionally operational under any pattern produced under normal system operation.

To resolve the overtesting issue associated with the MA fault model, functionally maximal aggressor (FMA) tests that meet the system constraints and can be conducted while operating in the functional mode were proposed [Lai 2001]. These tests provide complete coverage of all crosstalk-induced logical and delay faults that can cause errors during operation in the functional mode.

Given the timing diagrams of all bus operations, the spatial and temporal constraints imposed on the buses can be extracted and FMA tests can be generated. A covering relationship between vectors, extracted from the timing diagrams of the bus commands, is used during the FMA test generation process. Because the resulting FMA tests are highly regular, they can be clustered into a few groups. The tests in each group are highly similar except for the victim lines. Similar to a March-test sequence, which is an algorithm commonly used for testing memory, the tests in each group can be synthesized by a software routine. The test program is highly modularized and very small. Experimental results have shown that a test program as small as 3000 to 5000 bytes can detect all crosstalk defects on the bus from the processor core to the target core.

The synthesized test program is applied to the bus from the processor core, and the input buffers of the destination core capture the responses at the other end of the bus. The processor core should read back such responses to determine whether any faults occur on the bus. However, because the processor core cannot read the input buffers of a nonmemory core, a DFT scheme is suggested to allow the processor core to directly observe the input buffers. The DFT circuitry consists of bypass logic added to each I/O core to improve its testability. With DFT support on the target I/O core, the test generation procedure first synthesizes instructions to set the target core to the bypass mode, and then it continues synthesizing instructions for the FMA tests. The test generation procedure does not depend on the functionality of the target core.

Testing Nonprogrammable Cores

Testing nonprogrammable SOC cores is a complex problem with many unresolved issues [Huang 2001]. Standards such as IEEE 1500 were created with the intent of relieving these core test problems; however, the test provisions within the standard do not reduce the complexity of the test generation and response analysis problems. Furthermore, the requirement of at-speed testing is not addressed.

A self-testing approach was proposed in [Huang 2001]. In it, the embedded processor running the test program serves as an embedded software tester that performs test pattern generation, test pattern application, response capture, and response analysis. The advantages of this approach are as follows:

  1. The need for dedicated test circuitry as found in traditional BIST techniques (i.e., embedded hardware tester) is eliminated.

  2. Tremendous flexibility in the type and quality of patterns that can be generated is provided. One simply has to use a different test pattern generation software procedure or modify the program parameters.

  3. The approach is scalable to large IP cores with available structural netlists.

  4. Patterns are delivered at the SOC operation speed. Hence, it supports delay testing.

To facilitate core testing using the embedded software tester, a test wrapper is placed around each core to support pattern delivery. It contains the test support logic needed to control scan chain shifting, buffers to store scan data, buffers to support at-speed test, and so on.

The test flow based on the embedded software tester methodology is illustrated in Figure 11.22. It is divided into a preprocessing and a core test phase.

The test flow of nonprogrammable cores.

Figure 11.22. The test flow of nonprogrammable cores.

Preprocessing Phase

In the preprocessing phase, a test wrapper is automatically inserted around the IP core under test. The test wrapper is configured to meet the specific testing needs for the IP core. The IP core is then subjected to fault simulation by using different sets of patterns. Either weighted random patterns generated with multiple weight sets or multiple capture cycles [Tsai 1998] after each scan sequence could be used.

Next, a high-level test program is generated that synchronizes tasks including software pattern generation, starting the test, applying the test, and analyzing the test response. The program can also synchronize testing multiple cores in parallel. The test program is then compiled to generate a processor specific binary code.

Core Test Phase

In the core test phase, the test program is run on the processor core to test various IP cores. A test packet is sent to the IP core test wrapper informing it about the test application scheme (single- or multiple-capture cycle). Data packets are then sent to load the scan buffers and the I/O buffers. The test wrapper applies the required number of scan shifts and captures the test response for a preprogrammed number of functional cycles. Test results are stored in the I/O buffers and the scan buffers and then are read by the processor core.

Instruction-Level DFT

Self-testing manufacturing defects in an SOC by running test programs using a programmable core has several potential benefits, including the ability to conduct at-speed testing, low-DFT overhead because dedicated test circuitry use is eliminated, and better power and thermal management during testing. Such a self-test strategy might require a lengthy test program, yet it still may not achieve sufficiently high fault coverage. One solution is to apply a DFT methodology, which is based on adding test instructions to the processor core. This methodology is called instruction-level DFT.

Instruction-Level DFT Concept

Instruction-level DFT inserts test circuitry into the design in the form of test instructions. It should be a less intrusive testing approach than using gate-level DFT techniques, which attempt to create a separate test mode somewhat orthogonal to the functional mode. This instructional-level methodology is also more attractive for applying at-speed tests and for power/thermal management during testing, as compared with the existing logic BIST approaches.

When adding new instructions, existing hardware should be “reused” as much as possible to reduce area overhead. If the test instructions are carefully designed such that their microinstructions reuse the datapath for the functional instructions and do not require a new datapath, then the controller overhead should be relatively low. In general, adding extra buses or registers to implement new instructions is unnecessary and avoidable. Also, in most cases, a new instruction can be added by introducing new control signals to the datapath rather than by adding hardware.

In [Shen 1998, Lai 2001], the authors propose instruction-level DFT methods to address the fault coverage. The approach in [Shen 1998] adds instructions for testing the exceptions such as microprocessor interrupts and reset. With these new instructions, the test program can achieve a fault coverage close to 90% for stuck-at faults. However, this approach cannot achieve a higher coverage because the test program is synthesized based on a random approach and cannot effectively control or observe some internal registers that have low testability.

The DFT methodology proposed in [Lai 2001], on the other hand, systematically adds test instructions to an on-chip processor core to improve its self-testability, to reduce the size of the self-test program, and to reduce the test application runtime. The experimental results of two processors (PARWAN and DLX) show that test instructions can reduce the program size and program runtime by about 20% at the cost of about a 1.6% increase in area overhead. The following discussion elaborates on the instruction-level DFT techniques presented in [Lai 2001], including testability instructions and test optimization instructions.

Testability Instructions

DFT instructions of this type are added to enhance the processor’s testability, including controllability and observability of registers and the processor I/O. To determine which instructions to add, the testability of the processor is analyzed first.

A register’s testability can be determined based on the availability of data movement instructions between the memory and the register that is targeted for testing. A fully controllable register is one for which there exists a sequence of data movement instructions that can move the desired data from memory to that register. Similarly, a fully observable register is one for which there exists a sequence of data movement instructions to propagate the register data to memory. Given the microarchitecture of a processor core, it is possible to identify the set of fully controllable registers and fully observable registers. For registers that are not fully controllable/observable, new instructions can be added to improve their accessibility. In [Lai 2001], four instructions are added to enhance the register accessibility in Figure 11.23:

  • s2r (move SR to Rn). This instruction is intended to improve the observability of the status register SR. It moves the data from the status register to any general-purpose register Rn. Note that data in SR are propagated through an existing data path from SR to ALU, to register C, and, finally, to the target register Rn.

  • r2s (move A to SR). This instruction aims to improve the controllability of SR. It moves the data from a general-purpose register A to SR. Again, an existing path (from A, to ALU, to SR) is utilized.

  • Read exception signals from Rn. This instruction allows the processor to set the values of the exception signals from Rn, rather than from external devices.

An example processor to demonstrate instruction-level DFT.

Figure 11.23. An example processor to demonstrate instruction-level DFT.

To enable this instruction to be used, extra hardware is added as shown in Figure 11.24a. Without the loss of generality, R27 is selected as the source register of the controller exception signals. By setting the 1-bit register T, the processor can switch the exception signal source between R27 and external devices using the two multiplexers (MUX). Based on this scheme, the added instruction simply needs to be able to set the value of T.

DFT instructions for testability enhancement: (a) DFT for exception signals and (b) DFT for pipeline registers.

Figure 11.24. DFT instructions for testability enhancement: (a) DFT for exception signals and (b) DFT for pipeline registers.

  • Pipeline register access. Instructions can be added in pipeline designs to manage the difficult-to-control registers buried deeply in the pipeline. To this end, extra hardware is added. An example of such a pipeline DFT is depicted in Figure 11.24b. To enhance the controllability of pipeline register B, we can add a test instruction, an extra bus (bus D), a multiplexer (MUX C), and a MUX control signal to enable loading data directly from a general-purpose register to the register B. When the test instruction is decoded and its operands become available on bus D, the test instruction will enable MUX C to select bus D as the signal source for B.

Test Optimization Instructions

The test optimization instructions aim to optimize the test program in terms of its size and application time. The need to implement such DFT instructions is based on the observation that, in the synthesized self-test program, some code segments (called hot segments) appear repeatedly. Therefore, the addition of test instructions that reduce the size of hot segments will help to lower the test program size. In addition to program size reduction, DFT instructions may be added to speed up the processes of test vector preparation, response transportation, and response analysis. In [Lai 2001], the authors proposed two test optimization instructions:

  • load2 (consecutive load to Ri and Rj). This instruction can read two (or more) consecutive words from a memory address, which is stored in another register Rk, and load them to registers Ri and Rj. Whereas a consecutive load needs three words in memory (one for the instruction itself and two for the operands), two load instructions require four words (two for the instruction and two for the operands). Thus, inclusion of the load2 instruction reduces the test program size.

  • xor_all (signature computation). This instruction performs a sequence of exclusive-OR (XOR) operations on the processor register files (Figure 11.23) and stores the final result in register C. Although replacing a sequence of XOR instructions in the response analysis subroutine with xor_all helps reduce the test program run time, it does not significantly reduce its size because there is only one copy of the signature analysis subroutine in the program.

It is interesting to note that although adding test instructions to the programmable core does not improve the testability of other nonprogrammable cores on the SOC, the programs for testing the nonprogrammable cores can also be optimized with the added test optimization instructions, such as the load2 instruction.

DSP-Based Analog/Mixed-Signal Component Testing

For mixed-signal systems that integrate both analog and digital functional blocks onto the same chip, testing of analog/mixed-signal modules has become a production testing bottleneck. Because most analog/mixed-signal circuits are functionally tested, their testing necessitates the use of expensive automatic test equipment for analog stimulus generation and response acquisition. One promising solution is BIST. It utilizes on-chip resources that are shared with either functional blocks or dedicated BIST circuitry to perform on-chip stimulus generation and response acquisition. Under the BIST approach, the demands on the external test equipment are less stringent. Furthermore, stimulus generation and response acquisition are less vulnerable to environmental noise and less limited by I/O pin bandwidth than external tester based testing.

With the advances in CMOS technology, DSP-based BIST becomes a viable solution for analog/mixed-signal components—the signal processing needed to make a pass/fail decision can be implemented in the digital domain with digital resources. In DSP-based BIST schemes [Toner 1995] [Pan 1995], on-chip digital-to-analog converters (DACs) and analog-to-digital converters (ADCs) are used for stimulus generation and response acquisition, and DSP resources (such as processor or DSP cores) are used for the required signal synthesis and response analysis. The DSP-based BIST scheme is attractive because of its flexibility—various tests, such as AC, DC, and transient tests, can be performed by modifying the software routines without needing to alter the hardware. However, on-chip ADCs and DACs are not always available in analog/mixed-signal SOC devices.

In [Huang 2000], the authors proposed to use the 1-bit first-order delta-sigma modulation ADC as a dedicated BIST module for on-chip response acquisition, in case an on-chip ADC is unavailable. Owing to its oversampling nature, the delta-sigma modulation ADC can tolerate relatively high process variations and match inaccuracy without causing functional failure. It is therefore particularly suitable for VLSI implementation. This solution is suitable for low to medium frequency applications such as audio signal. Figure 11.25 illustrates the overall delta-sigma modulation-based BIST architecture. It employs the delta-sigma modulation technique for both stimulus generation [Dufort 1997] and response analysis. The test process consists of test pattern generation, stimulus application, and response digitization.

DSP-based self-test for analog/mixed-signal cores.

Figure 11.25. DSP-based self-test for analog/mixed-signal cores.

  • Test pattern generation. A software delta-sigma modulation ADC routine is executed to convert the desired test stimulus (e.g., a sinusoidal of 1 V amplitude at 500 KHz) to 1-bit digital stream. For periodic test stimuli, a segment from the delta-sigma modulation ADC output bit stream that contains an integer number of signal periods is stored in on-chip memory.

  • Stimulus application. To transform the stored 1-bit stream segment to the specified analog test stimulus, the stored pattern is repeatedly applied to the 1-bit DAC, which translates the digital 1’s and 0’s to two discrete analog levels. The succeeding low-pass filter then removes the out-of-band high-frequency modulation noise and restores the original analog waveform.

  • Response digitization. The 1-bit delta-sigma modulation ADC is dedicated to converting the analog component output response into a 1-bit stream that will be stored in on-chip memory. Here, the first-order delta-sigma modulation ADC is utilized because it is more stable and has a larger input dynamic range than higher-order delta-sigma modulation ADCs. However, it is not quite practical for high-resolution applications because a rather high oversampling rate will be needed, and it suffers from intermodulation distortion (IMD). Compared to the first-order configuration, the second-order configuration has a smaller dynamic range but is more suitable for high-resolution applications.

  • Response analysis. The stored test responses are then analyzed by software DSP routines (e.g., decimation filter and FFT) to derive the desired performance specifications. Note that, the software part of this technique (that is, the software delta-sigma modulation ADC and the response analysis routines) can be performed by on-chip DSP or processor cores, but only if abundant on-chip digital programmable resources are available (as indicated in Figure 11.25) or by external digital test equipment.

Concluding Remarks

Embedded software-based self-testing (SBST) has the potential to alleviate many problems associated with current scan test and structural BIST techniques. These problems include excessive test power consumption, overtesting, and yield loss. This chapter has summarized the recently proposed techniques on this subject. On-chip programmable resources such as embedded processors are used to test processor cores, global interconnects, nonprogrammable cores, and analog/mixed-signal components. The major challenge in using these techniques is extracting functional constraints imposed by the processor instruction set. These extracted constraints are crucial during test program synthesis to ensure that the derived tests are delivered using processor instruction sequences.

Future research in this area must address the problem of automating constraint extraction to make the SBST methodology fully automatic for use in testing general embedded processors. Also, the SBST paradigm should be further generalized for analog/mixed-signal components through the integration of DSP-based test techniques, delta-sigma modulation principles, and some low-cost analog/mixed-signal DFT methods.

Exercises

11.1

(Functional Fault Testing) Assign labels to the nodes and edges shown in the following S-graph. The instructions and registers are summarized in Table 11.6.

Exercises

S-graph for Exercise 11.1.

Table 11.6. Summary of the Registers and Instructions for Exercise 11.1

R1:

Accumulator (ACC)

R2:

General purpose register

R3:

Scratch-pad register

R6:

Program counter (PC)

I1:

Load R1 from the main memory using immediate addressing. (T)

I2:

Load R2 from the main memory using immediate addressing. (T)

I3:

Transfer the contents of R1 to R2. (T)

I4:

Add the contents of R1 and R2 and store the results in R1. (M)

I5:

Transfer the contents of R1 to R3. (T)

I6:

Transfer the contents of R3 to R1. (T)

I7:

Store R1 into the main memory using implied addressing. (T)

I8:

Store R2 into the main memory using implied addressing. (T)

I9:

Jump instruction. (B)

I11:

Left shift R1 by one bit. (M)

11.2

(Functional Fault Test Generation) Generate tests to detect the register decoding faults of the simplified processor described in Figure 11.3 and Table 11.1.

11.3

(Structural Fault Testing) Consider the test program in Figure 11.11. Where will the status flags be stored? How do you interpret the stored results?

11.4

(Software LFSR) Using bit-wise shift and logic operations, write a program that generates m state transitions of an LFSR of degree n.

11.5

(Data Bus Self-Testing) Assume that the data bus is 4 bits wide. What vector pairs are needed to detect all the MA faults?

11.6

(Address Bus Self-Testing) Assume that both the address and data buses are 4 bits wide. What data should you store in the memory in order to test the address bus MA faults by reading the memory contents?

11.7

(Instruction-Level DFT) Consider the example processor in Figure 11.22. The r2s (move SR to Rn) DFT instruction is realized by propagating the contents of SR to Rn through ALU. Another possible way to realize r2s is to connect SR to the bus directly. What are the advantages and disadvantages of using this second approach?

Acknowledgments

The authors wish to thank Dr. Jayanta Bhadra of Freescale, Professor Dimitris Gizopoulos of University of Piraeus (Greece), and Professor Charles E. Stroud of Auburn University for reviewing the text and providing helpful comments.

References

Books

Introduction

Software-Based Self-Testing Paradigm

Processor Functional Fault Self-Testing

Processor Structural Fault Self-Testing

Processor Self-Diagnosis

Testing Global Interconnect

Testing Nonprogrammable Cores

Instruction-Level DFT

DSP-Based Analog/Mixed-Signal Component Testing



[1] The destination registers of an instruction are defined to be the set of registers that are changed by that instruction during its execution. For example, the destination registers of instruction I9 in Figure 11.3 are {R6, OUT}.

[2] The set of source registers for an instruction is defined to be the set of registers that provide the operands for that instruction during its execution.

[3] An equivalence class is a set of faults that cause exactly the same faulty behavior for each applied pattern.

[4] D(n) is defined as the percentage of faults that are classified into equivalence classes of size n or less by the diagnostic test program set. D(10) is regarded as the percentage of correctly classified faults because exact analysis of fault equivalence cannot be performed for medium or large sequential circuits.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset