Chapter 4. System/Network-on-Chip Test Architectures

Chunsheng LiuUniversity of Nebraska-Lincoln, Omaha, Nebraska

Krishnendu ChakrabartyDuke University, Durham, North Carolina

Wen-Ben JoneUniversity of Cincinnati, Cincinnati, Ohio

About This Chapter

The popularity of system-on-chip (SOC) integrated circuits has led to an unprecedented increase in test costs. This increase can be attributed to the difficulty of test access to embedded cores, long test development and test application times, and high test data volumes. The on-chip network is a natural evolution of traditional interconnects such as shared bus. An SOC whose interconnection is implemented by an on-chip network is called a network-on-chip (NOC) system. Testing an NOC is challenging, and new test methods are required.

This chapter presents techniques that facilitate low-cost modular testing of SOCs. Topics discussed here include techniques for wrapper design, test access mechanism optimization, test scheduling, and applications to mixed-signal and hierarchical SOCs. Recent work on the testing of embedded cores with multiple clock domains and wafer sort of core-based SOCs are also discussed. Together, these techniques offer SOC integrators with the necessary means to manage test complexity and reduce test cost. The discussion is then extended to the testing of NOC-based designs. Topics discussed here include reuse of on-chip network for core testing, test scheduling, test access methods and interface, efficient reuse of the network, power-aware and thermal-aware testing, and on-chip network (including interconnects, routers, and network interface) testing. Finally, two case studies, one for SOC testing and the other for NOC testing, are presented based on industrial chips developed by Philips.

Introduction

Shrinking process technologies and increasing design sizes have led to billion-transistor system-on-chip (SOC) integrated circuits. SOCs are crafted by system designers who purchase intellectual property (IP) circuits, known as embedded cores, from core vendors and integrate them into large designs. Embedded cores are complex, predesigned, and preverified circuits that can be purchased off the shelf and reused in designs. Although SOCs have become popular as a means to integrate complex functionality into designs in a relatively short amount of time, there remain several roadblocks to rapid and efficient system integration. Primary among these is the lack of design-for-testability (DFT) tools upon which core design and system development can be based. Importing core designs from different IP sources and stitching them into designs often entails cumbersome format translation. A number of SOC and core development working groups have been formed, notable among these being the virtual socket interface alliance (VSIA) [VSIA 2007]. The IEEE 1500 standard has also been introduced to facilitate SOC testing [IEEE 1500-2005].

The VSIA was formed in September 1996 with the goal of establishing a unifying vision for the SOC industry and the technical standards required to facilitate system integration. VSIA specifies interface standards, which will allow cores to fit quickly into “virtual sockets” on the SOC, at both the architectural level and the physical level [VSIA 2007]. This will allow core vendors to produce cores with a uniform set of interface features, rather than having to support different sets of features for each customer. SOC integration is in turn simplified because cores may be imported and plugged into standardized “sockets” on SOCs with relative ease. The IEEE 1500 standard includes a standardized core test language (CTL) and a test wrapper interface from cores to on-chip test access mechanisms.

An SOC test is essentially a composite test comprised of the individual tests for each core, the user defined logic (UDL) tests, and interconnect tests. Each individual core or UDL test may involve surrounding components and may imply operational constraints (e.g., safe mode, low power mode, bypass mode), which necessitate special isolation modes.

SOC test development is especially challenging for several reasons. Embedded cores represent intellectual property, and core vendors are reluctant to divulge structural information about their cores to users. Thus, users cannot access core netlists and insert design-for-testability (DFT) hardware that can ease test application from the surrounding logic. Instead, the core vendor provides a set of test patterns that guarantees a specific fault coverage. These test patterns must be applied to the cores in a given order, using a specific clocking strategy. Care must often be taken to ensure that undesirable patterns and clock skews are not introduced into these test streams. Furthermore, cores are often embedded in several layers of user-designed or other core-based logic and are not always directly accessible from chip I/Os. Propagating test stimuli to core inputs may therefore require dedicated test transport mechanisms. Moreover, it is necessary to translate test data at the inputs and outputs of the embedded-core into a format or sequence suitable for application to the core.

A conceptual architecture for testing embedded core-based SOCs is shown in Figure 4.1 [Zorian 1999]. It consists of three structural elements:

  1. Test pattern source and sink. The test pattern source generates the test stimuli for the embedded cores, and the test pattern sink compares the response(s) to the expected response(s).

  2. Test access mechanism (TAM). The TAM transports test patterns. It is used for the on-chip transport of test stimuli from the test pattern source to the core under test, and for the transport of test responses from the core under test to a test pattern sink.

  3. Core test wrapper. The core test wrapper forms the interface between the embedded core and its environment. It connects the terminals of the embedded core to the rest of the integrated circuit and to the TAM.

Overview of the three elements in an embedded-core test approach: (1) test pattern source and sink, (2) test access mechanism, and (3) core test wrapper [Zorian 1999].

Figure 4.1. Overview of the three elements in an embedded-core test approach: (1) test pattern source and sink, (2) test access mechanism, and (3) core test wrapper [Zorian 1999].

Once a suitable test data transport mechanism and test translation mechanism have been designed, the next major challenge confronting the system integrator is test scheduling. This refers to the order in which the various core tests and tests for user-designed interface logic are applied. A combination of BIST and external testing is often used to achieve high fault coverage [Sugihara 1998] [Chakrabarty 2000a], and tests generated by different sources may therefore be applied in parallel, provided resource conflicts do not arise. Effective test scheduling for SOCs is challenging because it must address several conflicting goals: (1) SOC test time minimization, (2) resource conflicts between cores arising from the use of shared TAMs and on-chip BIST engines, (3) precedence constraints among tests, and (4) power constraints.

Finally, analog and mixed-signal cores are increasingly being integrated onto SOCs with digital cores. Testing mixed-signal cores is challenging because their failure mechanisms and testing requirements are not as well modeled as they are for digital cores. It is difficult to partition and test analog cores, because they may be prone to crosstalk across partitions. Capacitance loading and complex timing issues further exacerbate the mixed-signal test problem.

Section 4.2 presents a survey of modular test methods for SOCs that enhance the utilization of test resources, such as test data and test hardware, to reduce test cost such as test time.

Current core-based SOCs represent the success of the reuse paradigm in the industry. The design effort of a complex and mixed system can be strongly reduced by the reuse of predesigned functional blocks. However, for future SOCs with a large number of cores and increased interconnection delay, the implementation of an efficient and effective communication architecture among the blocks is becoming the new bottleneck in the performance of SOCs. It has been shown that the conventional point-to-point or bus-based communication architectures can no longer meet the system requirements in terms of bandwidth, latency, and power consumption [Vermeulen 2003] [Zeferino 2003].

Some works [Daly 2001] [Benini 2002] have proposed the use of integrated switching networks as an alternative approach to interconnect cores in SOCs. Such networks rely on a scalable and reusable communication platform, the so-called network-on-chip (NOC) system, to meet the two major requirements of current systems: reusability and scalable bandwidth. A conceptual architecture of an NOC system based on a 2-D mesh network is shown in Figure 4.2. Cores are connected to the network by routers or switches. Data are organized in the form of packets and transported through the interconnection links. Various network topologies and routing algorithms can be adopted to meet the requirements on performance, overhead, power consumption, among others.

Conceptual architecture of a 2-D mesh NoC system.

Figure 4.2. Conceptual architecture of a 2-D mesh NoC system.

Reusing the on-chip network as a test access mechanism for embedded cores in an SOC has been proposed in [Cota 2003]. Further results presented in [Cota 2004] show that test time can be reduced by network reuse even under power constraints, while other cost factors such as pin count and area overhead are strongly minimized. The reuse method assumes that most or all the embedded cores are connected or accessible through the on-chip communication network. This easier access to the cores and high parallelism of communications make the network a cost-effective test access mechanism (TAM).

The main advantage of on-chip network reuse is the availability of several accesses to each core, depending on the number of system input and output ports used during testing. Therefore, more cores can be tested in parallel as more access paths are available. In [Cota 2003], the idea of network parallelism is explored so that all available communication resources (channels, routers, and interfaces) are used in parallel to transmit test data. Test data are organized into test packets, which are scheduled so that the network usage is maximized to reduce the system test time.

For the rest of this chapter, we use the term NOC to denote an on-chip interconnection network (in general a packet-switching network) consisting of switches/routers, channels and other network components. And we use the term NOC-based system or NOC-based design to denote the entire SOC system consisting of an NOC and the embedded cores.

Section 4.3 presents advances in cost-efficient testing of embedded cores and interconnection fabrics in NOC-based systems.

System-on-Chip (SOC) Testing

This section first explains the importance of modular testing of SOC through the design of test wrappers and test access mechanisms (TAMs) that are used to transport test data. Optimizations of wrapper and TAM designs are thoroughly reviewed and discussed. Test scheduling with power and resource constraints is also elaborated for the cost-efficient testing of SOCs. Because modern SOCs contain mixed-signal circuits, analog wrapper design and its TAM support are also included. Finally, advanced topics such as hierarchical core-based testing and wafer-sort optimization are discussed.

Modular Testing of SOCs

Modular testing of embedded cores in a system-on-a-chip (SOC) is being increasingly advocated to simplify test access and test application [Zorian 1999]. To facilitate modular test, an embedded core must be isolated from surrounding logic, and test access must be provided from the I/O pins of the SOC. Test wrappers are used to isolate the core, whereas test access mechanisms (TAMs) transport test patterns and test responses between SOCs pins and core I/Os [Zorian 1999].

Effective modular testing requires efficient management of the test resources for core-based SOCs. This involves the design of core test wrappers and TAMs, the assignment of test pattern bits to automatic test equipment (ATE) channels, the scheduling of core tests, and the assignment of ATE channels to SOCs. The challenges involved in the optimization of SOC test resources for modular testing can be divided into three broad categories:

  1. Wrapper/ TAM co-optimization. Test wrapper design and TAM optimization are of critical importance during system integration because they directly impact hardware overhead, test time, and tester data volume. The issues involved in wrapper/TAM design include wrapper optimization, core assignment to TAM wires, sizing of the TAMs, and routing of TAM wires. As shown in [Chakrabarty 2001], [Iyengar 2002d], and [Marinissen 2000], most of these problems are NP-hard. Figures 4.3a and b illustrate the position of TAM design and test scheduling in the SOC DFT and test generation flows.

    The (a) DFT generation flow and (b) test generation flow for SOCs [Iyengar 2002c].

    Figure 4.3. The (a) DFT generation flow and (b) test generation flow for SOCs [Iyengar 2002c].

  2. Constraint-driven test scheduling. The primary objective of test scheduling is to minimize test time while addressing one or more of the following issues: (1) resource conflicts between cores arising from the use of shared TAMs and BIST resources, (2) precedence constraints among tests, and (3) power dissipation constraints. Furthermore, test time can often be decreased further through the selective use of test preemption [Iyengar 2002a]. As discussed in [Chakrabarty 2000a] and [Iyengar 2002a], most problems related to test scheduling for SOCs are also NP-hard.

  3. Minimizing ATE reload under memory depth constraints. Given test data for the individual cores, the entire test suite for the SOC must be made to fit in a minimum number of ATE memory loads (preferably one memory load). This is important because, whereas the time required to apply digital vectors is relatively small, the time required to load several gigabytes of data to the ATE memory from workstations is significant [Barnhart 2001] [Marinissen 2001]. Therefore, to avoid splitting the test into multiple ATE load–apply sessions, the number of bits required to be stored on any ATE channel must not exceed the limit on the channel’s memory depth.

In addition, the rising cost of ATE for SOC devices is a major concern [SIA 2005]. Because of the growing demand for pin counts, speed, accuracy, and vector memory, the cost of high-end ATE for full-pin, at-speed functional test, is predicted to be excessively high [SIA 2005]. As a result, the use of low-cost ATE that will perform structural rather than at-speed functional testing is increasingly being advocated for reducing test costs. Multisite testing, in which multiple SOCs are tested in parallel on the same ATE, can significantly increase the efficiency of ATE usage, as well as reduce test time for an entire production batch of SOCs. The use of low-cost ATE and multisite testing involves test data volume reduction and test pin count (TAM width) reduction, such that multiple SOC test suites can fit in ATE memory in a single test session [Marinissen 2001] [Volkerink 2002].

As a result of the intractability of the problems involved in test planning, test engineers adopted a series of simple ad hoc solutions in the past [Marinissen 2001]. For example, the problem of TAM width optimization is often simplified by stipulating that each core on the SOC have the same number of internal scan chains, say W; thus, a TAM of width W bits is laid out and cores are simply daisy-chained to the TAM. However, with the growing size of SOC test suites and the rising cost of ATE, the application of more aggressive test resource optimization techniques that enable effective modular test of highly complex next-generation SOCs using current-generation ATE is critical.

Wrapper Design and Optimization

A core test wrapper is a layer of logic that surrounds the core and forms the interface between the core and its SOC environment. Wrapper design is related to the well-known problems of circuit partitioning and module isolation and is therefore a more general test problem than its current instance (related to SOC test using TAMs). For example, earlier proposed forms of circuit isolation (precursors of test wrappers) include boundary scan and the built-in logic block observer (BILBO) [Abramovici 1994].

The test wrapper and TAM model of SOC test architecture was presented in [Zorian 1999]. In this paper, three mandatory wrapper operation modes listed were (1) normal operation, (2) core-internal tests, and (3) core-external tests. Apart from the three mandatory modes, two optional modes are “core bypass” and “detach.”

Two proposals for test wrappers have been the “test collar” [Varma 1998] and TestShell [Marinissen 1998]. The test collar was designed to complement the test bus architecture [Varma 1998] and the TestShell was proposed as the wrapper to be used with the TestRail architecture [Marinissen 1998]. In [Varma 1998], three different test collar types were described: combinational, latched, and registered. For example, a simple combinational test collar cell consisting of a 2-to-1 multiplexer can be used for high-speed signals at input ports during parallel, at-speed test. The TestShell described in [Marinissen 1998] is used to isolate the core and perform TAM width adaptation. It has four primary modes of operation: function mode, IP test mode, interconnect test mode, and bypass mode. These modes are controlled using a test control mechanism that receives two types of control signals: pseudo-static signals (that retain their values for the duration of a test) and dynamic control signals (that can change values during a test pattern).

An important function of the wrapper is to adapt the TAM width to the core’s I/O terminals and internal scan chains. This is done by partitioning the set of core-internal scan chains and concatenating them into longer wrapper scan chains, equal in number to the TAM wires. Each TAM wire can now directly scan test patterns into a single wrapper scan chain. TAM width adaptation directly affects core test time and has been the main focus of research in wrapper optimization. Note that to avoid problems related to clock skew, either internal scan chains in different clock domains must not be placed on the same wrapper scan chain or anti-skew (lock-up) latches must be placed between scan flip-flops belonging to different clock domains.

The issue of designing balanced scan chains within the wrapper was addressed in [Chakrabarty 2000b] (see Figure 4.4). The first techniques to optimize wrappers for test time reduction were presented in [Marinissen 2000]. To solve the problem, the authors proposed two polynomial-time algorithms that yield near-optimal results. The largest processing time (LPT) algorithm is taken from the multiprocessor scheduling literature and solves the wrapper design problem in short computation times. At the expense of a slight increase in computation time, the COMBINE algorithm yields even better results. It uses LPT as a start solution, followed by a linear search over the wrapper scan chain length with the First Fit Decreasing heuristic.

Wrapper chains: (a) unbalanced and (b) balanced.

Figure 4.4. Wrapper chains: (a) unbalanced and (b) balanced.

To perform wrapper optimization, the authors in [Iyengar 2002d] proposed design wrapper, an algorithm based on the best fit decreasing heuristic for the bin packing problem. The algorithm has two priorities: (1) minimizing core test time and (2) minimizing the TAM width required for the test wrapper. These priorities are achieved by balancing the lengths of the wrapper scan chains designed and identifying the number of wrapper scan chains that actually need to be created to minimize test time. Priority (2) is addressed by the algorithm, as it has a built-in reluctance to create a new wrapper scan chain, while assigning core-internal scan chains to the existing wrapper scan chains [Iyengar 2002d].

Wrapper design and optimization continue to attract considerable attention. Work in this area has focused on “light wrappers”—that is, the reduction of the number of register cells [Xu 2003]—and the design of wrappers for cores and SOCs with multiple clock domains [Xu 2004].

TAM Design and Optimization

Many different TAM designs have been proposed in the literature. TAMs have been designed based on direct access to cores multiplexed onto the existing SOC pins [Immaneni 1990], reusing the on-chip system bus [Harrod 1999], searching transparent paths through or around neighboring modules [Ghosh 1998] [Nourani 2000] [Chakrabarty 2003], and 1-bit boundary scan rings around cores [Touba 1997] [Whetsel 1997].

The most popular appear to be the dedicated, scalable TAMs such as test bus [Varma 1998] and TestRail [Marinissen 1998]. Despite the fact that their dedicated wiring adds to the area costs of the SOC, their flexible nature and guaranteed test access have proven successful. Three basic types of such scalable TAMs have been described in [Aerts 1998] (Figure 4.5): (1) the multiplexing architecture, (2) the daisy-chain architecture, and (3) the distribution architecture. In the multiplexing and daisy-chain architectures, all cores get access to the total available TAM width, while in the distribution architecture, the total available TAM width is distributed over the cores.

The (a) multiplexing, (b) daisy-chain, and (c) distribution architectures [Aerts 1998, Iyengar 2002c].

Figure 4.5. The (a) multiplexing, (b) daisy-chain, and (c) distribution architectures [Aerts 1998, Iyengar 2002c].

In the multiplexing architecture, only one core wrapper can be accessed at a time. Consequently, this architecture only supports serial schedules, in which the cores are tested one after the other. An even more serious drawback of this architecture is that testing the circuitry and wiring in between cores is difficult; interconnect test requires simultaneous access to multiple wrappers. The other two basic architectures do not have these restrictions; they allow for both serial as well as parallel test schedules, and they also support interconnect testing.

The test bus architecture [Varma 1998] (see Figure 4.6a) is a combination of the multiplexing and distribution architectures. A single test bus is in essence the same as what is described by the multiplexing architecture; cores connected to the same test bus can only be tested sequentially. The test bus architecture allows for multiple test buses on one SOC that operate independently, as in the distribution architecture. Cores connected to the same test bus suffer from the same drawback as in the multiplexing architecture (i.e., their wrappers cannot be accessed simultaneously), making core-external testing difficult or impossible.

The (a) fixed-width test bus architecture, (b) fixed-width TestRail architecture, and (c) flexible-width test bus architecture [Iyengar 2003c].

Figure 4.6. The (a) fixed-width test bus architecture, (b) fixed-width TestRail architecture, and (c) flexible-width test bus architecture [Iyengar 2003c].

The TestRail architecture [Marinissen 1998] (see Figure 4.6b) is a combination of the daisy-chain and distribution architectures. A single TestRail is in essence the same as what is described by the daisy-chain architecture: scan-testable cores connected to the same TestRail can be tested simultaneously as well as sequentially. A TestRail architecture allows for multiple TestRails on one SOC, which operate independently, as in the distribution architecture. The TestRail architecture supports serial and parallel test schedules, as well as hybrid combinations of those.

In most TAM architectures, the cores assigned to a TAM are connected to all wires of that TAM. This is referred to this as fixed-width TAMs. A generalization of this design is one in which the cores assigned to a TAM each connect to a (possibly different) subset of the TAM wires [Iyengar 2003c]. The core–TAM assignments are made at the granularity of TAM wires, instead of considering the entire TAM bundle as one inseparable entity. These are referred to as flexible-width TAMs. This concept can be applied to both test bus and TestRail architectures. Figure 4.6c shows an example of a flexible-width test bus architecture.

Most SOC test architecture optimization algorithms proposed have concentrated on fixed-width test bus architectures and assume cores with fixed-length scan chains. In [Chakrabarty 2001], the author described a test bus architecture optimization approach that minimizes test time using integer linear programming (ILP). ILP is replaced by a genetic algorithm in [Ebadi 2001]. In [Iyengar 2002b], the authors extend the optimization criteria of [Chakrabarty 2001] with place-and-route and power constraints, again using ILP. In [Huang 2001] and [Huang 2002], test bus architecture optimization is mapped to the well-known problem of two-dimensional bin packing, and a best fit algorithm is used to solve it. Wrapper design and TAM design both influence the SOC test time, hence their optimization needs to be carried out in conjunction in order to achieve the best results. The authors in [Iyengar 2002d] were the first to formulate the problem of integrated wrapper/TAM design; despite its NP-hard character, it is addressed using ILP and exhaustive enumeration. In [Iyengar 2003b], the authors presented efficient heuristics for the same problem.

Idle bits exist in test schedules when parts of the test wrapper and TAM are underutilized, leading to idle time in the test delivery architecture. In [Marinissen 2002a], the authors first formulated the test time minimization problem both for cores with fixed-length scan chains as well as for cores with flexible-length scan chains. Next, they presented lower bounds on the test time for the test bus and TestRail architectures and then examined three main reasons for underutilization of TAM bandwidth, leading to idle bits in the test schedule and test times higher than the lower bound [Marinissen 2002a]. The problem of reducing the amount of idle test data was also addressed in [Gonciari 2003].

The optimization of a flexible-width multiplexing architecture (i.e., for one TAM only) was proposed in [Iyengar 2002d]. This work again assumes cores with fixed-length scan chains. The paper describes a heuristic algorithm for the co-optimization of wrappers and test buses based on rectangle packing. In [Iyengar 2002a], the same authors extended this work by including precedence, concurrency, and power constraints while allowing a user-defined subset of the core tests to be preempted.

Fixed-width TestRail architecture optimization was investigated in [Goel 2002]. Heuristic algorithms have been developed for the co-optimization of wrappers and TestRails. The algorithms work both for cores with fixed-length and flexible-length scan chains. TR-ARCHITECT, the tool presented in [Goel 2002], is currently in actual industrial use.

Test Scheduling

Test scheduling for SOCs involving multiple test resources and cores with multiple tests is especially challenging, and even simple test scheduling problems for SOCs have been shown to be NP-hard [Chakrabarty 2000a]. In [Sugihara 1998], a method for selecting tests from a set of external and BIST tests (that run at different clock speeds) was presented. Test scheduling was formulated as a combinatorial optimization problem. Reordering tests to maximize defect detection early in the schedule was explored in [Jiang 1999]. The entire test suite was first applied to a small sample population of ICs. The fault coverage obtained per test was then used to arrange tests that contribute to high fault coverage earlier in the schedule. The authors used a polynomial-time algorithm to reorder tests based on the defect data as well as execution time of the tests [Jiang 1999]. A test scheduling technique based on the defect probabilities of the cores has been reported [Larsson 2004].

Macro testing is a modular testing approach for SOC cores in which a test is broken down into a test protocol and list of test patterns [Beenker 1995]. A test protocol is defined at the terminals of a macro and describes the necessary and sufficient conditions to test the macro [Marinissen 2002b]. The test protocols are expanded from the macro level to the SOC pins and can be either applied sequentially to the SOC or scheduled to increase parallelism. In [Marinissen 2002b], a heuristic scheduling algorithm based on pair-wise composition of test protocols was presented. The algorithm determines the start times for the expanded test protocols in the schedule, such that no resource conflicts occur and test time is minimized [Marinissen 2002b].

SOCs in test mode can dissipate up to twice the amount of power they do in normal mode, because cores that do not normally operate in parallel may be tested concurrently [Zorian 1993]. Power-constrained test scheduling is therefore essential to limit the amount of concurrency during test application to ensure that the maximum power budget of the SOC is not exceeded. In [Chou 1997], a method based on approximate vertex cover of a resource-constrained test compatibility graph was presented. In [Muresan 2000], the use of list scheduling and tree-growing algorithms for power-constrained scheduling was discussed. The authors presented a greedy algorithm to overlay tests such that the power constraint is not violated. A constant additive model is employed for power estimation during scheduling [Muresan 2000]. The issue of reorganizing scan chains to tradeoff test time with power consumption was investigated in [Larsson 2001b]. The authors presented an optimal algorithm to parallelize tests under power and resource constraints. The design of test wrappers to allow for multiple scan chain configurations within a core was also studied.

In [Iyengar 2002a], an integrated approach to test scheduling was presented. Optimal test schedules with precedence constraints were obtained for reasonably sized SOCs. For precedence-based scheduling of large SOCs, a heuristic algorithm was developed. The proposed approach also includes an algorithm to obtain preemptive test schedules in O(n3) time, where n is the number of tests [Iyengar 2002a]. Parameters that allow only a certain number of preemptions per test can be used to prevent excessive BIST and sequential circuit test preemptions. Finally, a new power-constrained scheduling technique was presented, whereby power constraints can be easily embedded in the scheduling framework in combination with precedence constraints, thus delivering an integrated approach to the SOC test scheduling problem.

Integrated TAM Optimization and Test Scheduling

Both TAM optimization and test scheduling significantly influence the test time, test data volume, and test cost for SOCs. Furthermore, TAMs and test schedules are closely related. For example, an effective schedule developed for a particular TAM architecture may be inefficient or even infeasible for a different TAM architecture. Integrated methods that perform TAM design and test scheduling in conjunction are therefore required to achieve low-cost, high-quality test.

In [Larsson 2001a], the authors presented an integrated approach to test scheduling, TAM design, test set selection, and TAM routing. The SOC test architecture was represented by a set of functions involving test generators, response evaluators, cores, test sets, power and resource constraints, and start and end times in the test schedule modeled as Boolean and integral values [Larsson 2001a]. A polynomial-time algorithm was used to solve these equations and determine the test resource placement, TAM design and routing, and test schedule, such that the specified constraints are met.

The mapping between core I/Os and SOC pins during the test schedule was investigated in [Huang 2001]. TAM design and test scheduling was modeled as two-dimensional bin-packing, in which each core test is represented by a rectangle. The height of each rectangle corresponds to the test time, the width corresponds to the core I/Os, and the weight corresponds to the power consumption during test. The objective is to pack the rectangles into a bin of fixed width (SOC pins) such that the bin height (total test time) is minimized while power constraints are met. A heuristic method based on the best fit algorithm was presented to solve the problem [Huang 2001]. The authors next formulated constraint-driven pin mapping and test scheduling as the chromatic number problem from graph theory and as a dependency matrix partitioning problem [Huang 2002]. Both problem formulations are NP-hard. A heuristic algorithm based on clique partitioning was proposed to solve the problem.

The problem of TAM design and test scheduling with the objective of minimizing the average test time was formulated in [Koranne 2002a]. The problem was reduced to one of minimum-weight perfect bipartite graph matching, and a polynomial-time optimal algorithm was presented. A test planning flow was also presented.

The power-constrained TAM design and test scheduling problem was studied in [Zhao 2003]. By formulating into a graph theoretic problem, the test scheduling problem was reduced to provide test access to core level test terminals from system level pins and efficiently route wrapper-configured cores on TAMs to reduce the total test application time. The seed sets were effectively selected from the conflict graph to initiate scheduling and the compatible test sets were derived from the power-constrained test compatible graph to facilitate test concurrency and dynamic TAM width distribution.

In [Iyengar 2003c], a new approach for wrapper/TAM co-optimization and constraint-driven test scheduling using rectangle packing was described. Flexible-width TAMs that are allowed to fork and merge were designed. Rectangle packing was used to develop test schedules that incorporate precedence and power constraints, while allowing the SOC integrator to designate a group of tests as pre-emptable. Finally, the relationship between TAM width and tester data volume was studied to identify an effective TAM width for the SOC.

The work reported in [Iyengar 2003c] was extended in [Iyengar 2003a] to address the minimization of ATE buffer reloads and include multisite test. The ATE is assumed to contain a pool of memory distributed over several channels, such that the memory depth assigned to each channel does not exceed a maximum limit. Furthermore, the sum of the memory depth over all channels equals the total pool of ATE memory. Idle bits appear on ATE channels whenever there is idle time on a TAM wire. These bit positions are filled with don’t-cares if they appear between useful test bits; however, if they appear only at the end of the useful bits, they are not required to be stored in the ATE.

The SOC test resource optimization problem for multisite test was stated as follows. Given the test set parameters for each core, and a limit on the maximum memory depth per ATE channel, determine the wrapper/TAM architecture and test schedule for the SOC, such that (1) the memory depth required on any channel is less than the maximum limit, (2) the number of TAM wires is minimized, and (3) the idle bits appear only at the end of each channel. A rectangle packing algorithm was developed to solve this problem.

A new method for representing SOC test schedules using k-tuples was discussed in [Koranne 2002b]. The authors presented a p-admissible model for test schedules that is amenable to several solution methods such as local search, two-exchange, simulated annealing and genetic algorithms that cannot be used in a rectangle-representation environment. The proposed approach provides a compact, standardized representation of test schedules. This facilitates fast and efficient evaluation of SoC test automation solutions to reduce test costs.

Finally, work on TAM optimization has focused on the use of ATEs with port scalability features [Sehgal 2003a, 2004a, 2004c]. To address the test requirements of SOCs, automatic test equipment (ATE) vendors have announced a new class of testers that can simultaneously drive different channels at different data rates. Examples include the Agilent 93000 series tester based on port scalability and the test processor-per-pin architecture [Agilent] and the Tiger system from Teradyne [Teradyne 2007] in which the data rate can be increased through software for selected pin groups to match SOC test requirements. However, the number of tester channels with high data rates may be constrained in practice because of ATE resource limitations, the power rating of the SOC, and scan frequency limits for the embedded cores. Optimization techniques have been developed to ensure that the high data rate tester channels are efficiently used during SOC testing [Sehgal 2004a].

The availability of dual-speed ATEs was also exploited in [Sehgal 2003a] and [Sehgal 2004c], where a technique was presented to match ATE channels with high data rates to core scan chain frequencies using virtual TAMs. A virtual TAM is an on-chip test data transport mechanism that does not directly correspond to a particular ATE channel. Virtual TAMs operate at scan-chain frequencies; however, they interface with the higher frequency ATE channels using bandwidth matching. Moreover, because the virtual TAM width is not limited by the ATE pin count, a larger number of TAM wires can be used on the SOC, thereby leading to lower test times. A drawback of virtual TAMs, however, is the need for additional TAM wires on the SOC as well as frequency division hardware for bandwidth matching. In [Sehgal 2004a], the hardware overhead is reduced through the use of a smaller number of on-chip TAM wires; ATE channels with high data rates directly drive SOC TAM wires, without requiring frequency division hardware.

Modular Testing of Mixed-Signal SOCs

Prior research on modular testing of SOCs has focused almost exclusively on the digital cores in an SOC. However, most SOCs in use today are mixed-signal circuits containing both digital and analog cores [Liu 1998] [Kundert 2000] [Yamamoto 2001]. Increasing pressure on consumer products for small form factors and extended battery life is driving single chip integration and blurring the lines between analog and digital design types. As indicated in the 2005 International Technology Roadmap for Semiconductors [SIA 2005], the combination of these circuits on a single die compounds the test complexities and challenges for devices that fall in an increasing commodity market. Therefore, an effective modular test methodology should be capable of handling both digital and analog cores, and it should reduce test cost by enabling test reuse for reusable embedded modules.

In traditional mixed-signal SOC testing, tests for analog cores are applied either from chip pins through direct test access methods, such as via multiplexing or through a dedicated analog test bus [Sunter 1996] [Cron 1997], which requires the use of expensive mixed-signal testers. For mid- to low-frequency analog applications, the data are often digitized at the tester, where it is affordable to incorporate high-quality data converters. In most mixed-signal ICs, analog circuitry accounts for only a small part of the total silicon (“big-D/small-A”). However, the total production testing cost is dominated by analog testing costs. This is because expensive mixed-signal testers are employed for extended periods of time resulting in high overall test costs. A natural solution to this problem is to implement the data converters on-chip. Because most SOC applications do not push the operational frequency limits, the design of such data converters on-chip appears to be feasible. Until recently, such an approach has not been deemed desirable because of its high hardware overhead. However, as the cost of on-chip silicon is decreasing and the functionality and the number of cores in a typical SOC are increasing, the addition of data converters on-chip for testing analog cores now promises to be cost-efficient. These data converters eliminate the need for expensive mixed-signal test equipment.

Results have been reported on the optimization of a unified test access architecture that is used for both digital and analog cores [Sehgal 2003b]. Instead of treating the digital and analog portions separately, a global test resource optimization problem is formulated for the entire SOC. Each analog core is wrapped by a DAC-ADC pair and a digital configuration circuit. Results show that for “big-D/small-A” SOCs, the test time and test cost can be reduced considerably if the analog cores are wrapped and the test access and test scheduling problems for the analog and digital cores are tackled in a unified manner.

Each analog core is provided a test wrapper where the test information includes only digital test patterns, clock frequency, the test configuration, and pass/fail criteria. This analog test wrapper converts the analog core to a virtual digital core with strictly sequential test patterns, which are the digitized analog signals. To utilize test resources efficiently, the analog wrapper needs to provide sufficient flexibility in terms of required resources with respect to all the test needs of the analog core. One way to achieve this uniform test access scheme for analog cores is to provide an on-chip ADC-DAC pair that can serve as an interface between each analog core and the digital surroundings, as shown in Figure 4.7.

On-chip digitization of analog test data for uniform test access [Sehgal 2003b].

Figure 4.7. On-chip digitization of analog test data for uniform test access [Sehgal 2003b].

Analog test signals are expressed in terms of a signal shape, such as sinusoidal or pulse, and signal attributes, such as frequency, amplitude, and precision. The core vendor provides these tests to the system integrator. In the case of analog testers, these signals are digitized at the high precision ADCs and DACs of the tester. In the case of on-chip digitization, the analog wrapper needs to include the lowest cost data converters that can still provide the required frequency and accuracy for applying the core tests. Thus, on-chip conversion of each analog test to digital patterns imposes requirements on the frequency and resolution of the data converters of the analog wrapper. These converters need to be designed to accommodate all the test requirements of the analog core.

Analog tests may also have a high variance in terms of their frequency and test time requirements. Whereas tests involving low-frequency signals require low bandwidth and high test times, tests involving high-frequency signals require high bandwidth and low test time. Keeping the bandwidth assigned to the analog core constant results in underutilization of the precious test resources. The variance of analog test needs have to be fully exploited in order to achieve an efficient test plan. Thus, the analog test wrapper has to be designed to accommodate multiple configurations with varying bandwidth and frequency requirements.

Figure 4.8 shows the block diagram of an analog wrapper that can accommodate all the abovementioned requirements. The figure highlights the control and clock signals generated by the test control circuit. The registers at each end of the data converters are written and read in a semiserial fashion depending on the frequency requirement of each test. For example, for a digital TAM clock of 50MHz, 12-bit DAC and ADC resolution, and an analog test requirement of 8MHz sampling frequency, the input and output registers can be updated with a serial-to-parallel ratio of 6. Thus, the bandwidth requirement of this particular test is only 2 bits. The digital test control circuit selects the configuration for each test. This configuration includes the divide ratio of the digital TAM clock, the serial-to-parallel conversion rate of the input and output registers of the data converters, and the test modes.

Block diagram of the analog test wrapper [Sehgal 2003b].

Figure 4.8. Block diagram of the analog test wrapper [Sehgal 2003b].

Analog Test Wrapper Modes

In the normal mode of operation, the analog test wrapper is completely bypassed; the analog circuit operates on its analog input/output pins. During testing, the analog wrapper has two modes, a self-test mode and a core-test mode. Before running any tests on the analog core, the wrapper data converters have to be characterized for their conversion parameters, such as the nonlinearity and the offset voltage. The self-test mode is enabled through the analog multiplexer at the input of the wrapper ADC, as shown in Figure 4.8. The parameters of the DAC-ADC pair are determined in this mode and are used to calibrate the measurement results. Once the self-test of the test wrapper is complete, core testing can be enabled by turning off the self-test bits.

For each analog test, the encoder has to be set to the corresponding serial-to-parallel conversion ratio (cr), where it shifts the data from the corresponding TAM inputs into the register of the ADC. Similarly, the decoder shifts data out of the DAC register. The update frequency of the input and output registers, fupdate = fs × cr, is always less than the TAM clock rate, fTAM. For example, if the test bandwidth requirement is 2 bits and the resolution of the data converters is 12 bits, the input and output registers of the data converters are clocked at a rate six times less than the clock of the encoder, and the input data are shifted into the encoder and out of the decoder at a 2-bits/cycle rate. The complexity of the encoder and the decoder depends on the number of distinct bandwidth and TAM assignments (the number of possible test configurations). For example, for a 12-bit resolution, the bandwidth assignments may include 1, 2, 3, 4, 6, and 12 bits, where in each case the data may come from distinct TAMs. Clearly, to limit the complexity of the encoder-decoder pair, the number of such distinct assignments has to be limited. This requirement can be imposed in the test scheduling optimization algorithm.

The analog test wrapper transparently converts the analog test data to the digital domain through efficient utilization of the resources; thus this obviates the need for analog testers. The processing of the collected data can be done in the tester by adding appropriate algorithms, such as the FFT algorithm. Further details and experimental results can be found in [Sehgal 2003b], [Sehgal 2005], and [Sehgal 2006b].

Modular Testing of Hierarchical SOCs

A hierarchical system-on-chip (SOC) is designed by integrating heterogeneous technology cores at several layers of hierarchy [SIA 2005]. The ability to reuse embedded cores in a hierarchical manner implies that today’s SOC is tomorrow’s embedded core [Gallagher 2001]. Two broad design transfer models are emerging in hierarchical SOC design flows.

  1. Noninteractive. The noninteractive design transfer and hand-off model is one in which there is limited communication between the core vendor and the SOC integrator. The hard cores are taken off-the-shelf and integrated into designs as optimized layouts.

  2. Interactive. The interactive design transfer model is typical of larger companies where the business units producing intellectual property (IP) cores may be part of the same organization as the business unit responsible for system integration. Here, there is a certain amount of communication between the core vendor and core user during system integration. The communication of the core user’s requirements to the core vendor can play a role in determining the core specifications.

Hierarchical SOCs offer reduced cost and rapid system implementation; however, they pose difficult test challenges. Most TAM design methods assume that the SOC hierarchy is flattened for the purpose of test. However, this assumption is often unrealistic in practice, especially when older-generation SOCs are used as hard cores in new SOC designs. In such cases, the core vendor may have already designed a TAM within the “megacore” that is provided as an optimized and technology-mapped layout to the SOC integrator.

A megacore is defined as a design that contains nonmergeable embedded cores. To ensure effective testing of an SOC based on megacores, the top-level TAM must communicate with lower level TAMs within megacores. Moreover, the system-level test architecture must be able to reuse the existing test architecture within cores; redesign of core test structures must be kept to a minimum, and it must be consistent with the design transfer model between the core designer and the core user [Parulkar 2002].

A TAM design methodology that closely follows the design transfer model in use is necessary because if the core vendor has implemented “hard” (i.e., nonalterable) TAMs within megacores, the SOC integrator must take into account these lower-level TAM widths while optimizing the widths and core assignment for higher-level TAMs. On the other hand, if the core vendor designs TAMs within megacores in consultation with the SOC integrator, the system designer’s TAM optimization method must be flexible enough to include parameters for lower-level cores. Finally, multilevel TAM design for SOCs that include reused cores at multiple levels is needed to exploit “TAM reuse” and “wrapper reuse” in the test development process.

It is only recently that the problem of designing test wrappers and TAMs of multilevel TAMs for the “cores within cores” design paradigm has been investigated [Iyengar 2003a] [Chakrabarty 2005]. Two design flows have been considered for the scenario in which megacores are wrapped by the core vendor before delivery. In an alternative scenario, the megacores can be delivered to the system integrator in an unwrapped fashion, and the system integrator appropriately designs the megacore wrappers and the SOC-level TAM architecture to minimize the overall test time.

Figure 4.9 illustrates a megacore that contains four embedded cores and additional logic external to the embedded cores. The core vendor for this megacore core has wrapped the four embedded cores, and implemented a TAM architecture to access the embedded cores. The TAM architecture consists of two test buses of widths 3 bits and 2 bits, respectively, that are used to access the four embedded cores. It is assumed here that the TAM inputs and outputs are not multiplexed with the functional pins. Next, Figure 4.10 shows how a two-part wrapper (wrapper 1 and wrapper 2) for the megacore can be designed not only to drive the TAM wires within the megacore but also to test the logic that is external to the embedded cores. In this design, the TAM inputs for wrapper 1 and wrapper 2 are multiplexed in time such that the embedded cores within the megacore are tested before the logic external to them, or vice versa. Test generation for the top-level logic is done by the megacore vendor with the wrappers for the embedded cores in functional mode. During the testing of the top-level logic in the megacore using wrapper 1, the wrappers for the embedded cores must therefore be placed in functional mode to ensure that the top-level logic can be tested completely through the megacore I/Os and scan terminals.

An illustration of a megacore with a predesigned TAM architecture.

Figure 4.9. An illustration of a megacore with a predesigned TAM architecture.

An illustration of a two-part wrapper for the megacore that is used to drive the TAMs in the megacore and to test the logic external to the embedded cores.

Figure 4.10. An illustration of a two-part wrapper for the megacore that is used to drive the TAMs in the megacore and to test the logic external to the embedded cores.

Megacores may be supplied by core vendors in varying degrees of readiness for test integration. For example, the IEEE 1500 standard on embedded core test defines two compliance levels for core delivery: 1500-wrapped and 1500-unwrapped [IEEE 1500-2005]. Here we describe three other scenarios, based in part on the 1500 compliance levels. These scenarios refer to the roles played by the system integrator and the core vendor in the design of the TAM and the wrapper for the megacore. For each scenario, the design transfer model refers to the type of information about the megacore that is provided by the core vendor to the system integrator. The term wrapped is used to denote a core for which a wrapper has been predesigned, as in [Marinissen 2002b]. The term TAM-ed is used to denote a megacore that contains an internal TAM structure.

  1. Scenario 1. Not TAM-ed and not wrapped: In this scenario, the system integrator must design a wrapper for the megacore as well as TAMs within the megacore. The megacores are therefore delivered either as soft cores or before final netlist and layout optimization such that TAMs can be inserted within the megacores.

  2. Scenario 2. TAM-ed and wrapped: In this scenario, we consider TAM-ed megacores for which wrappers have been designed by the core vendor. This scenario is especially suitable for a megacore that was an SOC in an earlier generation. It is assumed that the core vendors wrap such megacores before design transfer, and test data for the megacore cannot be further serialized or parallelized by the SOC integrator. This implies that the system integrator has less flexibility in top-level TAM partitioning and core assignment. At the system level, only structures that facilitate normal/test operation, interconnect test, and bypass are created. This scenario includes both the interactive and noninteractive design transfer models.

  3. Scenario 3. TAM-ed but not wrapped: In this scenario, the megacore contains lower-level TAMs, but it is not delivered in a wrapped form; therefore, the system integrator must design a wrapper for the megacore. To design a wrapper as sketched in Figure 4.10, the core vendor must provide information about the number of functional I/Os, the number and lengths of top-level scan chains in the megacore, the number of TAM partitions and the size of each partition, and the test time for each TAM partition. Compared to the noninteractive design transfer model in scenario 2, the system integrator in this case has greater flexibility in top-level TAM partitioning and core assignment. Compared to the interactive design transfer model in scenario 2, the system integrator here has less influence on the TAM design for a megacore; however, this loss of flexibility is somewhat offset by the added freedom of being able to design the megacore wrapper. Width adaptation can be carried out in the wrapper for the megacore such that a narrow TAM at the SOC-level can be used to access a megacore that has a wider internal TAM.

Optimization techniques for these scenarios are described in detail in [Iyengar 2003a], [Chakrabarty 2005], [Sehgal 2004b], and [Sehgal 2006a]. As hierarchical SOCs become more widespread, it is expected that more research effort will be devoted to this topic.

Wafer-Sort Optimization for Core-Based SOCs

Product cost is a major driver in the consumer electronics market, which is characterized by low profit margins and the use of SOC designs. Packaging has been recognized as a significant contributor to the product cost for such SoCs [Kahng 2003]. To reduce packaging cost and the test cost for packaged chips, the semiconductor industry uses wafer-level testing (wafer sort) to screen defective dies [Maxwell 2003]. However, because test time is a major practical constraint for wafer sort, even more so than for package test, not all the scan-based digital tests can be applied to the die under test. An optimal test-length selection technique for wafer-level testing of core-based SOC has been developed [Bahukudumbi 2006]. This technique, which is based on a combination of statistical yield modeling and integer linear programming, allows us to determine the number of patterns to use for each embedded core during wafer sort such that the probability of screening defective dies is maximized for a given upper limit on the SOC test time.

One SOC test scheduling method attempted to minimize the average test time for a packaged SOC, assuming an abort-on-first fail strategy [Larsson 2004] [Ingelsson 2005]. The key idea in this work is to use defect probabilities for the embedded cores to guide the test scheduling procedure. These defect probabilities are used to determine the order in which the embedded cores in the SOC are tested, as well as to identify the subsets of cores that are tested concurrently. The defect probabilities for the cores were assumed in [Larsson 2004] to be either known a priori or obtained by binning the failure information for each individual core over the product cycle [Ingelsson 2005]. In practice, however, short product cycles make defect estimation based on failure binning difficult. Moreover, defect probabilities for a given technology node are not necessarily the same for the next (smaller) technology node. Therefore, a yield modeling technique is needed to accurately estimate these defect probabilities.

In [Bahukudumbi 2006], the researchers show how statistical yield modeling for defect-tolerant circuits can be used to estimate defect probabilities for embedded cores in an SOC. The test-length selection problem for wafer-level testing of core-based SOC is next formulated. Then, integer linear programming is used to obtain optimal solutions for the test-length selection problem. The application of this approach to reduced pin-count testing is presented in [Bahukudumbi 2007].

Network-on-Chip (NOC) Testing

Testing an NOC-based system includes testing of embedded cores and testing of the on-chip network. The former is similar to conventional SOC testing, the latter tests the NOC itself including interconnects, switches/routers, input/output ports, and other mechanisms other than the cores. Because of the excessive routing overhead, all tests in the NOC domain should be done in a cost-efficient manner, which can be done through the reuse of the NOC as a test access mechanism (TAM).

NOC Architectures

A typical packet-switching network model called system-on-chip interconnection network (SOCIN) [Zeferino 2003] that implements on a two-dimensional (2-D) mesh topology is shown in Figure 4.11. Here the d695 circuit from the ITC-2002 SOC benchmarks [ITC 2002] is stitched in the network for illustration. The bidirectional communication channels are defined to be 32-bits wide on each direction, and the packets have unlimited length. It uses credit-based flow-control and XY routing—a deadlock-free, deterministic and source-based approach, in which a packet is first routed on the X direction and then on the Y direction before reaching its destination. Switching is based on the wormhole approach, where a packet is broken up into flits (flow control units, the smallest unit over which the flow control is performed), and the flits follow the header in a pipelined fashion. The flit size equals the channel width. This platform is assumed in the rest of this section for illustration.

System d695 implemented in the SOCIN NOC.

Figure 4.11. System d695 implemented in the SOCIN NOC.

NOCs typically use the message-passing communication model, and the processing cores attached to the network communicate by sending and receiving request and response messages. To be routed by the network, a message is composed of a header, a payload, and a trailer. The header and the trailer frame the packet, and the payload carries the data being transferred. The header also carries the information needed to establish the path between the sender and the receiver. Depending on the network implementation, messages can be split into packets, which have the same format of a message and are individually routed. Packet-based networks present better resource utilization, as packets are short and reserve only a small number of channels during their transportation.

Besides its topology, an NOC can be described by the approaches used to implement the mechanisms for flow-control, routing, arbitration, switching, and buffering, as follows. The flow control deals with data traffic on the channels and inside the routers. Routing is the mechanism that defines the path a message takes from a sender to a receiver. The arbitration makes a scheduling when two or more messages request the same resource. Switching is the mechanism that takes an incoming message of a router and puts it in an output port of the router. Finally, buffering is the strategy used to store messages when a requested output channel is busy. Current embedded cores usually need to use wrappers to adapt their interfaces and protocols to the ones of the target NOC. Such wrappers pack and unpack data exchanged by the processing cores.

Testing of Embedded Cores

Testing of the embedded cores in NOC-based systems poses considerable challenges. In a traditional SOC, test data are transported through a dedicated TAM. This strategy, however, could lead to difficulty of routing in an NOC-based system, because the network (routers, channels, etc.) has already imposed significant routing overhead. Therefore, many current approaches for testing NOC-based systems rely on the reuse of the existing on-chip communication infrastructure as a TAM [Cota 2003, Cota 2004] [Liu 2004].

Reuse of On-Chip Network for Testing

To reuse the on-chip network as an access mechanism, the test vectors and test responses of each core are first organized into sets of packets that can be transmitted through the network. To keep the original wrapper design of each core unchanged to minimize the test application time and cost, the test packets are defined in such a way that each flit arriving from the network can be unpacked in one cycle. Each bit of a packet flit fills exactly one bit of a scan chain of the core. Functional inputs and outputs of the core, as well as the internal scan chains, are concatenated into wrapper (external) scan chains of similar length such that the channel width is enough to transport one bit for each wrapper scan chain. Control information, such as scan shift and capture signals, is also delivered in packets, either in the test header (to be interpreted by the wrapper) or as specific bits in the payload (for direct connection to the target pins). This concept will not considerably affect the core wrapper design, which can still follow the conventional SOC wrapper design methodology. A wrapper configuration during test and normal operation is depicted in Figure 4.12 [Cota 2003].

Wrapper configurations of cores in NOC-based system [Cota 2003]: (a) test mode and (b) function mode.

Figure 4.12. Wrapper configurations of cores in NOC-based system [Cota 2003]: (a) test mode and (b) function mode.

Test packets can be injected from an external tester and routed to the core under test, and corresponding test responses will be assembled into packets and routed back to the tester. This requires some test interfaces, which can be either dedicated I/O ports or reused embedded cores. Figure 4.13 [Liu 2004] illustrates the scenario using the NOC-based system shown in Figure 4.11. Note that there are two input ports and two output ports, each associated with a core by reusing the core’s input/output or wrapper (if the core is wrapped) infrastructure. In a BIST environment, test data packets can be generated on-chip, and responses can also be analyzed via MISRs or comparators. In practice, each core can have several BIST and external test sets simultaneously.

System d695 with test ports and routing paths [Liu 2004].

Figure 4.13. System d695 with test ports and routing paths [Liu 2004].

Test Scheduling

Test scheduling is one of the major problems in embedded core testing. A general form S of this problem can be formulated as follows: for an NOC-based system Test Scheduling, a maximal number Test Scheduling of input/output ports, a set Test Scheduling of test sets (each core may have multiple test sets, either deterministic or random, or functional test for processor cores), a set C of constraints, and a set Test Scheduling of test resources (dedicated TAM, BIST engines, etc.), determine a selection of input/output ports, an assignment of test resources to cores, and a schedule of test sets such that the optimization objective(s) is (are) minimized and all the constraints are met. Note that the objective can be any cost factor such as test time and hardware overhead. A basic subset of this problem S is problem S0, where only core tests are optimized under some constraints. It was proved as Test Scheduling-complete in [Liu 2004]. S and other subset of S can also be similarly proved as Test Scheduling-complete.

In an NOC-based system, test scheduling can be done in a manner of either preemptive or nonpreemptive. As in a general SOC, an optimized schedule should maximize the test data parallelism. In an NOC-based system, this is done through the exploration of network parallelism so that all available communication resources (channels, routers, and interfaces) are used in parallel to transmit test data. In preemptive test scheduling, test data are transformed into test packets, which are transmitted through the network in such a way that one test packet contains one test vector or one test response. Because each test vector or test response can be scheduled individually, the network parallelism has privilege over the core test pipeline, and the test of a core can be interrupted. As a result, the pipeline of the core’s scan-in and scan-out operations cannot be maintained.

Preemptive testing is not always desirable in practice, especially for BIST and sequential circuit testing [Iyengar 2002a]. In addition, it is always desirable that the test pipeline of a core is not interrupted—that is, the nth test vector will be shifted into the scan chains as the (n – 1)th test response is shifted out, such that test time is minimized. However, in the case of preemption, the test pipeline has to be halted if either the test vector or test response packet cannot be scheduled because of the unavailability of test resources (i.e., channels and input/output ports). This will not only increase the complexity of wrapper control but also cause potential increase on test time.

A nonpreemptive schedule maintains the test pipeline so that the wrapper can remain unchanged and the test time can be potentially reduced. In this approach, the scheduler will assign each core a routing path, including an input port, an output port, and the corresponding channels that transport test vectors from the input to the core and the test responses from the core to the output in the form of packets. Once the core is scheduled on this path, all resources (input, output, channels) on the path will be reserved for the test of this core until the entire test is completed. Test vectors will be routed to the core and test responses to the output in a pipelined fashion. Therefore, in this nonpreemptive schedule, the test of a core is identical to a normal test and the flow control becomes similar to circuit switching. Note that in Figure 4.13, cores 8 and 10 are scheduled on two I/O pairs in a nonpreemptive manner. It has been shown that usually nonpreemptive scheduling can yield shorter test time compared to preemptive scheduling [Liu 2004]. This is because the test pipeline can be maintained, scan-in and scan-out can be overlapped, and, hence, test time is reduced. It can also avoid the possibility of resource conflict. The complexity of a nonpreemptive scheduling algorithm is much lower than that of a preemptive scheduling algorithm because the minimum manageable unit in scheduling is a set of test packets in the former, instead of a single packet in the latter.

In practice, it is more realistic to have both preemptive and nonpreemptive test configurations in testing a system. It is also necessary to consider various constraints such as power, precedence, and multiple test sets such as external test and BIST. A preemptive schedule can be useful under these requirements. For instance, excessive power dissipation and inadequate heat convection/conduction can cause some cores to be significantly hotter than others, so called hot spots. Applying the entire test suite continuously can lead to dangerous temperature on these cores. In this case, the test suite can be split into several test sessions (or even single test vector in the extreme case) that can be scheduled individually in a preemptive manner. Sufficient time can be allowed between test sessions for hot spots to be cooled down.

Test Access Methods and Test Interface

Test access and test interface design in an NOC-based system need techniques different from those in conventional SOCs. A typical instance is the multiprocessor system discussed in [Aktouf 2002]. Each node in such an NOC-based system can contain a processor core and the corresponding router or switch, buffers, etc. Testing of a processor-based system usually mandates both deterministic test and functional test. Most functional test approaches such as software-based BIST [Chen 2003] can be applied. However, all test patterns and test responses must be organized into sets of packets, and all necessary network interfaces need to be added. For deterministic tests, nodes can be organized into groups, and each group can be tested using a conventional boundary scan approach. Figure 4.14 shows the test configuration in this scheme when nodes are organized into 1×1 and 2×2 groups, respectively. Nodes in the same group share the test infrastructure for boundary scan (test access port [TAP] controller, I/O cells, etc.) and are tested in parallel. Because nodes are identical, each bit of test data is broadcasted to all nodes in the group. Test responses from all nodes in the group can be processed on-chip by feeding to a comparator, as shown in Figure 4.14.

Testing identical nodes in NOC-based multiprocessor system using boundary scan in group of 1×1 (left) and 2×2 (right) [Aktouf 2002].

Figure 4.14. Testing identical nodes in NOC-based multiprocessor system using boundary scan in group of 1×1 (left) and 2×2 (right) [Aktouf 2002].

To reuse the network to transport test data, a test interface has to be established to handle both functional protocol from network and test application to the core. A wrapper is therefore needed for each core as an interface. Because the core and the network may use difference protocols, the wrapper must be extended to incorporate the standard IEEE 1500 test modes. This includes modifications on test wires, wrapper cells, and test control logic. The TAM port in the 1500 wrapper should be replaced by a port connecting to the network. The control logic should include the process of network protocols. Further, wrapper cell should be correspondingly modified to implement the protocol. An instance of such a wrapper cell, as well as a standard 1500 wrapper cell, is shown in Figure 4.15 [Amory 2006]. Note that both cells need functional/scan input/output terminals and a few control terminals for the MUXs. When compared with the traditional 1500 wrapper cell, the modified cell has an additional MUX. Terminal prot_in receives the required values for protocol operation from the control logic. In test mode, terminal prot_mode is asserted to “1” to ensure that test signal does not interfere with the functional protocol. The actual protocol is implemented in the control logic. Other modes in the 1500 wrapper can be maintained.

Standard 1500 wrapper cell (right) and modified wrapper cell for NOC-based system (left) [Amory 2006].

Figure 4.15. Standard 1500 wrapper cell (right) and modified wrapper cell for NOC-based system (left) [Amory 2006].

Efficient Reuse of Network

Test efficiency is also critical in future massive NOC-based systems containing a large number of cores. One challenge in this reuse-based approach is that the channel width is determined by the system performance in design process and hence cannot be optimized for test purpose. This can be illustrated through an example shown in Figure 4.16 [Liu 2005]. Core test wrapper is usually designed through the use of balanced wrapper scan chains. Figure 4.16a shows a core with two internal scan chains of lengths 4 and 8, two primary inputs and two primary outputs. In Figure 4.16b, two balanced wrapper scan chains are designed for the core. Note that eight test cycles are needed to scan in a test vector. If this core is used in an NOC with channel width of 2, each test vector can be loaded using eight payload flits (in eight clock cycles) with 2 bits for each flit. However, in Figure 4.16b, if we increase the number of wrapper scan chains from two to four, the longest wrapper scan chain remains eight, and the test time is unchanged. Therefore, two wrapper scan chains are sufficient for minimizing the test time.

An example of (a) an unwrapped core and (b) a balanced wrapper scan chain design.

Figure 4.16. An example of (a) an unwrapped core and (b) a balanced wrapper scan chain design.

However, in the context of network reuse in NOC test, the available TAM or channel width for wrapper scan chain design is already determined by the bandwidth requirements of cores in mission mode, not for test mode. If the channel width is predesigned to be 4, then half of the channel wires will be idle during core test. When legacy cores with test wrappers designed for minimal dedicated TAM widths are integrated into NOC, a significant part of the network channel width can therefore remain idle during test. Table 4.1 [Liu 2005] illustrates how NOC channels can remain idle during test of some cores in the d695 benchmark shown in Figure 4.11, by presenting statistics of test data of the 10 cores. Column 2 lists the total number of test data packets (test vectors and the corresponding test responses) for each core. Note that test data are organized in packet(s) in both preemptive and nonpreemptive scheduling because of the nature of network. Columns 3 and 4 list the number of payload flits per packet and the total number of test cycles needed to test each core using a channel width of 16. Columns 5 and 6 list the corresponding numbers of flits and test cycles when the channel width equals 32. Test time is calculated assuming full-scan with balanced scan chain design and full test pipeline. Note that when the channel width increases from 16 to 32, test times of cores 3, 4 and 8 do not decrease. Hence, channel width of 16 is already adequate for optimized test time for these three cores. Using channel width of 32 will cause idle channel wires during test data transportation.

Table 4.1. Test Data Statistics for Cores in d695 of Figure 4.11 [Liu 2005]

Cores

Number of Test Patterns

Channel Width = 16

Channel Width = 32

  

Flits/Packet

Test Cycles

Flits/Packet

Test Cycles

1

24

2

38

1

25

2

146

13

1029

7

588

3

150

32

2507

32

2507

4

210

54

5829

54

5829

5

220

109

12192

55

6206

6

468

50

11978

41

9869

7

190

43

4219

34

3359

8

194

46

4605

46

4605

9

24

128

1659

64

836

10

136

109

7586

55

3863

To fully utilize the channel width to reduce test time, the on-chip clocking scheme presented in [Gallagher 2001] can be used to provide multiple test clocks to different cores such that test data throughput can be maximized [Liu 2005]. This can be done through a combination of on-chip clocking and parallel-serial conversion. Let the channel width for the NOC be w and the number of predesigned wrapper scan chains for a legacy core in the system be w′, where w′<w. Further, let n = ⌊w/w′⌋. The channel width w can be used to transport n flits in parallel to the wrapper, and the wrapper can serially scan in each flit to the core. To synchronize the core test wrapper operation with test data transportation on the network channel, the frequency of the on-chip clock supplied to the wrapper must be n times the frequency of the slower tester clock supplied to the network. This fast clock is generated by an on-chip phase-locked loop (PLL) [Gallagher 2001]. Additionally, a multiplexer controlled by the on-chip clock is used to select between the flits on the network channel during one tester clock cycle.

A possible test architecture using on-chip clocking is illustrated in Figure 4.17 with a simple instance where n = 2. Test data for two flits at a time are presented to the core test wrapper on the network channels. These test data remain stable on the network channel for the period of the slow tester clock. Test data are loaded into the wrapper scan chains one flit at a time every on-chip clock cycle. No changes are required to the core test wrapper, thereby protecting the core vendor’s IP as well as easing the core’s integration into the system.

Test architecture using on-chip clocking [Liu 2005].

Figure 4.17. Test architecture using on-chip clocking [Liu 2005].

Another method of fully utilizing the channel width to transport test data can be accomplished by multiple data flit formats [Li 2006] through the use of dynamically reconfigurable data flit format to adapt the number of unfilled wrapper scan chains. As shown in Figure 4.18, when both shorter scan chains are filled, the data flit can be reformatted to send more test data for both longer scan chains by data flit reformatting. A wrapper scan chain configuration method was proposed in [Li 2006] to minimize the waste in NOC channel capacities. The basic idea is to organize the internal scan chains and functional I/Os of a core into factorial wrapper scan chain groups (FSCGs) with balanced-length chains within each group as shown in Figure 4.19. For example, if the channel (flit) width equals 32, then FSCG1 and FSCG2 should contain 2 wrapper scan chains in each group, FSCG3 (FSCG4) should contain 4 (8) wrapper scan chains. Therefore, the first data flit format involves 16 wrapper scan chains (from scan chain 1 to 16), and each wrapper scan chain gets two bits in each data flit. When wrapper scan chains in FSCG4 are filled, the second data flit format involving 8 wrapper scan chains (from scan chain 1 to 8) starts, and each wrapper scan chain gets 4 bits in each data flit. This avoids the waste of channel capacity significantly. Among 40 SOC benchmark cores [ITC 2002], 20 of them can be solved by using equal-length configuration methods for traditional SOC testing without any waste, but the remaining 20 have significant waste in channel capacities. For these 20 cores, the proposed wrapper scan chain configuration can reduce the channel capacity waste to zero for 14 cores, and have slight improvement for the other 6 cores. Only two or three data flit formats are required for all of these cases, so the hardware overhead is quite small. Note that the wrapper scan chain configuration can be applied only when the core wrapper can be redesigned.

(a) Data flit format with each flit containing 8 bits/chain; (b) data flit format with each flit containing 16 bits/chain [Li 2006].

Figure 4.18. (a) Data flit format with each flit containing 8 bits/chain; (b) data flit format with each flit containing 16 bits/chain [Li 2006].

Factorial scan chain groups with channel width equal 32 [Li 2006].

Figure 4.19. Factorial scan chain groups with channel width equal 32 [Li 2006].

Power-Aware and Thermal-Aware Testing

On-chip power and thermal management is critical for future NOC-based systems. High power consumption has been a critical problem in system testing; this is exacerbated when faster test clocks are used. Therefore, test scheduling needs to be power-aware. This is usually done by setting power constraints during test scheduling [Cota 2004] [Liu 2004]. However, stringent power constraints can cause failure of test scheduling for some high power-consumption cores. In this case, test clocks need to be slowed down by scaling down the tester clock. A frequency divider can generate a slower test clock. If the slower clock rate is a factor 1/n of the tester clock rate, then no change is needed for the core test wrapper. Similar to the virtual channel routing method in [Duato 1997], each NOC channel can be viewed as n virtual channels, and each core using a slower clock will occupy only one of them. Therefore, time-division scheduling of test packets is required. A conceptual timing diagram on a specific channel, where n = 3, is shown in Figure 4.20 [Liu 2005]. The figure shows that during one test clock cycle of core A, three test packets are routed through the channel to cores A, B, and C, respectively.

Slower on-chip clock using time-division scheduling [Liu 2005].

Figure 4.20. Slower on-chip clock using time-division scheduling [Liu 2005].

The on-chip variable clocking scheme can also be used for removing the hot spots on chip by assigning different test clocks to cores during test scheduling. This is specifically suitable for NOC-based systems because here cores are globally asynchronous, and they communicate by sending and receiving messages in the form of packets via network. Each core can receive several test clocks generated by an on-chip clock generator. During the test application process, a core can vary its power dissipation, and hence temperature, by choosing a different clock based on the test control information carried in test packets. Slower clocks are used to reduce temperature while faster clocks are used to reduce test time. This dynamic clock scaling scheme can not only guarantee thermal safety but also achieve thermal balance and optimize test time. Hot spots are efficiently removed and test time can be reduced. Test scheduling using this method can be found in [Liu 2006].

Testing of On-Chip Networks

An NOC usually consists of routers, network interfaces (between routers and embedded cores), and interconnects (channels). Although numerous works have been done (as discussed in Section 4.3.2) on core testing by transporting test data through the on-chip network fabric, works on testing on-chip networks have been limited [Amory 2005] [Grecu 2006] [Stewart 2006]. Unless the on-chip network of an NOC-based system has been thoroughly tested, it cannot be used to support the testing of embedded cores.

Testing of Interconnect Infrastructures

Interconnect testing has been discussed in many works [Singh 2002], and most of them can be directly applied to the NOC domain. A major difference in interconnect testing between SOC and NOC is that test patterns can be delivered using packets in the latter case. Hardware overhead can be greatly reduced if the test patterns and responses are transported using the existing on-chip network.

Based on the well-known maximal aggressor fault (MAF) model [Cuviello 1999], an efficient built-in self-test methodology has been proposed for testing interconnects of an on-chip network in [Grecu 2006]. For a set of interconnects with N wires, the MAF model assumes the worst-case situation with one victim and (N-1) aggressors. The MAF model consists of six crosstalk errors: rising/falling delay, positive/negative glitch, and rising/falling speed-up. Though it has been found that MAF may not always generate the worst case [Attarha 2001], the test coverage is generally high. Using this model, a total of 6N faults are to be tested, and 6N two-vector test patterns are required. They also cover traditional faults such as stuck-at, stuck-open, and bridging faults [Grecu 2006]. Both unidirectional and bidirectional transactions are considered.

A self-test structure without considering the special property of NOCs is shown in Figure 4.21. Here, a pair of test data generator (TDG) and test error detector (TED) are inserted to generate all MAF test patterns and observe the test responses. As shown in Figure 4.21, test patterns are launched before the line drivers and sampled after the receiver buffers. This tests the crosstalk effects that are highly dependent on the driver size and load. The self-test circuit also allows one clock cycle delay for data to travel from transmitters to receivers. Multiple clock-cycle delays are also allowed if the interconnects are pipelined. Detailed design of TDG and TED can be found in [Grecu 2006]. This self-test structure can be inserted to interconnects between each pair of routers (called point-to-point MAF self-test), and all interconnects can be tested in parallel as long as the power consumption is not exceeded.

Test data generator (TDG) and test error detector (TED) [Grecu 2006].

Figure 4.21. Test data generator (TDG) and test error detector (TED) [Grecu 2006].

By taking advantage of the on-chip network, the MAF test patterns can be broadcast to all interconnects in the form of test packets with only one TDG. Note that one TED is still required for interconnects between each pair of routers as shown in Figure 4.22. Here, the test packets are broadcasted in a unicast manner (i.e., interconnects between a pair of routers are tested for each test packet broadcasting). A global test controller (GTC) is also designed to inject test patterns for testing routers. A more powerful test packet broadcasting is multicast MAF self-test as shown in Figure 4.23. The major difference between unicast and multicast is that in the latter, test packets are broadcast to interconnects of different pairs of routers to achieve the maximum test parallelism. A detailed design of test packets, TDG, TED, and multidestination broadcasting to support the interconnect BIST infrastructure can be found in [Grecu 2006]. This proposed interconnect test approach was validated using a 64-router NOC. The results demonstrate that the point-to-point (unicast) test method has the smallest (largest) test application time but the largest (smallest) hardware overhead. The multicast method gives a good compromise between test application time and hardware overhead [Grecu 2006].

Interleaved unicast MAF test [Grecu 2006].

Figure 4.22. Interleaved unicast MAF test [Grecu 2006].

Interleaved multicast MAF test [Grecu 2006].

Figure 4.23. Interleaved multicast MAF test [Grecu 2006].

Testing of Routers

Routers are used to implement the functions of flow control, routing, switching, and buffering of packets for an on-chip network. Figure 4.24 shows a typical organization of a router, whereas Figure 4.25 shows a typical structure of an NOC-based design [Amory 2005]. As shown in Figure 4.25, router testing can be considered the same way as testing a sequential circuit. However, a special property of an on-chip network is its regularity. Based on this property, the idea of test pattern broadcasting discussed in Section 4.3.3.1 can be applied to reduce the test application time. In [Amory 2005], an efficient test method has been developed based on partial scan and an IEEE 1500-compliant test wrapper by taking advantage of the NOC regularity.

A typical organization of a router [Amory 2005].

Figure 4.24. A typical organization of a router [Amory 2005].

An NOC-based system [Amory 2005].

Figure 4.25. An NOC-based system [Amory 2005].

In [Amory 2005], router testing has been dealt with in three parts: the testing of each router, the testing of all routers (without considering network interfaces and interconnects), and the testing of wrapper design. Testing a router consists of testing the control logic (routing, arbitration, and flow control modules) and input first-in first-out buffers (FIFOs) as shown in Figure 4.24. Control logic testing can be done using traditional sequential circuit testing methods such as scan. A smart way to test each FIFO is to configure the first register of the FIFO as part of a scan chain, and other registers can be tested through this scan chain. Because FIFOs are generally not deep, this method proves to be very efficient [Amory 2005]. Because routers are identical, all can be tested in parallel by test pattern broadcasting as shown in Figure 4.26. Comparator implemented by XOR gates can be used to evaluate the output responses. The comparator logic also supports diagnosis.

Testing multiple identical routers [Amory 2005].

Figure 4.26. Testing multiple identical routers [Amory 2005].

To support the proposed test strategy, an IEEE-1500 compliant test wrapper is designed to support test pattern broadcasting and test response comparison as shown in Figure 4.27. For example, all SC1 scan chains of these routers share the same set of test patterns. Similarly, all Din[0] (i.e., Din-R0[0], ... , Din-Rn[0]) data inputs of these routers share the same set of test patterns. As Figure 4.27 shows, this wrapper also supports test response comparison for scan chains and data outputs. Finally, the diagnosis control block can activate diagnosis. Simulation results demonstrate that the proposed router test method achieves the goals of small hardware overhead (about 8.5% relative to router hardware), small number of test patterns (several hundreds) because of test pattern broadcasting, and small amount of test application time (several tens of thousands test cycles) using multiple, balanced scan chains and test pattern broadcasting. Most important, the method is scalable—that is, the test volume and test time increase at a much lower rate than the increase in the NOC size. More details for Figure 4.27 can be found in [Amory 2005].

Test wrapper design [Amory 2005].

Figure 4.27. Test wrapper design [Amory 2005].

Testing of Network Interfaces and Integrated System Testing

A network interface (NI) is used to receive data bits from its corresponding IP core (router), packetize (de-packetize) the bits, and perform clock domain conversions between the router and the IP core. NIs might be the most difficult to test components in an on-chip network, because clock domain conversion introduces nondeterministic device behavior, which is detrimental to conventional stored response testing. New structural test solutions must be developed to deal with NI testing. In [Stewart 2006], based on the architectures of AEthereal [Goossens 2005] and Nostrum [Wiklund 2002], functional testing has been used to detect faults in NIs, routers, and the corresponding interconnects. The following discussion is mainly based on the work in [Stewart 2006] using AEthereal as the target NOC architecture.

The AEthereal NI in [Stewart 2006] is outlined in Figure 4.28. The NI faults in AEthereal are represented with the four-tuple NI(c1, c2, o1 and o2), where c1 is the identification of the NI under test, and c2 indicates whether the NI works as a source (S) or destination (D) during testing. O1 is an optional field, which represents the transmission mode, BE (best effort) or GT (time guarantee) of the NI, and o2 is another optional field, which represents the connection type U (unicast), N (narrowcast) or M (multicast) of the NI. Details of transmission modes and transmission connections of AEthereal can be found in [Radulescu 2005]. Therefore, each NI must be tested based on all different combinations of these tuples. For example, each NI must be tested as a source (master) and as a destination (slave), respectively; in each case, the NI must be tested with both BE and GT transmission modes. Two additional tests are required to test narrowcast (N) and multicast (M) for the NI. Consequently, a total of six faults must be dealt with for thoroughly testing each NI. Note that unicast (U) is not required to be added to the last two tests, because it has been applied during the first four tests. By following the same process, 10 functional faults can be identified for each router. Test patterns must be generated to detect all 6(10) faults for each NI (router).

The AEthereal NI [Stewart 2006].

Figure 4.28. The AEthereal NI [Stewart 2006].

It is important to develop an efficient method that can generate test patterns shared for NI faults and router faults. A test scheduling method is also proposed in [Stewart 2006] to mix the testing of router functional faults (ten faults for each router) and NI functional faults (six faults for each NI) such that the total test application time can be minimized. Initially, a preprocessing step is used to broadcast data packets (GT data and BE data) from I/O pins to the local memory of each core. During the test phase, an instruction packet is sent from the input port of the NOC to the source router by the GT transmission mode. The instruction packet contains information of the destination core, the transmission path, and the time at which the test pattern application should take place. After the test pattern is applied, the destination node generates a signature packet indicating whether a fault is detected [Stewart 2006].

NI is indeed a complex device that implements the functions of arbitration between channels, transaction ordering, end-to-end flow control, packetization and de-packetization, and a link protocol with the router. Moreover, it is necessary to design the NI to adapt to existing on-chip communication protocols. Functional testing for NI is not sufficient and efficient structural test methods must be investigated. Testing an on-chip network piece by piece is inadequate. Similarly, testing an NOC-based system by separating core testing from on-chip network testing is inadequate as well. Interactions between cores and the on-chip network must be tested using extensive functional testing. The interactions among the on-chip network components (routers, interconnects, and NIs) must be thoroughly tested by functional testing as well. Reliable NOC design can be found in [Tamhankar 2005] [Pande 2006].

Design and Test Practice: Case Studies

Although multicore platforms are now being designed by a number of semiconductor companies (AMD, Intel, Sun), and incorporated in products such as servers and desktops, not much has been published in the open literature about testing and design-for-testability techniques adopted for these platforms. The literature on the testing of industrial SOCs and NOCs has largely been limited to publications from Philips. Hence, this chapter presents case studies based on Philips chips. For the SOC testing example, we consider the PNX8550 chip, which is used in the well-known Nexperia Home Platform [Goel 2004]. The example for NOC testing is a companion chip of PNX8550 [Steenhof 2006]. Philips developed both chips for its high-end (digital) TVs.

SOC Testing for PNX8550 System Chip

PNX8550 is a chip that is used in the well-known Nexperia digital video platform (Figure 4.29) developed by Philips [Goel 2004]. This chip is fabricated using a 0.13 μm process, six metal layers, with 1.2V supply voltage. It is packaged using PBGA564 and the die size is 100mm2. The entire chip contains 62 logic cores (five are hard cores while the rest are soft cores), 212 memory cores, and 94 clock domains. Five hard cores are: one MIPS CPU, two TriMedia CPUs, a custom analog block containing the PLLs and the delay-locked loops (DLLs), and a digital to analog converter. The 62 logic cores are partitioned into 13 chiplets (a chiplet is a group of cores placed together), because either they are synchronous to each other or they are not timing critical. Each chiplet is considered as an independent block to be connected to a specific set of TAM wires. As shown in Figure 4.29, PNX8550 contains one or more 32-bit MIPS CPUs (hard core) to control the process of the entire chip, and one or more 32-bit Tri-Media VLIW processors (hard core) for streaming data. Other modules include the MPEG decoder, UART, PIC 2.2 bus interface, etc. CPUs and many modules have access to external memory via a high-speed memory access network. Two device control and status (DCS) networks enable each processor to control or observe the on-chip modules. A bridge is used to allow both DCS networks to communicate [Goel 2004].

Nexperia Home Platform [Goel 2004].

Figure 4.29. Nexperia Home Platform [Goel 2004].

PNX8550 inherits the requirement to have modular test strategy that allows tests to be reused through the use of wrappers (so-called TestShell [Marinissen 2000]) and TAMs (so-called TestRails [Marinissen 2000]). Full scan design is used for logic core testing to achieve 99% of stuck-at fault coverage for all logic cores. Small, embedded memories are also tested using scan chains, though large memories are testing using BIST. The design team of PNX8550 decided to give 140 TAM wires (i.e., 280 chip pins) for core testing. The design issues are how to assign these TAM wires to different cores and how to design the wrapper for each core. Both issues must be considered carefully such that the data volume per test channel (28M) can be met and the overall test cost (mainly test application time) can be minimized. To solve these problems, Philips has developed a tool, called TR-ARCHITECT, to deal with these core-based testing requirements [Goel 2004]. TR-ARCHITECT supports three test architectures: daisy chain (see Figure 4.5b), distribution (see Figure 4.5c), and hybrid (of daisy chain and distribution) as discussed in Section 4.2; and requires two different kinds of inputs: the SOC data file and a list of user options. The SOC data file describes the SOC parameters such as the number of cores in the SOC and the number of test patterns and scan chains in each core. The user options give the test choices such as the number of SOC test pins, type of modules (hard or soft), TAM type (test bus/TestRail), architecture type (daisy chain, distribution, or hybrid), test schedule type (serial or parallel for daisy chain), and external bypass per module (yes/no).

As discussed previously, 140 TAM wires are spent, and 13 chiplets are identified. The distribution of these 140 TAM wires to 13 chiplets is done manually based on test data of a predecessor version and experienced engineering judgments. The assignment of TAM wires for a chiplet ranges from 2 (for chiplet UCLOCK) to 21 (for chiplet UQVCP5L). Once this step is done, the next job is to design the test architecture inside each chiplet. Basically, the distribution architecture is chosen for all except two chiplets: UMDCS and UTDCS. For these two chiplets (hybrid test architecture), some TAM wires are shared by two or more cores (using daisy chain), as there are more cores than wires; some cores are connected by the distribution architecture. Test architecture design is trivial, if the chiplet under consideration has only one core (e.g., chiplets UMCU, UQVCP5L, UCLOCK, MIPS, TM1, and TM2). Given the number of TAM wires and the core parameters, the wrapper design method presented in [Marinissen 2000] can be applied to design the core wrapper.

For a chiplet containing multiple cores and using the distribution test architecture, TR-ARCHITECT is used to determine the number of TAM wires assigned to each core. TR-ARCHITECT applies the idea in [Goel 2002] to determine the number of TAM wires assigned to each individual core and to design the wrapper for the core. For the chiplets UMDCS (22 soft cores) and UTDCS (17 soft cores) that have a hybrid test architecture, TR-ARCHITECT can be applied to determine the number of TAM-wire groups, the width assigned to each group, and the assignment of cores to each group [Goel 2002].

TR-ARCHITECT contains four major procedures: Create-Start-Solution, Optimize-BottomUp, Optimize-TopDown, and Reshuffle [Goel 2002]. As it is named, initially, Crate-Start-Solution is used to assign at least one TAM wire for each core. If there are cores left unassigned, they will be added to the least occupied TAMs. However, if there are TAM wires left unassigned, they will be added to the most-occupied TAMs. After the initial assignment, Optimize-BottomUp is used to merge the TAM (may be with several wires) with the shortest test time with another TAM, such that the wire freed up in this process can be used for an overall test time reduction. For example, TAM-1 contains three wires with 500 test cycles for Core-1, TAM-2 contains four wires with 200 test cycles for Core-2, and TAM-3 contains two wires with 100 test cycles for Core-3. Core-1 is then the bottleneck core and the whole system needs to be tested by 500 test cycles. Now, if Core-3 is merged to TAM-2, the overall test time is not increased (200 + 100 = 300 test cycles). Thus, the two wires freed up by TAM-3 can be given to TAM-1. Assume that adding the extra two lines to TAM-1 can greatly reduce the test time of Core-1 from 500 to 350 test cycles (this may not always be the case). Finally, the overall test time can be reduced from 500 to 350 test cycles. The processes of Optimize-TopDown and Reshuffle follow the same idea and can be found in [Goel 2002]. It should be noted that each of the procedures in TR-ARCHITECT requires the information of wrapper design and test time for each assignment of TAM wires, which can be provided by the work in [Marinissen 2000].

As described earlier, PNX8550 spends 140 TAM wires and the distribution of the wires to each chiplet is done manually. This is because TR-ARCHITECT became available halfway of the PNX8550 design process. In this case, the total test time is dominated by chiplet UTDCS with 3,506,193 test cycles. If these 140 TAM wires are distributed to the 13 chiplets by TR-ARCHITECT (instead of manually) and a hybrid test architecture is used, then the overall test time can be reduced by 29% [Goel 2004]. In this case, UTDCS is assigned three more TAM lines (to reduce test time) by TR-ARCHITECT, and the chiplet dominating the test time is changed to UMCU with 2,494,687 test cycles. This demonstrates the effectiveness and efficiency of TR-ARCHITECT. If the designer can further modify and optimize the number and lengths of the core internal scan chains of all cores (except Trimedia and MIPS), the test time can be further decreased by 50%. In this case, the dominating chiplet is TM2 with 1,766,095 test cycles. The test data volume can fit onto the Agilent 93000-P600 test system with 28M deep vector memories. The computing complexity of TR-ARCHITECT is light, and the computing time is negligible. More SOC test strategies can be found in [Vermeulen 2001] for PNX8525, which is a predecessor chip of PNX8550. The work in [Wang 2005] presents a BIST scheme dealing with at-speed testing for multiclock design, based on an SOC chip developed by Samsung.

NOC Testing for a High-End TV System

In Philips, NOC has been proved as a mature technology using an existing SOC architecture for picture improvement in high-end TVs [Steenhof 2006]. Traditional interconnect design of several chips has been replaced by the AEthereal NOC architecture [Goossens 2005]. Figure 4.30 outlines the main TV chip (PNX8550 discussed in Section 4.4.1), a companion chip, and external memories. The main chip, the master of the entire system, contains 61 IP blocks with the responsibilities of interacting with users, TV source, TV display, peripherals, and configuration of the companion chip. The companion chip contains 9 IP blocks for enhancing the video quality. A companion chip usually implements more advanced technologies that will not be released to competitors. Both main and companion chips, implemented by dedicated interconnect structures, are connected using a high-speed external link (HSEL). The dashed lines in Figure 4.30 represent a task that involves 11 IP blocks in the main chip, companion chip, and two memories. The functionality of the whole system usually contains hundreds of tasks. I (O) stands for input (output), H for horizontal scaler, and C for control processor. The method of partitioning a complex system into multiple chips including main and companion chips has many advantages. For example, it can reduce the development risk (because of implementing smaller and less complex chips), manage the different innovation rates in different market segments, and encapsulate differentiating functionality [Steenhof 2006]. To further enhance the flexibility, the dedicated interconnect of the companion chip has been replaced by an NOC structure [Steenhof 2006].

The main TV chip and companion chip [Steenhof 2006].

Figure 4.30. The main TV chip and companion chip [Steenhof 2006].

Figure 4.31 shows the detailed diagram of the on-chip network of the companion chip in Figure 4.30. The on-chip network contains routers (R), interconnects, and network interfaces (NIs). Each NI contains one kernel (K), one shell (S) and several ports. The functions of each on-chip network component can be found in [Radulescu 2005]. The numbers of master (M) and slave (S) ports are indicated in each NI. For example, the box with 2M, 2S is a NI port with two masters and two slaves. The ports are connected to IPs of microprocessor, DSP, or memory arrays. The HSEL IO is the high-speed external link used to connect the main chip and the companion chip, whereas the new HSEL is used to attach another companion chip (e.g., an FPGA) to this chip. Thus, the on-chip network in the companion chip is basically a 2×2 mesh. The AEthereal NOC is configured at run time with the required task diagram. The NOC structure offers great flexibility and reuse potential of the companion chip with the price of increased area (4%), larger power consumption (12%), and larger latency (10%), which is viewed as tolerable [Steenhof 2006].

On-chip network architecture of the companion chip [Steenhof 2006].

Figure 4.31. On-chip network architecture of the companion chip [Steenhof 2006].

Although no test strategy has been proposed in [Steenhof 2006], the test method for Philips’s AEthreal NOC architecture has been presented in [Vermeulen 2003]. It has been suggested that the on-chip network shown in Figure 4.31 can be treated as a core, and the knowledge about the NOC can be utilized to modify the standard core-based test approach to obtain a better suited test. For example, to test the NOC-based system in Figure 4.31, all identical blocks (e.g., all routers) can reuse the test data by test broadcasting as described in Section 4.3.3.2. The responses can be compared to each other and any mismatch will be sent off-chip. It is emphasized that timing test is very important for the following two reasons: (1) many long wires exist in the NOC-based design, and they can cause crosstalk errors; and (2) all clock boundaries between cores are inside the NI and timing errors can occur easily. Thus, the method on interconnect testing discussed in Section 4.3.3.1 can be applied to (1). However, multiple clock domain testing for the NIs is still under investigation. Once the NOC-based system in Figure 4.31 has been structurally tested, the network can be used to transfer test data for all cores as discussed in Section 4.3.2 in a flexible way. No new TAM wires need to be added to the design, and the NOC is fully reused for testing. The NOC structure also enables parallel testing of multiple cores, if the channel capacity can support their test data transportation under a specific power budget [Vermeulen 2003].

Concluding Remarks

Rapid advances in test development techniques are needed to reduce the test cost of million-gate SOC devices. This chapter has described a number of state-of-the-art techniques for reducing test time, thereby decreasing test cost. Modular test techniques for digital, mixed-signal, and hierarchical SOCs must develop further to keep pace with design complexity and integration density. The test data bandwidth needs for analog cores are significantly different from that for digital cores; therefore, unified top-level testing of mixed-signal SOCs remains a major challenge. Most SOCs include embedded cores that operate in multiple clock domains. Because the 1500 standard does not address wrapper design for at-speed testing of such cores, research is needed to develop wrapper design techniques for multifrequency cores. There is also a pressing need for test planning methods that can efficiently schedule tests for these multifrequency cores. The work reported in [Xu 2004] is a promising first step in this direction. Meanwhile, as a result of the revolutionary RF interconnect technology, the integrated wireless test framework for heterogeneous nanometer SoC testing is emerging to address future global routing needs and sustain performance improvement [Zhao 2006].

NOC has become a promising design paradigm for future SOC design, and cost-efficient test methodologies are imminently needed. This chapter has included advances in testing NOC-based systems, including core testing and network testing. The focus has been on how to efficiently utilize the on-chip network as a test access mechanism without compromising fault coverage or test time. Case studies have been provided to demonstrate some initial efforts in testing real-world NOC designs. Research on NOC test is still premature when compared to industrial needs because of the limited support for various network topologies, routing strategies, and other factors. Future research and development are needed to provide an integrated platform and a set of methodologies that are suitable for various networks such that the design and test cost of the overall NOC-based system (both cores and network) can be reduced. Wrapper design techniques for SOC test can also be adopted by NOC-based systems.

Exercises

4.1

(SOC Test Infrastructure Design) Consider an embedded core, referred to as C in a core-based SOC. C has 8 functional inputs a[0:7], 11 functional outputs z[0:10], 9 internal scan chains of lengths 12, 12, 8, 8, 8, 6, 6, 6, and 6 flip-flops, respectively, and a scan enable control signal SE. The test wrapper for C is to be connected to a 4-bit TAM. Design the wrapper for this core, and present your results in the form of the following table for each wrapper scan chain n (1 to 4):

 

Wrapper scan chain n

Internal scan chains

Which scan chains are included? Provide number of scan elements.

Wrapper input cells

How many wrapper input cells are included?

Wrapper output cells

How many wrapper output cells are included?

Scan-in length

Number of bits?

Scan-out length

Number of bits?

4.2

(SOC Test Infrastructure Design) Next consider the same embedded core C as in Exercise 4.1. The test wrapper for C is to be connected to two TAMs: a 4-bit TAM and a 6-bit TAM. Design a reconfigurable wrapper for this scenario, and compute the savings in test time (as a percentage) if a 6-bit-wide TAM is used.

4.3

(NOC Infrastructure) Refer to the ITC-2002 SOC benchmarks [ITC 2002], and verify the data illustrated in Table 4.1 using the methodology introduced in Section 4.2.

4.4

(NOC Test Scheduling) Using the system d695 shown in Figure 4.13 and data in Table 4.1, develop a nonpreemptive schedule for cores in system d695. Based on this result, assume all routers are identical and the test time of a router equals the average test time of embedded cores. Calculate the test time for testing all routers.

4.5

(Integrated NOC Test Scheduling for Test Time Reduction) In the previous problem, what method can be used to reduce the test time of all routers in the NOC? What method can be used to reduce the overall test time of cores and routers?

4.6

(NOC Test Using On-Chip Clocking) Now assume on-chip clocking scheme is used and the operation clock on NOC for testing is CLK. Also assume each core can be tested using CLK/2, CLK, or CLK*2 under certain constraints. Using the data shown in Table 4.1, design the wrapper architecture for each core in d695, and then upgrade your scheduling method for embedded cores (without consideration of routers) by incorporating these multiple clocks. Note that cores tested using slow clocks may share a physical channel in a time-multiplexing manner. Observe the test time, and compare it with the result you obtained in Problem 4.4.

4.7

(NoC Interconnect Test Application) Figure 4.22 shows an example for interleaved unicast MAF test of an NOC. Assume interconnects between each pair of routers are unidirectional (i.e., the wires for data transmission from router R1 to router R2 are different from those from router R2 to router R1). Find test configurations (paths) to apply the MAF test to all wires without any wire traveled redundantly. Show your answer by using a mesh with size M×N where M = 3, N = 4; M = 4, N = 4, and M = 5, N = 4. Note that M is the number of routers in the x direction, and N is the number of routers in the y direction.

To further simplify the analysis, you can replace each (bidirectional) interconnect in Figure 4.22 with two unidirectional interconnects, and assume there are only two unidirectional interconnects between each pair of routers.

Acknowledgments

The authors wish to thank Erik Jan Marinissen of NXP Semiconductors; Dr. Anuja Sehgal of AMD; Dr. Vikram Iyengar of IBM, Dr. Ming Li of Siemens, Ltd., China; Ke Li of University of Cincinnati; C. Grecu of University of British Columbia; Professor Erika Cota and Alexandre de Morais Amory of UFRGS, Brazil; and Professor Dan Zhao of University of Louisiana for their help and contribution during the preparation of this chapter. Thanks are also due to Professor Partha Pande of Washington State University; Dr. Ming Li of Siemens, Ltd., China; Professor Erika Cota of UFRGS, Brazil; Professor Tomokazu Yoneda of Nara Institute of Science and Technology, Japan; and Professor Erik Larsson of Linköping University, Sweden, for reviewing the text and providing valuable comments.

References

Books

Introduction

System-on-Chip (SOC) Testing

Network-on-Chip (NOC) Testing

Design and Test Practice: Case Studies

Concluding Remarks

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset