The demand of embedded systems for mission critical applications is increasing drastically. We are completely dependent on the reliable functioning of these systems. A failure in an automotive, aerospace or nuclear application might result in serious problems with consequences for life, health, ecology or expensive technology. The growing complexity and decreasing fabrication size of microelectronic circuits as the basis of embedded systems have become the main potential error sources due to radiation, electromagnetic interference, temperature variation, noise etc. These new influences now more than ever have to be taken into account when designing reliable and dependable embedded systems.
The almost exponential growth of microelectronic circuits and systems in complexity and performance over the last 40 years is a unique success story in the history of technology in general. While this fact is well perceived in the public, little attention is usually paid to the history of electronic design automation (EDA) - the art of designing electronic circuits and systems with specific software. Third, the art of testing large-scale integrated circuits and systems before system assembly and shipment to customers is not well known to the public, though it is a critical “enabling technology” with respect to the quality and the dependability of circuits and systems.
The first challenge for the emerging EDA science and technology, followed soon by the foundation of EDA companies since the early 1980s, was the correctness of design for an integrated circuit. This challenge seemed to be met in the early 1990s, when reports about automatically designed ICs working well after their first production became known. The second challenge was design for testability, because all too soon ICs designed with a complexity of about 10 000 transistors proved to be difficult or even impossible to test. From about 1990, the EDA industry therefore gave special attention to design for testability and test-supporting EDA tools.
More recently, there are five trends which have a massive influence on the further directions of integrated systems design and implementation.
First, physical design, which had been well understood in the 1970s and 1980s, became a problem again, since shrinking feature size and reduced voltage levels promoted several parasitic effects such as signal line delay, capacitive coupling, and voltage drops on supply networks. Such problems became dominating with the arrival of nano-electronics at a minimum feature size of 50 nanometers and below.
Second, large-scale integrated systems were more and more built around “embedded” processors and key functions implemented in software. Hence system design verification and test became a complex mixed hardware / software problem. System design was eventually becoming a complex task including mixed hardware and software design and validation, also including analog and mixed-signal circuits.
Third, large scale systems would often be composed from pre-designed building blocks, imported from external sources as “components of the shelf” (COTS), often with external intellectual property rights on them. Typically, such blocks, often embedded processor cores, would not even be known to the system designer in all their details, making a design with proven correctness very difficult at best.
Fourth, hardware is becoming soft, due to architectures such as field-programmable gate arrays (FPGAs). A basic piece of hardware is programmed by the system designer to fit the application on his desk. Even re-programming an FPGA-based system in the field of application is an option, which gets more and more acceptance and usage. FPGA-based systems have long started to replace application-specific ICs in low-volume applications. The essential bottleneck is the validation of correctness down to the physical design, since many details of the FPGA are systematically hidden from the designer.
Fifth and worst of all, the underlying hardware became a problem. ICs fabricated in traditional technologies with a feature size above 300 nanometers used to have a very long reliable life time, once they had passed production tests successfully, and their operation would be rather stable. This has changed dramatically with IC technologies of a feature size in the range of 100 nanometers and below. One set of problems is associated with transient fault effects, which can and will harm also correctly designed circuits and systems, due to inevitable physical reasons. Therefore systems have to be designed to be fault tolerant. Even worse, ICs with a feature size of 50 nanometers and below seem to exhibit properties of gradual parameter degradation and eventual failure due to wear-out problems.
Essentially, electronic systems of today and tomorrow will have to be designed to work in a dependable manner for a specific period of time, based on unreliable basic elements.
The book covers aspects of system design and efficient modelling, but also introduces various fault models and fault mechanisms associated with digital circuits integrated into System on Chip (SoC), Multi-Processor System-on Chip (MPSoC) or into Network on Chip (NoC). Finally, the book gives an insight into refined “classical” design and test topics and solutions of IC test technology and fault-tolerant systems development targeted to applications into special safety-critical systems or general-purpose systems. Aspects of pure software design, test and verification and special problems of analogue and mixed-signal design and test, however, are beyond the scope of this book.
As the primary audience of the book we see practitioners and researchers in SoC design and testing area. The audience can get acquainted with state-of-the-art, and with details of existing mainstream research achievements including open questions in the field of design and test technology for dependable embedded systems. As the secondary prospective audience we see undergraduate and graduate students with a basic understanding of electronics and computer engineering who can get familiar with problems and solutions associated with the basic tasks of designing dependable systems from not-so reliable basic components.
We hope that the readers should get a good insight into design and test technology that may yield dependable and fault-tolerant embedded systems.
Organization of the book
The book is organized into 5 sections and 22 chapters. Each chapter is written by different authors’ team, experts in the specific topics, therefore each chapter is unique in this book.
Section 1 is targeted to digital system design problems, mathematical background and advanced approaches to digital system modelling at different levels of abstraction, formal verification and debugging. Section 2 describes different types of faults (transient and permanent), repairing technologies and techniques for logic structures, memories and interconnections. Section 3 is aimed at fault simulation in digital systems at different levels of abstraction, and fault injection for analysis of electronic system designs with respect to their dependability. Section 4 concerns with test technology for SoCs and test techniques for timing, low power and termal parameters as well. The last section is targeted to reducing test length and cost by suitable test planning, and using efficient test compression and compaction techniques for SoCs.
The complexity and communication requirements of SoC are increasing, thus making the goal to design a fault-free system a very difficult task. Network-on-chip has been proposed as one of the alternatives to solve some of the on-chip communication problems and to address dependability at various levels of abstraction. Chapter 1 presents system-level design techniques for NoC based systems. The NoC architecture has been utilized to address on-chip communication problems of complex SoCs. It can also be used to deal with faults as it exhibits a natural redundancy. The chapter presents an interesting system-level design framework to explore the large and complex design space of dependable NoC-based systems.
Chapter 2 deals with the design and optimization of embedded applications with soft and hard real-time processes. The hard processes must always complete on time, while a soft process can complete after its deadline and its completion time is associated with the quality of service (QoS). Deadlines for the hard processes must be guaranteed even in the presence of transient and intermittent faults, and the QoS should be maximized. The chapter presents a novel quasi-static scheduling strategy, where a set of schedules is synthesized off-line and, at run time the scheduler will select the appropriate schedule based on the occurrence of faults and the actual execution times of processes.
The rapid development in semiconductor technologies makes it possible to manufacture ICs with multiple processors, so called Multi-Processor System-on-Chips (MPSoC). Chapter 3 deals with fault tolerance design of MPSoC for general-purpose application, where the main concern is to reduce the average execution time (AET). It presents a mathematical framework for the analysis of AET, and an integer linear programming model to minimize AET, which takes also communication overhead into account. It describes also an interesting approach to estimate the error probability and to adjust the fault tolerant scheme dynamically during the operation of a MPSoC.
To cope with the complexity of today’s digital systems in test generation, fault simulation and fault diagnosis hierarchical multi-level formal approaches should be used. Chapter 4 presents a unified diagnostic modelling technique based on Decision Diagrams (DD), which can be used to capture a digital system design at different levels of abstraction. Two new types of DDs, the logic level structurally synthesized binary DDs (SSBDD) and the high level DDs (HLDD), are defined together with several subclasses. Methods for the formal synthesis of the both types of DDs are described, and it is shown how the DDs can be used in a design environment for dependable systems.
Chapter 5 deals with techniques for formal hardware verification. An enhanced formal verification flow to integrate debugging and coverage analysis has been presented. In this flow, a debugging tool locates the source of a failure by analyzing the discrepancy between the property and the circuit behavior. A technique to analyze functional coverage of the proven Bounded Model Checking properties is then used to determine if the property set is complete or not, and return the coverage gaps, if it is not. The technique can be used to ensure the correctness of a design, which facilitates consequently the development of dependable systems.
Transient faults have become an increasing issue in the past few years as smaller geometries of newer, highly miniaturized, silicon manufacturing technologies brought to the mass-market failure mechanisms traditionally bound to niche markets as electronic equipments for avionic, space or nuclear applications. Chapter 6 presents and discusses the origin of transient faults, fault propagation mechanisms, and the state-of-the-art design techniques that can be used to detect and correct transient faults. The concepts of hardware, data and time redundancy are presented, and their implementations to cope with transient faults affecting storage elements, combinational logic and IP-cores (e.g., processor cores) typically found in a System-on-Chip are discussed.
Memories are very dense structures and therefore the probability of defects is higher than that of the logic and analogue blocks, which are not so densely laid out. Embedded memories are the largest components of a typical SoC, thus dominating the yield and reliability of the chip. Chapter 7 gives a summary view of static and dynamic fault models, effective test algorithms for memory fault (defect) detection and localization, built-in self-test and classification of advanced built-in self-repair techniques supported by different types of repair allocation algorithms.
Chapter 8 deals with the problem, how to design fault-tolerant or fail-safe systems in programmable hardware (FPGAs) for using them in mission-critical applications. RAM based FPGAs are usually taken for unreliable due to the high probability of transient faults and therefore inapplicable in this area. But FPGAs can be easily reconfigured if an error is detected. It is shown how to utilize appropriate type of FPGA reconfiguration to combine it with fail-safe and fault-tolerant design. The trade-off between the requested level of dependability characteristics of a designed system and area overhead with respect to FPGA possible faults is the main property and advantage of the presented methodology.
The reliability of interconnects on ICs has become a major problem in recent years, due to the rise of complexity, low-k-insulating materials with reduced stability, and wear-out-effects due to high current density. The total reliability of a system on a chip is more and more dependent on the reliability of interconnects. Chapter 9 presents an overview of the state of the art for fault-tolerant interconnects. Most of the published techniques are aimed at the correction of transient faults. Built-in self-repair has not been discussed as much as the other techniques. In this chapter, this gap is filled by discussing how to use built-in self-repair in combination with other approved solutions to achieve fault tolerance with respect of all kind of faults.
For several years, it has been predicted that nano-scale ICs will have a rising sensitivity to both transient and permanent faults effects. Most of the effort has so far gone into the detection and compensation of transient fault effects. More recently, also the possibility of repairing permanent faults, due to either production flaws or to wear-out effects, has found a great attention. While built-in self test (BIST) and even self repair (BISR) for regular structures such as static memories (SRAMs) are well understood, the concepts for in-system repair of irregular logic and interconnects are few and mainly based on FPGAs as the basic implementation. In Chapter 10, different schemes of logic (self-) repair with respect to cost and limitations, using repair schemes that are not based on FPGAs, are described and analyzed. It can be shown that such schemes are feasible, but need lot of attention in terms of hidden single points of failure.
Statically scheduled superscalar processors (e.g. very long instruction word processors) are characterized by multiple parallel execution units and small sized control logic. This makes them easy scalable and therefore attractive for the use in embedded systems as an application specific processor. Chapter 11 deals with the fault-tolerance of VLIW processor architectures. If one or more components in the data path of a processor become permanently faulty, then it becomes necessary either to reconfigure the hardware or the executed program such that operations are scheduled around the faulty units. The reconfiguration of the program is either done dynamically by the hardware or permanently by self-modifying code. In both cases a delay may occur during the execution of the application. This graceful performance degradation may become critical for real-time applications. A framework to overcome this problem by using scalable algorithms is provided.
Simulation of faults has two important areas of application. On one hand, fault simulation is used for validation of test patterns, on the other hand, simulation based fault injection is used for dependability assessment of systems. Chapter 12 describes simulation of faults in electronic systems by the usage of SystemC. Two operation areas are targeted: fault simulation for detecting of fabrication faults, and fault injection for analysis of electronic system designs for safety critical applications with respect to their dependability under fault conditions. The chapter discusses possibilities of using SystemC to simulate the designs. State of the art applications are presented for this purpose. It is shown how simulation with fault models can be implemented by several injection techniques. Approaches are presented, which help to speed up simulations. Some practical simulation environments are shown.
Chapter 13 deals with high-level fault simulation for design error diagnosis. High-level descision diagrams (HLDD) are used for high-level fault reasoning which allow to implement efficient algorithms for locating the design errors. HLDDs can be efficiently used to determine the critical sets of soft-errors to be injected for evaluating the dependability of systems. A holistic diagnosis approach based on high-level critical path tracing for design error location and for critical fault list generation to assess designs vulnerability to soft-errors by means of fault injection is presented.
Chapter 14 is devoted to logic level fault simulation. A new approach based on exact critical path tracing is presented. To achieve the speed-up of backtracing, the circuit is presented as a network of subcircuits modeled with structurally synthesized BDDs to compress the gate-level structural details. The method can be used for simulating permanent faults in combinational circuits, and transient or intermittent faults both in combinational and sequential circuits with the goal of selecting critical faults for fault injecting with dependability analysis purposes.
In the recent years, the usage of embedded microprocessors in complex SoCs has become common practice. Their test is often a challenging task, due to their complexity, to the strict constraints coming from the environment and the application. Chapter 15 focuses on the test of microprocessors or microcontrollers existing within a SoC. These modules are often coming from third parties, and the SoC designer is often in the position of not being allowed to know the internal details of the module, nor to change or redesign it for test purposes. For this reason, an emerging solution for processor testing within a SoC is based on developing suitable test programs. The test technique, known as Software-based Self-test is introduced, and the main approaches for test program generation and application are discussed.
Testing complex SoCs with up to billions of transistors has been a challenge to IC test technology for more than a decade. Most of the research work has focused on problems of production testing, while the problem of self test in the field of application has found much less attention. Chapter 16 faces this issue, describing a hierarchical HW/SW based self test solution based on introducing a test processor in charge of orchestrating the test activities and taking under control the test of the different modules within the SoC.
SoC devices are among the most advanced devices which are currently manufactured; consequently, their test must take into consideration some crucial issues that can often be neglected in other devices, manufactured with more mature technologies. One of these issues relates to delay faults: we are forced not only to check whether the functionality of SoCs is still guaranteed, but also whether they are able to correctly work at the maximum frequency they have been designed for. New semiconductor technologies tend to introduce new kinds of faults, that can not be detected unless the test is performed at speed and specifically targeting these kinds of faults. Chapter 17 focuses on delay faults: it provides an overview of the most important fault models introduced so far, as well as a presentation of the key techniques for detecting them.
Another increasingly important issue in SoC testing is power consumption, which is becoming critical not only for low-power devices. In general, test tends to excite as much as possible the device under test; unfortunately, this normally results in a higher than usual switching activity, which is strictly correlated with power consumption. Therefore, test procedures may consume more power than the device is designed for, creating severe problems in terms of reliability and duration. Chapter 18 deals with power issues during test, clarifying where the problem comes from, and which techniques can be used to circumvent it.
The high degree of integration of SoC devices, combined with the already mentioned power consumption, may rise issues in terms of the temperature of the different parts of the device. In general, problems stemming from the fact that some part of the circuit reaches a critical temperature during the test can be solved by letting this part to cool before the test is resumed, but this can obviously go against the common goal of minimizing test time. Chapter 19 discusses thermal issues during test, and proposes solutions to minimize their impact by identifying optimal strategies for fulfilling thermal constraints while still minimizing test time.
Test-data volume and test execution times are both costly commodities. To reduce the cost of test, previous studies have used test-data compression techniques on system-level to reduce the test-data volume or employed test architecture design for module-based SOCs to enable test schedules with low test execution time. Research on combining the two approaches is lacking. Chapter 20 studies how core-level test data compression can be combined with test architecture design and test planning to reduce test cost. Test data compression for non-modular SoCs and test planning for modular SoCs have been separately proposed to address test application time and test data volumes.
Chapter 21 addresses the bandwidth problem between the external tester and the device under test (DUT). While the previous chapter assumes deterministic tests, this chapter suggests to combine deterministic patterns stored on the external tester with pseudorandom patterns generated on chip. The chapter details ad-hoc compression techniques for deterministic test and details a mixed-mode approach that combines deterministic test vectors with pseudo-random test vectors using chip automata.
Chapter 22 continues on the line of the previous chapter and discusses embedded self-test. Instead of transporting test data to the DUT, the approach in this chapter is to make use of a fully embedded test solution where the test data is generated by on-chip linear feed-back shift-registers (LFSRs). While LFSRs usually are considered to deliver lower quality tests than deterministic ATPG tests, the chapter demonstrates that the test quality can be made high by careful planning of LRSR re-seeding.
Concluding the overview of the book content we hope that this collection of presentations of hot problems in the field of fault tolerance and test of digital systems can serve as a source of fresh ideas and inspiration for engineers, scientists, teachers and doctoral students. Engineers and scientists in application fields of test and reliability of systems may find useful ideas and techniques that may help them to solve related problems in industry. Educators may find discussions of novel ideas and examples from industrial contexts that could be useful for updating curricula in the field of dependable microelectronics design and test.
Tallinn - Cottbus