Reference Hub4
Optimizing Fault Tolerance for Multi-Processor System-on-Chip

Optimizing Fault Tolerance for Multi-Processor System-on-Chip

Dimitar Nikolov, Mikael Väyrynen, Urban Ingelsson, Virendra Singh, Erik Larsson
Copyright: © 2011 |Pages: 26
ISBN13: 9781609602123|ISBN10: 1609602129|EISBN13: 9781609602147
DOI: 10.4018/978-1-60960-212-3.ch003
Cite Chapter Cite Chapter

MLA

Nikolov, Dimitar, et al. "Optimizing Fault Tolerance for Multi-Processor System-on-Chip." Design and Test Technology for Dependable Systems-on-Chip, edited by Raimund Ubar, et al., IGI Global, 2011, pp. 66-91. https://doi.org/10.4018/978-1-60960-212-3.ch003

APA

Nikolov, D., Väyrynen, M., Ingelsson, U., Singh, V., & Larsson, E. (2011). Optimizing Fault Tolerance for Multi-Processor System-on-Chip. In R. Ubar, J. Raik, & H. Vierhaus (Eds.), Design and Test Technology for Dependable Systems-on-Chip (pp. 66-91). IGI Global. https://doi.org/10.4018/978-1-60960-212-3.ch003

Chicago

Nikolov, Dimitar, et al. "Optimizing Fault Tolerance for Multi-Processor System-on-Chip." In Design and Test Technology for Dependable Systems-on-Chip, edited by Raimund Ubar, Jaan Raik, and Heinrich Theodor Vierhaus, 66-91. Hershey, PA: IGI Global, 2011. https://doi.org/10.4018/978-1-60960-212-3.ch003

Export Reference

Mendeley
Favorite

Abstract

While the rapid development in semiconductor technologies makes it possible to manufacture integrated circuits (ICs) with multiple processors, so called Multi-Processor System-on-Chip (MPSoC), ICs manufactured in recent semiconductor technologies are becoming increasingly susceptible to transient faults, which enforces fault tolerance. Work on fault tolerance has mainly focused on safety-critical applications; however, the development of semiconductor technologies makes fault tolerance also needed for general-purpose systems. Different from safety-critical systems where meeting hard deadlines is the main requirement, it is for general-purpose systems more important to minimize the average execution time (AET). The contribution of this chapter is two-fold. First, the authors present a mathematical framework for the analysis of AET. Their analysis of AET is performed for voting, rollback recovery with checkpointing (RRC), and the combination of RRC and voting (CRV) where for a given job and soft (transient) error probability, the authors define mathematical formulas for each of the fault-tolerant techniques with the objective to minimize AET while taking bus communication overhead into account. And, for a given number of processors and jobs, the authors define integer linear programming models that minimize AET including communication overhead. Second, as error probability is not known at design time and it can change during operation, they present two techniques, periodic probability estimation (PPE) and aperiodic probability estimation (APE), to estimate the error probability and adjust the fault tolerant scheme while the IC is in operation.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.