Reliability Aware Performance and Power Optimization in DVFS-Based On-Chip Networks

Reliability Aware Performance and Power Optimization in DVFS-Based On-Chip Networks

Aditya Yanamandra (The Pennsylvania State University, USA), Soumya Eachempati (The Pennsylvania State University, USA), Vijaykrishnan Narayanan (The Pennsylvania State University, USA) and Mary Jane Irwin (The Pennsylvania State University, USA)
DOI: 10.4018/978-1-61520-807-4.ch011


Recently, chip multi-processors (CMP) have emerged to fully utilize the increased transistor count within stringent power budgets. Transistor scaling has lead to more error-prone and defective components. Static and run-time induced variations in the circuit lead to reduced yield and reliability. Providing reliability at low overheads specifically in terms of power is a challenging task that requires innovative solutions for building future integrated chips. Static variations have been studied previously. In this proposal, we study the impact of run-time variations on reliability. On-chip interconnection network that forms the communication fabric in the CMP has a crucial role in determining the performance, power consumption and reliability of the system. We manage protecting the data in a network on chip from transient errors induced by voltage fluctuations. Variations in operating conditions result in a significant variation in the reliability of the system, motivating the need to provide tunable levels of data protection. For example, the use of Dynamic Voltage and Frequency Scaling (DVFS) technique used in most CMPs today results in voltage variation across the chip, giving rise to variable error rates across the chip. We investigated the design of a dynamically reconfigurable error protection scheme in a NoC to achieve a desired level of reliability. We protect data at the desired reliability while minimizing the power and performance overhead incurred. We obtain a maximum of 55% savings in the power expended for error protection in the network with our proposed reconfigurable ECC while maintaining constant reliability. Further, 35% reduction in the average message latency in the network is observed, making a case for providing tunability in error protection in the on-chip network fabric.
Chapter Preview


Advancements in semiconductor technology have lead to diminutive feature sizes for a transistor. This has lead to a dramatic increase in the overall number of transistors available on a modern chip. To take advantage of the ever increasing transistor budget, there has been a paradigm shift towards having multiple processors on a chip [Nayfeh, 1999]. Chip multiprocessors (CMPs) have found a niche in embedded markets, the mainstream laptop and desktop computers as well as high-end servers. Network on Chip (NoC) has been suggested as a solution to the exacerbating global wire delay problem in newer technology generations and for a scalable number of cores. Currently, commercially designed NoC topologies include ring (Intel's Larrabee), mesh (Tilera) and clustered networks. The on-chip interconnect has become an important focus of research as the CMPs and system-on-chip scale to hundreds of cores. The network fabric on chip plays a vital role in the performance, and power consumption in such a system. The significance of the underlying communication architecture is well understood in designing multiprocessors over the years. In the case of the CMPs, however, due to the small transistor size and high density of transistors, power consumption is becoming a first order design metric as opposed to the pin-bandwidth which is the major limiting factor in off-chip networks. It is predicted that NoC power can be a significant part of the chip power and can account for up to 40 to 60 watts [Borkar, 2007] with technology scaling for a mesh based network with 128 nodes. A few commercial designs also support this trend, where up to 28% of the entire chip power is devoted to the interconnect [Hoskote, Vangal, Singh, Borkar, 2007]. Thus, on-chip interconnects that can optimize both performance and power pose intriguing research challenges. This is evident from the large body of literature covering multiple facets of NoC design [Kim, Davis, Oskin, & Austin, 2008; Kim, Dally, Scott & Abts, 2008 ;Muralimanohar & Balasubramonian, 2007].

Reliability has been identified as one of the key limiters to future transistor scaling. Process variation, wear-out and transient errors are the major contributors to faults in chips. Process variations are posing a big challenge for the semi-conductor industry in terms of the yield. In addition, dynamic variations due to non-uniformity of workload activity across the chip can also cause erroneous scenarios. The on-chip network which is the basic medium of communication among the components is a distributed resource and thus will play a major role in determining the overall system reliability. The impact of process and temperature variations on the on-chip interconnect were studied in [Bin, Peh & Patra, 2008; Nicopoulos, Yanamandra, Srinivasan, Narayanan & Irwin, 2007].

Recent work done by Kim et al. [Kim, W., Gupta, M. S., Wei, G.-Y. & Brooks, D., 2008] demonstrates the use of on-chip voltage regulators for a per-core Dynamic Voltage Frequency Scaling (DVFS) in a CMP. With technologies such as this, DVFS can become an integral part of the on-chip network as well. DVFS can also be applied to routers for saving power as well as for controlling congestion i.e. increasing the throughput of the system. Under congestion, techniques such as throttling the upstream routers and/or increasing the frequency of the congested router can help controlling the congestion. For using such a variable frequency system, micro-architectural changes such as using a dual clock I/O buffer for the input buffers would be required.

Complete Chapter List

Search this Book: