Dynamic Thermal Management for Multi-/Many-Core Systems

Dynamic Thermal Management for Multi-/Many-Core Systems

Yang Ge, Qinru Qiu
Copyright: © 2012 |Pages: 24
DOI: 10.4018/978-1-4666-1842-8.ch004
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

High chip complexity and power consumption raise chip temperature, reduce lifetime, affect the reliability, and increase the cooling cost. Dynamic Thermal Management (DTM) techniques are design to control the chip temperature and tackle the thermal related issues. In this chapter, the authors introduce the working principles and implementation details of some state-of-the-art DTM techniques, in order to boost thermal awareness in the green computing community. They first give the motivation of dynamic thermal management, and divide existing DTM approaches into different categories based on their characteristics. Then the detailed design and implementation issues of these techniques are carefully discussed. Finally, the authors share future research directions in this area.
Chapter Preview
Top

1. Introduction

Moore’s law states that the number of transistors on a chip doubles about every two years or less. As we continue to shrink the chip sizes and extract the performance of our systems at the cost of higher power consumption, the ever-increasing chip complexity and power density elevate peak temperatures of the chip and imbalance the thermal gradients. Raised peak temperatures reduce the lifetime of the chip, deteriorate its performance, affect the reliability, and increase the cooling cost (Skadron, Stan, Sankaranarayanan, Huang, Velusamy, & Tarjan, 2004). The adverse positive feedback between leakage power and raised temperature creates the potential of thermal runaway. When mapped on a multi or many-core system, the diverse workload of applications may lead to power and temperature imbalance among different cores. Such temporal and spatial variation in temperature creates local temperature maxima on the chip called the hotspot (Donald & Martonosi, 2006). An excessive spatial temperature variation, which is also referred to as the thermal gradient, increases clock skews and decreases performance and reliability. Elevated temperatures require more cooling efforts; to cool down the processor, a typical cooling fan can consume up to 51% power budget of a server (Lefurgy, Rajamani, Rawson, Felter, Kistler, & Keller, 2003; Ayoub, Sharifi, & Rosing, 2010).

Dynamic Thermal Management (DTM) techniques are designed to tackle the aforementioned problems and control the chip temperature as well as power consumption. As long as the temperature is regulated, the system reliability can be improved significantly. It has been pointed out that a moderate reduction in temperature by 10oC~15oC can extend the lifespan of the electronic device 2 times (Kursun, Cher, Buyuktosunoglu, & Bose, 2006), and 10oC decrease in the magnitude of thermal cycles can achieve 16 times increase in mean time to failure for metallic structures. Leakage power also drops significantly when temperature reduces. For every 9o C temperature reduction, there is 50% reduction in the leakage power (Liu, Dick, Shang, & Yang, 2007). This reduction is particularly important in the future System-on-Chip design, because the leakage power consumption is estimated to account for more than 50% of total chip power consumption (Semiconductor Industry Association, 2001). Regulated temperature not only guarantees the system reliability and reduces leakage power consumption, but also boosts the performance. Transistor switching speed is faster in low temperature (Pamula & Chakrabarty, 2003). A balanced spatial gradient can mitigate the clock skew problem noticeably.

The goal of this book chapter is to provide the audience a thorough understanding of the working principles and implementation details of some state-of-the-art Dynamic Thermal Management (DTM) techniques, and to boost the thermal awareness in the green computing community. We will first give the motivation of dynamic thermal management, then present a detailed survey on some existing DTM approaches with detailed discussions on common design and implementation issues, and finally share our view in the future research directions in this area.

Complete Chapter List

Search this Book:
Reset