Improving Energy-Efficiency of Scientific Computing Clusters

Improving Energy-Efficiency of Scientific Computing Clusters

Tapio Niemi (Helsinki Institute of Physics, Finland), Jukka Kommeri (Helsinki Institute of Physics, Finland) and Ari-Pekka Hameri (University of Lausanne, Switzerland)
DOI: 10.4018/978-1-4666-1945-6.ch103
OnDemand PDF Download:
$37.50

Abstract

The authors applied operations management principles on scheduling and allocation to scientific computing clusters to decrease energy consumption and to increase throughput. They challenged the traditional one job per one processor core scheduling method commonly used in scientific computing with parallel processing and bottleneck management. The authors tested the effect of increased parallelism by using different test applications related to high-energy physics computing. The test results showed that at best their methods both decreased energy consumption down to 40% and increased throughput up to 100%, compared to the standard one task per CPU core method. The trade-off is that processing times of individual tasks get longer, but in scientific computing, the overall throughput and energy-efficiency are often more important.
Chapter Preview
Top

Introduction

Scientific computing clusters are widely used in many research fields, especially in experimental physics, astronomy, and bio-sciences. Computing intensive research easily deploys thousands, even hundreds of thousands, of CPUs to analyze various data sets and models. These clusters can be viewed as production resources processing jobs consisting of numerous tasks. The tasks can be processed by different resources, and finally the jobs are assembled together to be delivered back to the cluster customers. Operating such a cluster has its own cost structure related to capital invested, energy consumed, maintenance work, and facility related costs. In all, a computing cluster closely resembles an industrial production unit, thus the working hypothesis for this chapter is to apply operations management principles used in manufacturing to improve computing cluster productivity and overall efficiency.

Green computing, as it is now often referred to, seeks to improve efficiency of computing centers. It is a wide topic incorporating issues like data centre locations near cheap energy sources (Brown & Reams, 2010), minimizing so called e-waste (Hanselman & Pegah, 2007), designing optimal cooling infrastructure and running the centre in an optimal way (Marwah, et al., 2009). Generally, energy and resource optimization in scientific computing has mostly focused on hardware and infrastructure issues, for example the development of more efficient hardware or optimizing cooling of computer centers, and has not focused as much on operational methods such as workload management and even less on operating systems and application software optimization for energy-efficiency.

The most well known method for comparing energy efficiency of data centers is Power Usage Effectiveness (PUE) metrics (Belacy, 2008). This is a ratio of the total facility power / IT equipment power. It indicates how much of the energy is lost in cooling, power distribution, and other infrastructures. However, it does not indicate how efficiently IT resources are operated. Limitations of PUE metrics have been recognized by the Green IT Promotion Council (2010), which proposes three additional metrics: ITEU (IT Equipment Utilization)—IT equipment usage in data center; ITEE (IT Equipment Energy Efficiency)—total rated capacity of IT equipment / total rated energy consumption of IT equipment; and GEC (Green Energy Coefficient)—Green (natural energy) energy / total energy consumption of data center. Based on these metrics it is possible to calculate Datacenter Performance Per Energy used (DPPE) metrics as follows: DPPE = (ITEU x ITEE x 1/PUE) / (1-GEC). Further, the report gives four methods for improving energy efficiency: 1) Operating the data center in an efficient way, i.e. reducing amount of hardware and increasing its utilization; 2) Installing energy efficient hardware; 3) Improving energy efficiency of non-IT infrastructure; and 4) Using renewable energy.

In this work, we focus on a typical computing problem in high-energy physics: How to process a large set of jobs efficiently. While most of the existing work on high performance computing focuses on optimizing processing time of individual computing jobs, we try to optimize energy consumption per computing job and the total processing time of the set of jobs by choosing an optimal scheduling policy.

We study this problem by using real experimental physics data, computing jobs, and dedicated computing clusters. CERN, the European Organization of Nuclear Research in Geneva, provided us with a unique possibility to experiment and test the hypothesis. The Large Hadron Collider (LHC) experiment at CERN produces about 15 petabytes of data in a year. The overall computing infrastructure comprises numerous computing clusters of alternative sizes, yet the total amount of CPUs is over 100,000 in over 140 computing centers. Efficient management of these computing resources is vital for the success of the project, which is foreseen to be active for the next 20 years.

Complete Chapter List

Search this Book:
Reset