Analysis of Frequently Failing Tasks and Rescheduling Strategy in the Cloud System

Analysis of Frequently Failing Tasks and Rescheduling Strategy in the Cloud System

Hongyan Tang (School of Software and Microelectronics, Peking University, Beijing, China), Ying Li (National Engineering Center of Software Engineering, Peking University, Beijing, China), Tong Jia (School of Software and Microelectronics, Peking University, Beijing, China), Xiaoyong Yuan (Department of Computer and Information Science and Engineering, University of Florida, Florida, USA) and Zhonghai Wu (National Engineering Center of Software Engineering, Peking University, Beijing, China)
Copyright: © 2018 |Pages: 23
DOI: 10.4018/IJDST.2018010102

Abstract

To better understand task failures in cloud computing systems, the authors analyze failure frequency of tasks based on Google cluster dataset, and find some frequently failing tasks that suffer from long-term failures and repeated rescheduling, which are called killer tasks as they can be a big concern of cloud systems. Hence there is a need to analyze killer tasks thoroughly and recognize them precisely. In this article, the authors first investigate resource usage pattern of killer tasks and analyze rescheduling strategies of killer tasks in Google cluster to find that repeated rescheduling causes large amount of resource wasting. Based on the above observations, they then propose an online killer task recognition service to recognize killer tasks at the very early stage of their occurrence so as to avoid unnecessary resource wasting. The experiment results show that the proposed service performs a 93.6% accuracy in recognizing killer tasks with an 87% timing advance and 86.6% resource saving for the cloud system averagely.
Article Preview

Introduction

With its on-demand accessing and pay-as-you-go manner, cloud computing has been a new paradigm for providing computing resources. Nowadays, more and more organizations and companies in different areas decide to run their applications on cloud system gradually. Higher dependability and availability of cloud platform is essential to ensure quality of services of applications. However, due to its ever-increasing heterogeneous, distribution and large-scale, failures are more norm than expectation in cloud computing system, which leads to a high demand on capability of fault tolerance and fast recovery. Therefore, a deep understanding of the failure pattern can lead not only to effective solutions, but also to a better fulfillment of the objective of high dependability.

Among different types of failures in cloud computing system, task failure is the most basic and common one because task is the minimal scheduling unit running on a single machine. To deal with task failures, one of the most effective method is rescheduling (Soualhia, Khomh & Tahar, 2015). However, there are some tasks behaving like a “crash-loops” which deterministically fail shortly after execution while rescheduling repeatedly (Reiss, Tumanov, Ganger, Katz & Kozuch, 2012b). These tasks may do great harm to the cloud computing system, because they continually compete for resources and then waste resources due to failure. Meanwhile, they significantly increase the scheduling workload. As a result, in this paper, we call this type of tasks “killer task.”

To better understand killer task, we develop a study on Google cluster workload traces1, which contain the workload measurements of 25M tasks of 650K jobs on more than 12,000 nodes during a one month period. In Google cluster, each task represents execution and resource allocation unit of job, and is scheduled onto a single machine. We analyze failure frequency of tasks and find that some tasks suffer from continual failures and repeated rescheduling, for example, we find 1,151 tasks experiencing more than 500 times of failures and rescheduling during their lifetime. After analyzing the characteristics and rescheduling strategies of these tasks, we propose an online killer task recognition service to help cloud systems recognize killer tasks automatically and proactively, to save computing resources and promote the stability of the cloud system.

Prior studies on failures of cloud computing systems focus on characterization of job failures (Chen, Lu & Pattabiraman, 2014a), server failures (Garraghan, Townend & Xu, 2014; Birke, Giurgiu, Chen, Wiesmann & Engbersen, 2014) and failure prediction (Chen, Lu & Pattabiraman, 2014b; Rosa, Chen & Binder, 2015a). To the best of our knowledge, we are the first to recognize killer tasks and perform online recognition at their early stage. We make the following contributions in this paper:

  • Exploring killer tasks by analyzing failure frequency of tasks in Google cluster and revealing their characteristics;

  • Investigating resource usage pattern of killer tasks and non-killer tasks and identifying differences between them;

  • Disclosing rescheduling strategies of killer tasks performed in Google cluster and evaluating their positive and negative impacts quantitatively;

  • Proposing framework of online killer task recognition service that make use of resource usage time series to recognize killer tasks at the early stage of their occurrence;

  • Implementing the prototype of proposed service and verifying it with experiments. The results show that it can precisely recognize killer tasks with 93.6% of accuracy and provide cloud system an average of 633 minutes to take proactive actions, with 86.6% of resource saving.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 10: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing