Article Preview
TopIntroduction
With its on-demand accessing and pay-as-you-go manner, cloud computing has been a new paradigm for providing computing resources. Nowadays, more and more organizations and companies in different areas decide to run their applications on cloud system gradually. Higher dependability and availability of cloud platform is essential to ensure quality of services of applications. However, due to its ever-increasing heterogeneous, distribution and large-scale, failures are more norm than expectation in cloud computing system, which leads to a high demand on capability of fault tolerance and fast recovery. Therefore, a deep understanding of the failure pattern can lead not only to effective solutions, but also to a better fulfillment of the objective of high dependability.
Among different types of failures in cloud computing system, task failure is the most basic and common one because task is the minimal scheduling unit running on a single machine. To deal with task failures, one of the most effective method is rescheduling (Soualhia, Khomh & Tahar, 2015). However, there are some tasks behaving like a “crash-loops” which deterministically fail shortly after execution while rescheduling repeatedly (Reiss, Tumanov, Ganger, Katz & Kozuch, 2012b). These tasks may do great harm to the cloud computing system, because they continually compete for resources and then waste resources due to failure. Meanwhile, they significantly increase the scheduling workload. As a result, in this paper, we call this type of tasks “killer task.”
To better understand killer task, we develop a study on Google cluster workload traces1, which contain the workload measurements of 25M tasks of 650K jobs on more than 12,000 nodes during a one month period. In Google cluster, each task represents execution and resource allocation unit of job, and is scheduled onto a single machine. We analyze failure frequency of tasks and find that some tasks suffer from continual failures and repeated rescheduling, for example, we find 1,151 tasks experiencing more than 500 times of failures and rescheduling during their lifetime. After analyzing the characteristics and rescheduling strategies of these tasks, we propose an online killer task recognition service to help cloud systems recognize killer tasks automatically and proactively, to save computing resources and promote the stability of the cloud system.
Prior studies on failures of cloud computing systems focus on characterization of job failures (Chen, Lu & Pattabiraman, 2014a), server failures (Garraghan, Townend & Xu, 2014; Birke, Giurgiu, Chen, Wiesmann & Engbersen, 2014) and failure prediction (Chen, Lu & Pattabiraman, 2014b; Rosa, Chen & Binder, 2015a). To the best of our knowledge, we are the first to recognize killer tasks and perform online recognition at their early stage. We make the following contributions in this paper:
- •
Exploring killer tasks by analyzing failure frequency of tasks in Google cluster and revealing their characteristics;
- •
Investigating resource usage pattern of killer tasks and non-killer tasks and identifying differences between them;
- •
Disclosing rescheduling strategies of killer tasks performed in Google cluster and evaluating their positive and negative impacts quantitatively;
- •
Proposing framework of online killer task recognition service that make use of resource usage time series to recognize killer tasks at the early stage of their occurrence;
- •
Implementing the prototype of proposed service and verifying it with experiments. The results show that it can precisely recognize killer tasks with 93.6% of accuracy and provide cloud system an average of 633 minutes to take proactive actions, with 86.6% of resource saving.