Fault Tolerant and Optimal Task Clustering for Scientific Workflow in Cloud

Fault Tolerant and Optimal Task Clustering for Scientific Workflow in Cloud

Nagaraj V. Dharwadkar (Rajarambapu Institute of Technology, Uran Islampur, India), Shivananda R. Poojara (Rajarambapu Institute of Technology, Uran Islampur, India) and Priyanka M. Kadam (Rajarambapu Institute of Technology, Uran Islampur, India)
Copyright: © 2018 |Pages: 19
DOI: 10.4018/IJCAC.2018070101


Scientific workflows are very complex, large-scale applications and require more computational power for data transmission and execution. In this article, the authors address the problem of scheduling scientific workflow on a number of virtual machines (VM) with the objective of reducing the total makespan of workflow and failure. This article implements checkpoints and replication strategies with the parallel task execution (PTE) algorithm to schedule scientific workflow for minimum time and cost. In order to reduce execution overhead and improve performance of the scientific application, the task uses clustering methods. Specifically, Horizontal Reclustering (HR) method were implemented to reduce failure and scheduling overhead. The authors have combined checkpoint, replication and PTE algorithms together and applied it to the HR method. Results show that the proposed strategies and method works efficiently in terms of reducing failure, makespan and execution cost compared to existing methods.
Article Preview


The scientific workflow is a collection of the number of dependent and interdependent tasks. Scientific workflow is used to model the scientific applications in various fields of astronomy, physics, bio informatics etc. Since the workflow consists of the large number of computational tasks, hence more computational power required to run. The estimation of performance of workflow technique in real word is more time-consuming and complex, due to large system overhead. Task clustering (Singh et al., 2008) is a technique to consolidate the short tasks into jobs to reduce the system overhead. Existing clustering techniques don't consider the impact of the failure in the system; despise their powerful effect on distributed environment such as clouds, grids, etc. (Sahoo et al., 2004; Schroeder et al., 2010; Gupta et al., 2016; Mohammad et al., 2017; Nasr & Ouf, 2015). Researchers (Zhang et al., 2004; Tang et al., 1990) stressed the significance of fault tolerance planning and showing that the failure rate in distributed environment is important. Deployment of workflow in cloud environment is largely efficient, but more scope is required on fault tolerance (Bagui & Nguyen, 2015; Bhushan & Gupta, 2017). Failure of the workflow tasks consists of many reasons and types. The proposed work focuses to reduce the transient failure, where failure can be recoverable (not permanent failure) (Zhang et al., 2004). Transient failures are divided in two types: job failure and task failure. The task clustering consists of multiple clustered jobs with many tasks. The task from job is failed due to the unpredictable event during computation. Then whole job will be in the failed state, even though the other tasks from the same clustered job are executed successfully. Different technique is designed to improve the impact of job failure on scientific workflow execution. One of the techniques is to retry the failed job, but it’s expensive because already executed tasks will also be the part of re-computation (Zhang et al., 2004).

Task failure plays a very important and the most focused problem in workflow execution. Many investigators (Sahoo et al., 2004; Schroeder et al, 2010; Zhang et al., 2004; Tang et al., 1990) point out failure problem and its significance. There are different types of failure such as job failure, task failure, VM failure and many more. The focused research problem entitled on failure, which is not permanent, which can be recoverable (Zhang et al., 2004). The Pegasus Workflow Management System will monitor scientific workflow based on task-level and re-run the job when failure of tasks occurs. Failure data is analyzed to know the reason of failure (Deelman et al., 2015). The (Deelman, et al., 2009), CloudSim is a framework for simulating and modeling cloud computing services. It supports the single execution of workloads.

In literature, few authors focused on the problem of mapping of workflow using Directed Acyclic Graph (DAG) (Blythe et al., 2005; Wieczorek et al., 2005; Kalayci et al., 2010). Author considered matchmaking technique for scheduling tasks to distributed resources (Blythe et al., 2005; Kalayci et al., 2010). Similar approach is taken by author (Chen et al., 2016) to defeat scientific workflow to node computing with high rate of failure. Furthermore, author concentrated on gaining performance of task clustering methods by changing size of the cluster to reduce a cost of re-execution of failed task and reduce makespan of workflow.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 10: 4 Issues (2020): Forthcoming, Available for Pre-Order
Volume 9: 4 Issues (2019): 3 Released, 1 Forthcoming
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing