The Optimal Checkpoint Interval for the Long-Running Application

The Optimal Checkpoint Interval for the Long-Running Application

Yongning Zhai (Jiangsu Automation Research Institute, Lianyungang, China) and Weiwei Li (Jiangsu Automation Research Institute, Lianyungang, China)
DOI: 10.4018/IJAPUC.2017040103

Abstract

For the distributed computing system, excessive or deficient checkpointing operations would result in severe performance degradation. To minimize the expected computation execution of the long-running application with a general failure distribution, the optimal equidistant checkpoint interval for fault tolerant performance optimization is analyzed and derived in this paper. More precisely, the optimal checkpointing period to determine the proper checkpoint sequence is proposed, and the derivation of the expected effective rate of the defined computation cycle is introduced. Corresponding to the maximal expected effective rate, the constraint of the optimal checkpoint sequence can be obtained. From the constraint of optimality, the optimal equidistant checkpoint interval can be obtained according to the minimal fault tolerant overhead ratio. By the numerical results, the proposal is practical to determine a proper equidistant checkpoint interval for fault tolerant performance optimization.
Article Preview

1. Introduction

According to Kuang (2014), checkpointing and rollback recovery schemes are the famous backward fault tolerant techniques to minimize the execution time of the long-running applications, such as scientific computing and telecommunication applications. The system takes checkpoints according to some specified policy, and recovers automatically from the transient fault if they occur. Specifically, the time between two successive checkpoints is referred to as the checkpoint interval. According to Meroufel (2014), a saved state of the process is called a checkpoint, to reduce the number of logs to be replayed during the rollback recovery. During failure-free execution, the time between two consecutive checkpoints is referred to as the checkpoint interval according to Islam (2014). The checkpoint interval is one of the major factors influencing the performance of the fault tolerant scheme according to Mendizabal (2014) and Awasthi (2014). As the checkpoint interval decreases, in the presence of the failure event, the computation loss decreases. However, excessive checkpointing operations incur high overhead during the normal failure-free execution and may result in severe performance degradation. On the contrary, as the checkpoint interval increases, the overhead for the checkpointing operation during the failure-free execution decreases. However, the computation loss caused by the failure event increases and deficient checkpointing may incur an expensive rollback recovery overhead. Therefore, a trade-off must be made to determine a proper checkpoint interval for high fault tolerant performance according to Elnozahy (2002) and Treaster (2005).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2018): 3 Released, 1 Forthcoming
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing