Article Preview
TopIntroduction
Nowadays, cloud computing can effectively reconfigure various resources to offer composite services for meeting the dynamic needs of users. The users not only focus on the functional properties of the composite services, but also on the non-functional properties (Grunske & Zhang, 2009; Zhang, Li, Wan, & Grunske, 2011), such as reliability, availability, security, etc. Due to the complex cloud environment, a variety of uncertainties may affect the quality of cloud services. For example, individual cloud services, that are distributed on the Internet, derived from different organizations, and running on different system platforms, may generate anomalies. Unpredictable faults also result in the composite services unable to run correctly. To ensure that the composite services remain in a consistent state even in the presence of failures is a challenging problem.
Several approaches have been proposed to recover from failures of cloud services. Current work lacks a comprehensive understanding of the causes and effects of faults in the complex cloud computing environment. Some approaches just propose service recovery strategies for specific types of faults in certain layer of cloud computing, not considering the failures in all three layers of cloud computing at the same time (Juhnke, Dornemann, & Freisleben, 2009; Mdhaffar, Halima, Juhnke, Jmaiel &Freisleben, 2011; Nallur & Bahsoon, 2013; Ramakrishnan, Koelbel, Keeet et al., 2009). For example, Mdhaffar et al. (2011) present the recovery of SaaS services from failures by the Aop4csm approach. Juhnke et al. (2009) propose to recover IaaS failures by a policy-based approach. In addition, these recovery approaches do not take into account service granularity. They are only suitable for basic services, but not for composite services. Thus, these approaches offer no comprehensive recovery framework for the failures occurred in different cloud layers.
In this paper, we first identify the causes and effects of faults in cloud computing environment, and analyze the relationship between the faults and failures. A unified fault taxonomy is presented for the three layers of cloud computing, where the faults are related to the infrastructure layer, platform layer and software layer. We then propose a hierarchical recovery framework where a series of recovery strategies are used for these failures. In addition, recovery strategies depend on the different service granularity such as basic services and composite services. Four recovery strategies for basic services, according to various fault causes, are undo, redo, substitute with undo and substitute without undo. The recovery strategy for composite services is recompose.
The contributions of the paper are summarized as follows:
- •
The relationship between the faults and failures is analyzed. The taxonomy of faults in three layers of cloud computing is presented;
- •
According to different service granularity, five recovery strategies are proposed for basic services and composite services, respectively;
- •
A simulation system for failure recovery of cloud service composition named CSFRS (Cloud Service Failure Recovery System) is developed. Experimental results based on the simulation system are performed to validate our proposed recovery algorithms.