Improved Checkpoint Using the Effective Management of I/O in a Cloud Environment

Improved Checkpoint Using the Effective Management of I/O in a Cloud Environment

Copyright: © 2018 |Pages: 13
DOI: 10.4018/978-1-5225-2255-3.ch096
(Individual Chapters)
No Current Special Offers

Chapter Preview



The emergence of cloud computing has brought a new dimension to the world of information technology. Although cloud computing offers several advantages such as virtualization, cost reduction, multi-tenancy, etc., there are risks and failures associated with it (Yang et al., 2014). A key challenge for research in cloud computing is to ensure the reliability of the system without reducing the overall system performance. Among of fault tolerance, there is the strategy of checkpointing. The major problem of checkpointing is the overhead caused by the storage time of checkpointing files in stable storage, this time is estimated at 70% of checkpointing process time caused by the storage (Ouyang et al., 2009a;Cornwell &kongmunvattana, 2011a), Figure1 shows the main phases of the checkpointing process. This process is based on three phases: i) suspend communication between processes and ensure consistent state; ii) use the checkpointing library to create and store checkpoints; iii)re-connect processes and continue execution.

Figure 1.

The time of the phases of the checkpointing process


The aim of our work is to minimize the overhead of checkpointing by minimizing its storage time. To ensure this goal, we improve the I/O management and we propose a checkpointing strategy of three phases:

  • 1.

    The construction of VRbIO topology (Virtual RbIO): RbIO proposed in (Lui et al., 2010) is a virtual hierarchical topology; it minimizes checkpointing time and I/O time at the same time. In our system, each VM has a reactive agent responsible of the local I/O management; at the end of this phase some of these reactive agents will be activated to manage the I/O of a group of VMs of the server. In this case, the I/O will be hierarchical.

  • 2.

    Creating the checkpointing files using coordinated checkpointing protocol.

  • 3.

    Ensuring a lightweight and fault-tolerant storage of these files by using Collective and Selective Data Sieving input/output (CSDS I/O), which is executed by only the active agents. CSDS is an improved ROMIO I/O strategy. However, this strategy has several problems and limitations (Fu et al., 2011).

Our algorithm with its three phases provides solutions for most issues raised by the use of classical checkpointing with ROMIO as an I/O strategy. The rest of the chapter is organized as follows: Section 2 presents the background in the field of aggregating I/O techniques with a comparative study. ROMIO and its features are illustrated in Section 3. Section 4 presents our contribution, each service of this contribution is described in details, and all the problems cited in previous section are solved in this section. Section 5 presents some experimental results, followed by a conclusion and future research directions.



An important reason for the limitations of I/O systems is that applications often send smaller queries disjoint. This access mode generates a first additional cost to the large number of applications running on various transmission channels, but more significantly increases the processing time of the latter (Sadiku et al., 2014). To deal with this problem, several “aggregation” methods have been proposed we can distinguish two types of aggregations strategies: dependent and collective.

Independent I/O is a straightforward form of I/O and is widely used in parallel applications. This form of I/O can be called independently by an individual process or any subset of processes of a parallel application. The advantage of independent I/O is that users have the freedom to perform I/O for each individual process or any subset of the processes that open the file.

Key Terms in this Chapter

Checkpointing: A technique to add fault tolerance into computing systems. It basically consists of saving a snapshot of the application's state, so that it can restart from that point in case of failure.

ROMIO: A portable MPI-IO implementation that works on many different machines and file systems.

Data Sieving: To reduce the effect of high I/O latency in parallel applications, it is critical to make as few requests to the file system as possible. Data sieving is a technique that enables an implementation to make a few large, contiguous requests to the file system even if the user’s request consists of several small, noncontiguous accesses.

Cloud Computing: Refers to applications and services offered over the Internet. These services are offered from data centers all over the world, which collectively are referred to as the “cloud”.

Collective I/O: In many parallel applications, despite the fact that each process may need to access several noncontiguous portions of a file, the requests of different processes are often interleaved and may together span large contiguous portions of the file. Collective I/O procedure is used to improve significantly the I/O performance by merging the requests of different processes and servicing the merged request.

Complete Chapter List

Search this Book: