Transparent Throughput Elasticity for Modern Cloud Storage: An Adaptive Block-Level Caching Proposal

Transparent Throughput Elasticity for Modern Cloud Storage: An Adaptive Block-Level Caching Proposal

Bogdan Nicolae (Argonne National Laboratory, USA), Pierre Riteau (StackHPC, UK), Zhuo Zhen (University of Chicago, USA) and Kate Keahey (Argonne National Laboratory, USA)
DOI: 10.4018/978-1-5225-8295-3.ch007


Storage elasticity on the cloud is a crucial feature in the age of data-intensive computing, especially when considering fluctuations of I/O throughput. In this chapter, the authors explore how to transparently boost the I/O bandwidth during peak utilization to deliver high performance without over-provisioning storage resources. The proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored during runtime. They show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. Second, they introduce a corresponding performance and cost prediction methodology. They demonstrate the benefits of our proposal both for micro-benchmarks and for two real-life applications using large-scale experiments. They conclude with a discussion on how these techniques can be generalized for increasingly complex landscape of modern cloud storage.
Chapter Preview

1. Introduction

Elasticity (i.e., the ability to acquire and release resources on-demand as a response to changes of application requirements during runtime) is a fundamental feature that drives the popularity of cloud architectures. To date, much effort has been dedicated to studying the elasticity of computational resources, which mostly revolves around how to acquire/release virtualization units that provide performance isolation and multi-tenancy for computations, such as virtual machines (VMs) or containers (Tesfatsion, Klein, & Tordsson, 2018). Elasticity of storage has gained comparatively little attention, despite the fact that applications are becoming increasingly data-intensive and thus need cost-effective means to store and access data.

An important aspect of storage elasticity is the management of I/O access throughput. Traditional clouds offer little support to address this aspect: users have to manually provision raw virtual disks of predetermine capacity and performance characteristics (i.e., latency and throughput) that can be freely attached to and detached from VM instances (e.g., Amazon Elastic Block Storage (EBS) (AmazonEBS, n.d.)). Naturally, provisioning a slower virtual disk incurs lower costs when compared with using a faster disk; however, this comes at the expense of potentially degraded application performance because of slower I/O operations.

This trade-off has important consequences in the context of large-scale distributed scientific applications that exhibit an iterative behavior. Such applications often interleave computationally intensive phases with I/O intensive phases. For example, a majority of high-performance computing (HPC) numerical simulations model the evolution of physical phenomena in time by using a bulk synchronous approach. This involves a synchronization point at the end of each iteration in order to write intermediate output data about the simulation, as well as periodic checkpoints that are needed for a variety of scenarios (Nicolae & Cappello, 2013) such as migration, debugging, and fault tolerance. Since many processes share the same storage (e.g., all processes on the same node share the same local disks), this behavior translates to periods of little I/O activity that are interleaved with periods of highly intensive I/O peaks.

Since time to solution is an important concern, users often over-provision faster virtual disks to achieve the best performance during I/O peaks and under-use this expensive throughput outside the I/O peaks. However, scientific applications tend to run in configurations that include a large number of VMs/containers and virtual disks, which means the waste can quickly get multiplied by scale, prompting the need for an elastic solution.

In our previous works (Nicolae, Riteau, & Keahey, 2014b, 2015) we introduced an elastic disk throughput solution that can deliver high performance during I/O peaks while minimizing costs related to storage. Our proposal focuses on the idea of using small, short-lived, and fast virtual disks to temporarily boost the maximum achievable throughput during I/O peaks by acting as a caching layer for larger but slower virtual disks that are used as primary storage. In this context, we developed both a strategy and performance model to decide when to boost I/O throughput with temporary virtual disks and how large they need to be in order to deliver optimal performance with minimal cost. We have shown how this approach can be efficiently achieved for HPC applications in a completely transparent fashion by expos- ing a specialized block device inside the guest operating system that hides all details of virtual disk management at the lowest level, effectively casting the throughput elasticity as a block-device caching problem where performance is complemented by cost considerations.

Despite increasing complexity of the cloud landscape, the general principles of our proposal can be applied to provide transparent I/O throughput elasticity for cloud storage in a general fashion, beyond the specific context of HPC applications that use virtual disks. Therefore, in this chapter we refine our proposal with a discussion in this direction. We place the discussion in the context of modern applications, cloud infrastructures and virtualization technologies, which introduce new challenges and perspectives (e.g. new cost models).

Complete Chapter List

Search this Book: