Data Partitioning for Highly Scalable Cloud Applications

Data Partitioning for Highly Scalable Cloud Applications

Robert Neumann (Otto-von-Guericke-Universität Magdeburg, Germany), Matthias Baumann (Otto-von-Guericke-Universität Magdeburg, Germany), Reiner Dumke (Otto-von-Guericke-Universität Magdeburg, Germany) and Andreas Schmietendorf (Berlin School of Economics and Law, Germany)
DOI: 10.4018/978-1-4666-0957-0.ch016


Cloud computing has brought new challenges, but also exciting chances to developers. With the illusion of an infinite expanse of computing resources, even individual developers have been put into a position from which they can create applications that scale out all over the world, thus affecting millions of people. One difficulty with developing such mega-scale Cloud applications is to keep the storage backend scalable. In this chapter, we detail ways of partitioning non-relational data among thousands of physical storage nodes, thereby emphasizing the peculiarities of tabular Cloud storage. The authors give recommendations of how to establish a sustainable and scaling data management architecture that – while growing in terms of data volume – still provides the same high throughput. Finally, they underline their theoretical elaborations by featuring insights won from a real-world cloud project with which the authors have been involved.
Chapter Preview

Cloud Computing

The economics of Cloud Computing have been discussed by several researchers and practitioners. Essentially, the three core benefits are (Armbrust et al, 2009):

  • 1.

    Infinite computing resources available on demand

  • 2.

    No up-front commitment by Cloud users

  • 3.

    Pay for computing resources as needed

For application developers, the elimination of an up-front commitment and the ability to dynamically adapt the used resources to the system load have offered them the possibility to give formerly risky ideas a chance. If one such idea turned out to be successful, it would be inevitable to have designed the application in such a way, that it would be able to accomodate a quickly increasing load. Keeping an application scalable not only means keeping its business logic, but also its data logic scalable.

Mega-scale applications, such as Facebook, Flickr, YouTube or Yelp present new challenges for storing and querying large amounts of data. SQL databases have dominated application development for decades as their ACID characteristics have especially benefited data integrity and safety. Performance and throughput, on the other hand, are aspects that have been suffering from ACID, a fact that was no longer acceptable for mega-scale application engineers. The advent of light-weight and non-relational key-value storages, such as Apache Cassandra (originally developed by Facebook) or membase, has facilitated the shift from ACID to BASE, which stands for “Basically Available, Soft-state, Eventually consistent”. Though BASE’s eventual consistency characteristic complicates data integrity, at the same time it allows for enhanced concurrency and scale-out – traits that are crucial when serving millions of queries in less than a second (e.g., Facebook: 13,000,000 queries per second).

Complete Chapter List

Search this Book: