NoSQL Databases

Ganesh Chandra Deka

Source Title: Handbook of Research on Cloud Infrastructures for Big Data Analytics

DOI: 10.4018/978-1-4666-5864-6.ch008

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

NoSQL databases are designed to meet the huge data storage requirements of cloud computing and big data processing. NoSQL databases have lots of advanced features in addition to the conventional RDBMS features. Hence, the “NoSQL” databases are popularly known as “Not only SQL” databases. A variety of NoSQL databases having different features to deal with exponentially growing data-intensive applications are available with open source and proprietary option. This chapter discusses some of the popular NoSQL databases and their features on the light of CAP theorem.

Chapter Preview

Top

Introduction

“NoSQL” is a breed of databases that are appearing in response to the limitations of existing relational databases (RDBMS). NoSQL databases are capable of handling large amounts of structured, unstructured, semi-structured and hybrid data with an amazing performance at reduced complexity and cost.

The foundation of NoSQL movement was laid by the following three major research papers:

1.
Google Bigtable
2.
Dynamo paper of Amazon (Gossip protocol, Distributed key-value data store and Eventual consistency)
3.
CAP Theorem

Table 1 shows the chronology of the NoSQL movement (Noller, 2013), (Vasiliev, 2013).

Table 1.

Chronology of Development of NoSQL

Year	Development
1998	Carlo Strozzi introduced the term NoSQL to name his lightweight, open-source relational database that does not render the standard SQL interface.
2000	Graph database Neo4j introduced
2004	Google Bigtable project started. The first paper published in 2006.
2005	CouchDB lunched
2007	Research paper on Amazon Dynamo released
2008	Facebook’s open sources the Cassandra project started. Project Voldemort started
2009	The document database MongoDB started as a part of an open source cloud computing stack. The first standalone version released.
2009	The term NoSQL reintroduced in early 2009. Lots of commercial and open source NoSQL developed and floated in the market by various vendors and communities.

This chapter discusses about the NoSQL database features in general and features of mostly used 10 NoSQL in the light of CAP theorem (http://nosql-database.org/, 2011). Apart from these 10 NoSQL databases Microsoft Azure (SQL based) and IBM DB2 is also discussed with a focus on big data. Sufficient references are given for the benefit of readers. The important technical terms related to NoSQL are explained at the end of the chapter for ready reference.

Top

Nosql Features

NoSQL databases provide:

•
Scalability (can be scaled horizontally)
•
High availability
•
Optimized resource allocation and utilization
•
Virtually unlimited data store capacity
•
Multitenancy

Key Terms in this Chapter

Shard: Sharding is physically breaking large data into smaller pieces (shards) of data. A database shard is a horizontal partition in a database or search engine. Each individual partition is referred to as a shard or database shard. Sharding is an application-managed scaling technique using many (hundreds/thousands of) independent databases. Vertical scaling is limited by cost and implementation since it is difficult to scale in the cloud vertically. Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into. Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. Features of sharding are as follows: 1) Data is split into multiple databases (shards); 2) Each database holds a subset (either range or hash) of the data; 3) Split the shards as data volume or access grows; 4) Shards are replicated for availability and scalability; and 5) Sharding is the dominant approach for scaling massive websites. Sharding is used in custom applications that require extreme scalability and are willing to make a number of tradeoffs to achieve it.

Resource Description Framework (RDF): RDF is a general method to decompose any type of Data/Information into small pieces, with some rules about the semantics or meaning, of those pieces.

Resource Description Framework: ( RDF): is a general method of decomposing any type of data into small pieces, with some rules about the semantics of those pieces. The point is to have a method so simple that it can express any fact, and yet structured enough that computer applications can do useful things with it.

Memtable: A memtable is basically a write-back cache of data rows that can be looked up by key i.e. unlike a write-through cache, writes are batched up in the memtable until it is full, when a memtable is full, and it is written to disk as SSTable. Memtable is an in-memory cache with content stored as key/column. Memtable data are sorted by key. Each ColumnFamily has a separate Memtable and retrieve column data from the key. Cassandra writes are first written to the CommitLog. After writing to CommitLog, Cassandra writes the data to memtable.

Http Daemon (HTTPD): HTTPD is a software program that runs in the background of a Web server and waits for incoming server requests. The daemon answers the requests automatically and serves the hypertext and multimedia documents over the Internet using HTTP.

Infiniband Network: A high-speed switched fabric network. In SQL Server PDW, Infiniband is used for private communication inside a SQL Server Parallel Data Warehouse (PDW) appliance. InfiniBand delivers 40 GB/Second connectivity with application-to-application latency as low as 1 µ Second has become a dominant fabric for high performance enterprise clusters. Its ultra-low latency and near zero CPU utilization for remote data transfers make InfiniBand ideal for high performance clustered applications.

Commodity Server Hardware: Commodity server hardware is making it possible for cost effective MPP. The hardware configuration of a typical commodity server might contain: 1) CPU 16 Cores; 2) RAM 1 Terabyte; 3) Disk 500 Terabytes; and 4) Ethernet 1 Gbit.

Social Computing: Social computing is basically a Cloud computing application for sharing of information amongst themselves by the masses. Some of the social networking sites such as Twitter, LinkedIn, and Facebook have shown phenomenal popularity recently. They have become the platform for exchange of views, ideas on issues of the common interest to come into a consensus as well as debate on issues of conflict. These play a vital role on the political front as well.

MapReduce: It is quite easy to use programming model that supports parallel design since it is very scalable and works in a distributed way. It is also helpful for huge data processing, large scale searching and data analysis within the cloud. It provides related abstraction by a process of “mapper” and “reducer”. The “mapper” is applicable to each input key-value pair trying to come up with an associated absolute range of intermediate key-value pairs. Map: produce a list of ( key , value ) pairs from the input structured as a key( k ) value( v ) pair of a different type i.e. (k1, v1) ? list (k2, v2) The “reducer” is applicable to some or all values related to identifying the intermediate key to come up with output key-value pairs. Reduce: produce a list of values from an input that consists of a key and a list of values associated with that key i.e. (k2, list (v2)) ? list (v2) MapReduce is having adequate capability to support many real and global algorithms and tasks. It can divide the input data, schedule the execution of programs over a set of machines and handle machine failures. MapReduce can also handle the inter-machine communication. Map/Reduce is: 1) a Programming model from Lisp and other functional languages; 2) Many problems can be phrased this way; 3) Easy to distribute across nodes; and 4) Nice retry/failure semantics. MapReduce provides: 1) Automatic parallelization and distribution; 2) Fault tolerance; 3) I/O scheduling; and 4) Monitoring & status updates. The limitations of MapReduce are: 1) Extremely rigid data flow; 2) Constantly hacked in Join, Union, Split; 3) Common operations must be coded by user; and 4) Semantics hidden inside map-reduce functions, Difficult to maintain, extend, and optimize.

Gossip Protocol: A gossip protocol is a style of computer-to-computer communication protocol inspired by the form of gossip seen in social networks. Modern distributed systems often use gossip protocols to solve problems that might be difficult to solve in other ways, either because the underlying network has an inconvenient structure, is extremely large, or because gossip solutions are the most efficient ones available. Gossip protocols are probabilistic in nature: a node chooses its partner node with which to communicate randomly. They are scalable because each node sends only a fixed number of messages, independent of the number of nodes in the network. In addition, a node does not wait for acknowledgments nor does it take some recovery action should an acknowledgment not arrive. They achieve fault-tolerance because a node receives copies of a message from different nodes. No node has a specific role to play, and so a failed node will not prevent other nodes from continuing sending messages. Hence, there is no need for failure detection or specific recovery actions.

CAP Theorem: CAP theorem has been widely accepted since it was introduced in 2000 as a significant driver of NoSQL technology. CAP theorem can be explained by considering the two nodes on two ends of a partition permitting one node to manipulate data leading to the following three situations. Firstly, since one node has updated the data this will result into inconsistency between the nodes. Under this consideration Consistency is forfeited by picking up AP (availability, partition tolerance). Secondly, for preserving Consistency, the other side of the partition has to act as if it were unavailable, thus forfeiting A (availability). The third possibility that when clients are communicating to preserve Consistency and Availability they must behave in such a way that the failure of a partition on one node should not stop the operation of the other node and paralyze the communications between them i.e. forfeiting P (partition tolerance).

Unstructured Data: Data stored in files of different types, in which metadata were either unavailable or incomplete is termed as unstructured data.

BASE: The elasticity of storage and server resources is at the crux of BASE paradigm. BASE databases use strategies to have Consistency, Atomicity and Partition tolerance “eventually”. BASE does not flout CAP theorem but works around it. In case of BASE Availability and scalability gets highest priorities Consistency is and Weak. Base is simple and fast. In an asynchronous model, when no clocks are available, it is impossible to provide consistent data, even allowing stale data to be returned when messages are lost. However, in partially synchronous models it is possible to achieve a practical compromise between consistency and availability

SSTable: An SSTable provides a persistent, ordered an immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks typically each block is 64KB in size, but this is configurable. A block index stored at the end of the SSTable is used to locate blocks; the index is loaded into memory when the SSTable is opened. The features of SSTable are: 1) SSTables are immutable; 2) Simplifies caching, sharing across GFS, etc; 3) No need for concurrency control; 4) SSTables of a tablet recorded in METADATA table; 5) Garbage collection of SSTables done by master; and 5) On tablet split, split tables can start off quickly on shared SSTables, splitting them lazily.

Atomicity, Consistency, Isolation, and Durability (ACID): In computer science, ACID is a set of properties that guarantee that database transactions are processed reliably: Atomicity: Either the task (or all tasks) within a transaction are performed or none of them are. This is the all-or-none principle. If one element of a transaction fails the entire transaction fails; Consistency: The transaction must meet all protocols or rules defined by the system at all times. The transaction does not violate those protocols and the database must remain in a consistent state at the beginning and the end of a transaction; there is no any half-completed transactions; Isolation: No transaction has access to any other transaction that is in an intermediate or unfinished state. Thus, each transaction is independent into it. This is required for both performance and consistency of transactions within a database; and Durability: Once the transaction is complete, it will persist as complete and cannot be undone; it will survive system failure, power loss and other types of system breakdowns. ACID is having the property of Strong consistency of transactions, availability less important.

Softstate: In soft state database provides a relaxed view of data in terms of consistency. Information on soft state will expire if it is not refreshed. The value stored in soft state may not be up-to-date but handy for approximations. Soft state data are in changing state over time without user intervention and/or input due to eventual consistency. Whilst soft state is lost or made unavailable due to service instance crashes and overloads, reconstructing it through user interaction or third-tier re-access can be expensive in terms of time and resources.

Hortonworks Data Platform: The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and completely open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Yahoo! is a development partner of Hortonworks(Hortonworks, n.d. AU20: The in-text citation "Hortonworks, n.d." is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. ). Hortonworks Objectives are: 1) Making Apache Hadoop projects easier to install, manage and use; 2) Make Apache Hadoop more robust; and 3) Make Apache Hadoop easier to integrate and extend.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference