NoSQL Databases

NoSQL Databases

Mainak Adhikari (IMPS College of Engineering and Technology, India) and Sukhendu Kar (IMPS College of Engineering and Technology, India)
DOI: 10.4018/978-1-4666-6559-0.ch006

Abstract

NoSQL database provides a mechanism for storage and access of data across multiple storage clusters. NoSQL dabases are finding significant and growing industry to meet the huge data storage requirements of Big data, real time applications, and Cloud Computing. NoSQL databases have lots of advantages over the conventional RDBMS features. NoSQL systems are also referred to as “Not only SQL” to emphasize that they may in fact allow Structured language like SQL, and additionally, they allow Semi Structured as well as Unstructured language. A variety of NoSQL databases having different features to deal with exponentially growing data intensive applications are available with open source and proprietary option mostly prompted and used by social networking sites. This chapter discusses some features and challenges of NoSQL databases and some of the popular NoSQL databases with their features on the light of CAP theorem.
Chapter Preview
Top

Introduction

“NoSQL” or “Not Only SQL” is a non-relational database management system and a largely distributed system which is different from our relational database system in some meaningful ways. The data structure of the NoSQL is differs from the traditional relational database system (RDBMS), for this reason some operations are faster in NoSQL. It is designed for extremely high volume and disparate data types, i.e. a huge amount of storage or retrieval of data means where a large scale of data can be store (for example Google, Facebook, Google+, Google big table, Amazon Dynamo, Twitter flockDB collects and stores terabits of data for their user every day). NoSQL database management system started off with a different set of goals and progressed in a different environment. It is operationally different and probably, provide better-grace solutions for today’s data storing problem. “NoSQL” is a breed of databases that are appearing in response to the limitations of existing relational databases (RDBMS). NoSQL databases are capable of handling large amounts of structured, unstructured, semi-structured and hybrid data with an amazing performance at reduced complexity and cost. This chapter discusses about the NoSQL database features in general and features of mostly used NoSQL in the light of CAP theorem.

NoSQL Databases Types

  • Key-Value Stores: Those are the simplest NoSQL databases (shown in Figure 1).

    • o

      A system that stores values indexed for retrieval by keys;

    • o

      Structured or Unstructured data can be hold by the systems;

    • o

      Data Model: (key,value) pairs;

    • o

      Operations: Insert (key,value), Fetch (key), Update(key), Delete(key);

    • o

      Efficiency, Scalability, Fault-tolerance;

    • o

      Replication, Single record transaction, ‘Eventual Consistency’;

    • o

      Example: Voldemort, Dynamo.

  • Column-oriented databases (shown in Figure 2):

    • o

      Column-oriented databases contain one extendable column of closely related data.

    • o

      Facebook created the high performance Cassandra to help power its website.

    • o

      Example: Hbase, Cassandra, Hypertable, Bigtable.

  • Document-Based Databases: Each of the key pairs with a complex data structure (shown in Figure 3).

    • o

      For each record, store and organize data as a collction of documents, rather than as structured tables with uniform sized fields.

    • o

      Any number of fields of any length added to a document by users.

    • o

      Like key-value stores, except value is document.

    • o

      Data Model: (key-document) pairs.

    • o

      Document: JSON, XML, other semistructured formats.

    • o

      Operations: Insert (key,document), Fetch (key), Update(key), Delete(key).

    • o

      Fetch based on document contents.

    • o

      Example: CouchDB, MongoDB,etc.

  • Graph Database: It is designed for data whose relations are well represented as a graph. It used to store information about networks (Figure 4).

    • o

      Interfaces and query languages vary.

    • o

      Single-step versus “path expressions” versus full recursion.

    • o

      RDF “triple stores” can map to graph database.

    • o

      Example: Neo4j, FlockDB, HyperGraphDB.

Figure 1.

Diagram of Key-value database

Figure 2.

Diagram of Column-oriented database

Figure 3.

Diagram of Document-based database

Figure 4.

Diagram of Graph database

The foundation of NoSQL movement was laid by the following four major research papers i.e.

  • 1.

    Google Bigtable,

  • 2.

    Dynamo paper of Amazon (Gossip protocol, Distributed key-value data store and Eventual consistency),

  • 3.

    CAP Theorem,

  • 4.

    BASE Transaction.

Key Terms in this Chapter

Memtable: A memtable is an interactive touch table that supports co-located group meeting by capturing both digital and physical interaction in its memory. A memtable is basically a write-back cache of data rows that can be looked up by key i.e. unlike a write-through cache, writes are batched up in the memtable until it is full, when a memtable is full, and it is written to disk as SSTable. Memtable is an in-memory cache with content stored as key/column. Memtable data are sorted by key; each ColumnFamily has a separate Memtable and retrieve column data from the key. Cassandra writes are first written to the CommitLog. After writing to CommitLog, Cassandra writes the data to memtable.

SSTable: Sorted String Table is a file of key/value string pairs, sorted by keys. An SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disks.An SSTable provides a persistent, ordered an immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key and to iterate over all key/value pairs in a specified key range. SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optioning for high throughput, sequential read/write workloads. Internally, each SSTable contains a sequence of blocks typically each block is 64KB in size, but this is configurable. A block index stored at the end of the SSTable is used to locate blocks; the index is loaded into memory when the SSTable is opened. The features of SSTable are: SSTables are immutable, simplifies caching, sharing across GFS etc., serializable index, serializable data, bloom filter, no need for concurrency control, SSTables of a tablet recorded in METADATA table, Garbage collection of SSTables done by master, on tablet split, split tables can start off quickly on shared SSTables, splitting them lazily. SSTable works in Cassandra (data format, indexing, serialization, searching).

Commodity Server Hardware: Commodity server is a commodity computer that is dedicated to running server program and carrying out associated tasks. Commodity hardware can arise from any technologically mature product in a mature market. Commodity server hardware is making it possible for cost effective MPP. The hardware configuration of a typical commodity server might contain CPU 16 Cores, RAM 1 Terabyte, Disk 500 Terabytes, and Ethernet 1 Gbit.

Infiniband Network: Infiniband is a switched fabric computer network communications link used in high-performance computing and enterprise data centers. A high-speed switched fabric network. In SQL Server PDW, Infiniband is used for private communication inside a SQL Server Parallel Data Warehouse (PDW) appliance. InfiniBand delivers 40 GB/Second connectivity with application-to-application latency as low as 1 µSecond has become a dominant fabric for high performance enterprise clusters. Its ultra-low latency and near zero CPU utilization for remote data transfers make InfiniBand ideal for high performance clustered applications. The features of Infiniband network are high throughput, low latency, quality of service and failover and it is designed to be scalable.

Softstate: Soft state is a state, which is useful for efficiency, but not essential, as it can be regenerated or replaced if needed. Soft state reffered to state that could be discarded by the routers as a local decision (e.g. because it was out of room), and the user would continue to receive service, albeit perhaps at a degraded level. In soft state database provides a relaxed view of data in terms of consistency. Information on soft state will expire if it is not refreshed. The value stored in soft state may not be up-to-date but handy for approximations. Soft state data are in changing state over time without user intervention and/or input due to eventual consistency.

Resource Description Framework (RDF): RDF is a general method of decomposing any type of data into small pieces, with some rules about the semantics of those pieces. NoSQL systems are increasingly used to manage. RDF data, it is still difficult to grasp their key advantages and drawbacksin this context. The point is to have a method so simple that it can express any fact, and yet structured enough that computer applications can do useful things with it. NoSQL data management systems have emerged as a commonly used infras-tructure for handling big data outside the RDF space.

Social Computing: Social computing is an area of computer science that is concerened with the intersection of social behavior and computational systems. Social computing is basically a Cloud computing application for sharing of information amongst themselves by the masses. Social computing group works on models for information processing that work from both angels. Some of the social networking sites such as Twitter, LinkedIn, and Facebook have shown phenomenal popularity recently. They have become the platform for exchange of views, ideas on issues of the common interest to come into a consensus as well as debate on issues of conflict.

Httpd: Httpd is the Apache Hyper Text Transfer Protocol (HTTP) server program. It is designed to be run as standalone daemon process. Httpd should not be invoked directly, but rather should be invoked via apachacti on Unix-based systems, or as a service on Windows NT, 2000 and XP and as a console application on Windows 9x and ME. Httpd stands for “Http daemon”, a software program that runs in the background of a Web server and waits for incoming server requests. The daemon answers the requests automatically and serves the hypertext and multimedia documents over the Internet using Http.

Unstructured Data: Data stored in files of different types, in which metadata (metadata is data about data. It describes other data. It provides information about a certain items contain.) were either unavailable or incomplete is termed as unstructured data. Unsturctured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as datas, number and facts as well.

Complete Chapter List

Search this Book:
Reset