Database Systems for Big Data Storage and Retrieval

Database Systems for Big Data Storage and Retrieval

Venkat Gudivada, Amy Apon, Dhana L. Rao
DOI: 10.4018/978-1-5225-3142-5.ch003
(Individual Chapters)
No Current Special Offers


Special needs of Big Data applications have ushered in several new classes of systems for data storage and retrieval. Each class targets the needs of a category of Big Data application. These systems differ greatly in their data models and system architecture, approaches used for high availability and scalability, query languages and client interfaces provided. This chapter begins with a description of the emergence of Big Data and data management requirements of Big Data applications. Several new classes of database management systems have emerged recently to address the needs of Big Data applications. NoSQL is an umbrella term used to refer to these systems. Next, a taxonomy for NoSQL systems is developed and several NoSQL systems are classified under this taxonomy. Characteristics of representative systems in each class are also discussed. The chapter concludes by indicating the emerging trends of NoSQL systems and research issues.
Chapter Preview

Functional Inadequacy Of Relational Database Management Systems (Rdbms) For Big Data Applications

Underlying all the RDBMS (Relational Database Management Systems) is the relational data model for structuring data and the ISO/ANSI standard SQL for data manipulation and querying. The relational data model is based on first-order predicate logic and lends itself naturally for providing a declarative method for specifying queries on the database (Codd, 1970). The SQL language is originally based on relational algebra and tuple relational calculus (Chamberlin, 1974).

RDBMS is inherently inadequate to address the data management needs of Big Data applications. Because semantically related data is fragmented across various tables, it requires several joins to process typical queries. Query latency times are simply too high for these applications. Moreover, impedance mismatch problems arise in RDBMS due to the difference between the relational data model structures on the disk and in-memory data structures of applications. Often Object Relational Mapping (ORM) frameworks such as Hibernate are used to automatically generate the code needed to map relational structures to in-memory application data structures. Though these frameworks help in ORM code generation, the same code exacerbates the query latency times.

Key Terms in this Chapter

MapReduce: Is a computational paradigm for processing massive datasets in parallel if the computation fits a three-step pattern: map, shard and reduce. The map process is a parallel one. Each process executes on a different part of data and produces (key, value) pairs. The shard process collects the generated pairs, sorts and partitions them. Each partition is assigned to a different reduce process which produces a single result.

Memory-Mapped File: A memory-mapped file is a segment of virtual memory which is associated with an operating system file or file-like resource (e.g., a device, shared memory) on a byte-for-byte correspondence. Memory-mapped files increase I/O performance especially for large files.

Hashing: Generating a fixed length output as a unique and shortened representation for a given piece of data. The fixed length output is called the hash code and the algorithm that generates the output is called a hash function .

Client-Server Architecture: A computing model for distributed applications. Workload is divided between a server and client. Typically client and server communicate over a computer network, though a client and server may reside on the same computer. Server provides a service which clients can seek by requesting.

Vector Clock: An algorithm for synchronizing data in a distributed system. It is used to determine which version of data is the most up-to-date by reasoning about events based on event timestamps.

Replication: Multiple copies of the same data are stored on different computers (aka nodes) to improve data availability and query performance. When a data item is updated at one node, its copies at other nodes are either updated simultaneously (synchronous replication) or at a later time (asynchronous replication). Replication can be continuous or done according to a schedule.

JavaScript Object Notation (JSON): Is a lightweight, text-based, open standard format for exchanging data between applications. Though it is originally derived from the JavaScript language, it is a language-neutral data format.

Commodity Hardware: Devices or components that are relatively inexpensive, widely available and easily interchangeable with other hardware of their type.

Protobuf: Protocol buffers (Protobuf) is a method for efficiently serializing structured data from an application so that it can be stored in a database. Protobuf is also used in applications that perform Remote Procedure Calls (RPC).

Shared-Nothing Architecture: A computer architecture where each processor (aka node) is self-sufficient and acts independently to remove single point of resource contention or failure. Nodes share neither memory nor disk storage. Variations include master-master (aka multi-master) and master-slave.

Binary JavaScript Object Notation: ( BSON) : Is a format for binary-coded serialization of JSON-like documents. See JSON.

Horizontal Scaling: Database systems based on shared-nothing architecture can be made to accommodate increased workloads by simply adding new nodes that are built from commodity hardware. No other changes are required to the application. Horizontal scaling is also known as scaling out. Also see Vertical Scaling.

Database Normalization: Designing tables in a relational database in a way to eliminate insertion, update and deletion anomalies. It increases the number of joins required to process a query.

Vertical Scaling: Adding more processing power to the same computer (aka scaling up). Compared to horizontal scaling, vertical scaling is more expensive and limiting.

Sharding: Distributing data across the nodes in a non-overlapping manner. When this task is done by a system in a manner transparent to the user, it is called auto-sharding. Client-managed sharding refers to data distribution specified through application logic.

Partitioned Data: In the context of a distributed system, data that is distributed in a non-overlapping manner across different processors or compute nodes.

Partition Tolerance: Ensures that a distributed system works well with data that is partitioned across a physical network.

Create, Read, Update, and Delete (CRUD) Operations: For any database system, Create (aka insert), Read (aka retrieve), Update and Delete operations are considered fundamental and minimal. CRUD is an acronym for these four operations.

CAP Theorem: Consistency, Availability and Partition tolerance (CAP) are the three primary concerns that determine which data management system is suitable for a given application. The CAP theorem states that it is impossible for any system to achieve all these three features at the same time. For example, to achieve partition tolerance, a system may need to give up consistency or availability. The degree to which these three features are available in a system should be viewed as spanning a spectrum rather than as binary values.

Query Latency: Is the amount of time it takes to execute a query and receive the result.

Database as a Service: A cloud operator provides traditional database administration functions for a fee. This enables applications of an organization to use database services without the need for local technical expertise. The cloud operator provisions resources on demand to meet the applications workload.

Basic Availability, Soft State and Eventual Consistency (BASE): ACID properties are to RDBMS as BASE properties are to NoSQL systems. BASE refers to Basic availability, Soft state and Eventual consistency. Basic availability implies continuous system availability despite network failures and tolerance to temporary inconsistency. Eventual consistency means that if no further updates are made to a given updated database item for long enough period of time , all users will see the same value for the updated item. Soft state refers to state change without input which is required for eventual consistency.

Atomicity, Consistency, Isolation and Durability (ACID): Atomicity, Consistency, Isolation and Durability (ACID) comprise a set of properties that characterize database transaction execution. Atomicity refers to ensuring that all the steps in a transaction are executed as a unit step. The consistency property guarantees that execution of a transaction takes the database from one valid state to another valid state. The isolation property guarantees that the effect of a set of concurrently executing transactions on the database is same as executing them in some serial order. The durability property ensures that the effects of a committed transaction are permanent against any type of failure.

Thrift: A Remote Procedure Call (RPC) framework and an Interface Definition Language (IDL) for cross-language services and API development.

Web 2.0 Applications: A new generation of Web applications which allow users to interact and collaborate with each other and contribute content. Examples include social media applications, blogs, wikis, folksonomies, image and video sharing sites and crowdsourcing.

Versioning: Maintaining multiple time-stamped copies of a data item.

Hash tree: A tree in which every non-leaf node is labeled with the hash of the labels of its child nodes. Hash trees (aka Merkle trees) enable efficient and secure verification of data transmitted between computers for veracity.

Structured Query Language (SQL): Is an ANSI/ISO standard declarative language for querying and manipulating relational databases.

Representational State Transfer (REST): API: Is a minimal overhead Hypertext Transfer Protocol (HTTP) API for interacting with independent software systems. REST uses four HTTP methods -- GET (for reading data), POST (for writing data), PUT (for updating data) and DELETE (for removing data).

Application Programming Interface (API): Is a specification for interacting with software libraries or systems.

Database Schema: A roadmap of the database that depicts structure of tables, relationships between tables and data integrity constraints.

Consistent Hashing: In traditional hashing, a change in the number of slots in the hash table results in nearly all keys remapped to new slots. If $K$ is the number of keys and $n$ is the number of slots, consistent hashing guarantees that on average no more than $K/n$ keys are remapped to new slots.

High Availability: Assurance that all clients of an application can always read data from and write data to the application.

Database Transaction: A unit of work in RDBMS. It is an all or nothing proposition. All subtasks of a transaction are executed as a unit. If a subtask fails, all of the subtasks preceding the failed subtask are undone. A transaction execution takes the database from one valid state to another valid state. See ACID.

Complete Chapter List

Search this Book: