New Trends in Databases to NonSQL Databases

New Trends in Databases to NonSQL Databases

Antonio Sarasa-Cabezuelo (Facultad de Informática, Universidad Complutense de Madrid, Spain)
DOI: 10.4018/978-1-7998-3479-3.ch054

Abstract

The appearance of the “big data” phenomenon has meant a change in the storage and information processing needs. This new context is characterized by 1) enormous amounts of information are available in heterogeneous formats and types, 2) information must be processed almost in real time, and 3) data models evolve periodically. Relational databases have limitations to respond to these needs in an optimal way. For these reasons, some companies such as Google or Amazon decided to create new database models (different from the relational model) that solve the needs raised in the context of big data without the limitations of relational databases. These new models are the origin of the so-called NonSQL databases. Currently, NonSQL databases have been constituted as an alternative mechanism to the relational model and its use is widely extended. The main objective of this chapter is to introduce the NonSQL databases.
Chapter Preview
Top

Introduction

In recent decades, the relational model has been the mechanism of persistence of information most used in the computer applications. It provides a mature and stable model to represent the data, and also offers a widely used standard query and manipulation (SQL) language.

  • 1.

    However, the appearance of the “Big Data” phenomenon has meant a change in the storage and information processing needs. This new context is characterized by: a) enormous amounts of information are available in heterogeneous formats and types, b) information must be processed almost in real time, and c) data models evolve periodically.

  • 2.

    This requires a greater processing capacity, greater flexibility in terms of the types of information data that can be stored and greater flexibility in the data models that facilitate their adaptation.

  • 3.

    Relational databases have limitations to respond to these needs in an optimal way given that:

    • The solution to achieve more processing capacity is to use more powerful machines (vertical scaling). It is not economically advantageous.

    • The evolution of the data models implies a modification of the database schemas. This causes problems of maintenance and consistency of previously stored information.

    • The storage of heterogeneous and complex data requires the transformation of information in order to store complex data in basic data types (the only ones that manage relational databases).

For these reasons, some companies such as Google or Amazon decided to create new database models (different from the relational model) that solve the needs raised in the context of Big Data without the limitations of relational databases. These new models are the origin of the so-called NonSQL Databases.

Currently, NonSQL databases have been constituted as an alternative mechanism to the relational model and its use is widely extended.

The main objective of this chapter's proposal is to introduce the NonSQL databases. Specifically:

  • To show the processing and persistence needs of information in the Big Data.

  • To describe the main characteristics and the conceptual basis of the NoSQL databases.

  • To show how the NoSQL databases solve the needs raised in the Big Data

  • To describe the main families of NoSQL databases and the data models on which they are based.

The structure of the chapter is as follows. Section 1 describes the storage needs that appear in the Big Data. Section 2 presents the limitations of the databases to cover the needs of Big Data. Next, the NoSQL databases and their main characteristics are introduced. In the following section, persistence and distribution models are discussed. Next, the CAP theorem is presented. Finally, the trends in the NoSQL world are discussed, and a set of conclusions is established.

Top

Background

Big Data is a technological challenge that arises due to the confluence of two phenomena (Chen et al, 2014). On the one hand, there is a rapid increase in the storage capacity of the hardware and a reduction in its cost (in the 1980s it was possible to handle data sizes of 0.02 EB, in 2013 they handled sizes of 4.4 ZB, and in 2020 it is estimated that it will be possible to handle sizes of 44 ZB). And on the other hand, some companies and institutions begin to generate large amounts of data in a very fast way (for example, Google generates around 20 PB of data daily and Facebook 500 TB). The possibility of storing this data by having storage with sufficient capacity makes companies start to exploit the stored data for economic or strategic purposes, and to obtain benefits from the results of the analyzes. Thus, an infinite loop begins in which more and more data are generated, which forces us to develop media with greater storage capacity. When initial storage needs have been met, they are no longer able to meet current needs (since the amounts of data generated increase exponentially), which makes it necessary to create new storage media with greater capacity.

Key Terms in this Chapter

Replication: It consists in making a copy of the data in more than one machine of a distributed system in order to guarantee that the data is not lost in the case of the fall of any of the machines.

Theorem CAP: It is a theorem that establishes a set of properties about the distributed systems with respect to the availability, consistency of the data or the partitioning of the system.

Big Data: It is a phenomenon that involves a set of tools, technologies and techniques that aim to process huge amounts of data of all types in order to exploit the information contained and obtain some kind of advantage.

Scaling: Refers to the way you can increase the processing capacity of a system. There are two types: vertical scaling and horizontal scaling. In the first case, it consists of acquiring a machine with better features than the previous one, while in the second case, it consists of creating a cluster of machines that collaborate with each other, so that the processing powers of all the machines are added.

Relational Database: It is a database that is designed using the relational model as a basis. Its main characteristic is that it organizes the information using tables as a storage unit. Likewise, it uses SQL as a query language.

SQL Language: It is a declarative language that is used in relational databases to make queries.

Sharding: It is an organization of information that consists of distributing data among a set of machines so that machines can define roles such as a machine acting as a server and the rest of customers or that all machines act with the same roles.

Consistency: It refers to the mechanism that ensures that the data that is retrieved from a database is valid and is updated.

NoSQL Database: It is a database that is not designed using the relational model as a basis. It presents, among other features, that it does not use the SQL language as a query language.

Complete Chapter List

Search this Book:
Reset