Role of Open Source Software in Big Data Storage

Role of Open Source Software in Big Data Storage

Rupali Ahuja (University of Delhi, India), Jigyasa Malik (University of Delhi, India), Ronak Tyagi (University of Delhi, India) and R. Brinda (University of Delhi, India)
DOI: 10.4018/978-1-5225-3142-5.ch005
OnDemand PDF Download:
List Price: $37.50


Today, the world is revolving around Big Data. Each organization is trying hard to explore ways for deriving value out of huge pile of data we are generating each moment. Open Source Software are widely being adopted by most academicians, researchers and industrialists to handle various Big Data needs because of their easy availability, flexibility, affordability and interoperability. As a result, several open source Big Data tools have been developed. This chapter discusses the role of Open Source Software in Big Data Storage and how various organizations have benefitted from its use. It provides an overview of popular Open Source Big Data Storage technologies existing today. Distributed File Systems and NoSQL databases meant for storing Big Data have been discussed with their features, applications and comparison.
Chapter Preview


The amount of data generated each second is continuously growing at an exponential rate. Facebook, a social networking website, is home to 40 billion photos and more than 100 hours of videos are uploaded to YouTube every minute and these statistics are burgeoning at speed of light in almost every field increasing the interest and demand for Big Data Storage and management technologies. A new forecast from International Data Corporation (IDC) sees the Big Data technology and services market growing at a Compound Annual Growth Rate (CAGR) of 23.1% over the 2014-2019 forecast periods with annual spending reaching $48.6 billion in 2019 (IDC, 2016).

Open Source tools are playing prominent role in managing Big Data Storage issues. The most dominant technologies used in Big Data world, Hadoop and Apache Spark are Open Source tools. The most popular Big Data software distribution companies like Cloudera and HortonWorks have based their business around open source technologies. Open Source is the platform best suited for Big Data solutions. Almost all Big Data solutions work on top of UNIX Operating System which is open source. Without open source tools, the Big Data world would not have grown so rapidly. According to Talend’s CEO, Mike Tuchen, “the entire next-generation data platform will be open source”. (Noyes, 2016)

Key Terms in this Chapter

Distributed File Systems (DFS): A File System in which files are distributed across multiple storage resources but appear to users as they exist on a single location.

MVCC (Multi Version Concurrency Control): It is a Concurrency Control method which allows concurrent access to the database without using any locking mechanism and by maintaining different versions of the same data.

Sharding: Sharding is a Database Partitioning scheme in which datasets are distributed across nodes for Load Balancing and improving performance.

Inode: In UNIX, inode is a data structure used to represent a file system object. It stores the attributes and disk block location of the file system object's data.

Multi-Master Replication: It is a method of database replication in which a group of computers store and update data. All members can handle client requests and are responsible for transmitting modifications to rest of its group members.

Master Slave Replication: Master Slave Replication allows data to be stored by a group of computers but it can be updated by only one member, the “master” of the group. Master is in charge of the group while several other database servers (the “slaves”) keep copies of all the data that’s been written to the master and can be queried. Data cannot be written to slaves directly.

Geographic Replication: A replication system in which data is replicated across servers which are geographically apart to improve network performance.

Complete Chapter List

Search this Book: