Data Lake Architecture: A New Repository for Data Engineer

Data Lake Architecture: A New Repository for Data Engineer

Arvind Panwar (GGSIP University, Delhi, India) and Vishal Bhatnagar (Department of Computer Science and Engineering, Ambedkar Institute of Advanced Communication Technologies and Research, New Delhi, India)
Copyright: © 2020 |Pages: 13
DOI: 10.4018/IJOCI.2020010104

Abstract

Data is the biggest asset after people for businesses, and it is a new driver of the world economy. The volume of data that enterprises gather every day is growing rapidly. This kind of rapid growth of data in terms of volume, variety, and velocity is known as Big Data. Big Data is a challenge for enterprises, and the biggest challenge is how to store Big Data. In the past and some organizations currently, data warehouses are used to store Big Data. Enterprise data warehouses work on the concept of schema-on-write but Big Data analytics want data storage which works on the schema-on-read concept. To fulfill market demand, researchers are working on a new data repository system for Big Data storage known as a data lake. The data lake is defined as a data landing area for raw data from many sources. There is some confusion and questions which must be answered about data lakes. The objective of this article is to reduce the confusion and address some question about data lakes with the help of architecture.
Article Preview
Top

1. Introduction

Data is the biggest assets after people for business, and it is a new driver of the world economic and social changes for today’s world. The volume of data that enterprise gathering every day is growing rapidly (Bala, Boussaid, & Alimazighi, 2017; Hefer, 2007). Every organization has its own data warehouse to store huge amount of business data. A data warehouse is designed to capture and store business data from another enterprise system for example, inventory system, supply chain management system, customer relationship management system. A data warehouse system allows business users and data analysts to drive values from data and make important decisions to grow their business.

The world is changing with speed of light so new technology has come in market for data storage, data processing, and data analysis. New technologies including streaming data, data from connected devices on internet of things, cloud computing, social media, high tech power grid, is driving a much greater volume of data (CITO research, 2014; Hortonworks, 2014). This greater volume of data is driving higher user’s expectations and globalization of economics. Data generated from above-said resources is not only huge in term of volume but generate with high velocity and variety of data such as structured, unstructured and semi-structured. This kind of generated data is known as Big Data. The traditional data warehouse is not suitable to process and analyze Big Data. Now organizations are understanding that traditional data warehouse technologies can’t match their business need to compete in the ever-growing market.

As a result, every organization is turning toward Apache Hadoop for Big Data storage and gain insights from data. Hadoop is an open-source software which is used for distributed processing and distributed storage of huge amount of data sets on computer clusters commodity hardware. Apache Hadoop provides many services like storage of data, processing of data, data access, data governance, data security, data visualization, and operations. Adoption of Hadoop in organization is growing exponentially, according to Gartner survey in mid-2015, 26% enterprises already deploying and piloting Hadoop for practice next-generation data storage and processing framework. According to survey, 12% is planning to deploy very soon and 7 to 10 percent deploy within a year.

Many organization experiences good success and growth in business with these early pursuits of mainstream Hadoop deployment in healthcare, retail, financial and e-commerce sectors. In starting Hadoop is used as tactical tools instead of strategic tool, because many opposed to replacing data warehouse. They have some questions and doubts about whether Hadoop can match their enterprise services for scalability, security, performance, and availability. But organizations know that they can’t continue with data warehouse due to some challenges which come with advancement in technology.

As technology advancement enterprise data warehouse is not suitable for data storage for current market demand. Enterprise data warehouse works on the concept of schema-on-write architecture, to get data in data warehouse an extraction, transformation, and loading (ETL) process is required (Cha, Park, Kim, Pan, & Shin, 2018; Khine & Wang, 2018). With this architecture, organization design a data model and prepare an analytic plan before loading data. In other words, organization must know in starting, before loading data, how they are planning to use that data, and this is very limiting. Big data analytics want data storage who works on schema-on-read concept in which data is stored in raw format as data generated or in other words, there is no need to prepare an analytic plan before loading data, and no need to know ahead of time how they plan to use that data.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 10: 4 Issues (2020): 2 Released, 2 Forthcoming
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing