Towards Extract-Transform-Load Operations in a Big Data context

Towards Extract-Transform-Load Operations in a Big Data context

Hana Mallek, Faiza Ghozzi, Faiez Gargouri
Copyright: © 2020 |Pages: 19
DOI: 10.4018/IJSKD.2020040105
(Individual Articles)
No Current Special Offers


Big Data emerged after a big explosion of data from the Web 2.0, digital sensors, and social media applications such as Facebook, Twitter, etc. In this constant growth of data, many domains are influenced, especially the decisional support system domain, where the integration of processes should be adapted to support this huge amount of data to improve analysis goals. The basic purpose of this research article is to adapt extract-transform-load processes with Big Data technologies, in order to support not only this evolution of data but also the knowledge discovery. In this article, a new approach called Big Dimensional ETL (BigDimETL) is suggested to deal with ETL basic operations and take into account the multidimensional structure. In order to accelerate data handling, the MapReduce paradigm is used to enhance data warehousing capabilities and HBase as a distributed storage mechanism. Experimental results confirm that the ETL operation performs well especially with adapted operations.
Article Preview

1. Introduction

Nowadays, the availability of smart devices and communication systems used by billions of end users has led to the emergence of Big Data notion, which is characterized by 4Vs as presented by Gartner1: the Volume, the velocity, the variety, and the veracity. From this perspective, certain statistics reported by IBM analytic2 indicate that 6 billion of people from 7 billion of the world population have at least one Smartphone. This has increased to a great extent the number of connected people on the internet, which multiplied the volume of data 300 times from 2005 to 2020 amounting to about 40 Zettabytes of data. Hence, Big data is a term applied to gigantic, unstructured and heterogeneous datasets whose size and type exceed the ability of traditional relational databases to manage, capture, and process data. Thus, this explosion of data and technologies stands for a big challenge for multiple domains, and particularly in terms of Decision Support System (DSS) (Kimball & Caserta, 2011) especially data mining and knowledge discovery activities (Storey & Song, 2017). Within this framework of reference, many researchers, focused upon this big evolution by handling and analyzing the sensor data (Trauth & Browning, 2018), logistic data (AlShaer et al., 2019), data of social media (Gupta & Aluvalu, 2019), etc. The significance of analyzing this big evolution makes it intrinsic to take into consideration the basic element of DSS, which is called Extract-Transform-Load (ETL). According to (Vassiliadis et al., 2002) ETL implementation may take up to 80% of the DW project. Typically, ETL is composed of several operations such as Selection, conversion, filtering, Join, etc.; executed sequentially in order to capture, integrate and filter data so as to be loaded in DW. In fact, these classical operations cannot bear the big evolution of data. Moreover, Relational Database Management Systems (RDBMS) are not suitable for distributed databases as argued in (Boussahoua et al., 2017). For this reason, ETL process needs much attention to be adapted to deal with the big explosion of data in order to generate handled data into the DW.

Several technologies have appeared with the emergence of Big Data such as MapReduce (Dean & Ghemawat, 2008) with Google to process a big amount of data. In addition, NoSQL (Not only SQL) databases have appeared to store unstructured data on column-oriented such as HBase (George, 2011) or document-oriented such as MongoDB (Chodorow & Dirolf, 2010), or as a graph-oriented (Pokornỳ, 2015).

Indeed, this paper presents BigDimETL approach that applies Big Data technologies, which support the scalability and the performance, to adapt the Extraction and Transformation phases of ETL. Within this context, the adaptation of data processing is based on adding the parallelism aspect through Hadoop (White, 2012) ecosystem. The latter corresponds to an open-source framework for handling unstructured data using a parallel processing technique called MapReduce paradigm (Dean & Ghemawat, 2008) in order to minimize time-consumption. Besides, the HBase database is considered in BigDimETL as a column-oriented data store in order to support complex data instead of classical Relational database. Moreover, the goal of the proposed approach is to make the reformulation of ETL processes by retaining the specificities of the multidimensional structure of DW. The latter is considered as a high-level DW/ETL specific constructs (Liu et al., 2013). It is dedicated to online analytical processing and business intelligence applications. The central focus of this research work is upon modeling ETL operations in the formal level of extraction and transformation phases. Accordingly, in the extraction phase, conversion and vertical partitioning methods are invested to minimize the overload into the transforming and loading phase. However, the transformation phase proves to support the most used operations for treating and filtering data

Complete Article List

Search this Journal:
Volume 16: 1 Issue (2024)
Volume 15: 1 Issue (2023)
Volume 14: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing