Extracting-Transforming-Loading Modeling Approach for Big Data Analytics

Extracting-Transforming-Loading Modeling Approach for Big Data Analytics

Mahfoud Bala (Department of Informatics, Saad Dahleb University, Blida 1, Algeria), Omar Boussaid (Laboratory ERIC, University of Lyon 2, Lyon, France) and Zaia Alimazighi (Department of Computer Science, University of Science and Technology Houari Boumediene, Bab Ezzouar, Algeria)
Copyright: © 2016 |Pages: 20
DOI: 10.4018/IJDSST.2016100104
OnDemand PDF Download:
No Current Special Offers


Due to their widespread use, Internet, Web 2.0 and digital sensors create data in non-traditional volumes (at terabytes and petabytes scale). The big data characterized by the four V's has brought with it new challenges given the limited capabilities of traditional computing systems. This paper aims to provide solutions which can cope with very large data in Decision-Support Systems (DSSs). In the data integration phase, specifically, the authors propose a conceptual modeling approach for parallel and distributed Extracting-Transforming-Loading (ETL) processes. Among the complexity dimensions of big data, this study focuses on the volume of data to ensure a good performance for ETL processes. The authors' approach allows anticipating on the parallelization/distribution issues at the early stage of Data Warehouse (DW) projects. They have implemented an ETL platform called Parallel-ETL (P-ETL for short) and conducted some experiments. Their performance analysis reveals that the proposed approach enables to speed up ETL processes by up to 33% with the improvement rate being linear.
Article Preview

1. Introduction

Big data, often characterized by the so-called “four V’s” (Mohanty, Jagadeesh & Srivatsa, 2013), has brought with it new challenges given the limited capabilities of traditional computing systems. Fortunately, the distribution of the data processing on clusters is a promising solution. In fact, new paradigms have emerged such as cloud computing (Sosinsky, 2010), MapReduce (Dean & Ghemawat, 2010), and Not Only SQL (NoSQL) data models (Han, Haihong, Le & Du, 2011). This paper aims to provide solutions which can cope with very large data in Decision-Support Systems (DSSs). More specifically, we propose a novel conceptual Extracting-Transforming-Loading (ETL) modeling approach, devoted to the big data era, which defines parallel/distributed ETL processes at the early stage of Data Warehouse (DW) projects.

In the early 2000’s, the ETL has attracted significant interest of the DSS community. For the modeling purpose, precisely, we cite the following examples: Vassiliadis, Simitsis and Skiadopoulos (2002), Trujillo and Luján-Mora (2003), (Vassiliadis, Simitsis, Georgantas & Terrovitis, 2003; Vassiliadis, Simitsis, Georgantas, Terrovitis & Skiadopoulos, 2005; Simitsis, Vassiliadis, Terrovitis & Skiadopoulos, 2005; Vassiliadis, Simitsis, Terrovitis & Skiadopoulos, 2005), El Akkaoui & Zimányi (2009), and Deufemia et al. (2014). Some works such as (Simitsis, 2005), (Simitsis & Vassiliadis, 2008), (Skoutas & Simitsis, 2006; Skoutas & Simitsis, 2007a), and (Skoutas & Simitsis, 2007b) were interested in the semantic of the ETL process. Vassiliadis, Karagiannis, Tziovara, Simitsis and Hellas (2007) and Simitsis, Vassiliadis, Dayal, Karagiannis and Tziovara (2008) have introduced ETL benchmark approaches. In view of the increasing complexity of data and the ETL tasks, the ETL is considered nowadays one of the most important issues in the DSS field. Recently, the emergence of big data has generated much interest in the research community. Some authors such as (Liu, Thomsen & Pedersen, 2011), (Liu, Thomsen & Pedersen, 2014), and (Misra, Saha & Mazumdar, 2013) have proposed interesting ETL approaches. Indeed, our study is motivated by the fact that the existing conceptual modeling approaches such as (El Akkaoui & Zimányi, 2009), (Trujillo & Luján-Mora, 2003), and (Vassiliadis et al., 2002) are not suitable for big data environments. On the other hand, the prior parallel/distributed processing approaches, CloudETL (Liu, et al, 2014), ETLMR (Liu et al, 2011), and MapReduce paradigm (Dean & Ghemawat, 2010), for instance, and commercial tools such as Talend Big Data Integration Platform (“Talend Big Data”, 2016) and PDI Pentaho (“PDI”, 2016), are defined at the implementation stage, e.g., physical level, of the project. Admittedly, the conceptual model is, first, the means of communication between the involved parties in the DW project. Further, it enables highlighting the main lacks, difficulties and risks at the earliest stages before tackling the implementation step which coasts 60% and can rise up to 80% of the DW development project time (Demarest, 1997). In the big data era, particularly, the conceptual modeling offers a better visibility to deal with the “4 V’s” of big data (Embley & Liddle, 2013). Commonly, the MapReduce model is considered only at the physical level. Yet, MapReduce is not a simple programming model; it is a “paradigm”. In big data environments, the specification of parallel/distributed aspects at the early stage becomes interesting as all the processes run in parallel/distributed manner. Thus, we propose to anticipate the parallelization/distribution issues and model them at a conceptual level.

Complete Article List

Search this Journal:
Open Access Articles
Volume 14: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 13: 4 Issues (2021): 2 Released, 2 Forthcoming
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing