Article Preview
Top1. Introduction
Nowadays, the availability of smart devices and communication systems used by billions of end users has led to the emergence of Big Data notion, which is characterized by 4Vs as presented by Gartner1: the Volume, the velocity, the variety, and the veracity. From this perspective, certain statistics reported by IBM analytic2 indicate that 6 billion of people from 7 billion of the world population have at least one Smartphone. This has increased to a great extent the number of connected people on the internet, which multiplied the volume of data 300 times from 2005 to 2020 amounting to about 40 Zettabytes of data. Hence, Big data is a term applied to gigantic, unstructured and heterogeneous datasets whose size and type exceed the ability of traditional relational databases to manage, capture, and process data. Thus, this explosion of data and technologies stands for a big challenge for multiple domains, and particularly in terms of Decision Support System (DSS) (Kimball & Caserta, 2011) especially data mining and knowledge discovery activities (Storey & Song, 2017). Within this framework of reference, many researchers, focused upon this big evolution by handling and analyzing the sensor data (Trauth & Browning, 2018), logistic data (AlShaer et al., 2019), data of social media (Gupta & Aluvalu, 2019), etc. The significance of analyzing this big evolution makes it intrinsic to take into consideration the basic element of DSS, which is called Extract-Transform-Load (ETL). According to (Vassiliadis et al., 2002) ETL implementation may take up to 80% of the DW project. Typically, ETL is composed of several operations such as Selection, conversion, filtering, Join, etc.; executed sequentially in order to capture, integrate and filter data so as to be loaded in DW. In fact, these classical operations cannot bear the big evolution of data. Moreover, Relational Database Management Systems (RDBMS) are not suitable for distributed databases as argued in (Boussahoua et al., 2017). For this reason, ETL process needs much attention to be adapted to deal with the big explosion of data in order to generate handled data into the DW.
Several technologies have appeared with the emergence of Big Data such as MapReduce (Dean & Ghemawat, 2008) with Google to process a big amount of data. In addition, NoSQL (Not only SQL) databases have appeared to store unstructured data on column-oriented such as HBase (George, 2011) or document-oriented such as MongoDB (Chodorow & Dirolf, 2010), or as a graph-oriented (Pokornỳ, 2015).
Indeed, this paper presents BigDimETL approach that applies Big Data technologies, which support the scalability and the performance, to adapt the Extraction and Transformation phases of ETL. Within this context, the adaptation of data processing is based on adding the parallelism aspect through Hadoop (White, 2012) ecosystem. The latter corresponds to an open-source framework for handling unstructured data using a parallel processing technique called MapReduce paradigm (Dean & Ghemawat, 2008) in order to minimize time-consumption. Besides, the HBase database is considered in BigDimETL as a column-oriented data store in order to support complex data instead of classical Relational database. Moreover, the goal of the proposed approach is to make the reformulation of ETL processes by retaining the specificities of the multidimensional structure of DW. The latter is considered as a high-level DW/ETL specific constructs (Liu et al., 2013). It is dedicated to online analytical processing and business intelligence applications. The central focus of this research work is upon modeling ETL operations in the formal level of extraction and transformation phases. Accordingly, in the extraction phase, conversion and vertical partitioning methods are invested to minimize the overload into the transforming and loading phase. However, the transformation phase proves to support the most used operations for treating and filtering data