Optimizing ETL by a Two-Level Data Staging Method

Optimizing ETL by a Two-Level Data Staging Method

Xiufeng Liu (Department of Management Engineering, Technical University of Denmark, Kongens Lyngby, Denmark), Nadeem Iftikhar (University College of Northern Denmark, Aalborg, Denmark), Huan Huo (Shanghai University of Science and Technology, China) and Per Sieverts Nielsen (Technical University of Denmark, Kongens Lyngby, Denmark)
Copyright: © 2016 |Pages: 19
DOI: 10.4018/IJDWM.2016070103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In data warehousing, the data from source systems are populated into a central data warehouse (DW) through extraction, transformation and loading (ETL). The standard ETL approach usually uses sequential jobs to process the data with dependencies, such as dimension and fact data. It is a non-trivial task to process the so-called early-/late-arriving data, which arrive out of order. This paper proposes a two-level data staging area method to optimize ETL. The proposed method is an all-in-one solution that supports processing different types of data from operational systems, including early-/late-arriving data, and fast-/slowly-changing data. The introduced additional staging area decouples loading process from data extraction and transformation, which improves ETL flexibility and minimizes intervention to the data warehouse. This paper evaluates the proposed method empirically, which shows that it is more efficient and less intrusive than the standard ETL method.
Article Preview

1. Introduction

A data warehouse is the decision-making database which holds the data extracted from transactional systems, operational data stores, or other external sources. The data are processed by the ETL from the source systems into a central data warehouse, traditionally referred as data warehousing (Kimball & Caserta, 2004). The transformed data in the data warehouse (Inmon, 2002) are typically structured according to star schema and accessed by decision-support systems, such as Online Analytical Processing (OLAP) tools and Business Intelligence (BI) applications (March & Hevner, 2007). Data warehousing systems run ETL jobs at a regular time interval, such as daily, weekly or monthly. Operational data management systems create dynamic data by transactions. The data might be changed during everyday’s business operations, e.g., adds new orders, updates or cancels existing orders. The changes are updated to the data warehouse for supporting decision makings. The data warehouse that keeps the detailed transaction history can date back the changes of the data in a transactional processing system, called System of Record (SOR) (Inmon, 2003). The data fed to an ETL typically follows a certain order such as according to data timestamp (e.g., transaction time) or/and dependencies. A typical scenario is that dimension data have to be loaded into the data warehouse earlier than fact data, due to the foreign-key constraint between the fact and dimension tables. A fact record consists of dimension keys and measures, where the dimension keys are gotten from the referenced dimension tables by the lookup operation. If the fact record arrives first, the lookup operation will fail. The standard approach of solving this problem is that an ETL first loads the parent tables (referenced tables, e.g., dimension tables), then loads the child tables (the tables with the foreign keys, e.g., a fact table). This approach, however, has to consider the loading dependency, which has some weaknesses, e.g., extra space is required for storing the early-arriving data; the fact data are put on hold until the dimension data has been processed; delaying the whole ETL process due to the waiting; and it is not applicable for parallelism due to the dependency. Furthermore, from the perspective of ETL complexity, the ETL process is typically complicated for handling dynamic data that requires keeping track of change history, e.g., tracing shipping status and customer information changes on an on-line shopping store. This involves dealing with fast-/slowly-changing data issue. A classic example is processing slowly changing dimensions (SCDs) (Kimball & Caserta, 2004). An ordinary ETL process checks the existence of the previous records in the DW to decide doing record update or record insertion operation. The innovative approaches for easing ETL complexity, maintenance and optimization are desirable.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing