Article Preview
Top1. Introduction
A data warehouse is the decision-making database which holds the data extracted from transactional systems, operational data stores, or other external sources. The data are processed by the ETL from the source systems into a central data warehouse, traditionally referred as data warehousing (Kimball & Caserta, 2004). The transformed data in the data warehouse (Inmon, 2002) are typically structured according to star schema and accessed by decision-support systems, such as Online Analytical Processing (OLAP) tools and Business Intelligence (BI) applications (March & Hevner, 2007). Data warehousing systems run ETL jobs at a regular time interval, such as daily, weekly or monthly. Operational data management systems create dynamic data by transactions. The data might be changed during everyday’s business operations, e.g., adds new orders, updates or cancels existing orders. The changes are updated to the data warehouse for supporting decision makings. The data warehouse that keeps the detailed transaction history can date back the changes of the data in a transactional processing system, called System of Record (SOR) (Inmon, 2003). The data fed to an ETL typically follows a certain order such as according to data timestamp (e.g., transaction time) or/and dependencies. A typical scenario is that dimension data have to be loaded into the data warehouse earlier than fact data, due to the foreign-key constraint between the fact and dimension tables. A fact record consists of dimension keys and measures, where the dimension keys are gotten from the referenced dimension tables by the lookup operation. If the fact record arrives first, the lookup operation will fail. The standard approach of solving this problem is that an ETL first loads the parent tables (referenced tables, e.g., dimension tables), then loads the child tables (the tables with the foreign keys, e.g., a fact table). This approach, however, has to consider the loading dependency, which has some weaknesses, e.g., extra space is required for storing the early-arriving data; the fact data are put on hold until the dimension data has been processed; delaying the whole ETL process due to the waiting; and it is not applicable for parallelism due to the dependency. Furthermore, from the perspective of ETL complexity, the ETL process is typically complicated for handling dynamic data that requires keeping track of change history, e.g., tracing shipping status and customer information changes on an on-line shopping store. This involves dealing with fast-/slowly-changing data issue. A classic example is processing slowly changing dimensions (SCDs) (Kimball & Caserta, 2004). An ordinary ETL process checks the existence of the previous records in the DW to decide doing record update or record insertion operation. The innovative approaches for easing ETL complexity, maintenance and optimization are desirable.