On-Demand ELT Architecture for Right-Time BI: Extending the Vision

On-Demand ELT Architecture for Right-Time BI: Extending the Vision

Florian Waas, Robert Wrembel, Tobias Freudenreich, Maik Thiele, Christian Koncilia, Pedro Furtado
Copyright: © 2013 |Pages: 18
DOI: 10.4018/jdwm.2013040102
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In a typical BI infrastructure, data, extracted from operational data sources, is transformed, cleansed, and loaded into a data warehouse by a periodic ETL process, typically executed on a nightly basis, i.e., a full day’s worth of data is processed and loaded during off-hours. However, it is desirable to have fresher data for business insights at near real-time. To this end, the authors propose to leverage a data warehouse’s capability to directly import raw, unprocessed records and defer the transformation and data cleaning until needed by pending reports. At that time, the database’s own processing mechanisms can be deployed to process the data on-demand. Event-processing capabilities are seamlessly woven into our proposed architecture. Besides outlining an overall architecture, the authors also developed a roadmap for implementing a complete prototype using conventional database technology in the form of hierarchical materialized views.
Article Preview
Top

Introduction

Business Intelligence (BI) has long been considered an integral part of any successful enterprise’s data processing and analysis strategy (Chaudhuri et al., 2011). BI analysts inspect and query the data, made available through a Data Warehouse to gain insight into sales data or other business facts that will aid them at making business decisions.

Data warehouses are periodically populated or refreshed with data from Operational Data Stores (ODS), e.g., front-end transaction databases. In most businesses, the freshness of the information available in the data warehouse translates directly into more timely business decisions and competitive advantage. Therefore, it is highly desirable to have data available for analysis at real-time or near real-time, i.e., provide data, so no delay is discernable. The degree of delay acceptable depends on the specific application scenario and actual real-time processing in the sense of sub-second delays is generally not needed. This subjective timeliness requirement is sometimes referred to as right-time BI (Davis, 2006).

The biggest hurdle to satisfying right-time BI latency requirements is the data processing needed to make data available in a data warehouse: the data coming from the ODS infrastructure needs to be processed before it is suitable for BI for a variety of reasons. For example, a data warehouse typically consolidates a multitude of different ODS with different schemas and metadata, hence, all incoming data must be normalized. Also, the ODS may contain erroneous or corrupted data that needs to be cleaned and reconciled. This preprocessing is commonly known as Extract-Transform-Load (ETL): data are first extracted from the original data source, then transformed including normalization and cleansing and finally loaded into the data warehouse. For simplicity, we refer to the entire ETL process as loading in the following, unless indicated otherwise. Figure 1 depicts a typical architecture, including various data sources, an ETL layer, and components of the reporting pipeline.

Figure 1.

Typical DW architecture

jdwm.2013040102.f01

While database technology for data warehousing has seen tremendous performance and scalability enhancements over the past decade in the form of massively parallel database architectures, ETL has improved in scalability and performance to a much lesser degree. As a result, most BI infrastructures are increasingly experiencing an ingest bottleneck: data cannot be furnished to the data warehouse at the necessary pace and freshness. Clearly, in order to provide near real-time or right-time BI this bottleneck needs to be resolved.

A natural approach would be to scale the different components involved in ETL individually. In particular, parallelizing the transformation phase is instrumental in achieving better overall throughput. However, a parallel ETL infrastructure turns out to be a double-edged sword: while the processing time of daily loads may be reduced, the cost of the initial investment and, more importantly, the continual maintenance of a complex parallel system quickly outweigh its benefits.

Instead we propose the three following major building blocks to address real-time/right-time data acquisition:

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 6 Issues (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing