Article Preview
TopIntroduction
Business Intelligence (BI) has long been considered an integral part of any successful enterprise’s data processing and analysis strategy (Chaudhuri et al., 2011). BI analysts inspect and query the data, made available through a Data Warehouse to gain insight into sales data or other business facts that will aid them at making business decisions.
Data warehouses are periodically populated or refreshed with data from Operational Data Stores (ODS), e.g., front-end transaction databases. In most businesses, the freshness of the information available in the data warehouse translates directly into more timely business decisions and competitive advantage. Therefore, it is highly desirable to have data available for analysis at real-time or near real-time, i.e., provide data, so no delay is discernable. The degree of delay acceptable depends on the specific application scenario and actual real-time processing in the sense of sub-second delays is generally not needed. This subjective timeliness requirement is sometimes referred to as right-time BI (Davis, 2006).
The biggest hurdle to satisfying right-time BI latency requirements is the data processing needed to make data available in a data warehouse: the data coming from the ODS infrastructure needs to be processed before it is suitable for BI for a variety of reasons. For example, a data warehouse typically consolidates a multitude of different ODS with different schemas and metadata, hence, all incoming data must be normalized. Also, the ODS may contain erroneous or corrupted data that needs to be cleaned and reconciled. This preprocessing is commonly known as Extract-Transform-Load (ETL): data are first extracted from the original data source, then transformed including normalization and cleansing and finally loaded into the data warehouse. For simplicity, we refer to the entire ETL process as loading in the following, unless indicated otherwise. Figure 1 depicts a typical architecture, including various data sources, an ETL layer, and components of the reporting pipeline.
While database technology for data warehousing has seen tremendous performance and scalability enhancements over the past decade in the form of massively parallel database architectures, ETL has improved in scalability and performance to a much lesser degree. As a result, most BI infrastructures are increasingly experiencing an ingest bottleneck: data cannot be furnished to the data warehouse at the necessary pace and freshness. Clearly, in order to provide near real-time or right-time BI this bottleneck needs to be resolved.
A natural approach would be to scale the different components involved in ETL individually. In particular, parallelizing the transformation phase is instrumental in achieving better overall throughput. However, a parallel ETL infrastructure turns out to be a double-edged sword: while the processing time of daily loads may be reduced, the cost of the initial investment and, more importantly, the continual maintenance of a complex parallel system quickly outweigh its benefits.
Instead we propose the three following major building blocks to address real-time/right-time data acquisition: