Data Warehouse Benchmarking with DWEB

Data Warehouse Benchmarking with DWEB

Jérôme Darmont (University of Lyon (ERIC Lyon 2), France)
DOI: 10.4018/978-1-60566-232-9.ch015
OnDemand PDF Download:


Performance evaluation is a key issue for designers and users of Database Management Systems (DBMSs). Performance is generally assessed with software benchmarks that help, for example test architectural choices, compare different technologies, or tune a system. In the particular context of data warehousing and On-Line Analytical Processing (OLAP), although the Transaction Processing Performance Council (TPC) aims at issuing standard decision-support benchmarks, few benchmarks do actually exist. We present in this chapter the Data Warehouse Engineering Benchmark (DWEB), which allows generating various ad-hoc synthetic data warehouses and workloads. DWEB is fully parameterized to fulfill various data warehouse design needs. However, two levels of parameterization keep it relatively easy to tune. We also expand on our previous work on DWEB by presenting its new Extract, Transform, and Load (ETL) feature, as well as its new execution protocol. A Java implementation of DWEB is freely available online, which can be interfaced with most existing relational DMBSs. To the best of our knowledge, DWEB is the only easily available, up-to-date benchmark for data warehouses.
Chapter Preview


Performance evaluation is a key issue for both designers and users of Database Management Systems (DBMSs). It helps designers select among alternate software architectures, performance optimization strategies, or validate or refute hypotheses regarding the actual behavior of a system. Thus, performance evaluation is an essential component in the development process of efficient and well-designed database systems. Users may also employ performance evaluation, either to compare the efficiency of different technologies before selecting one, or to tune a system. In many fields including databases, performance is generally assessed with the help of software benchmarks. The main components in a benchmark are its database model and workload model (set of operations to execute on the database).

Evaluating data warehousing and decision-support technologies is a particularly intricate task. Though pertinent, general advice is available, notably on-line (Pendse, 2003; Greenfield, 2004a), more quantitative elements regarding sheer performance, including benchmarks, are few. In the late nineties, the OLAP (On-Line Analytical Process) APB-1 benchmark has been very popular. Henceforth, the Transaction Processing Performance Council (TPC) (1), a non-profit organization, defines standard benchmarks (including decision-support benchmarks) and publishes objective and verifiable performance evaluations to the industry.

Our own motivation for data warehouse benchmarking was initially to test the efficiency of performance optimization techniques (such as automatic index and materialized view selection techniques) we have been developing for several years. None of the existing data warehouse benchmarks suited our needs. APB-1’s schema is fixed, while we needed to test our performance optimization techniques on various data warehouse configurations. Furthermore, it is no longer supported and somewhat difficult to find. The TPC currently supports the TPC-H decision-support benchmark (TPC, 2006). However, its database schema is inherited from the older and obsolete benchmark TPC-D (TPC, 1998), which is not a dimensional schema such as the typical star schema and its derivatives that are used in data warehouses (Inmon, 2002; Kimball & Ross, 2002). Furthermore, TPC-H’s workload, though decision-oriented, does not include explicit OLAP queries either. This benchmark is implicitly considered obsolete by the TPC that has issued some draft specifications for its successor: TPC-DS (TPC, 2007). However, TPC-DS, which is very complex, especially at the ETL (Extract, Transform, and Load) and workload levels, has been under development since 2002 and is not completed yet.

Furthermore, although the TPC decision-support benchmarks are scalable according to Gray’s (1993) definition, their schema is also fixed. For instance, TPC-DS’ constellation schema cannot easily be simplified into a simple star schema. It must be used “as is”. Different ad-hoc configurations are not possible. Furthermore, there is only one parameter to define the database, the Scale Factor (SF), which sets up its size (from 1 to 100,000 GB). Users cannot control the size of dimensions and fact tables separately, for instance. Finally, users have no control on workload definition. The number of generated queries directly depends on SF.

Eventually, in a context where data warehouse architectures and decision-support workloads depend a lot on application domain, it is very important that designers who wish to evaluate the impact of architectural choices or optimization techniques on global performance can choose and/or compare among several configurations. The TPC benchmarks, which aim at standardized results and propose only one configuration of warehouse schema, are ill-adapted to this purpose. TPC-DS is indeed able to evaluate the performance of optimization techniques, but it cannot test their impact on various choices of data warehouse architectures. Generating particular data warehouse configurations (e.g., large-volume dimensions) or ad-hoc query workloads is not possible either, whereas it could be an interesting feature for a data warehouse benchmark.

Complete Chapter List

Search this Book:
Editorial Advisory Board
Table of Contents
David Taniar
Chapter 1
Laila Niedrite, Maris Solodovnikova Treimanis, Liga Grundmane
There are many methods in the area of data warehousing to define requirements for the development of the most appropriate conceptual model of a data... Sample PDF
Development of Data Warehouse Conceptual Models: Method Engineering Approach
Chapter 2
Stefano Rizzi
In the context of data warehouse design, a basic role is played by conceptual modeling, that provides a higher level of abstraction in describing... Sample PDF
Conceptual Modeling Solutions for the Data Warehouse
Chapter 3
Hamid Haidarian Shahri
Entity resolution (also known as duplicate elimination) is an important part of the data cleaning process, especially in data integration and... Sample PDF
A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses
Chapter 4
Maurizio Pighin, Lucio Ieronutti
Data Warehouses are increasingly used by commercial organizations to extract, from a huge amount of transactional data, concise information useful... Sample PDF
Interactive Quality-Oriented Data Warehouse Development
Chapter 5
Dirk Draheim, Oscar Mangisengi
Nowadays tracking data from activity checkpoints of unit transactions within an organization’s business processes becomes an important data resource... Sample PDF
Integrated Business and Production Process Data Warehousing
Chapter 6
Jorge Loureiro, Orlando Belo
OLAP queries are characterized by short answering times. Materialized cube views, a pre-aggregation and storage of group-by values, are one of the... Sample PDF
Selecting and Allocating Cubes in Multi-Node OLAP Systems: An Evolutionary Approach
Chapter 7
Jorge Loureiro, Orlando Belo
Globalization and market deregulation has increased business competition, which imposed OLAP data and technologies as one of the great enterprise’s... Sample PDF
Swarm Quant' Intelligence for Optimizing Multi-Node OLAP Systems
Chapter 8
Franck Ravat, Olivier Teste, Ronan Tournier
With the emergence of Semi-structured data format (such as XML), the storage of documents in centralised facilities appeared as a natural adaptation... Sample PDF
Multidimensional Anlaysis of XML Document Contents with OLAP Dimensions
Chapter 9
Hanene Ben-Abdallah, Jamel Feki, Mounira Ben Abdallah
Despite their strategic importance, the wide-spread usage of decision support systems remains limited by both the complexity of their design and the... Sample PDF
A Multidimensional Pattern Based Approach for the Design of Data Marts
Chapter 10
Concepción M. Gascueña, Rafael Guadalupe
The Multidimensional Databases (MDB) are used in the Decision Support Systems (DSS) and in Geographic Information Systems (GIS); the latter locates... Sample PDF
A Multidimensional Methodology with Support for Spatio-Temporal Multigranularity in the Conceptual and Logical Phases
Chapter 11
Francisco Araque, Alberto Salguero, Cecilia Delgado
One of the most complex issues of the integration and transformation interface is the case where there are multiple sources for a single data... Sample PDF
Methodology for Improving Data Warehouse Design using Data Sources Temporal Metadata
Chapter 12
Shi-Ming Huang, John Tait, Chun-Hao Su, Chih-Fong Tsai
Data warehousing is a popular technology, which aims at improving decision-making ability. As the result of an increasingly competitive environment... Sample PDF
Using Active Rules to Maintain Data Consistency in Data Warehouse Systems
Chapter 13
Marcin Gorawski, Wojciech Gebczyk
This chapter describes realization of distributed approach to continuous queries with kNN join processing in the spatial telemetric data warehouse.... Sample PDF
Distributed Approach to Continuous Queries with kNN Join Processing in Spatial Telemetric Data Warehouse
Chapter 14
Maria Luisa Damiani, Stefano Spaccapietra
This chapter is concerned with multidimensional data models for spatial data warehouses. Over the last few years different approaches have been... Sample PDF
Spatial Data Warehouse Modelling
Chapter 15
Jérôme Darmont
Performance evaluation is a key issue for designers and users of Database Management Systems (DBMSs). Performance is generally assessed with... Sample PDF
Data Warehouse Benchmarking with DWEB
Chapter 16
Lars Frank, Christian Frank
A Star Schema Data Warehouse looks like a star with a central, so-called fact table, in the middle, surrounded by so-called dimension tables with... Sample PDF
Analyses and Evaluation of Responses to Slowly Changing Dimensions in Data Warehouses
About the Contributors