Range-Sum Queries over High Dimensional Data Cubes Using a Dynamic Grid File

Range-Sum Queries over High Dimensional Data Cubes Using a Dynamic Grid File

Feng Yu (Southern Illinois University, USA), Cheng Luo (Coppin State University, USA), Xiaoguang Yu (Southern Illinois University, USA), Wen-Chi Hou (Southern Illinois University, USA), Chih-Fang Wang (Southern Illinois University, USA) and Michael Wainer (Southern Illinois University, USA)
DOI: 10.4018/978-1-60960-521-6.ch014
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In this chapter, the authors propose to use the grid file to store multi-dimensional data cubes and answer range-sum queries. The grid file is enhanced with a dynamic splitting mechanism to accommodate insertions of data. It overcomes the drawback of the traditional grid file in storing uneven data while enjoying its advantages of simplicity and efficiency. The space requirement grows linearly with the dimension of the data cube, compared with the exponential growth of conventional methods that store pre-computed aggregate values for range-sum queries. The update cost is O(1), much faster than the pre-computed data cube approaches, which generally have exponential update cost. The grid file structure can also respond to range queries quickly. They compare it with an approach that uses the R*-tree structure to store the data cube. The experimental results show that the proposed method performs favorably in file size, update speed, construction time, and query response time for both evenly and unevenly distributed data.
Chapter Preview
Top

Introduction

A data warehouse is a large collection of integrated data, built to assist knowledge workers, such as executives, managers, analysts, etc., to make better and faster decisions. It is often required that data be summarized at various levels of detail and on various attributes to allow knowledge workers to analyze the data through a variety of views in on-line analytical processing (OLAP). Typical OLAP applications include product performance and profitability, effectiveness of sales programs or marketing campaigns, sales forecasting, capacity planning, etc. Data warehousing and OLAP have increasingly become a focus of the database industry.

OLAP systems generally support a multidimensional data model, which is also known as the data cube (Gray, 1997). Construction of a data cube is based on the set of selected attributes of the database. Certain attributes are chosen to be the measure attributes, i.e., attributes whose values are of interest, while some others are selected as dimension or functional attributes (Geffner, 1999). The values of the measure attributes are often aggregated according to the dimension attributes for analysis. The size of a data cube can be huge when the number of combinations of dimension attribute values is large.

The storage of data cubes is essential to OLAP. Much research (Agarwal, 1996; Beyer, 1999; Han, 2001; Morfonios, 2006; Xin, 2003; Zhao, 1997) has focused on the materialization of data cubes, that is, to pre-compute and store all possible combinations of multi-dimensional aggregates for fast multi-dimensional analysis. Some notable cube materialization algorithms proposed include ROLAP-based multi-dimensional aggregate computation (Agarwal, 1996, Morfonios, 2006), multi-way array aggregation (Beyer, 1999), BUC (Han, 2001), H-cubing (Xin, 2003), Star-cubing (Zhao, 1997), Minimal cubing (Li, 2004), etc. Since materializing data cubes are generally computationally intensive and space consuming, much effort has been devoted to reducing the computation and storage space of data cubes. These efforts include partial materialization of data cubes (Harinarayan, 1996), iceberg cube computation (Han, 2001; Xin, 2003; Zhao, 1997), computation of condensed, dwarf, and quotient cubes (Lakshmanan, 2002; Lakshmanan, 2003; Sismanis, 2002; Beyer, 1999; Wang, 2002), and computation of approximate cubes (Barbara, 1997; Cuzzocrea, 2006; Shanmugasundaram, 1999). While these pre-computed data cubes can be used to answer queries quickly, tremendous overhead is incurred in maintaining these pre-computed aggregate values as updates can propagate to a large number of relevant cells.

Complete Chapter List

Search this Book:
Reset