Data Cube Compression Techniques: A Theoretical Review

Data Cube Compression Techniques: A Theoretical Review

Alfredo Cuzzocrea (University of Calabria, Italy)
Copyright: © 2009 |Pages: 7
DOI: 10.4018/978-1-60566-010-3.ch058
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

OnLine Analytical Processing (OLAP) research issues (Gray, Chaudhuri, Bosworth, Layman, Reichart & Venkatrao, 1997) such as data cube modeling, representation, indexing and management have traditionally attracted a lot of attention from the Data Warehousing research community. In this respect, as a fundamental issue, the problem of efficiently compressing the data cube plays a leading role, and it has involved a vibrant research activity in the academic as well as industrial world during the last fifteen years. Basically, this problem consists in dealing with massive-in-size data cubes that make access and query evaluation costs prohibitive. The widely accepted solution consists in generating a compressed representation of the target data cube, with the goal of reducing its size (given an input space bound) while admitting loss of information and approximation that are considered irrelevant for OLAP analysis goals (e.g., see (Cuzzocrea, 2005a)). Compressed representations are also referred-in-literature under the term “synopsis data structures”, i.e. succinct representations of original data cubes introducing a limited loss of information. Benefits deriving from the data cube compression approach can be summarized in a relevant reduction of computational overheads needed to both represent the data cube and evaluate resource-intensive OLAP queries, which constitute a wide query class allowing us to extract useful knowledge from huge amounts of multidimensional data repositories in the vest of summarized information (e.g., aggregate information based on popular SQL aggregate operators such as SUM, COUNT, AVG etc), otherwise infeasible by means of traditional OLTP approaches. Among such queries, we recall: (i) range-queries, which extract a sub-cube bounded by a given range; (ii) top-k queries, which extract the k data cells (such that k is an input parameter) having the highest aggregate values; (iii) iceberg queries, which extract the data cells having aggregate values above a given threshold. This evidence has given raise to the proliferation of a number of Approximate Query Answering (AQA) techniques, which, based on data cube compression, aim at providing approximate answers to resource-intensive OLAP queries instead of computing exact answers, as decimal precision is usually negligible in OLAP query and report activities (e.g., see (Cuzzocrea, 2005a)). Starting from these considerations, in this article we first provide a comprehensive, rigorous survey of main data cube compression techniques, i.e. histograms, wavelets, and sampling. Then, we complete our analytical contribution with a detailed theoretical review focused on spatio-temporal complexities of these techniques.
Chapter Preview
Top

Introduction

OnLine Analytical Processing (OLAP) research issues (Gray, Chaudhuri, Bosworth, Layman, Reichart & Venkatrao, 1997) such as data cube modeling, representation, indexing and management have traditionally attracted a lot of attention from the Data Warehousing research community. In this respect, as a fundamental issue, the problem of efficiently compressing the data cube plays a leading role, and it has involved a vibrant research activity in the academic as well as industrial world during the last fifteen years.

Basically, this problem consists in dealing with massive-in-size data cubes that make access and query evaluation costs prohibitive. The widely accepted solution consists in generating a compressed representation of the target data cube, with the goal of reducing its size (given an input space bound) while admitting loss of information and approximation that are considered irrelevant for OLAP analysis goals (e.g., see (Cuzzocrea, 2005a)). Compressed representations are also referred-in-literature under the term “synopsis data structures”, i.e. succinct representations of original data cubes introducing a limited loss of information. Benefits deriving from the data cube compression approach can be summarized in a relevant reduction of computational overheads needed to both represent the data cube and evaluate resource-intensive OLAP queries, which constitute a wide query class allowing us to extract useful knowledge from huge amounts of multidimensional data repositories in the vest of summarized information (e.g., aggregate information based on popular SQL aggregate operators such as SUM, COUNT, AVG etc), otherwise infeasible by means of traditional OLTP approaches. Among such queries, we recall: (i) range-queries, which extract a sub-cube bounded by a given range; (ii) top-k queries, which extract the k data cells (such that k is an input parameter) having the highest aggregate values; (iii) iceberg queries, which extract the data cells having aggregate values above a given threshold.

This evidence has given raise to the proliferation of a number of Approximate Query Answering (AQA) techniques, which, based on data cube compression, aim at providing approximate answers to resource-intensive OLAP queries instead of computing exact answers, as decimal precision is usually negligible in OLAP query and report activities (e.g., see (Cuzzocrea, 2005a)).

Starting from these considerations, in this article we first provide a comprehensive, rigorous survey of main data cube compression techniques, i.e. histograms, wavelets, and sampling. Then, we complete our analytical contribution with a detailed theoretical review focused on spatio-temporal complexities of these techniques.

Top

Background

A data cube L is a tuple L = 〈C,J,H,M〉, such that: (i) C is the data domain of L containing (OLAP) data cells, which are the basic SQL aggregations of L computed against the relational data source S alimenting L; (ii) J is the set of dimensions of L, i.e. the functional attributes (of S) with respect to which the underlying OLAP analysis is defined (in other words, J is the set of attributes with respect to which relational tuples in S are aggregated); (iii) H is the set of hierarchies related to the dimensions of L, i.e. hierarchical representations of the functional attributes shaped-in-the-form-of generic trees; (iv) M is the set of measures of L, i.e. the attributes of interest (of S) for the underlying OLAP analysis (in other words, M is the set of attributes with respect to which SQL aggregations stored in data cells of L are computed).

Complete Chapter List

Search this Book:
Reset