View Selection in DW and OLAP: A Theoretical Review

View Selection in DW and OLAP: A Theoretical Review

Alfredo Cuzzocrea
Copyright: © 2009 |Pages: 8
DOI: 10.4018/978-1-60566-010-3.ch313
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Data Warehousing (DW) systems store materialized views, data marts and data cubes, and provide nicely data exploration and analysis interfaces via OnLine Analytical Processing (OLAP) (Gray et al., 1997) and Data Mining (DM) tools and algorithms. Also, OnLine Analytical Mining (OLAM) (Han, 1997) integrates the previous knowledge discovery methodologies and offers a meaningfully convergence between OLAP and DM, thus contributing to significantly augment the power of data exploration and analysis capabilities of knowledge workers. At the storage layer, the mentioned knowledge discovery methodologies share the problem of efficiently accessing, querying and processing multidimensional data, which in turn heavily affect the performance of knowledge discovery processes at the application layer. Due to the fact that OLAP and OLAM directly process data cubes/marts, and DM is more and more encompassing methodologies that are interested to multidimensional data, the problem of efficiently representing data cubes by means of a meaningfully selected view set is become of relevant interest for the Data Warehousing and OLAP research community. This problem is directly related to the analogous problem of efficiently computing the data cube from a given relational data source (Harinarayan et al., 1996; Agarwal et al., 1996; Sarawagi et al., 1996; Zhao et al., 1997). Given a relational data source R and a target data cube schema W, the view selection problem in OLAP deals with how to select and materialize views from R in order to compute the data cube A defined by the schema W by optimizing both the query processing time, denoted by TQ, which models the amount of time required to answer a reference query-workload on the materialized view set, and the view maintenance time, denoted by TM, which models the amount of time required to maintain the materialized view set when updates occur, under a given set of constraints I that, without any loss of generality, can be represented by a space bound constraint B limiting the overall occupancy of the views to be materialized (i.e., I = ). It has been demonstrated (Gupta, 1997; Gupta & Mumick, 2005) that this problem is NP-hard, thus heuristic schemes are necessary. Heuristics are, in turn, implemented in the vest of greedy algorithms (Yang et al., 1997; Kalnis et al., 2002). In this article, we focus the attention on state-ofthe- art methods for the view selection problem in Data Warehousing and OLAP, and complete our analytical contribution with a theoretical analysis of these proposals under different selected properties that nicely model spatial and temporal complexity aspects of the investigated problem.
Chapter Preview
Top

Introduction

Data Warehousing (DW) systems store materialized views, data marts and data cubes, and provide nicely data exploration and analysis interfaces via OnLine Analytical Processing (OLAP) (Gray et al., 1997) and Data Mining (DM) tools and algorithms. Also, OnLine Analytical Mining (OLAM) (Han, 1997) integrates the previous knowledge discovery methodologies and offers a meaningfully convergence between OLAP and DM, thus contributing to significantly augment the power of data exploration and analysis capabilities of knowledge workers. At the storage layer, the mentioned knowledge discovery methodologies share the problem of efficiently accessing, querying and processing multidimensional data, which in turn heavily affect the performance of knowledge discovery processes at the application layer. Due to the fact that OLAP and OLAM directly process data cubes/marts, and DM is more and more encompassing methodologies that are interested to multidimensional data, the problem of efficiently representing data cubes by means of a meaningfully selected view set is become of relevant interest for the Data Warehousing and OLAP research community.

This problem is directly related to the analogous problem of efficiently computing the data cube from a given relational data source (Harinarayan et al., 1996; Agarwal et al., 1996; Sarawagi et al., 1996; Zhao et al., 1997). Given a relational data source R and a target data cube schema W, the view selection problem in OLAP deals with how to select and materialize views from R in order to compute the data cube A defined by the schema W by optimizing both the query processing time, denoted by TQ, which models the amount of time required to answer a reference query-workload on the materialized view set, and the view maintenance time, denoted by TM, which models the amount of time required to maintain the materialized view set when updates occur, under a given set of constraints I that, without any loss of generality, can be represented by a space bound constraint B limiting the overall occupancy of the views to be materialized (i.e., I = 〈B〉). It has been demonstrated (Gupta, 1997; Gupta & Mumick, 2005) that this problem is NP-hard, thus heuristic schemes are necessary. Heuristics are, in turn, implemented in the vest of greedy algorithms (Yang et al., 1997; Kalnis et al., 2002).

In this article, we focus the attention on state-of-the-art methods for the view selection problem in Data Warehousing and OLAP, and complete our analytical contribution with a theoretical analysis of these proposals under different selected properties that nicely model spatial and temporal complexity aspects of the investigated problem.

Top

Background

Before going into details, in this Section we provide the conceptual basis of our work, which is mainly related to the representation of multidimensional data cubes according to the ROLAP storage model. Let R = 〈{R0, R1, …, RM-1}, S〉 be a relational data source, such that (i) Ri, with 0 ≤ iM – 1, is a relational table of form Ri(Ai,0, Ai,1, …, Ai,|Ri|-1), such that Ai,j, with 0 ≤ j ≤ |Ri| – 1, is an attribute of Ri, and (ii) S it the relational schema that models associations among relational tables in {R0, R1, …, RM-1}. Let W be the goal data cube schema, which, without any loss of generality, can alternatively be modeled as a star schema, where a central fact table F is connected to multiple dimensional tables Tj, or a snowflake schema, where dimensional tables are also normalized across multiple tables (Gray et al., 1997). At the conceptual level, the goal data cube A can also be modeled as a tuple A = 〈D, H, M〉, such that: (i) D is the set of dimensions of A, (ii) H is the set of hierarchies associated to dimensions of A, and (iii) M is the set of measures of A. Dimensions model the perspective of analysis of the actual OLAP model. Hierarchies are hierarchical structures (e.g., trees) that capture hierarchical relationships among attributes of dimensional tables. Measures model the analysis goals of the actual OLAP model.

Complete Chapter List

Search this Book:
Reset