Empowering the OLAP Technology to Support Complex Dimension Hierarchies

Empowering the OLAP Technology to Support Complex Dimension Hierarchies

Svetlana Mansmann
Copyright: © 2009 |Pages: 21
DOI: 10.4018/978-1-60566-098-1.ch022
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Comprehensive data analysis has become indispensable in a variety of domains. OLAP (On-Line Analytical Processing) systems tend to perform poorly or even fail when applied to complex data scenarios. The restriction of the underlying multidimensional data model to admit only homogeneous and balanced dimension hierarchies is too rigid for many real-world applications and, therefore, has to be overcome in order to provide adequate OLAP support. We present a framework for classifying and modeling complex multidimensional data, with the major effort at the conceptual level as to transform irregular hierarchies to make them navigable in a uniform manner. The properties of various hierarchy types are formalized and a two-phase normalization approach is proposed: heterogeneous dimensions are reshaped into a set of well-behaved homogeneous subdimensions, followed by the enforcement of summarizability in each dimension’s data hierarchy. Mapping the data to a visual data browser relies solely on metadata, which captures the properties of facts, dimensions, and relationships within the dimensions. The navigation is schema-based, that is, users interact with dimensional levels with ondemand data display. The power of our approach is exemplified using a real-world study from the domain of academic administration.
Chapter Preview
Top

Introduction

Data Warehouses (DWs) are repositories that typically store large amounts of data that have been extracted and integrated from transactional systems and various other operational sources. Those repositories are useful for online analytical processing (OLAP) and data mining analysis. Typical queries include both standard reporting and ad hoc analysis. They usually are complex and access very large volumes of data, performing time-consuming aggregations. Although data warehouses easily can reach many Giga or Terabytes, users still require fast answers to their analyses. Therefore, performance becomes a major concern in those systems. Although structures such as materialized views and specialized indexes improve response times for predicted queries, parallel processing can be used alone or in conjunction with those structures to offer a major performance boost and to guarantee speedup and scale-up, even for unpredicted ad hoc queries.

Parallel database systems are implemented using one of the following parallel architectures: shared-memory, shared-disk, shared nothing, hierarchical, NUMA (Valduriez & Ozsu, 1999). Each choice has implications for parallel query processing algorithms and data placement. In practice, parallel environments involve several extra overheads related to data and control exchanges between processing units and also concerning storage, so that all components of the system need to be designed to avoid bottlenecks that would compromise the whole processing efficiency. Some parts of the system even may have to account for the aggregate flow into/from all units. For instance, in shared-disk systems, the storage system, including controllers and connections to storage, have to be sufficiently fast in order to handle the aggregate of all accesses without becoming a significant bottleneck for I/O-bound applications. To handle potential bottlenecks, specialized, fast, and fully dedicated parallel hardware and interconnects are required. An attractive alternative is to use a number of low-cost computer nodes in a shared-nothing environment, possibly in a non-dedicated local network, and design the system with special partitioning and processing care. In such an environment, each node has a basic database engine, and the system includes a middle layer providing parallelism to the whole environment. The Node Partitioned Data Warehouse (NPDW) is a generic architecture for partitioning and processing query-intensive data in such an environment. One of the objectives of the Node Partitioned Data Warehouse is to minimize the dependency on very fast, dedicated computing and data exchange infrastructures by optimizing partitioning and making use of replication whenever useful.

DeWitt and Gray (1992) review the major issues in parallel database systems implemented over conventional shared-nothing architectures. One of the major concerns when using such an architecture is to decide how to partition or to cluster relations into nodes, which raises the issue of how to determine the most appropriate partitioning and placement choice for a schema. Data warehouses are a specialized type of database with specific characteristics and requirements that may be useful in the partitioning and placement decision. They are mostly read-only, periodically loaded centralized repositories of data. Replication-related consistency issues are minor when compared to full-blown transactional systems.

The star schema (Kimball, 1996) is part of the typical data organization in a data warehouse, representing a multidimensional logic with a large central fact table and smaller dimension tables. Facts typically are very large relations with hundreds of gigabytes of historical details. Dimensions are smaller relations identifying entities by means of several descriptive properties.

In that context, a basic placement strategy for the simple star schema replicates dimensions and fully partitions the large central fact horizontally randomly. Figure 1 illustrates the simple placement strategy. The large fact F is partitioned into node fragments Fi, and dimensions D are replicated into all nodes. Very small dimensions even can be cached in memory for faster access and join processing.

Figure 1.

Partitioning the star schema

978-1-60566-098-1.ch024.f01

Complete Chapter List

Search this Book:
Reset