Article Preview
TopIntroduction
Data warehouse applications store large amounts of data in various tables, where each instance is described using several attributes. Usually, these applications are modeled with relational database schemas like star or snowflake schemas. A star schema contains a large fact table and various dimension tables. A star schema is usually queried in various combinations involving many tables. The most used operations are joins, aggregations and selections (Papadomanolakis & Ailamaki, 2004). Joins are well known to be expensive operations, especially when the involved relations are substantially larger than the size of the main memory (Lei & Ross, 1999) which is usually the case of business intelligence applications. The typical queries defined on the star schema are commonly referred to as star join queries, and exhibit the following two characteristics: (1) there is a multi-table join among the large fact table and multiple smaller dimension tables, and (2) each of the dimension tables involved in the join operation has multiple selection predicates on its descriptive attributes. To speed up star join queries, many optimization techniques were proposed. In (Bellatreche, Moussaoui, Necir & Drias, 2008), we classified them into two main categories: redundant techniques such as materialized views, advanced index schemes, vertical partitioning, parallel processing with replication and non redundant techniques like ad-hoc joins, where joins are performed without additional data structures like indexes (hash join, nested loop, sort merge, etc.), horizontal partitioning, parallel processing without replication. In this article, we concentrate only on horizontal partitioning, since it is more adapted to reduce the cost of star join queries.
Horizontal partitioning has been mainly used in logical distributed and parallel databases design in last decades (Özsu & Valduriez, 1999; Sacca & Wiederhold, 1985). Recently, it becomes a crucial part of physical database design (Sanjay, Narasayya & Yang, 2004; Papadomanolakis & Ailamaki, 2004; Eadon et al., 2008), where most of today's commercial DBMS offer native DDL (data definition language) support for defining horizontal partitions (fragments) of a table (Sanjay et al., 2004). In context of relational warehouses, horizontal partitioning allows tables, indexes and materialised views to be partitioned into disjoint sets of rows that are physically stored and accessed separately. Contrary to materialised views and indexes, horizontal data partitioning does not replicate data, thereby reducing space requirements and minimising the update overhead. The main characteristic of data partitioning is its ability to be combined with some redundant optimization techniques such as indexes and materialized views. It also affect positively (1) query performance, by performing partition elimination: if a query includes a partition key as a predicate in the WHERE clause, the query optimizer will automatically route the query to only relevant partitions and (2) database manageability, for instance by allocating partitions in different machines (Stöhr, Märtens & Rahm, 2000).
Two versions of horizontal partitioning exist (Ceri, Negri & Pelagatti, 1982; Özsu & Valduriez, 1999): primary and derived (known as referential partitioning in Oracle 11g (Eadon et al., 2008)). Primary horizontal partitioning of a table is performed using attributes defined on that table. Derived horizontal partitioning, on the other hand, is the fragmentation of a table using attribute(s) defined on another table(s). In other word, the derived horizontal partitioning of a table is based on the fragmentation schema of another table (the fragmentation schema is the result of the partitioning process of a given table). The derived partitioning of a table R according a fragmentation schema of S is feasible if and only if there is a join link between R and S (R contains a foreigner key of S). It has been used to optimize data transfer when executing queries in the distributed database environment.