Data warehousing is a popular technology, which aims at improving decision-making ability. As the result of an increasingly competitive environment, many companies are adopting a “bottom-up” approach to construct a data warehouse, since it is more likely to be on time and within budget. However, multiple independent data marts/cubes can easily cause problematic data inconsistency for anomalous update transactions, which leads to biased decision-making. This research focuses on solving the data inconsistency problem and proposing a temporal-based data consistency mechanism (TDCM) to maintain data consistency. From a relative time perspective, we use an active rule (standard ECA rule) to monitor the user query event and use a metadata approach to record related information. This both builds relationships between the different data cubes, and allows a user to define a VIT (valid interval temporal) threshold to identify the validity of interval that is a threshold to maintain data consistency. Moreover, we propose a consistency update method to update inconsistent data cubes, which can ensure all pieces of information are temporally consistent.
Designing and constructing a data warehouse for an enterprise is a very complicated and iterative process since it involves aggregation of data from many different departments and extract, transform, load (ETL) processing (Bellatreche et al., 2001). Currently, there are two basic strategies to implementing a data warehouse, “top-down” and “bottom-up” (Shin, 2002), each with its own strengths, weaknesses, and using the appropriate uses.
Constructing a data warehouse system using the bottom-up approach will be more likely to be on time and within budget. But inconsistent and irreconcilable results may be transmitted from one data mart to the next due to independent data marts or data cubes (e.g. distinct updates time for each data cube) (Inmon, 1998). Thus, inconsistent data in the recognition of events may require a number of further considerations to be taken into account (Shin, 2002; Bruckner et. al, 2001; Song & Liu, 1995):
· Data Availability: Typical update patterns for a traditional data warehouse on weekly or even monthly basis will delay discovery, so information is unavailable for knowledge workers or decision makers.
· Data Comparability: In order to analyze from different perspectives, or even go a step further to look for more specific information, data comparability is an important issue .
Real-time updating in a data warehouse might be a solution which can enable data warehouses to react “just-in-time” and also provide the best consistency (Bruckner et al., 2001) (e.g. real-time data warehouse). But, not everyone needs or can benefit from a real-time data warehouse. In fact, it is highly possible that only a relatively small portion of the business community will realize a justifiable ROI (return on investment) from a real time data warehouse (Vandermay J., 2001). Real-time data warehouses are expensive to build, requiring a significantly higher level of support and significantly greater investment in infrastructure than a traditional data warehouse. In additional, real-time update will also require high time cost for response and huge storage space for aggregation.
As a result, it is desirable to find an alternative solution for data consistency in a data warehouse system (DWS) which can achieve near real-time outcome but does not require a high cost.
Motivation And Objective
Integrating active rules and data warehouse systems has been one of the most important treads in data warehousing (DM Review, 2001). Active rules have also been used in databases for several years (Paton & Daz, 1999; Roddick & Schrefl, 2000), and much research has been done in this field. It is possible to construct relations between different data cubes or even the data marts. However, anomalous updates could occur when each of the data marts has its own timestamp for obtaining the same data source. Therefore, problems with controlling data consistency in data marts/data cubes are raised.
There have been numerous studies discussing the maintenance of data cubes dealing with the space problem and retrieval efficiency, either by pre-computing a subset of the “possible group-bys” (Harinarayan et al., 1996; Gupta et al., 1997; Baralis et al., 1997), estimating the values of the group-bys using approximation (Gibbons & Matias, 1998; Acharya et al., 2000) or by using online aggregation techniques (Hellerstein et al., 1997; Gray et al., 1996). However, these solutions still focus on single data cube consistency, not on the overall data warehouse environment’s respective. Thus, each department in the enterprise will still face problems of temporal inconsistency over time.