Current data warehouses deal for the most part with numerical data. However, decision makers need to analyze data presented in all formats which one can qualify as complex data. Warehousing complex data is a new challenge for the scientific community. Indeed, it requires revisiting the whole warehousing process in order to take into account the complex structure of data; therefore, many concepts of data warehousing will need to be redefined. In particular, modeling complex data in a unique format for analysis purposes is a challenge. In this chapter, the authors present a complex data warehouse model at both conceptual and logical levels. They show how XML is suitable for capturing the main concepts of their model, and present the main issues related to these data warehouses.
A commonly accepted definition of a data warehouse is that given by Inmon (2002) “A data warehouse is a subject-oriented, integrated, non-volatile, and time variant collection of data in support of management’s decisions” (Inmon, 2002). Traditional data warehouses apply to structured data and they have gained maturity as witnessed by the number of related tools. As structured data are not the only data needed for decision making, new generations of data warehouses have emerged that take into account different structures of data. In this chapter, we focus on one kind of these data warehouses: XML warehouses. As a new research field, there is no common definition of an XML warehouse. This is due to the fact that XML is used differently in different contexts: as a format of data sources, as a means of data integration or exchange between traditional data warehouses or as a language to describe the warehouse itself. In this section, we present the research work on XML warehouses. For the sake of clarity, we propose to group the research work according to the following concepts of data warehousing: data preparation, data modeling, data storage, data exchange and data analysis.
In a data warehouse, data is physically integrated from different sources. Because the sources are usually independent from each other, the integration operation may cause many problems due to data redundancy, inconsistency, etc. Thus, data needs to be cleaned before integrating it. In this context, Rusu et.al. (2005) consider XML sources and provide a method for cleaning XML data. Their method consists of four steps: correcting XML schemas, eliminating redundancy, eliminating inconsistency and eliminating errors.
Golfarelli et. al. (2001) also consider XML data sources with related DTDs. Prior to the modeling activity, the authors require DTDs to be simplified by flattening their element definitions, grouping the same-named sub-elements and reducing unary operators to single one. A similar approach is found in (Vrdoljak et.al., 2003) but it applies to XML Schemas rather than DTDs.