Article Preview
TopIntroduction
A Data Warehouse (DW) integrates data coming from different sources (Vaisman & Zimányi, 2014) and supply it to various systems, including Business Intelligence (BI) applications. Data warehouses change in their content and schema. Typically, content changes are due to routine business operations or the correction of existing data. An example of a content change is a modification in the price of a product.
In practice, changes to a DW schema result from (1) the evolution of external data sources, (2) changes of the real-world represented by a DW, (3) new user requirements, and (4) the creation of simulation environments to list the most common causes. An example of a schema change is a change in the geographical hierarchy of a sales network. As reported in (Moon, Curino, Deutsch, Hou, & Zaniolo, 2008; Qiu, Li, & Su, 2013; Sjøberg, 1993; Vassiliadis, Zarras, & Skoulis, 2017), the schemas of data sources change frequently. For example, the Wikipedia schema changed on average every 9-10 days during the 4.5 years of its lifetime. As a result of the changes in data sources, the content and schema of the related DWs must also change. Real-world examples of scenarios leading to changes in DWs can be found in (Eder, Koncilia, & Kogler, 2002; Rundensteiner, Koeller, & Zhang, 2000).
A DW should keep track of the evolution of its content and schema to reconstruct the state of the world under consideration at any instant without losing data. A temporal data warehouse (TDW) (Golfarelli & Rizzi, 2009) keeps track of the evolution of its contents whereas, a multiversion data warehouse (MVDW), based on multischema data management principles (Roddick, 1995; Herrmann, Voigt, Pedersen, & Lehner, 2018), handles content and schema changes by creating multiple and persistent DW versions.
Even though version management in databases has been researched for over 30 years (in the context of object databases, relational databases, data warehouses, XML databases), it is still an active research field. This research is regaining its importance in the context of NoSQL storage and data lakes. Support for version management was explicitly stated as a requirement for data lakes management systems (Nargesian, Zhu, J. Miller, Pu, & Patricia C., 2019).
The temporal DW-based approaches profit from the support of various temporal query languages to analyze changing data and from the existence of multiple index structures. Such approaches are suitable for representing historical data versions and not for representing and managing schema changes. On the other hand, the MV approaches allow managing both data and schema changes. However, the implementation of such approaches is more complicated. Moreover, they possess the limited capabilities of querying DW versions.