Big Data Warehouse Automatic Design Methodology

Big Data Warehouse Automatic Design Methodology

Francesco Di Tria, Ezio Lefons, Filippo Tangorra
Copyright: © 2014 |Pages: 35
DOI: 10.4018/978-1-4666-4699-5.ch006
(Individual Chapters)
No Current Special Offers


Traditional data warehouse design methodologies are based on two opposite approaches. The one is data oriented and aims to realize the data warehouse mainly through a reengineering process of the well-structured data sources solely, while minimizing the involvement of end users. The other is requirement oriented and aims to realize the data warehouse only on the basis of business goals expressed by end users, with no regard to the information obtainable from data sources. Since these approaches are not able to address the problems that arise when dealing with big data, the necessity to adopt hybrid methodologies, which allow the definition of multidimensional schemas by considering user requirements and reconciling them against non-structured data sources, has emerged. As a counterpart, hybrid methodologies may require a more complex design process. For this reason, the current research is devoted to introducing automatisms in order to reduce the design efforts and to support the designer in the big data warehouse creation. In this chapter, the authors present a methodology based on a hybrid approach that adopts a graph-based multidimensional model. In order to automate the whole design process, the methodology has been implemented using logical programming.
Chapter Preview

1. Introduction

Big data warehousing refers commonly to the activity of collecting, integrating, and storing (very extra) large volumes of data coming from data sources, which may contain both structured and unstructured data. However, volume alone does not imply big data. Further and specific issues are related to the velocity in generating data, and their variety and complexity.

The increasing volume of data stored in data warehouses is mainly due to their nature of preserving historical data, for performing statistical analyses and extracting significant information, hidden relationships, and regular patterns from data. Other factors that affect the size growth derive from the necessity of integrating several data sources, each of them provides a different variety of data that contribute to enrich the types of analyses, by correlating a large set of parameters. Furthermore, some data sources—such as Internet transactions, networked devices and sensors, for example—generate billions of data very quickly. These data should update the data warehouse as soon as possible, in order to gain fresh information and make timely decisions (Helfert & Von Maur, 2001).

These issues affect the design process, because big data warehouses must integrate heterogeneous data to be used to perform analyses that consider many points of view, and to produce complex schemas having cubes with high number of dimensions. Furthermore, they must be capable of quickly integrating new data sources through a minimal data modelling process.

To summarize this, new aspects for data warehouses supporting analyses of Big Data have been stated in Cohen et al. (2009). Big data warehouses have to be (i) magnetic for they must attract all the data sources available in an organization; (ii) agile for they should support continuous and rapid evolution; and (iii) deep in that they must support analyses more sophisticated that traditional OLAP functions.

Complete Chapter List

Search this Book: