A Framework for Evaluating Design Methodologies for Big Data Warehouses: Measurement of the Design Process

A Framework for Evaluating Design Methodologies for Big Data Warehouses: Measurement of the Design Process

Francesco Di Tria (Department of Computer Science, University of Bari Aldo Moro, Bari, Italy), Ezio Lefons (Department of Computer Science, University of Bari Aldo Moro, Bari, Italy) and Filippo Tangorra (Department of Computer Science, University of Bari Aldo Moro, Bari, Italy)
Copyright: © 2018 |Pages: 25
DOI: 10.4018/IJDWM.2018010102

Abstract

This article describes how the evaluation of modern data warehouses considers new solutions adopted for facing the radical changes caused by the necessity of reducing the storage volume, while increasing the velocity in multidimensional design and data elaboration, even in presence of unstructured data that are useful for providing qualitative information. The aim is to set up a framework for the evaluation of the physical and methodological characteristics of a data warehouse, realized by considering the factors that affect the data warehouse's lifecycle when taking into account the Big Data issues (Volume, Velocity, Variety, Value, and Veracity). The contribution is the definition of a set of criteria for classifying Big Data Warehouses on the basis of their methodological characteristics. Based on these criteria, the authors defined a set of metrics for measuring the quality of Big Data Warehouses in reference to the design specifications. They show through a case study how the proposed metrics are able to check the eligibility of methodologies falling in different classes in the Big Data context.
Article Preview

1. Introduction

The design of data warehouses in the context of Big Data requires new solutions for solving the challenges and taking advantages of the opportunities introduced by novel data sources, such as social networks, that provide also qualitative information (Value issue) to companies about user preferences (Waters & Jamal, 2011). Indeed, these data are daily generated (Velocity issue) in a massive way (Volume issue) (Chen et al., 2014) and usually appear in both structured and unstructured forms (Variety issue) (Buneman et al., 1997; Rehman et al., 2012). In order to be effectively used for business analytics and decision making, these data are to be validated according to a data quality model that checks the degree of reliability (Veracity issue). Each of these issues is faced by emerging methods for data warehouse design.

First, the Value issue concerns the realization of a schema with a good quality, where all the data sources contribute to the data warehouse in the same way. A schema with a good quality is that allows to extract all the information the decision makers are interested in and that presents no violations in reference to the constraints in the data sources. To achieve this, hybrid methodologies are adopted, because they take into account the best features of traditional methodologies. Applying such methodologies, the designer produces a multidimensional schema that not only agrees with the data sources but also does not miss any requirement and does not discard any data source. On the other hand, the workflow of these methodologies is quite complex because they integrate and reconcile both the requirement and the data oriented approaches (Mazón & Trujillo, 2009; Mazón et al., 2007; Di Tria et al., 2015; Di Tria et al., 2012).

The Velocity issue is related to the necessity of integrating new data sources as soon as possible and accepting new business requirements without performing a complete redesign process. The aim is to quickly modify an existing schema for timely providing updated and accurate information in reference to the most recent business goals. This aim can be reached using automatic and agile techniques, because the former simulates the reasoning of an expert designer, by avoiding repetitive tasks and human errors (Di Tria et al., 2014; Phipps & Davis, 2002), while the latter introduces adjustments to a multidimensional schema, letting the data warehouse evolve as business requirements change (Corr with Stagnitto, 2011).

The Volume issue addresses the problems of realizing a data warehouse without importing, replicating, and storing tens of terabytes through the ETL process. The solution is based on a virtual data warehouse, where the movement of data among systems is avoided. As a further consequence, the delays of the importing phase for feeding the data warehouse are discarded (Farooq & Sarwar, 2010) and the data to be used in the analytical phase are immediately available. As an alternative to the virtual data warehouse approach, emergent non-relational models adopted in NoSQL databases provide more flexibility, for they allow denormalized and join-less schemas that can be exploited for analysing data according to novel paradigms, besides the traditional OLAP operators (Dehdouh et al., 2014). So, non-relational models are actually replacing traditional logical models (viz ROLAP and MOLAP) (Chevalier et al., 2015).

For facing the Variety issue, recent papers have introduced a semantic level in multidimensional design, on the basis of an ontological approach (He et al., 2011; Vranesic & Rovan, 2009; Di Tria et al., 2013; Khouri & Bellatreche, 2011; Thenmozhi & Vivekanandan, 2013). Since an ontology is a machine-processable conceptual representation of a domain of interest, it is used for solving in automatic way syntactical and semantic inconsistencies in the schema integration process, even in presence of unstructured data.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 15: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 14: 4 Issues (2018): 3 Released, 1 Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing