Article Preview
Top1. Introduction
The design of data warehouses in the context of Big Data requires new solutions for solving the challenges and taking advantages of the opportunities introduced by novel data sources, such as social networks, that provide also qualitative information (Value issue) to companies about user preferences (Waters & Jamal, 2011). Indeed, these data are daily generated (Velocity issue) in a massive way (Volume issue) (Chen et al., 2014) and usually appear in both structured and unstructured forms (Variety issue) (Buneman et al., 1997; Rehman et al., 2012). In order to be effectively used for business analytics and decision making, these data are to be validated according to a data quality model that checks the degree of reliability (Veracity issue). Each of these issues is faced by emerging methods for data warehouse design.
First, the Value issue concerns the realization of a schema with a good quality, where all the data sources contribute to the data warehouse in the same way. A schema with a good quality is that allows to extract all the information the decision makers are interested in and that presents no violations in reference to the constraints in the data sources. To achieve this, hybrid methodologies are adopted, because they take into account the best features of traditional methodologies. Applying such methodologies, the designer produces a multidimensional schema that not only agrees with the data sources but also does not miss any requirement and does not discard any data source. On the other hand, the workflow of these methodologies is quite complex because they integrate and reconcile both the requirement and the data oriented approaches (Mazón & Trujillo, 2009; Mazón et al., 2007; Di Tria et al., 2015; Di Tria et al., 2012).
The Velocity issue is related to the necessity of integrating new data sources as soon as possible and accepting new business requirements without performing a complete redesign process. The aim is to quickly modify an existing schema for timely providing updated and accurate information in reference to the most recent business goals. This aim can be reached using automatic and agile techniques, because the former simulates the reasoning of an expert designer, by avoiding repetitive tasks and human errors (Di Tria et al., 2014; Phipps & Davis, 2002), while the latter introduces adjustments to a multidimensional schema, letting the data warehouse evolve as business requirements change (Corr with Stagnitto, 2011).
The Volume issue addresses the problems of realizing a data warehouse without importing, replicating, and storing tens of terabytes through the ETL process. The solution is based on a virtual data warehouse, where the movement of data among systems is avoided. As a further consequence, the delays of the importing phase for feeding the data warehouse are discarded (Farooq & Sarwar, 2010) and the data to be used in the analytical phase are immediately available. As an alternative to the virtual data warehouse approach, emergent non-relational models adopted in NoSQL databases provide more flexibility, for they allow denormalized and join-less schemas that can be exploited for analysing data according to novel paradigms, besides the traditional OLAP operators (Dehdouh et al., 2014). So, non-relational models are actually replacing traditional logical models (viz ROLAP and MOLAP) (Chevalier et al., 2015).
For facing the Variety issue, recent papers have introduced a semantic level in multidimensional design, on the basis of an ontological approach (He et al., 2011; Vranesic & Rovan, 2009; Di Tria et al., 2013; Khouri & Bellatreche, 2011; Thenmozhi & Vivekanandan, 2013). Since an ontology is a machine-processable conceptual representation of a domain of interest, it is used for solving in automatic way syntactical and semantic inconsistencies in the schema integration process, even in presence of unstructured data.