Article Preview
Top1. Introduction
Typically, a decision support system is based on two components: Data Warehouse (DW) and one or more Data Marts (DMs) (Figure 1). DW is a database used for decision-making, where data are either gathered from existing sources or directly entered to meet the needs of a decision-support application (for this last case, see the medical application presented in Section Motivation). Staring from DW we extract subsets, called Data Marts, on which we apply OLAP operations. DMs are designed according to a multidimensional model (star schema, snowflake schema or fact constellation schema) (Teste, 2010) in order to meet the particular demands of a specific decision makers group. In contrast, DW is not directly accessible to decision makers, there is therefore no need to use a multidimensional model to describe it; the relational model was the most effective model used for this.
The influence of Big Data challenged this traditional approach that uses relational databases for data warehousing. This is primarily due to data that has become highly distributed, loosely structured and is growing at exponential rates. Usually, we use Volume, Variety and Velocity, known as 3Vs (Douglas, 2001), to characterize the concept of Big Data. Volume is the size of the data set that needs to be processed, Variety describes different data types including factors such as format, structure, and sources, and Velocity refers to the speediness with which data may be analyzed and processed. Most organizations need to improve their decision-making process using Big Data. To achieve this, they have to store Big Data, perform an analysis, and transform the results into useful and valuable information. To perform these storage and analytical processes, it’s necessary to deal with new challenges in designing and creating DW.
Indeed, some new considerations should be verified by the database used for data warehousing. It should have the ability to: (1) integrate all possible data structures, (2) combine multiple data sources, (3) scale at relatively low cost, and (4) analyze large volumes of data. Relational warehouses are mature data management technology. However, with the rise of Big Data, these systems became unfit for large, distributed data management. The major problems of relational technologies are: (1) the horizontal scale: Relational databases were mainly designed for single-server configurations. To scale relational database, it has to be distributed across multiple powerful servers that are expensive. Furthermore, handling tables across different servers is difficult. (2) a strict data model to design prior to data processing: in Big Data context, it should be easy to add and analyze new data regardless of its type (structured, semi-structured or unstructured); But the problem is that relational models are hard to change incrementally without impacting performance or taking the database offline. As a result, new kind of DBMS, known as “NoSQL” (Cattell, 2011), have appeared. NoSQL databases are well suited for managing large volume of data and they keep good performance when scaling up (Angadi, 2013). Using NoSQL for data warehousing has become a necessity for a wider number of reasons, mainly relating to the high performance provided by these systems (Herrero, 2016).
This work deal with creating a DW in Big Data context, and is motivated by the needs of a medical application. This application generates a continuous stream of complex data (patient histories, visit summaries, paper prescriptions, radiology reports, etc.) that will be directly entered into a DW (§2). To describe this DW, a conceptual data model closer to human thinking is required; the choice for such model has been UML (Abello, 2015). Our purpose is to assist developers in creating the DW on a NoSQL database. For this, we propose an automatic process that transforms UML conceptual model describing DW into a NoSQL model.