Multidimensional Model Design using Data Mining: A Rapid Prototyping Methodology

Multidimensional Model Design using Data Mining: A Rapid Prototyping Methodology

Sandro Bimonte (IRSTEA, Clermont Ferrand, France), Lucile Sautot (TETIS, AgroParisTech, Montpellier, France), Ludovic Journaux (LE21, AgroSupDijon, Dijon, France) and Bruno Faivre (University of Burgundy Franche-Comté, Dijon, France)
Copyright: © 2017 |Pages: 35
DOI: 10.4018/IJDWM.2017010101


Designing and building a Data Warehouse (DW), and associated OLAP cubes, are long processes, during which decision-maker requirements play an important role. But decision-makers are not OLAP experts and can find it difficult to deal with the concepts behind DW and OLAP. To support DW design in this context, we propose: (i) a new rapid prototyping methodology, integrating two different DM algorithms, to define dimension hierarchies according to decision-maker knowledge; (ii) a complete UML Profile, to define a DW schema that integrates both the DM algorithms; (iii) a mapping process to transform multidimensional schemata according to the results of the DM algorithms; (iv) a tool implementing the proposed methodology; (v) a full validation, based on a real case study concerning bird biodiversity. In conclusion, we confirm the rapidity and efficacy of our methodology and tool in providing a multidimensional schema to satisfy decision-maker analytical needs.
Article Preview

1. Introduction

Business Intelligence technology provides tools, such as Data Warehouses (DWs), On-Line Analytical Processing (OLAP), and Data Mining (DM), that allow decision-makers to explore huge volumes of data, in order to discover patterns and knowledge, and thus confirm their hypotheses.

DWs are large data repositories that support the decision-making process through flexible, interactive data analysis (Kimbal, 1996). Warehoused data are built according to a multidimensional model that defines concepts of facts and dimensions. Facts represent objects and are described by numerical attributes, called measures. Facts are analyzed along dimensions representing the axes of analysis. Dimensions are organized in hierarchies. Measures are aggregated with classical SQL aggregation functions (e.g. SUM, MIN, MAX, etc.) along hierarchical levels, using OLAP operators (Inmon, 2005). These OLAP systems allow decision-makers to visualize and explore facts during query sessions by applying OLAP operators: Slice selects a subset of warehoused data; Roll-Up aggregates measures by moving up through the hierarchy; Drill-Down is the opposite of Roll-Up, etc. A basic Relational OLAP (ROLAP) system architecture consists of: (i) a relational Data Base Management System (DBMS), which stores data in accordance with a multidimensional paradigm; (ii) an OLAP server, which implements the multidimensional model and OLAP operators on top of the DBMS; (iii) an OLAP client, which combines and synchronizes tabular and graphical displays, and allows DW queries; (iv) an ETL tool, which extracts data from multiple heterogeneous sources, then transforms and loads them into the DW. The classic development cycle of DWs includes several steps, among which ETL design is typically the most time-consuming (Bimonte, Edoh-Alove, et al. 2013). Several DW design methodologies can be characterized by the relative importance of user requirements (Romero & Abelló, 2009; Kimbal, 1996): in requirement-driven approaches, the conceptual DW schema is based primarily on user requirements; in source-driven approaches, the conceptual DW schema is (semi-automatically) derived from the schemata of the data sources; in mixed approaches, these two processes are carried out in parallel. Rapid DW prototyping is crucial when dealing with complex applications, and has therefore been the object of several studies (Bimonte, Edoh-Alove et al., 2013; Golfarelli & Rizzi, 2011; Huynh & Schiefer, 2001). The Bimonte et al. study presented a rapid, requirement-driven design methodology and tool, called ProtOLAP. Their methodology is based on conceptual DW models, which are then implemented automatically. After DW implementation, decision-makers must manually feed sample data into the prototype, dimension by dimension and level by level, for each hierarchy, to simulate an ETL process in the context of a requirement-driven methodology. However, feeding DWs with sample data is not always easy and, in some cases, dimensional data lack the hierarchical structure necessary to fit the user’s requirements.

Data Mining (DM) is a data exploration phase of a Knowledge Discovery in Databases (KDD) process (Fayyad et al., 1996). DM is a set of descriptive and predictive methods that aim to explore data by discovering a priori unknown links between data attributes (Tufféry, 2011). DM is at the interface between machine learning and statistics, and includes automatic and semi-automatic approaches. DM offers three main techniques:

  • 1.

    Clustering, or unsupervised classification: this approach corresponds to organizing a data collection (represented by a vector or a point in a multidimensional space) into classes (groups or clusters), based on similarity between group members according to a mathematical indicator (Jain et al. 1999). Classes are not defined by analysts but discovered during the clustering process.

  • 2.

    Supervised classification: this approach includes an item in a class, within a set of classes predetermined by analysts.

  • 3.

    Association rule learning, which discovers rules from data.

Complete Article List

Search this Journal:
Open Access Articles
Volume 16: 4 Issues (2020): 2 Released, 2 Forthcoming
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing