Discovering Higher Level Correlations from XML Data

Discovering Higher Level Correlations from XML Data

Luca Cagliero (Politecnico di Torino, Italy), Tania Cerquitelli (Politecnico di Torino, Italy) and Paolo Garza (Politecnico di Milano, Italy)
Copyright: © 2012 |Pages: 28
DOI: 10.4018/978-1-61350-356-0.ch013


This chapter proposes the XML-GERMI framework to support XML data analysis by automatically extracting generalized association rules (i.e., higher level correlations) from XML data. The proposed approach, which extends the concept of multiple-level association rules, is focused on extracting generalized rules from XML data. To drive the generalization phase of the extraction process, a taxonomy is exploited to aggregate features at different granularity levels. Experiments performed on both real and synthetic datasets show the adaptability and the effectiveness of the proposed framework in discovering higher level correlations from XML data.
Chapter Preview


During recent years, the eXtensible Markup Language (XML) has become a standard for representing, exchanging, and publishing information on the Web, by ensuring interoperability among software systems. XML data is commonly represented by either graph representations, e.g., GraphLog (Consens & Mendelzon, 1990), OEM (Papakonstantinou, Garcia-Molina, & Widom 1995), G-Log (Paredaens, Peelman, & Tanca, 1995), UnQL (Bunemann, Davidson, Fan, Hara, & Tan, 1996), or labeled trees (Damiani, Oliboni, Quintarelli, & Tanca, 2003; Baralis, Garza, Quintarelli, & Tanca, 2007). These representations allow for storing data that cannot be easily represented with traditional data models. Since nowadays a large amount of data is stored in XML, novel and more efficient data mining techniques are needed (2) to analyze large collections of semistructured data and (2) to discover useful and interesting knowledge. For example, discovering association rules (Agrawal & Srikant, 1994) allows the identification of hidden and interesting correlations among data. Originally introduced in the context of market basket analysis, this mining activity finds nowadays applications in a wide range of different contexts (e.g., network traffic characterization (Baldi, Baralis, & Risso, 2005), context-aware applications (Baralis, Cagliero, Cerquitelli, Garza, & Marchetti, 2009)). However, the suitability of data mining approaches for business decisions strictly depends on the granularity level of the available information. Traditional association rule mining algorithms (e.g., (Agrawal & Srikant, 1994)) are sometimes not effective in mining valuable knowledge because of the excessive detail level of the mined information. Consider, for instance, a business-oriented scenario in which orders submitted by customers are stored in an XML data repository. Traditional rule mining may discover occurrences concerning specific customers and orders. A minimum support threshold is enforced to discriminate with respect to the strength of the patterns (i.e., how patterns frequently occur). The discovery of higher level correlations regarding well-known categories of interest (e.g., order priority classes) may allow for better supporting business decisions by both (1) providing a higher level of the analyzed data, and (2) preventing relevant but infrequent knowledge discarding. For instance, a valuable occurrence about a low priority order category may be figured out even if each low priority order is infrequent with respect to the minimum support threshold (i.e., it would be discarded by a traditional rule miner). Indeed, our study is mainly focused on aggregating knowledge to discover and exploit valuable correlations, hidden in the analyzed data, at different levels of abstraction. To this aim, the mining task is driven by a conceptual taxonomy (i.e., a is-a hierarchy defined over items) to allow for discovering associations among data items at any aggregation level. This paper presents one of the first attempts to exploit generalized mining in a data mining framework oriented to XML data.

This chapter thoroughly describes the problem of generalized association rule mining from XML data. After a brief overview of related work that addresses the mining of hidden correlations from XML data, we present our framework called XML-GERMI (XML-GEneralized Rule MIner). The proposed framework first performs XML data pre-processing to tailor XML data to a transactional data format. Then, it performs the extraction of generalized association rules through the evaluation of a taxonomy built over data items. Although XML-GERMI is flexible in that it can easily integrate any rule mining algorithm, currently two algorithms, Cumulate (Agrawal & Srikant, 1995) and GenIO (Baralis, Cagliero, Cerquitelli, D’Elia, & Garza, 2010), are available in XML-GERMI to efficiently extract high level correlations. To effectively support end-users in exploring the extracted knowledge, mined patterns are stored in an XML data repository that can be queried by means of the XQuery language (XQuery, 2010). Experiments performed on both real and synthetic datasets show the effectiveness of the XML-GERMI framework in discovering higher level and interesting correlations from XML data.

Complete Chapter List

Search this Book: