A Framework for Mining and Querying Summarized XML Data through Tree-Based Association Rules

A Framework for Mining and Querying Summarized XML Data through Tree-Based Association Rules

Mirjana Mazuran (Politecnico di Milano, Italy), Elisa Quintarelli (Politecnico di Milano, Italy), Angelo Rauseo (Politecnico di Milano, Italy) and Letizia Tanca (Politecnico di Milano, Italy)
Copyright: © 2012 |Pages: 24
DOI: 10.4018/978-1-61350-356-0.ch012


In this work we describe the TreeRuler tool, which makes it possible for inexperienced users to access huge XML (or relational) datasets. TreeRuler encompasses two main features: (1) it mines all the frequent association rules from input documents without any a-priori specification of the desired results, and (2) it provides quick, summarized, thus often approximate answers to user’s queries, by using the previously mined knowledge. TreeRuler has been developed in the scenario of the Odyssey EU project dealing with information about crimes, both for the relational and XML data model. In this chapter we mainly focus on the objectives, strategies, and difficulties encountered in the XML context.
Chapter Preview


One of the trickiest problems of finding information in the context of large datasets is reaching fast and concise answering capabilities. This is a consequence of the enormous amount of available data: useful information resides behind a thick and matt wall, which is the “noise” generated by all the uninteresting data around it.

An experienced user, with a good understanding of the document structure, can obtain what s/he seeks by means of a careful selection of the dataset content. Instead, inexperienced users need the support of a knowledge discovery system able to search, retrieve and “highlight” information starting from simple inputs.

Data mining techniques can be successfully applied to face the challenges of this scenario. They offer a privileged way to deal with the information overload problem by extracting frequent patterns and providing intensional, often approximate, information both about the content and the structure of a document. An intensional representation of a dataset is a set of patterns (e.g., association rules, clusters, etc.) describing the most relevant properties of the dataset. Intensional information is thus a summarized representation of the original document, which means that less space is required to store it and less time is required to query it.

The extraction of intensional information through the use of data mining techniques has been proposed in the literature, both with respect to the relational model (Agrawal, Imieliński, & Swami, 1993; Agrawal & Srikant, 1995) and to the XML format (Braga, Campi, Klemettinen, & Lanzi, 2002; Oliboni, Combi, & Rossato, 2005; Youn, Paik, & Kim, 2005; Weigand, Feng, Dillon, & Chang, 2003; Liu & Zeleznikow, 2005; Wan & Dobbie, 2005; Wang & Liu, 2000). However, while in the relational context a lot of algorithms (downloadable from Goethals & Zaki, 2004), and tools (e.g., Weka1) have been proposed, the literature about this topic is not as rich in the XML context. Major difficulties consist in the fact that XML is more expressive than the relational format and allows to represent both the structure and content of information in a different (i.e., hierarchical) way. Such novelty has made it difficult to give a generally accepted definition of how an association rule or a cluster should look like in the XML context.

Nevertheless, given the tree-based nature of XML documents, there have been a number of attempts to use data mining to extract frequent tree-shaped XML patterns (Li, Xiao, Yao, & Dunham, 2003; Berzal, Jiménez, & Cubero, 2008; Termier, Rousset, & Sebag, 2004; Kawasoe, Arimura, Sakamoto, Asai, Abe, & Arikawa, 2002; Yang, Xia, Chi, & Muntz, 2004; Zaki, 2005). More information about frequent subtree mining can be found in (Chi, Muntz, Nijssen, & Kok, 2004); see also Chapter “Frequent Pattern Discovery and Association Rule Mining of XML Data”. Moreover, the Background Section highlights the differences of our approach w.r.t. the literature.

The research presented in the chapter addresses the problems of: (1) extracting intensional information from XML datasets without guiding the mining process, (2) representing it by means of appropriate association rules, and (3) allowing users to use such information in the query-answering process.

We describe the TreeRuler tool, which supports casual (or possibly inexperienced) users to easily manage huge amounts of XML data. Our main objectives are:

  • 1.

    applying efficient data mining techniques to extract a summarized view of both the content and the structure of huge XML documents;

  • 2.

    using the extracted information to provide users with intensional query-answering capabilities, that is the possibility to query the extracted knowledge rather than the original dataset.

Complete Chapter List

Search this Book: