Mining XML Documents
Laurent Candillier (Université Charles de Gaulle, France), Ludovic Denoyer (Université Pierre et Marie Curie, France), Patrick Gallinari (Université Pierre et Marie Curie, France), Marie Christine Rousset (LSR-IMAG, France), Alexandre Termier (Institute of Statistical Mathematics, Japan) and Anne-Marie Vercoustre (INRIA, France)
Copyright: © 2008
XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections.