Data mining is concerned with the generalized problem of digging out “the hidden gold” in form of knowledge patterns from massive amounts of data. The information overload, which characterizes the digital era we are living in, is further exasperated due to the textual nature of the majority of data available in existing information sources. Moreover, text data have a “semistructured” nature in most of such sources, primarily over the Web but also in digital libraries, company data repositories, and scientific databases.
Semistructured text data is the connection point between the natural language written text and the rigidly structured tuples of typed data—for example, a news article may contain a few structured fields (such as, news channel, headline, author, location, publication date) but also a largely unstructured text component (the article body). Semistructured data also enables the representation and description of complex real-life objects and their relationships, thus unleashing a potentially unlimited number of possibilities for human-machine-human communication.
XML is the preeminent form of representation of semistructured data. In contrast to most of the Web pages which are encoded as HTML documents, XML is well-defined and flexible, and markup is used to put emphasis on structuring and modeling data, rather than on presentation and layout issues, and to encode semantics. While the counterpart HTML is designed primarily for human-readable documents, XML supports the exchange of machine-readable data. Using XML, information representation is separated from information rendering, making documents to be presented by different views.
XML makes it possible to define complex document structures, such as unbounded nesting and object-oriented hierarchies, and to specify not only data but also the data structures, how elements are nested, and their content models. The flexible nature of XML syntax simplifies the definition and deployment of arbitrary languages for domain-specific markup, enabling automatic authoring and processing of networked data. It has been recognized that an important role of XML vocabularies is the ability of modeling a large variety of data types and their many interrelationships, and being flexible enough to support new information as it is discovered. XML is indeed conceived to couple data with its context (metadata) through an extensible, hierarchical tag structure, which is essential to handle taxonomies (as in life sciences) or other conceptual structures. As a consequence, XML has rapidly become the preferred meta-language for disseminating information in on-line databases, digital libraries, scientific and financial data repositories, multimedia, and many others. All these features, and much more, have made the impact of XML significant not only in research contexts but also in industry: publishing information sources in XML is ever attractive for organizations that want to easily interoperate and provide their information in a format processable by other applications especially on the Web. XML technologies have also been coupled with relational databases to solve business problems. As a matter of fact, encoding data into XML provides benefits in decreasing the translation overheads of communication within and between organizations.
The widespread use of XML has prompted the development of methodologies, techniques, and systems for effectively and efficiently mining XML data. Since the early years of XML, database vendors have responded to these information management needs raised by XML, and the actual scenario offers a variety of approaches which include object-oriented database systems and native XML database systems. All of this has increasingly attracted the attention of various research communities, including databases, information retrieval, Web intelligence, machine learning, and data mining from which myriad proposals have been offered to address problems in XML data management and knowledge discovery.
Mining XML data has its roots in semistructured data management. Therefore, most of the application domains of interest, such as integration of data sources and query processing, were initially focused on structure information available from XML data. In this context, they raised the demand for effective and efficient solutions to any problem concerned with structural comparison of semistructured data. For this purpose, a corpus of study has been borrowed from related research problems; important research contributions have especially regarded pattern matching, change detection, similarity search and detection, and document summarization. Schema matching was initially an important issue in relational models, but has rapidly gained momentum in the XML context. For instance, schema matching algorithms have been developed to support the clustering of structurally similar DTDs, the classification of XML documents with respect to a set of XML-Schemas, and the evolution of a schema based on structural information extracted from classes/clusters of XML documents. Identifying structural similarities between XML documents, rather than XML schemas, is also central in semistructured data mining in part because most real-life XML documents are schema-less. Defining a suitable distance or similarity measure between XML documents requires considering at least the features upon which documents are identified as similar and the method of pattern matching that is used according to the model chosen for representing an XML document. Early approaches to structural document similarity detection are based on tree edit distances, whereas most recent approaches range from vectorial or transactional representations to probabilistic models of XML data. The detection of structural similarities can be a valid support to recognize different sources providing the same kind of information, or for query processing in semistructured data; in addition, the capability of summarizing sets of XML documents can further help to estimate the selectivity of path expressions and devising indexing techniques for clusters to eventually improve the construction of query plans. Also, mining frequent patterns and association rules in XML data is useful to explore relationships between structure elements (e.g., tag names), such as their frequency of co-occurrence within the same collection of XML documents.
Mining XML data according to both structure and content information has become essential in an increasing number of tasks for which it is relevant to consider the types of structure in the documents as well as the topics that can be inferred from the textual content of XML elements. However, the definition of content features makes any mining task inherit some problems faced in traditional knowledge discovery from text data, while new ones arise as XML content is contextually dependent of the XML (logical) structure. The increase in volume and heterogeneity of XML-based application scenarios also makes XML data sources exhibit different ways of semantically annotating their information. Thanks to the inherent subjectivity in the definition of markup tags, XML documents that are different by structure and content may however be semantically related at a certain degree. Several “semantic questions” arise in this context, and hence there is an urgent need for developing semantics-aware representation models and devising suitable notions of semantic features and semantic relatedness measures for XML data. In this respect, semantic organization of XML data has become one of the hardest challenges in XML knowledge discovery and management.
Objectives and Mission
This book is intended to collect and distil the knowledge from experts of database systems, information retrieval, machine learning, Web intelligence and knowledge management communities in developing models, methods, and systems for XML data mining. Within this view, the book addresses key issues and challenges in XML data mining, offering insights into the various existing solutions and best practices for modeling, processing, and analyzing XML data. It explores algorithmic, theoretical, and practical issues regarding mining tasks specific for XML data, and is also concerned with XML-based data mining applications.
At the time of writing, this book also represents the first editorial opportunity to provide a single reference focused on Data Mining & XML. Indeed, most notable book publications have addressed separately such XML related fields as semistructured data management and relational databases, XML data management and XML-enabled database systems, XML and the Semantic Web, while none of the existing references on knowledge discovery and data mining focuses on semistructured data and XML. In some cases, the latter are at most treated marginally as related technologies for Web structure and content mining; analogously, no book reference on XML and related technologies focuses on the need for organizations to efficiently access and share data, extracting information from data and making a competitive use of them by resorting to knowledge discovery and data mining solutions. Therefore, the anticipated mission of this book was to fill this lack of a book concerned with data mining and XML in a unified way.
Prospective Audience and Potential Uses
This book is targeted to both researchers and practitioners in XML data mining and related fields, including Web mining, information retrieval, and knowledge management.
As a textbook, it aims to provide a guide to and through classic and challenging topics in knowledge discovery and data mining which are particularly concerned with the realm of XML. Therefore, the book could be used as a supplement of basic courses on either of the aforementioned disciplines, or as a reference for upper-level courses on advances in databases, information retrieval, and machine learning.
From an industry perspective, the book would be a reference for professionals in XML/database technologies for e-business and e-commerce. In this respect, the book provides insights into benefits, issues, and challenges of data mining solutions for developing XML-based intelligent management and analysis systems.
Organization of the Book
The book is laid out as follows:
Section 1: Models and Measures
Section 2: Clustering and Classification
Section 3: Association Mining
Section 4: Semantics-Aware Mining
Section 5: Applications
Each of the parts is comprised of a set of chapters that are coherent with respect to a major topic of interest in XML data mining. Although the parts are self-contained, the reader will may find useful to go beyond this underlying main classification of the chapters, and follow the cross-references between chapters (possibly located in different parts) to obtain further information about a topic.
The content of the parts is summarized as follows.
Section 1: Models and Measures. The success of an XML data mining method strongly relies on the choice of a model for representing XML data. This in turn influences the choice for the proximity measures and methods that are essential to compare XML data. The first part of this book is hence concerned with representation models (Chapters 1 and 2) and proximity measures (Chapters 3 and 4) that are well-suited for XML data.
Chapter 1 by Kutty, Nayak, and Tran provides a systematization to the many concepts around the context of XML models according to three main lines: the representations, the ways these representations are used for the various mining tasks, and the issues and challenges. The chapter also provides interesting pointers to future directions of research in XML modeling for data mining.
Kharlamov and Senellart address the very challenging problem of mining uncertain XML data in Chapter 2, where uncertainty essentially refers to the inherent impreciseness in automatic processes, and is usually represented as the probability the data is correct. The authors discuss how uncertainty is modeled in relational data before moving on to presenting uncertain XML models. Starting from lessons learned from probabilistic XML querying, the authors propose specific probabilistic models for XML documents and show how these models can be applied to XML mining tasks.
Madria and Viyanon overview similarity detection in XML documents in Chapter 3. Their description covers not only similarity and distance measures according to structure and content features of XML documents for mining, but also discusses the challenging problem of detecting semantically similar XML documents.
In Chapter 4, Wang, Li, and Li focus on effective and scalable solutions to approximate search and join on ordered trees for purposes of similarity detection in large and high-dimensional XML datasets. Within this view, they improve upon the known pq-gram method by introducing a randomized data structure and effective approximate join strategies.
Section 2: Clustering and Classification. Two of the most attractive research topics in XML data mining are considered in this part. Chapter 5, by Xing, presents solutions for computing approximate matching between XML documents and schemas to support the tasks of XML classification and clustering. The author investigates tree edit based measures and related algorithms to determine how well an XML document conforms to a schema, and demonstrates how the proposed methods can effectively be applied to structural classification and clustering of XML data.
Like Chapter 5, both XML schemas and instances are considered in Chapter 6, by De Meo, Nocera, Ursino, and Fiumara. They however focus on aspects related specifically to the clustering of XML structures, and their discussion of popular methods for structural clustering is organized to distinguish between the intensional data level (DTDs and XML Schemas) and the extensional data level (XML document structures).
Chapter 7 by Antonellis overviews the main literature on XML clustering algorithms by taking into account documents only. The author explores the various facets of XML document clustering according to structure and/or content information, also including the more recent semantic XML clustering.
Collectively, Chapters 6 and 7 cover a breadth of approaches and algorithms for XML clustering. Fuzzy clustering has not been largely investigated so far in the XML domain. The study presented in Chapter 8 by Kozielski is a first attempt to attract the attention of researchers in the clustering field towards the potentialities of fuzzy methods for grouping structurally similar XML documents. Particularly, the author leverages the importance of adopting a fuzzy approach for encoding XML structures and a “multilevel” clustering method that enables to better handle the various hierarchy levels when clustering XML documents.
In Chapter 9, Bifet and Gavaldà propose a framework for the classification of XML trees in the challenging domain of data streams. The evolving nature of such data is handled by adaptively mining closed tree patterns from the streams and combining them with a classifier to reduce the dimensionality in the classification task.
In Chapter 10, Hagenbuchner, Tsoi, Kc, and Zhang address the novel problem of link prediction in XML document sets. The key idea of their work is to encode graph-structured XML information using an unsupervised learning approach based on one particular class of neural networks, and exploit it for prediction in an inter-linked domain. The authors argue that self-organizing maps are well-suited for the challenging task at hand, and therefore propose the first extension of self-organizing map algorithms for predicting incoming and outgoing links in XML document collections.
Section 3: Association Mining. The third part of this book is concerned with frequent pattern and association rule mining in XML data. State-of-the-art research on these closely related problems is overviewed in Chapter 11, by Ding and Sundarraj. They introduce the challenges in content-based and structure-based mining of XML frequent patterns and association rules, discuss the various existing approaches, and finally highlight open issues in the field.
In Chapter 12, Mazuran, Quintarelli, Rauseo, and Tanca describe their experience in applying tree mining techniques to extract summarized views of content and structure in XML documents. Here the objective is to facilitate the query-answering process by enabling the user to query the extracted, concise tree-shaped XML patterns in addition to the original dataset.
Chapter 13 by Cagliero, Cerquitelli, and Garza is focused on the problem of extracting generalized association rules to mine higher correlations from XML data. The proposed approach entails the conversion of XML data to a transactional format, and the use of a taxonomy to organize item features at different granularity levels; this taxonomy is then evaluated to guide the generalization of the extracted association rules.
Section 4: Semantics-Aware Mining. This part is devoted to the presentation of works that, in one way or another, address semantic aspects in XML data mining. The first chapter in this part bridges the studies on Semantic Web and mining XML data, whereas the other two chapters respectively focus on exploiting interschema knowledge for XML integration and exploration, and on knowledge matching for the detection of structural/conceptual relationships in XML knowledge representations.
In Chapter 14, Berlanga and Nebot discuss the importance and feasibility of combining knowledge resources in data mining processes towards semantics-aware knowledge discovery. They discuss the benefits provided by semantic annotations and knowledge representation formalisms as a common layer for integrating heterogeneous data sources. Through an exhaustive review of the literature, the authors describe how semantic features have been incorporated and dealt with mining complex structured and semistructured data.
Chapter 15 by De Meo, Nocera, and Ursino presents a framework that extracts interschema properties from XML sources, builds up a hierarchy to represent the sources at different abstraction levels, and then exploits this hierarchy to organize and explore the sources. Although the authors describe a general, component-based framework with the aforementioned characteristics, they also provide implementations of the three layers of the framework that are focused on the intensional aspect of XML data.
Pan, Hadzic, and Dillon address the challenging problem of conceptual and structural matching in heterogeneous knowledge representations in Chapter 16. They prove that frequent tree mining algorithms, being capable of efficiently extracting common substructures in tree-structured knowledge representations, are useful to automatically model the knowledge shared by different XML data for a specific domain.
Section 5: Applications. The final part of this book contains an interesting mix of application-oriented studies: content characterization of geographical maps based on tag annotations provided by social network users, collaborative document clustering in P2P networks, and frequent subtree mining applied to credit risk assessment data.
In Chapter 17, Roglia, Meo, and Ponassi present an appealing application scenario for online mapping services like OpenStreetMap and Google Map. By exploiting information in tag-based cartographic annotations, the authors propose an approach based on a statistical test on the frequency of the annotations to characterize the map contents in a concise yet meaningful way.
Gullo, Ponti, and Greco describe a collaborative distributed framework for clustering XML documents in Chapter 18. They pose their attention on distributed environments implemented as P2P systems to introduce the novel element of collaborativeness in the task of XML document clustering. The proposed centroid-based partitional clustering method is shown to be suitable for efficiently organizing XML documents distributed across peers.
Chapter 19 by Ikasari, Hadzic, and Dillon is aimed at filling the lack of mining approaches that exploit qualitative information in credit risk assessment. XML is used in this endeavour to model quantitative as well as qualitative information of small-medium enterprises in loan applications. Frequent tree mining algorithms are then applied to the resulting XML data to discover potentially useful associations for supporting loan granting decision making.