Approximate Matching Between XML Documents and Schemas with Applications in XML Classification and Clustering

Approximate Matching Between XML Documents and Schemas with Applications in XML Classification and Clustering

Guangming Xing (Western Kentucky University, USA)
Copyright: © 2012 |Pages: 26
DOI: 10.4018/978-1-61350-356-0.ch005


Classification/clustering of XML documents based on their structural information is important for many tasks related with document management. In this chapter, we present a suite of algorithms to compute the cost for approximate matching between XML documents and schemas. A framework for classifying/clustering XML documents by structure is then presented based on the computation of distances between XML documents and schemas. The backbone of the framework is the feature representation using a vector of the distances. Experimental studies were conducted on various XML data sets, suggesting the efficiency and effectiveness of our approach as a solution for structural classification/clustering of XML documents.
Chapter Preview


The eXtensible Markup Language (XML) (Bray, Paoli, Sperberg-McQueen, Maler, & Yergeau, 2004) has become the standard format for data exchange on the Internet, providing interoperability between different business applications. Such wide use results in large volumes of heterogeneous XML data, i.e., XML documents conforming to different schemas. XML documents are naturally tree structured, and may contain atomic and complex structures. XML documents are also semistructured as they incorporate both structural information and content (Abiteboul, 1997). Dealing with structure information is important to XML data storage and management (Jagadish et al., 2002), and the presence of a schema can significantly simplify the processing of XML documents. For example, elements from different documents with similar structures can be stored together, resulting in reduced storage and faster query processing. However, processing XML documents based on the schemas may not be feasible in practice, since most XML documents are schema-less.

Studying the problem of approximate matching between an XML document and a schema and its applications in XML data management and mining is not only theoretically interesting, but also critical to many real-world scenarios. For instance, similarly structured XML documents can be stored in relational databases more efficiently than documents with arbitrary structures. As discussed in (Bertino, Castano, Ferrari, & Mesiti, 2004), the distance between an XML document and a DTD can be effective for classifying Web documents. When schemas are not available, meaningful schemas can be inferred from XML documents as explained in the XTRACT system (Garofalakis, Gionis, Rastogi, Seshadri, & Shim, 2000). The framework to compute edit distances between XML documents and inferred schemas provides an alternative to the traditional algorithms for tree-to-tree edit distance in evaluating structural similarity between XML documents (Nierman & Jagadish, 2002). Moreover, with the rapid growth of XML documents on the Internet, disseminating XML documents to a large group of information consumers is becoming more and more important. This typically involves applying XSLT to present related information to the end users, which requires that the XML documents are grouped based on the similarity of their tree structures. The approximate matching between XML documents and schemas is also useful for evaluating schema-based queries over sources of XML documents. Other applications in literature include selective dissemination of XML documents (Stanoi, Mihaila, & Padmanabhan, 2003), information extraction from Web documents (Reis, Golgher, Silva, & Laender, 2004), database integration (Parent & Spaccapietra, 1998), stream processing and document integration (Garofalakis & Kumar, 2005), and protection of XML documents (Bertino, Castano, Ferrari, & Mesiti, 2002).

Complete Chapter List

Search this Book: