XML Similarity Detection and Measures

XML Similarity Detection and Measures

Sanjay Kumar Madria (Missouri University of Science and Technology, USA) and Waraporn Viyanon (Missouri University of Science and Technology, USA)
Copyright: © 2012 |Pages: 25
DOI: 10.4018/978-1-61350-356-0.ch003


XML similarity detection plays an important role in facilitating many applications such as data integration, document classification/clustering, querying, and change management. In this chapter, we present an overview on XML document syntactic and semantic similarity/distance measures along with existing research related to XML similarity detection. The measures are classified into two main categories: structural similarity, and structural and content similarity. We review similarity detection approaches proposed in the literature and discuss some of the challenges and future directions for research on XML similarity detection and related fields.
Chapter Preview


Views of XML

XML documents can be classified as having either a document-centric (text-centric) view or a data-centric view (Bourret, 2005).

Data-centric documents are used to transport data. As such, they are highly structured data marked up with XML tags. Most data-centric XML documents are generated from structured sources such as RDBMS. The data-centric view emphasizes on XML structure since the meaning of a data-centric XML document depends only on the structured data represented inside it, and is usually used to exchange data in a structured form.

Document-centric documents focus on application-relevant objects. They are loosely structured documents marked-up with XML tags, and their meaning depends on the document as a whole. Their structure is more irregular, and their data are heterogeneous. Such documents might not even have a document-type declaration (DTD) or XML schema. For this view, text is a higher priority than structure. Figure 1 shows examples of both document-centric and data-centric documents.

Figure 1.

Two types of XML documents: (a) text-centric document, and (b) data-centric document

Complete Chapter List

Search this Book: