The ability of providing a “standardized, extensible means of coupling semantic information within documents describing semistructured data” (Chaudhri, Rashid, & Zicari, 2003) has led to a steady growth of XML (extensible markup language) data sources, so that XML is touted as the driving force for representing and exchanging data on the Web. The motivation behind any clustering problem is to find an inherent structure of relationships in the data and expose this structure as a set of clusters where the objects within the same cluster are each to other highly similar but very dissimilar from objects in different clusters. The clustering problem finds in text databases a fruitful research area. Since today semistructured text data has become more prevalent on the Web, and XML is the de facto standard for such data, clustering XML documents has increasingly attracted great attention. Any application domain that needs organization of complex document structures (e.g., hierarchical structures with unbounded nesting, object-oriented hierarchies) as well as data containing a few structured fields together with some largely unstructured text components can be profitably assisted by an XML document clustering task.
Several approaches and methodologies have been proposed in the last years to address the problem of clustering XML documents, and major differences can be found in how the following issues have been settled:
Key Terms in this Chapter
XML Cluster Representative: A prototype XML document capturing the most relevant features of the XML documents assigned to a cluster.
Document Clustering: The task of organizing a collection of documents, whose classification is unknown, into meaningful groups (clusters) that are homogeneous according to some notion of proximity (distance or similarity) among documents.
Clustering XML Documents by Structure: The task of clustering XML documents according to features based on information available from XML structure, that is, elements and their (hierarchical) relationships.
Semantic Relatedness: The state of being related by semantic affinities beyond syntactic correspondence or similarity.
Schema-less Document: A document with no explicitly associated document type definition (schema).
XML Transactional Model: A representation model that allows for mapping XML data to variable-length sequences of items, where each item may bear upon structure information, content information, or both.
Clustering XML Documents by Content: The task of clustering XML documents according to features based on information available from XML content, that is, textual elements and attribute values.