Organizing XML Documents on a Peer–to–Peer Network by Collaborative Clustering

Organizing XML Documents on a Peer–to–Peer Network by Collaborative Clustering

Francesco Gullo (University of Calabria, Italy), Giovanni Ponti (ENEA, Italy) and Sergio Greco (University of Calabria, Italy)
Copyright: © 2012 |Pages: 18
DOI: 10.4018/978-1-61350-356-0.ch018


In this chapter we address the problem of clustering XML documents in a collaborative distributed environment. We developed a clustering framework for XML sources distributed on a P2P network. XML documents are modeled based on a transactional representation which uses both XML structure and content information. The clustering method employs a centroid-based partitional scheme suitably adapted to work on a P2P network. Each peer is enabled to compute a clustering solution over its local repository and to exchange the resulting cluster representatives with the other peers. The exchanged cluster representatives are hence used to compute the global clustering solution in a collaborative way. Effectiveness and efficiency of the framework were evaluated on real XML document collections varying the number of peers. Experimental results have shown significant improvements of our collaborative distributed algorithm with respect to the centralized clustering setting in terms of execution time, achieving clustering solutions that still remain accurate with a moderately low number of nodes in the network.
Chapter Preview


The extensibility of its markup functionalities along with its natural capability of representing complex real-world objects and their relationships are the keys of success of XML in enabling the development of domain-specific markup languages. This has had a strong impact on the role of XML in the Internet, where XML languages have been developed for a large variety of domain applications, ranging from multimedia and networking to Web content syndication and rendering, from scientific data representation and literature to business processes.

In recent years, the use of XML for data representation and exchange has become central in high-demand environments. On the one hand, the growing availability of large XML document repositories has raised the need for fast and accurate organization of such data. In this respect, research on XML document clustering has produced a variety of approaches and methods, with different focuses on aspects such as the structure and/or content type of XML features, the XML data representation and summarization model, the XML similarity measures, and the strategy of clustering that was able to such special requirements as dealing with large document collections and high dimensionality, ease for browsing, meaningfulness of cluster descriptions (Candillier, Tellier, & Torre, 2005; Denoyer & Gallinari, 2008; Doucet & Lehtonen, 2006; Kutty, Tran, Nayak, & Li, 2008; Lian, Cheung, Mamoulis, & Yiu, 2004; Nayak & Xu, 2006; Tran, Nayak, Bruza, 2008; Tagarelli & Greco, 2010).

On the other hand, the inherently distributed nature of XML repositories is also calling for adequate distribute processing techniques that can aid the efficient management and mining of XML data. As an example, think of some Web news services that are in charge of very frequently gathering up-to-date information spanning over thousands of news sources: if such services aim to highlight (new) hot topics through the news channels or provide the users with a (personalized) view on the news headlines, they might be required to apply clustering algorithms to the news articles with a frequency of few minutes.

Clustering XML documents in such high-demand environments is hence challenging as the algorithms developed are to be able to face tight requirements on both processing power and space resources. Existing methods for clustering XML documents are instead designed as centralized systems, which is mainly due to the difficulty of decentralizing most clustering strategies and, additionally, to a number of issues arising in the development of a convenient yet effective summarization of both XML structure and content information in XML documents.

XML distributed applications are increasingly being demanded in several domains, such as software and multimedia sharing, product rating, personal profiling, and many others. A great merit of such an extensive use of XML in distributed applications is due to the popularity of peertopeer (P2P) networks. A P2P network is a distributed system with the following main properties (Rodrigues & Druschel, 2010). It has a high degree of decentralization, since processing power, bandwidth and space resources are contributed by the peers, which implement both client and server functionalities of the system. In general, there can be a high heterogeneity of resources, in terms of hardware and software architecture, power supply, geographic location. A P2P is mostly self-organizing, since any newly introduced peer node requires little or no manual configuration for the system maintenance. Multiple administrative domains usually characterize the system as the peers are not owned or controlled by a single organization. The deployment costs of a P2P system are typically lower than client-server systems, thanks to its independence of dedicated infrastructure, while the upgrade of the system components is made easier. Because there are few if any peers with centralized state, the P2P system is also more resilient to faults and attacks.

Complete Chapter List

Search this Book: