Article Preview
TopIntroduction
Peer Data Management Systems (PDMSs) (Kantere et al., 2009; Tatarinov et al., 2003) are advanced P2P applications in which each peer represents an autonomous data source that makes available an exported schema. Such schema represents the data to be shared with other peers. Peers communicate through an overlay network, i.e., a virtual (logical) network which runs as an overlay on top of a physical network (Doval & O’Mahony, 2003). According to the overlay topology employed, P2P systems are classified into three categories (Androutsellis-Theotokis & Spinellis, 2004): unstructured, structured, and hybrid. Some works also consider a fourth one called super-peer (Yang & Garcia-Molina, 2003).
One of the most studied data management issue on PDMSs is query answering (Hose et al., 2008; Montanelli & Castano, 2008), which consists in propagating a query, submitted at any of the peers, on paths of limited depth in the corresponding overlay network (Lodi et al., 2008). At each routing step, the query is reformulated to the exported schema of its new host based on the respective schema mappings (Souza et al., 2009).
Query answering issues in PDMSs can be improved if peers are efficiently disposed in the overlay network according to their similarity with respect to the content they are willing to share (Castano et al., 2004). The set of peers can then be partitioned into clusters of peers in order to maximize the semantic similarity between the peers participating into the same cluster. Peer clustering has several benefits, the most important one being the fact that the queries are answered only by a few (but relevant) peers (Raftopoulou et al., 2009).
Due to the excessive number of peers, their autonomous nature, and the heterogeneity of their schemas, the creation and maintenance of clusters is considered a challenging aspect in the current stage of development of PDMSs (Kantere et al., 2008). This work proposes an incremental process for clustering peers in a PMDS. To achieve this objective, we present a PDMS architecture which is mainly designed to facilitate the connection of new peers according to their corresponding exported schema. In this architecture, peers are organized in semantic communities (Castano & Montanelli, 2005) and within a community peers are grouped into semantically related clusters. As schemas are represented as an OWL ontology (OWL, 2011), the clustering process makes intensive use of ontology management services, such as matching (Castano et al., 2006; Euzenat & Shvaiko, 2007), merging, and summarization (Pires et al., 2010).
Some approaches have been proposed to the problem of peer clustering. One of the first solutions was proposed for P2P file sharing systems (Yang & Garcia-Molina, 2003). This approach can be applied when the peers have the same structure and vocabulary, which is not the case in our setting. Some of the existing solutions (Castano & Montanelli, 2005; Doulkeridis et al., 2006; Kantere et al., 2008) assume that the P2P network is already populated with a predetermined number of peers and the clustering process is done in an ad-hoc manner. Few solutions (Li & Vuong, 2005; Lodi et al., 2008) consider the problem of forming clusters from scratch. In Li and Vuong (2005), a simple and asymmetric global measure is used to compute the semantic similarity between two peers’ schemas; the authors assume that peers in a PDMS share exactly the same ontology. The PDMS proposed in Lodi et al. (2008) concentrates all efforts related to peer clustering in a centralized structure called Access Point Structure (APS) which is updated whenever a peer joins or leaves the PDMS. The frequency of updates in the APS can be intense and consequently bring scalability problems to the system.