This work proposes and evaluates distributed algorithms for data clustering in self-organizing ad-hoc sensor networks with computational, connectivity, and power constraints. Self-organization is essential in environments with a large number of devices, because the resulting system cannot be configured and maintained by specific human adjustments on its single components. One of the benefits of in-network data clustering algorithms is the capability of the network to transmit only relevant, high level information, namely models, instead of large amounts of raw data, also reducing drastically energy consumption. For instance, a sensor network could directly identify or anticipate extreme environmental events such as tsunami, tornado or volcanic eruptions notifying only the alarm or its probability, rather than transmitting via satellite each single normal wave motion. The efficiency and efficacy of the methods is evaluated by simulation measuring network traffic, and comparing the generated models with ideal results returned by density-based clustering algorithms for centralized systems.
Distributed and automated recording of data generated by high-speed, high-volume information sources is becoming common practice in scientific research, environmental monitoring, as well as in medium sized and large organizations and enterprises. Whereas distributed core database technology has been an active research area for decades, distributed data analysis and mining have been dealt with only more recently (Kargupta & Chan, 2000; Zaki & Ho, 2000) motivated by issues of scalability, bandwidth, privacy, and cooperation among competing data owners.
A common scheme underlying all approaches is to first locally extract suitable aggregates, then send the aggregates to a central site where they are processed and combined into a global approximate model. The kinds of aggregates and combination algorithm depend on the data types and distributed environment under consideration, e.g., homogeneous or heterogeneous data, and numeric or categorical data.
Among the various distributed computing paradigms, self-administered, massive-scale networks, like sensor networks and peer-to-peer (P2P) computing networks, are currently the topic of large bodies of both theoretical and applied research. In P2P computing networks, all nodes (peers) cooperate with each other to perform a critical function in a decentralized manner, and all nodes are both users and providers of resources (Milojicic et al., 2002). In sensor networks, small devices equipped with a sensing unit and a transceiver and, possibly, a limited computing architecture, are deployed in an environment to be monitored continuously or at fixed intervals.
Applications of both computing paradigms are rapidly maturing. In data management applications, deployed peer-to-peer systems have proven to be able to manage very large databases made up by thousands of personal computers. Many proposals in the literature have significantly improved existing P2P systems from several viewpoints, such as searching performance and query expressivity, resulting in concrete solutions for the forthcoming new distributed database systems to be used in large grid computing networks and in clustering database management systems.
Sensor technology is overcoming many functional limitations of early devices. State-of-the-art sensors are equipped with memory and processors capable of executing moderately demanding algorithms, enabling the deployment of sensor networks capable of processing the data in-network, at least partially, without transmission to a sink or gateway node.
In light of the foregoing, it is natural to foresee an evolution and convergence of sensor and P2P networks towards supporting advanced data processing services, such as distributed data mining services, by which many nodes cooperatively perform a distributed data mining task. In particular, the data clustering task matches well the features of self-organizing networks, since clustering models mostly account for local information, and consequently carefully designed distributed clustering algorithms can be effective in handling topological changes and frequent data updates.
In this chapter, we describe an approach to cluster multidimensional numeric data that are distributed across the nodes of a sensor network by using the data partitioning strategies of multidimensional peer-to-peer systems with some revisions, namely without requiring any costly reorganization of data, which would be infeasible under the rigid energy constraints enforced in a sensor networks, and without reducing the performance of the nodes in message routing and query processing. We evaluate the data clustering accuracy of the proposed clustering approach by comparison to a well-known traditional density-based clustering algorithm. The comparisons have been done by conducting extensive experiments on the decentralized wireless sensor network and on the algorithms we have fully implemented.