Resource Constrained Data Stream Clustering with Concept Drifting for Processing Sensor Data

Resource Constrained Data Stream Clustering with Concept Drifting for Processing Sensor Data

Gansen Zhao (School of Computer Science, South China Normal University, Guangzhou, China), Zhongjie Ba (School of Software, Sun Yat-sen University, Guangzhou, China), Jiahua Du (School of Computer Science, South China Normal University, Guangzhou, China), Xinming Wang (School of Computer Science, South China Normal University, Guangzhou, China), Ziliu Li (Microsoft Search Technology Center Asia, Beijing, China), Chunming Rong (Centre of Innovation Technology, University of Stavanger, Stavanger, Norway) and Changqin Huang (School of Information Technology in Education, South China Normal University, Guangzhou, China)
Copyright: © 2015 |Pages: 19
DOI: 10.4018/IJDWM.2015070103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Wireless sensors and mobile devices have been widely deployed as data collecting devices for monitoring real world systems. A large amount of stream data is generated in real-time, which has to be processed in real-time as well. One of the common processing operations is clustering that automatically groups the elements of a stream into a number of clusters in general. Elements of the same cluster have maximum similarity and elements of different clusters have minimum similarity. This paper proposes an on-demand framework (SRAStream) based on the concept drifting detection mechanism. The concept drifting detection algorithm is used to measure the distance of the new clusters for the current data and that of the existing clusters. Only when a concept drifting occurs will the re-clustering be performed to identify new clusters. SRAStream thus avoids the unnecessary computation intensive re-clustering calculation. Experiments suggest that the proposed framework does work well and improve the processing speed greatly in data streams clustering.
Article Preview

1. Introduction

Cyber physical systems deploy a great number of sensors and various mobile devices to monitor real world systems. Each sensor and device senses and measures objects in a regular interval, generating new values from time to time, hence resulting in a continuous data stream(shown in Figure 1). In such a scenario data streams need to be processed to identify events and extract useful information. For example, in the context of smart cities, mining user daily behavior data stream is able to identify the daily commuting patterns of users.

Figure 1.

Clustering real-time data generated by mobile devices

Clustering is a process of grouping a set of objects. In general, clustering puts similar objects in the same group and less similar objects in different groups, where objects of the same group have the most similarity and objects of different groups have the least similarity. Clustering stream data is one of the most common data processing operations for mining real time data in many scenarios.

Data streams are different from static data sets in several perspectives (Aggarwal & Yu, 2008).

  • Data arrives as one or more continuous and unbounded data streams, requiring a great amount of resource to process while most processing systems have limited amount of resource for use.

  • Data streams usually are not available for random access and most items in a data stream may be processed only once during the entire computation. This is due to the great volume and some data may be abandoned after being processed.

  • The data arrival rate of a data stream varies from time to time and it could be difficult to predict.

There are a few major challenges of data streams clustering, including but not limited to the follows:

  • Data streams are dynamic with new data comes continuously. It is not possible for performing the clustering operation after receiving all data. Hence it is necessary to be able to cluster without having the whole data and update the cluster result continuously as new data arrive.

  • As more data are arriving while they are being processed and clustered. Results are expected within short time period. The clustering process needs to be fast enough to response in real-time.

  • Data streams are unbounded, leading to the accumulation of very large volume of data but the amount of available resource is limited. It is necessary to find an efficient way to cluster a large amount of data with resource constrain.

To address the above issues, this paper proposes a SRAStream Clustering framework, which clusters data streams based on a concept drifting detecting model. The general idea is that an initial set of cluster centers are initially calculated based on a small set of data that is available at the beginning. Then the clustering error is continuously monitored and measured as more data arrives. The re-calculation of the cluster centers are performed only when the clustering error is greater than a certain threshold, which is called a concept drifting.

This paper presents a comprehensive literature review on related work. For efficient clustering of real-data streams, the SRAStream system framework is devised and related concept drifting detection algorithms are proposed. Analysis on the proposed algorithms and experiments has been conducted. The experiment results indicate that the proposed framework and algorithm can achieve a good level of performance in both efficiency and accuracy.

The contribution of this work is as follows. Firstly, we identify the need for clustering data streams in cyber physical systems and conduct a comprehensive literature review on related work; secondly, we develop a clustering framework for tackling the data stream clustering issues. Thirdly, we propose the corresponding concept drifting detection model and concept drifting detection algorithms for data stream. Lastly, we conduct related algorithm analysis and experiments for performance investigation, showing the performance of the proposed mechanism.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing