Article Preview
Top1. Introduction
In recent years, many new applications emerged that generate data streams. Examples of applications that generate data streams are: financial applications, network monitoring, web applications, sensor networks, etc. (Tjioe & Taniar, 2005; Goh & Taniar, 2004). Unlike traditional static databases, data streams are continuous, unbounded in size, and usually with high arrival rate.
The nature of data streams poses some requirements when designing an algorithm to mine them such as finding the frequent subsequences. For example, since data streams are unbounded in size and have high arrival rate, algorithms are allowed only one look at the data. This means that algorithms for data streams may not have the chance to revisit the data twice. To solve this problem a buffer is used to collect the data temporarily for processing. A sliding window model (Zhu & Shasha, 2002) can be used to buffer n values of a data stream. Also the algorithm should be incremental, which means that the algorithm does not recompute the results after every window, but rather it only updates and builds on computed results of previous windows.
In this paper, we investigate finding frequent subsequences in multiple data streams. The approach of the proposed algorithm for finding frequent subsequences is by clustering subsequences of a data stream. A subsequence is considered to be frequent if the number of similar subsequences in a cluster is above a threshold value called support. Due to the challenging characteristics of data streams (continuous, unbounded in size, and usually with high arrival rate), the proposed algorithm is incremental, efficient and any-time algorithm. That is at the end of every window, the proposed algorithm does not recompute the clustering results of similar subsequences however it updates the previous clustering results. Therefore it employs a decay value for each discovered cluster to determine when to remove old clusters and retain recent ones. In addition, the proposed algorithm is efficient as it scans the data streams once and also it is considered an Any-time algorithm since the frequent subsequences are ready at the end of every window.
Finding Frequent subsequences, or clusters of subsequences, can be used in many applications. For example, Network monitoring to discover common usage patterns, exploring common stocks’ trend in financial markets, which will lead to good prediction of their future behavior, discovering web click patterns on websites would help website administrators in more efficient buffering and pre-fetching of busy web pages and in the placement of advertisements, and finding the load pattern on busy servers would assist system administrators in placing a more efficient load balancing scheme. Applications like the aforementioned ones and the lack of efficient and incremental algorithms for finding frequent subsequences motivated us to do this work.
Although there are many works on mining frequent itemsets over transactional data streams, little is done on mining frequent subsequences over streaming real-valued data. Also most of the works dealt with single a data stream, while the proposed algorithm deals with multiple data streams. The main contributions of this paper are:
- •
The proposed algorithm is incremental because clustering results of a current window is built on results of previous windows and also it employs a decay value to remove old frequent subsequences and retain the most recent frequent ones.
- •
The proposed algorithm is any-time algorithm since the clustering results of frequent subsequences are readily available at the end of every window.
- •
The proposed algorithm is an exact algorithm since no approximation for the data is used.
- •
The proposed algorithm is designed to be executed in parallel for multiple data streams.
The rest of this paper is organized as follows. Section 2 discusses the related work. In Section 3, we present some background information and formally define the problem and propose a solution. The proposed algorithm is presented in Section 4. In Section 5, we discuss the results of our experiments and show the feasibility of our approach. Finally, we conclude the paper in Section 6.