TopBackground
Nowadays, the quantity of data that is created every day is growing fast. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of data opens new challenging discovery tasks, and the goal of this paper is to discuss them.
Data stream real time analytics (Masud, 2013) are needed to manage the data currently generated, at an ever increasing rate, from such applications as: sensor networks, measurements in network monitoring and traffic management, log records or click-streams in Web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others. In fact, all data generated can be considered as streaming data or as a snapshot of streaming data, since it is obtained from an interval of time.
In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.
We need to deal with resources in an efficient and low-cost way (Gaber, 2005). In data stream mining, we are interested in three main dimensions:
These dimensions are typically interdependent: adjusting the time and space used by an algorithm can influence accuracy. By storing more pre-computed information, such as look up tables, an algorithm can run faster at the expense of space. An algorithm can also run faster by processing less information, either by stopping early or storing less, thus having less data to process. The more time an algorithm has, the more likely it is that accuracy can be increased.
TopMain Focus
The most important challenges in data stream mining are how to perform low-cost data mining analysis in real time. In evolving data streams we are concerned with
- •
Evolution of accuracy.
- •
Probability of false alarms.
- •
Probability of true detections.
- •
Average delay time in detection.
Some learning methods do not have change detectors implemented inside, and therefore it may be hard to define ratios of false positives and negatives, and average delay time in detection. In these cases, learning curves may be a useful alternative for observing the evolution of accuracy in changing environments.
The main challenges of an ideal learning method for mining evolving data streams are the following: high accuracy and fast adaption to change, low computational cost in both space and time, theoretical performance guarantees, and minimal number of parameters.