Data Stream Mining

Data Stream Mining

Jesse Read (Universidad Carlos III, Spain) and Albert Bifet (Yahoo! Research Barcelona, Spain)
Copyright: © 2014 |Pages: 3
DOI: 10.4018/978-1-4666-5202-6.ch061

Chapter Preview



Nowadays, the quantity of data that is created every day is growing fast. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of data opens new challenging discovery tasks, and the goal of this paper is to discuss them.

Data stream real time analytics (Masud, 2013) are needed to manage the data currently generated, at an ever increasing rate, from such applications as: sensor networks, measurements in network monitoring and traffic management, log records or click-streams in Web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others. In fact, all data generated can be considered as streaming data or as a snapshot of streaming data, since it is obtained from an interval of time.

In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.

We need to deal with resources in an efficient and low-cost way (Gaber, 2005). In data stream mining, we are interested in three main dimensions:

  • Accuracy.

  • Amount of space (computer memory) necessary.

  • The time required to learn from training examples and to predict.

These dimensions are typically interdependent: adjusting the time and space used by an algorithm can influence accuracy. By storing more pre-computed information, such as look up tables, an algorithm can run faster at the expense of space. An algorithm can also run faster by processing less information, either by stopping early or storing less, thus having less data to process. The more time an algorithm has, the more likely it is that accuracy can be increased.


Main Focus

The most important challenges in data stream mining are how to perform low-cost data mining analysis in real time. In evolving data streams we are concerned with

  • Evolution of accuracy.

  • Probability of false alarms.

  • Probability of true detections.

  • Average delay time in detection.

Some learning methods do not have change detectors implemented inside, and therefore it may be hard to define ratios of false positives and negatives, and average delay time in detection. In these cases, learning curves may be a useful alternative for observing the evolution of accuracy in changing environments.

The main challenges of an ideal learning method for mining evolving data streams are the following: high accuracy and fast adaption to change, low computational cost in both space and time, theoretical performance guarantees, and minimal number of parameters.

Key Terms in this Chapter

Online Boosting: Ensemble of classifiers for evolving data streams, that gives more weight to misclassified examples, and reduces the weight of the correctly classified ones.

Data Stream Mining: Process for obtaining useful information of data that arrives continuously in real-time.

Hoeffding Tree: A decision tree designed for mining data streams. It has theoretical guarantees that the output of a Hoeffding tree is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples.

Online Bagging: Ensemble of classifiers for evolving data streams, where each classifier has a different bootstrap sample.

Complete Chapter List

Search this Book: