Synopsis Data Structures for Representing, Querying, and Mining Data Streams

Synopsis Data Structures for Representing, Querying, and Mining Data Streams

Alfredo Cuzzocrea
DOI: 10.4018/978-1-60566-242-8.ch075
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Data-stream query processing and mining is an emerging challenge for the database research community. This issue has recently gained the attention from the academic as well as the industrial world. Data streams are continuously flooding data produced by untraditional information sources. They are generated by a growing number of data entities, among all we recall performance measurements in network monitoring and traffic management systems, call details records in telecommunication systems, transactions in retail chains, ATM operations, log records generated by Web servers, sensor network data, RFID- (radio frequency identification) based readings, and so forth.
Chapter Preview
Top

Introduction

Data-stream query processing and mining is an emerging challenge for the database research community. This issue has recently gained the attention from the academic as well as the industrial world. Data streams are continuously flooding data produced by untraditional information sources. They are generated by a growing number of data entities, among all we recall performance measurements in network monitoring and traffic management systems, call details records in telecommunication systems, transactions in retail chains, ATM operations, log records generated by Web servers, sensor network data, RFID- (radio frequency identification) based readings, and so forth.

The most distinctive characteristics of data streams are (a) massive volumes of data (e.g., terabytes) and (b) tuples arriving rapidly; the latter make DBMS- (database management system) inspired computational models, which are typically memory bounded, inappropriate for efficiently processing such kinds of data. More specifically, as regards the model used to represent and perform computation over data streams, we can identify the following widely accepted properties characterizing data streams (Babcock, Babu, Datar, Motawani, & Widom, 2002). First, data items of the stream arrive online; typically a time stamp is associated with each, and, as a consequence, the reading of a stream can be modeled as a tuple of kind 〈id, val, ts〉 such that id is the identifier of the reading (e.g., the RFID tag), val is the value of the reading (e.g., the bar code of the product identified by the RFID tag), and ts is the time stamp of the data item (e.g., the time instant in which the value val was produced). Second, the stream processor has no control over the order in which data items arrive (and, thus, over the order in which data items can be processed). Next, the data stream is potentially unbounded. Finally, once an item of the data stream has been processed, it must be discarded so there is no possibility of it being reprocessing more times. For what regards queries, there are two distinct classes of queries that are of interest for extracting useful information and knowledge from data streams (Babcock et al., 2002): (a) one-time queries, which are evaluated once over a point-in-time snapshot of the stream (e.g., an aggregate operator, such as SUM and COUNT, computed over the collection of items belonging to the target data stream and having a time stamp contained within a given time interval [ti, tj], with ti < tj and tj smaller than the current time t), and (b) continuous queries, which produce a stream of answers over time, reflecting the evolution of the target data stream; in other words, a continuous query Q with frequency 978-1-60566-242-8.ch075.m01 produces, at each time t, an answer At(Q) computed by considering the items of the stream whose time stamp is contained within the interval [t-TQ, t], with TQ being an input parameter of Q. It should be noted that, while one-time queries are meaningful evolutions of conventional DBMS queries targeted at the particular application context (i.e., data-stream query processing), continuous queries represent a novel and very interesting class of queries that allow us to think of new scenarios of continuous, useful information and knowledge delivery beyond traditional client-server computational schemes, and also pose important challenges for the database and data warehouse research community. In order to efficiently support continuous query answering, new models and algorithms able to fit the novel requirements dictated by the computational model underlying such queries are needed.

Contrary to the above-described characteristics, data-stream-based applications and systems require processing queries and mining patterns over continuously evolving data in real time, thus involving the need for innovative models and algorithms for representing, querying, and mining data streams. As a consequence, there is a stringent need for designing and developing the so-called data-stream management system (DSMS), that is, a system that efficiently supports representation, indexing, query, and mining functionalities over data streams by overcoming the limitations of traditional DBMS.

Key Terms in this Chapter

Query Optimizer: The DBMS component devoted to optimize, based on relational algebra rules, the input query in order to find the optimal query plan with respect to spatial and temporal constraints.

Continuous Query: Query defined over the flooding data stream and producing in output a stream of answers.

Time Window: A fixed interval of time when the data stream is processed for query and mining purposes.

Data-Stream Pattern: A repetitive, regular sequence of items of the data stream.

Selectivity of a Query: The number of tuples involved by evaluating the query.

Predicate of a Query: The condition expressed by the query in its where clause.

Data Stream: An unconventional information source that produces unbounded, continuously flooding data.

On-Time Query: Query defined over a fixed temporal snapshot of the data stream.

Time Stamp: The time instant related to a stream reading and reporting the (absolute) time in which the reading was produced.

Reading: The point value produced by a stream source.

Complete Chapter List

Search this Book:
Reset