OLAP Over Probabilistic Data

OLAP Over Probabilistic Data

Alfredo Cuzzocrea (ICAR-CNR, Italy & University of Calabria, Italy)
Copyright: © 2014 |Pages: 12
DOI: 10.4018/978-1-4666-5202-6.ch148

Chapter Preview



Probabilistic data (e.g., (Barbarà et al., 1992; Cheng et al., 2003; Dalvi & Suciu, 2004; Dalvi & Suciu, 2007; Ré & Suciu, 2008; Benjelloun et al., 2009; Agrawal et al., 2006; Sarma et al., 2008)) are becoming one of the most attracting kinds of data for database researchers, due to the fact such a format/formalism perfectly captures two novel, interesting classes of datasets that very often occur in modern database application scenarios, namely uncertain and imprecise data (e.g., (Ge et al., 2013)). Uncertain and imprecise data are indeed very popular, as uncertainty and imprecision affect the same processes devoted to collect data from input data sources and make use of these data in order to populate the target database (e.g., (Balcan et al., 2013)). Consider, for instance, the simplest case represented by a sensory database (Bonnet et al., 2001) populated by a sensor network monitoring the temperature T of a given geographic area S. Here, being T monitoring a natural, real-life measure, it is likely to retrieve an uncertain and imprecise estimate of T, denoted by , with a given confidence interval (Papoulis, 1994), denoted by , such that < , having a certain probability pT, such that 0 ≤ pT ≤ 1, rather than to obtain the exact value of T, denoted by . The semantics of this confidence-interval-based model states that the (estimated) value of T, , ranges between and with probability pT. In popular probabilistic database models (e.g., (Benjelloun et al., 2009; Agrawal et al., 2006; Sarma et al., 2008)), confidence intervals and related probabilities are directly embedded into the probabilistic tables directly, thus originating probabilistic attributes storing probabilistic (attribute) values, which compose probabilistic tuples. Physical reasons of uncertain and imprecision of data are many-fold, and they can be found in inherent randomness and incompleteness of data, sampling errors, human errors, instrument errors, data unavailability, delayed data updates, and so forth.

Key Terms in this Chapter

Data Warehousing: A central repository of current and historical data made by integrating data from heterogeneous sources.

Confidence Interval: The range of plausible values for a parameter. In statistics, it is used to specify the reliability of an estimate.

OLAP: On-Line Analytical Processing, or OLAP, designate a set of software techniques for interactive analysis of large amounts of multidimensional data from multiple perspectives.

Probabilistic Database: A collection of data whose correctness is uncertain and known only with some probability.

Data Cube: A multidimensional dataset used to explore and analyze business data from many different perspectives.

Uncertain and Imprecise Data: Data whose correctness is uncertain due to a variety of reasons: measurement errors, network delay.

Complete Chapter List

Search this Book: