TopIntroduction
Probabilistic data (e.g., (Barbarà et al., 1992; Cheng et al., 2003; Dalvi & Suciu, 2004; Dalvi & Suciu, 2007; Ré & Suciu, 2008; Benjelloun et al., 2009; Agrawal et al., 2006; Sarma et al., 2008)) are becoming one of the most attracting kinds of data for database researchers, due to the fact such a format/formalism perfectly captures two novel, interesting classes of datasets that very often occur in modern database application scenarios, namely uncertain and imprecise data (e.g., (Ge et al., 2013)). Uncertain and imprecise data are indeed very popular, as uncertainty and imprecision affect the same processes devoted to collect data from input data sources and make use of these data in order to populate the target database (e.g., (Balcan et al., 2013)). Consider, for instance, the simplest case represented by a sensory database (Bonnet et al., 2001) populated by a sensor network monitoring the temperature T of a given geographic area S. Here, being T monitoring a natural, real-life measure, it is likely to retrieve an uncertain and imprecise estimate of T, denoted by , with a given confidence interval (Papoulis, 1994), denoted by , such that < , having a certain probability pT, such that 0 ≤ pT ≤ 1, rather than to obtain the exact value of T, denoted by . The semantics of this confidence-interval-based model states that the (estimated) value of T, , ranges between and with probability pT. In popular probabilistic database models (e.g., (Benjelloun et al., 2009; Agrawal et al., 2006; Sarma et al., 2008)), confidence intervals and related probabilities are directly embedded into the probabilistic tables directly, thus originating probabilistic attributes storing probabilistic (attribute) values, which compose probabilistic tuples. Physical reasons of uncertain and imprecision of data are many-fold, and they can be found in inherent randomness and incompleteness of data, sampling errors, human errors, instrument errors, data unavailability, delayed data updates, and so forth.