Bayesian Belief Networks for Data Cleaning
Enrico Fagiuoli (Università degli Studi di Milano-Bicocca, Italy), Sara Omerino (ETNOTEAM S.p.A., Italy) and Fabio Stella (Università degli Studi di Milano-Bicocca, Italy)
Copyright: © 2008
The importance of data cleaning and data quality is becoming increasingly clear as evidenced by the surge in software, tools, consulting companies and seminars addressing data quality issues. In this contribution the authors present and describe how Bayesian computational techniques can be exploited for data cleaning purposes to the extent of reducing the time to clean and understand the data. The proposed approach relies on the computational device named Bayesian belief network, which is a general statistical model that allows the efficient description and treatment of joint probability distributions. This work describes the conceptual framework that maps the Bayesian belief network computational device to some of the most difficult tasks in data cleaning, namely imputing missing values, completing truncated datasets and outliers detection. The proposed framework is described and supported by a set of numerical experiments performed by exploiting the Bayesian belief network programming suite named HUGIN.