An Efficient Algorithm for Data Cleaning

An Efficient Algorithm for Data Cleaning

Payal Pahwa, Rajiv Arora, Garima Thakur
Copyright: © 2011 |Pages: 16
DOI: 10.4018/ijkbo.2011100104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.
Article Preview
Top

Introduction

A process of transforming data into information and making it available to users in a timely manner is called Data warehousing.

A data warehouse is a central repository of an organization's electronically stored data (http://en.wikipedia.org). Our approach focuses on the identification of approximate duplicate records before loading them in the data warehouse. Hence, we present a brief overview of various sources of errors that arise due to machine or human intervention (Hernandez & Stolfo, 1995, 1998).

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing