A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses

A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses

Hamid Haidarian Shahri (University of Maryland, USA)
Copyright: © 2008 |Pages: 15
DOI: 10.4018/978-1-59904-853-6.ch030
OnDemand PDF Download:
$37.50

Abstract

Entity resolution (also known as duplicate elimination) is an important part of the data cleaning process, especially in data integration and warehousing, where data are gathered from distributed and inconsistent sources. Learnable string similarity measures are an active area of research in the entity resolution problem. Our proposed framework builds upon our earlier work on entity resolution, in which fuzzy rules and membership functions are defined by the user. Here, we exploit neuro-fuzzy modeling for the first time to produce a unique adaptive framework for entity resolution, which automatically learns and adapts to the specific notion of similarity at a meta-level. This framework encompasses many of the previous work on trainable and domain-specific similarity measures. Employing fuzzy inference, it removes the repetitive task of hard-coding a program based on a schema, which is usually required in previous approaches. In addition, our extensible framework is very flexible for the end user. Hence, it can be utilized in the production of an intelligent tool to increase the quality and accuracy of data.

Key Terms in this Chapter

Data Warehouse: A data warehouse is a database designed for the business intelligence requirements and managerial decision making of an organization. The data warehouse integrates data from the various operational systems and is typically loaded from these systems at regular intervals. It contains historical information that enables the analysis of business performance over time. The data are subject oriented, integrated, time variant, and nonvolatile.

Mamdani Method of Inference: Mamdani’s fuzzy inference method is the most commonly seen fuzzy methodology. It was proposed in 1975 by Ebrahim Mamdani as an attempt to control a steam engine and boiler combination. Mamdani-type inference expects the output membership functions to be fuzzy sets. After the aggregation process, there is a fuzzy set for each output variable that needs defuzzification. It is possible, and in many cases much more efficient, to use a single spike as the output membership function rather than a distributed fuzzy set. This type of output is sometimes known as a singleton output membership function, and it can be thought of as a “predefuzzified” fuzzy set. It enhances the efficiency of the defuzzification process because it greatly simplifies the computation required by the more general Mamdani method, which finds the centroid of a two-dimensional function. Rather than integrating across the two-dimensional function to find the centroid, you use the weighted average of a few data points. Sugeno-type systems support this type of model.

Machine Learning: Machine learning is an area of artificial intelligence concerned with the development of techniques that allow computers to learn. Learning is the ability of the machine to improve its performance based on previous results.

OLTP (Online Transaction Processing): OLTP involves operational systems for collecting and managing the base data in an organization specified by transactions, such as sales order processing, inventory, accounts payable, and so forth. It usually offers little or no analytical capabilities.

OLAP (Online Analytical Processing): OLAP involves systems for the retrieval and analysis of data to reveal business trends and statistics not directly visible in the data directly retrieved from a database. It provides multidimensional, summarized views of business data and is used for reporting, analysis, modeling and planning for optimizing the business.

Data Cleaning: Data cleaning is the process of improving the quality of the data by modifying their form or content, for example, removing or correcting erroneous data values, filling in missing values, and so forth.

Sugeno Method of Inference: Introduced in 1985, it is similar to the Mamdani method in many respects. The first two parts of the fuzzy inference process, fuzzifying the inputs and applying the fuzzy operator, are exactly the same. The main difference between Mamdani and Sugeno is that the Sugeno output membership functions are either linear or constant.

Complete Chapter List

Search this Book:
Reset
Editorial Advisory Board
Program Committee
Table of Contents
Foreword
Maria Amparo Vila, Miguel Delgado
Preface
José Galindo
Acknowledgment
Chapter 1
José Galindo
This chapter presents an introduction to fuzzy logic and to fuzzy databases. With regard to the first topic, we have introduced the main concepts in... Sample PDF
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
$37.50
Chapter 2
Slawomir Zadrozny, Guy de Tré, Rita de Caluwe, Janusz Kacprzyk
In reality, a lot of information is available only in an imperfect form. This might be due to imprecision, vagueness, uncertainty, incompleteness... Sample PDF
An Overview of Fuzzy Approaches to Flexible Database Querying
$37.50
Chapter 3
Balazs Feil, Janos Abonyi
This chapter aims to give a comprehensive view about the links between fuzzy logic and data mining. It will be shown that knowledge extracted from... Sample PDF
Introduction to Fuzzy Data Mining Methods
$37.50
Chapter 4
Didier Dubois, Henri Prade
The chapter advocates the interest of distinguishing between negative and positive preferences in the processing of flexible queries. Negative... Sample PDF
Handling Bipolar Queries in Fuzzy Information Processing
$37.50
Chapter 5
Noureddine Mouaddib, Guillaume Raschia, W. Amenel Voglozin, Laurent Ughetto
This chapter presents a discussion on fuzzy querying. It deals with the whole process of fuzzy querying, from the query formulation to its... Sample PDF
From User Requirements to Evaluation Strategies of Flexible Queries in Databases
$37.50
Chapter 6
P Bosc, A Hadjali, O Pivert
The idea of extending the usual Boolean queries with preferences has become a hot topic in the database community. One of the advantages of this... Sample PDF
On the Versatility of Fuzzy Sets for Modeling Flexible Queries
$37.50
Chapter 7
Guy De Tré, Marysa Demoor, Bert Callens, Lise Gosseye
In case-based reasoning (CBR), a new untreated case is compared to cases that have been treated earlier, after which data from the similar cases (if... Sample PDF
Flexible Querying Techniques Based on CBR
$37.50
Chapter 8
Bordogna Bordogna, Guiseppe Psaila
In this chapter, we present the Soft-SQL project whose goal is to define a rich extension of SQL aimed at effectively exploiting flexibility offered... Sample PDF
Customizable Flexible Querying in Classical Relational Databases
$37.50
Chapter 9
Cornelia Tudorie
The topic presented in this chapter refers to qualifying objects in some kinds of vague queries sent to relational databases. We want to compute a... Sample PDF
Qualifying Objects in Classical Relational Database Querying
$37.50
Chapter 10
Ludovic Liétard, Daniel Rocacher
This chapter is devoted to the evaluation of quantified statements which can be found in many applications as decision making, expert systems, or... Sample PDF
Evaluation of Quantified Statements Using Gradual Numbers
$37.50
Chapter 11
Angélica Urrutia, Leonid Tineo, Claudia Gonzalez
Actually, FSQL and SQLf are the main fuzzy logic based proposed extensions to SQL. It would be very interesting to integrate them with a standard... Sample PDF
FSQL and SQLf: Towards a Standard in Fuzzy Databases
$37.50
Chapter 12
Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerlé
Within the framework of flexible querying of possibilistic databases, based on the fuzzy set theory, this chapter focuses on the case where the... Sample PDF
Hierarchical Fuzzy Sets to Query Possibilistic Databases
$37.50
Chapter 13
Troels Andreasen, Henrik Bulskov
The use of taxonomies and ontologies as a foundation for enhancing textual information base access has recently gained increased attention in the... Sample PDF
Query Expansion by Taxonomy
$37.50
Chapter 14
Mohamed Ali Ben Hassine, Amel Grissa Touzi, José Galindo, Habib Ounelli
Fuzzy relational databases have been introduced to deal with uncertain or incomplete information demonstrating the efficiency of processing fuzzy... Sample PDF
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
$37.50
Chapter 15
Geraldo Xexéo, André Braga
We present CLOUDS, which stands for C++ Library Organizing Uncertainty in Database Systems, a tool that allows the creation of fuzzy reasoning... Sample PDF
A Tool for Fuzzy Reasoning and Querying
$37.50
Chapter 16
Aleksandar Takaci, Srdan Škrbic
This chapter introduces a way to extend the relational model with mechanisms that can handle imprecise, uncertain, and inconsistent attribute values... Sample PDF
Data Model of FRDB with Different Data Types and PFSQL
$37.50
Chapter 17
Carlos D. Barranco, Jesús R. Campaña, Juan M. Medina
This chapter introduces a fuzzy object-relational database model including fuzzy extensions of the basic object-relational databases constructs, the... Sample PDF
Towards a Fuzzy Object-Relational Database Model
$37.50
Chapter 18
Radim Belohlavek
Formal concept analysis is a particular method of analysis of relational data. Also, formal concept analysis provides elaborate mathematical... Sample PDF
Relational Data,Formal Concept Analysis, and Graded Attributes
$37.50
Chapter 19
Markus Schneider
Spatial database systems and geographical information systems are currently only able to support geographical applications that deal with crisp... Sample PDF
Fuzzy Spatial Data Types for Spatial Uncertainty Management in Databases
$37.50
Chapter 20
Yauheni Veryha, Jean-Yves Blot, Joao Coelho
There are many well-known applications of fuzzy sets theory in various fields of science and technology. However, we think that the area of maritime... Sample PDF
Fuzzy Classification in Shipwreck Scatter Analysis
$37.50
Chapter 21
Yan Chen, Graham H. Rong, Jianhua Chen
A Web-based fabric database is introduced in terms of its physical structure, software system architecture, basic and intelligent search engines... Sample PDF
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
$37.50
Chapter 22
R. A. Carrasco, F. Araque, A. Salguero, M. A. Vila
Soaring is a recreational activity and a competitive sport where individuals fly un-powered aircrafts known as gliders. The soaring location... Sample PDF
Applying Fuzzy Data Mining to Tourism Area
$37.50
Chapter 23
Andreas Meier, Günter Schindler, Nicolas Werro
In practice, information systems are based on very large data collections mostly stored in relational databases. As a result of information... Sample PDF
Fuzzy Classification on Relational Databases
$37.50
Chapter 24
Shyue-Liang Wang, Ju-Wen Shen, Tuzng-Pei Hong
Mining functional dependencies (FDs) from databases has been identified as an important database analysis technique. It has received considerable... Sample PDF
Incremental Discovery of Fuzzy Functional Dependencies
$37.50
Chapter 25
Radim Belohlavek, Vilem Vychodil
This chapter deals with data dependencies in Codd’s relational model of data. In particular, we deal with fuzzy logic extensions of the relational... Sample PDF
Data Dependencies in Codd's Relational Model with Similarities
$37.50
Chapter 26
Awadhesh Kumar Sharma, A. Goswami, D. K. Gupta
In this chapter, the concept of fuzzy inclusion dependencies (FIDas) in fuzzy databases is introduced and inference rules on such FIDas are derived.... Sample PDF
Fuzzy Inclusion Dependencies in Fuzzy Databases
$37.50
Chapter 27
Wai-Ho Au
The mining of fuzzy association rules has been proposed in the literature recently. Many of the ensuing algorithms are developed to make use of only... Sample PDF
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
$37.50
Chapter 28
Yi Wang
This chapter applies fuzzy logic to a dynamic causal mining (DCM) algorithm and argues that DCM, a combination of association mining and system... Sample PDF
Applying Fuzzy Logic in Dynamic Causal Mining
$37.50
Chapter 29
Céline Fiot
The explosive growth of collected and stored data has generated a need for new techniques transforming these large amounts of data into useful... Sample PDF
Fuzzy Sequential Patterns for Quantitative Data Mining
$37.50
Chapter 30
Hamid Haidarian Shahri
Entity resolution (also known as duplicate elimination) is an important part of the data cleaning process, especially in data integration and... Sample PDF
A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses
$37.50
Chapter 31
Malcolm Beynon
The general fuzzy decision tree approach encapsulates the benefits of being an inductive learning technique to classify objects, utilising the... Sample PDF
Fuzzy Decision-Tree-Based Analysis of Databases
$37.50
Chapter 32
Malcolm Beynon
Outranking methods are a family of techniques concerned with ranking the preference for alternatives based on the criteria values that describe... Sample PDF
Fuzzy Outranking Methods Including Fuzzy PROMETHEE
$37.50
Chapter 33
J. I. Peláez, J. M. Doña, D. La Red
Missing data is often an actual problem in real data sets, and different imputation techniques are normally used to alleviate this problem.... Sample PDF
Fuzzy Imputation Method for Database Systems
$37.50
Chapter 34
Safìye Turgay
In this chapter, an agent-based fuzzy data mining structure was developed to process and evaluate data with an enlargement in the knowledge... Sample PDF
Intelligent Fuzzy Database Management Systems
$37.50
About the Editor
About the Contributors