Entity Resolution on Single Relation

Entity Resolution on Single Relation

Copyright: © 2014 |Pages: 36
DOI: 10.4018/978-1-4666-5198-2.ch005
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

A basic work of entity resolution is to detect duplicate records in single relation. To address this problem, many different approaches for different areas are proposed. The basic process of entity resolution is attribute similarity computation. Based on the attribute similarity computation methods, many techniques for different areas are proposed to fulfill the process of entity resolution. Rule-based approach is one of the main techniques for entity resolution. To speed up the process of duplicate record detecting, the authors use techniques such as canopy and blocking. In this chapter, the authors focus on the record similarity computation, rule-based approach, similarity threshold computation, and blocking.
Chapter Preview
Top

Record Similarity Computation

In this section, we focus on the problem of lexical heterogeneity and explore various techniques for addressing this problem. We introduce basic record matching criteria in section 1.1, and approaches for detecting duplicate record in section 1.2, and two basic and traditional techniques to improve the efficacy of detection are presented in section 1.3.

Establishing Record Matching Criteria

Before we start to detect duplicate record, we should establish record matching criteria. Approaches for computing similarity of two records include character-based, token-based, phonetic and numeric computation.

Character-Based Similarity Computation

Exact Match

The basic way in record matching is the exact matching. For example, if a relation schema contains student ID and student name, we can perform exact matching on the student ID attribute. The data set should be clean enough to perform exact matching, i.e., two tuples are free from data entry errors (e.g., Google versus Googel).

Distance Match

One of the methodologies to compute similarity between character strings is the distance match method. There are different similarity metrics of distance match method; here we cover the following computation methods

Edit Distance

Edit distance, specifically refer to Levenshtein distance, is defined as the smallest number of insertions, deletions, and substitutions required to change one string or tree into another. For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

Kitten → sitten (substitution of “s” for “k”) sitten → sittin (substitution of “i” for “e”) sittin → sitting (insertion of “g” at the end)

The basic algorithm to compute edit distance between two strings and takes time complexity. Landau et al. presented an algorithm in which can determine whether the edit distance between two strings is less than k.

Complete Chapter List

Search this Book:
Reset