In this section, we focus on the problem of lexical heterogeneity and explore various techniques for addressing this problem. We introduce basic record matching criteria in section 1.1, and approaches for detecting duplicate record in section 1.2, and two basic and traditional techniques to improve the efficacy of detection are presented in section 1.3.
Establishing Record Matching Criteria
Before we start to detect duplicate record, we should establish record matching criteria. Approaches for computing similarity of two records include character-based, token-based, phonetic and numeric computation.
Character-Based Similarity Computation
Exact Match
The basic way in record matching is the exact matching. For example, if a relation schema contains student ID and student name, we can perform exact matching on the student ID attribute. The data set should be clean enough to perform exact matching, i.e., two tuples are free from data entry errors (e.g., Google versus Googel).
Distance Match
One of the methodologies to compute similarity between character strings is the distance match method. There are different similarity metrics of distance match method; here we cover the following computation methods
Edit Distance
Edit distance, specifically refer to Levenshtein distance, is defined as the smallest number of insertions, deletions, and substitutions required to change one string or tree into another. For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
Kitten →
sitten (substitution of “s” for “k”)
sitt
en → sitt
in (substitution of “i” for “e”)
sittin → sittin
g (insertion of “g” at the end)
The basic algorithm to compute edit distance between two strings
and
takes
time complexity. Landau et al. presented an algorithm in
which can determine whether the edit distance between two strings is less than k.