Basic Data Operators for Entity Resolution

Basic Data Operators for Entity Resolution

Copyright: © 2014 |Pages: 24
DOI: 10.4018/978-1-4666-5198-2.ch011
(Individual Chapters)
No Current Special Offers


This chapter focuses on the basic data operators for entity resolution, which include similarity search, similarity join, and clustering on sets or strings. These three problems are of increasing complexity, and the solution of simpler problems is the building blocks for the harder problem. The authors first introduce the solution of similarity search, covering gram-based algorithms and sketch-based algorithms. Then the chapter turns to the solution of similarity join, covering both exact and approximate algorithms. At last, the authors deal with the problem of clustering similar strings in a set, which can be applied to duplicate detection in databases.
Chapter Preview

The goal of similarity search is to find similar strings for a given query string. As the readers might have noticed, this problem exists ubiquitously entity resolution, and solution of this problem serves as a powerful weapon for tackling more complicated problems, such as similarity join and clustering.

Find strings similar to a given string: dist (Q, D) <= δ

Example: Find strings similar to “hadjeleftheriou”

Similarity Measures and Distances (Xiao, Wang, Lin, Yu & Wang, 2011)

  • Jaccard Similarity is defined as 978-1-4666-5198-2.ch011.m01

  • Cosine similarity is defined as 978-1-4666-5198-2.ch011.m02

  • Overlap similarity is defined as 978-1-4666-5198-2.ch011.m03

  • Hamming distance between x and y is defined as the size of their symmetric difference: 978-1-4666-5198-2.ch011.m04

  • Edit distance, also known as Levenshtein distance, measures the minimum number of edit operations needed to make two strings identical.


Complete Chapter List

Search this Book: