Abstract
This chapter focuses on the basic data operators for entity resolution, which include similarity search, similarity join, and clustering on sets or strings. These three problems are of increasing complexity, and the solution of simpler problems is the building blocks for the harder problem. The authors first introduce the solution of similarity search, covering gram-based algorithms and sketch-based algorithms. Then the chapter turns to the solution of similarity join, covering both exact and approximate algorithms. At last, the authors deal with the problem of clustering similar strings in a set, which can be applied to duplicate detection in databases.
TopSimilarity Search
The goal of similarity search is to find similar strings for a given query string. As the readers might have noticed, this problem exists ubiquitously entity resolution, and solution of this problem serves as a powerful weapon for tackling more complicated problems, such as similarity join and clustering.
Find strings similar to a given string: dist (Q, D) <= δ
Example: Find strings similar to “hadjeleftheriou”
Similarity Measures and Distances (Xiao, Wang, Lin, Yu & Wang, 2011)
- •
Jaccard Similarity is defined as 
- •
Cosine similarity is defined as 
- •
Overlap similarity is defined as 
- •
Hamming distance between x and y is defined as the size of their symmetric difference: 
- •
Edit distance, also known as Levenshtein distance, measures the minimum number of edit operations needed to make two strings identical.