Measures of Entity Resolution Result

Measures of Entity Resolution Result

Copyright: © 2014 |Pages: 25
DOI: 10.4018/978-1-4666-5198-2.ch002
OnDemand PDF Download:
$37.50

Abstract

In this chapter, the authors introduce how to measure Entity Resolution (ER) result. As the authors have already made the entity resolution process, they need to know how much better this result is. This is often done by comparing the ER result with the ground truth. First, two important parameters, precision and recall, that are commonly used in measuring the ER result today are shown. Then, there is a discussion of three categories of measure methods: pairwise, distance, and cluster. The authors stress talking about the distance-based measuring method in section two. In section three, there is a comparison made between these methods, and the authors discuss how to choose them in specific applications or circumstance.
Chapter Preview
Top

Introoduction

Entity Resolution (ER) is a fundamental and important problem in data cleaning. In real world applications, we may want to merge records from one or more data resources. In this process, a fatal task is to identify data records that actually represent the same entity in different ways in the whole dataset. Entity, here, may refers to a person, an item or any other things that described by data. This identifying and merging process is called entity resolution or ER in short. For example, in natural language processing, before analyzing sentence structures, our first task is to find items that represent the same thing in reality, e.g. “the United States of American” and “USA” all refers to the same country. Another example in real applications is data merging. Supposed that we need to merge financial data records from different banks together, we need to identify each person which may be represented in different names in several datasets. We use ER algorithms to identify each person and put all his accounts from different banks together. Since the condition that the same entity represented in different forms commonly occurred, the ER problem is widely used in many domains such as natural language processing (NLP), machine learning, database and so on.

Since its importance and practice, this problem has been deeply researched in long terms. Many algorithms focus on how to identify the same entity. However, with more and more algorithms appear, there new problems emerge. Obviously, different methods will solve the ER problem in different way produce different results. Then the new problem is which results are the “best” among all these candidates. To answer this question, we should proceed in the most common ways to select candidates. That is to define measures to examine each candidate and choose the one with highest score. However, this time the measures are on ER results. Each ER result should be evaluated to produce scores and at last we choose which one is the best.

Measures on ER results are very important since they directly tell us which ER results produced by different algorithms is the best. To summarize, it evaluates the existing algorithms. And in future, it gives directions in later works on how to improve the quality of ER results. Nowadays, many methods on evaluating ER results have been proposed. In this chapter we will have an overview on them and make a discussion in details on some respective ones. Also, we will make a contrast on differences between these measure methods and guide users how to choose the proper one in specific circumstance.

Complete Chapter List

Search this Book:
Reset