Overview of Entity Resolution

Overview of Entity Resolution

Copyright: © 2014 |Pages: 14
DOI: 10.4018/978-1-4666-5198-2.ch001
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Entity resolution is one of many importation operations for data quality management, information retrieval, and data management. It has wide applications in Web search, ecommerce search, data cleaning, and information integration. Due to its importance, entity resolution has been studied by researchers in multiple fields including database, machine learning, information retrieval, as well as high performance computation. This book contains a number of chapters, which are carefully chosen in order to discuss the broad research issues in entity resolution. In addition, a number of important applications of entity resolution are also covered in the book. The purpose of this chapter is to provide an overview of the concepts, applications, and research topics of entity resolution, as well as the coverage of these topics in this book.
Chapter Preview
Top

Basic Concepts Of Entity Resolution

Entity resolution is to distinguish the representations referring to the same real-world entity in one or more databases and recognize all different real-world entities in the databases.

Entity resolution plays an important role in data management. It is one of the major research problems in data quality management.

From the result form of entity resolution, it could be classified into two types. One is pair-wise entity resolution. The results are pairs of data objects which refer to the same real-world entity. The other is group-wire entity resolution, whose result is a family of clusters with each one containing the data objects referring to the same real-world entity.

Entity resolution has wide applications in many steps in data management and data quality management. We use two examples to explain entity resolution and its applications.

  • Example 1: In a management information system for an enterprise, different departments of marketing, sales and server may maintain autonomous databases. These databases may have different types such as relational database, XML documents and OO database. The data in the databases may have different schemas. The name of attribute of the same entity may have different description method. As an example, a custom with name “Wei Wang” may be represented as “Wang Wei”, “W Wang” even pairs (Wei, Wang) or XML data fragment <Customer><FamilyName>Wang</FamilyName> <GivenName>Wei</ GivenName ></customer> in different databases. The acquiring and reorganizing of enterprises will result in more such instances, since the databases of enterprises involving the acquiring may have many different representations referring to the same real-world entity. Information integrated from such databases may mislead the decision. For example, during the statistics of the number of customers, if the same customer from various databases is treated as different customers, the result is larger than the real result. In order to support the decisions with management information system, it is necessary to detect the data object referring to the same real-world entity in different databases correctly. Additionally, the data quantity in enterprise gets very large. According to the panel in VLDB 2002 (42), in 2002, the data amount of manufacturing enterprises reaches 100TB and increases 20% each year. Therefore, entity resolution techniques for massive and frequent-updating data in various structures are in demand for enterprise data management.

  • Example 2: Web sites in the Internet are autonomous. Information in Web 2.0 sites is inputted by various non-expert users. Therefore, one real-world entity may have different descriptions in different web sites even in different part of the same website. Thus, the search results from the Internet may contain various descriptions of the same real-world entity. On one hand, such duplicated results make users browse many similar information and their time is wasted. On the other hand, inconsistent information and wrong statistics results from retrieval results may lead to wrong decisions. If entity resolution is applied on the retrieval results to cluster them according to the referred entities and make the data objects in each cluster referring to the same real-world entity, retrieval results in higher quality are provided to users. Such that the effectiveness of information use is increased. However, entity resolution on Internet brings challenges. The first challenge is that the data quantity of information in the Internet is very large. The number of pages indexed by Google exceeds 1T. Due to the involving of many users, the information in Internet updates frequently and is in various types including XML, relational database, RDF in graph structure and HTML. Internet information collection and retrieval system with quality assurance requires entity resolution on dynamic massive data in various types.

Other examples of entity resolution include finding special structure in network and IP alias discovery (Getoor & Machanavajjhala, 2012).

From these examples, entity resolution is important for data quality management and data management. Formally, these two kinds of entity resolution are defined as follows.

Complete Chapter List

Search this Book:
Reset