Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Entity Resolution on Names

Source Title: Innovative Techniques and Applications of Entity Resolution

DOI: 10.4018/978-1-4666-5198-2.ch003

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Errors with names occur frequently. “California” and “CA” refer to the same state of the USA; however, they may both appear as records in a database at the same time. Several techniques need to be proposed to solve these problems. In this chapter, the authors introduce the methods of entity resolution on names. They propose three methods. Similarity measure between names is a kind of fundamental techniques; it makes a significant contribution to the textual similarity. The method of string transformations can handle some situations beyond textual similarity. Recently, learning algorithms on string transformations have been proposed to make matching robust to such variations. Examples illustrate the benefits of each approach.

Chapter Preview

Top

Introduction

In the real world, we may be confronted with many errors about names. Record matching is a well-known problem of matching records that can handle this situation. Most approaches to record matching just rely on textual similarity of each pair record. The applications include entity resolution in E-Commerce (Chapter 16), bibliography (Chapter 15) and medical health information system (Chapter 17).

In section 1, we will introduce several metrics computed using a similarity function. Levenshtein (1966) has proposed a metric to calculate the distance between two sequences called Edit-Distance. The edit distance metrics work well for catching typographical errors, but they are typically ineffective for other types of mismatches. In Elmagarmid, Ipeirotis, and Verykios (2007), Smith and Waterman describe an extension of edit distance and affine gap distance in which mismatches at the beginning and the end of strings have lower costs than mismatches in the middle. This approach is a well-known algorithm for performing local sequence alignment, that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure in Smith & Waterman (1981). Jaro-Winkler Distance in Jaro (1989) and Winkler (1990) introduces a string comparison algorithm that is used in the area of record linkage (duplicate detection). The higher the two strings’ Jaro-Winkler distance, the more similar the strings are. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names. The above three algorithms are character-based similarity metrics that are designed to handle typographical errors. Token-based similarity metrics are widely used in the domain of information retrieval. Scientists have proposed TF-IDF that is a numerical statistic which reflects how important a word is to a document in a data collection in Manning, Raghavan, and Schütze (2008). Cosine similarity in Tan (2007) is often coming along with TF-IDF. Cohen described a system named WHIRL in Cohen (1998, June) that adapts from information retrieval the cosine similarity combined with the TF-IDF weighting scheme to compute the similarity of two fields in Elmagarmid, Ipeirotis, & Verykios (2007). Character-based and token-based similarity metrics focus on the string-based representation of the database records. However, strings may be phonetically similar even if they are not similar in a character or token level. Soundex is a phonetic algorithm for indexing name by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spellings in R.C. Russell Index and R.C. Russell Index.

Record matching infrastructure does not allow a flexible way to account for synonyms that refer to the same name with different manifestations, and forms of string transformation such as abbreviations. In section 2, we will introduce a transformation-based framework for record matching. At first, we introduce preliminary knowledge about Context-Free grammar which is also referred as CFG in Hopcroft (2008). The context-free grammar is widely used in compiler theory. The parse tree that based on CFG, is one of the most efficient method to parse grammar. It can be constructed easily and it is effective to process semantic actions. Two frameworks have been proposed as transformation-based entity representation that has been shown in Arasu, Chaudhuri & Kaushik (2008) and Arasu & Kaushik (2009). In this section, we mainly introduce the grammar-based entity representation framework. At first, the framework generates productions based the real world to construct a context free grammar. Then it utilizes the parse tree technique to analysis how a record is generated by the extension grammar of the CFG, and adds semantic actions to the parse tree in order to determine whether two records are the same. For example, “Dr Andrew J. Smith” will be analyzed as first name is Andrew, and last name is Smith. Then, “Smith, Andy J.” will also be analyzed as first name is Andrew, and last name is Smith. Therefore, this framework can solve this problem.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Entity Resolution on Names

Abstract

Introduction

Complete Chapter List