The Personal Name Problem and a Data Mining Solution

The Personal Name Problem and a Data Mining Solution

Clifton Phua (Monash University, Australia), Vincent Lee (Monash University, Australia) and Kate Smith-Miles (Deakin University, Australia)
Copyright: © 2009 |Pages: 8
DOI: 10.4018/978-1-60566-010-3.ch234
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Almost every person has a life-long personal name which is officially recognised and has only one correct version in their language. Each personal name typically has two components/parts: a first name (also known as given, fore, or Christian name) and a last name (also known as family name or surname). Both these name components are strongly influenced by cultural, economic, historical, political, and social backgrounds. In most cases, each of these two components can have more than a single word and the first name is usually gender-specific. (see Figure 1). There are three important practical considerations for personal name analysis: • Balance between manual checking and analytical computing. Intuitively, a small proportion of names should be manually reviewed, the result has to be reasonably accurate, and each personal name should not take too long to be processed. • Reliability of the verification data has to be examined. By keeping the name verification database’s updating process separate from incoming names, it can prevent possible data manipulation/corruption over time. However, the incompatibility of names in databases can also be caused by genuine reasons as such as cultural and historical traditions, translation and transliteration, reporting and recording variations, and typographical and phonetic errors (Borgman and Siegfried, 1992).
Chapter Preview
Top

Introduction

Almost every person has a life-long personal name which is officially recognised and has only one correct version in their language. Each personal name typically has two components/parts: a first name (also known as given, fore, or Christian name) and a last name (also known as family name or surname). Both these name components are strongly influenced by cultural, economic, historical, political, and social backgrounds. In most cases, each of these two components can have more than a single word and the first name is usually gender-specific. (see Figure 1).

Figure 1.

Hierarchy chart on the inputs, process, and outputs of the name verification task.

There are three important practical considerations for personal name analysis:

  • Balance between manual checking and analytical computing. Intuitively, a small proportion of names should be manually reviewed, the result has to be reasonably accurate, and each personal name should not take too long to be processed.

  • Reliability of the verification data has to be examined. By keeping the name verification database’s updating process separate from incoming names, it can prevent possible data manipulation/corruption over time. However, the incompatibility of names in databases can also be caused by genuine reasons as such as cultural and historical traditions, translation and transliteration, reporting and recording variations, and typographical and phonetic errors (Borgman and Siegfried, 1992).

  • Domain knowledge has to be incorporated into the entire process. Within the Australian context, the majority of names will be Anglo-Saxon but the minority will consist of many and very diverse groups of cultures and nationalities. Therefore the content of the name verification database has to include a significant number of popular Asian, African, Middle Eastern, and other names.

Figure 1 illustrates the input, process, and output sections. Input refers to the incoming names and those in the verification database (which acts like an external dictionary of legal names). Process/program refers to the possible four approaches for personal name analysis: exact-, phonetical-, similarity matching are existing and traditional approaches, while classification and hybrids are newer techniques on names-only data. Output refers to the insights correctly provided by the process. For simplicity, this paper uses first name to denote both first and middle names; and culture to represent culture and nationality. While the scope here explicitly seeks to extract first/last name and gender information from a personal name, culture can be inferred to a large extent (Levitt and Dubner, 2005), authenticity and age group can be inferred to a limited extent.

Top

Background

In this paper, we argue that there are four main explanations when the incoming first and last name does not match any name in the verification database exactly. First, the personal name is not authentic and should be manually checked. Second, it is most likely due to an incomplete white list. It is impossible to have a name verification database which has every possible name, especially rare ones. Third, the incoming name does not have any variant spelling of name(s) in the database (i.e. Western European last names). Fourth, there are virtually millions of potential name combinations or forms (i.e. East Asian first names).

The last three reasons are problems which prevent incoming personal names from being verified correctly by the database. Without finding an exact match in the name verification database, the personal name problem in this paper refers to scenario where ordering and gender (possibly culture, authenticity, and age group) cannot be determined correctly and automatically for every incoming personal name. Therefore, additional processing is required.

There are three different and broad application categories of related work in name matching (Borgman and Siegfried, 1992):

Complete Chapter List

Search this Book:
Reset