Information Retrieval by Linkage Discovery

Information Retrieval by Linkage Discovery

Richard S. Segall (Arkansas State University, USA) and Shen Lu (Soft Challenge LLC, USA)
DOI: 10.4018/978-1-4666-5888-2.ch387
OnDemand PDF Download:
No Current Special Offers

Chapter Preview



This section discusses the concepts of information retrieval, knowledge extraction, and record linkage (RL), and as well as the foundations of record matching and linkage. The latter also includes parameter estimation and knowledge discovery involving comparisons of semantic similarity between pieces of textual information within and among documents.

Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources with searches that can be based on metadata or on full-text indexing. Information retrieval can lead to new knowledge or knowledge discovery by knowledge extraction. (Wikipedia, 2012a)

Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. (Wikipedia, 2012b)

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. (Wikipedia (2012c)]

Fellegi and Sunter (1969) provided a statistical model for record linkage and discussed different solutions associated with this model for different situations. They concluded that linkage rules can be defined with the observed data. With linkage rules, we can determine if a pair of records is a link, a non-link, or a possible link.

Stanford Entity Resolution Framework (SERF) (2009) provided a general framework for when and how to identify and match a pair of records. Stanford Entity Resolution Framework (SERF) is a linkage process which can be used to match and merge records, and it includes two steps: one is record matching, the other is record merging. In the matching process, it defines a black-box mechanism. All of the record pairs go through black-box and each record pair gets similarity values for different attributes. In the merging process, similar records are merged into one.

Key Terms in this Chapter

Semantic Similarity: A concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning/semantic content. ( Wikipedia, 2012e ).

Topic or Text Segmentation: The process of dividing written text into meaningful units, such as words, sentences or topics ( Wikipedia, 2013 ).

Linkage Discovery: Discovery of connections between or among related topic segments located in different sections of electronic publications.

Semantic Analysis: the process of relating syntactic structures, from the levels of phrases, clauses, sentences and paragraphs to the level of the writing as a whole, to their language-independent meanings ( Wikipedia, 2012d ).

Latent Semantic Analysis: Statistical model of word usage that permits comparisons of semantic similarity between pieces of textual information. (Foltz, 1996).

Information Retrieval: The activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text indexing.

Complete Chapter List

Search this Book: