Multi-Distribution Characteristics Based Chinese Entity Synonym Extraction from The Web

Multi-Distribution Characteristics Based Chinese Entity Synonym Extraction from The Web

Xiuxia Ma (School of Computer Engineering and Science, Shanghai University, Shanghai, China), Xiangfeng Luo (School of Computer Engineering and Science, Shanghai University, Shanghai, China), Subin Huang (School of Computer Engineering and Science, Shanghai University, Shanghai, China), and Yike Guo (Imperial College London, London, UK)
Copyright: © 2019 |Pages: 22
DOI: 10.4018/IJIIT.2019070103
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Entity synonyms play an important role in natural language processing applications, such as query expansion and question answering. There are three main distribution characteristics in web texts:1) appearing in parallel structures; 2) occurring with specific patterns in sentences; and 3) distributed in similar contexts. The first and second characteristics rely on reliable prior knowledge and are susceptive to data sparseness, bringing high accuracy and low recall to synonym extraction. The third one may lead to high recall but low accuracy, since it identifies a somewhat loose semantic similarity. Existing methods, such as context-based and pattern-based methods, only consider one characteristic for synonym extraction and rarely take their complementarity into account. For increasing recall, this article proposes a novel extraction framework that can combine the three characteristics for extracting synonyms from the web, where an Entity Synonym Network (ESN) is built to incorporate synonymous knowledge. To improve accuracy, the article treats synonym detection as a ranking problem and uses the Spreading Activation model as a ranking means to detect the hard noise in ESN. Experimental results show the proposed method achieves better accuracy and recall than the state-of-the-art methods.
Article Preview
Top

1. Introduction

For an entity, its synonyms refer to terms that can be used as alternative names to describe this entity (e.g., entity “America” can be referred using “USA”). Entity synonyms play a vital role in many natural language processing applications, such as query expansion (Krishnan, Deepak, Ranu, & Mehta, 2018; Singh & Kumar, 2017), question answering (Dinevski, Kastrin, Hristovski, & Rindflesch, 2015), automatic text summarization (Alguliyev, Isazade, Abdi, & Idris,2017). The distribution of synonyms in the web has three main characteristics:1) appearing in parallel structures such as tables and lists; 2) occurring with specific patterns in sentences; 3) distributed in similar contexts.

A direct method for extracting semantic relations is to exploit parallel structures such as tables and lists in web pages (Song, Gu, & Zhou,2015; Subramaniyaswamy,2013). The structures are manually created by editors. The manual creation may lead to incompleteness of knowledge (Wu & Weld, 2007). Pattern-based and context-based methods have been proposed to extract synonyms automatically from web texts. Pattern-based methods infer the relation of two words by analysing specific patterns in sentences. For example, from the sentence “America is commonly referred to as The United States.”, it can be inferred that “America” and” The United States” have the synonymous relation; while “America is adjacent to Canada” may imply that “America” and “Canada” are not synonymous. The pattern- based methods treat the pattern as concrete evidences to discover synonyms from sentences, which are more accurate and interpretable. However, many synonymous terms will not be co-mentioned in any sentences, leading to low recall. Context-based methods based on the third characteristic, assume words which often appear in similar contexts may be synonyms. For example, the words “America” and “USA” are usually mentioned in similar contexts, and they are the synonyms of the country USA. Based on the assumption, the context-based methods often represent words with their distributional features and train classifiers to predict a pair of terms are synonyms or not. Since most synonyms will appear in similar context, such strategy usually has high recall. However, the methods may bring noise, since some non-synonymous words may also share similar contexts, such as “USA” and “Canada”, which could be treated as synonyms incorrectly. Existing methods mostly consider one characteristic of them to extract synonyms, but these characteristics are largely complementary (Mirkin, Dagan, & Geffet, 2006).

This article proposes a novel extraction framework by combining the three characteristics. Since encyclopedias in the web cover topics in nearly every imaginable domain, entity synonyms can be extracted from web pages of Chinese encyclopedias, such as BaiduBaike1 and HuDongBaike2. For the first characteristic, web pages from encyclopedias contain parallel structures, called infoboxes, presenting tabular summaries of entries’ attributes. Synonyms of entries may appear with specific attribute names in the structures. For the pattern characteristic, since cue words are key elements of patterns, cue words are automatically generated by information entropy and word vector, for constructing patterns. What’s more, given the inability to generate all cue words, synonyms are expanded by the bootstrapping process of patterns. For the third characteristic, this paper uses synonymous relations to retrofit word vectors and applies relative cosine similarity to obtain high-quality candidates.

Complete Article List

Search this Journal:
Reset
Volume 21: 1 Issue (2025)
Volume 20: 1 Issue (2024)
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing