Low-Quality Error Detection for Noisy Knowledge Graphs

Low-Quality Error Detection for Noisy Knowledge Graphs

*Chenyang Bu, Xingchen Yu, Yan Hong, Tingting Jiang
Copyright: © 2021 |Pages: 17
DOI: 10.4018/JDM.2021100104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The automatic construction of knowledge graphs (KGs) from multiple data sources has received increasing attention. The automatic construction process inevitably brings considerable noise, especially in the construction of KGs from unstructured text. The noise in a KG can be divided into two categories: factual noise and low-quality noise. Factual noise refers to plausible triples that meet the requirements of ontology constraints. For example, the plausible triple satisfies the constraints that the head entity “New_York” is a city and the tail entity “America” belongs to a country. Low-quality noise denotes the obvious errors commonly created in information extraction processes. This study focuses on entity type errors. Most existing approaches concentrate on refining an existing KG, assuming that the type information of most entities or the ontology information in the KG is known in advance. However, such methods may not be suitable at the start of a KG's construction. Therefore, the authors propose an effective framework to eliminate entity type errors. The experimental results demonstrate the effectiveness of the proposed method.
Article Preview
Top

Introduction

In the era of big data (Wu et al., 2015), automatic or semi-automatic acquisition of knowledge from big data and establishing a knowledge-based system to provide Internet-oriented intelligent knowledge services have become the common demands of knowledge-driven applications (Jiang et al., 2018; Lu et al., 2019; Polleres et al., 2010; Wu et al., 2014). Knowledge graphs (KGs) have an increasingly important application value in this context, and the goal is to transform the content of the World Wide Web into knowledge that can be understood and calculated by machines for intelligent applications (Wu & Wu, 2019).

A KG describes the concepts, entities, and their relationships in the objective world in a structured form, typically as a triple (head entity, relation, tail entity) (Nickel et al., 2015). For example, the triple (Hamlet, Author, William Shakespeare) implies that William Shakespeare is the author of Hamlet. Early KGs were usually constructed manually or through crowdsourcing, which is time-consuming and inefficient. Thus, automatic KG construction from various data sources, such as unstructured text data, has attracted increasing attention. Owing to the challenge of language understanding tasks in unstructured text data, noise may inevitably be generated by the current information extraction (IE) systems (Nickel et al., 2015; Paulheim & Bizer, 2013, 2014). Detecting such noises from the data obtained from various information sources remains a challenge (Batini et al., 2015).

Currently, only a few studies focus on KG denoising (Wang et al., 2017; Xie et al., 2018). Most studies focusing on the application of KGs assume that the existing KGs are completely correct, ignoring that the automatic mechanisms involved in KG construction processes inevitably bring considerable noise. Moreover, most existing studies focus only on “factual errors,” for example, an incorrect capital for a country (Paulheim & Bizer, 2013). However, different types of errors (called low-quality errors here) often occur during an automatic construction process. For example, the triple (Chang'an, IsCapitalOf, LiBai), extracted from LiBai's Baidu encyclopedia homepage, is a low-quality noise (Hong et al., 2020). This is because Li Bai lives in the city of Chang'an but not the capital of Chang'an. This type of noise is common during the construction of a KG (Paulheim & Bizer, 2013).

The existing studies on noisy KGs mostly focus on KG refinement (Paulheim, 2017). That is, they assume the presence of a large-scale knowledge base (such as Wikipedia), and the type information of most entities in the knowledge base is known (Paulheim & Bizer, 2013). For example, Paulheim and Bizer (Paulheim & Bizer, 2013) adopted the statistical distribution of types in the position of the head and tail entities for each relation to predict the instance type. This method requires the type information of most entities. However, many knowledge bases constructed from unstructured texts do not contain type information. Therefore, existing methods may not be directly applied to this type of problem. Even if we handle the entity classification problem separately, we still need considerable labeled data or a large knowledge corpus, and the accuracy of the classification may also be insufficient. To the best of our knowledge, not many studies focus on denoising in the KG construction stage without a large knowledge base.

In this study, we propose a framework based on human intelligence, artificial intelligence, and organizational intelligence (HAO intelligence) (Wu & Wu, 2019) for low-quality error detection. This framework is aimed at error detection in scenarios where no large knowledge base or corpus is available during the knowledge graph construction. The advantage of this framework is that very little information needs to be manually marked. Experiments on two large-scale datasets demonstrate their effectiveness.

Complete Article List

Search this Journal:
Reset
Volume 35: 1 Issue (2024)
Volume 34: 3 Issues (2023)
Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming
Volume 32: 4 Issues (2021)
Volume 31: 4 Issues (2020)
Volume 30: 4 Issues (2019)
Volume 29: 4 Issues (2018)
Volume 28: 4 Issues (2017)
Volume 27: 4 Issues (2016)
Volume 26: 4 Issues (2015)
Volume 25: 4 Issues (2014)
Volume 24: 4 Issues (2013)
Volume 23: 4 Issues (2012)
Volume 22: 4 Issues (2011)
Volume 21: 4 Issues (2010)
Volume 20: 4 Issues (2009)
Volume 19: 4 Issues (2008)
Volume 18: 4 Issues (2007)
Volume 17: 4 Issues (2006)
Volume 16: 4 Issues (2005)
Volume 15: 4 Issues (2004)
Volume 14: 4 Issues (2003)
Volume 13: 4 Issues (2002)
Volume 12: 4 Issues (2001)
Volume 11: 4 Issues (2000)
Volume 10: 4 Issues (1999)
Volume 9: 4 Issues (1998)
Volume 8: 4 Issues (1997)
Volume 7: 4 Issues (1996)
Volume 6: 4 Issues (1995)
Volume 5: 4 Issues (1994)
Volume 4: 4 Issues (1993)
Volume 3: 4 Issues (1992)
Volume 2: 4 Issues (1991)
Volume 1: 2 Issues (1990)
View Complete Journal Contents Listing