Article Preview
TopIntroduction
In the era of big data (Wu et al., 2015), automatic or semi-automatic acquisition of knowledge from big data and establishing a knowledge-based system to provide Internet-oriented intelligent knowledge services have become the common demands of knowledge-driven applications (Jiang et al., 2018; Lu et al., 2019; Polleres et al., 2010; Wu et al., 2014). Knowledge graphs (KGs) have an increasingly important application value in this context, and the goal is to transform the content of the World Wide Web into knowledge that can be understood and calculated by machines for intelligent applications (Wu & Wu, 2019).
A KG describes the concepts, entities, and their relationships in the objective world in a structured form, typically as a triple (head entity, relation, tail entity) (Nickel et al., 2015). For example, the triple (Hamlet, Author, William Shakespeare) implies that William Shakespeare is the author of Hamlet. Early KGs were usually constructed manually or through crowdsourcing, which is time-consuming and inefficient. Thus, automatic KG construction from various data sources, such as unstructured text data, has attracted increasing attention. Owing to the challenge of language understanding tasks in unstructured text data, noise may inevitably be generated by the current information extraction (IE) systems (Nickel et al., 2015; Paulheim & Bizer, 2013, 2014). Detecting such noises from the data obtained from various information sources remains a challenge (Batini et al., 2015).
Currently, only a few studies focus on KG denoising (Wang et al., 2017; Xie et al., 2018). Most studies focusing on the application of KGs assume that the existing KGs are completely correct, ignoring that the automatic mechanisms involved in KG construction processes inevitably bring considerable noise. Moreover, most existing studies focus only on “factual errors,” for example, an incorrect capital for a country (Paulheim & Bizer, 2013). However, different types of errors (called low-quality errors here) often occur during an automatic construction process. For example, the triple (Chang'an, IsCapitalOf, LiBai), extracted from LiBai's Baidu encyclopedia homepage, is a low-quality noise (Hong et al., 2020). This is because Li Bai lives in the city of Chang'an but not the capital of Chang'an. This type of noise is common during the construction of a KG (Paulheim & Bizer, 2013).
The existing studies on noisy KGs mostly focus on KG refinement (Paulheim, 2017). That is, they assume the presence of a large-scale knowledge base (such as Wikipedia), and the type information of most entities in the knowledge base is known (Paulheim & Bizer, 2013). For example, Paulheim and Bizer (Paulheim & Bizer, 2013) adopted the statistical distribution of types in the position of the head and tail entities for each relation to predict the instance type. This method requires the type information of most entities. However, many knowledge bases constructed from unstructured texts do not contain type information. Therefore, existing methods may not be directly applied to this type of problem. Even if we handle the entity classification problem separately, we still need considerable labeled data or a large knowledge corpus, and the accuracy of the classification may also be insufficient. To the best of our knowledge, not many studies focus on denoising in the KG construction stage without a large knowledge base.
In this study, we propose a framework based on human intelligence, artificial intelligence, and organizational intelligence (HAO intelligence) (Wu & Wu, 2019) for low-quality error detection. This framework is aimed at error detection in scenarios where no large knowledge base or corpus is available during the knowledge graph construction. The advantage of this framework is that very little information needs to be manually marked. Experiments on two large-scale datasets demonstrate their effectiveness.