Computing Semantic Relatedness from Human Navigational Paths: A Case Study on Wikipedia

Computing Semantic Relatedness from Human Navigational Paths: A Case Study on Wikipedia

Philipp Singer, Thomas Niebler, Markus Strohmaier, Andreas Hotho
Copyright: © 2013 |Pages: 30
DOI: 10.4018/ijswis.2013100103
(Individual Articles)
No Current Special Offers


In this article, the authors present a novel approach for computing semantic relatedness and conduct a large-scale study of it on Wikipedia. Unlike existing semantic analysis methods that utilize Wikipedia’s content or link structure, the authors propose to use human navigational paths on Wikipedia for this task. The authors obtain 1.8 million human navigational paths from a semi-controlled navigation experiment – a Wikipedia-based navigation game, in which users are required to find short paths between two articles in a given Wikipedia article network. The authors’ results are intriguing: They suggest that (i) semantic relatedness computed from human navigational paths may be more precise than semantic relatedness computed from Wikipedia’s plain link structure alone and (ii) that not all navigational paths are equally useful. Intelligent selection based on path characteristics can improve accuracy. The authors’ work makes an argument for expanding the existing arsenal of data sources for calculating semantic relatedness and to consider the utility of human navigational paths for this task.
Article Preview

1. Introduction

Computing semantic relatedness1 between concepts represents a fundamental challenge on our way to a semantically-enabled web. Especially, common sense knowledge in terms of semantic relatedness is of special interest in e.g., improving information retrieval or language processing. To obtain a judgement of semantic relatedness of two terms or concepts, the idea is to rely on the accumulated or common knowledge. Rubenstein Goodenough(1965) have pointed out that there is a positive relationship between the degree of semantic relatedness of a pair of terms and the degree to which their contexts are similar. Hence, the idea is that a semantic relatedness score captures this common sense knowledge over a set of contexts and abstracts and generalizes it.

Psychological experiments Tversky (1977), Medin et al. (1993), Medin, Goldstone, Gentner have shown that semantic relatedness is both context dependent and asymmetric. Context dependency means that the determined relatedness is influenced by the context the words appear in and the semantic relatedness may be asymmetric as people may provide distinct ratings depending on the direction the words are presented. Nevertheless, Aguilar Medin (1999) showed that this asymmetry just occurs at special occasions and Medin et al. (1993) and Medin, Goldstone & Gentner, also showed that the difference in ratings for a given word pair is less than five percent. Hence, we will focus on symmetric semantic relatedness in this work, as we believe that this is sufficient for the investigations we want to conduct and we can ignore these small differences.

Recent approaches to identify semantic associations between concepts exploit the rich fabric of emerging information networks such as Wikipedia. Existing semantic analysis methods such as those by Gabrilovich Markovitch (2007), Ponzetto Strube (2007) or Yeh et al. (2009), Yeh, Ramage, Manning, Agirre & Soroa have shown great potential by using textual or structural (link) information on Wikipedia. While these methods have produced promising results, they only capture semantics from a limited set of people (e.g., Wikipedia editors) and they mostly neglect pragmatics (i.e., how Wikipedia is used). At the same time, millions of web users navigate Wikipedia daily to find information, to educate themselves or for research issues. When navigating a set of articles on Wikipedia, users typically need to tap into their intuitions about real-world concepts and the perceived relationships between them in order to progress towards their set of targeted articles. Humans tend to find intuitive paths instead of necessarily short paths, while contrary an automatic algorithm would try to find a shortest path between two concepts that may not be as semantically rich and intuitive as a navigational path conducted by a human.

A great advantage of such navigational paths by humans is that they can be captured in a very simple way. The only prerequisite is that there is a group of users that navigate a system. Furthermore, many existing methods only work well if the system at hand provides high quality content that can be leveraged for calculating semantic relatedness. Contrary, our approach is independent of the content of a resource. It also gives opportunities to calculate semantic relatedness between different kind of resources. For example, suppose we want to calculate semantic relatedness between images and textual pages of a website. This would be a very difficult task for content based approaches, as both resources exhibit different features. The method proposed in this work though would work on any type of resource as long as it is navigated by users.

While such data about navigational paths could potentially represent a profoundly rich resource for calculating semantic relatedness between concepts, it has not received much attention by the research community yet.

Complete Article List

Search this Journal:
Volume 20: 1 Issue (2024)
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing