Identification and Classification of Health Queries: Co-Occurrences vs. Domain-Specific Terminologies

Identification and Classification of Health Queries: Co-Occurrences vs. Domain-Specific Terminologies

Carla Teixeira Lopes (Department of Informatics and Engineering, University of Porto, Porto, Portugal) and Cristina Ribeiro (Department of Informatics and Engineering, University of Porto, Porto, Portugal and INESC Technology and Science, Institute for Systems and Computer Engineering of Porto, Porto, Portugal)
DOI: 10.4018/ijhisi.2014070104
OnDemand PDF Download:
No Current Special Offers


Identifying the user's intent behind a query is a key challenge in Information Retrieval. This information may be used to contextualize the search and provide better search results to the user. The automatic identification of queries targeting a search for health information allows the implementation of retrieval strategies specifically focused on the health domain. In this paper, two kinds of automatic methods to identify and classify health queries based on domain-specific terminology are proposed. Besides evaluating these methods, we compare them with a method that is based on co-occurrence statistics of query terms with the word “health”. Although the best overall result was achieved with a variant of the co-occurrence method, the method based on domain-specific frequencies that generates a continuous output outperformed most of the other methods. Moreover, this method also allows the association of queries to the semantic tree of the Unified Medical Language System and thereafter their classification into appropriate subcategories.
Article Preview

A manual approach to classify web queries is straightforward. Usually several assessors are involved in the classification process; and, to reduce the subjectivity, more than one person typically is asked to classify the very same query. If and when a consensus is not found initially, either another element is added to ease the classification or a discussion between the adjudicators is promoted to reach a consensus. In a study that focused on studying queries that users submit to search engines, Amanda Spink, Wolfram, Jansen, and Saracevic (2001) manually classified a sample of 2,414 queries submitted to the Excite search Engine into 11 categories. Focusing on the study of health queries submitted to search engines, Spink et al. (2004) also do a manual classification of queries to select the ones related to the topic of health. Despite being a popular approach, manual classification is slow and represents a tedious process requiring the availability of one or more human classifiers. In some cases, the huge volume of queries may even make the classification task impracticable; for these reasons, automatic methods have been proposed.

In Information Retrieval (IR), several approaches to detect topics in documents and collections of documents have emerged. Some methods are based on mathematical models, for example, the method of Latent Semantic Analysis, which is a method based on co-occurrences of terms in the collection to reduce the semantic context of the documents (Landauer, Foltz, & Laham, 1998). Even so, as web queries are more or less short, these methods are not the most appropriate.

Complete Article List

Search this Journal:
Volume 17: 2 Issues (2022)
Volume 16: 4 Issues (2021)
Volume 15: 4 Issues (2020)
Volume 14: 4 Issues (2019)
Volume 13: 4 Issues (2018)
Volume 12: 4 Issues (2017)
Volume 11: 4 Issues (2016)
Volume 10: 4 Issues (2015)
Volume 9: 4 Issues (2014)
Volume 8: 4 Issues (2013)
Volume 7: 4 Issues (2012)
Volume 6: 4 Issues (2011)
Volume 5: 4 Issues (2010)
Volume 4: 4 Issues (2009)
Volume 3: 4 Issues (2008)
Volume 2: 4 Issues (2007)
Volume 1: 4 Issues (2006)
View Complete Journal Contents Listing