Clustering of Relevant Documents Based on Findability Effort in Information Retrieval

Clustering of Relevant Documents Based on Findability Effort in Information Retrieval

Prabha Rajagopal, Taoufik Aghris, Fatima-Ezzahra Fettah, Sri Devi Ravana
Copyright: © 2022 |Pages: 18
DOI: 10.4018/IJIRR.315764
Article PDF Download
Open access articles are freely available for download

Abstract

A user expresses their information need in the form of a query on an information retrieval (IR) system that retrieves a set of articles related to the query. The performance of the retrieval system is measured based on the retrieved content to the query, judged by expert topic assessors who are trained to find this relevant information. However, real users do not always succeed in finding relevant information in the retrieved list due to the amount of time and effort needed. This paper aims 1) to utilize the findability features to determine the amount of effort needed to find information from relevant documents using the machine learning approach and 2) to demonstrate changes in IR systems' performance when the effort is included in the evaluation. This study uses a natural language processing technique and unsupervised clustering approach to group documents by the amount of effort needed. The results show that relevant documents can be clustered using the k-means clustering approach, and the retrieval system performance varies by 23%, on average.
Article Preview
Top

Introduction

Information retrieval (IR) is the science of searching information in documents relevant to a given query, from within large stored collections. The fundamental challenge of an information retrieval system (IRS) resides in matching between an information requirement statement, precisely a user’s query, and a collection of documents by ranking each one according to its importance for the query.

During the past decades, a huge amount of research was done to build a ranking model to retrieve the best relevant documents. Generally, a ranking model is either constructed with probabilistic methods or modern machine learning methods. The algorithm is based on the frequency of words, considering that a document is a set of words, often called a word bag. With these models, if a user enters a simple query, for example, “what is information retrieval” in a given IRS, hundreds of thousands, if not million results are retrieved and ranked. However, sometimes a large amount of time is spent just to get a small piece of information in those documents which are considered relevant. The amount of effort put in by the user, either satisfies or dissatisfies the user in gaining the necessary information knowledge. It was mentioned before that real users tend to give up easily when searching for information in the retrieved documents (Verma et al., 2016). Therefore, the concept of relevance no longer remains in just ensuring relevant information is available in the document but also the amount of effort needed in finding relevant information (Yilmaz, 2014).

Two widely used methods evaluate the effectiveness of information retrieval systems. The first method is called the collection-based method and it is often referred to as the Cranfield approach (Cleverdon, 1991). This approach is based on a document collection (corpus), a set of topics that contain the query, title and description to define a user’s need, and a set of relevance judgments pointing out the relevant documents in the collection to each topic, often judged by topic experts. So, to evaluate the effectiveness of IRS, the scores for the systems are generated using the retrieved ranked list of documents by the systems and the relevance judgment. The scores are calculated using evaluation indicators such as precision, recall, mean average precision, and others (Clough & Sanderson, 2013). The second evaluation method is the user-based evaluation. This approach is based on the interaction between the user and the IRS which is defined by the user’s environment such as his/her educational background, the context, subject expertise, and his/her perspective like the search goal (Park, 1994).

Comparing both the evaluation methods, the system-based and the user-based evaluation can match each other’s results (Al-Maskari, 2008). However, previous research has shown there is a broad gap between these two approaches, given that the collection-based method makes many hypotheses about what the real user looks for to satisfy his/her information needs. Additionally, there are many other assumptions to simplify the relevance evaluation (Allan et al., 2005). So, the mismatch between the two evaluation methods is due to the dissension between what the expert judges consider as relevant documents, and what the real users need to satisfy their information demand. The user’s need is specified as document utility (Turpin & Hersh, 2001). Evaluating IR relevance by documents utility in a semantic and pragmatic view was argued by Saracevic (1979) in earlier research (Saracevic, 1975) as follows: “it is fine for IR systems to provide relevant information, but the true role is to provide information that has utility-information that helps to directly resolve given problems, that directly bears on given actions, and/or that directly fits into given concerns and interests. Thus, it was argued that relevance is not a proper measure for a true evaluation of IR systems. A true measure should be utilitarian.” Following that, Yilmaz et al. stated that relevance is about how documents found by the retrieval system are useful (2014).

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing