A Novel Hybrid Correlation Measure for Query Expansion-Based Information Retrieval

A Novel Hybrid Correlation Measure for Query Expansion-Based Information Retrieval

Ilyes Khennak (University of Science and Technology Houari Boumediene, Algeria) and Habiba Drias (University of Science and Technology Houari Boumediene, Algeria)
Copyright: © 2020 |Pages: 19
DOI: 10.4018/978-1-7998-1021-6.ch001

Abstract

Query expansion (QE) is one of the most effective techniques to enhance the retrieval performance and to retrieve more relevant information. It attempts to build more useful queries by enriching the original queries with additional expansion terms that best characterize the users' information needs. In this chapter, the authors propose a new correlation measure for query expansion to evaluate the degree of similarity between the expansion term candidates and the original query terms. The proposed correlation measure is a hybrid of two correlation measures. The first one is considered as an external correlation and it is based on the term co-occurrence, and the second one is considered as an internal correlation and it is based on the term proximity. Extensive experiments have been performed on MEDLINE, a real dataset from a large online medical database. The results show the effectiveness of the proposed approach compared to prior state-of-the-art approaches.
Chapter Preview
Top

Introduction

The large volume of textual content available on the Web is growing exponentially and the number of new websites created online is increasing rapidly. For instance, the total number of websites has grown significantly, from 900 million in 2014 to 1.6 billion in 2018. Moreover, the volume of user-created content posted on online platforms is considerably expanding, especially on social media websites. Every day, 4 million blog posts are published on the Internet and over 500 million tweets are submitted by users. The number of google search requests has also significantly increased. In 2012, Google handled more than 2 billion queries per day and this number exceeded 4 billion in 2018. In addition, the Internet traffic has dramatically grown. According to the latest Cisco report, the Internet data traffic reached 1.5 ZB in 2017, and it is expected to cross 4 ZB by 2022. This explosive growth of the World Wide Web has led to the following findings:

  • New terms are constantly created and generated on the Internet. According to Williams and Zobel (2005), there is one new term in every two hundreds words. Prior efforts by (Eisenstein et al., 2012; Sun, 2010) have demonstrated that this is primarily due to: neologisms, acronyms, abbreviations, emoticons, URLs and typographical errors.

  • The Internet users are increasingly using these new terms in their search queries. In their study, Chen et al. (2007) stated that more than 17% of query terms are out of dictionary, 45% of them are E-speak (lol), 18% are companies and products, 16% are proper names, 15% are misspellings and foreign words (Subramaniam et al., 2009; Ahmad & Kondrak, 2005).

These new terms that the users are employing to express their needs are often ambiguous and imprecise. Hence, they negatively affect the quality of search queries and do not allow characterizing the information needs in a satisfactory manner. As a result, retrieving relevant information has become a serious and challenging issue. Many different retrieval approaches and techniques have been suggested and studied in order to overcome this shortcoming and return more relevant information. One well-known technique to fix this shortcoming and improve the retrieval performance is Query Expansion. It aims to augment the user's original query with expansion terms that best describe the actual user intent. QE is widely used in many applications including multimedia information retrieval (Wie et al., 2014), Question Answering (Park & Croft, 2015), information filtering (Leturia et al., 2013); and applied to various areas such as sport (Al Kabary & Schuldt, 2014), health (Khennak & Drias, 2017), e-commerce (Lee & Chau, 2011) and search mobile (Gao et al., 2013).

The process of generating the most relevant and related terms to be used as expansion features is the key step in query expansion. Numerous concepts such as proximity, co-occurrence, association, closeness, relatedness and relationship have been introduced and discussed in order to express the strength of correlation between an expansion term candidate and the query keywords (Carpineto & Romano, 2012).

To generate the most relevant expansion terms, we propose in this work a new robust and effective correlation measure to evaluate the relatedness between the expansion term candidates and the original query terms. The proposed correlation measure is a hybrid of two correlation measures. The first one is considered as an external correlation and it is based on the term co-occurrence, and the second one is considered as an internal correlation and it is based on the term proximity. The hybrid correlation measure gives importance to terms that frequently occur in the same context during the search process. For example, the term 'IJIRR' is often found in the same sites where the words 'Journal', 'IGI Global', and 'Retrieval' occur. The main contributions of our work are the following:

Complete Chapter List

Search this Book:
Reset