Article Preview
TopIntroduction
Over the years, many different retrieval models, such as vector space models (Salton et al., 1975; Salton & Buckley 1988), classic probabilistic models (Robertson et al., 1995; Turtle & Croft, 1991; Fuhr, 1992), and statistical language models (Ponte & Croft, 1998; Lavrenko & Croft, 2001; Zhai & Lafferty, 2001a), have been proposed and studied in order to fix the issue of searching relevant documents in a large data source that satisfy the users’ information needs (Van Rijsbergen, 1979). Nevertheless, it remains a great challenge to develop Information Retrieval Systems (IRSs) that are robust, effective, and efficient.
The reason for the ineffectiveness of IRSs is predominantly caused by the ambiguity, incompleteness and imprecision of keywords that are used to express the genuine user’s information need. One well-known technique to bypass this shortcoming is to expand the original user query with extra terms that best characterize the actual user intent. In this regard, various approaches dealing with the proximity and the interdependence of words have been implemented and tested to assess the strength of the relationship between an extra word candidate and the user query in order to find the most important terms to be used as extra terms, or rather, as expansion features. (Carpineto & Romano, 2012)
In this sense, the main goal of this work is to propose a robust correlation measure that evaluates the relatedness of words based on the co-occurrence and closeness of terms. This principle gives importance to words that frequently occur in the same context during the search process. For example, the term ‘IJIRR’ is often found in the same sites where the words ‘Journal,’ ‘IGI Global’, and ‘Retrieval’ occur. Relying on this concept was not a coincidence but rather came as a result of the researches conducted recently about the growth of the World Wide Web. All of these researches have demonstrated an exponential growth of the Web and rapid increase in the number of new pages created. In his study, Ranganathan (2011) estimated that the volume of online data indexed by Google had increased from 5 exabytes in 2002 to 280 exabytes in 2009. According to Zhu et al. (2009), this volume is expected to double in every 18 months. Ntoulas et al. (2004) interpreted these statistics in terms of the number of new pages created and indicated that their number is increasing by 8% a week. The work of Bharat and Broder (1998) went further and estimated that the World Wide Web pages are growing at the rate of 7.5 pages every second. This revolution, that the Web is witnessing, has led to the appearance of two points:
- •
The first point is the entry of new words into the Web which is estimated, according to Williams and Zobel (2005), at about one new word in every two hundred words. Studies by (Williams and Zobel, 2005; Eisenstein et al., 2012; Sun, 2010) have shown that this invasion is mainly due to: neologisms, acronyms, abbreviations, emoticons, URLs and typographical errors.
- •
The second point is that the users use these new words during the search. Chen et al. (2007) indicated in their study that more than 17% of query terms are non-dictionary words, 45% of them are E-speak (lol), 18% are companies and products (Google), 16% are proper names, 15% are misspellings and foreign words (Subramaniam et al., 2009; Ahmad & Kondrak, 2005).