An Automatic User Interest Mining Technique for Retrieving Quality Data

An Automatic User Interest Mining Technique for Retrieving Quality Data

Shilpa Sethi (Department of Computer Engineering, YMCA University of Science and Technology, Faridabad, India) and Ashutosh Dixit (Department of Computer Engineering, YMCA University of Science and Technology, Faridabad, India)
Copyright: © 2017 |Pages: 18
DOI: 10.4018/IJBAN.2017040104
OnDemand PDF Download:
No Current Special Offers


Search engines acts as an intermediate between the user and web. It takes the user query as input and retrieves the pages based on query terms from its database, which is in advance populated from World Wide Web. It then applies some ranking algorithm to sort the retrieved pages and presents the results back to the user in the form of millions of web pages. But most of pages in the result are not useful to the user. This problem arises because the search engine retrieves the results based on query keywords only and no attention is paid in incorporating the user interest during the ranking process. Due to the lack of automatic mechanism for tracking user browsing patterns, user seldom gets the relevant results in the top ten links. So, in order to cater the need of individual user, an automatic user interest mining technique for retrieving quality data is being proposed here. The mechanism provides the satisfactory results to the user as each user interest is maintained separately without any hassle at the user end.
Article Preview

1. Introduction

WWW is a large repository of interconnected web documents that contain text, images, multimedia and many other items of information referred to as information resources (Sethi & Dixit, 2015). Statistics of authoritative web sites show that there are at least 4.78 billion web pages in indexed web as recorded on 27 March, 2016 and many more are lying in hidden web. The collection is exponentially increasing at a rate of 25% per year. People use information retrieval tool such as search engine to get information from such a huge collection of documents.

A basic search engine has five main components namely: User interface, crawler also known as spider, indexing module, query processing module and ranking module as shown in Figure 1 in the Appendix (Mudgil, Sharma, & Gupta 2013).

When the user submits its information need in the form of set of keywords referred to as query at user interface, search engine takes few seconds to retrieve the web pages and present back the result list to the user. The less retrieval time is possible because it is retrieving the documents from its own database which has been maintained locally much before the actual requirement arises by crawling and indexing module. The crawler is the program that traverses the web at specified interval and downloads the web documents from different web servers (Sethi & Dixit 2015). Further these documents are parsed to extract text, hyperlinks and stored separately in different files. The extracted hyperlinks are again used by crawler to download the web pages and text is stored in repository. The indexing module takes the text from repository and constructs the inverted index of terms belonging to a document (Hao, Guolian, & Lizhu., 2013, Bilimoria. & Patel, 2015, Kalra,2012). The index is basically the list of terms where each term is linked with multiple postings. The no. of postings is equal to the no. of documents containing the term. The document posting stores doc ID, the no. of incoming links, number of outgoing links from the document, depth and frequency of term in the document. Further this list is attached to a third list containing the exact information about the position of every occurrence of term in the document. The query processor executes the user query on this inverted index and retrieves the matched documents.

These set of documents are then sorted by ranking module based upon content and link structure mining mechanisms. The sorted list is at last present back to the user in response to its query. In short, the information retrieval is purely based on keyword matching. But users of these search engines may have varying skills and internet for retrieving information from a novice user to computer specialist. So, the keywords entered by user are sometimes not enough to clearly reflect its information need or ambiguous to infer distinct need. Moreover, the different users use the same word to get different information. For example, for the query JAVA, some users may be interested in documents related to programming language Java whereas other may be looking for Java coffee beans. But the traditional search engines provide the same ranked list to the entire users regardless of; they are interested in programming language or coffee. Hence, it becomes difficult for a novice user to get relevant information.

In order to predict such information needs precisely, web usages mining can be consider as a solution. It can be defined as the collection of techniques that analyze the user access pattern with an aim to infer its searching need. Many algorithms based on user explicit feedback form, Collaborative filtering (Ekstrand, Riedi & Konstan (2010), click history (Leung,Ng & Lee (2008)), session usages (Duhan & Sharma(2010)) etc. had been proposed in the past. In order to mine the user interest, all the above mentioned approaches requires the involvement of user to some extent. This paper proposed a novel hassle free user interest learning mechanism which dynamically evaluates the user interest factor in different domains that can be further used in ranking process to sort the results as per user expectations.

The rest of the paper is structured as follows: section 2 discusses the related work done in this area. Sections 3 describe the proposed user interest mining system in detail with examples illustration. In section 4, analysis of sample query set is conducted to verify that user profile information can be utilized for the retrieval of relevant pages from search engine database. Section 5 concludes the paper.

Complete Article List

Search this Journal:
Volume 9: 5 Issues (2022): 3 Released, 2 Forthcoming
Volume 8: 4 Issues (2021)
Volume 7: 4 Issues (2020)
Volume 6: 4 Issues (2019)
Volume 5: 4 Issues (2018)
Volume 4: 4 Issues (2017)
Volume 3: 4 Issues (2016)
Volume 2: 4 Issues (2015)
Volume 1: 4 Issues (2014)
View Complete Journal Contents Listing