Article Preview
Top1. Introduction
Information Retrieval (IR) is a field of study that helps to extract relevant information from a large collection of text documents1 in the web. Web search is a most important application of IR and its challenges to retrieve the high quality web pages to the user query. While user using popular search engines (SEs) such as Google and Yahoo, they glance many web pages before finding a required one or they cannot find all the relevant information they are looking for (Fan et al., 2009).
Reasons can be explained from a number of perspectives. The existing search engines are learning the surfing habits of users through Analytics and Adsense code, which embedded on web pages for tracking the interest of users. This information is sold to companies for their development or used for targeted advertising, which allows businesses to advertise by popular keywords and advertise on particular sites. Both these Adsense and AdWords are increasing the revenue of search engines and create a profitable business for advertisers. This way of earnings will probably display the business and marketing web pages as top-k results, which is irrelevant to the users.
Due to the proprietary reasons, the accurate algorithms used by commercial SEs are not recognized. However, the correlation study has been made recently to conjecture the working nature of SEs algorithms and provide the factors that affecting the ranking. Those are mainly based on the backlinks, social networking websites, on page technical and on page content. While considering backlinks, various strategies are followed in increasing the rank of particular web sites, such as, links should point to an inner page, links should come from their country, remove the offending links, analysis links to keep and to get rid of and so on. Usually, webmaster and search engine optimizers (SEOs) carefully assess the above factors and rank pages higher than they deserve. Hence, SEOs earn money for better placement of websites in the search list. Due to all above listed facts, all commercial, social networking, fake, personal, advocacy web pages are ranked as top-k web pages. Therefore, users are frequently navigating for interesting pages result to increase in browsing time. A recent study has shown that the common characteristics of top ranked pages on the web are social network pages. However, social network pages are used by the certain aged people. As the result, the personal and advocacy web pages are present at the top-k of retrieved results. Further, in the TREC competition (Fan et al., 2009), it was stated that using the link information alone doesn’t provide much help in performance improvement as compared to using content information. The ranking functions based on content alone are still very successful.
Meanwhile, on page technical and on page content are shown special attention on the texts associated with the web pages. Typically, web pages are developed using two characteristics: content of the document and the structure of the document. The ranking of web pages in content based ranking is done using various lexical/syntactical statistics of the words in web pages. Structure based considers the structural properties of web pages and a weight is assigned to each word existing in different structural position. This weighting heuristic improves the ranking performance of the web. Each web page is an HTML document where Tags are used for designing purpose. The HTML Tags gives idea for understanding a web page, says, the <title> TAG represents the title of the document. SEs understand the web page using few Tags related to the head and body sections and assign equal priority which lead to less progression. In addition to the text information, image, audio, video contents are also embedded into the pages and that can also be used in understanding the web content thereby the retrieval accuracy is improved. While considering the text content of images, it is assured that they are similar to the text annotations and closer to the semantic interpretation of the image. Recent works have been failed to exactly interpret the image using texts. A novel approach will need to propose which improves the top-k of retrieved results.