Efficacious Hyperlink Based Similarity Measure Using Heterogeneous Propagation of PageRank Scores

Efficacious Hyperlink Based Similarity Measure Using Heterogeneous Propagation of PageRank Scores

Vasantha Thangasamy (Department of Computer Science, Fatima College, Madurai, India)
Copyright: © 2019 |Pages: 14
DOI: 10.4018/IJIRR.2019100104

Abstract

Information available on the internet is wide, diverse, and dynamic. Since an enormous amount of information is available online, finding similarity between webpages using efficient hyperlink analysis is a challenging task. In this article, the researcher proposes an improved PageSim algorithm which measurse the importance of a webpage based on the PageRank values of connected webpage. Therefore, the proposed algorithm uses heterogeneous propagation of the PageRank score, based on the prestige measure of each webpage. The existing and the improved PageSim algorithms are implemented with a sample web graph. Real time Citation Networks, namely the ZEWAIL Citation Network and the DBLP Citation Network are used to test and compare the existing and improved PageSim algorithms. By using this proposed algorithm, it has been found that a similarity score between two different webpages significantly increases based on common information features and significantly decreases based on distinct factors.
Article Preview
Top

2. Review Of Literature

Most of the researchers suggested that the similarity measures are scoring functions to determine relationship between a pair of webpages. The scores are usually between 0 and 1, the lower value indicates that two webpages are dissimilar and the higher value indicates that two webpages are identical by Smucker et al. (2007). Calado et al. (2006) proposed link-based techniques, in which hyperlinks are used for finding webpage similarity. Wan (2008) stated that similarity measures are central to many important applications such as searching, clustering, classification and recommendation.

Lin et al. (2006) stated that the similarity measures are broadly classified into two categories namely the text-based approach and the hyperlink-based approach. In the text-based approach, the similarity between webpages is evaluated by webpages’ contents. Peter Turney & Patrick Pantel (2010) suggested that the most widely used content based similarity measure is cosine TFIDF in Information Retrieval, which has several issues when applied to the web, since the web consists of billions of webpages. Scalability of web is a major issue, because it requires large amount of storage and long computation time for comparison of the full text. Next, accuracy of similarity measure is not exact since most of the webpages are not properly edited.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 10: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing