Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Detecting Webspam Beneficiaries Using Information Collected by the Random Surfer

Thomas Largillier, Sylvain Peyronnet

Source Title: International Journal of Organizational and Collective Intelligence (IJOCI) 2(2)

DOI: 10.4018/joci.2011040103

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Search engines use several criteria to rank webpages and choose which pages to display when answering a request. Those criteria can be separated into two notions, relevance and popularity. The notion of popularity is calculated by the search engine and is related to links made to the webpage. Malicious webmasters want to artificially increase their popularity; the techniques they use are often referred to as Webspam. It can take many forms and is in constant evolution, but Webspam usually consists of building a specific dedicated structure of spam pages around a given target page. It is important for a search engine to address the issue of Webspam; otherwise, it cannot provide users with fair and reliable results. In this paper, the authors propose a technique to identify Webspam through the frequency language associated with random walks among those dedicated structures. The authors identify the language by calculating the frequency of appearance of k-grams on random walks launched from every node.

Article Preview

Top

Introduction

The Web has grown so big (Alpert & Hajhaj, 2008) that users cannot afford the patience to go all over it, or even through a rather small part of it. Except for their favorites sites they have to (and they do) use search engines that answer to billions of requests per day. Search engines are thus facing the issue of providing their users with good results as quick as they can, results being web pages taken from a huge index (billions of web pages) and under a tremendous load (billions of requests each day).

To ensure good results to a particular request, search engines mostly use a relevance metric for web pages and requests. Being relevant regarding a certain query does not ensure being in the first places of a search engine result list. To arbitrate between equally relevant pages, search engines have to use other metrics. One of them is popularity. Most search engines are using a popularity mechanism that is not content related. Indeed, with a popularity independent of the content of webpages, computational issues linked to ranking web pages are less prevalent. Thus, popularity often depends on the links a webpage receive from other web pages. The Google’s PageRank algorithm (Brin, Page, Motwani & Winograd, 1999) compute the popularity of all pages independently of the query while Kleinberg HITS algorithm (Kleinberg, 1999) has a query dependent version of the popularity called the authority score.

The economic model sustaining most websites is such that webmasters want their sites to appear on the first places of a search engines regarding specific requests. Indeed, the income of a website is directly correlated to unique visitors a website receives. Since search engines are the major sources of visitors on the web, to attract as many visitors as possible one has to maximize its exposure on them. Being close to the top will redirect a huge amount of traffic if the targeted request is wisely chosen.

Artificially increasing the relevance towards a request without quickly using spamming techniques and being spotted by search engines is almost impossible. Plus spammers often want to boost a legitimate page so they don’t need to manipulate its relevance. So spammers aim to increase their popularity to move to the top of the list. The most effective way to increase the popularity of a given webpage is to create a set of dull pages organized in a specific architecture whose goal is to boost the target page. This is a borderline technique, which is far beyond the guidelines of most search engines. Structures intending to maximize the pagerank of one specific page are well known (Gyongyi & Garcia-Molina, 2005). Those structures can no longer be use efficiently since it is hard for them to avoid automatic detection. Webspammers then slightly modify those structures to increase their rank while avoiding automatic detection.

There is thus a weapon race between search engines and spammers. It is really important for the first ones to deal with Webspam since it can pollute their results, then without fair results they will lose the users’ confidence and visits. Having less visitors search engines will then lose their income due to advertising and sponsored links. Fighting spam is then an economic necessity for the search engines. For spammers this is the same problem: their income is correlated to their exposure in the SERPs (Search Engines Results Pages). So each time a search engine adapts itself to lower the incidence of Webspam, they have to renew their techniques to stay one step ahead.

In this paper we present a method whose goal is to identify malicious structures amongst web pages. The intuition behind our method is that spammers use specific pages architecture to route the PageRank around the target page in order to maximize its score while avoiding automatic detection. Since the PageRank can be seen as related to the behavior of a random surfer, it seems that using random walks to reproduce the behavior of this random surfer we will be able to expose paths created by spammers in order to manipulate and increase their pagerank.

The main results of this paper are:

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order

Volume 13: 1 Issue (2023)

Volume 12: 4 Issues (2022)

Volume 11: 4 Issues (2021)

Volume 10: 4 Issues (2020)

Volume 9: 4 Issues (2019)

Volume 8: 4 Issues (2018)

Volume 7: 4 Issues (2017)

Volume 6: 4 Issues (2016)

Volume 5: 4 Issues (2015)

Volume 4: 4 Issues (2014)

Volume 3: 4 Issues (2012)

Volume 2: 4 Issues (2011)

Volume 1: 4 Issues (2010)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Detecting Webspam Beneficiaries Using Information Collected by the Random Surfer

Abstract

Introduction

Complete Article List