Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHP

Deepika Punj, Ashutosh Dixit

Source Title: International Journal of Rough Sets and Data Analysis (IJRSDA) 4(1)

DOI: 10.4018/IJRSDA.2017010106

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In order to manage the vast information available on web, crawler plays a significant role. The working of crawler should be optimized to get maximum and unique information from the World Wide Web. In this paper, architecture of migrating crawler is proposed which is based on URL ordering, URL scheduling and document redundancy elimination mechanism. The proposed ordering technique is based on URL structure, which plays a crucial role in utilizing the web efficiently. Scheduling ensures that URLs should go to optimum agent for downloading. To ensure this, characteristics of both agents and URLs are taken into consideration for scheduling. Duplicate documents are also removed to make the database unique. To reduce matching time, document matching is made on the basis of their Meta information only. The agents of proposed migrating crawler work more efficiently than traditional single crawler by providing ordering and scheduling of URLs.

Article Preview

Top

Introduction

With the increase in size of web, it is necessary to ensure the richness and uniqueness of information available on it. In the era of ICT, search engine plays a vital role for finding relevant information (Sharma & Sharma, 2011). For relevant information both the parameters namely Users’ behaviour (Deepika and Dixit, 2015c) and rich database are significant. The focus of the present paper is on maintaining the database richer. The crawler is one of the main modules of search engine and it is responsible for gathering the information from web and stored in database. There are many design issues while designing the crawler like uniqueness & richness of database, cooperation between crawling agents, fast & efficient crawling etc (Deepika and Dixit, 2012). There are many types of crawler exist such as Parallel Crawler, Migrating Crawler, Focussed Crawler, Hidden Crawler, Incremental Crawler etc. In this proposed work, the capabilities of migrating crawler (Singhal et al., 2012) are being utilized for designing an efficient crawling system. The proposed Crawler tries to achieve all the design issues needed for efficient crawling. Here, an URL plays an important role in designing a migrating crawler. In migrating crawler environment, if an URL is scheduled properly it will improve the richness of database with least communication overhead on to the network.

The URL is composed of five components namely the scheme, authority, path, query and fragment components (Lee et al., 2005) as shown below:

There are URLs which are points to same page. This problem is designated as DUST (Schonfeld et al., 2007) i.e. different URLs with similar text. DUST effect the whole working of Search Engines i.e. crawling, indexing, ranking etc. In order to remove this duplicity at URLs level proper processing has been adopted. Before checking for duplicity in URLs, URL standard Normalization process (Lee et al., 2005) is applied to them. This process eliminates the syntactically similar URLs. There are three types of normalization process:

1.
Case normalization
- Conversion of scheme component letters and hostname to lowercase is done in this type of normalization
2.
Percent-encoding normalization
- All unreserved characters like ~, _ etc are decoded into %form.
3.
Path segment normalization
- Remove all ‘.’, ‘..’ from the path component of the URL.
- Remove the fragment component from the URL i.e. after#.
- Eliminate port number like 80.
- Remove ‘/’ from the end and add ‘/’ at path location if it is null.
- With the help of these normalization processes, syntactically similar URLs are identified.

Sitemap (Deepika and Dixit, 2015a) is also used here to get the information of all links present in a webpage. Basically, sitemap gives the number of links present in a web page. This helps in ordering the URLs before downloading them. To maintain the database unique and rich, hashing algorithm is used. Hashing helps in eliminating duplicate documents from the database.

Top

Some of the prevalent work in the related area is discussed below. Agarwal et al., (2009) worked on crawl logs and then from these logs generate rules. These rules were then utilized for finding duplicate web pages. They form clusters of similar pages from crawl logs. These clusters were then used to generate rules for detecting duplicate pages. These rules were then generalized and with the help of these rules then only URL were able to detect identical pages. They also showed that their proposal effects crawling, indexing in effective way.

Complete Article List

Search this Journal:

Reset

Volume 9: 1 Issue (2025): Forthcoming, Available for Pre-Order

Volume 8: 1 Issue (2024): Forthcoming, Available for Pre-Order

Volume 7: 4 Issues (2021): 1 Released, 3 Forthcoming

Volume 6: 3 Issues (2019)

Volume 5: 4 Issues (2018)

Volume 4: 4 Issues (2017)

Volume 3: 4 Issues (2016)

Volume 2: 2 Issues (2015)

Volume 1: 2 Issues (2014)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHP

Abstract

Introduction

Complete Article List

Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHP

Abstract

Introduction

Related Work

Complete Article List