An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital Libraries

An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital Libraries

Sumita Gupta (YMCAUST, Faridabad, India), Neelam Duhan (J C Bose University of Science and Technology YMCA, Faridabad, India) and Poonam Bansal (MSIT, Delhi, India)
Copyright: © 2019 |Pages: 25
DOI: 10.4018/IJIRR.2019070103
OnDemand PDF Download:
No Current Special Offers


With the rapid growth of digital information and user need, it becomes imperative to retrieve relevant and desired domain or topic specific documents as per the user query quickly. A focused crawler plays a vital role in digital libraries to crawl the web so that researchers can easily explore the domain specific search results list and find the desired content against the query. In this article, a focused crawler is being proposed for online digital library search engines, which considers meta-data of the query in order to retrieve the corresponding document or other relevant but missing information (e.g. paid publication from ACM, IEEE, etc.) against the user query. The different query strategies are made by using the meta-data and submitted to different search engines which aim to find more relevant information which is missing. The result comes out from these search engines are filtered and then used further for crawling the Web.
Article Preview

2. Review Work

Today collecting the domain or topic specific information from the Web by using the focused crawler is becoming a challenge task. As we know, now a day’s only 20% of the websites or digital libraries are open to crawlers. Mostly digital libraries or publication venues like ACM, IEEE, Springer etc. have their documents /publications available online but they need the access permission (i.e. they are paid). Therefore, for providing the desired and relevant information to researchers in such case, many digital libraries harvest the online academic documents by using focused crawling of open-access archives, authors personal websites, institutions web sites etc. There are lots of digital library search engines have been proposed for finding the scientific literature. The Research Index (formerly CiteSeer) (Li et al., 2006; McKiernan, 2000) is most commonly used open access search system for retrieving computer science research papers. This system uses the full-text search system. It handles the citations by using Autonomous Citation Indexing. But, the main disadvantage of this system is that the seed URL or point for crawling is only the keyword search. Hoffand Mundhenk (2001) proposed a system HPSearch and Mops which to search for author homepages. Homepages generally contains information about researcher’s interest, publication details, work status, phone details and sometime publications itself. This type of information about the researcher is useful in performing various task like crawling for new literature work, meta-data extraction etc. In this system, HomePageSearch component searches for researcher homepages as per the names, and the second component Mops, is responsible to find publications close to the Home Pages.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2021): 1 Released, 3 Forthcoming
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing