Article Preview
TopIntroduction
With the increase in size of web, it is necessary to ensure the richness and uniqueness of information available on it. In the era of ICT, search engine plays a vital role for finding relevant information (Sharma & Sharma, 2011). For relevant information both the parameters namely Users’ behaviour (Deepika and Dixit, 2015c) and rich database are significant. The focus of the present paper is on maintaining the database richer. The crawler is one of the main modules of search engine and it is responsible for gathering the information from web and stored in database. There are many design issues while designing the crawler like uniqueness & richness of database, cooperation between crawling agents, fast & efficient crawling etc (Deepika and Dixit, 2012). There are many types of crawler exist such as Parallel Crawler, Migrating Crawler, Focussed Crawler, Hidden Crawler, Incremental Crawler etc. In this proposed work, the capabilities of migrating crawler (Singhal et al., 2012) are being utilized for designing an efficient crawling system. The proposed Crawler tries to achieve all the design issues needed for efficient crawling. Here, an URL plays an important role in designing a migrating crawler. In migrating crawler environment, if an URL is scheduled properly it will improve the richness of database with least communication overhead on to the network.
The URL is composed of five components namely the scheme, authority, path, query and fragment components (Lee et al., 2005) as shown below:
There are URLs which are points to same page. This problem is designated as DUST (Schonfeld et al., 2007) i.e. different URLs with similar text. DUST effect the whole working of Search Engines i.e. crawling, indexing, ranking etc. In order to remove this duplicity at URLs level proper processing has been adopted. Before checking for duplicity in URLs, URL standard Normalization process (Lee et al., 2005) is applied to them. This process eliminates the syntactically similar URLs. There are three types of normalization process:
Sitemap (Deepika and Dixit, 2015a) is also used here to get the information of all links present in a webpage. Basically, sitemap gives the number of links present in a web page. This helps in ordering the URLs before downloading them. To maintain the database unique and rich, hashing algorithm is used. Hashing helps in eliminating duplicate documents from the database.
TopSome of the prevalent work in the related area is discussed below. Agarwal et al., (2009) worked on crawl logs and then from these logs generate rules. These rules were then utilized for finding duplicate web pages. They form clusters of similar pages from crawl logs. These clusters were then used to generate rules for detecting duplicate pages. These rules were then generalized and with the help of these rules then only URL were able to detect identical pages. They also showed that their proposal effects crawling, indexing in effective way.