Article Preview
TopIntroduction
Web Scraping involves extracting enormous amount of data embedded in semi-structured HTML pages. The amount of information available with deep web is of several orders of magnitude higher than the surface web. The surface web refers to those web pages indexed to search engines like google, yahoo etc. The deep web refers to web pages that are generated dynamically by querying the back-end database and embedding the resultant data records in server-side templates. Deep web is also referred to as Dark web since it not indexed to search engines. The degeneration of data records is not straightforward since the web pages are intended for human understanding. Getting the data from deep web is easy if the owner of the web site provides the API for accessing it. This is not true in most of the cases since it requires technical expertise and some are not willing to outsource their data. It is due to this reason web scraping is the only solution to get the data from Deep web.
Data from deep web acts as a complementary source of information for many data analytics applications such as opinion mining, sentiment analysis, product intelligence, customer intelligence and many more. The huge amount of information needed by the data analytics application is available in the Dark web or Deep web. The first step is to collect the data from the deep web pages. Performing copy paste is practically infeasible since the number of web pages to be processed is huge. Therefore, only possibility is to come up with an automated system which can identify target pages and perform extraction. The problem of web data extraction can be stated as follows:
Let web site S consists of collection of template generated web pages P = {p1, p2, p3…pi… pm) where each web page pi consists of set of data objects D = {d1, d2, d3…dr}. Each data object dj in D is a set of attribute value pairs {<x1,y1>,<x2,y2>……<xn,yn>}. The problem of web data extraction involves extraction of D from every pi in P belonging to S.
The design of web data extraction system should be capable of handling various challenges such as heterogeneity of structuring of web pages belonging to different web sites, missing attributes, several levels of nesting within templates in which data records are embedded, identification of extraction target, semantic representation of extracted data, automatic labeling and so on. Although many commercial tools such as Lixto (Baumgartner, Gatterbauer, & Gottlob, 2009), import.io (https://www.connotate.com/) are available for web data extraction, their usage requires understanding of site map, manual selection of extraction targets. Many automatic approaches such as ExAlg (Arasu & Garcia-Molina, 2003), RoadRunner (Crescenzi, Mecca, & Merialdo, 2002), FiVaTech (Kayed & Chang, 2010) and Trinity (Sleiman & Corchuelo, 2014) exist in the literature. Semantic Scraper departs from these techniques in the following ways:
- 1.
Automatic identification of data rich section
- 2.
Automatic labeling of extracted data records
- 3.
Ability to extract from a single input page
Section 2 discusses about the state-of-the-art approaches in the literature. Section 3 explains the architecture of Web Data Extraction System (WDES) based on Semantic Labeling, Section 4 shows experimental results and comparison with other state-of-the-art techniques and Section 5 discusses about conclusion and future scope.