A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

Umamageswari Kumaresan, Kalpana Ramanujam
Copyright: © 2022 |Pages: 18
DOI: 10.4018/IJIRR.290830
Article PDF Download
Open access articles are freely available for download

Abstract

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.
Article Preview
Top

Introduction

Web Scraping involves extracting enormous amount of data embedded in semi-structured HTML pages. The amount of information available with deep web is of several orders of magnitude higher than the surface web. The surface web refers to those web pages indexed to search engines like google, yahoo etc. The deep web refers to web pages that are generated dynamically by querying the back-end database and embedding the resultant data records in server-side templates. Deep web is also referred to as Dark web since it not indexed to search engines. The degeneration of data records is not straightforward since the web pages are intended for human understanding. Getting the data from deep web is easy if the owner of the web site provides the API for accessing it. This is not true in most of the cases since it requires technical expertise and some are not willing to outsource their data. It is due to this reason web scraping is the only solution to get the data from Deep web.

Data from deep web acts as a complementary source of information for many data analytics applications such as opinion mining, sentiment analysis, product intelligence, customer intelligence and many more. The huge amount of information needed by the data analytics application is available in the Dark web or Deep web. The first step is to collect the data from the deep web pages. Performing copy paste is practically infeasible since the number of web pages to be processed is huge. Therefore, only possibility is to come up with an automated system which can identify target pages and perform extraction. The problem of web data extraction can be stated as follows:

Let web site S consists of collection of template generated web pages P = {p1, p2, p3…pi… pm) where each web page pi consists of set of data objects D = {d1, d2, d3…dr}. Each data object dj in D is a set of attribute value pairs {<x1,y1>,<x2,y2>……<xn,yn>}. The problem of web data extraction involves extraction of D from every pi in P belonging to S.

The design of web data extraction system should be capable of handling various challenges such as heterogeneity of structuring of web pages belonging to different web sites, missing attributes, several levels of nesting within templates in which data records are embedded, identification of extraction target, semantic representation of extracted data, automatic labeling and so on. Although many commercial tools such as Lixto (Baumgartner, Gatterbauer, & Gottlob, 2009), import.io (https://www.connotate.com/) are available for web data extraction, their usage requires understanding of site map, manual selection of extraction targets. Many automatic approaches such as ExAlg (Arasu & Garcia-Molina, 2003), RoadRunner (Crescenzi, Mecca, & Merialdo, 2002), FiVaTech (Kayed & Chang, 2010) and Trinity (Sleiman & Corchuelo, 2014) exist in the literature. Semantic Scraper departs from these techniques in the following ways:

  • 1.

    Automatic identification of data rich section

  • 2.

    Automatic labeling of extracted data records

  • 3.

    Ability to extract from a single input page

Section 2 discusses about the state-of-the-art approaches in the literature. Section 3 explains the architecture of Web Data Extraction System (WDES) based on Semantic Labeling, Section 4 shows experimental results and comparison with other state-of-the-art techniques and Section 5 discusses about conclusion and future scope.

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing