Harvesting Deep Web Data through Produser Involvement

Harvesting Deep Web Data through Produser Involvement

Tomasz Kaczmarek, Dawid Grzegorz Węckowski
Copyright: © 2014 |Pages: 22
DOI: 10.4018/978-1-4666-4313-0.ch013
(Individual Chapters)
No Current Special Offers


Acquiring the data from the deep Web is a complex process, which requires understanding of Website navigation issues, data extraction, and integration techniques. Currently existing solutions to automate it are not ready to cover the whole deep Web and require skills and knowledge to be applied in practice. However, several systems were created, which approach the problem by involving end users who are able to bring the data from the deep Web to the surface while creating solutions for their own information needs. The authors study these systems in the chapter from the end user perspective, investigating their interfaces, languages that they expose to end users, and the platforms that accompany the systems to involve end users and allow them to share the results of their work.
Chapter Preview

Deep Web Structure

The Deep Web notion (a.k.a. the Hidden Web) refers to Web pages that are not directly accessible by the usage of URLs, but are rather dynamically generated upon HTML form submitting (Madhavan et al., 2008). Web page retrieval from the Deep Web involves filling the form with desired values, which will influence the content of delivered Web page. Apart from the result processing, the challenging part is the automatic determination of HTML form values, that can generate useful outcome.

As it was shown in the studies (He et al., 2007), the Deep Web is very extensive and versatile. In 2007 it was estimated to embrace over 300,000 sites, 450,000 databases and 1,250,000 interfaces, and still expanding at high rate, e.g. increasing 3 – 7 times between 2000 and 2004. The Deep Web pages are distributed across wide range of subject areas, with significant share of e-commerce sites. Although the non-commerce sites are gradually being hidden behind HTML forms. The Deep Web pages are mostly structured, providing the data objects in attribute-value pairs. This feature comes from the back-end structure of Deep Web sites, that use databases running in relational or objective paradigm. As the generated Web pages are the result of database queries, which provides data in highly structured manner, the Web pages design is noticeably influenced by the data structures. This is reflected in table-like layouts or database-style tuples on the Web pages. Also the structure of Deep Web sites tend to be quite shallow (He et al., 2007), about 94% of the Deep Web databases is located not deeper than on the 3rd level of a Website.

Complete Chapter List

Search this Book: