Traditionally a great deal of research has been devoted to data extraction on the web (Crescenzi, et al, 2001; Embley, et al, 2005; Laender, et al, 2002; Hammer, et al, 1997; Ribeiro-Neto, et al, 1999; Huck, et al, 1998; Wang & Lochovsky, 2002, 2003) from areas where data is easily indexed and extracted by a Search Engine, the so-called Surface Web. There are, however, other sites that are greater and potentially more vital, that contain information which cannot be readily indexed by standard search engines. These sites which have been designed to require some level of direct human participation (for example, to issue queries rather than simply follow hyperlinks) cannot be handled using the simple link traversal techniques used by many web crawlers (Rappaport, 2000; Cho & Garcia-Molina, 2000; Cho et al, 1998; Edwards et al, 2001). This area of the web, which has been operationally off-limits for crawlers using standard indexing procedures, is termed the Deep Web (Zillman, 2005; Bergman, 2000). Much work still needs to be done as Deep Web sites represent an area that is only recently being explored to identify where potential uses can be developed.
Key Terms in this Chapter
Digit Content: In our context, this refers to the presence of digit characters (0-9) within a cell.
HTML-Encoded Tables: Sections of HTML code, which are delimited by the HTML
tag. The data within these sections are not always tables in the logical sense of the word.
Dynamic Web Page: A Web page that is created on-the-fly from a back-end database when a user interactively issues a query on a Web site. Sometimes the presence of a question mark “?” in the body of a URL indicates that dynamic content will be sent instead of a static HTML page.
Character-to-Digit Ratio (CDR): This is a narrowly defined ratio obtained by dividing the number of characters by the number of digits. In the case where there are no digits the CDR is set to the number of characters, and in the case where there are no characters the CDR is set to zero. It gives a sense of the character content versus digit content of a cell.
HTML Tag: HTML consists of elements, which control how HTML encoded data is displayed. Tags start with a “<” and end with a “>.” For example ,
, and are three distinct tags in HTML. Tags may also contain information, which modify the default behaviour of the tag called attributes. For example the tag
contains the border attribute for this table. The lettering inside the angle brackets is not case sensitive.
General Table Recognition: A complex field of study, which seeks to identify tables within documents, typically by a pixel by pixel analysis of an image file. The presence of borders and other repeating regions of distinctions may indicate the presence of a table.
Character Content: In our context, this refers to the presence of alphabetic characters (A-Za-z) within a cell.
Cell: A region with a HTML-encoded table, which is delimited by a HTML
| tag. Cells may contain rich variety of HTML tags and markup in addition to raw data in the form of text.|
Perl Module: This is a special-purpose pre-built section of Perl code, which is freely available from the CPAN.org or other standard Perl-coding Web sites. Modules act as code libraries and allow extended functionality to be added simply and easily to Perl programs. For example the HTML::TableContentParser module.
Deep Web: A largely untapped region of cyberspace in which Web data is indirectly accessible through the use of query-type human readable interfaces. Typically, the user must enter log-on information or select options before being granted access to the information from the Web site. The need for human interaction restricts the ability of search engines and Web bots to index these sites. The terms invisible Web and hidden Web are also loosely used to describe these regions of cyberspace.