Article Preview
TopIntroduction
The World Wide Web is traditionally viewed as a web of linked documents. A great research effort has been put to the analysis and modeling of the relationships among the individual documents (Page et al., 1999) or even analyzing semantic document relationships in order to obtain more information about the web organization (Luo et al., 2009; Luo et al., 2011). From this point of view, the documents are usually regarded as atomic units with certain properties such as the individual keyword frequencies. On the other hand, a remarkably less effort has been devoted to modeling the information structure in the documents themselves.
The WWW documents often present structured information that consists of multiple pieces of data of different kinds together with certain relationships among them. A typical example may be a conference programme that consists of speech titles, times, places and author names. However, the relationships are often not explicitly described in the document content. They are expressed by different, mostly visual means and the human reader is expected to interpret the visual presentation of the content appropriately in order to assign for example the appropriate author and time to a speech title.
Existing approaches to structured information identification in web documents are usually based on an analysis of a larger set of documents that follow the same presentation guidelines. Then, based on a set of sample documents, we may infer a set of rules that may be later applied to other documents that follow the same guidelines. However, this does not solve a very frequent situation when we have a set of documents where each one comes from a different author and follows a different presentation style.
Let’s consider two conference programmes presented in Figure 1. Both documents provide information about conference sessions, starting times, and the titles and authors of the individual presentations. However, this information is presented differently regarding the content layout, order of the individual data fields, colors and other properties and only a single exemplar of each such document is available. Moreover, the document formats may be different (for example, the HTML or PDF documents may be used).
Figure 1. Different presentation styles of conference programmes
Despite of the different formats, for a human reader, the presented relationships between the content elements remain the same and they correspond to the structure shown in Figure 2. In this example, both documents assign some times and sessions to the individual speech titles and authors. These relationships are presented visually by different font properties, indentation and other means that allow the reader to interpret the relationships without reading the text or even without understanding the used language. We may expect that these logical relationships are similar in all the conference programmes independently on how they are actually presented. Generally, we may expect that the documents presenting data of the same topic will share the same logical relationships between the individual content elements although presented in different ways. To give more examples: Published articles present the relationships between their title, authors, date of publication or even the sections and subsections. Timetables represent the relationships between the lines, places and times, etc.
Figure 2. Expected logical structure of a conference programme
In this paper, we propose a hierarchical logical relationship model that explicitly models the intra-document logical relationships that may be obtained by interpreting the visual presentation of the contents. This model has applications in information extraction, retrieval and other areas. Moreover, we address the problem of the automatic discovery of the logical relationships in a document. We analyze the visual presentation and content features that can be examined in order to obtain the logical relationship model. Lastly, we evaluate the proposed approach on real-world documents and we show that it can give comparable results for different document from various sources.