Using Logic Programming and XML Technologies for Data Extraction from Web Pages

Using Logic Programming and XML Technologies for Data Extraction from Web Pages

Amelia Badica (University of Craiova, Romania), Costin Badica (University of Craiova, Romania) and Elvira Popescu (University of Craiova, Romania)
Copyright: © 2009 |Pages: 31
DOI: 10.4018/978-1-59904-576-4.ch002
OnDemand PDF Download:
$37.50

Abstract

The Web is designed as a major information provider for the human consumer. However, information published on the Web is difficult to understand and reuse by a machine. In this chapter, we show how well established intelligent techniques based on logic programming and inductive learning combined with more recent XML technologies might help to improve the efficiency of the task of data extraction from Web pages. Our work can be seen as a necessary step of the more general problem of Web data management and integration.
Chapter Preview
Top

Introduction

The Web is extensively used for information dissemination to humans and businesses. For this purpose, Web technologies are used to convert data from internal formats, usually specific to data base management systems, to suitable presentations for attracting human users. However, the interest has rapidly shifted to make that information available for machine consumption by realizing that Web data can be reused for various problem solving purposes, including common tasks like searching and filtering, and also more complex tasks like analysis, decision making, reasoning and integration.

For example, in the e-tourism domain one can note an increasing number of travel agencies offering online services through online transaction brokers (Laudon & Traver, 2004). They provide useful information to human users about hotels, flights, trains or restaurants, in order to help them plan their business or holiday trips. Travel information, like most of the information published on the Web, is heterogeneous and distributed, and there is a need to gather, search, integrate and filter it efficiently (Staab et al., 2002) and ultimately to enable its reuse for multiple purposes. In particular, for example, personal assistant agents can integrate travel and weather information to assist and advise humans in planning their weekends and holidays. Another interesting use of data harvested from the Web that has been recently proposed (Gottlob, 2005) is to feed business intelligence tasks, in areas like competitive analysis and intelligence.

Two emergent technologies that have been put forward to enable automated processing of information published on the Web are semantic markup (W3C Semantic Web Activity, 2007). and Web services (Web Services Activity, 2007). However, most of the current practices in Web publishing are still being based on the combination of traditional HTML-lingua franca for Web publishing (W3C HTML, 2007) with server-side dynamic content generation from databases. Moreover, many Web pages are using HTML elements that were originally intended for use in structure content (e.g., those elements related to tables), or for layout and presentation effects, even if this practice is not encouraged in theory. Therefore, techniques developed in areas like information extraction, machine learning and wrapper induction are still expected to play a significant role in tackling the problem of Web data extraction.

Data extraction is related to the more general problem of information extraction that is traditionally associated with artificial intelligence and natural language processing. Information extraction was originally concerned with locating specific pieces of information in text documents written in natural language (Lenhert & Sundheim, 1991) and then using them to populate a database or structured document. The field then expanded to cover extraction tasks from Web documents represented in HTML and attracted other communities including databases, electronic documents, digital libraries and Web technologies. Usually, the content of these data sources can be characterized as neither natural language, nor structured, and therefore usually the term semi-structured data is used. For these cases, we consider that the term data extraction is more appropriate than information extraction and consequently, we shall use it in the rest of this chapter.

A wrapper is a program that is used for performing the data extraction task. On one hand, manual creation of Web wrappers is a tedious, error-prone and difficult task because of Web heterogeneity in both structure and content. On the other hand, construction of Web wrappers is a necessary step to allow more complex tasks like decision making and integration. Therefore, a lot of techniques for (semi-)automatic wrapper construction have been proposed. One application area that can be described as a success story for machine learning technologies is wrapper induction for Web data extraction. For a recent overview of state-of-the-art approaches in the field see Chang, Kayed, Girgis, and Shaalan (2006).

Complete Chapter List

Search this Book:
Reset
Table of Contents
Preface
Dariusz Król, Ngoc Thanh Nguyen
Chapter 1
Juliusz L. Kulikowski
In this chapter, a concept of using incomplete or fuzzy ontologies in decision making is presented. A definition of ontology and of ontological... Sample PDF
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
$37.50
Chapter 2
Amelia Badica, Costin Badica, Elvira Popescu
The Web is designed as a major information provider for the human consumer. However, information published on the Web is difficult to understand and... Sample PDF
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
$37.50
Chapter 3
Andreas Jacobsson, Paul Davidsson
This chapter introduces a formal model of virtual enterprises, as well as an analysis of their creation and operation. It is argued that virtual... Sample PDF
A Formal Analysis of Virtual Enterprise Creation and Operation
$37.50
Chapter 4
Donat Orski
The chapter concerns a class of systems composed of operations performed with the use of resources allocated to them. In such operation systems... Sample PDF
Application of Uncertain Variables to Knowledge-Based Resource Distribution
$37.50
Chapter 5
Clive Fencott
This chapter undertakes a methodological study of virtual environments (VEs), a specific subset of interactive systems. It takes as a central theme... Sample PDF
A Methodology of Design for Virtual Environments
$37.50
Chapter 6
Salvador Sanchez-Alonso, Dirk Frosch-Wilke
In current organizations, the models of knowledge creation include specific processes and elements that drive the production of knowledge aimed at... Sample PDF
An Ontological Representation of Competencies as Codified Knowledge
$37.50
Chapter 7
Marcos De Oliveira, Martin Purvis
In the distributed multi-agent systems discussed in this chapter, heterogeneous autonomous agents interoperate in order to achieve their goals. In... Sample PDF
Aspects of Openness in Multi-Agent Systems: Coordinating the Autonomy in Agent Societies
$37.50
Chapter 8
Kostas Kolomvatsos, Stathes Hadjiefthymiades
The field of Multi-agent systems (MAS) has been an active area for many years due to the importance that agents have to many disciplines of research... Sample PDF
How Can We Trust Agents in Multi-Agent Environments? Techniques and Challenges
$37.50
Chapter 9
Mariusz Nowostawski
The concept of autonomy is one of the central concepts in distributed computational systems, and in multi-agent systems in particular. With diverse... Sample PDF
The Concept of Autonomy in Distributed Computation and Multi-Agent Systems
$37.50
Chapter 10
Maryam Purvis, Toktam Ebadi, Bastin Tony Roy Savarimuthu
The objective of this research is to describe a mechanism to provide an improved library management system using RFID and agent technologies. One of... Sample PDF
An Agent-Based Library Management System Using RFID Technology
$37.50
Chapter 11
Sharmila Savarimuthu, Martin Purvis, Maryam Purvis, Mariusz Nowostawski
Societies are made of different kinds of agents, some cooperative and uncooperative. Uncooperative agents tend to reduce the overall performance of... Sample PDF
Mechanisms to Restrict Exploitation and Improve Societal Performance in Multi-Agent Systems
$37.50
Chapter 12
Bastin Tony Roy Savarimuthu, Maryam Purvis, Stephen Cranefield
Norms are shared expectations of behaviours that exist in human societies. Norms help societies by increasing the predictability of individual... Sample PDF
Norm Emergence in Multi-Agent Societies
$37.50
Chapter 13
Scott A. DeLoach, Madhukar Kumar
This chapter provides an overview of the Multi-agent Systems Engineering (MaSE) methodology for analyzing and designing multi-agent systems. MaSE... Sample PDF
Multi-Agent Systems Engineering: An Overview and Case Study
$37.50
Chapter 14
František Capkovic
An alternative approach to modeling and analysis of agents’ behaviour is presented in this chapter. The agents and agent systems are understood here... Sample PDF
Modeling, Analysing, and Control of Agents Behaviour
$37.50
Chapter 15
Martin Tabakov
This chapter presents a methodology for an image enhancement process of computed tomography perfusion images by means of partition generated with... Sample PDF
Using Fuzzy Segmentation for Colour Image Enhancement of Computed Tomography Perfusion Images
$37.50
Chapter 16
Giovanni Vincenti, Goran Trajkovski
This chapter presents an innovative approach to the field of information fusion. Fuzzy mediation differentiates itself from other algorithms, as... Sample PDF
Fuzzy Mediation in Shared Control and Online Learning
$37.50
Chapter 17
Adam Jatowt, Yukiko Kawai, Katsumi Tanaka
The Web is a useful data source for knowledge extraction, as it provides diverse content virtually on any possible topic. Hence, a lot of research... Sample PDF
Utilizing Past Web for Knowledge Discovery
$37.50
Chapter 18
Dariusz Król
In this chapter, we propose a generic framework in C# to distribute and compute tasks defined by users. Unlike the more popular models such as... Sample PDF
Example-Based Framework for Propagation of Tasks in Distributed Environments
$37.50
Chapter 19
Xia Xie, Jin Huang, Song Wu, Hai Jin, Melvin Koh, Jie Song, Simon See
In this chapter, we present a survey on some of the commercial players in the Grid industry, existing research done in the area of market-based Grid... Sample PDF
Survey on the Application of Economic and Market Theory for Grid Computing
$37.50
About the Contributors