Search Engine-Based Web Information Extraction

Search Engine-Based Web Information Extraction

Gijs Geleijnse (Philips Research, The Netherlands)
Copyright: © 2009 |Pages: 34
DOI: 10.4018/978-1-60566-112-4.ch009

Abstract

In this chapter we discuss approaches to find, extract, and structure information from natural language texts on the Web. Such structured information can be expressed and shared using the standard Semantic Web languages and hence be machine interpreted. In this chapter we focus on two tasks in Web information extraction. The first part focuses on mining facts from the Web, while in the second part, we present an approach to collect community-based meta-data. A search engine is used to retrieve potentially relevant texts. From these texts, instances and relations are extracted. The proposed approaches are illustrated using various case-studies, showing that we can reliably extract information from the Web using simple techniques.
Chapter Preview
Top

Introduction

Suppose we are interested in ‘the countries where Burger King can be found’, ‘the Dutch cities with a university of technology’ or perhaps ‘the genre of the music of Miles Davis’. For such diverse factual information needs, the World Wide Web in general and a search engine in particular can provide a solution. Experienced users of search engines are able to construct queries that are likely to access documents containing the desired information. However, current search engines retrieve Web pages, not the information itself 1. We have to search within the search results in order to acquire the information. Moreover, we make implicit use of our knowledge (e.g. of the language and the domain), to interpret the Web pages.

Apart from factual information, the Web is the de-facto source to gather community-based data as people with numerous backgrounds, interests and ideas contribute to the content of the Web. Hence the Web is a valuable source to extract opinions, characterizations and perceived relatedness between items.

In this chapter, the focus is on gathering and structuring information from the ‘traditional’ Web. This structured information can be represented (and shared) using the standard Semantic Web (SW) languages. Hence, this chapter focuses on the automatic creation of content for the SW. For simplicity, we abstract from the SW standards RDF(S)/OWL.

Complete Chapter List

Search this Book:
Reset