Utilizing Past Web for Knowledge Discovery

Utilizing Past Web for Knowledge Discovery

Adam Jatowt (Kyoto University, Japan), Yukiko Kawai (Kyoto Sangyo University, Japan) and Katsumi Tanaka (Kyoto University, Japan)
Copyright: © 2009 |Pages: 19
DOI: 10.4018/978-1-59904-576-4.ch017
OnDemand PDF Download:
$37.50

Abstract

The Web is a useful data source for knowledge extraction, as it provides diverse content virtually on any possible topic. Hence, a lot of research has been recently done for improving mining in the Web. However, relatively little research has been done taking directly into account the temporal aspects of the Web. In this chapter, we analyze data stored in Web archives, which preserve content of the Web, and investigate the methodology required for successful knowledge discovery from this data. We call the collection of such Web archives past Web; a temporal structure composed of the past copies of Web pages. First, we discuss the character of the data and explain some concepts related to utilizing the past Web, such as data collection, analysis and processing. Next, we introduce examples of two applications, temporal summarization and a browser for the past Web.
Chapter Preview
Top

Introduction

As the Web changes continuously, it is necessary to preserve the past content of pages for a future reuse. The Internet Archive1 is the best-known and largest public Web archive containing data crawled since 1996. Other Web archives exist, for example, ones containing Web pages from particular countries (e.g., Arvidson, Persson, & Mannerheim, 2000; Hallgrimsson & Bang, 2003). Besides, there are also numerous repositories of past copies of pages such as caches, site archives, personal page repositories or search engine caches.

Web archives provide a view on the history of the Web reflecting past societal states. Past content of pages can reveal the histories of underlying elements represented by these pages, such as institutions, companies, people or other entities. For example, one could approximately detect when a particular member left some laboratory by detecting the time point at which her or his name was removed from the list of laboratory’s personnel. In general, the use of Web archives can greatly benefit researchers and practitioners in many areas, such as history, sociology or marketing.

Furthermore, analyzing information from the past can help not only in better understanding the history of our society but also understanding its present state. This is because Web archives can provide contextual information about Web pages and the objects or concepts discussed on them as well as their inter-relations. For example, we can analyze information from Web archives concerning a given company in order to use it as a context for better understanding the present information about this company. In general, mining past Web content has a potential to stimulate and improve the traditional Web mining process in the sense that it provides contextual information and sheds new light on present data.

Past Web is considered here as a part of the WWW space where pages no longer have any change potential; they are “frozen” past snapshots of pages. The live Web, on the other hand, is the present Web, containing pages that we can currently view online. These pages may be changed or updated and they usually provide full interaction capabilities.

In the past Web each page has its history and lifetime. Links between the old content of pages can be reactivated again. In this way, a temporal structure can be obtained reflecting connectivity between pages in the past. Another aspect of the past Web is missing data. A given content after its deletion from a page may never be reproduced if it has not been preserved in any repository. Besides, due to the rapid growth of the Web, selective type archiving often needs to be done.

In this chapter, we approach the problem of discovering knowledge from the past Web. First, we discuss the character of data that is used and methods for acquiring and processing it. We propose techniques for analyzing and selecting candidate Web pages for mining. This approach is based on analyzing long-term characteristics of pages with a special focus on their content changes as they are most interesting from the viewpoint of pages’ evolution. Next, we introduce temporal summarization, which is an adaptation of a traditional text mining task into the past Web scenario. We propose summarizing histories of Web pages to generate abstraction of events and salient concepts described in selected portions of the past Web. We also discuss the possibility of discovering object histories in past content of Web documents. Finally, we describe an application for browsing and navigating the past Web. We show an implementation that is similar to those of traditional browsers for the live Web and of video players.

The rest of this chapter is organized as follows. In the next section, we discuss the related research and attempt to place this work in the wider context of text and Web mining. The following two sections describe the data accumulation, preparation and analysis. In the next section we discuss temporal summarization and investigate the possibility of object history detection from the past Web. The next section describes a browser for the past Web, while the last section concludes the chapter with a brief summary.

Complete Chapter List

Search this Book:
Reset
Table of Contents
Preface
Dariusz Król, Ngoc Thanh Nguyen
Chapter 1
Juliusz L. Kulikowski
In this chapter, a concept of using incomplete or fuzzy ontologies in decision making is presented. A definition of ontology and of ontological... Sample PDF
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
$37.50
Chapter 2
Amelia Badica, Costin Badica, Elvira Popescu
The Web is designed as a major information provider for the human consumer. However, information published on the Web is difficult to understand and... Sample PDF
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
$37.50
Chapter 3
Andreas Jacobsson, Paul Davidsson
This chapter introduces a formal model of virtual enterprises, as well as an analysis of their creation and operation. It is argued that virtual... Sample PDF
A Formal Analysis of Virtual Enterprise Creation and Operation
$37.50
Chapter 4
Donat Orski
The chapter concerns a class of systems composed of operations performed with the use of resources allocated to them. In such operation systems... Sample PDF
Application of Uncertain Variables to Knowledge-Based Resource Distribution
$37.50
Chapter 5
Clive Fencott
This chapter undertakes a methodological study of virtual environments (VEs), a specific subset of interactive systems. It takes as a central theme... Sample PDF
A Methodology of Design for Virtual Environments
$37.50
Chapter 6
Salvador Sanchez-Alonso, Dirk Frosch-Wilke
In current organizations, the models of knowledge creation include specific processes and elements that drive the production of knowledge aimed at... Sample PDF
An Ontological Representation of Competencies as Codified Knowledge
$37.50
Chapter 7
Marcos De Oliveira, Martin Purvis
In the distributed multi-agent systems discussed in this chapter, heterogeneous autonomous agents interoperate in order to achieve their goals. In... Sample PDF
Aspects of Openness in Multi-Agent Systems: Coordinating the Autonomy in Agent Societies
$37.50
Chapter 8
Kostas Kolomvatsos, Stathes Hadjiefthymiades
The field of Multi-agent systems (MAS) has been an active area for many years due to the importance that agents have to many disciplines of research... Sample PDF
How Can We Trust Agents in Multi-Agent Environments? Techniques and Challenges
$37.50
Chapter 9
Mariusz Nowostawski
The concept of autonomy is one of the central concepts in distributed computational systems, and in multi-agent systems in particular. With diverse... Sample PDF
The Concept of Autonomy in Distributed Computation and Multi-Agent Systems
$37.50
Chapter 10
Maryam Purvis, Toktam Ebadi, Bastin Tony Roy Savarimuthu
The objective of this research is to describe a mechanism to provide an improved library management system using RFID and agent technologies. One of... Sample PDF
An Agent-Based Library Management System Using RFID Technology
$37.50
Chapter 11
Sharmila Savarimuthu, Martin Purvis, Maryam Purvis, Mariusz Nowostawski
Societies are made of different kinds of agents, some cooperative and uncooperative. Uncooperative agents tend to reduce the overall performance of... Sample PDF
Mechanisms to Restrict Exploitation and Improve Societal Performance in Multi-Agent Systems
$37.50
Chapter 12
Bastin Tony Roy Savarimuthu, Maryam Purvis, Stephen Cranefield
Norms are shared expectations of behaviours that exist in human societies. Norms help societies by increasing the predictability of individual... Sample PDF
Norm Emergence in Multi-Agent Societies
$37.50
Chapter 13
Scott A. DeLoach, Madhukar Kumar
This chapter provides an overview of the Multi-agent Systems Engineering (MaSE) methodology for analyzing and designing multi-agent systems. MaSE... Sample PDF
Multi-Agent Systems Engineering: An Overview and Case Study
$37.50
Chapter 14
František Capkovic
An alternative approach to modeling and analysis of agents’ behaviour is presented in this chapter. The agents and agent systems are understood here... Sample PDF
Modeling, Analysing, and Control of Agents Behaviour
$37.50
Chapter 15
Martin Tabakov
This chapter presents a methodology for an image enhancement process of computed tomography perfusion images by means of partition generated with... Sample PDF
Using Fuzzy Segmentation for Colour Image Enhancement of Computed Tomography Perfusion Images
$37.50
Chapter 16
Giovanni Vincenti, Goran Trajkovski
This chapter presents an innovative approach to the field of information fusion. Fuzzy mediation differentiates itself from other algorithms, as... Sample PDF
Fuzzy Mediation in Shared Control and Online Learning
$37.50
Chapter 17
Adam Jatowt, Yukiko Kawai, Katsumi Tanaka
The Web is a useful data source for knowledge extraction, as it provides diverse content virtually on any possible topic. Hence, a lot of research... Sample PDF
Utilizing Past Web for Knowledge Discovery
$37.50
Chapter 18
Dariusz Król
In this chapter, we propose a generic framework in C# to distribute and compute tasks defined by users. Unlike the more popular models such as... Sample PDF
Example-Based Framework for Propagation of Tasks in Distributed Environments
$37.50
Chapter 19
Xia Xie, Jin Huang, Song Wu, Hai Jin, Melvin Koh, Jie Song, Simon See
In this chapter, we present a survey on some of the commercial players in the Grid industry, existing research done in the area of market-based Grid... Sample PDF
Survey on the Application of Economic and Market Theory for Grid Computing
$37.50
About the Contributors