Supporting Text Retrieval by Typographical Term Weighting

Supporting Text Retrieval by Typographical Term Weighting

Lars Werner (University of Paderborn, Germany)
DOI: 10.4018/978-1-60566-144-5.ch011
OnDemand PDF Download:
$37.50

Abstract

Text documents stored in information systems usually consist of more information than the pure concatenation of words, i.e., they also contain typographic information. Because conventional text retrieval methods evaluate only the word frequency, they miss the information provided by typography, e.g., regarding the importance of certain terms. In order to overcome this weakness, we present an approach which uses the typographical information of text documents and show how this improves the efficiency of text retrieval methods. Our approach uses weighting of typographic information in addition to term frequencies for separating relevant information in text documents from the noise. We have evaluated our approach on the basis of automated text classification algorithms. The results show that our weighting approach achieves very competitive classification results using at most 30% of the terms used by conventional approaches, which makes our approach significantly more efficient.
Chapter Preview
Top

Introduction

Text documents combine textual and typographical information. However, since Luhn (1958), information retrieval (IR) algorithms use only term frequency in text documents for measuring the text significance, i.e., typographic information also contained in the texts is not considered by most of the common IR methods. Typographic information includes the employment of different character fonts, character sizes and styles, the choice of line length, text alignment and the type-area within the paper format.

Authors use typographical information in their texts to make them more readable. Therefore, we follow the arguments of Apté et al. (1994), Cutler et al. (1997), Kim and Zhang (2000), and Kwon and Lee (2000) that typographical information may help to classify or to better understand the meaning of texts, which results in the following hypothesis that can be regarded as an extension to Luhn’s thesis:

The justification of measuring word significance by typography is based on the fact that a writer normally uses certain typographic styles to clarify his argumentation and the description of certain facts.

In order to verify our hypothesis, we have implemented our ideas within the VKC1 document management system. For an evaluation of the classification quality of our approach, we have used two public data sets of the World Wide Knowledge Base (Web-Kb) project2, which contains HTML documents with typographical information and our own selection of publications in PDF format from the ACM Digital Library3. The evaluation result is that classification algorithms that consider typography information allow reducing the considered term set, thereby significantly improving the efficiency of the automated document classification.

The remainder of the article is organized as follows. The second section describes some related works. The third section outlines our previous HTML tag-based typographical weighting approach and the fourth section describes our catalogue evaluation scenario and summarizes the performance results of the tag based approach. Within the fifth section we describe our new general typography-based weighting approach, which we evaluate in the sixth section. The seventh section outlines a summary and the conclusions.

Top

Apté, Damerau and Weiss (1994) presented the first typographic term weighting approach for text classification. They measured the classification quality of the “Reuters-21578 text categorization test collection”4 and demonstrated that by counting the terms of the news titles twice, an improvement of nearly 2% (precision recall break even point) could be achieved.

Cutler, Shih and Meng (1997), for the first time, suggested an absolute weighting scheme for HTML tags. By weighting words enclosed in tags depending on the tag weight (c.f. Table 1) the average precision of their IR system was increased by nearly 7%.

Table 1.
Absolute term weighting table by Cutler, Shih and Meng
HTML TagTag Weight
<a>1
<h1>, <h2>8
<h3>, <h4>, <h5>, <h6>1
<strong>, <b>, <em>, <i>, <u>, <dl>, <ol>, <ul>1
<title>0
Remaining tags and normal text1

Complete Chapter List

Search this Book:
Reset
Editorial Review Board
Table of Contents
Preface
Vijayan Sugumaran
Chapter 1
Hong Lin
In this chapter a program construction method based on ?-Calculus is proposed. The problem to be solved is specified by first-order predicate logic... Sample PDF
Designing Multi-Agent Systems from Logic Specifications: A Case Study
$37.50
Chapter 2
Rahul Singh
Organizations use knowledge-driven systems to deliver problem-specific knowledge over Internet-based distributed platforms to decision-makers.... Sample PDF
Multi-Agent Architecture for Knowledge-Driven Decision Support
$37.50
Chapter 3
Farid Meziane
Trust is widely recognized as an essential factor for the continual development of business-to-customer (B2C) electronic commerce (EC). Many trust... Sample PDF
A Decision Support System for Trust Formalization
$37.50
Chapter 4
Mehdi Yousfi-Monod
The work described in this chapter tackles learning and communication between cognitive artificial agents and trying to meet the following issue: Is... Sample PDF
Using Misunderstanding and Discussion in Dialog as a Knowledge Acquisition or Enhancement Procecss
$37.50
Chapter 5
Sungchul Hong
In this chapter, we present a two-tier supply chain composed of multiple buyers and multiple suppliers. We have studied the mechanism to match... Sample PDF
Improving E-Trade Auction Volume by Consortium
$37.50
Chapter 6
Manoj A. Thomas, Victoria Y. Yoon, Richard Redmond
Different FIPA-compliant agent development platforms are available for developing multiagent systems. FIPA compliance ensures interoperability among... Sample PDF
Extending Loosely Coupled Federated Information Systems Using Agent Technology
$37.50
Chapter 7
H. Hamidi
The reliable execution of mobile agents is a very important design issue in building mobile agent systems and many fault-tolerant schemes have been... Sample PDF
Modeling Fault Tolerant and Secure Mobile Agent Execution in Distributed Systems
$37.50
Chapter 8
Xiannong Meng, Song Xing
This chapter reports the results of a project attempting to assess the performance of a few major search engines from various perspectives. The... Sample PDF
Search Engine Performance Comparisons
$37.50
Chapter 9
Antonio Picariello
Information retrieval can take great advantages and improvements considering users’ feedbacks. Therefore, the user dimension is a relevant component... Sample PDF
A User-Centered Approach for Information Retrieval
$37.50
Chapter 10
Aboul Ella Hassanien, Jafar M. Ali
This chapter presents an efficient algorithm to classify and retrieve images from large databases in the context of rough set theory. Color and... Sample PDF
Classification and Retrieval of Images from Databases Using Rough Set Theory
$37.50
Chapter 11
Lars Werner
Text documents stored in information systems usually consist of more information than the pure concatenation of words, i.e., they also contain... Sample PDF
Supporting Text Retrieval by Typographical Term Weighting
$37.50
Chapter 12
Ben Choi
Web mining aims for searching, organizing, and extracting information on the Web and search engines focus on searching. The next stage of Web mining... Sample PDF
Web Mining by Automatically Organizing Web Pages into Categories
$37.50
Chapter 13
John Goh
Mobile user data mining is about extracting knowledge from raw data collected from mobile users. There have been a few approaches developed, such as... Sample PDF
Mining Matrix Pattern from Mobile Users
$37.50
Chapter 14
Salvatore T. March, Gove N. Allen
Active information systems participate in the operation and management of business organizations. They create conceptual objects that represent... Sample PDF
Conceptual Modeling of Events for Active Information Systems
$37.50
Chapter 15
John M. Artz
Earlier work in the philosophical foundations of information modeling identified four key concepts in which philosophical groundwork must be further... Sample PDF
Information Modeling and the Problem of Universals
$37.50
Chapter 16
Christian Hillbrand
The motivation for this chapter is the observation that many companies build their strategy upon poorly validated hypotheses about cause and effect... Sample PDF
Empirical Inference of Numerical Information into Causal Strategy Models by Means of Artificial Intelligence
$37.50
Chapter 17
Yongjian Fu
In this chapter, we propose to use N-gram models for improving Web navigation for mobile users. Ngram models are built from Web server logs to learn... Sample PDF
Improving Mobile Web Navigation Using N-Grams Prediction Models
$37.50
Chapter 18
Réal Carbonneau, Rustam Vahidov, Kevin Laframboise
Managing supply chains in today’s complex, dynamic, and uncertain environment is one of the key challenges affecting the success of the businesses.... Sample PDF
Forecasting Supply Chain Demand Using Machine Learning Algorithms
$37.50
Chapter 19
Teemu Tynjala
The present study implements a generic methodology for describing and analyzing demand supply networks (i.e. networks from a company’s suppliers... Sample PDF
Supporting Demand Supply Network Optimization with Petri Nets
$37.50
About the Contributors