A Cognitive-Based Approach to Identify Topics in Text Using the Web as a Knowledge Source

A Cognitive-Based Approach to Identify Topics in Text Using the Web as a Knowledge Source

Louis Massey, Wilson Wong
DOI: 10.4018/978-1-60960-625-1.ch004
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This chapter explores the problem of topic identification from text. It is first argued that the conventional representation of text as bag-of-words vectors will always have limited success in arriving at the underlying meaning of text until the more fundamental issues of feature independence in vector-space and ambiguity of natural language are addressed. Next, a groundbreaking approach to text representation and topic identification that deviates radically from current techniques used for document classification, text clustering, and concept discovery is proposed. This approach is inspired by human cognition, which allows ‘meaning’ to emerge naturally from the activation and decay of unstructured text information retrieved from the Web. This paradigm shift allows for the exploitation rather than avoidance of dependence between terms to derive meaning without the complexity introduced by conventional natural language processing techniques. Using the unstructured texts in Web pages as a source of knowledge alleviates the laborious handcrafting of formal knowledge bases and ontologies that are required by many existing techniques. Some initial experiments have been conducted, and the results are presented in this chapter to illustrate the power of this new approach.
Chapter Preview
Top

Introduction

It has become somewhat of a cliché to say that a large quantity of human knowledge is stored as unstructured electronic text. This cliché is nevertheless a true representation of the reality in corporations, governments and even in our everyday life. Indeed, we are plagued by an increasing dependence on an ever-growing body of information on the Web. Some of the common means to date for managing this information explosion include online directory and automated search engines, all of which rely heavily on the notion of topics. In this chapter, topics are keywords that represent and convey the themes or concepts addressed in a text document. In this regard, topics can be seen as lexical manifestations of the general meaning of documents.

The main issues that prevent the application of existing computational means to generate content-representative topics for managing information on a Web-scale are: (1) computational inefficiency; (2) knowledge acquisition and training data bottleneck; and (3) inherent challenges of processing natural language such as handling ambiguity and metaphor. Existing computational methods fill this semantic gap by exploiting knowledge handcrafted by human experts. Natural language processing for example depends on language and encyclopedic knowledge for syntactic and semantic processing, while supervised learning techniques rely on human guidance to classify documents. The problem of acquisition bottleneck in turn leads to major scalability and robustness issues. Ideally, one would like a computational method that can identify topics in a way that is not dependent on any form of human intervention. In this regard, the desirable properties of such systems are autonomy and adaptability.

In this chapter, we present a computational method that is void of any dependence on expert-crafted knowledge resources or training data. This cognition-inspired paradigm of generating topics takes the stream of words from a single document and determines the main themes addressed in that document based on overlapping activations and decay of unstructured lexical information. The lexical information is retrieved from the Web by querying Web search engines. This approach exploits the information embedded in the ordering of words but without traditional syntactic processing.

The chapter is organized as follows. Section 2 presents a case study that illustrates some of the problems with existing topic identification methods and with vector-representation of documents. In Section 3, we introduce the fundamentals of the proposed approach to represent text and to identify topics in documents. We then present and discuss the results obtained using a prototype computational model in Section 4. We conclude this chapter with an outlook to future work in Section 5.

Complete Chapter List

Search this Book:
Reset