Information Fusion for Scientific Literature Classification

Information Fusion for Scientific Literature Classification

Gary G. Yen (Oklahoma State University, USA)
Copyright: © 2009 |Pages: 11
DOI: 10.4018/978-1-60566-010-3.ch159
OnDemand PDF Download:
$37.50

Abstract

Scientific literatures can be organized to serve as a roadmap for researchers by pointing where and when the scientific community has been and is heading to. They present historic and current state-of-the-art knowledge in the interesting areas of study. They also document valuable information including author lists, affiliated institutions, citation information, keywords, etc., which can be used to extract further information that will assist in analyzing their content and relationship with one another. However, their tremendously growing size and the increasing diversity of research fields have become a major concern, especially for organization, analysis, and exploration of such documents. This chapter proposes an automatic scientific literature classification method (ASLCM) that makes use of different information extracted from the literatures to organize and present them in a structured manner. In the proposed ASLCM, multiple similarity information is extracted from all available sources and fused to give an optimized and more meaningful classification through using a genetic algorithm. The final result is used to identify the different research disciplines within the collection, their emergence and termination, major collaborators, centers of excellence, their influence, and the flow of information among the multidisciplinary research areas.
Chapter Preview
Top

Background

In addition to the body content, which is sometimes hard to analyze using a computer, scientific literatures incorporate essential information such as title, abstract, author, references and keywords that can be exploited to assist in the analysis and organization of a large collection (Singh, Mittal & Ahmad, 2007; Guo, 2007). This kind of analysis and organization proves helpful while dealing with a large collection of articles with a goal of attaining efficient presentation, visualization, and exploration in order to search for hidden information and useful connections lying within the collection. It can also serve as a historic roadmap that can be used to sketch the flow of information during the past and as a tool for forecasting possible emerging technologies. The ASLCM proposed in this study makes use of the above-mentioned types of information, which are available in most scientific literatures, to achieve an efficient classification and presentation of a large collection.

Many digital libraries and search engines make use of title, author, keyword, or citation information for indexing and cataloging purposes. Word-hit-based cataloging and retrieval using such types of information tends to miss related literatures that does not have the specified phrase or keyword, thus requiring the user to try several different queries to obtain the desired search result. In this chapter, these different information sources are fused to give an optimized and all-rounded view of a particular literature collection so that related literatures can be grouped and identified easily.

Title, abstract, author, keyword, and reference list are among the most common elements that are documented in typical scientific literature, such as a journal article. These sources of information can be used to characterize or represent literature in a unique and meaningful way, while performing computation for different information retrievals including search, cataloguing or organization. However, most of the methods that have been developed (Lawrence, Giles & Bollacker, 1999; Morris & Yen, 2004; White & McCain, 1998; Berry, Dramac & Jessup, 1999) use only one of these while performing the different information retrieval tasks, such as search and classification, producing results that focus only on a particular aspect of the collection. For example, usage of reference or citation information leads to a good understanding of the flow of information within the literature collection. This is because most literatures provide a link to the original base knowledge they used within their reference list. In a similar fashion, use of information extracted from the authors list can lead to a good understanding of various author collaboration groups within the community along with their areas of expertise. This concept can be extended analogously to different information types provided by scientific literatures.

The theory behind the proposed ASLCM can be summarized and stated as follows:

Complete Chapter List

Search this Book:
Reset