Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Dynamic Clustering Based on Minimum Spanning Tree and Context Similarity for Enhancing Document Classification

Anirban Chakrabarty, Sudipta Roy

Source Title: International Journal of Information Retrieval Research (IJIRR) 4(1)

DOI: 10.4018/ijirr.2014010103

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Document Classification is the task of assigning a text document to one or more predefined categories according to its content and the labeled training samples. Traditional classification schemes use all training samples for classification, thereby increasing storage requirements and calculation complexity as the number of features increase. Moreover, the commonly used classification techniques consider the number of categories is known in advance, this may not be so in actual reality. In the practical scenario, it is very much essential to find the number of clusters for unknown dataset dynamically. Identifying these limitations, the proposed work evolves a text clustering algorithm where clusters are generated dynamically based on minimum spanning tree incorporating semantic features. The proposed model can efficiently find the significant matching concepts between documents and can perform multi category classification. The formal analysis is supported by applications to email and cancer data sets. The cluster quality and accuracy values were compared with some of the widely used text clustering techniques which showed the efficiency of the proposed approach.

Article Preview

Top

Introduction

In the digital era, the rapid growth in the volume of text documents available from various sources like Internet, digital libraries, medical records has spurred users to effectively retrieve, navigate, and organize information. The ultimate goal is to help users to search what they are looking for effortlessly and take decisions suitably. In this context, fast and high-quality document clustering algorithms play a major role. Most of the common techniques in text retrieval are based on the statistical analysis of terms i.e. words or phrases. Such text retrieval methods are based on the vector space model (VSM) which is a widely used data representation. The VSM represents each document as a feature vector of the terms in the form of term frequency or term weight (Salton et al., 1975). The similarity between documents is measured by one of the several similarity measures that are based on feature vector. Examples include the cosine measure and the Jaccard measure (Schaeffer, 2007). Metric distances such as Euclidean distance are not appropriate for high dimension and sparse domains. Most conventional measures estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. To achieve a more accurate analysis, the underlying model should indicate the semantics of text. Conceptual information retrieval extracts information by processing the document on semantic level forming a concept base and then retrieves relative information to provide search results.

Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories. Methods used for text clustering include decision trees, contextual clustering, clustering based on data summarization, statistical analysis, neural nets and rule-based systems among others (Nahm & Mooney, 2000; L. Talavera & J. Bejar, 2001; Jin et al., 2005). Most clustering techniques consider the number of clusters is fixed which can result in poor quality clustering. Dynamic document clustering is the process of inserting the newly arrived documents to the appropriate existing cluster so it is not required to relocate clusters, thus time and effort taken for clustering is drastically reduced(Wang et al., 2011; Nadig et al., 2008). Figure 1 shown below depicts a model for dynamic document clustering.

Figure 1.

A model for dynamic document clustering

A spanning tree is an acyclic sub graph of a graph G, which contains all vertices from G and is also a tree. The minimum spanning tree (MST) of a weighted graph is the minimum weight spanning tree of that graph (Edla & Jana, 2013). MST clustering algorithm is known to be capable of detecting clusters with irregular boundaries. Moreover MST is relatively insensitive to small amounts of noise spread over the field (Zahn, 1971).Thus the shape of a cluster boundary has little impact on the performance of the algorithm. The proposed approach does not require a preset number of clusters. Edges that satisfy a predefined inconsistency measure are removed from the tree. The process is iterated until there is a change in the edge list and all data are clustered.

The paper suggests a context based retrieval method at the sentence, document and corpus levels for enhancing the quality of text retrieval. More specifically, it can quantify how closely concepts relate to each other and integrate this into a document similarity measure. As a result, documents do not have to mention the same words to be judged similar. The suggested clustering technique is applied on two different data sets for developing clusters - email messages and cancer data sets to demonstrate its feasibility. A major contribution of this work is in developing clusters dynamically based on area of interest of email users and when applied to cancer data sets it can classify patients to different treatment clusters based on age groups. The work introduces a text classification algorithm which allows incremental and multi label classification by comparing with the context pool i.e. the most significant concepts in the cluster.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024)

Volume 13: 1 Issue (2023)

Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming

Volume 11: 4 Issues (2021)

Volume 10: 4 Issues (2020)

Volume 9: 4 Issues (2019)

Volume 8: 4 Issues (2018)

Volume 7: 4 Issues (2017)

Volume 6: 4 Issues (2016)

Volume 5: 4 Issues (2015)

Volume 4: 4 Issues (2014)

Volume 3: 4 Issues (2013)

Volume 2: 4 Issues (2012)

Volume 1: 4 Issues (2011)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Dynamic Clustering Based on Minimum Spanning Tree and Context Similarity for Enhancing Document Classification

Abstract

Introduction

Complete Article List