Save 10% on All IGI Global Research Books
& OnDemand Individual Chapter & Article DownloadsAvailable exclusively on IGI Global’s Online Bookstore. Offer valid through October 31, 2024

Special Offers
- Save 10% on the IGI Global Online bookstore
  Now through October 31, 2024, save 10% on all IGI Global research books & OnDemand individual chapter & article downloads. IGI Global contributors may stack this discount with their exclusive 50% contributor discount, which is automatically applied when logged into a contributor portal account. Non-contributors may also combine the discount with one other discount, including coupon codes. Not valid on open access processing charges, e-collections, or videos. Discount is not applicable for distributors.
  Explore Books & Chapters
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

Prafulla Bharat Bafna, Shailaja Shirwaikar, Dhanya Pramod

Source Title: International Journal of Information Retrieval Research (IJIRR) 6(2)

DOI: 10.4018/IJIRR.2016040102

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The authors propose clustering based multistep iterative algorithm. The important step is where terms are grouped by synonyms. It takes advantage of semantic relativity measure between the terms. Term frequency is computed of the group of synonyms by considering the relativity measure of the terms appearing in the document from the parent term in the group. This increases the importance of terms which though individually appear less frequently but together show their strong presence. The authors tried experiments on different real and artificial datasets such as NEWS 20, Reuters, emails, research papers on different topics. Resulted entropy shows that their algorithm gives improved result on certain set of documents which are well-articulated, such as research papers. The results are marginal on documents where the message is emphasized by repetitions of terms specifically the documents that are rapidly generated such as emails. The authors also observed that newly arrived documents get appropriately mapped based on proximity to the semantic group.

Article Preview

Top

Introduction

Managing growing repositories of unstructured or semi structured documents in an organization is becoming increasingly difficult. The size and number of online and offline documents is increasing exponentially. The need for identifying groups of similar documents has also increased for either getting rid of multiple versions of same document or extracting relevant set of documents from huge document repositories. It benefits many applications such as finding near duplicated web pages, replicated web collections, detecting plagiarism. Web search engines are highly benefited as it can be used for focused Crawling. Forming group of documents is not the only challenge, but there is need to identify the relevant group for a newly arrived document. It can be achieved through feature selection mechanism. It means that if the features of newly arrived document can be identified and matched with the feature set for each group of documents from the existing corpus, the new document gets placed in its relevant group depending on the match found (Moon et al., 2013).

Clustering techniques are available and can be readily applied on data in flat file format. Document data is converted to flat file format by extracting the terms from the documents, so documents represent rows and terms are placed in columns (Gulic et al., 2013). The terms are in large number which causes the problem of dimension curse and decreases algorithm efficiency.

To reduce these terms some feature selection technique should be used. TF-IDF (Term Frequency-Inverse Document Frequency) technique is used which extracts only most relevant terms depending on term occurrence frequency and eliminates the most common terms in the corpus (Albitar, et al., 2014).

Problem with TF-IDF is that, it does not consider synonyms of terms (Gulic et al., 2013). Synonyms are important part while presenting a document. Ideal documents do not repeat the same word. In fact multiple synonyms of a single word are mostly used. So term frequency of the term gets reduced and term is not selected by TF-IDF method. Important terms thus get ignored and logically relevant documents fall into different clusters. Sometimes documents on the same topic may get grouped in different clusters.

The proposed approach is similar to other researchers in the way document set is preprocessed to remove noisy and less useful data. Some researchers extend the bag of words (Jashki et al., 2009) by adding the synonyms which affects cluster quality due to increase in dimensions. In the proposed approach synonyms of each term are treated as a group of terms and group frequency is computed as the sum of the degree of similarity to the parent attribute of each term in the group occurring in the document. It changes the major terms/features because individual count increases unlike in TF-IDF. By applying hierarchical agglomerative clustering algorithm, the clusters are obtained at several levels of hierarchy. The feature set of the documents in clusters at the lowest level of hierarchy are iteratively extracted. The newly discovered features are added to extend the feature set at the topmost level.

It results into extended set of features with respect to each group and even wrongly placed document gets the right cluster. After some iteration, algorithm converges and produces stable set of features. Number of iterations and features depend on variance in data set.

The paper is organized as follows. The background presents the relevant work of other researchers on the topic. The next section presents our approach. The experimental set up and results obtained are presented to validate efficacy of the method. Lastly a real world application is presented to highlight applicability of the method. The paper ends with conclusion and future directions.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024)

Volume 13: 1 Issue (2023)

Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming

Volume 11: 4 Issues (2021)

Volume 10: 4 Issues (2020)

Volume 9: 4 Issues (2019)

Volume 8: 4 Issues (2018)

Volume 7: 4 Issues (2017)

Volume 6: 4 Issues (2016)

Volume 5: 4 Issues (2015)

Volume 4: 4 Issues (2014)

Volume 3: 4 Issues (2013)

Volume 2: 4 Issues (2012)

Volume 1: 4 Issues (2011)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

Abstract

Introduction

Complete Article List