Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

Prafulla Bharat Bafna (Symbiosis International University, Pune, India), Shailaja Shirwaikar (Nowrosjee Wadia College, Pune, India) and Dhanya Pramod (Symbiosis International University, Pune, India)
Copyright: © 2016 |Pages: 17
DOI: 10.4018/IJIRR.2016040102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The authors propose clustering based multistep iterative algorithm. The important step is where terms are grouped by synonyms. It takes advantage of semantic relativity measure between the terms. Term frequency is computed of the group of synonyms by considering the relativity measure of the terms appearing in the document from the parent term in the group. This increases the importance of terms which though individually appear less frequently but together show their strong presence. The authors tried experiments on different real and artificial datasets such as NEWS 20, Reuters, emails, research papers on different topics. Resulted entropy shows that their algorithm gives improved result on certain set of documents which are well-articulated, such as research papers. The results are marginal on documents where the message is emphasized by repetitions of terms specifically the documents that are rapidly generated such as emails. The authors also observed that newly arrived documents get appropriately mapped based on proximity to the semantic group.
Article Preview

Introduction

Managing growing repositories of unstructured or semi structured documents in an organization is becoming increasingly difficult. The size and number of online and offline documents is increasing exponentially. The need for identifying groups of similar documents has also increased for either getting rid of multiple versions of same document or extracting relevant set of documents from huge document repositories. It benefits many applications such as finding near duplicated web pages, replicated web collections, detecting plagiarism. Web search engines are highly benefited as it can be used for focused Crawling. Forming group of documents is not the only challenge, but there is need to identify the relevant group for a newly arrived document. It can be achieved through feature selection mechanism. It means that if the features of newly arrived document can be identified and matched with the feature set for each group of documents from the existing corpus, the new document gets placed in its relevant group depending on the match found (Moon et al., 2013).

Clustering techniques are available and can be readily applied on data in flat file format. Document data is converted to flat file format by extracting the terms from the documents, so documents represent rows and terms are placed in columns (Gulic et al., 2013). The terms are in large number which causes the problem of dimension curse and decreases algorithm efficiency.

To reduce these terms some feature selection technique should be used. TF-IDF (Term Frequency-Inverse Document Frequency) technique is used which extracts only most relevant terms depending on term occurrence frequency and eliminates the most common terms in the corpus (Albitar, et al., 2014).

Problem with TF-IDF is that, it does not consider synonyms of terms (Gulic et al., 2013). Synonyms are important part while presenting a document. Ideal documents do not repeat the same word. In fact multiple synonyms of a single word are mostly used. So term frequency of the term gets reduced and term is not selected by TF-IDF method. Important terms thus get ignored and logically relevant documents fall into different clusters. Sometimes documents on the same topic may get grouped in different clusters.

The proposed approach is similar to other researchers in the way document set is preprocessed to remove noisy and less useful data. Some researchers extend the bag of words (Jashki et al., 2009) by adding the synonyms which affects cluster quality due to increase in dimensions. In the proposed approach synonyms of each term are treated as a group of terms and group frequency is computed as the sum of the degree of similarity to the parent attribute of each term in the group occurring in the document. It changes the major terms/features because individual count increases unlike in TF-IDF. By applying hierarchical agglomerative clustering algorithm, the clusters are obtained at several levels of hierarchy. The feature set of the documents in clusters at the lowest level of hierarchy are iteratively extracted. The newly discovered features are added to extend the feature set at the topmost level.

It results into extended set of features with respect to each group and even wrongly placed document gets the right cluster. After some iteration, algorithm converges and produces stable set of features. Number of iterations and features depend on variance in data set.

The paper is organized as follows. The background presents the relevant work of other researchers on the topic. The next section presents our approach. The experimental set up and results obtained are presented to validate efficacy of the method. Lastly a real world application is presented to highlight applicability of the method. The paper ends with conclusion and future directions.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing