Dealing with Relevance Ranking in Cross-Lingual Cross-Script Text Reuse

Dealing with Relevance Ranking in Cross-Lingual Cross-Script Text Reuse

Aarti Kumar (Department of Computer Applications, Maulana Azad National Institute of Technology, Bhopal, India) and Sujoy Das (Department of Computer Applications, Maulana Azad National Institute of Technology, Bhopal, India)
Copyright: © 2016 |Pages: 20
DOI: 10.4018/IJIRR.2016010102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Proliferation of multilingual content on the web has paved way for text reuse to get cross-lingual and also cross script. Identifying cross language text reuse becomes tougher if one considers cross-script less resourced languages. This paper focuses on identifying text reuse between English-Hindi news articles and improving their relevance ranking using two phases (i) Heuristic retrieval phase for reducing search space and (ii) post processing phase for improving the relevance ranking. Dictionary based strategy of Cross-Language Information Retrieval is used for heuristic retrieval and Parse Feature Vector Model (PFVS) is proposed for post processing to improve the relevance ranking. The application of this model has been successful in tackling the obfuscation problems of synonymy, hyponymy, hypernymy, antonym, sentence addition/ deletion and word inflection. Instead of using traditional approaches, Parse Feature Vectors have been explored to detect the reused documents and as per the knowledge of the authors it is a novel contribution with regards to these two language pairs.
Article Preview

Detecting cross-lingual reuse has been an area of research interest for many researchers since long. Stephan Vogel, Hermann Ney and Christoph Tillman (1996) tried to use Hidden Markov Model for aligning words of statistically translated English and French. As opposed to common approach where alignment probabilities are dependent upon absolute position of words, they aimed at making it dependent on relative position.

Noah A. Smith (2002) devised an approach which could be acclimatized for any multilingual corpus for classifying document pairs as either translationally equivalent or not.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing