Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia

Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia

Vishal Goyal (Punjabi University, India), Ajit Kumar (Multani Mal Modi College, India) and Manpreet Singh Lehal (Punjabi University)
Copyright: © 2020 |Pages: 10
DOI: 10.4018/IJEA.2020010104


Comparable corpora come as an alternative to parallel corpora for the languages where the parallel corpora is scarce. The efficiency of the models trained on comparable corpora is comparatively less to that of the parallel corpora however it helps to compensate much to the machine translation. In this article, the authors have explored Wikipedia as a potential source and delineated the process of alignment of documents which will be further used for the extraction of parallel data. The parallel data thus extracted will help to enhance the performance of Statistical Machine translation.
Article Preview

2. Literature Review

The web can be seen as large comparable corpora, and many studies have been conducted for constructing parallel corpora from it. (Smith, Quirk, and Toutanova, 2010), (Otero and López, 2010) explored Wikipedia as a source of comparable corpus. The recent studies by (Bouamor and Sajjad, 2018) involve the use of sentence embedding and neural machine translation for the identification of parallel data from French-English corpus. As parallel sentences tend to appear in similar article pairs, many studies first conduct article alignment from comparable corpora and then identify the parallel sentences from the aligned article pairs.

Complete Article List

Search this Journal:
Open Access Articles
Volume 13: 2 Issues (2021): Forthcoming, Available for Pre-Order
Volume 12: 2 Issues (2020)
Volume 11: 2 Issues (2019)
Volume 10: 2 Issues (2018)
Volume 9: 2 Issues (2017)
Volume 8: 2 Issues (2016)
Volume 7: 2 Issues (2015)
Volume 6: 2 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing