Distributional Semantic Model Based on Convolutional Neural Network for Arabic Textual Similarity

Distributional Semantic Model Based on Convolutional Neural Network for Arabic Textual Similarity

Adnen Mahmoud, Mounir Zrigui
DOI: 10.4018/IJCINI.2020010103
Article PDF Download
Open access articles are freely available for download

Abstract

The problem addressed is to develop a model that can reliably identify whether a previously unseen document pair is paraphrased or not. Its detection in Arabic documents is a challenge because of its variability in features and the lack of publicly available corpora. Faced with these problems, the authors propose a semantic approach. At the feature extraction level, the authors use global vectors representation combining global co-occurrence counting and a contextual skip gram model. At the paraphrase identification level, the authors apply a convolutional neural network model to learn more contextual and semantic information between documents. For experiments, the authors use Open Source Arabic Corpora as a source corpus. Then the authors collect different datasets to create a vocabulary model. For the paraphrased corpus construction, the authors replace each word from the source corpus by its most similar one which has the same grammatical class applying the word2vec algorithm and the part-of-speech annotation. Experiments show that the model achieves promising results in terms of precision and recall compared to existing approaches in the literature.
Article Preview
Top

2. Problem Statement

The amount of textual information available and stored electronically has grown at a staggering rate. This has exponentially increased the potential source of paraphrase. More formally, given two sentences IJCINI.2020010103.m01 and IJCINI.2020010103.m02, such that IJCINI.2020010103.m03, when IJCINI.2020010103.m04 and IJCINI.2020010103.m05 convey the same meaning and are semantically equivalent, they are said to be paraphrased (Agarwal et al. 2017). Many researches on paraphrase detection have focused on the English language, but little effort has been done recently on other languages like Arabic. It is considered as a complex problem because of the challenging features of this language (Mohamed et al 2015). It is Semitic spoken by more than 330 million people and composed of 28 letters written from right to left. In addition, Arabic script has a rich morphologically accentuating by the existence of dots, diacritics and stacked letters (Hkiri et al. 2017, Mansouri et al. 2018, Mahmoud et al. 2018). It is highly inflectional, derivational and non-concatenative compared to other languages (Batita et al. 2018, Mahmoud et al. 2017). To contribute and solve these gaps, recent research has been advancing to propose semantic-similarity-based approaches that have more flexibility and expressiveness compared to syntactic ones. The main objective was to measure the degree of relationship between textual units and cover the maximum of Arabic specificities in terms of word construction and diversity meanings.

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 1 Issue (2022)
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing