Deep Neural Models and Retrofitting for Arabic Text Categorization

Deep Neural Models and Retrofitting for Arabic Text Categorization

Fatima-Zahra El-Alami (Laboratory of Informatics and Modeling, Sidi Mohamed Ben Abdellah University, Fez, Morocco), Said Ouatik El Alaoui (Ibn Tofail University, National School of Applied Sciences, Kenitra, Morocco) and Noureddine En-Nahnahi (Laboratory of Informatics and Modeling, Sidi Mohamed Ben Abdellah University, Fez, Morocco)
Copyright: © 2020 |Pages: 13
DOI: 10.4018/IJIIT.2020040104

Abstract

Arabic text categorization is an important task in text mining particularly with the fast-increasing quantity of the Arabic online data. Deep neural network models have shown promising performance and indicated great data modeling capacities in managing large and substantial datasets. This article investigates convolution neural networks (CNNs), long short-term memory (LSTM) and their combination for Arabic text categorization. This work additionally handles the morphological variety of Arabic words by exploring the word embeddings model using position weights and subword information. To guarantee the nearest vector representations for connected words, this article adopts a strategy for refining Arabic vector space representations using semantic information embedded in lexical resources. Several experiments utilizing different architectures have been conducted on the OSAC dataset. The obtained results show the effectiveness of CNN-LSTM without and with retrofitting for Arabic text categorization in comparison with major competing methods.
Article Preview
Top

Introduction

Over the last decades, we have been experiencing the explosion of textual information such as social media, numerical books and digital encyclopedia, etc. Natural Language Processing (NLP) techniques have been designed to help user to analyze and extract insight from huge amount of textual data. Innovative Machine Learning (ML) approaches, such as neural networks and deep learning models, showed significant enhancements in many NLP applications (information retrieval, document clustering, etc.). Text categorization (TC) is a fundamental task in diverse text mining applications such as sentiment analysis (Kim, 2014), question classification (Alami, En-Nahnahi, Zidani, & Ouatik, 2019), information filtering and topic classification (El-Alami & El Alaoui, 2018). This process consists of assigning a predefined label or a category to a textual document. However, building a TC system remains a challenging task due to the following two reasons: (1) The high dimensionality of feature space which decreases the performance of the categorization system; (2) The existence of redundant and noisy features that misguide the TC results. To address these issues, various feature representation methods have been proposed. The most known representations are Bag-of-Words (BoW), pLSA, LDA, word embeddings and doc2vec. BoW (Wang & Manning, 2012) extracts patterns like unigrams, bigrams, n-grams as features by considering text as independent tokens. However, this method cannot capture semantics within texts and fails to reflect similarities among words. PLSA (Cai & Hofmann, 2003) and LDA (Hingmire, Chougule, Palshikar, & Chakraborti, 2013) are topic modeling methods which are, generally, applied to select more discriminative features but they suffer from inference problem. More efficient representations as word embeddings (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013b) or document embeddings (Le & Mikolov, 2014) are defined as a set of language modeling techniques able to present words of the vocabulary or text as low dimensional vectors of real numbers through neural language models. They ignore the information embedded in lexical database. These representations have shown a good performance in Arabic text categorization. However, the information embedded in lexical database is ignored.

While several TC systems have been proposed for other languages (English, French, etc), Arabic TC still faces numerous difficulties in addition to the challenges discussed above. This could be explained by the complexity of Arabic language owing to the fact that it is inflectional and derivational.

In this paper, we explore deep neural models and retrofitting for Arabic text categorization to solve the aforementioned shortcomings such the luck of semantic, the high-dimensionality of representation space and the complexity of Arabic language. The retrofitting is defined as a graph-based learning technique utilizing lexical relational resources to train higher quality semantic vectors. It is employed for further enhancement. On the other hand, deep neural networks are able to achieve best results in many NLP tasks. Convolutional Neural Networks (CNNs) are able to achieve good performances in highlighting best features and empowering deeper models (Kim, 2014; Kalchbrenner, Grefenstette, & Blunsom, 2014). Long Short-Term Memory networks (LSTMs) are a kind of Recurrent Neural Networks (RNNs) which enable associations between cells to construct a directed graph along a sequence. They have demonstrated great capabilities in highlighting dynamic behavior of sequential data (Hochreiter & Schmidhuber, 1997). The main contributions of this work can be summarized as follows:

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 17: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing