Latent Topic Model for Indexing Arabic Documents

Latent Topic Model for Indexing Arabic Documents

Rami Ayadi (LaTice Lab, Faculty of Economics and Management of Sfax, University of Sfax, Sfax, Tunisia), Mohsen Maraoui (LaTice Lab, Faculty of Sciences of Monastir, University of Monastir, Monastir, Tunisia) and Mounir Zrigui (LaTice Lab, Faculty of Sciences of Monastir, University of Monastir, Monastir, Tunisia)
Copyright: © 2014 |Pages: 17
DOI: 10.4018/ijirr.2014010102


In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.
Article Preview

2. Text Representation

Feature selection algorithm seeks to retain certain characteristics to optimize classification performance by removing the noise and redundancy. The feature extraction is to represent original features into another space through some sort of transformation. This transformation is a kind of representation from high-dimensional vector space to low-dimensional vector space.

Vector space model (VSM) (Salton, 1975) is still the most popular method for text representation, which reduces each document in the corpus to a vector of real numbers. Related research focuses on what are the most appropriate terms for document representation and how to calculate the weight of these terms. Much research adopt “word” or “n - gram” as terms and tf*idf as weight.

Although the reduction of tf*idf has some attractive features including the identification of words that are discriminatory for all documents in the collection, the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter or intra of the structure statistical document. To overcome these shortcomings, researchers have proposed several methods for dimensionality reduction, including latent semantic indexing (LSI) (Deerwester, 1990).

LSI uses a singular value decomposition of the matrix X to identify a subspace in the space of tf*idf features that capture most of the variance in the collection. This approach can achieve significant compression in large collections. In fact, according to Deerwester et al, derived characteristics of LSI, which are linear combinations of the original tf*idf features can capture some aspects of basics linguistic notions such as synonymy and polysemy.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing