To improve the Recovery of an Arab Stemmer for Information Retrieval

To improve the Recovery of an Arab Stemmer for Information Retrieval

Khaireddine Bacha (LaTICE Laboratory, University of Tunisia, Tunisia)
Copyright: © 2018 |Pages: 9
DOI: 10.4018/IJDAI.2018010102
OnDemand PDF Download:
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The automatic processing of the Arabic language is a growing discipline, in which one sees more and more research and technologies to examine the specificities of this language and to propose tools necessary to the development of its automatic processing. The old techniques of rooting have limits that weaken the process of root extraction. In this article, the author proposes a new approach to rooting based on two finite state automata. The technique proposed is based on finite state automata in the root extraction process, with the aim of minimizing the error rate and ambiguity, usually due to the removal of the affixes. The author is currently focusing on the development and improvement of the rooting technique while trying to overcome the various problems encountered. The author is working on the compilation of a corpus of evaluation which will allow him to evaluate and compare their approach to others
Article Preview
Top

2. Schemes And Their Importance To Treatment

In the field of computer processing of Arabic, research began in the 1970s, even before the problems of Arabic text editing were completely under control. The first works concerned in particular lexicons. For the last ten years, the internationalization of the Web and the proliferation of means of communication in Arabic have revealed the usefulness of a large number of potential applications of the Automatic Processing of Arabic Natural Language.

Based on the analysis, this view consists of a series of treatments, such as morphological, syntactic, semantic, pragmatic, and so on. In this case (Zeroual and Lakhouaja, 2016) the analysis consists of constructing a formal representation of the input text, this representation must be easy to manipulate by the machine. The second aspect concerns generation-based processing, which has a reverse function to that of analysis; it consists of generating texts from an internal representation. The search formalism of the root of an Arabic word was one of the most interesting areas of Arabic TAL. Nowadays, the schema method is used by linguists and in this case by many computer scientists in the treatment of the Arabic language.

The schema is a word composed of three consonants [f] ف, ['] [ع and ل (El-Haj, Mahmoud, and Koulali, 2013), which are vocalized and can be augmented by other letters (prefix, suffix and infix) (Hadni, Said Ouatik, Lachkar, and Meknassiv, 2013). The schema plays a very important role in the process of generating derived forms from a root. This process of generation consists in replacing the root of the schema by the consonants of the latter in question, while keeping the same vowels and the same letters respecting the same order of the consonants, in other words the schema can be considered as a mold on which In total, there are 19 verbal schemas that can be either nude or augmented deriving from three consonants of the root by changing the vowels, by doubling the second letter of the root, by adding and even by intercalating affixes (prefix, infix, suffix) (Beesely, 1996). The augmented verbs are conjugated with the same prefixes and suffixes as the verb without augmented. As a result, a root can generate up to 19 verbs and the corresponding schemas can yield 22 different conjugation patterns. The Arabic schema integrates grammatical information that will be attributed to the words it forms. It can be quantitative, qualitative, with or without gemination of one of its radical consonants.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 2 Issues (2021): Forthcoming, Available for Pre-Order
Volume 12: 2 Issues (2020)
Volume 11: 2 Issues (2019)
Volume 10: 2 Issues (2018)
View Complete Journal Contents Listing