Combining Indexing Units for Arabic Information Retrieval

Combining Indexing Units for Arabic Information Retrieval

Souheila Ben Guirat, Ibrahim Bounhas, Yahya Slimani
DOI: 10.4018/978-1-5225-5191-1.ch029
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Using either stems or roots as index terms offered considerable performance to Arabic Information Retrieval (IR) systems compared to the use of surface words for indexing. Many comparative works tried to find out the best from these two indexing approaches but until then, no of the two methods widely overtook the other. Each of the two index types performed better under different test circumstances in terms of recall and precision. In this paper, the authors propose a hybrid approach combining the two indexing units in a way they take the advantages from both of them and try to overcome their shortcomings. Then, based on some combining techniques, the authors assign a weight for each indexing unit and try to find out the best weighting values.
Chapter Preview
Top

1. Introduction

Choosing the indexing unit is yet a challenging problem in Arabic IR (Elayeb & Bounhas, 2015). When using surface words, precision reaches high levels, but we will report low recall rates because of the high derivation and agglutination of Arabic. Furthermore, this choice requires more resources in terms of storage space and processing time (Aljlayl & Frieder, 2002). Thus, other types of indexing units gave better performance to Arabic IR systems like stems and roots. To better understand this problem, we present an example in section 1. We summarize our contribution in section 2.

1.1. Motivation

Let consider the Arabic root “kassama-قسم” and some related words (cf. Table 1). In one side, root-based methods will relate many derived words to the same form and this will cause ambiguity and reduce precision. If we consider the example, “inkassama-انقسم” (was divided) and “akssama-اقسم” (swear) will be represented by the same index i.e. “kassama-قسم” (divide).

Stem indexes reduce ambiguity (Ayed, 2014; Bounhas et al., 2015), but it will from the other side, reduce recall, since some studies (Al-Kabi et al., 2011) showed that in most cases, morphological variants of words have somehow similar semantic interpretations and are not completely dissimilar and different word forms may bear similar meaning. When we apply a light stemming to the same example, “alinkissamat-الانقسامات” (the divisions) and “inkissam-انقسام” (division) will have the same stem but their semantic relation with “kisma -قسمة” (division) will be ignored.

Table 1.
Some Arabic words related to the root “kassama-قسم”
انقسم
Inkassama
استقسم
istakssama
اقتسم
iktassama
اقسم
Akssama
قاسم
kaassama
انقسام
inkissam
قسمة
kisma
الانقسامات
alinkissamat
Was dividedConjureShare somebody in something.SwearShare something with somebodyDivisionDivisionThe divisions

Indeed, light stemming methods offer less recall and more precision. However, lemmatization helps to achieve better recall rates, but reduces precision. Thus, combining these techniques seems promising as it realizes some compromise between precision and recall, in a way we ensure that “alinkissamat -الانقسامات” (the divisions) is equal to “inkissam-انقسام” (division), not so far from “inkassama-انقسم” (was divided), a little bit different from “iktassama-اقتسم” (share somebody in something), but not totally different from “kaassama-قاسم” (share something with somebody) and that “akssama-اقسم” (swear) and “istakssama-استقسم”(conjure) are somehow related (cf. Figure 1).

Complete Chapter List

Search this Book:
Reset