Negation Handling in Machine Learning-Based Sentiment Classification for Colloquial Arabic

One crucial aspect of sentiment analysis is negation handling, where the occurrence of negation can flip the sentiment of a sentence and negatively affects the machine learning-based sentiment classification. The role of negation in Arabic sentiment analysis has been explored only to a limited extent, especially for colloquial Arabic. In this paper, the author addresses the negation problem of machine learning-based sentiment classification for a colloquial Arabic language. To this end, we propose a simple rule-based algorithm for handling the problem; the rules were crafted based on observing many cases of negation. Additionally, simple linguistic knowledge and sentiment lexicon are used for this purpose. The author also examines the impact of the proposed algorithm on the performance of different machine learning algorithms. The results given by the proposed algorithm are compared with three baseline models. The experimental results show that there is a positive impact on the classifiers accuracy, precision and recall when the proposed algorithm is used compared to the baselines.

sentence like ‫ام"‬ ‫يف‬ ‫جاعزا‬ ‫سكعالب‬ ‫ةئداه‬ ‫"ادج‬ (there is no noise, on the contrary, it was very quiet), the words after ‫"ام"‬ would be marked with a negation, although the only word which supposed to be affected is ‫)جاعزا(‬ "noise". This would result with new features created that negatively affect the performance of the sentiment classification. Nevertheless, we present these models as baselines to be compared with the proposed algorithm. On the contrary, the proposed algorithm aims to detect only the affected words as some opinionated words might not be affected even though they are within the scope of a negation term. Another approach used to capture negation is using higher-order n-grams, such as using bi-gram in the work of Pang, Lee, and Vaithyanathan (2002). Although this approach is convenient, this would fail in cases in which the affected words are at a distance from the negation words. For instance, in a sentence like ‫ال"‬ ‫دجوي‬ ‫اذهب‬ ‫معطمال‬ ‫يا‬ ‫ءيش‬ ‫"يكاز‬ (there is no anything in this restaurant is delicious), the algorithm needs 6-gram to capture negation (... ‫يكاز‬ ‫)ال‬ "no…delicious", and using such high order n-grams would lead to very sparse representation that makes the learning from training data is harder.
As a main contribution, we propose an algorithm that can detect and handle the negation problem in the colloquial Arabic reviews to improve the performance of the machine learningbased sentiment classification. The author also examines the effect of the proposed algorithm on four of the most common classifiers used in sentiment analysis; they are Support Vector Machine (SVM), Naïve Bayes (NB), k-nearest neighbor (KNN), and Logistic Regression. Additionally, a comparison is carried out between the classifiers when our algorithm is used and three baseline models that differ in their methods of determining the negation scope. The proposed algorithm uses crafted rules, linguistic knowledge, and sentiment lexicon. The rules were crafted based on observing many cases of negation in colloquial reviews. It detects the negation words, like ‫،وم(‬ ‫ال‬ ‫)،شم‬ "no, not, not", and then mark the opinionated words that might be affected within a predefined window length of words. These rules do not rely on grammatical knowledge about the relationships between different constituents, as there are no standard grammatical rules to dialectal texts. A major challenge in this respect is determining the sequence of words in the sentence that might be affected (negation scope) by a negation term. Unlike the Arabic language, several approaches based on various aspects of contents have been presented to address this issue in the English language. These approaches require an annotated negation dataset which is not available for colloquial Arabic language. This issue is beyond the scope of this paper; therefore, we solely use a predefined window length of five words that directly follow a negation word.
The work of Duwairi and Alshboul (2015) also focused on the negation problem. In this work, they introduce an unsupervised sentiment analysis for MSA language that includes a morphological framework for negation. They propose a treatment of negation by using a set of rules derived from formal linguistic knowledge. The negation words are categorized into two groups, the first one include ‫،ام(‬ ‫،ال‬ ‫،مل‬ ‫)نل‬ which affect only the verb that appears immediately after them, and the second group contains ‫)سيل(‬ which affect only the two nouns following it. ArSenl lexicon and an Arabic morphological analyzer were used to assign the sentimental value and POS for the terms, respectively. Unfortunately, these rules cannot be applied to the texts in our work due to the presence of dialect that does not abide by the same linguistic rules. Furthermore, the authors did not provide any details about the experiment or the evaluation results for their approach.

Pre-Processing
The pre-processing stage included removing noise from data, normalization, and tokenization. The process of removing noise from data includes removing misspellings, repeated letters, diacritics, punctuations, numerals, English words, and elongation. After that, a normalization process was applied to particular letters, for example the letters ‫أ(‬ ‫إ,‬ ‫)آ,‬ were converted to ‫,)ا(‬ the letters ‫ى(‬ ‫)ئ,‬ were converted to ‫,)ي(‬ the letter ‫)ة(‬ was converted to ‫,)ه(‬ and finally the letter ‫)ؤ(‬ was converted to ‫.)و(‬ Tokenization is the process of dividing a given text into a set of words (tokens) which are separated by spaces.

Negation Terms List
The author manually collected the most common negation terms used in the reviews and stored them in a list, including different morphological forms of some words. The negation list contains 50 terms, including the terms used in both types of Arabic, MSA such as ‫،مل(‬ ‫)سيل‬ and the dialectal words. In Jordanian dialect, negation is expressed with different terms from MSA. For example, the terms ‫،وم(‬ ‫،شم‬ ‫،شوهم‬ ‫،شف‬ ‫،شيفم‬ ‫)وهم‬ were used in the collected texts. Another way used to negate words is using terms like ‫،يفام(‬ ‫اهيبام‬ ‫،اهيفام,‬ ‫،ويفام‬ ‫)تميال‬ due to that the people tend to not space between the negation terms and the following word. We treated such cases as one expression that belongs to the negation words. In this work, if a negation term is detected in the review, the following words within a window length of 5 words will be checked against the sentiment lexicon to decide if they need to be reversed by marking them as negated words. On the other hand, there are several cases in which the negation terms were detected, but not followed by sentimental words, for instance, the review ‫يفام(‬ ‫مسق‬ ‫تاودأال‬ ‫)ةيلزنمال‬ "there is no home appliances section", in these cases, the algorithm will not mark any word within the scope with a negation tag.

Negation Handling
The main objective of the paper is to address negation in colloquial Arabic reviews to improve sentiment classification. This section describes the proposed algorithm to handle this problem. The algorithm was developed using Python 3.0 programing language, see Figure 1. The input to our algorithm is a review with one or more occurrences of negation terms and output the review with negated polarity words if detected within the negation scope. First of all, we introduce the mechanism of detecting the negation terms and negation scope, which is simply tracing the negation terms within a given review based on the predefined negation terms. Then, if sentimental words are detected within the negation scope, the words will be marked with a negation tag, for instance, ‫<بحأ_!>ال(‬ ‫اذه‬ ‫)معطمال‬ "I don't like_! this restaurant". Each negation term is assumed to have a scope of negation effect. In this work, the negation scope is the five words that directly follow the negation term. Determining the negation terms is not an easy task, particularly in the Arabic language since sometimes a negation term in a review does not have the negation sense, or might affect one sentimental polarity without the other. Knowing that, there is no morpho-syntactic tools can be used to the colloquial Arabic, made detecting such exceptions even complicated task. To this end, many cases have been analyzed to come up with rules that can detect such exceptions. In this section, we summarize several cases of how negation terms used in the colloquial Arabic reviews, from which we crafted the required rules to detect negation properly.

Case1:
A sentence has a negation word followed by an exceptional word ‫)اإل(‬ "but, or except" and polarity expression within the negation scope, and the index of the exceptional word is greater than the index of negation word and less than the index of polarity term like in the sentence ‫هحارصب(‬ ‫ام‬ ‫تيقل‬ ‫اال‬ ‫هلماعمال‬ ‫هسيوكال‬ ‫)فارتحاالو‬ "Frankly, we did not find anything but proper treatment and professionalism". In this case, the negation word is used to emphasize whatever the polarity comes after the exceptional word which is positive polarity in this sentence expressed by ‫هسيوكال(‬ ‫)فارتحاال,‬ "proper, professionalism". Therefore, the algorithm will not mark the polarity words as negated. Case2: Another phenomenon used commonly in the texts is the use of superlative and comparative words preceded with negation words to express the sentiment as in the sentence ‫ديج(‬ ‫هبرجت‬ ‫يفام‬ ‫ىلحا‬ ‫اهنم‬ ‫فيظن‬ ‫)البقتساو‬ "There is no more beautiful than this experience; it was a clean and good reception". The negation word ‫)يفام(‬ "there is no" followed by the word ‫)ىلحا(‬ "more beautiful" were used to express positive sentiment, so expectedly any sentimental term comes after those agree with the same polarity and that obvious with the words ‫،فيظن(‬ ‫)ديج‬ "clean, good" that also express positive sentiment. Another example with a negative sentiment ‫نيباذك(‬ ‫يفام‬ ‫أوسا‬ ‫نم‬ ‫كيه‬ ‫)نسا‬ "there is no worse than such people, liars", where is the polarity of the word ‫)نيباذك(‬ "liars" agrees with the polarity of the word ‫)أوسأ(‬ "worse" and the negation here would not be appropriate. As can be noted in this case, the index of superlative and comparative words is always greater than the index of negation word and less than the index of polarity term. In this case, the algorithm will discard negating the polarity word, and in order to do that, given that we decided to not use any morphological analyzer, we collected and stored the most common used comparative and superlative words such as ‫،لمجأ(‬ ‫،نسحأ‬ ‫،مخفأ‬ ‫،ىقرأ‬ ‫،أوسأ‬ ‫نعأل‬ ‫،عورأ‬ ‫،ىلحأ‬ ‫.)،لضفأ‬ Case3: A sentence has two or more sentimental words with different polarities (positive and negative), which fall into the negation scope like in the sentence ‫شم(‬ ‫ولح‬ ‫ناكمال‬ ‫خسو‬ ‫)ةرمالب‬ "Not a lovely place, it is very filthy". The presence of a negation term in a sentence does not mean that all its polarity words should be affected. As we can see in the example, there are two sentimental words within the negation scope ‫)ولح(‬ "lovely" which expresses positive sentiment and ‫)خسو(‬ "filthy" which expresses negative sentiment. In this case, the algorithm will detect the polarity of the first sentimental word occurs after the negation term which is in the above sentence ‫,)ولح(‬ then will negate only the words that fall into the same polarity within the scope and discarding any other polarity. Case4: A sentence has the negation term ‫)ام(‬ that holds different senses other than the negation, such as interrogative or relative pronoun. For instance, ‫لك(‬ ‫ام‬ ‫حورن‬ ‫مهيلع‬ ‫دكنتن‬ ‫ريغنو‬ ‫)ناكمال‬ "Every time we visit them, we got miserable, and we then change the place", based on the discourse context, the word ‫)ام(‬ is a relative pronoun that does not has a negation effect on the negative sentiment of the word ‫)دكنتن(‬ "got miserable"; however, the capability to recognize such cases is hard without a morpho-syntactic analyzer. As mentioned before, we cannot use such analyzer since the available ones have been trained only on MSA. Therefore, we collected and stored all the words that used frequently before or after ‫)ام(‬ when it does not express the nega-tion sense. Table 1 shows most the cases of ‫)ام(‬ as not a negation term, whenever, these cases detected the algorithm will ignore negating any polarity term within the scope.
Case5: A sentence has the negation term ‫)ريغ(‬ which in some cases does not have the negation effect on the words like in the sentence ‫نكامأ(‬ ‫ةبسانم‬ ‫تاللئاعل‬ ‫ريغ‬ ‫نع‬ ‫نكامأال‬ ‫)ةجعزمال‬ "These places are suitable for families; they are different from the noisy places". The word ‫)ريغ(‬ in the sentence means "different from", and it cannot play the role of the negation on the polarity word ‫)ةجعزمال(‬ "noisy". In this case it is hard to recognize the word without morphological knowledge, however, the proposed algorithm can handle this case based on knowledge of the words used frequently whether before or after ‫.)ريغ(‬ Those words were observed and collected from the dataset to be fed to the algorithm, Table 2 shows the words.
Case6: Other cases were observed in which the negation terms do not have the negation sense. To enable the algorithm to detect such cases, we collected the words that might frequently appear before or after the negation terms in these cases as knowledge to guide the algorithm to decide whether it is a negation word or not. Table 3 shows the cases we collected along with examples.
In future work, we plan to enable the algorithm to deal with implicit negation that also can negativelyaffectpolarityclassification.Anotherproblemthatneedstobeaddressedisthattheusage ofintensifiersanddiminishers,whichcanchangethepolarityofwordsorphrases.