Article Preview
TopIntroduction
The field of Arabic Natural Language Processing (NLP) is a growing field with many interesting and challenging problems. Two types of Arabic are usually considered in Arabic NLP papers: Modern Standard Arabic (MSA) and dialects (vernaculars). MSA is derived from Classical Arabic. It is the official Arabic language used in media, education, culture, literature, official documents, old books and most of the new books throughout the Arab world, which spans regions of the Middle East and North Africa (MENA) in addition to parts of East Africa (Horn of Africa). It is one of the six official languages of the United Nations. It is the native language of 420 million people (Hmeidi et al., 2015b).
Classical Arabic and MSA remained the only documented versions of Arabic till mid-1990s when the dawn of Internet services and mobile communication pushed for the documentation of different Arabic dialects (vernaculars). The widespread use of emails, SMS, blogs and later social media helped in documenting these Arabic vernaculars in addition to giving birth to a new version of Arabic called Arabizi in which the Arabic words are transliterated using the Roman alphabet (Habash, 2010).
A number of specialists like Habash (Habash, 2010) consider the Arabic vernaculars as the true native language forms, since they are used in daily informal communications between people who live in the Arab world. Although these Arabic vernaculars lack standardization, are not generally found in written form and are not officially taught, they can be found in TV shows, movies, songs, theaters, etc. Arabic vernaculars are classified by linguists into seven main regional language groups: Maghrebi, Egyptian, Mesopotamian, Arabian Peninsula, Sudanese, Levantine, and Andalusian (now extinct) (Ta’amneh et al., 2014; Faqeeh et al., 2014).
Sentiment analysis (SA) and opinion mining (OM) is a growing field of study that automatically determines people's opinions, sentiments, attitudes, and emotions from written text or speech excerpts (Liu, 2012). It is the focus of a large number of research projects and the reasons for this are: availability of a number of good machine learning methods, the availability of huge corpora and, most importantly, the realization of the intellectual challenges and commercial applications of SA (Pang & Lee, 2008). This field of study is active in many research areas such as NLP, data mining, Web mining, and text mining (Liu, 2012). Due to its vast applications, SA has spread from computer science to the management sciences, political science, economics, and social sciences (Liu, 2012).
Most works on SA focus on sentence-based or document-based SA. A very interesting version of SA known as aspect-based SA (ABSA) is less studied in the literature despite its grave importance. This might be due to the several challenges it poses. This is the case for the well-studied English language. The situation is even worse for other languages such as Arabic where dozens of paper have been published in the few years on SA with only two papers (Al-Smadi et al., 2015a, 2015b) (as far as we know) published on ABSA.
Researchers in the field of SA usually depend on lexicons as essential resources in their studies to identify the polarity of different sentiments. Lexicons used in SA comprise of a list of sentiment words (opinion words, polar words, or opinion-bearing words) and sentiment phrases that used to express positive or negative sentiments. Liu presents in his book the major challenges facing the use of such lexicons (Liu, 2012).
Many studies presented different algorithms to compile lexicons of sentiment-bearing words from the English language. On the other hand, few studies presented algorithms to construct lexicons for Arabic words such as (Abbasi, Chen, & Salem, 2008; Abdul-Mageed, Diab, & Korayem, 2011; Abdul-Mageed & Diab, 2012). A number of SA studies of Arabic sentiments are based on manually constructed lexicons such as (Abdulla et al., 2013). Manually constructed lexicons are characterized by their quality, but they are limited in size.