Extracting Sentiment Patterns from Syntactic Graphs

Extracting Sentiment Patterns from Syntactic Graphs

Alexander Pak (Université Paris-Sud, France) and Patrick Paroubek (Université Paris-Sud, France)
Copyright: © 2013 |Pages: 18
DOI: 10.4018/978-1-4666-2806-9.ch001

Abstract

Sentiment analysis and opinion mining became one of key topics in research of social media and social networks. Polarity classification, i.e. determining whether a text expresses a positive attitude or a negative one, is a basic task of sentiment analysis. Based on traditional information retrieval techniques, such as topic detection, many researchers use a bag-of-words or an n-gram model to represent an analyzed text. Regardless of its simplicity, such a representation loses latent information contained in relations between words in a sentence. The authors consider this information to be important for sentiment analysis and thus propose a novel method for representing a text based on graphs extracted from sentence linguistic parse trees. The new method preserves the information of words relations and can replace a standard n-gram model. In this chapter, the authors give a description of their approach and present results of conducted experimental evaluations that prove the benefits of their text representation. In the authors’ experiments, they work with English and French languages; however, their approach is generic and can be easily adapted to other languages.
Chapter Preview
Top

Introduction

The increase of interest in sentiment analysis is associated with the appearance of Web-blogs and social networks, where users post and share information about their likes/dislikes, preferences, and lifestyle. Many websites provide an opportunity for users to leave their opinion about a certain object or a topic. For example, the users of IMDb1 website can write a review on a movie they have watched and rate it on a 5-star scale. As a result, given a large number of reviews and rating scores, the IMDb reflects general opinions of Internet users on movies. Many other related Web-resources, such as cinema schedule websites, use the information from the IMDb to provide information about the movies including the average rating. Thus, IMDb reviews influence the choice of other users, who will have a tendency to select movies with higher ratings.

Another example is social networks. It is popular among users of Twitter2 or Facebook3 to post messages that are visible to their friends with an opinion on different consumer goods, such as electronic products and gadgets. Jansen (2009) called Twitter as “electronic word of mouth.” The companies who produce or sell those products are interested in monitoring the current trend and to analyze people's interest. Such information can influence their marketing strategy or bring changes in the product design to meet the customers’ needs.

Therefore, there is a need in algorithms and methods of automatic collection and processing of opinionated texts. Such methods are expected to classify the texts by their polarity (positive or negative), estimate the sentiment and determine the opinion target and the holder, where the target is the object or a subject of the opinion and the holder is usually the author of the text, but not limited to (Toprak, et al., 2010).

Bag-of-words is one of the first models of text representation, which is nowadays often used in sentiment analysis. In this approach, text is represented usually as a set of unigrams (or bigrams) disregarding their order and relations within the text. Common machine learning techniques such as Naive Bayes or SVM are then used to perform the sentiment classification of the given text. Although the accuracy of such approaches can be quite high, especially when using advanced feature selection techniques and additional lexicons of opinionated texts. We think that this model should be improved or replaced by the one that can identify more complex sentiment expressions rather than only simple ones such as good movie or bad acting.

One of the problems of bag-of-words representation is the information loss when representing a text as a collection of unrelated terms. However, these relations are often very important and may change the degree and the polarity of a sentiment expressed in the text. We illustrate this problem with a simple example. Let us consider a simple phrase: “This book is bad.” The sentiment of this phrase is obviously negative and a standard classifier based on unigrams model will easily classify this sentence correctly provided a good training dataset. Now let us make the sentence a little more complex: “This book is not bad.” In this case, a simple unigram model will probably fail. However, a bigram model will still work, capturing not bad as a term with a positive polarity. If we make the sentence more complex: “This books is surprisingly not that bad,” both unigram and bigram models will fail. To make them work, a more sophisticated treatment of negations is needed.

Other than handling negations, the n-gram model has problems with capturing long dependencies. A bigram model will capture “I like” as a positive pattern in a sentence such as “I like fish,” but not in “I really like fish.” If we advance the task and move to a more refined polarity classification, i.e. identifying not only the polarity of a text (positive or negative), but also the degree of the polarity (low/high or even more precise), the n-gram model cannot provide sufficient information.

In order to solve the problems of the n-gram model, we propose to use a dependency parse tree of a sentence to generate a text representation. A dependency tree is a graphical representation of a sentence where nodes correspond to words of the sentence and edges represent syntactic relations between them such as object, subject, modifier etc. Figure 1 depicts a dependency parse tree of a sentence “I do not like fish very much.”

Complete Chapter List

Search this Book:
Reset