Sentiment Analysis on Movie Reviews Dataset Using Support Vector Machines and Ensemble Learning

Sentiment Analysis on Movie Reviews Dataset Using Support Vector Machines and Ensemble Learning

Razia Sulthana A., Jaithunbi A. K., Haritha Harikrishnan, Vijayakumar Varadarajan
DOI: 10.4018/IJITWE.311428
Article PDF Download
Open access articles are freely available for download

Abstract

The internet makes it easier for people to connect to each other and has become a platform to express ideas and share information with the world. The growth of the internet has indirectly led to the development of social networking sites. The reviews posted by people on these sites implies their opinion, and analysis over reviews is required to understand their intent. In this paper, natural language processing technique and machine learning algorithms are applied to classify the text data. The contributions of the proposed approach are three-fold: 1) chi square selector is applied to select the k-best features, 2) support vector machines is executed to classify the reviews (hyperparameters of the SVM classifier are tuned using GridSearch approach), and 3) bagging algorithm is applied with the base classifier over the newly built SVM classifier. The number of base classifiers of the bagging algorithm is varied accordingly. The results of the proposed approach are compared to the similar existing work, and hence, it is found to achieve better results as compared to the existing systems.
Article Preview
Top

1. Introduction

Sentiment Analysis, otherwise called as opinion-mining applies natural language processing techniques to schematically recognize, extract, enumerate, and study the subjective information. It is extensively applied to reviews given by the customers; responses to the surveys; reviews in online and social media; reviews given over products in online e-stores, to create AI based bots or assistants. Sentiment analysis classifies the opinion as positive or negative. Lexicon-based and Machine-learning (ML) based approaches are applied to identify the sentiment of any sentence. The former approach uses a vocabulary which contains pre-defined negative and positive words and the latter approach uses training and testing data to identify the positive and negative words. Sentiment analysis can be applied to classify emotions based on subjective parameters (Liu, 2010). It is known as emotion AI and has a variety of purposes in different fields like analyzing sentiments in emails, comments and survey feedback. It plays an imperative role in the domain of Artificial Intelligence (Mäntylä et al, 2018; Poria et al, 2018).

The textual datasets that are applied for sentiment analysis are first subjected to preprocessing. Most of the datasets require removal or fixing of missing values, null values or redundant values. Data pre-processing step includes sampling, cleaning and transformation of data. The type of data pre-processing needed by a particular dataset depends on the type of datasets (textual/image/numerical dataset). In the proposed approach, the type of dataset is a textual dataset.

Movies are one of the finest forms of entertainment and it’s a very common thing that the people watch movies and share their opinions on the social media platforms. By analyzing the reviews on the movies, the positive and negative opinion over the movie can be found. Thus, sentiment analysis can help in knowing the public opinion of that movie. Twitter, another platform where a huge perception of the user’s opinion is posted every day and these opinions can be over any generic content. Few of the recent research articles focus over detecting the hatred words in tweets. A number of emotional labels is used largely in tweets and is given in Figure 1.

Figure 1.

Labels used to classify the sentiments of the comments

IJITWE.311428.f01

The section split of this paper is given here: Section 2 details the terminologies, tasks, levels and open challenges in sentiment analysis; Section 3 does a detailed analysis about the literature work done in this field; Section 4 explains the step by step procedure of implementing review analysis using SVM; Section 5 tabulates all the experimental outcomes and compares with the results of existing works; Section 6 concludes the research work.

Top

2. Terminologies

  • Natural language processing (NLP): It is applied in sentiment analysis to review the marketing strategies and has reshaped the business approach. The steps of applying NLP (Chowdhury, 2003) in analyzing a review includes the process of tokenization; applying Part Of Speech (POS); text lemmatization; stop word identification, etc

  • Tokenization: Tokenization (Webster & Kit, 1992) is splitting a phrase or sentence or paragraph, or an entire text document into smaller units or terms. Each of these smaller units are called tokens. Tokenization is important because the meaning of the text could easily be interpreted by analyzing the words present in the text. Tokenization is a critical step in NLP and jumping into the model-building is not possible without applying tokenization (Pentheroudakis et al, 2006).

  • Bag-of-words: It’s a way of representing text data as a group of words. The bag-of-words model is applied in language and document classification (Voorhees, 1999).

A Model for the sentiment analysis is as given in Figure 2.

Figure 2.

General model of sentiment analysis

IJITWE.311428.f02

Complete Article List

Search this Journal:
Reset
Volume 19: 1 Issue (2024)
Volume 18: 1 Issue (2023)
Volume 17: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 16: 4 Issues (2021)
Volume 15: 4 Issues (2020)
Volume 14: 4 Issues (2019)
Volume 13: 4 Issues (2018)
Volume 12: 4 Issues (2017)
Volume 11: 4 Issues (2016)
Volume 10: 4 Issues (2015)
Volume 9: 4 Issues (2014)
Volume 8: 4 Issues (2013)
Volume 7: 4 Issues (2012)
Volume 6: 4 Issues (2011)
Volume 5: 4 Issues (2010)
Volume 4: 4 Issues (2009)
Volume 3: 4 Issues (2008)
Volume 2: 4 Issues (2007)
Volume 1: 4 Issues (2006)
View Complete Journal Contents Listing