Effectiveness of Normalization Over Processing of Textual Data Using Hybrid Approach Sentiment Analysis

Effectiveness of Normalization Over Processing of Textual Data Using Hybrid Approach Sentiment Analysis

Sukhnandan Kaur Johal (Thapar Institute of Engineering and Technology, India) and Rajni Mohana (Jaypee University of Information Technology, India)
Copyright: © 2020 |Pages: 14
DOI: 10.4018/IJGHPC.2020070103

Abstract

Various natural language processing tasks are carried out to feed into computerized decision support systems. Among these, sentiment analysis is gaining more attention. The majority of sentiment analysis relies on the social media content. This web content is highly un-normalized in nature. This hinders the performance of decision support system. To enhance the performance, it is required to process data efficiently. This article proposes a novel method of normalization of web data during the pre-processing phase. It is aimed to get better results for different natural language processing tasks. This research applies this technique on data for sentiment analysis. Performance of different learning models is analysed using precision, recall, f-measure, fallout for normalize and un-normalize sentiment analysis. Results shows after normalization, some documents shift their polarity i.e. negative to positive. Experimental results show normalized data processing outperforms un-normalized data processing with better accuracy.
Article Preview
Top

1. Introduction

Natural language processing is a field of computational linguistics and artificial intelligence. It is the key to unlock various decisions using narrative web content. The automation of decision support system widely relies over the performance of natural language processors. Data available over the web sphere in various forms such as text, audio, video or pictures. Due to the arbitrary nature of the language, this data is unstructured in nature. Efficiency of decision support system also gets affected by this unstructured data processing. This may sometimes hinder the performance of sentiment analyzer thus affecting the decision support system. As shown in Figure 1, initially, data is collected from the various social sites for automation of the decision support systems. Then data is pre-processed to get the structured content which includes removing the redundant content, cleaning and normalization. Later, various language processing tasks are carried out. Depending on the requirement, the results of the language processor are filtered out for the automation of decision support system. In this work, the result of sentiment analyzer (SA) is considered.

Figure 1.

Automation of decision support system

IJGHPC.2020070103.f01

The proliferation of web data primarily as communication medium give rise to the existence of unstructured content in the form of posts, blogs, reviews, etc. This web data is rich indicator of people’s reaction for any entity. This reaction of people is analyzed and termed as sentiment analysis in the field of natural language processing.

Classification of this web data into predefined categories, i.e. positive, negative or neutral is the task of sentiment analyzer. The web content is usually the raw data which is taken as an input by the sentiment analyzer. To reduce the performance degradation, it is necessary to pre-process data efficiently. Given the importance to minimize the human intervention in sentiment analysis and to get better results, systematized and efficient mechanisms is the need of the hour. Normalization is the basic task to handle performance degradation of various natural language processing tasks. The term normalizes in past is taken as to just make the content in a well-structured format. These days normalize has broader term in the field of natural language processing. It includes handling slangs, spell correction, finding missing words, cleaning the text, etc. In this manuscript, the presented system design and algorithm is used to handle unstructured or noisy data for sentiment analysis.

1.1. Motivation and Contribution

The most important source of texts is undoubtedly the Web. The web content is full of unstructured content and slangs. The motivation behind our work is to process the semantically correct and methodologically useful content for sentiment analysis. To find the significant meaning or the replacements of each and every slang is the key concern of the work presented. It is a general methodology which can be embedded into various natural language applications to enhance their performance.

The proposed technique is generic in nature. This can be applied to the pre-processing of any textual data for language processing task. This helps in enhancing the performance of the automatic decision support system. Hybrid systems for sentiment analysis comprises of two modules: corpus based, and dictionary based. The corpus-based approach is characterized by the maximum likelihood ratio along with point-wise mutual information for normalization. The dictionary-based approach consists of a crossword dictionary for slangs and emoticons. The development of hybrid system stems from the failure of any single technique to achieve a satisfactory level of accuracy in sentiment analysis.

The paper structure is following the state-of-the-art algorithms for normalization in section 2. It includes the summarized content of various researchers work in the same field. It is preceded by the design and algorithm of the proposed hybrid method for handling un-normalized data in section 3 and section 4. Afterwards, the experimental results and evaluation of the system is done in section 5 and 6. Lastly, the conclusion is presented in section 7.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2020): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing