Learning Algorithms of Sentiment Analysis: A Comparative Approach to Improve Data Goodness

Learning Algorithms of Sentiment Analysis: A Comparative Approach to Improve Data Goodness

Suania Acampa, Ciro Clemente De Falco, Domenico Trezza
DOI: 10.4018/978-1-7998-8473-6.ch012
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The uncritical application of automatic analysis techniques can be insidious. For this reason, the scientific community is very interested in the supervised approach. Can this be enough? This chapter aims to these issues by comparing three machine learning approaches to measuring the sentiment. The case study is the analysis of the sentiment expressed by the Italians on Twitter during the first post-lockdown day. To start the supervised model, it has been necessary to build a stratified sample of tweets by daily and classifying them manually. The model to be test provides for further analysis at the end of the process useful for comparing the three models: index will be built on the tweets processed with the aim of detecting the goodness of the results produced. The comparison of the three algorithms helps the authors to understand not only which is the best approach for the Italian language but tries to understand which strategy is to verify the quality of the data obtained.
Chapter Preview
Top

Introduction

Big Corpora and Digital Methods: A Critical Approach to Improve Data Goodness

The ubiquity of digital technologies and the popularity of opinion-rich platforms such as social media and review sites generates a large and rapid amount of user-generated data encoded in natural language daily. Reviews, tweets, likes, links, shares, texts, posts, tags etc.; these are only part of the billions of digital traces that we leave on the web every day, through which it is possible to accurately trace the tastes, opinions, and attitudes of everyone. Big corpora represent a profitable empirical basis for all those who investigate social phenomena on the net. The production and increasing availability of data offers new possible forms of knowledge of social complexity that social researchers cannot ignore. The data revolution is considered as “the sum of the disruptive social and technological changes that are transforming the routine of construction, management and analysis of data consolidated within the various scientific disciplines” (Amaturo, Aragona, 2017, p.1). The new digital technologies and big data allow social research to move from the construction of empirical bases through interrogation to the construction of empirical bases through survey. Big data allows us to measure complex phenomena in detail in real time, thanks to the evolution of IT tools and techniques such as artificial intelligence, machine learning, and natural language processing. This promotes interdisciplinarity between different scientific areas and provides social researchers solid empirical bases for experimenting and integrating new and traditional approaches to social research. These technologies push the social sciences into a scenario in which “web-mediated research [...] is already transforming the way researchers practice traditional research methods transposed to the web” (Amaturo and Punziano, 2016, 35, 36).

To be able to describe and analyse this wealth of information, social scientists have also begun to use computational analytical methods to assemble, filter and interpret user generated data encoded in natural language. Text mining is part of this context, a branch of data mining that allows you to analyse vast textual corpora in different languages ​​by extracting high quality information with very limited manual intervention. Natural language processing (NLP) is the area of ​​machine learning dedicated to the meaning of the written word.

A very profitable branch of natural language processing is sentiment analysis: it consists in the extraction and analysis of the opinions that users express on the web towards products, services, topics or characters. With language processing and text analysis, sentiment analysis identifies subjective information in sources. The main objective is to determine the general polarity of a text (whether it is a review or a comment) and classify it into three categories: positive, negative or neutral. Sentiment analysis techniques are divided according to the type of approach used: lexicon based or machine learning approach. The machine learning approach treats sentiment classification as a question of general text classification. This approach to classification is divided between unsupervised and supervised learning models. In supervised models it is necessary to arrange a training set labelled with the indication of the polarity of the feeling (negative, positive, neutral) that the algorithm will use to predict the polarity of other textual content contained in the test set. The machine learning approach has the advantage of not depending on the availability of dictionaries, but the accuracy of the classification methods depends a lot on the correct labelling of the texts used for training and on a careful selection of the features by the algorithm. The results of the three supervised algorithms were adopted and compared through an analysis model that involved the construction of the labelled training on which the three models were tested to evaluate the accuracy of each. The next step involved recoding the processed tweets based on their agreement/discrepancy with the output returned by each of the three algorithms. The tweet analysis allows us to define the components (text, sentiment, and other features) that suggest a plausible relationship with the functioning of the algorithm. These algorithms that work through learning open interesting developments by defining data accuracy parameters in relation to validated benchmarks. Our work examines sentiment in a sample of one-day tweets in Italian (May 4, 2020) related to phase 2 of the post-lockdown. The tweets were processed with the three most widely used algorithms in the literature for this type of analysis (Naives Bayes, Decision Tree and Logistic Regression). The results of the three supervised algorithms were adopted and compared based on the accuracy of each and the predictive ability. To check if there were latent differences in the corpus, it was decided to use a lexical correspondence analysis (ACL) which allowed us to define the components (text, sentiment, and other characteristics) that give us information about the functioning of the algorithm. Although the techniques are advancing rapidly and their performances are improving year by year, the analysis shows that the functioning of the chosen algorithms still present various limits for the Italian language.

Complete Chapter List

Search this Book:
Reset