Natural Language Processing in Online Reviews

Natural Language Processing in Online Reviews

Gunjan Ansari, Shilpi Gupta, Niraj Singhal
Copyright: © 2021 |Pages: 25
DOI: 10.4018/978-1-7998-4240-8.ch003
(Individual Chapters)
No Current Special Offers


The analysis of the online data posted on various e-commerce sites is required to improve consumer experience and thus enhance global business. The increase in the volume of social media content in the recent years led to the problem of overfitting in review classification. Thus, there arises a need to select relevant features to reduce computational cost and improve classifier performance. This chapter investigates various statistical feature selection methods that are time efficient but result in selection of few redundant features. To overcome this issue, wrapper methods such as sequential feature selection (SFS) and recursive feature elimination (RFE) are employed for selection of optimal feature set. The empirical analysis was conducted on movie review dataset using three different classifiers and the results depict that SVM could achieve f-measure of 96% with only 8% selected features using RFE method.
Chapter Preview


With the rise of various e-commerce sites, 72% buyers rely on online reviews before purchasing any product or service. Online review statistics show that 85% of consumers prefer to buy products from sites with reviews and users trust 12 times more on customer reviews than description given by product manufacturers. Reviews are the third most significant factor used for the ranking of e-commerce sites by Google. Facebook reviews statistics reveal that every four out of five users rely on local business having positive reviews. However, one negative review may adversely impact 35% of customers. Twitter statistics showed that the reviews shared through tweets in 2019 increased the sale by 6.46% on e-commerce sites (Galov et al.,2020).

With the remarkable rise in the social media content in the past few years, there arises a need to analyze this online data to enhance user’s experience which will further lead to an improvement in the local and global business of the e-commerce sites. Due to the availability of annotated datasets of product, movie, restaurant, reviews, etc. the researchers are developing various supervised learning approaches in recent years for extracting useful patterns from the online content. Although the supervised learning approaches are found to be quite useful, they suffer from the curse of dimensionality due to the generation of ample feature space from the vast amount of online content. The selection of relevant and non-redundant features from the extracted features have shown to achieve promising results in terms of accuracy and time.

The chapter will provide a theoretical and empirical study of different filter (Yang & Pederson,1997; Chandrashekhar & Sahin, 2014) and wrapper (Zheng et al.,2003) based feature selection methods for improving classification. The filter-based feature selection methods rank each feature based on the correlation between the feature and the class using various statistical tests. The top-ranked features are then selected for training the classification model. However, the filter-based methods are computationally fast; they result in the selection of redundant features. To overcome this drawback, wrapper-based feature selection methods such as Recursive Feature Elimination and Sequential Feature Selection are employed in this study. They evaluate each feature subset based on its performance on the classifier. The selected features in wrapper methods are more relevant and non-redundant as compared to filter methods, thus leading to better performance of the classifier.

The first section of the chapter will introduce elementary Natural Language Processing (NLP) tasks related to online review classification. An insight into a few tools used for scraping data (Mitchell, 2015) from online review sites will be covered in this section. The reviews posted on these sites are generally noisy and contain misspelt words, abbreviations etc. To handle these issues, pre-processing of reviews (Kowsari et al.,2019) is required which convert raw data into an appropriate format for the implementation of the machine learning model. Few parsing techniques such as Parts-of-Speech (PoS) tagging and dependency parsing are the primary tasks required for extracting opinion from the review in applications such as Sentiment Analysis (Liu, 2012), Named entity recognition (Hanafiah & Quix, 2014) etc.

After pre-processing of reviews, there is a need to represent each review document into a learning vector for designing any machine learning model. The section will also provide a review of elementary feature representation models used in various applications of text classification (Ahuja et al., 2019) such as Term-Frequency (TF) or Bag-of-Words (BoW) and Term Frequency- Inverse Document Frequency (TF-IDF) (Qaiser & Ali, 2018). However, these schemes are easy to implement; their negative aspect is that they ignore the position of feature and its semantic relationship with other features in the given review document. This issue can be resolved by using the model (Uchida et al.,2018) that converts document of the given corpus into low dimensional embedding vector using deep learning and neural networks-based techniques. The Doc2vec model for representing feature vectors will also be covered in the section.

Key Terms in this Chapter

Unsupervised Learning: In unsupervised machine learning algorithms, the model learns from unlabeled data instances by finding the similarity or association between them.

Filter-Based Feature Selection: It filters irrelevant features from the extracted features on the basis of their association with the output class.

Wrapper-Based Feature Selection: This method selects the most useful and non-redundant features from the extracted features on the basis of their performance on the classifier.

Supervised Learning: It is machine learning algorithm in which the model learns from ample amount of available labeled data to predict the class of unseen instances.

Deep Learning: It is a subarea of machine learning, where the models are built using multiple layers of artificial neural networks for learning useful patterns from raw data.

Feature Selection: It is used to select appropriate features from the available data for improving efficiency of machine learning algorithms.

Semi-Supervised Learning: It is a machine learning algorithm in which the machine learns from both labeled and unlabeled instances to build a model for predicting the class of unlabeled instances.

Complete Chapter List

Search this Book: