Automatic Categorization of Reviews and Opinions of Internet: E-Shopping Customers

Automatic Categorization of Reviews and Opinions of Internet: E-Shopping Customers

Jan Žižka, Vadim Rukavitsyn
Copyright: © 2011 |Pages: 10
DOI: 10.4018/ijom.2011040105
(Individual Articles)
No Current Special Offers


E-shopping customers, blog authors, reviewers, and other web contributors can express their opinions of a purchased item, film, book, and so forth. Typically, various opinions are centered around one topic (e.g., a commodity, film, etc.). From the Business Intelligence viewpoint, such entries are very valuable; however, they are difficult to automatically process because they are in a natural language. Human beings can distinguish the various opinions. Because of the very large data volumes, could a machine do the same? The suggested method uses the machine-learning (ML) based approach to this classification problem, demonstrating via real-world data that a machine can learn from examples relatively well. The classification accuracy is better than 70%; it is not perfect because of typical problems associated with processing unstructured textual items in natural languages. The data characteristics and experimental results are shown.
Article Preview

Data Description

To investigate possibilities of automatic data-mining from customer comments that are written in a quite free, unstructured form using natural language (Berry & Kogan, 2010; Konchady, 2006), the authors collected some publicly accessible textual data from the Internet web-site The main intention was to get comments about various consumer goods with at least 100 different opinions per each goods item provided by purchasers. The customer reviews describe their experiences that are good, bad, or something between. It is possible to apply also a certain scale as a kind of classification, or rating: from one star (the worst experience) up to five stars (the best one). The reviews are expected to explain reasons of their ratings which are usually relatively short, tens or hundreds of words. Typically, the language is English, however, with many mistypings, grammar errors, and so forth. In addition, the used English is really very “international”, and the customers are not only people whose native language is one of existing English languages that can more or less differ in grammar and vocabulary. Also, a reader of reviews can sometimes see non-standard interjections and onomatopoeic words.

The nine different commodities the reviews of which were used in the research are shown in Table 1. Interestingly, the average customer rating is very typically closer to five stars which means that customers were probably mostly satisfied.

Table 1.
The basic features of the nine various Amazon data-sets
Data typeAverage
Review number
Battery Tender Junior 12V Battery Charger4.5229
Coffee People, Donut Shop K-Cups for Keurig Brewers4.5186
Crocs Cayman Sandal4.5192
Eureka 4870MZ Vacuum Cleaner4.0230
Men's Health 1-Year Magazine Subscription3.5106
Timex Men's Ironman Triathlon 42 Lap Analog/Digital Dress Watch4.0172
Toshiba 640 GB USB 2.0 Portable External Hard Drive4.5156
Twilight: The Complete Illustrated Movie Companion4.5298
Wii Nunchuk Controller4.5228

Complete Article List

Search this Journal:
Volume 13: 1 Issue (2023): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing