Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text

Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text

Neetika Bansal (College of Engineering & Management, India), Vishal Goyal (Punjabi University, India) and Simpel Rani (Yadavindra College of Engineering, India)
Copyright: © 2020 |Pages: 11
DOI: 10.4018/IJEA.2020010105

Abstract

People do not always use Unicode, rather, they mix multiple languages. The processing of codemixed data becomes challenging due to the linguistic complexities. The noisy text increases the complexities of language identification. The dataset used in this article contains Facebook and Twitter messages collected through Facebook graph API and twitter API. The annotated English Punjabi code mixed dataset has been trained using a pipeline Dictionary Vectorizer, N-gram approach with some features. Furthermore, classifiers used are Logistic Regression, Decision Tree Classifier and Gaussian Naïve Bayes are used to perform language identification at word level. The results show that Logistic Regression performs best with an accuracy of 86.63 with an F-1 measure of 0.88. The success of machine learning approaches depends on the quality of labeled corpora.
Article Preview
Top

Literature Review

Language identification is an essential prerequisite for automatic text processing. It is a preprocessing task for computational tasks for code switching and is considered as almost a solved problem for monolingual text in which n-gram approaches, character encoding detection or stop word lists can reach up to 100% accuracy. Researchers use simple dictionary method or machine learning techniques such as Naive Bayes, Support Vector Machines (SVM), and Conditional Random Forests (CRF), Convolutional Neural Networks (CNN) etc. Language systems fail due to style of writing and brevity of texts. Language detection is a difficult and unsolved problem due to Anglicism, code mixing, code switching, lexical borrowings (all terms being used interchangeably).

Beesley (1988) developed a prototype for language identifier of online text based on cryptanalysis. (Cavnar & Trenkle, 1994) used character n-gram frequency lists to determine the language of a new piece of text in the underlying algorithm of TextCat, an automatic LID system developed by vanNoord. The results were reported are 99.8% accuracy for language models of more than 300 n-grams.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 13: 2 Issues (2021): Forthcoming, Available for Pre-Order
Volume 12: 2 Issues (2020): 1 Released, 1 Forthcoming
Volume 11: 2 Issues (2019)
Volume 10: 2 Issues (2018)
Volume 9: 2 Issues (2017)
Volume 8: 2 Issues (2016)
Volume 7: 2 Issues (2015)
Volume 6: 2 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing