An Extensive Text Mining Study for the Turkish Language: Author Recognition, Sentiment Analysis, and Text Classification

An Extensive Text Mining Study for the Turkish Language: Author Recognition, Sentiment Analysis, and Text Classification

Durmuş Özkan Şahin, Erdal Kılıç
Copyright: © 2021 |Pages: 35
DOI: 10.4018/978-1-7998-4240-8.ch012
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In this study, the authors give both theoretical and experimental information about text mining, which is one of the natural language processing topics. Three different text mining problems such as news classification, sentiment analysis, and author recognition are discussed for Turkish. They aim to reduce the running time and increase the performance of machine learning algorithms. Four different machine learning algorithms and two different feature selection metrics are used to solve these text classification problems. Classification algorithms are random forest (RF), logistic regression (LR), naive bayes (NB), and sequential minimal optimization (SMO). Chi-square and information gain metrics are used as the feature selection method. The highest classification performance achieved in this study is 0.895 according to the F-measure metric. This result is obtained by using the SMO classifier and information gain metric for news classification. This study is important in terms of comparing the performances of classification algorithms and feature selection methods.
Chapter Preview
Top

Introduction

With the proliferation of the internet, the use of computers, mobile phones and tablets is increasing, and the amount of data is growing day by day. One of the sources of this increasing data type is non-structured textual documents. There is a significant increase in the number of data produced and stored in textual format. For this reason, automatically processing this data via computers and obtaining meaningful information from it will help researchers to develop new products. At the same time, the idea of text mining, a sub-branch of data mining, has appeared. Researchers aim to solve some problems with text mining techniques.

Text categorization can include supervised and unsupervised learning problems (Aggarwal and Zhai, 2012; Kadhim, 2019; Dasgupta and Ng, 2009 and Shafiabady et al., 2016). There is no training stage in unsupervised learning. Clustering algorithms are examples of these approaches. On the other hand, there is a training stage in supervised learning. Classification algorithms create a mathematical formula according to the training model. Classification is then carried out according to that mathematical formula. In a supervised text classification approach texts are divided into two parts, namely training and testing. Then, various rules are learned by classifiers according to the way the classification algorithms work on the training set. Classifiers apply these rules to the text in the test set and classify the text. There are many studies in published literature on text classification (Sebastiani, 2002). Examples include:

  • Machine learning-based and text mining-based automatic electronic mail filtering (Clark et al., 2003)

  • Classification of webpages (Sun et al., 2002)

  • Author recognition (Stamatatos et al., 2000)

  • Automatic extraction of text summary (Salton et al., 1997)

  • Automatic question–answer system (Soricut and Brill, 2006)

  • Sentiment analysis on texts (Dos Santos and Gatti, 2014)

  • Document language identification (Artemenko et al., 2006)

In this study, three different Turkish text classification applications were performed. These are news classification, author recognition and sentiment analysis. In order to solve these text classification problems, all operations from the pre-processing step to obtaining the classification performance are explained in detail. In this way, the reader is shown how to make a Turkish text classification application in any programming language. It also explains what methods are used to improve the running time and performance of the classification algorithms. In order to increase classification performance, TF-IDF – a popular term weighting method – and classification algorithms with different working principles were used. Two different feature selection metrics were used to try and reduce the working time of the algorithms. Besides, the keywords extracted from the feature selection methods were compared and interpreted. This study uses many methods on different text classification problems and consequently contribute to existing published literature.

Key Terms in this Chapter

Feature Extraction: It is a method frequently used in learning and image processing applications. In the field of text mining, it can be thought of as obtaining the words in the document.

Stop Words: Stop words do not contribute to understanding because they are used very often.

Tokenization: Tokenization is defined as dividing a sentence into smaller meaningful units. Tokens are meaningful small units. Words, idioms can be given as examples of tokens.

N-Gram: They are words that consist of n-element subsets of a word. If N is equal to 1, 2, and 3, N-gram is called unigram, bigram, and trigram, respectively.

Feature Selection: It is selecting and finding the most useful features in a data set. In other words, instead of using all the features in a data set, a subset of all features is obtained and used. It can also be considered as dimension reduction techniques.

Unsupervised Learning: It is a machine learning technique. It is used to estimate an unknown structure from unlabeled data.

Feature: A structure that characterizes a system, an object, or a class and makes it distinct is called a feature.

Supervised Learning: It is a machine learning technique. It generates a function to match the inputs to the desired outputs.

Complete Chapter List

Search this Book:
Reset