Hershey, Pennsylvania

New York, New YorkBeijing, China

Special Offers
- Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 20 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global Scientific Publishing and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all available IGI Global Scientific Publishing open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all available IGI Global Scientific Publishing open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through the IGI Global Scientific Publishing Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global Scientific Publishing to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open access endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global Scientific Publishing to publish your work under open access? Review the IGI Global Scientific Publishing open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

An Extensive Text Mining Study for the Turkish Language: Author Recognition, Sentiment Analysis, and Text Classification

Durmuş Özkan Şahin (Ondokuz Mayıs University, Turkey) and Erdal Kılıç (Ondokuz Mayıs University, Turkey)

Source Title: Natural Language Processing for Global and Local Business

DOI: 10.4018/978-1-7998-4240-8.ch012

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In this study, the authors give both theoretical and experimental information about text mining, which is one of the natural language processing topics. Three different text mining problems such as news classification, sentiment analysis, and author recognition are discussed for Turkish. They aim to reduce the running time and increase the performance of machine learning algorithms. Four different machine learning algorithms and two different feature selection metrics are used to solve these text classification problems. Classification algorithms are random forest (RF), logistic regression (LR), naive bayes (NB), and sequential minimal optimization (SMO). Chi-square and information gain metrics are used as the feature selection method. The highest classification performance achieved in this study is 0.895 according to the F-measure metric. This result is obtained by using the SMO classifier and information gain metric for news classification. This study is important in terms of comparing the performances of classification algorithms and feature selection methods.

Chapter Preview

Top

Introduction

With the proliferation of the internet, the use of computers, mobile phones and tablets is increasing, and the amount of data is growing day by day. One of the sources of this increasing data type is non-structured textual documents. There is a significant increase in the number of data produced and stored in textual format. For this reason, automatically processing this data via computers and obtaining meaningful information from it will help researchers to develop new products. At the same time, the idea of text mining, a sub-branch of data mining, has appeared. Researchers aim to solve some problems with text mining techniques.

Text categorization can include supervised and unsupervised learning problems (Aggarwal and Zhai, 2012; Kadhim, 2019; Dasgupta and Ng, 2009 and Shafiabady et al., 2016). There is no training stage in unsupervised learning. Clustering algorithms are examples of these approaches. On the other hand, there is a training stage in supervised learning. Classification algorithms create a mathematical formula according to the training model. Classification is then carried out according to that mathematical formula. In a supervised text classification approach texts are divided into two parts, namely training and testing. Then, various rules are learned by classifiers according to the way the classification algorithms work on the training set. Classifiers apply these rules to the text in the test set and classify the text. There are many studies in published literature on text classification (Sebastiani, 2002). Examples include:

•
Machine learning-based and text mining-based automatic electronic mail filtering (Clark et al., 2003)
•
Classification of webpages (Sun et al., 2002)
•
Author recognition (Stamatatos et al., 2000)
•
Automatic extraction of text summary (Salton et al., 1997)
•
Automatic question–answer system (Soricut and Brill, 2006)
•
Sentiment analysis on texts (Dos Santos and Gatti, 2014)
•
Document language identification (Artemenko et al., 2006)

In this study, three different Turkish text classification applications were performed. These are news classification, author recognition and sentiment analysis. In order to solve these text classification problems, all operations from the pre-processing step to obtaining the classification performance are explained in detail. In this way, the reader is shown how to make a Turkish text classification application in any programming language. It also explains what methods are used to improve the running time and performance of the classification algorithms. In order to increase classification performance, TF-IDF – a popular term weighting method – and classification algorithms with different working principles were used. Two different feature selection metrics were used to try and reduce the working time of the algorithms. Besides, the keywords extracted from the feature selection methods were compared and interpreted. This study uses many methods on different text classification problems and consequently contribute to existing published literature.

Key Terms in this Chapter

Feature Extraction: It is a method frequently used in learning and image processing applications. In the field of text mining, it can be thought of as obtaining the words in the document.

Stop Words: Stop words do not contribute to understanding because they are used very often.

Tokenization: Tokenization is defined as dividing a sentence into smaller meaningful units. Tokens are meaningful small units. Words, idioms can be given as examples of tokens.

N-Gram: They are words that consist of n-element subsets of a word. If N is equal to 1, 2, and 3, N-gram is called unigram, bigram, and trigram, respectively.

Feature Selection: It is selecting and finding the most useful features in a data set. In other words, instead of using all the features in a data set, a subset of all features is obtained and used. It can also be considered as dimension reduction techniques.

Unsupervised Learning: It is a machine learning technique. It is used to estimate an unknown structure from unlabeled data.

Feature: A structure that characterizes a system, an object, or a class and makes it distinct is called a feature.

Supervised Learning: It is a machine learning technique. It generates a function to match the inputs to the desired outputs.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

An Extensive Text Mining Study for the Turkish Language: Author Recognition, Sentiment Analysis, and Text Classification

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List