Research on Text Classification Based on Automatically Extracted Keywords

Research on Text Classification Based on Automatically Extracted Keywords

Pin Ni, Yuming Li, Victor Chang
Copyright: © 2020 |Pages: 16
DOI: 10.4018/IJEIS.2020100101
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Automatic keywords extraction and classification tasks are important research directions in the domains of NLP (natural language processing), information retrieval, and text mining. As the fine granularity abstracted from text data, keywords are also the most important feature of text data, which has great practical and potential value in document classification, topic modeling, information retrieval, and other aspects. The compact representation of documents can be achieved through keywords, which contains massive significant information. Therefore, it may be quite advantageous to realize text classification with high-dimensional feature space. For this reason, this study designed a supervised keyword classification method based on TextRank keyword automatic extraction technology and optimize the model with the genetic algorithm to contribute to modeling the keywords of the topic for text classification.
Article Preview
Top

1. Introduction

Keyword, key sentence and key paragraph as important features of text data, which can reflect the topic of the document to a certain extent (Beliga, Meštrović, & Martinčić-Ipšić, 2015). Automated keyword extraction can extract the most significant key information from specific documents, thus speeding up the abstraction of specific descriptive instances from massive text data. On the other hand, these keyword information as fine-grained text can be more macroscopically divided into different categories. This classification method can be used not only for keyword topic modeling but also for text categorization tasks based on high-dimensional feature space (Onan, Korukoğlu, & Bulut, 2016), (Lautenbacher, Bauer, Sieber, & Cabral, 2010). This could achieve more accurate word-of-mouth text classification (Jansen, Zhang, Sobel, & Chowdury, 2009), (Hung & Lin, 2013), topic feature analysis of social relationships (Hauffa, Lichtenberg, & Groh, 2012), keyword classification (Fernando, 2018), document classification (Onan et al., 2016), (Hu et al., 2018), (Puri & Singh, 2019), recommendation for user interest features (W. Wu, Zhang, & Ostendorf, 2010), (Meng & Gao, 2019), etc.

Text categorization is a modeling method for categorizing documents according to preset categories (Liu & Wang, 2007), (Schütze, Manning, & Raghavan, 2007), (Y.-C. Wu, 2015), which has widely used in text mining (Al-Thuhli, Al-Badawi, Baghdadi, & Al-Hamdani, 2017), covering information retrieval (Boughareb & Farah, 2013), (Ghnemat & Shaout, 2016), sentiment analysis (Jain, Kumar, & Mahanti, 2018), topic mining, document organization, spam filtering, news classification etc (Aggarwal & Zhai, 2012). However, there are still many difficulties in text categorization in high-dimensional feature space (Joachims, 2002). When entire words in the document served as training features, the computational complexity will be greatly increased, making the task of text categorization transformed into a type of computationally intensive task (Onan et al., 2016). Therefore, as the most relevant feature of documents and a relatively reasonable data dimensionality reduction to a certain extent, keywords can become relatively ideal feature candidates in classification modeling (Liu & Wang, 2007), (Rossi, Marcacini, & Rezende, 2014). From the perspective of classification accuracy, the text classification method based on keywords as features may be an effective approach that worth to be explored, and from the perspective of the actual application from the micro to the macro, the keyword-based approach is also more suitable for the real situation in the information retrieval scenario where the user inputs fine-grained features (e.g. words, character, punctuation character) to more accurately match the corresponding text instance.

For this reason, this study designed a supervised keyword classification method based on TextRank keyword automatic extraction technology and optimize the model with the Genetic Algorithm to contribute to text classification for modeling the feature of the topic. This method improved the accuracy compare with the conventional classification and clustering methods and solve the problem about conventional methods do not have the mechanisms of self-renewal keywords and self-adjusting classification weights, to realize the attributes that keyword topic model can gradually improve with the input of new data. And the study also compared the effect of other commonly used classification methods in keyword classification. Experiments show that the proposed method achieves ideal performance on test datasets of ACM collection (Table 5).

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing