Article Preview
Top1. Introduction
Keyword, key sentence and key paragraph as important features of text data, which can reflect the topic of the document to a certain extent (Beliga, Meštrović, & Martinčić-Ipšić, 2015). Automated keyword extraction can extract the most significant key information from specific documents, thus speeding up the abstraction of specific descriptive instances from massive text data. On the other hand, these keyword information as fine-grained text can be more macroscopically divided into different categories. This classification method can be used not only for keyword topic modeling but also for text categorization tasks based on high-dimensional feature space (Onan, Korukoğlu, & Bulut, 2016), (Lautenbacher, Bauer, Sieber, & Cabral, 2010). This could achieve more accurate word-of-mouth text classification (Jansen, Zhang, Sobel, & Chowdury, 2009), (Hung & Lin, 2013), topic feature analysis of social relationships (Hauffa, Lichtenberg, & Groh, 2012), keyword classification (Fernando, 2018), document classification (Onan et al., 2016), (Hu et al., 2018), (Puri & Singh, 2019), recommendation for user interest features (W. Wu, Zhang, & Ostendorf, 2010), (Meng & Gao, 2019), etc.
Text categorization is a modeling method for categorizing documents according to preset categories (Liu & Wang, 2007), (Schütze, Manning, & Raghavan, 2007), (Y.-C. Wu, 2015), which has widely used in text mining (Al-Thuhli, Al-Badawi, Baghdadi, & Al-Hamdani, 2017), covering information retrieval (Boughareb & Farah, 2013), (Ghnemat & Shaout, 2016), sentiment analysis (Jain, Kumar, & Mahanti, 2018), topic mining, document organization, spam filtering, news classification etc (Aggarwal & Zhai, 2012). However, there are still many difficulties in text categorization in high-dimensional feature space (Joachims, 2002). When entire words in the document served as training features, the computational complexity will be greatly increased, making the task of text categorization transformed into a type of computationally intensive task (Onan et al., 2016). Therefore, as the most relevant feature of documents and a relatively reasonable data dimensionality reduction to a certain extent, keywords can become relatively ideal feature candidates in classification modeling (Liu & Wang, 2007), (Rossi, Marcacini, & Rezende, 2014). From the perspective of classification accuracy, the text classification method based on keywords as features may be an effective approach that worth to be explored, and from the perspective of the actual application from the micro to the macro, the keyword-based approach is also more suitable for the real situation in the information retrieval scenario where the user inputs fine-grained features (e.g. words, character, punctuation character) to more accurately match the corresponding text instance.
For this reason, this study designed a supervised keyword classification method based on TextRank keyword automatic extraction technology and optimize the model with the Genetic Algorithm to contribute to text classification for modeling the feature of the topic. This method improved the accuracy compare with the conventional classification and clustering methods and solve the problem about conventional methods do not have the mechanisms of self-renewal keywords and self-adjusting classification weights, to realize the attributes that keyword topic model can gradually improve with the input of new data. And the study also compared the effect of other commonly used classification methods in keyword classification. Experiments show that the proposed method achieves ideal performance on test datasets of ACM collection (Table 5).