A Comparative Evaluation of Different Keyword Extraction Techniques

A Comparative Evaluation of Different Keyword Extraction Techniques

Raj Kishor Bisht
Copyright: © 2022 |Pages: 17
DOI: 10.4018/IJIRR.289573
Article PDF Download
Open access articles are freely available for download

Abstract

Retrieving keywords in a text is attracting researchers for a long time as it forms a base for many natural language applications like information retrieval, text summarization, document categorization etc. A text is a collection of words that represent the theme of the text naturally and to bring the naturalism under certain rules is itself a challenging task. In the present paper, the authors evaluate different spatial distribution based keyword extraction methods available in the literature on three standard scientific texts. The authors choose the first few high-frequency words for evaluation to reduce the complexity as all the methods are somehow based on frequency. The authors find that the methods are not providing good results particularly in the case of the first few retrieved words. Thus, the authors propose a new measure based on frequency, inverse document frequency, variance, and Tsallis entropy. Evaluation of different methods is done on the basis of precision, recall, and F-measure. Results show that the proposed method provides improved results.
Article Preview
Top

1. Introduction

A text is a collection of words. A major part of the text is covered with function words that are necessary to make a sentence meaningful and grammatically correct. The author finds many other words in the text related to the theme of the topic. These words carry important information about the text and this information is useful in many tasks like information retrieval, natural language processing, text summarization, document categorization, etc. These words can be described as keywords. Thus, the automatic extraction of keywords is an important research direction in the field of text mining. The process of extracting keywords is to find the words that are sufficiently informative to represent the text. It is a challenging task to define a generalized rule for every text as different texts may have different linguistic features. To uncover these challenges, researchers have been making continuous efforts to establish the relationship among linguistic features, laws of Mathematics and Physics. The keyword extraction methods can be categorized under three broad categories: linguistics, machine learning, and statistical methods. In linguistics methods, the main focus is to observe syntactic, semantic aspects of words, morphological features, and linguistic relationships among words like synonym, hypernym, hyponym, etc. In machine learning methods, first, the learning algorithm is trained using a tagged training set and then its performance is evaluated through a tagged test set. The weighting of words in a text plays an important role in information retrieval. Initially, weighting schemes are defined in the term of the frequency of words in a text. Term frequency (tf) and inverse document frequency (idf) were the weighting schemes firstly used for the weighting of words.

Luhn (1958) introduced an early idea of the importance of words in a text by analyzing Zip’s analysis of the word’s frequency in a text. Since then, a number of approaches for measuring the importance of words in a text appeared in the literature. The details of weighting schemes in information retrieval can be found in the books of Dominich (2008) and Manning and Schütze (1999). Earlier methods were based on the frequency of words in a text, later on, many other aspects were considered by different researchers. Turney (2000) performed a supervised learning approach for keyword extraction. The standard deviation of the distance between successive occurrences of a word is considered as a parameter to extract keywords by Ortuño et al. (2002). In their work, they found that the relevant words have greater standard deviation as their spatial distribution is more inhomogeneous in comparison to irrelevant words. Hulth (2003) suggested a keyword extraction method based on linguistics knowledge like syntactic features. A study on the fractal structure of the text can be found in Andres et al. (2010) and Andres et al. (2011). Yang et al. (2013) used Shannon’s entropy difference between the intrinsic and extrinsic modes for determining the relevance of words in a text. Najafi and Darooneh (2015) used the concept of fractal dimension for keyword extraction. Jamaati and Namaati and Mehri (2018) used Tsallis entropy for ranking of the relevance of terms taking advantage of the spatial correlation length. Mehri et al. (2019) used distorted entropy for word ranking.

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing