Article Preview
Top1. Introduction
A text is a collection of words. A major part of the text is covered with function words that are necessary to make a sentence meaningful and grammatically correct. The author finds many other words in the text related to the theme of the topic. These words carry important information about the text and this information is useful in many tasks like information retrieval, natural language processing, text summarization, document categorization, etc. These words can be described as keywords. Thus, the automatic extraction of keywords is an important research direction in the field of text mining. The process of extracting keywords is to find the words that are sufficiently informative to represent the text. It is a challenging task to define a generalized rule for every text as different texts may have different linguistic features. To uncover these challenges, researchers have been making continuous efforts to establish the relationship among linguistic features, laws of Mathematics and Physics. The keyword extraction methods can be categorized under three broad categories: linguistics, machine learning, and statistical methods. In linguistics methods, the main focus is to observe syntactic, semantic aspects of words, morphological features, and linguistic relationships among words like synonym, hypernym, hyponym, etc. In machine learning methods, first, the learning algorithm is trained using a tagged training set and then its performance is evaluated through a tagged test set. The weighting of words in a text plays an important role in information retrieval. Initially, weighting schemes are defined in the term of the frequency of words in a text. Term frequency (tf) and inverse document frequency (idf) were the weighting schemes firstly used for the weighting of words.
Luhn (1958) introduced an early idea of the importance of words in a text by analyzing Zip’s analysis of the word’s frequency in a text. Since then, a number of approaches for measuring the importance of words in a text appeared in the literature. The details of weighting schemes in information retrieval can be found in the books of Dominich (2008) and Manning and Schütze (1999). Earlier methods were based on the frequency of words in a text, later on, many other aspects were considered by different researchers. Turney (2000) performed a supervised learning approach for keyword extraction. The standard deviation of the distance between successive occurrences of a word is considered as a parameter to extract keywords by Ortuño et al. (2002). In their work, they found that the relevant words have greater standard deviation as their spatial distribution is more inhomogeneous in comparison to irrelevant words. Hulth (2003) suggested a keyword extraction method based on linguistics knowledge like syntactic features. A study on the fractal structure of the text can be found in Andres et al. (2010) and Andres et al. (2011). Yang et al. (2013) used Shannon’s entropy difference between the intrinsic and extrinsic modes for determining the relevance of words in a text. Najafi and Darooneh (2015) used the concept of fractal dimension for keyword extraction. Jamaati and Namaati and Mehri (2018) used Tsallis entropy for ranking of the relevance of terms taking advantage of the spatial correlation length. Mehri et al. (2019) used distorted entropy for word ranking.