Unsupervised Keyword Extraction Methods Based on a Word Graph Network

Unsupervised Keyword Extraction Methods Based on a Word Graph Network

Hongbin Wang, Jingzhen Ye, Zhengtao Yu, Jian Wang, Cunli Mao
Copyright: © 2020 |Pages: 12
DOI: 10.4018/IJACI.2020040104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Supervised keyword extraction methods usually require a large human-annotated corpus to train the model. Expensive manual labeling has made unsupervised technology using word graph networks attractive. Traditional word graph networks simply consider the co-occurrence relationship of words or the topological structure of the network, ignoring the influence of semantic relations between words on keyword extraction. To solve these problems, an unsupervised keyword extraction method based on word graph networks for both Chinese and English is proposed. This method uses word embedding to applying a “word attraction score” to semantic relevance between words in a document. Combination of the bias weight of the node and a weighted PageRank algorithm is used to compute the final scores of words. The experimental results demonstrate that the method is more effective than the traditional methods.
Article Preview
Top

Introduction

The term “keyword” refers to a key word or phrase that is directly extracted from a title or the content of a document. Because of the attributes of simplicity and objectivity, keywords are a concise representation of a text and an effective reflection of theme. Currently, keywords provide the foundation for many natural language processing sub-fields, such as text classification, clustering, information extraction, recommendation system and automatic text summarization (Chen, Jiang, & Bian, 2014). Before the advent of automatic keyword extraction technology, task is performed manually, which is both inefficient and time-consuming. Furthermore, in the setting of large corpora simultaneously processed by multiple people, the ways the keywords are extracted vary from person to person, thus leading to the enlargement of the labeling and the deterioration of text description accuracy. Many reports propose keyword extraction algorithms by using supervised or unsupervised learning (Hasan & Ng, 2014).

In supervised learning, keyword extraction is considered a two-category problem. Candidate keywords are classified as either true (i.e., keyword) or false (i.e., non-keyword) (Hulth, 2003). This method uses manual or tagged keyword text as training data, and use classification algorithms, such as decision tree, support vector machine, and logistic regression, to extract keywords (Jiang, Hu, & Li, 2009). Although supervised methods often outperform unsupervised methods, a large amount of manually annotated corpora are required for supervised methods. Research into improving unsupervised learning methods is therefore attractive due to the unnecessity of manual annotation (Florescu & Caragea, 2017).

Keyword extraction algorithms can be divided into three categories in unsupervised learning: keyword extraction based on (1) statistical features, (2) topic models, and (3) graph models.

Statistically based methods do not require prior labeling of training corpora, the document keywords are usually extracted by the frequency of the words in the document, the length of the words, and positional features. The drawback of this approach is that some specialized documents such as biology and medicine journal articles, keywords may appear only once. In this case, the statistical model considers these words to be less important and therefore ignores them (Chen & Lin, 2010).

In the keyword extraction methods based on the topic model, the document analysis is viewed as a mixture of topics because the probability that words appear under each topic is different. Therefore, once the document topics are determined, representative words of each topic represent the core content of the document, which can be considered the keywords. TopicRank (Bougouin, Boudin, & Daille, 2013) and hierarchical clustering are used to classify candidate words and then document keywords can be obtained through algorithms such as PageRank. However, a mixture of topics leads to a method that performs well for long documents, but is difficult to extend to short documents.

Keyword extraction based on a graphic model functions by constructing a semantically weighted network of the document, and then important nodes are found in the network as keywords through the analysis of a word graph network. This method considers the relationship between words (e.g., co-occurrence frequency) and other statistical features that lead to better extraction results (Chang, Zhang, Wang, Wan, & Xiao, 2018). TextRank demonstrates the scalability and accuracy of the word graph network (Mihalcea & Tarau, 2004).

In this research, the word graph network is firstly being built, then the word attraction score is used to capture the semantic features between words by the combination of Word2vec and the Dice coefficient (Dice, 1945). Then bias weight information of a node is added into the PageRank algorithm (Brin & Page, 1998). Top-K document keywords are extracted after multiple iterations of sorting by PageRank.

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024)
Volume 14: 1 Issue (2023)
Volume 13: 6 Issues (2022): 1 Released, 5 Forthcoming
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing