Document keyphrases provide semantic metadata which can characterize documents and produce an overview of the content of a document. This chapter describes a Keyphrase Identification Program (KIP), which extracts document keyphrases by using prior positive samples of human identified domain keyphrases to assign weights to the candidate keyphrases. The logic of our algorithm is: the more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. To obtain human identified positive inputs, KIP first populates its glossary database using manually identified keyphrases and keywords. It then checks the composition of all noun phrases extracted from a document, looks up the database and calculates scores for all these noun phrases. The ones having higher scores will be extracted as keyphrases. KIP’s learning function can enrich the glossary database by automatically adding new identified keyphrases to the database.
Several automatic keyphrase extraction techniques have been proposed in previous studies. Krulwich and Burkey (1996) use some heuristics to extract significant topical phrases from a document. The heuristics are based on documents’ structural features, such as the presence of phrases in document section headers, the use of italics, and the different formatting structures. This approach is not difficult to implement, but the limitation is that not every document has explicit structural features.
Key Terms in this Chapter
Keyphrase Assignment: Assigning keyphrases using predefined list to a document.
Keyphrase Extraction: Methods extracting important topical phrases from a document.
Document Keyphrases: Important topical phrases which, when combined, describe the main theme of a document.
Text Mining: Methods to distil useful information from large bodies of text.
Content Metadata: A special piece of metadata which describes the content of a document.
Domain-Specific Keyphrase Extraction: Document keyphrase extraction methods that are used for extracting keyphrases for documents in a specific domain.
Document Metadata: Data about a document, e.g. title, author, source, etc.
Complete Chapter List
Min Song, Yi-Fang Brook Wu
Min Song, Yi-Fang Brook Wu
Yi-fang Brook Wu, Quanzhi Li
Xiaoyan Yu, Manas Tungare, Weigo Yuan, Yubo Yuan, Manuel Pérez-Quiñones, Edward A. Fox
Ricco Rakotomalala, Faouzi Mhamdi
Abdelmalek Amine, Zakaria Elberrichi, Michel Simonet, Ladjel Bellatreche, Mimoun Malki
Lean Yu, Shouyang Wang, Kin Keung Lai
Yi-fang Brook Wu, Xin Chen
Luis M. de Campos
Stanley Loh, Leandro Krug Wives, Daniel Lichtnow, José Palazzo M. de Oliveira
Quanzhi Li, Yi-fang Brook Wu
Rosa Meo, Maristella Matera
Brigitte Trousse, Marie-Aude Aufaure, Bénédicte Le Grand, Yves Lechevallier, Florent Masseglia
Stanley R.M. Oliveira, Osmar R. Zaïane
G.S. Mahalakshmi, S. Sendhilkumar
Ganesh Ramakrishnan, Pushpak Bhattacharyya
Giuseppe Manco, Riccardo Ortale, Andrea Tagarelli
Alexander Dreweke, Ingrid Fischer, Tobias Werth, Marc Wörlein
Nitin Agarwal, Huan Liu, Jianping Zhang
Pasquale De Meo
Richard S. Segall
Ah Chung Tsoi, Phuong Kim To, Markus Hagenbuchner
Miao-Ling Wang, Hsiao-Fan Wang