Document keyphrases provide semantic metadata which can characterize documents and produce an overview of the content of a document. This chapter describes a Keyphrase Identification Program (KIP), which extracts document keyphrases by using prior positive samples of human identified domain keyphrases to assign weights to the candidate keyphrases. The logic of our algorithm is: the more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. To obtain human identified positive inputs, KIP first populates its glossary database using manually identified keyphrases and keywords. It then checks the composition of all noun phrases extracted from a document, looks up the database and calculates scores for all these noun phrases. The ones having higher scores will be extracted as keyphrases. KIP’s learning function can enrich the glossary database by automatically adding new identified keyphrases to the database.
Several automatic keyphrase extraction techniques have been proposed in previous studies. Krulwich and Burkey (1996) use some heuristics to extract significant topical phrases from a document. The heuristics are based on documents’ structural features, such as the presence of phrases in document section headers, the use of italics, and the different formatting structures. This approach is not difficult to implement, but the limitation is that not every document has explicit structural features.
Key Terms in this Chapter
Keyphrase Assignment: Assigning keyphrases using predefined list to a document.
Keyphrase Extraction: Methods extracting important topical phrases from a document.
Document Keyphrases: Important topical phrases which, when combined, describe the main theme of a document.
Text Mining: Methods to distil useful information from large bodies of text.
Content Metadata: A special piece of metadata which describes the content of a document.
Domain-Specific Keyphrase Extraction: Document keyphrase extraction methods that are used for extracting keyphrases for documents in a specific domain.
Document Metadata: Data about a document, e.g. title, author, source, etc.