Unsupervised Automatic Keyphrases Extraction on Italian Datasets

Unsupervised Automatic Keyphrases Extraction on Italian Datasets

Isabella Gagliardi (IMATI-CNR, Italy) and Maria Teresa Artese (IMATI-CNR, Italy)
Copyright: © 2021 |Pages: 20
DOI: 10.4018/978-1-7998-3479-3.ch009

Abstract

Keyword/keyphrase extraction is an important research activity in text mining, natural language processing, and information retrieval. A large number of algorithms, divided into supervised or unsupervised methods, have been designed and developed to solve the problem of automatic keyphrases extraction. The aim of the chapter is to critically discuss the unsupervised automatic keyphrases extraction algorithms, analyzing in depth their characteristics. The methods presented will be tested on different datasets, presenting in detail the data, the algorithms, and the different options tested in the runs. Moreover, most of the studies and experiments have been conducted on texts in English, while there are few experiments concerning other languages, such as Italian. Particular attention will be paid to the evaluation of the results of the methods in two different languages, English, and Italian.
Chapter Preview
Top

Background

To identify the most relevant keywords for a text, the following pipeline has to be performed, that mimics the Information Retrieval one:

  • Pre-process data

  • Apply Unsupervised Automatic Keyword Extraction Algorithms:

  • Extract a list of candidate keywords /keyphrases using some heuristics,

  • Score each candidate keywords/keyphrases, according to different criteria and methods,

  • Select the first m keywords/keyphrases.

  • Evaluate the results.

Each step will be described in detail below.

Key Terms in this Chapter

Keywords/Keyphrases: Single or compound words able to characterize the content of the documents, useful for identifying the documents relevant to a given query.

Lemmatization: Is the process of reducing inflected forms of a word to its dictionary form or lemma.

Information Retrieval: Is the activity of searching information, relevant to user’s needs from a collection of documents. Documents usually are texts, but can be images, video, or sounds.

Bag of Documents Model: Is a representation of a text as a set of its words, holding multiplicity (how many times each word appears), but losing position and other related information.

POS (Part of Speech) Tagging: Is the process of assigning to each word of a text a particular part-of-speech, as verb, nouns, adjective, adverb, and so on. Different languages need different POS tagging tools.

Dataset: Is a collection of data. Usually referred to a coherent set of data, such as records of a single database, or even of a single table.

Stemming: Is the process of reducing inflected forms of a word to its root or stem.

Automatic Keyword Extraction Algorithms: Algorithms able to extract keywords from texts, automatically and in an unsupervised manner. They are usually based on probability, graph, or on clusters.

Complete Chapter List

Search this Book:
Reset