Domain Adaptation in Part-of-Speech Tagging

Domain Adaptation in Part-of-Speech Tagging

Miriam Lúcia Domingues, Eloi Luiz Favero
DOI: 10.4018/978-1-4666-2169-5.ch003
OnDemand:
(Individual Chapters)
Available
$33.75
List Price: $37.50
10% Discount:-$3.75
TOTAL SAVINGS: $3.75

Abstract

Many Natural Language Processing (NLP) applications rely on accuracy of the part-of-speech taggers. Although many taggers have good accuracy for the domain in which they were trained, their accuracy typically is not portable to new domains due to problems, such as different linguistic structures or presence of new words. The need for domain adaptation has emerged as a new challenge for part-of-speech tagging and in most NLP tasks. The goal of this chapter is to highlight solutions that handle labeled and unlabeled data, methods that deal with such data to solve the domain adaptation problem, and to present a case study that has achieved significant accuracy rates on tagging journalistic and scientific texts.
Chapter Preview
Top

Introduction

Many state-of-the-art Natural Language Processing (NLP) applications based on supervised learning have good accuracy for the domain or genre in which they were trained; however, most of them exhibit a lack of portability to new domains due to problems such as different linguistic structures or the presence of new words. As a result, domain adaptation, which is the ability to exhibit good performance on both the training (source) and the new (target) domains, has emerged as a new challenge. This challenge arises in many NLP tasks, such as Part-Of-Speech (POS) tagging, Named Entity (NE) recognition, parsing, Word Sense Disambiguation (WSD), and relation extraction.

Published literature has addressed the importance of domain adaptation in NLP tasks by applying machine learning methods, such as supervised (Chelba & Acero, 2006; Daumé, 2007), unsupervised (Blitzer, McDonald, & Pereira, 2006; Jiang & Zhai, 2007; Huang & Yates, 2010), and ensemble methods (Daumé III& Marcu, 2006).

Jiang and Zhai (2007) cited several examples of domain adaptation problems. The first example is POS tagging, where the source domain being tagged is journalistic data and the target domain is scientific data. The second example is NE recognition, where the source domain being annotated is news articles and the target domain is personal blogs. The third example is personalized spam filtering, where many labeled spam and ham emails from publicly available sources must be adapted to an individual user’s inbox because of the specificities of the user distribution of emails and the individual notions of what constitutes a spam.

The objective of this chapter is to present state-of-the-art domain-adaptation problems focused on solutions in POS tagging, an important preprocessing task in many NLP applications. Specifically, we present experiments with the adaptation of a hybrid POS tagger, which improves tagging accuracy by reducing errors in new or Out-Of-Vocabulary (OOV) words and by making adjustments to the tagger to handle different data distributions in the source and in the target domains. This tagger has been trained with Portuguese texts to generate similar levels of accuracy on texts from two different domains: journalistic and scientific.

In the following sections, we first describe basic concepts of POS tagging and its main approaches. Then, we present the current state of the art in domain adaptation, including any related issues and problems. We highlight solutions using NLP systems that handle labeled and unlabeled data, taking the perspectives adopted by researchers working on NLP. There is also a brief overview of domain adaptation solutions in POS tagging. We then present a case study with a Portuguese POS tagger, followed by a discussion of future research directions and the conclusions of this chapter.

Top

Part-Of-Speech Tagging

POS tagging is the basic task of labeling a word or a token in a sentence with its grammatical category, such as noun, adjective, or verb. Punctuation marks are usually tagged as well. When a suitable automatic tagging algorithm is given a string of words and a specified tag set, the tagger outputs annotated results such as the following:

  • A/ART casa/N é/V grande/ADJ ./. (The house is big.)

  • Maria/NPROP casa/V hoje_à_noite/ADV ./. (Maria marries tonight.)

The tags of the examples are from the Mac-Morpho tag set (Aluísio, et al., 2003) and are described as the following: ART=article, N=noun, V=verb, ADJ=adjective, NPROP=proper noun, ADV=adverb and the punctuation mark .=.

A word is ambiguous when it has more than one grammatical category, such as the word “casa” in the example. (In Portuguese, the word “casa” may refer to the noun house or to the verb to marry.) The tag with the correct grammatical category will be assigned according to the context of the word in the sentence. For disambiguation, taggers use a large set of methods and techniques with different approaches to tag the words with the greatest accuracy possible. Tags may include more lexical attributes, such as gender, number, verbal mood, tense, and person. For example, the word “casa” may be tagged as NFS, a noun (N) that is feminine (F) in gender and singular (S) in number.

Complete Chapter List

Search this Book:
Reset