Migrating from ISO/IEC 9126 to SQUARE: A Case Study on the Evaluation of Medical Speech Translation Systems

Migrating from ISO/IEC 9126 to SQUARE: A Case Study on the Evaluation of Medical Speech Translation Systems

Paula Estrella, Nikos Tsourakis
DOI: 10.4018/978-1-5225-1724-5.ch010
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

When it comes to the evaluation of natural language systems, it is well acknowledged that there is a lack of common evaluation methodologies, making the fair comparison of such systems a difficult task. Many attempts to standardize this process have used a quality model based on the ISO/IEC 9126 standards. The authors have also used these standards for the definition of a weighted quality model for the evaluation of a medical speech translator, showing the relative importance of the system's features depending on the potential user (patient or doctor, developer). More recently, ISO/IEC 9126 has been replaced by a new series of standards, the 25000 or SQuaRE series, indicating that the model should be migrated to the new series in order to maintain compliance adherence to current standards. This chapter demonstrates how to migrate from ISO/IEC 9126 to ISO 25000 by using the authors' previous work as a use case.
Chapter Preview
Top

Introduction

Normalizing evaluations for specific contexts of use have gained increasing importance as software systems become widespread among global organizations and professionals. In particular, Natural Language Processing systems (NLP) are varied in nature and purpose, ranging from low-level applications intended to be used by developers (such as part-of-speech taggers, syntactic parsers or morphological analyzers) to complex systems targeting human end-users (such as voice-commanded booking systems (Jiao et al., 2015) or eye-commanded interfaces (Soltani et al., 2016). During the development lifecycle, these systems are periodically evaluated in order to assess the level of improvement achieved, to detect, classify and recover errors or to quantify user acceptability and satisfaction. While, in the last decades numerous authors have provided evaluation results leveraging various computer and human centered metrics, it is well acknowledged that there is a lack of a methodology that would provide a fair comparison framework for different NLP systems.

Convinced that user needs and the specific context of use of NLP systems cannot be omitted in an evaluation, several initiatives emerged which decompose quality into several dimensions. The International Standards for Language Engineering (Calzolari et al., 2002) project was one of these initiatives, which aims at standardizing the evaluation of language engineering systems by relating a customized quality model based on the ISO standards 9126 (ISO/IEC, 2001) to the purpose and context of use of the system based on the ISO standard 14598 (ISO/IEC, 1999). According to these standards, software quality results in general from six categories of quality characteristics (namely functionality, reliability, usability, efficiency, maintainability and portability) that can be particularized to a given software domain and context of use; in that case such a hierarchy is called a quality model and its terminal nodes must be features of the software that can be measured using one or more metrics. Additionally, ISO proposes models to evaluate internal or external quality as well as quality in use.

These standards have been recently replaced by the new 25000 series (ISO, 2014), named SQuaRE, implying that quality models based on the 9126 standard are outdated and should be migrated to the new series. This chapter builds on previous work, where the authors applied the 9126 series to the evaluation of NLP systems in the specific area of medical speech translation systems from the perspective of doctors, patients and developers (Tsourakis & Estrella, 2013). The objective of this chapter is to propose a mapping from a previous weighted quality model to a new weighted quality model based on SQuaRE in order to reuse as much as possible from the previous evaluations, given the complex and laborious work that entails the application of international standards.

Complete Chapter List

Search this Book:
Reset