Speech Technologies for Augmented Communication

Speech Technologies for Augmented Communication

Gérard Bailly (CNRS/Universities of Grenoble, France), Pierre Badin (CNRS/Universities of Grenoble, France), Denis Beautemps (CNRS/Universities of Grenoble, France) and Frédéric Elisei (CNRS/Universities of Grenoble, France)
DOI: 10.4018/978-1-61520-725-1.ch007
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The authors introduce here an emerging technological and scientific field. Augmented speech communication (ASC) aims at supplementing human-human communication with enhanced or additional modalities. ASC improves human-human communication by exploiting a priori knowledge on multimodal coherence of speech signals, user/listener voice characteristics or more general linguistic and phonological structure on the spoken language or vocabulary being exchanged. The nature of this a priori knowledge, the quantitative models that implement it and their capabilities to enhance the available input signals influence the precision and robustness of the perceived signals. After a general overview of the possible input signals characterizing speech production activity and available technologies for mapping these various speech representations between each other, three ASC systems developed at GIPSA-Lab are described in detail. Preliminary results of the evaluation of these three systems will be given and commented. A discussion on scientific and technological challenges and limitations of ASC concludes the chapter.
Chapter Preview
Top

Introduction

Speech is very likely the most natural communication mean for humans. However, there are various situations in which audio speech cannot be used because of disabilities or adverse environmental conditions. Resorting to alternative methods such as augmented speech is a therefore an interesting approach. This chapter presents computer-mediated communication technologies that allow such an approach (see Figure 1). Speech of the emitter may in fact:

Figure 1.

Computer-mediated communication consists in driving an artificial agent from signals captured on the source speaker. The embodiment of the agent may be quite diverse: from pure audio through audiovisual rendering of speech by avatars to a more videorealistic animations by means of of virtual clones of the source speaker or anthropoid robots - here the animatronic talking head Anton developed at U. of Sheffield (Hofe & Moore, 2008). The control signals of these agents can encompass not only audible and visible consequences of articulation but also control posture, gaze, facial expressions or head/hand movements. Signals captured on the source speaker provide partial information on speech activity such as brain or muscular activity, articulatory movements, speech or even scripts produced by the source speaker. Such systems exploit a priori knowledge on the mapping between captured and synthesized signals labelled here as “virtual human” and “phonological representation”: these resources that know about the coherence between observed and generated signals can be either statistical or procedural.

  • not be captured by available hardware communication channels – camera, microphone

  • be impoverished by the quality of the hardware or the communication channel

  • be impoverished because of environmental conditions or because of motor impairments of the interlocutor

On the reception side, Augmented Speech Communication (ASC) may also compensate for perceptual deficits of the user by enhancing the captured signals or adding multimodal redundancy by synthesizing new perceptual channels or adding new features to existing channels. In order to improve human-human communication ASC can make use of a priori knowledge on multimodal coherence of speech signals, user/listener voice characteristics or more general linguistic and phonological structure on the spoken language or vocabulary being exchanged. The nature of this a priori knowledge, the quantitative models that implement it and their capabilities to enhance the available communication signals influence the precision and robustness of the communication.

The chapter will first present:

  • the signals that can characterise the speech production activity i.e. from electromagnetic signals from brain activity, through articulatory movements, to their audiovisual traces

  • devices that can capture these signals with various impact on articulation and constraints on usage

  • available technologies that have been proposed for mapping these various speech representations between each other i.e. virtual human, direct statistical mapping or speech technologies using a phonetic pivot obtained by speech recognition techniques

Three ASC systems developed in the MPACIF Team at GIPSA-Lab will then be described in detail:

  • a.

    a system that converts non audible murmur into audiovisual speech for silent speech communication (Tran, Bailly, & Loevenbruck, submitted; Tran, Bailly, Loevenbruck, & Toda, 2008)

  • b.

    a system that converts silent cued speech (Cornett, 1967) into audiovisual speech and vice-versa. This system aims at computer-assisted audiovisual telephony for deaf users (Aboutabit, Beautemps, & Besacier, Accepted; Beautemps et al., 2007)

  • c.

    a system that computes and displays virtual tongue movements from audiovisual input for pronunciation training (Badin, Elisei, Bailly, & Tarabalka, 2008; Badin, Tarabalka, Elisei, & Bailly, 2008).

Complete Chapter List

Search this Book:
Reset