GTM User Modeling for aIGA Weight Tuning in TTS Synthesis

GTM User Modeling for aIGA Weight Tuning in TTS Synthesis

Lluís Formiga (Universitat Ramon Llull, Spain) and Francesc Alías (Universitat Ramon Llull, Spain)
Copyright: © 2009 |Pages: 8
DOI: 10.4018/978-1-59904-849-9.ch117
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Unit Selection Text-to-Speech Synthesis (US-TTS) systems produce synthetic speech based on the retrieval of previous recorded speech units from a speech database (corpus) driven by a weighted cost function (Black & Campbell, 1995). To obtain high quality synthetic speech these weights must be optimized efficiently. To that effect, in previous works, a technique was introduced for weight tuning based on evolutionary perceptual tests by means of Active Interactive Genetic Algorithms (aiGAs) (Alías, Llorà, Formiga, Sastry & Goldberg, 2006) aiGAs mine models that map subjective preferences from users by partial ordering graphs, synthetic fitness and Evolutionary Computation (EC) (Llorà, Sastry, Goldberg, Gupta & Lakshmi, 2005). Although aiGA propose an effective method to map single user preferences, as far as we know, the methodology to extract common solutions among different individual preferences (hereafter denoted as common knowledge) has not been tackled yet. Furthermore, there is an ambiguity problem to be solved when different users evolve to different weight configurations. In this review, Generative Topographic Mapping (GTM) is introduced as a method to extract common knowledge from aiGA models obtained from user preferences.
Chapter Preview
Top

Background

Weight Tuning in Unit-Selection Text-to-Speech Synthesis

The aim of US-TTS is to generate synthetic speech by concatenating the sequence of units that best fit the requirements derived from the input text. The speech units are retrieved from a database (speech corpus) which stores speech-units previously recorded by a professional speaker, typically.

Text-to-speech workflow is generally modelled as two independent blocks that convert written text into speech signal. The first block is named Natural Language Processing (NLP), which is followed by the Digital Signal Processing block (DSP). At first stage, The NLP block carries out a text preprocessing (e.g. conversion of digit numbers or acronyms to words), then it converts graphemes to phonemes. And at last stage, the NLP block assigns quantified prosody parameters to each phoneme guiding the way each phoneme is converted to signal. Generally, this quantified prosody parameters involve duration, pitch and energy. Next, The DSP block retrieves from a recorded database (speech corpus) the sequence of units that best matches the target requirements (the phonemes and their prosody). Finally, the speech units are ensembled to obtain the output speech signal.

The retrieval process is done by a dynamic programming algorithm (e.g. Viterbi or A* (Formiga & Alías, 2006)) driven by a cost function. The cost function computes the load of selecting a unit within a sequence as the sum of two weighted subcosts (see equation (1)): the target subcost (Ct) and the concatenation subcost (Cc). In this work, the Ct is considered as a weighted linear combination of the normalized prosody distances between the target-NLP predicted prosody vector and the candidate unit prosody vector (see equation). Otherwise, the Cc is computed as a weighted linear combination of the distances between the feature vectors of the speech signal around its concatenation point (see equation).

(1) (2) (3)where represents the target units sequence {t1, t2,...,tn} and represents the candidate units sequence {u1, u2,..., un}.

(4)
(5)

Appropriate design of cost function by means of weight training is a crucial to earn high quality synthetic speech (Black, 2002). Nevertheless this concern has focused approaches with no unique response. Several techniques have been suggested for weight tuning, which may be spitted into three families: i) manual-tuning ii) computationally-driven purely objective methods and iii) perceptually optimized techniques (Alías, Llorà, Formiga, Sastry & Goldberg, 2006). The present review is based on the techniques based on human feedback to the training process, following previous work (Alías, Llorà, Formiga, Sastry & Goldberg, 2006), which is outlined in the next section.

Key Terms in this Chapter

Surrogate Fitness: Synthetic fitness measure that tries to evaluate one evolutionary solution in the same terms as one perceptual user would

Unit Selection Synthesis: A synthesis technique where appropriate units are retrieved from large databases of natural speech so as to generate synthetic speech.

Diphone: A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. e.g.: “hello” silence-h h-eh eh-l l-oe oe-silence.

Mel Frequency Cepstral Coefficients (MFCC): The MFCC are the coefficients of the Mel cepstrum.The Mel-cepstrum is the cepstrum computed on the Mel-bands (scaled to human ear) instead of the Fourier spectrum.

Correlation: A statistical measurement of the interdependence or association between two or qualitative variables. A typical calculation would be performed by multiplying a signal by either another signal (cross-correlation) or by a delayed version of itself (autocorrelation).

Evolutionary Algorithms: Collective term for all variants of (probabilistic) optimization and approximation algorithms that are inspired by Darwinian evolution. Optimal states are approximated by successive improvements based on the variation-selection-paradigm.

Pitch: Intonation measure given a time in the signal.

Generative Topographic Mapping (GTM): It is a technique for density modelling and data visualisation inspired in SOM (see SOM definition).

Unsupervised Learning: Learning techniques that group instances without a pre-specified dependent attribute. Clustering algorithms are usually unsupervised methods for grouping data sets.

Digital Signal Processing (DSP): DSP, or Digital Signal Processing, as the term suggests, is the processing of signals by digital means. The processing of a digital signal is done by performing numerical calculations

Self-Organizing Maps: Self-organizing maps (SOMs) are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks

Natural Language Processing (NLP): Computer understanding, analysis, manipulation, and/or generation of natural language

Text Normalization: The process of converting abbreviations and non-word written symbols into words that a speaker would say when reading that symbol out loud.

Prosody: A collection of phonological features including pitch, duration, and stress, which define the rhythm of spoken language

Complete Chapter List

Search this Book:
Reset