GTM User Modeling for aIGA Weight Tuning in TTS Synthesis

Lluís Formiga; Francesc Alías

doi:10.4018/978-1-59904-849-9.ch117

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

GTM User Modeling for aIGA Weight Tuning in TTS Synthesis

Lluís Formiga, Francesc Alías

Source Title: Encyclopedia of Artificial Intelligence

DOI: 10.4018/978-1-59904-849-9.ch117

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Unit Selection Text-to-Speech Synthesis (US-TTS) systems produce synthetic speech based on the retrieval of previous recorded speech units from a speech database (corpus) driven by a weighted cost function (Black & Campbell, 1995). To obtain high quality synthetic speech these weights must be optimized efficiently. To that effect, in previous works, a technique was introduced for weight tuning based on evolutionary perceptual tests by means of Active Interactive Genetic Algorithms (aiGAs) (Alías, Llorà, Formiga, Sastry & Goldberg, 2006) aiGAs mine models that map subjective preferences from users by partial ordering graphs, synthetic fitness and Evolutionary Computation (EC) (Llorà, Sastry, Goldberg, Gupta & Lakshmi, 2005). Although aiGA propose an effective method to map single user preferences, as far as we know, the methodology to extract common solutions among different individual preferences (hereafter denoted as common knowledge) has not been tackled yet. Furthermore, there is an ambiguity problem to be solved when different users evolve to different weight configurations. In this review, Generative Topographic Mapping (GTM) is introduced as a method to extract common knowledge from aiGA models obtained from user preferences.

Chapter Preview

Top

Background

Weight Tuning in Unit-Selection Text-to-Speech Synthesis

The aim of US-TTS is to generate synthetic speech by concatenating the sequence of units that best fit the requirements derived from the input text. The speech units are retrieved from a database (speech corpus) which stores speech-units previously recorded by a professional speaker, typically.

Text-to-speech workflow is generally modelled as two independent blocks that convert written text into speech signal. The first block is named Natural Language Processing (NLP), which is followed by the Digital Signal Processing block (DSP). At first stage, The NLP block carries out a text preprocessing (e.g. conversion of digit numbers or acronyms to words), then it converts graphemes to phonemes. And at last stage, the NLP block assigns quantified prosody parameters to each phoneme guiding the way each phoneme is converted to signal. Generally, this quantified prosody parameters involve duration, pitch and energy. Next, The DSP block retrieves from a recorded database (speech corpus) the sequence of units that best matches the target requirements (the phonemes and their prosody). Finally, the speech units are ensembled to obtain the output speech signal.

The retrieval process is done by a dynamic programming algorithm (e.g. Viterbi or A* (Formiga & Alías, 2006)) driven by a cost function. The cost function computes the load of selecting a unit within a sequence as the sum of two weighted subcosts (see equation (1)): the target subcost (Ct) and the concatenation subcost (Cc). In this work, the Ct is considered as a weighted linear combination of the normalized prosody distances between the target-NLP predicted prosody vector and the candidate unit prosody vector (see equation). Otherwise, the Cc is computed as a weighted linear combination of the distances between the feature vectors of the speech signal around its concatenation point (see equation).

(1)

(2) (3)where represents the target units sequence {t1, t2,...,tn} and represents the candidate units sequence {u1, u2,..., un}.

(4)

(5)

Appropriate design of cost function by means of weight training is a crucial to earn high quality synthetic speech (Black, 2002). Nevertheless this concern has focused approaches with no unique response. Several techniques have been suggested for weight tuning, which may be spitted into three families: i) manual-tuning ii) computationally-driven purely objective methods and iii) perceptually optimized techniques (Alías, Llorà, Formiga, Sastry & Goldberg, 2006). The present review is based on the techniques based on human feedback to the training process, following previous work (Alías, Llorà, Formiga, Sastry & Goldberg, 2006), which is outlined in the next section.

Key Terms in this Chapter

Surrogate Fitness: Synthetic fitness measure that tries to evaluate one evolutionary solution in the same terms as one perceptual user would

Unit Selection Synthesis: A synthesis technique where appropriate units are retrieved from large databases of natural speech so as to generate synthetic speech.

Diphone: A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. e.g.: “hello” silence-h h-eh eh-l l-oe oe-silence.

Mel Frequency Cepstral Coefficients (MFCC): The MFCC are the coefficients of the Mel cepstrum.The Mel-cepstrum is the cepstrum computed on the Mel-bands (scaled to human ear) instead of the Fourier spectrum.

Correlation: A statistical measurement of the interdependence or association between two or qualitative variables. A typical calculation would be performed by multiplying a signal by either another signal (cross-correlation) or by a delayed version of itself (autocorrelation).

Evolutionary Algorithms: Collective term for all variants of (probabilistic) optimization and approximation algorithms that are inspired by Darwinian evolution. Optimal states are approximated by successive improvements based on the variation-selection-paradigm.

Pitch: Intonation measure given a time in the signal.

Generative Topographic Mapping (GTM): It is a technique for density modelling and data visualisation inspired in SOM (see SOM definition).

Unsupervised Learning: Learning techniques that group instances without a pre-specified dependent attribute. Clustering algorithms are usually unsupervised methods for grouping data sets.

Digital Signal Processing (DSP): DSP, or Digital Signal Processing, as the term suggests, is the processing of signals by digital means. The processing of a digital signal is done by performing numerical calculations

Self-Organizing Maps: Self-organizing maps (SOMs) are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks

Natural Language Processing (NLP): Computer understanding, analysis, manipulation, and/or generation of natural language

Text Normalization: The process of converting abbreviations and non-word written symbols into words that a speaker would say when reading that symbol out loud.

Prosody: A collection of phonological features including pitch, duration, and stress, which define the rhythm of spoken language

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

GTM User Modeling for aIGA Weight Tuning in TTS Synthesis

Abstract

Background

Weight Tuning in Unit-Selection Text-to-Speech Synthesis

Key Terms in this Chapter

Complete Chapter List