Unit-Selection Speech Synthesis Method Using Words as Search Units

Unit-Selection Speech Synthesis Method Using Words as Search Units

Hiroyuki Segi (Department of Computer and Information Science, Seikei University, Tokyo, Japan)
DOI: 10.4018/IJMDEM.2016040104
OnDemand PDF Download:
$37.50

Abstract

Unit-selection speech-synthesis systems have been proposed. In most of the unit-selection speech-synthesis systems, search units are rather short such as syllables, phonemes and diphones. However, when applied to large speech databases, shorter units produce more voice-waveform candidates and a larger speech database cannot be used without narrow pruning for practical use. Narrow pruning impairs the quality of the synthesized speech. Here the author examined the possibility of using words as search units. Subjective evaluations indicated that 70% of the speech synthesized by the proposed method sounded more natural than that synthesized by a conventional method. The five-point mean opinion score of the synthesized speech was 3.5, and 21% was judged to sound as natural as human speech. These results demonstrate the effectiveness of unit-selection speech synthesis using words as search units.
Article Preview

1. Introduction

There is a strong need for higher quality Text-To-Speech (TTS) conversion in broadcasting services. The development of a TTS system that can generate synthesized speech that sounds similar to a human voice could improve access to text information in both data broadcasts (Sakai, 2007) and broadband contents (Baba, 2012) for visually impaired and mobile receivers. Moreover, a high-quality TTS system could facilitate the development of automatic spoken broadcasts, such as weather reports (Segi, 2013) and even automatic television broadcasts, by combining speech with computer-graphic animations generated from a script (Hayashi, 2013; Doke, 2012).

Several types of TTS system have been reported to date. One group utilizes the compilation of recorded speech sounds, which is employed in airport and train announcements (Demeur, 1987). Although the speech synthesized by this method has not yet been evaluated, it is widely considered to achieve human voice quality based on its use in broadcast systems. However, the content of the speech synthesized by this method is limited to combinations of recorded phrases connected by silent sections. Thus, it cannot be utilized for the speech synthesis of arbitrary input sentences. Moreover, this method does not take coarticulation into account, suggesting that the naturalness of the synthesized speech is degraded without sufficent silence sections. Indeed, 91% of synthesized speech with coarticulation was evaluated as more natural than synthesized speech without coarticulation in a previous study (Segi, 2010).

A second group of TTS systems employs Hidden Markov Models (HMMs) (Zen, 2009; Toda, 2007). This method analyzes the speech data, extracts prosody and voice-quality components from the speech data respectively, and allows them to be controlled independently (Kawahara, 1999). For example, HMM TTS systems can extract the feature parameters of phoneme “a” from a speech database, and use them to synthesize speech. This method has several advantages as it is easy to use for voice conversion, has good performance with small speech databases, and does not require a high-performance Central Processing Unit (CPU) or large memory. However, the naturalness of the speech synthesized using this method is not so high (Zen, 2008; Takaki, 2011; Nose, 2013).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing