Speech Synthesis of Emotions Using Vowel Features

Speech Synthesis of Emotions Using Vowel Features

Kanu Boku (Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Sakyo-ku Kyoto, Japan), Taro Asada (Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Sakyo-ku Kyoto, Japan), Yasunari Yoshitomi (Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Sakyo-ku Kyoto, Japan) and Masayoshi Tabuse (Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Sakyo-ku Kyoto, Japan)
Copyright: © 2013 |Pages: 14
DOI: 10.4018/ijsi.2013010105
OnDemand PDF Download:
$37.50

Abstract

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. For generating emotional synthetic speech, it is necessary to control the prosodic features of the utterances. The authors propose a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. As an initial investigation, they adopted the utterance of Japanese names, which are semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 70% when ten subjects listened to the emotional synthetic utterances of “angry,” “happy,” “neutral,” “sad,” or “surprised” when the utterance was the Japanese name “Taro.”
Article Preview

Proposed Method

In the first stage, we obtain audio data of emotional speech as a WAV file when a subject speaks with each of the intentional emotions of “angry,” “happy,” “neutral,” “sad,” and “surprised.” Then, for each kind of emotional speech, we measure the time of each vowel utterance and the value of the maximum amplitude of the waveform while speaking the vowel.

The second stage synthesizes the phoneme sequence uttered by the subject. This stage consists of the following four steps:

  • Step 1: For a vowel with a consonant appearing just before it in synthetic speech with neutral emotion, the total phonation duration time of the vowel and consonant is transformed into that for speech with a neutral emotion by the subject. The synthetic speech obtained by this processing is hereinafter called “neutral synthetic speech;”

  • Step 2: For a vowel with a consonant appearing just before it in synthetic speech with one of the intentional emotions of “angry,” “happy,” “sad,” and “surprised,” the total phonation duration time of the vowel and consonant is set as the value whose ratio to that in neutral synthetic speech is equal to the ratio of the phonation duration time of the vowel in emotional speech to the phonation duration time of the vowel in neutral speech;

  • Step 3: The fundamental frequency of synthetic speech, obtained by the processing up to Step 2, is adjusted based on the fundamental frequency of the emotional speech;

  • Step 4: For a vowel with a consonant appearing just before it in synthetic speech obtained by the processing up to Step 3, the amplitudes are transformed into final values by once or twice multiplying the ratio , where and denote the maximum amplitude of the vowel in emotional speech and that in neutral speech, respectively. The synthetic speech obtained by the processing up to Step 4 is hereinafter called “emotional synthetic speech.”

If no consonant appears just before a vowel, the process described in Steps 1, 2, and 4 applies to just the vowel.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 5: 4 Issues (2017)
Volume 4: 4 Issues (2016)
Volume 3: 4 Issues (2015)
Volume 2: 4 Issues (2014)
Volume 1: 4 Issues (2013)
View Complete Journal Contents Listing