Audio-Visual Speech Emotion Recognition

Audio-Visual Speech Emotion Recognition

Oryina Kingsley Akputu (Sunway University, Malaysia), Kah Phooi Seng (Edith Cowan University, Australia) and Yun Li Lee (Sunway University, Malaysia)
Copyright: © 2015 |Pages: 11
DOI: 10.4018/978-1-4666-5888-2.ch011
OnDemand PDF Download:
$30.00
List Price: $37.50

Chapter Preview

Top

Background

Since the early 19th century, research works have been conducted in search for best models for analysing and interpreting subtlety complex human emotions based on the following common dimensions (Scherer, 2000): Continuous abstract dimensions, appraisal dimensions and discrete categories. In continuous dimension space, a multidimensional space is defined in which emotions categories are represented by underlying dimensional points such as; Valence(V), Arousal(A), and control (Dominance) (C).In appraisal dimensions, emotional process and their related events are described. The central idea is to specify a set of criterion whose speculation underlies emotional constituents of an appraisal process. The discrete categories on the other hand entail selecting a set of word labels for representing emotions. In computer vision, and HCI the prototypical (archetypical) or discrete emotions includes, Anger, Disgust, Fear, Joy, Sadness and surprise(Cowie et al., 2001).

Nevertheless emotional expression drawn from any of the above dimensions, are known for their multimodal correlation and complexity. Traditionally, researchers have either employed, single modality or multimodal approach in the task of audio-visual emotion recognition. Information processed by a single sensor (modality) is limited to a single sensory cue. For instance, utilizing facial expression videos or audio-signal of an utterance separately for emotion recognition. Multimodal speech approaches however combine effective cues from audio and visual signals. Integrating audio and visual signals is introduced with the sole aim of harnessing individual advantages inherent in underlying modalities. A more basic audio-visual speech emotion recognition system is composed of four components: audio feature extraction, visual feature extraction, feature selection and classification. What may be considered the structure of a standard audio-visual emotion recognition system is illustrated in Figure 1. More crucial among components of this structure shall be discussed in the following sections of this article.

Key Terms in this Chapter

Human-Computer Interaction (HCI): The study, planning, and design of the interaction between people (users) and computers (Machines).

Affective Computing: The study and development of systems and devices that can recognize, interpret, process, and simulate human affect (e.g. emotions).

Emotions: A Generic term referring to subjective, conscious experiences that are characterized by psychophysiological expressions, biological reactions and mental states. Emotions are often associated with other human affective dimensions like mood and personality.

Classification: A compact clustering and description of a given instance according to common traits, behaviours and structural features.

Hybridization: The Combination of the two techniques, for example Phonetic (e.g. isolated vowels sounds) and image (e.g. lips features) using hybridized fusion techniques.

Complete Chapter List

Search this Book:
Reset