Comparison of Several Acoustic Modeling Techniques for Speech Emotion Recognition

Comparison of Several Acoustic Modeling Techniques for Speech Emotion Recognition

Imen Trabelsi (Sciences and Technologies of Image and Telecommunications (SETIT), University of Sfax, Tunisia) and Med Salim Bouhlel (Sciences and Technologies of Image and Telecommunications (SETIT), University of Sfax, Tunisia)
Copyright: © 2016 |Pages: 11
DOI: 10.4018/IJSE.2016010105
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human Computer Interaction (HCI) with a wide range of applications. The purpose of speech emotion recognition system is to automatically classify speaker's utterances into different emotional states such as disgust, boredom, sadness, neutral, and happiness. The speech samples in this paper are from the Berlin emotional database. Mel Frequency cepstrum coefficients (MFCC), Linear prediction coefficients (LPC), linear prediction cepstrum coefficients (LPCC), Perceptual Linear Prediction (PLP) and Relative Spectral Perceptual Linear Prediction (Rasta-PLP) features are used to characterize the emotional utterances using a combination between Gaussian mixture models (GMM) and Support Vector Machines (SVM) based on the Kullback-Leibler Divergence Kernel. In this study, the effect of feature type and its dimension are comparatively investigated. The best results are obtained with 12-coefficient MFCC. Utilizing the proposed features a recognition rate of 84% has been achieved which is close to the performance of humans on this database.
Article Preview

Introduction

Human emotion recognition is one of the major challenges in Human-Computer interactions (Grunberg, 2012) due to its wide range of applications and complex tasks: agent-customer interactions communication, speech driven facial animation, E-tutoring applications etc.

The classification of emotions has been researched from two fundamental viewpoints: one that emotions are basic, distinct and fundamentally different constructs e.g. fear and anger; or two, that emotions can be characterized on a dimensional space (Cowie & McKeown & Douglas-Cowie, 2012; Hudlicka, 2011; Gunes, 2010). The speech emotion recognition systems (SER) use several types of databases of acted, simulated or real emotions. A wide range of pattern recognition methods exists. They include two powerful paradigms in machine learning: generative and discriminative methods. A generative method is a full probabilistic model for all variables, such as those based on Bayes decision theory and related techniques of density estimation. A Discriminative model learns the conditional probability distribution, such as nearest-neighbor classification, support vector machines.

Integrating generative machines learning models such as Gaussian Mixture Model (GMM) and discriminative machines learning models such as Support Vector Machines (SVM) in a hybrid system has shown great success. Favorable properties of SVM such the non-linear kernels and the inherent class-discriminative model structure represent an attractive way to enhancing GMM. Thereby, a combination between these two powerful models based on the Kullback-Leibler Divergence Kernel is presented in this paper. The Speech signal contains a large number of parameters that reflect the emotional characteristics, and the different parameters result in changes in emotion. Thus, the most important challenge in speech emotion recognition is how to determine the best feature parameters, which can express mostly the emotional characteristics of speech.A lot of research has been done in the speech parameterization area, resulting in many different feature methods. These methods can be categorized into three broad categories (1) spectral features, (2) prosodic features and (3) high-level features. These speech features can be also divided into two categories: utterance level features and frames level features (Trabelsi & Bouhlel, 2016). In this paper, frames level features are explored, in particularly the spectral features such as, the Mel Frequency cepstrum coefficients (MFCC), Linear prediction coefficients (LPC), linear prediction cepstrum coefficients (LPCC), Perceptual Linear Prediction (PLP) and Relative Spectral Perceptual Linear Prediction (Rasta-PLP) and Formants.

The paper is organized as follows. Section 2 represents the related literature. Section 3 describes the proposed method. Section 4 discusses the results, and finally Section 6 concludes the work.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 2 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 2 Issues (2014)
Volume 4: 2 Issues (2013)
Volume 3: 2 Issues (2012)
Volume 2: 2 Issues (2011)
Volume 1: 2 Issues (2010)
View Complete Journal Contents Listing