Audio-Visual Emotion Recognition System Using Multi-Modal Features

Audio-Visual Emotion Recognition System Using Multi-Modal Features

Anand Handa, Rashi Agarwal, Narendra Kohli
DOI: 10.4018/IJCINI.20211001.oa34
Article PDF Download
Open access articles are freely available for download


Due to the highly variant face geometry and appearances, Facial Expression Recognition (FER) is still a challenging problem. CNN can characterize 2-D signals. Therefore, for emotion recognition in a video, the authors propose a feature selection model in AlexNet architecture to extract and filter facial features automatically. Similarly, for emotion recognition in audio, the authors use a deep LSTM-RNN. Finally, they propose a probabilistic model for the fusion of audio and visual models using facial features and speech of a subject. The model combines all the extracted features and use them to train the linear SVM (Support Vector Machine) classifiers. The proposed model outperforms the other existing models and achieves state-of-the-art performance for audio, visual and fusion models. The model classifies the seven known facial expressions, namely anger, happy, surprise, fear, disgust, sad, and neutral on the eNTERFACE’05 dataset with an overall accuracy of 76.61%.
Article Preview


Computer vision, in recent years, has witnessed outstanding and productive outcomes because of the tasks like face recognition, emotion recognition, and speech recognition. The reason is the adaptation of high-end techniques like machine learning. However, human expression recognition is still an onerous task. The first Emotion Recognition in Wild (EmotiW) (Dhall et al., 2013) challenge was held in the year 2013. Since then, the classification accuracy has increased to a great extent from a baseline figure of 38% but still, there is a scope of improvement. There are several reasons in the past for low accuracy percentage such as there is a lack of labeled video datasets, the nature of facial expressions is ambiguous, and the effectiveness of the methods of extracting facial expression is less. In the last few years, techniques like Deep Convolutional Neural Network (DCNN) (Schmidhuber, 2015) is proven to be outstanding in extracting features from an image. Also, Long Short Term Memory (LSTM) is proven to be the best in analyzing sequential data (Sak et al., 2014). Thus, by applying all these recent and effective methods and combining them may increase the accuracy of classifying the human facial expressions more effectively. The main contributions of this paper can be summarized as follows:

  • • A separate feature selection model is introduced in AlexNet architecture which automatically filters the most prominent facial features. It helps in an overall improvement of the accuracy of the model.

  • • Separate models for audio and visual emotion recognition with better classification accuracy.

  • • A probabilistic audio-visual fusion model using SVM machine learning classifier which classifies the emotions with a better accuracy.

The rest of the paper is organized as follows: Section 2 discusses the related work. In section 3, the authors present the multi-modal emotion recognition framework, including the discussion of datasets, multi- modal features, and network architecture. In section 4, the authors present the experimental setup for the audio and visual emotion recognition. In section 5, the experimental results from the audio, video, and audio-visual fusion-based recognition models are discussed separately, and Section 6 concludes the paper.


A multi-modal approach for an emotion recognition system is more powerful and efficient than the bimodal and unimodal approaches because human emotions depend on both audio and visual information. In recent years, many studies came up, which are based on audio-visual recognition of human emotions and they also prove audio and visual fusion for emotion recognition to be advantageous. In this section, the authors discuss a few of them.

M. Mansoorizadeh et al. (Mansoorizadeh and Charkari, 2010) propose a fusion-based approach to emotion recognition. It uses both decision and feature level fusion. Features which are related to the same emotion has a higher chance of getting overlapped. The proposed framework combines features of the different modalities and generates a hybrid feature space. The experiments are performed on two different audio-visual emotion databases with a total number of 42 and 12 subjects. The proposed model accuracy is comparatively higher than the unimodal and bimodal face and speech-based individual systems.

An audio-visual recognition system based on the fusion of features is proposed by R. Gajsek et al. (Štruc et al., 2010). For the audio-based recognition model, the coefficients -- cepstral and prosodic are extracted, and for video-based recognition model, Gabor wavelets are considered as features. Lastly, to combine the outputs, a multi-class classifier is used.

International Journal of Cognitive Informatics and Natural Intelligence

In (Avots et al., 2019), authors present the analysis of an audio-visual model for emotion recognition. They use three different databases SAVEE, eNTERFACE’05, and RML for training the models and AFEW database is used as a testing set. MFCC coefficients are used to represent the emotional speech and SVM machine learning classifier is used for classification. The proposed multimodal emotion recognition is a decision-based fusion model. They perform the facial image classification using AlexNet. The reported accuracy for eNTERFACE’05 is 48.2%.

Complete Article List

Search this Journal:
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 1 Issue (2022)
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing