Article Preview
TopIntroduction
Computer vision, in recent years, has witnessed outstanding and productive outcomes because of the tasks like face recognition, emotion recognition, and speech recognition. The reason is the adaptation of high-end techniques like machine learning. However, human expression recognition is still an onerous task. The first Emotion Recognition in Wild (EmotiW) (Dhall et al., 2013) challenge was held in the year 2013. Since then, the classification accuracy has increased to a great extent from a baseline figure of 38% but still, there is a scope of improvement. There are several reasons in the past for low accuracy percentage such as there is a lack of labeled video datasets, the nature of facial expressions is ambiguous, and the effectiveness of the methods of extracting facial expression is less. In the last few years, techniques like Deep Convolutional Neural Network (DCNN) (Schmidhuber, 2015) is proven to be outstanding in extracting features from an image. Also, Long Short Term Memory (LSTM) is proven to be the best in analyzing sequential data (Sak et al., 2014). Thus, by applying all these recent and effective methods and combining them may increase the accuracy of classifying the human facial expressions more effectively. The main contributions of this paper can be summarized as follows:
• A separate feature selection model is introduced in AlexNet architecture which automatically filters the most prominent facial features. It helps in an overall improvement of the accuracy of the model.
• Separate models for audio and visual emotion recognition with better classification accuracy.
• A probabilistic audio-visual fusion model using SVM machine learning classifier which classifies the emotions with a better accuracy.
The rest of the paper is organized as follows: Section 2 discusses the related work. In section 3, the authors present the multi-modal emotion recognition framework, including the discussion of datasets, multi- modal features, and network architecture. In section 4, the authors present the experimental setup for the audio and visual emotion recognition. In section 5, the experimental results from the audio, video, and audio-visual fusion-based recognition models are discussed separately, and Section 6 concludes the paper.
TopA multi-modal approach for an emotion recognition system is more powerful and efficient than the bimodal and unimodal approaches because human emotions depend on both audio and visual information. In recent years, many studies came up, which are based on audio-visual recognition of human emotions and they also prove audio and visual fusion for emotion recognition to be advantageous. In this section, the authors discuss a few of them.
M. Mansoorizadeh et al. (Mansoorizadeh and Charkari, 2010) propose a fusion-based approach to emotion recognition. It uses both decision and feature level fusion. Features which are related to the same emotion has a higher chance of getting overlapped. The proposed framework combines features of the different modalities and generates a hybrid feature space. The experiments are performed on two different audio-visual emotion databases with a total number of 42 and 12 subjects. The proposed model accuracy is comparatively higher than the unimodal and bimodal face and speech-based individual systems.
An audio-visual recognition system based on the fusion of features is proposed by R. Gajsek et al. (Štruc et al., 2010). For the audio-based recognition model, the coefficients -- cepstral and prosodic are extracted, and for video-based recognition model, Gabor wavelets are considered as features. Lastly, to combine the outputs, a multi-class classifier is used.
International Journal of Cognitive Informatics and Natural Intelligence
In (Avots et al., 2019), authors present the analysis of an audio-visual model for emotion recognition. They use three different databases SAVEE, eNTERFACE’05, and RML for training the models and AFEW database is used as a testing set. MFCC coefficients are used to represent the emotional speech and SVM machine learning classifier is used for classification. The proposed multimodal emotion recognition is a decision-based fusion model. They perform the facial image classification using AlexNet. The reported accuracy for eNTERFACE’05 is 48.2%.