Article Preview
TopI. Introduction
The speech signal is the fastest and the most natural way of communication between humans. For an efficient method of interaction between human and machine, the machine must have sufficient intelligence to understand the information content exists in speech (Ayadi et al., 2011). This understanding capacity of the machine can further be improved if an emotional state of the speaker is also known (Nwe et al., 2003).
The semantics of spoken words varies based on emotional context. Thus, identification of emotional state is more significant in speech technology. Speech Emotion Recognition (SER) is especially useful for applications which require natural man-machine interaction such as computer tutoring applications, interactive speech bot like Alexa. The response of those systems will be productive if it understands the emotions of the user. SER is also useful in automatic translation systems in which the emotional state of the speaker plays a vital role in communication to identify the exact meaning of the phrase. SER has also been helpful in call centre applications and mobile communication (Petrushin, 1999), where the client's satisfaction is credited into employees’ appraisal.
The research focus on Speech Emotion recognition (SER) has been evolving from the late nineties (Dellaert et al., 1995) and SER finds much social relevant application. Systems based on SER serve to human according to one's emotional state of behavior on various occasions. SER is used to assist human to play music according to mood. From a psychological point of view, SER can be used to monitor one's homeostatic balance and to give feedback accordingly (LeeBusso, 2013). The performance of dialog-based applications and question answering systems may be improved by incorporating emotions in conversation (Burkhardt et al., 2009). In usability analysis, the interactive application can capture the feelings of the speaker towards the product and user-friendliness of the application. Interactive gaming applications are developed to record, analyze emotions which are created during game playing. This helps in research and analysis of the role of certain games in the simulation of emotions of the user. It can be used by psychologists for various analysis.
SER has been affected by various factors like recording environment, acoustical and cultural background, age & gender of the speaker and many other factors. According to literature, culture, and gender roles have a stronger impact on emotional expression (Wester et al., 2002), (Wang, 2018) (Kamaruddin et al., 2012). Researchers use gender information to enhance emotion recognition accuracy (Devika et al., 2016, Fu and Wang, 2010) and emotion information for gender recognition (Chen, Gu, Lu, & Ke, 2012; Safavi et al., 2018). One such possible explanation for gender differences in emotional expressiveness is social factors. Men and women have been taught by social and cultural standards to express emotions differently (Derks et al., 2008). In many places, empirical evidence suggests that girls are socialized to be moving, non-aggressive, nurturing, and obedient, whereas boys are socialized to be unemotional, aggressive, achievement-oriented, and self-reliant. In many countries, women are often presumed to express happiness while men aren't expected to be expressive (Wester et al., 2002).
Moreover, the physical characteristics of the sound generation system of the male and female system vary in vocal track length, and vocal cord size and dimension change the glottal closure period, formant frequency of men and women. The acoustical characteristics of emotional speech of male and female differ due to the variations in the range of the acoustical features: pitch, intensity, energy and formant frequencies. Thus, having a common emotion model for emotions of both genders may not provide accurate results as the training involve variations of parameters due to gender also.