Article Preview
TopIntroduction
Speech is the natural communication form between humans, provides a great deal of information about speaker, language and emotions. This fact has motivated researchers to find a fast and efficient method of natural interaction between man and machine. Presence of emotions makes speech more natural. This has introduced a relatively new research area, namely speech emotion recognition (SER), which is defined as extracting the emotional state of a speaker from his or her speech. This challenging task has several applications in day-to-day life like agent-customer interactions, call-center applications (Herm, 2008), web movies, on- board car driving systems (Hu et al., 2013), medical diagnostic tool and E-tutoring systems (Trabelsi & Bouhlel, 2016a). As in any pattern recognition problem, the performance of emotion recognition from speech depends on label, organization, representation, and evaluation of training data. A significant challenge for emotional research depends on a sense of what emotion is and is in finding appropriate emotional labels. Three labeling methods can be distinguished: (1) categorical approach, (2) dimensional approach, and (3) appraisal-based approach (Cowie & McKeown & Douglas-Cowie, 2012; Hudlicka, 2011). In the first one, emotion is described as a discrete class that differs explicitly and mutually exclusive from one emotion to another. In the second one, emotion is described as a continuous process that will changes dynamically over time, using the multi-dimensional emotion model. However, the appraisal approach, introduces the role of time into the comprehension of emotions (Mortillaro & Meuleman & Scherer, 2012; De Vries, 2015). A critical research challenge in speech emotion recognition systems is to how to encode the spoken emotion by some suitable features (Maji et al., 2015; Saba et al., 2016). This step, called feature extraction, is of a great importance in SER. However, having a large number of potential features increases the complexity of the system and normally results in longer system training times. Therefore, a popular approach is to start with a larger set of features and then removes irrelevant data to reduce dimensionality of the training data and generate a more compact and robust feature set. Another important issue in the evaluation of an emotional speech system is the choice of emotional corpus. The existing emotional databases could be divided into three classes namely: simulated (actor), elicited (induced) and spontaneous (natural) speech databases. For more detailed description, the reader may refer to (Koolagudi& Rao, 2012).