Emotion detection from voice is a complex task, whereas from the facial expression it is easy. In this chapter, an attempt is taken to detect the emotion through machine using neural network-based models and compared. As no complete database is available for different age groups, a small database is generated. To know the emotion of different age groups substantially, three groups have been generated with each group of 20 subjects. The efficient prosodic features are considered initially. Further, the combination of those features are taken. Each set of features are fed to the models for classification and detection. Angry, happy, and sad are the three emotions verified for different group of persons. It is found that the classifier provides 96% of accuracy. In earlier work, cluster-based techniques with simple features pitch, speech rate, and log energy were verified. As an extension, the combination of features along with machine learning model is verified in this work.
TopIntroduction
Speech is a linguistic act is to convey information. Human thoughts and ideas include both implicit and explicit information during conversations. Intentions, emotions and cognitive states are unusually significantly convey at the time of interaction. Considerable information on their moods, mental states, environmental and demographical information can be extracted and fed as inputs for behavioral studies. This information can be used to understand their need, preferences, desires and intentions for planning human resources. Some of the emotions mostly used in literatures are: anger, disgust, happy, sad, fear, bore and neutral (Loizou, 2007) (Zeng, Raisman, and Huang, 2009) (Ramakrishnan, 2012) (Ververidis, and Kotropoulos, 2006) (Schuller, et. al., 2007). Three primary factors that tend to drive this field are: emotional speech database, feature extraction and classification methods and are discussed in this chapter.
Database collection of unambiguous emotional speech utterances play a major role to authenticate research direction in this field. Database selected for detection of speech emotion must take into account: type of speaker, language used, gender, age, real or acted utterances, balanced or unbalanced voices and so on. Reliability of emotional database tends to remain a challenge in terms of detection authenticity. However, generation and use of standard database have smoothened the complexity of research by clarifying the taxonomy of speech emotions. A review on emotional speech databases with informations on number of subjects (male, female), age of speaker, duration of utterances, categories of emotions, language, purpose of collection, size, type of utterances involved (natural, acted, elicited and simulated) and their availability has been presented in (Palo, et. al., 2015). However, most emtional speech database reviewed by these authors are not open to public access. This database is of regional Indian language (oriya) that has been recorded in suitable circumstances.
Emotional speech features can be broadly segregated into acoustic (prosodic and spectral), linguistic, contextual, nonlinear, statistical, and hybrid features etc (Wu, Falk, and Chan, 2011) (Lee, Narayanan, and Pieraccini, 2001) (Chandrasekar, Chapaneri, and Jayaswal, 2014). Literatures have reported efficient characterization of emotional speech samples by removing redudant features using feature reduction and selection techniques (Vogt and Andre, 2006) (Seehapoch, T., & Wongthanavasu, 2013). Different feature reduction techniques as Principal Component Analysis (PCA), Greedy Feature Selection (GFS), Sequential Floating Forward Selection (SFFS) and Sequential Floating Backward Selection (SFBS), Forward Selection (FS) along with PCA, elastic net, fast Correlation-based filter have been attempted. These feature reduction and selection algorithms have their own limitations. For example, in recognize different emotional classes PCA uses transformed set of original features and does not take into account its subsets. Thus it is suitable for unsupervised learning techniques.