Article Preview
Top1. Introduction
Facial expressions, gestures or verbal communication are few modalities for recognition of gender and age-related individual patterns. Among these modalities, speech remains a sole medium that makes the detection task difficult particularly in telephonic conversation. The objective is to determine the speaker’s emotional conversation pattern based on his/her age. The determination of these will be beneficial to law enforcement agencies in studying criminal psychology and further investigation. Particularly, the speaker’s state of mind and emotional attributes will assist the condition of both victim and the culprit during court hearings and prevent confusion. Identification of intimidating calls, false alarms, kidnapping involving influential people, fanatic religious groups, radicals, etc. can be made possible with such systems (Hämäläinen et al., 2011). Further, the recognition system will help in implementing corrective measures in case negative emotional attributes are manifested among children before it is too late. Utterances of speaker colored with emotion and age detection can also help human-robotic interfaces, telecommunications, intelligent tutoring, smart call center application, etc.
The vocal tract and vocal fold of human speech production mechanism are in a growing stage till a child attains adolescent. Selecting suitable features representing the age of the speaker thus remains an ever-growing challenge. Recognition systems trained for adult speakers often proved inefficient when these are trained with child’s utterances (Tanner & Tanner, 2004). This is because the core features representing the speech and emotional contents of an utterance vary with age and gender of the speaker. Especially, the fundamental frequency (F0), formants, speech rate, energy, etc., varies drastically between a child and an adult (Lyakso et al., 2015). The acoustic models made for research and business requirement thus become ineffective in case the emotional utterances belong to different age group. Speaker’s age and gender have been addressed by different pieces of literature during the last decades, although these studies little emphasized on emotional contents of the speech (Feld, Burkhardt, & Müller, 2010; Porat, Lange, & Zigel, 2010). These authors attempted the Gaussian weight super-vectors with support vector machine (SVM) classifier for age and gender identification. However, no precise study among different age groups or their emotional states has been made by them. Use of Mel-frequency Cepstral Coefficient (MFCC) with different feature selection algorithms such as PCA (principal component analysis), supervised PCA (SPCA) has been attempted for different age groups in (Chaudhari & Kagalkar, 2015). The prominent prosodic features representing speech emotion of children and adults could not be found in these pieces of literature. The absence of a clear boundary among emotions based on age has motivated the authors to move in this novel effort.
The objective is to cluster the features representing emotional utterances of different age groups. Different clustering approaches such as fuzzy c-means (FCM), hierarchical clustering, Partitioning, Density-Based, Grid-Based, Model-Based, K-means clustering has been applied to recognize human emotions (Kaur & Vashish, 2013; Trabelsi, Ayed, & Ellouze, 2016). The authors have compared the classification accuracy of these clustering methods using different emotional states. FCM has provided an accuracy of 63.97% for SROL emotional database using the statistical parameter as reported by the authors (Zbancioc & Ferarua, 2012). Among these techniques, two approaches as K-means and FCM have been applied in this work for clustering emotional speech utterances of different age groups. Speech emotions such as boredom, sadness, and anger have been chosen and analyzed separately. K-means is a hard-clustering algorithm, simple and can solve known clustering problems using unsupervised learning. The algorithm is faster than hierarchical clustering producing tighter clusters. FCM remains a compatible classifier for recognition of patterns that have overlapped clustering. It is suitable when the features of a pattern are associated with different clusters. However, meager works in recognition of speech emotion using FCM has been a motivating factor for the authors to choose this technique.