The ability to communicate in social and public environments can influence an individual’s career prospects, help build relationships, and resolve conflict. Public speaking performance is characterized not only by the content presented but also by the presenter’s nonverbal behavior, such as gestures and facial expressions. Nonverbal communication expressed through various types of behavior is a key aspect for successful public speaking and interpersonal communication. However, public speaking skills can be difficult to master and require extensive training. Moreover, in reality, the evaluation of public speaking can be subjective as it tends to heavily rely on human judgment. Thus, a system for automatic assessment of public speaking is needed for training.
Strangert and Gustafson (2008) presented via a political speech dataset the concept that vocal variety is correlated to human perception of a good speaker. Koppensteiner and Grammer (2010) used videos of political speakers to investigate different complex motion features and identified a correlation between gesturing and personality ratings. Scherer, Layher, Kane, et al. (2012) used a large publicly available dataset to investigate the effect of audiovisual features on the perception of speaking style and the performance of politicians. They conducted a human perception experiment using eye-tracker data to evaluate human performance ratings and behavior through two separate media: audiovisual and video only. They concluded that several statistically significant features such as pausing, voice quality measures, and motion correlate strongly positively or negatively with certain human approval ratings for speaking style. Fuyuno, Yamashita, Kawase et al. (2014) collected multimodal data of English public speaking by Japanese EFL (English as a foreign language) learners, and analyzed speech-pause distributions and facial movement patterns. In this research, characteristics facial movement patterns were found in their datasets. However, the facial movement was obtained by a feature point set on the speaker’s nose.
Recently, some interactive virtual audience systems for public speaking training have been proposed (Pertaub, Slater, & Barker, 2002; Batrinca, Stratou, Shapiro et al., 2013; Tudor, Poeschl, & Doering, 2013; Chollet, Stefanov, Prendinger et al., 2015). Batrinca, Stratou, Shapiro et al. (2013) developed a public speaking skill training system, Cicero, using a combination of advanced multimodal sensing and virtual human technologies. In this system, three kind of sensors; Microsoft Kinect sensor, two webcams, and a lapel microphone were used. Chollet, Stefanov, Prendinger et al. (2015) developed an interactive virtual audience platform for public speaking training. In their system, a depth sensor, an audio sensor, a video camera, and a physiological sensor were integrated, and these multimodal sensors were used to detect different types of behavior. However, these systems require several special devices, such as a head-mounted display (HMD), Microsoft Kinect sensor, and various physiological sensors. Therefore, an efficient but technologically simple training system is needed. Takahashi, Takayashiki, and Kitahara (2016) proposed a support system for improving speaking skills during job interviews, focusing on the skills needed for this specific type of presentation. Although this system was not related public speaking, it only comprised a web camera and a microphone.
Chen, Leong, Feng et al. (2015) proposed an automated scoring model for evaluating public speaking using multimodal cues. In their research, data on two types of public speaking tasks, informative and impromptu presentations, were collected using a Kinect sensor. They calculated the Kinect features, head pose, eye gaze, facial expression, lexical features, and speech features as multimodal features. The calculated values were then fed into three regression models: a support vector machine, glmnet, and random forest. Ramanarayanan, Leong, Chen et al. (2015) also used a similar approach. From the viewpoint of developing an automatic scoring system, these methods are useful; however, from a teaching viewpoint, it is still difficult to give feedback on how to improve public speaking performance. Any system that gives detailed feedback regarding this will be the best for the learner.