Uncanny Speech

Angela Tinwell (University of Bolton, UK), Mark Grimshaw (University of Bolton, UK) and Andrew Williams (University of Bolton, UK)
DOI: 10.4018/978-1-61692-828-5.ch011
With increasing sophistication of realism for human-like characters within computer games, this chapter investigates player perception of audio-visual speech for virtual characters in relation to the Uncanny Valley. Building on the findings from both empirical studies and a literature survey, a conceptual framework for the uncanny and speech is put forward which includes qualities of speech sound, lip-sync, human-likeness of voice, and facial expression. A cross-modal mismatch for the fidelity of speech with image can increase uncanniness and as much attention should be given to speech sound qualities as aesthetic visual qualities by game developers to control how uncanny a character is perceived to be.
As technological advancements allow for the representation of high fidelity, realistic, human-like characters within computer games, aspects of a character’s appearance and behaviour are being associated with the Uncanny Valley phenomenon. (A definition of the Uncanny Valley is provided in the first section of this chapter.) It seems that one of the main factors contributing to a character being regarded as lifeless as opposed to lifelike is the character’s speech. In 2006, Quantic Dream revealed a tech demo (The Casting) for the computer game Heavy Rain (2006), in which the main character, Mary Smith, evoked a somewhat negative responsive from the audience (Gouskos, 2006). Criticism was made of the uncanny nature of Mary Smith’s speech in that it sounded strange and out of context with the given facial expression and emotion portrayed by this character. A closer inspection of the video showed that not only were there errors in the sound recording (disparities between the acoustics and the volume and materials of the room with excessive plosives contradicting the distant camera and microphone), but a lack of correct pitch and intonation for speech and a lack of synchronization of speech with lip movement were factors that reduced the overall believability for this character (Tinwell & Grimshaw, 2010). A mismatch between the conveyed emotion of Mary Smith’s voice with her gestures and posture exacerbated how unnatural and odd the character was perceived to be. MacDorman (quoted in Gouskos, 2006), observed that a perceived asynchrony of lip movement with speech was one of the factors that people found disturbing about Mary Smith:

In addition, there is sometimes a lack of synchronization with her speech and lip movements, which is very disturbing to people. People 'hear' with their eyes as well as their ears. By this, I mean that if you play an identical sound while looking at a person's lips, the lip movements can cause you to hear the sound differently.

Since Mary Smith was revealed in 2006, increasing technological sophistication for computer games has allowed for heightened realism of human-like characters. Cinematic animation is achieved not only for cut scenes and trailers containing full motion video (FMV) but also for animation during in-game play. For example, the phoneme extractor and facial expression tool Faceposer designed by Valve for titles such as Left 4 Dead (2008) and Half Life 2 (2008). However it would seem that speech, as a factor integral to the uncanny phenomenon, is often overlooked when compared to the aesthetic visual qualities of behaviour of a human-like character. So far there have been limited studies to ascertain which factors contribute to the uncanny for virtual characters. In response to the hearsay in mass media raised by characters such as Mary Smith, Tinwell and Grimshaw (2010) conducted a study to investigate how the cross-modality of image and sound might exaggerate the uncanny. The results from this study are referred to throughout all sections within this chapter as the Uncanny Modality (UM) study, unless otherwise stated from another study. Prior to this, much of the work on the uncanny had been visually-based, excluding sound as a factor. As a way towards building a conceptual framework for the uncanny and virtual characters in immersive 3D environments, this chapter defines how characteristics for a character’s speech may exaggerate the uncanny by considering aspects such as synchronization of audio and video streams, articulation, and qualities of speech.

The first section provides an exposition of the Uncanny Valley describing how the theory came about, previous investigation into the theory and potential limitations of the theory in relation to virtual characters.

