Over the years the term “audio-visual” (AV) has segued into the more modern one of “multimedia” (MM), reflecting not only the incorporation of various I/O modalities, but also the implication of interactivity between user and system. Nevertheless, the older term still reflects the primary focus of MM systems today, these being sight and sound—the two primary modalities. Barfield (2004) views sound as being the “forgotten child” of MM, characterizing it as active, non-localized, transient, and dynamic, in contrast to graphics, which he characterizes as passive, localized, permanent, and static. Now speech is inherently temporal in nature, whereas vision is spatial—hence synchronization is a fundamental consideration in multimodal interfaces, in order that users do not suffer cognitive overload.
Ordinarily when we speak of multimodal interfaces, we mean the concurrent arrival of user input via more than one modality (sense). It is possible in some situations, however, that sequential operation is more appropriate—in other words, switching modalities where appropriate for improved clarity of user input to a system.
Now AV speech is inherently bi-modal in nature, which means that visual cues (such as eye/lip movement, facial expression, and so on) play an important role in automatic speech recognition—ASR (Potamianos, Neti, Gravier, Garg, & Senior, 2003). The key issues in this context are (a) which features to extract from lip movements, and (b) how to fuse (and synchronise) audio and visual cues.
Another example of a bi-modal user interface is the speech-gesture one developed by Sharma et al. (2003) for crisis management in emergency situations. Taylor, Neel, and Bouwhuis (2000) likewise discuss combining voice and gesture, whereas Smith and Kanade (2004) focus specifically on extracting information from video. Many other multimodal interfaces abound in the literature (e.g., Booher, 2003; Bunt & Bevin, 2001; Jacko & Sears, 2003; McTeal, 2002; Yuen, Tang, & Wang, 2002).
Figure 1 shows a system that incorporates three input modalities—(1) sight (hand gesture + scanned text and/or images), (2) sound (voice input via microphone), and (3) touch (via mouse + keyboard), together with the two most commonly used (i.e., AV) output modalities (sight and sound).
Common interface modalities
We have just seen how data (information) fusion needs to be considered in bi- (multi)modal user interfaces. This is especially important in lip synching in the case of combining the two dominant modalities (that is, sight and sound). This is a common post-production activity in film production, as well as a factor that needs to be taken into account in designing multi-user role-playing games for the Internet.