Much has changed in computer interfacing since the early days of computing—or has it? Admittedly, gone are the days of punched cards and/or paper tape readers as input devices; likewise, monitors (displays) have superseded printers as the primary output device. Nevertheless, the QWERTY keyboard shows little sign of falling into disuse—this is essentially the same input device as those used on the earliest (electromechanical) TeleTYpewriters, in which the “worst” key layout was deliberately chosen to slow down user input (i.e., fast typists). The three major advances since the 1950s have been (1) the rise of low cost (commodity off-theshelf) CRT monitors in the 1960s (and in more recent times, LCD ones), (2) the replacement of (text-based) command line interfaces with graphical user interfaces in the 1980s, and (3) the rise of the Internet/World Wide Web during the 1990s. In recent times, while speech recognition (and synthesis) has made some inroads (i.e., McTeal, 2002; O’Shaughnessy, 2003), the QWERTY keyboard and mouse remain the dominant input modalities.
Over the years the term “audio-visual” (AV) has segued into the more modern one of “multimedia” (MM), reflecting not only the incorporation of various I/O modalities, but also the implication of interactivity between user and system. Nevertheless, the older term still reflects the primary focus of MM systems today, these being sight and sound—the two primary modalities. Barfield (2004) views sound as being the “forgotten child” of MM, characterizing it as active, non-localized, transient, and dynamic, in contrast to graphics, which he characterizes as passive, localized, permanent, and static. Now speech is inherently temporal in nature, whereas vision is spatial—hence synchronization is a fundamental consideration in multimodal interfaces, in order that users do not suffer cognitive overload.
Ordinarily when we speak of multimodal interfaces, we mean the concurrent arrival of user input via more than one modality (sense). It is possible in some situations, however, that sequential operation is more appropriate—in other words, switching modalities where appropriate for improved clarity of user input to a system.
Now AV speech is inherently bi-modal in nature, which means that visual cues (such as eye/lip movement, facial expression, and so on) play an important role in automatic speech recognition—ASR (Potamianos, Neti, Gravier, Garg, & Senior, 2003). The key issues in this context are (a) which features to extract from lip movements, and (b) how to fuse (and synchronise) audio and visual cues.
Another example of a bi-modal user interface is the speech-gesture one developed by Sharma et al. (2003) for crisis management in emergency situations. Taylor, Neel, and Bouwhuis (2000) likewise discuss combining voice and gesture, whereas Smith and Kanade (2004) focus specifically on extracting information from video. Many other multimodal interfaces abound in the literature (e.g., Booher, 2003; Bunt & Bevin, 2001; Jacko & Sears, 2003; McTeal, 2002; Yuen, Tang, & Wang, 2002).
Figure 1 shows a system that incorporates three input modalities—(1) sight (hand gesture + scanned text and/or images), (2) sound (voice input via microphone), and (3) touch (via mouse + keyboard), together with the two most commonly used (i.e., AV) output modalities (sight and sound).
Common interface modalities
We have just seen how data (information) fusion needs to be considered in bi- (multi)modal user interfaces. This is especially important in lip synching in the case of combining the two dominant modalities (that is, sight and sound). This is a common post-production activity in film production, as well as a factor that needs to be taken into account in designing multi-user role-playing games for the Internet.
Key Terms in this Chapter
Tactile Interface: User interface that measure the pressure or force exerted by a user upon an object. Tactile devices receive and react to forces applied by the user.
Laparoscopy: A surgical technique in which several small incisions are made. A camera is inserted through one of these incisions and surgical instruments through the others. The surgery is guided by the camera, allowing for smaller instruments and surgical incisions to be used. Laparoscopic surgeries are generally less traumatic than traditional open surgeries.
Trauma-Reach: A VR program designed to train physicians to manage trauma patients.
SimMentor: A VR program which trains surgeons in laparoscopic techniques.
Virtual Reality: An interactive artificially created environment primarily involving the senses of vision, hearing, and touch but which may include all five senses. The artificial computer-generated environment may be manipulated and feedback given, allowing numerous scenarios to be enacted.
AccuTouch: A VR simulator designed to teach colonoscopy.
Haptic Feedback Device: A system in which the user interfaces via the sense of touch utilizing the application of forces, vibrations, or motions. These devices allow people to experience tactile stimuli in VR settings. Haptic devices are forces that are applied to the user.
Augmented Reality: A computer generated program which combines elements from the real world with elements of computer generated scenarios.