Multimodal Fission

An important issue for communication processes in general, and for multimodal interaction in particular, is the information output arrangement and organization (multimodal fission). Considering information structure, intonation, and emphasis for the output by speech, considering moreover spatio-temporal coordination of pieces of information for visual (video, graphics, images, and texts) outputs, designing outputs for each kind of modality, and synchronizing the different outputs modalities is one of the most relevant challenges of the multimodal interaction design process; it is called fission. This challenge is becoming more and more important with the use of a lot of different interaction devices from laptop to mobile and smart-phones, in different contexts. This chapter provides some basic concepts involved in the fission processes design and describes some of the most relevant approaches discussed in the literature.
Designing how to combine different outputs from modal channels for different pieces of information is a very important and critical issue in multimodal interaction systems. It consists of a process that considers these pieces of information and how to present and structure them; it is the fission process. Foster, (2002) defines fission as “the process of realising an abstract message through output on some combination of the available channels”.

This process can be conceived (Foster, 2002) as consisting of three main steps: (1) content selection and structuring, (2) modality selection (3) output coordination. The first step consists in selecting and organizing the content to be included in the presentation, the second step consists of specifying modalities that can be associated with the different contents of the previous step and finally the third step consists of coordinating the outputs on each channel in order to form a coherent presentation.

The fission process, and more generally, the information presentation activities are closely connected with the information structure, independently from the different modalities. It was introduced by Halliday (1967) and was initially used to structure a sentence into parts such as focus, background, topics, and so on. Focus identifies “information that is new or at least expressed in a new way” (Steedman 2000). Background, expresses old or given information. Lambrecht (1994) defines the information structure as “a component of GRAMMAR, more specifically of SENTENCE GRAMMAR” (Lambrecht, 1994).

Starting from Natural Language the information structure notion has been extensively used for the different interaction approaches and visual information too, independently from the used channel that could be referred to both, written texts as well as speech. Elements such as syntactic structures, word order, intonation and prosody in speech, layout presentation in visual communication, are all elements that contribute to identify the information structure. Considering informativeness of phrases composing sentences or visual elements that compose an image, the focus and background concepts (or others similar that will be specified below) have been introduced.

Indeed, structuring visual information (images, graphics, video, texts) requires spatial and temporal coordination of the different pieces of information. The spatial level usually organizes the information layout while the temporal one (considered for movies only) involves organization for the different pieces of information on the time.

The use of focus and background notion can be extended to information structure associated with multimodal utterance. Indeed, usually when two or more than two modalities are jointly used, some of them provide the new information and some others give the information context. The modality that usually is involved in expressing the focus is the prevalent modality, i.e. the modality that can significatively express the information content. It will be convenient to choose the prevalent modality according to the different users and contexts. For example it is not a good idea to choose a prevalent output modality that uses visual channel for systems used by visually impaired people, or speech when the environment presents sounds noises.

The literature proposes a lot of definitions for context and in particular for interaction context. Anind Dey et al. (2001) define the interaction context as “any information that can be used to characterize the situation of an entity. An entity is a person or object that is considered relevant to the interaction between a user and an application, including the user and application themselves”.

