This chapter describes multimodality as a means of augmenting information retrieval activities in multimedia digital libraries. Multimodal interaction systems combine visual information with voice, gestures, and other modalities to provide flexible and powerful dialogue approaches. The use of integrated multiple input modes enables users to benefit from the natural approach used in human communication, improving usability of the systems. However, natural interaction approaches may introduce interpretation problems as the systems’ usability is directly proportional to users’ satisfaction. To improve multimedia digital library usability users can express their queries by means of a multimodal sentence. The authors proposes a new approach to match a multimodal sentence with a template stored in a knowledge base to interpret the multimodal sentence and define the multimodal templates similarity.
Background: A Brief State-Of-The-Art Of Sentence Similarity Issues
There is a wide literature base on measuring the similarity between documents or long texts (e.g., Allen, 1995; Meadow, Boyce, & Kraft, 2000), but there are only few works relating to the measurement of similarity between very short texts (Foltz, Kintsch, & Landauer, 1998) or sentences (Li, McLean, Bandar, O’Shea, & Crockett, 2006).
According to Li et al. (2006), sentence similarity methods can be classified into three categories: methods based on the analysis of the degree of words co-occurrence, corpora analysis methods, and descriptive features-based methods.
Key Terms in this Chapter
Modality: The term is used to describe the distinct method of operation within a computer system, in which the same user input can produce different results depending of the state of the computer. It also defines the mode of communication according to human senses or type of computer input devices. In terms of human senses, the categories are sight, touch, hearing, smell, and taste. In terms of computer input devices, we have modalities that are equivalent to human senses: cameras (sight), haptic sensors (touch), microphones (hearing), olfactory (smell), and even taste. In addition, however, there are input devices that do not map directly to human senses: keyboard, mouse, writing tablet, motion input (e.g., the device itself is moved for interaction), and many others.
Usability: The term identifies that quality of a system that makes it easy to learn, easy to use, and encourages the user to regard the system as a positive help in getting the job done. Usability it is defined by five quality components: Learnability, which defines how easy it is for users to accomplish basic tasks the first time they encounter the design; Efficiency, which defines users’ quickness in performing tasks; Memorability, which is important when users return to the design after a period of not using it, in order to define how easily they can reestablish proficiency; Errors, which defines how many errors users make, how severe are these errors, and how easily they can recover from errors; And satisfaction, which that defines the satisfaction of the users using the systems.
Knowledge Base: A knowledge base is a special kind of database for knowledge management. It is the base for the collection of knowledge. Normally, the knowledge base consists of explicit knowledge of an organization, including trouble shooting, articles, white papers, user manuals, and others. A knowledge base should have a carefully designed classification structure, content format, and search engine.
Semantic Lexicon: A lexicon is a vocabulary, containing an alphabetical arrangement of the words in a language or of a considerable number of them, with the definition of each. A semantic lexicon is a dictionary of words labelled with semantic classes so associations can be drawn between words that have not previously been encountered.
NLP: The term is the acronym of natural language processing (NLP). NLP is a range of computational techniques for analyzing and representing naturally occurring text (free text) at one or more levels of linguistic analysis (e.g., morphological, syntactic, semantic, or pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive applications. NLP includes:
WordNet: WordNet is a semantic lexicon for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. WordNet was developed by the Cognitive Science Laboratory (http:// www.cogsci.princeton.edu/) at Princeton University under the direction of Professor George A. Miller (Principal Investigator). WordNet is considered to be the most important resource available to researchers in computational linguistics, text analysis, and many related areas. Its design is inspired by current psycholinguistic and computational theories of human lexical memory
User: A person, organization, or other entity that employs the services provided by an information processing system for transfer of information. A user functions as a source or final destination of user information.
Semantic Similarity: Semantic similarity, variously also called “semantic closeness /proximity/nearness” is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning/semantic content. We define two entities to be similar if: (i) both belong to the same class, (ii) both belong to classes that have a common parent class, or (iii) one entity belongs to a class that is a parent class to which the other entity belongs. Furthermore, two relationships are similar if (i) both belong to the same class, (ii) both belong to classes that have a common parent class, or (iii) one relation belongs to a class that is a parent class to which the other relation belongs.
Multimodality: By definition, “multimodal” should refer to using more than one modality, regardless of the nature of the modalities. However, many researchers use the term “multimodal,” referring specifically to modalities that are commonly used in communication between people, such as speech, gestures, handwriting, and gaze. Multimodality seamlessly combines graphics, text, and audio output with speech, text, and touch input to deliver a dramatically enhanced end-user experience. When compared to a single-mode of interface in which the user can only use either voice/ audio or visual modes, multimodal applications gives them multiple options for inputting and receiving information.