Multimodal Information Fusion for Semantic Video Analysis

Multimodal Information Fusion for Semantic Video Analysis

Elvan Gulen (Department of Computer Engineering, Middle East Technical University, Ankara, Turkey), Turgay Yilmaz (Department of Computer Engineering, Middle East Technical University, Ankara, Turkey and Institute of Industrial Science, University of Tokyo, Tokyo, Japan) and Adnan Yazici (Department of Computer Engineering, Middle East Technical University, Ankara, Turkey)
DOI: 10.4018/jmdem.2012100103
OnDemand PDF Download:
No Current Special Offers


Multimedia data by its very nature contains multimodal information in it. For a successful analysis of multimedia content, all available multimodal information should be utilized. Additionally, since concepts can contain valuable cues about other concepts, concept interaction is a crucial source of multimedia information and helps to increase the fusion performance. The aim of this study is to show that integrating existing modalities along with the concept interactions can yield a better performance in detecting semantic concepts. Therefore, in this paper, the authors present a multimodal fusion approach that integrates semantic information obtained from various modalities along with additional semantic cues. The experiments conducted on TRECVID 2007 and CCV Database datasets validates the superiority of such combination over best single modality and alternative modality combinations. The results show that the proposed fusion approach provides 16.7% relative performance gain on TRECVID dataset and 47.7% relative performance improvement on CCV database over the results of best unimodal approaches.
Article Preview


In recent years, the developments in multimedia technologies led to a rapid grow in the usage of multimedia data, especially videos. This demand in the usage of digital video data has exposed the need for content based video retrieval systems. Content based retrieval of videos requires extracting the semantic content in videos. Nevertheless, the ‘semantic gap’ between the low-level features of multimedia data and the high-level semantic information is still a challenging problem. Thus, semantic content extraction is still a compelling research topic for the researchers (Datta, Li, & Wang, 2005).

Videos, by its very own nature, comprise different types of data such as text, audio, and image in itself. Correspondingly, the semantic information to be extracted is directly connected to these separate sources. Therefore, in order to provide an efficient semantic content extraction solution, this nature of the multimedia data should be analyzed carefully and contained information should be used thoroughly. Video data displays an unstructured characteristic and leads to several complexities such as lighting variations, camera motion, occlusion, viewpoints changes, noise in the sensed data, etc. Moreover, video data has another important characteristic, which can help to overcome most of these challenges; the multimodal content. Integrating the information obtained from multiple modalities is an empirically validated approach to increase the retrieval accuracy (Atrey, Hossain, El-Saddik, & Kankanhalli, 2010). Besides, the dependence on any modality can be minimized with integration and this yields to a more robust system. We can think of a people-marching event, as an example, where the event can be recognized by using in any of the visual, audio and textual modalities. The video can include people as visual objects, a shouting sound and also some lyrics of a march in the closed caption text. A combination of these modalities can provide higher detection accuracy for people-marching event and is less dependent on potential problems in any of the modalities.

The information fusion literature contains a significant number of studies on multimodal information fusion. However, most of these studies do not take advantage of all available modalities. Instead, they focus on some alternative modality couples, especially ‘audio-visual’ and ‘visual-textual’ (Maragos, Gros, Katsamanis, & Papandreou, 2008)(Maragos, Gros, Katsamanis, & Papandreou, 2008; Atrey, Hossain, El-Saddik, & Kankanhalli, 2010) (Atrey, Hossain, El-Saddik, & Kankanhalli, 2010). In this study, we aim to incorporate as much information as possible through the existing modalities for the purpose of semantic concept detection. Hereby, we consider that a ‘modality’ is a set of information which is complementary to the other included modalities (Wu, Chang, Chang, & Smith, 2004) and elaborate the three information channels in video (visual, audio, textual) into following complementary modalities: Visual-Color, Visual-Region, Visual-Texture, Audio-Perceptual, Audio-Cepstral and Textual. Thus, we try to benefit from any useful information included in the video data and increase the retrieval accuracy of the concepts. The concepts intended to be predicted are referred as semantic concepts which constitute a class of elements that together share essential characteristics which identify the class. These semantic concepts include visual objects like Car, Bird, etc. or events like Biking or other semantic concepts like Soccer, Basketball, etc.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing