Article Preview
TopIntroduction
In recent years, the developments in multimedia technologies led to a rapid grow in the usage of multimedia data, especially videos. This demand in the usage of digital video data has exposed the need for content based video retrieval systems. Content based retrieval of videos requires extracting the semantic content in videos. Nevertheless, the ‘semantic gap’ between the low-level features of multimedia data and the high-level semantic information is still a challenging problem. Thus, semantic content extraction is still a compelling research topic for the researchers (Datta, Li, & Wang, 2005).
Videos, by its very own nature, comprise different types of data such as text, audio, and image in itself. Correspondingly, the semantic information to be extracted is directly connected to these separate sources. Therefore, in order to provide an efficient semantic content extraction solution, this nature of the multimedia data should be analyzed carefully and contained information should be used thoroughly. Video data displays an unstructured characteristic and leads to several complexities such as lighting variations, camera motion, occlusion, viewpoints changes, noise in the sensed data, etc. Moreover, video data has another important characteristic, which can help to overcome most of these challenges; the multimodal content. Integrating the information obtained from multiple modalities is an empirically validated approach to increase the retrieval accuracy (Atrey, Hossain, El-Saddik, & Kankanhalli, 2010). Besides, the dependence on any modality can be minimized with integration and this yields to a more robust system. We can think of a people-marching event, as an example, where the event can be recognized by using in any of the visual, audio and textual modalities. The video can include people as visual objects, a shouting sound and also some lyrics of a march in the closed caption text. A combination of these modalities can provide higher detection accuracy for people-marching event and is less dependent on potential problems in any of the modalities.
The information fusion literature contains a significant number of studies on multimodal information fusion. However, most of these studies do not take advantage of all available modalities. Instead, they focus on some alternative modality couples, especially ‘audio-visual’ and ‘visual-textual’ (Maragos, Gros, Katsamanis, & Papandreou, 2008)(Maragos, Gros, Katsamanis, & Papandreou, 2008; Atrey, Hossain, El-Saddik, & Kankanhalli, 2010) (Atrey, Hossain, El-Saddik, & Kankanhalli, 2010). In this study, we aim to incorporate as much information as possible through the existing modalities for the purpose of semantic concept detection. Hereby, we consider that a ‘modality’ is a set of information which is complementary to the other included modalities (Wu, Chang, Chang, & Smith, 2004) and elaborate the three information channels in video (visual, audio, textual) into following complementary modalities: Visual-Color, Visual-Region, Visual-Texture, Audio-Perceptual, Audio-Cepstral and Textual. Thus, we try to benefit from any useful information included in the video data and increase the retrieval accuracy of the concepts. The concepts intended to be predicted are referred as semantic concepts which constitute a class of elements that together share essential characteristics which identify the class. These semantic concepts include visual objects like Car, Bird, etc. or events like Biking or other semantic concepts like Soccer, Basketball, etc.