Article Preview
Top1. Introduction
As a result of the rapid improvement of contemporary technology, people usually have smartphones that easily allow capturing images and recording video and instantly sharing the multimedia content with corresponding descriptions with friends over social networks, a trend that has resulted in multimedia data propagating expeditiously around the world. A study presented by IDC and EMC stated that 1,800 EB (1 EB = 1,000 PB) of digital information were produced in 2011, and the amount of information increased ten times from 2005 to 2011 (Gantz et al., 2008). To manage the enormous volume of multimedia data, i.e., images, videos, texts, and audio, how to effectively retrieve data from different modalities and bridge the gap between low-level features and various semantic concepts becomes more and more essential. Many researchers have been investigating multi-modal fusion for multimedia analysis, e.g. video retrieval (Yan, Yang & Hauptmann, 2004; McDonald & Smeaton, 2005), speech recognition (Metallinou, Lee & Narayanan, 2010; Papandreou, Katsamanis, Pitsikalis & Maragos, 2009), event detection (Jiang et al., 2010; Mertens, Lei, Gottlieb Friedland & Divakaran, 2011), etc. However, because of the involved modalities, multi-modal fusion has many challenges: coping with different feature formats, capturing correlation and independence among modalities in many levels, and detecting the confidence level of each model in achieving tasks.
To resolve the above-mentioned challenges, Atrey et al. (2010) pointed out several questions for multimedia analysis; in particular some of them were comprehensively thought through for content-based multimedia retrieval: