Article Preview
TopIntroduction
Multimedia retrieval has become a popular research area due to the explosive growth of digital image and video collections and the widespread accessibility of media in social networks and internet. The demand for solutions and tools to search and retrieve the interesting information effectively and efficiently is increasing. Meanwhile, the capacity of multimedia data grows larger and faster. For instance, it has become more suitable to measure the sizes of videos in TB (terabytes) rather than in GB (gigabytes) now. Hence, how to manage and retrieve the desired information from the huge amounts of multimedia data has challenged researchers in the multimedia area (Chen, 2010).
Concept-based retrieval (Snoek & Worring, 2008) is to detect the existence of objects (such as bus and hand), the meaning of scenes (such as cityscape and nighttime), and the occurrence of events (such as airplane flying and people dancing). It enables the users to utilize multimedia data for entertainment, distant education, commerce and business, social communication, navigation, security, surveillance, and etc. For example, a user may enjoy watching the segments of videos with singing if she/he loves music, or may seek news videos with protest content if she/he is interested in politics. Correctly detecting the classroom setting from the videos would help information search for educational applications, and retrieving the bridge and mountain would assist the users who are planning a trip. The high-level concepts such as doorway and street from video games could be used for navigation, while emergency vehicle and traffic intersection from video surveillance and security cameras could be used for tracking.
Most of the existing search and retrieval approaches are restricted to textual information which is metadata such as surrounding text and closed caption, or are dependent on an interactive framework which requires users' feedback and log files. The advances of database and data warehouse technologies provide us a proper way to manage these textual data and they seem to be efficient tools that are able to facilitate the users to access the data on demand. However, challenges arise when heavy human efforts are demanded for annotation, correcting the textual information, as well as performance evaluation of the retrieved results. To address these issues, content-based multimedia retrieval has emerged in recent years. Most of the content-based frameworks utilize support vector machine (SVM) detectors trained on scale-invariant feature transform (SIFT) descriptors and rank the retrieved results based on the scores obtained from the classifiers. However, SVM is very time consuming with a huge demand in space. Moreover, the classification-based ranking methods suffer from the ad-hoc mechanism to determine the threshold for class labels. Therefore, they cannot be used for real-time online searching.
In addition to efficiency, another important consideration of a retrieval system is effectiveness. The overall retrieval performance is usually evaluated through the mean average precision (MAP) of the retrieved results obtained from the ranking algorithm. To make a fair comparison on the effectiveness of the approaches, the benchmarked video concepts provided by the TREC Video Retrieval Evaluation (TRECVID) community (Smeaton, Over, & Kraaij, 2006) are the most commonly used testbed for evaluating large-scale standardizing data sets. In 2008 and 2009, there are totally 30 concepts for high-level feature extraction task and 219 videos with annotations for the training purpose (Divakaran, 2009).