Conventional approaches to content-based image retrieval exploit low-level visual information to represent images and relevance feedback techniques to incorporate human knowledge into the retrieval process, which can only alleviate the semantic gap to some extent. To further boost the performance, a Bayesian framework is proposed in which information independent of the visual content of images is utilized and integrated with the visual information. Two particular instances of the general framework are studied. First, context which is the statistical relation across the images is integrated with visual content such that the framework can extract information from both the images and past retrieval results. Second, characteristic sounds made by different objects are utilized along with their visual appearance. Based on various performance evaluation criteria, the proposed framework is evaluated using two databases for the two examples, respectively. The results demonstrate the advantage of the integration of information from multiple sources.
TopI. Introduction
The ever-lasting growth of multimedia information has been witnessed and experienced by human beings since the beginning of the information era. An immediate challenge resulting from the information explosion is how to intelligently manage and enjoy the multimedia databases. In the course of the technological development of multimedia information retrieval, various approaches have been proposed with the ultimate goal of enabling semantic-based search and browsing. Among those intensively explored topics, content-based image retrieval (CBIR), born at the crossroad of computer vision, machine learning and database technologies, has been studied for more than a decade, yet still remaining difficult (Smeulders, Worring, Santini, Gupta, & Jain, 2001; Datta, Joshi, Li, & Wang, 2008). In a nutshell, the content-based approaches to image retrieval primarily rely on the pictorial information, a.k.a. low level visual features such as color, texture, shape and layout, which can be automatically extracted from images for similarity measure. The essential challenge is that the low level visual features accurately characterizing the semantic meaning of images are difficult to discover. Therefore, semantically relevant images may be located far away from each other in the space of the pictorial information, which is referred to as the semantic gap. To reduce the semantic gap, human knowledge was utilized to help refine the representation of the semantic meaning in a user's query. To this end, the relevance feedback (RF), a technique originally proposed for traditional document retrieval, was adapted to solve the problem of image retrieval (Crucianu, Ferecatu, & Boujemaa, 2004; Zhou & Huang, 2003). A common aspect of most RF techniques is that the learned knowledge will not be propagated forward to the retrieval in the future and hence can be considered as the short-term relevance feedback (STRF). STRF techniques alleviate the semantic gap by incorporating human users' knowledge into the process of labeling training samples yet still suffering from the problem of sample sparseness, as average users are normally willing to select only a few relevant and irrelevant images. In addition, as irrelevant images may be distinct from the relevant ones in many different ways, there is a good chance that training samples of the two categories in the context of STRF are imbalanced. Along with the demand for the real-time performance of a practical search engine, the above-mentioned problems can be considered as the major factors leading to the performance bottleneck.