INTRODUCTION
The retrieval of images from multimedia collections is one of the most active topics in computer science. In many cases, e.g., a raw collection of images from a photo repository, the only information that can be extracted is the related visual content. In these cases the retrieval task must rely only on low-level features of the images and, as a consequence, the task becomes more difficult due to the semantic gap between the visual content (i.e., the data used for the retrieval process) and the semantic concepts (i.e., the goal of the retrieval process) (Smeulders et al., 2000; Lew et al., 2006; Datta et al., 2008). All the techniques that are based on the visual content of the images fall in the Content Based Image Retrieval (CBIR) field.
It has been shown that the effectiveness of a CBIR technique strongly depends on the choice of the set of visual features, and on the choice of the metric used to model the users’ perception of image similarity. Unfortunately, it is a very hard task to assess which are the “best” visual features for a given retrieval task, and which is a suitable similarity metric. Consequently, the set of retrieved images often fits the users’ needs only partly. To overcome this problem, the use of Relevance Feedback has been widely studied as a tool to allow users refining the results by submitting a feedback on the images’ relevance (Huang et al., 2008; Zhou & Huang, 2003; Rui & Huang, 2001; Giacinto, 2007).
Relevance Feedback (RF) is a mechanism that involves directly the user by allowing her to refine the retrieval results by marking the retrieved images as relevant or non-relevant to the visual query. Then, this feedback is exploited to “adjust” the retrieval mechanism, and it is used to propose to the user a new set of images that is deemed to be relevant according to the given feedback. This “adjustment” can be made in different ways, but basically they can be subdivided into two methodologies: one that is based on some transformations of the visual feature space, and the other that is based on the modification of the similarity metric, both aiming to attain higher similarity between relevant images, and lower similarity between relevant and non-relevant images. Relevance Feedback approaches are usually based on the formulation of a two-class problem where the retrieved images are labeled as being either relevant or non relevant to the query. The behavior of RF techniques following this problem formulation is usually evaluated by means of images’ datasets where each image is associated with one tag (i.e., the label associated with a visual concept), even if the images typically contain more than one visual concept.
It is the opinion of the authors that the assessment of the capabilities of RF techniques carried out by this kind of datasets is somehow limited. This limitation is related to the fact that only a single tag/concept is associated with each image in the dataset. Commonly, this “limitation” is accepted for two reasons: first, the single tag is associated with the most significant visual concept of the image as it is easier to find; second, it is easier, during the testing phase, to artificially simulate the behavior of users that submit the feedback with respect to only one concept. Despite the fact that these reasons allow to set up a simple evaluation scenario, they do not allow to fully evaluate the potential behind RF techniques because an image is typically associated with a number of concepts, and because this modality of simulated feedback is not a good model of the behavior of a real user.
The problem of multi-tagged datasets has been faced in different fields (Tsoumakas et al., 2010), such as protein function classification, music categorization, and also semantic scene classification (Boutell et al., 2004), but unfortunately it remains an aspect almost not tackled in the field of Relevance Feedback for Content Based Image Retrieval. Probably, the main reason is that the simulation of a real user’s feedback in a real case scenario (i.e., when different visual concepts are associated with a single image) is quite difficult. Following the query-by-example paradigm, the user submits to the CBIR system an image as a query, and the system retrieves the images that are most similar to the query from the visual point of view. When the user is asked to mark the images as being relevant or not, then a real user can be interested in retrieving images related to more than just one of the visual concepts represented in the query image. Thus, different retrieval tasks can be performed starting from a given query image, as many as the possible combinations of visual concepts contained in the image itself. This kind of behavior is quite difficult to simulate unless you have a toy dataset, or a large number of real users agree to perform a live experimentation.
In this paper we propose a “new” modus operandi for evaluating a Relevance Feedback methodology in multi-tagged datasets. To this aim we propose different ways to simulate some of the possible behaviors of real users in different scenarios, i.e., the logic employed by the user to mark the images as being relevant or not. In our opinion these scenarios can be used to attain a more thorough assessment of RF techniques. Moreover we propose some novel measures of concept correlation, aimed at assessing if the result obtained in the experiments of RF techniques in different scenarios is reliable, or if it is biased by one of the single concept. In fact, while in the case of a single-tagged dataset the retrieval process is clearly driven by just one concept, in the case of a multi-tagged dataset it can be of interest to assess if the search for multiple concepts is actually driven by a few of them, or if these multiple concepts actually represent a higher level concept.
The paper is organized as follows. Firstly, we propose our view/ideas on evaluating a relevance feedback methodology in multi-tagged datasets. Secondly, we briefly review the RF techniques that are used in the experimental phase. After these sections, we give the details of the experimental phase, and finally the conclusions are outlined.