In content-based image retrieval (CBIR), a set of low-level features are extracted from an image to represent its visual content. Retrieval is performed by image example where a query image is given as input by the user and an appropriate similarity measure is used to find the best matches in the corresponding feature space. This approach suffers from the fact that there is a large discrepancy between the low-level visual features that one can extract from an image and the semantic interpretation of the image’s content that a particular user may have in a given situation. That is, users seek semantic similarity, but we can only provide similarity based on low-level visual features extracted from the raw pixel data, a situation known as the semantic gap. The selection of an appropriate similarity measure is thus an important problem. Since visual content can be represented by different attributes, the combination and importance of each set of features varies according to the user’s semantic intent. Thus, the retrieval strategy should be adaptive so that it can accommodate the preferences of different users. Relevance feedback (RF) learning has been proposed as a technique aimed at reducing the semantic gap. It works by gathering semantic information from user interaction. Based on the user’s feedback on the retrieval results, the retrieval scheme is adjusted. By providing an image similarity measure under human perception, RF learning can be seen as a form of supervised learning that finds relations between high-level semantic interpretations and low-level visual properties. That is, the feedback obtained within a single query session is used to personalize the retrieval strategy and thus enhance retrieval performance. In this chapter we present an overview of CBIR and related work on RF learning. We also present our own previous work on a RF learning-based probabilistic region relevance learning algorithm for automatically estimating the importance of each region in an image based on the user’s semantic intent.
TopIntroduction
In recent years, the rapid development of information technologies and the advent of the Web have accelerated the growth of digital media and, in particular, image collections. As a result and in order to realize the full potential of these technologies, the need for effective mechanisms to search large image collections becomes evident. The management of text information has been studied thoroughly and there have been many successful approaches for handling text databases (see (Salton, 1986)). However, the progress in research and development of multimedia database systems has been slow due to the difficulties and challenges of the problem.
The development of concise representations of images that can capture the essence of their visual content is an important task. However, as the saying “A picture is worth a thousand words” suggests, representing visual content is a very difficult task. The human ability to extract semantics from an image by using knowledge of the world is remarkable, though probably difficult to emulate.
At present, the most common way to represent the visual content of an image is to assign a set of descriptive keywords to it. Then, image retrieval is performed by matching the query text with the stored keywords (Rui, 1998). However, there are many problems associated with this simple keyword matching approach. First, it is usually the case that all the information contained in an image cannot be captured by a few keywords. Furthermore, a large amount of effort is needed to do keyword assignments in a large image database. Also, because different people may have different interpretations of an image's content, there will be inconsistencies (Rui, 1998). Consider the image in Figure 1. One might describe it as “mountains”, “trees”, and “lake”. However, that particular description would not be able to respond to user queries for “water”, “landscape”, “peaceful”, or “water reflection”.
In order to alleviate some of the problems associated with text-based approaches, content-based image retrieval (CBIR) was proposed (see (Faloutsos, 1993) for examples of early approaches). The idea is to search on the images directly. A set of low-level features (such as color, texture, and shape) are extracted from the image to characterize its visual content. In traditional approaches (Faloutsos, 1993; Gupta, 1997; Hara, 1997; Kelly, 1995; Mehrotra, 1997; Pentland, 1994; Samadani, 1993; Sclaroff, 1997; Smith, 1996; Smith, 1997; Stone, 1996; Wang, 1998), each image is represented by a set of global features that are calculated by means of uniform processing over the entire image and describe its visual content (e.g., color, texture). The features are then the components of a feature vector which makes the image correspond to a point in a feature space.