Article Preview
TopIntroduction
In an image, there always exist certain visual stimuli that demonstrate impressive capabilities in attracting human attention. As a consequence, it is important to detect such salient visual content before conducting complex segmentation and recognition tasks. In this manner, limited computational resources can be assigned to salient visual content with high priority so that images can be efficiently processed as the human-being does.
Inspired by various psychological and neurobiological theories (e.g., Guided Search Model (Wolfe et al., 1989) and Feature Integration Theory (Treisman et al., 1980)), numerous saliency models have been proposed in the past two decades. In these models, a common solution is to divide an image into non-overlapping rectangular patches (i.e., macro-blocks) at a single scale or multiple scales. After that, the saliency of a patch is measured by its rarity in the fixed local and/or global contexts. For example, such visual rarity can be computed as local contrast (Itti et al., 1998), surprise (Itti & Baldi, 2005), coding length increment (Hou & Zhang, 2009) as well as the global viewing time (Harel et al., 2007), entropy rate (Wang et al., 2010) and co-occurrence frequency (Lu et al., 2013). In particular, some approaches (e.g., (Hou & Zhang, 2007; Fang et al., 2012; Li et al., 2013; Li et al., 2015)) first transform image into the frequency domain and then measure patch rarity via spectrum analysis. Moreover, since rarity can be simultaneously measured from multiple feature channels, some researchers proposed to derive patch saliency by combining various kinds of rarities with a heuristic framework (Borji & Itti, 2012) or a “feature-saliency” mapping function learned from user data (Judd et al., 2009; Borji, 2012; Zhao & Koch, 2012).
Generally speaking, these saliency models can achieve impressive performance in many cases. However, they still have two drawbacks. First, these models often embed an image patch into fixed local and/or global contexts, while each patch actually appears along with different neighbors when an image is free-viewed by different subjects (i.e., flexible contexts). Second, many models propose to directly measure the saliency values of small patches with fixed sizes (e.g., 8×8 blocks), while human attention in the free-viewing process is often attached to regions with changing sizes (i.e., flexible regions). Actually, it is believed that the performance of saliency estimation can be greatly improved if the elementary saliency unit and its context are both flexibly selected.
Figure 1. In an eye-tracking experiment, different subjects may free-view different regions in the same image. As a result, their fixations are recorded so as to form a fixation density map that reflects the ground-truth saliency