Article Preview
TopIntroduction
Human visual system has a remarkable ability to pay more attention to some salient regions or objects in natural scenes, due to the fact that these regions or objects have conspicuous difference with surrounding in color, intensity, gradient, and so on. This is visual attention mechanism of human beings, and many studies have tried to build computational models to simulate this mechanism (Borji & Itti, 2013). There are many applications for visual attention, for example, automatic image cropping (Santella, Agrawala, DeCarlo, Salesin, & Cohen, 2006), adaptive image display on small devices (Chen, Xie, Fan, Ma, Zhang & Zhou, 2003), image/video compression, advertising design (Itti, L. 2000), and image collection browsing (Rother, Bordeaux, Hamadi & Blake, 2006). Recent studies (Navalpakkam & Itti, 2006; Rutishauser, Walther, Koch, & Perona, 2004) demonstrated that visual attention helps object recognition, tracking, and detection as well.
There are two types of methods to detect salient regions: one is based on spatial domain and the other is based on frequency domain. One of the earliest computational models of visual attention in spatial domain was proposed by Itti, Koch & Niebur (1998). The algorithm obtains the saliency map based on the intensity, color, and orientation conspicuity maps. These conspicuity maps are attained by across-scale addition of feature maps, while the feature maps capture the center-surround differences between various Gaussian pyramid and oriented pyramid scales. The saliency map of this method is useful in providing the locations of important regions in a given visual scene but is terribly low in resolution. Achanta, Estrada, Wils and Süsstrunk (2008) propose a salient region detection method, by this method a difference-of-means filter is used to estimate center-surround contrast. The lowest frequencies retained depend on the size of the largest surround filter and the highest frequencies depend on the size of the smallest center filter. So, method AC effectively retains the full resolution. Several other saliency models have been proposed to the research saliency using a graph representation of images (Harel, Koch, & Perona, 2007; Gopalakrishnan, Hu & Rajan 2009). In the Graph-Based Visual Saliency (GBVS) algorithm (Harel, Koch, & Perona, 2007), the edges of a graph are used to denote similarity between two nodes (pixels). Random walks are then performed on these nodes and the more a node is visited, the more salient it is to be. Goferman, Zelnik-Manor and Tal (2012) proposed a context-aware (CA) saliency computation approach by employing the color and position information of each image pixel, which can extract the salient objects and also reserve their surrounding regions. However, they calculate the saliency of each pixel by considering the patch dissimilarity of K most similar patches, which leads to high time complexity. The basis pitfall of these above mentioned methods is that they need calculate the global region or local region saliency for each pixel, so suffer from computational complexity, ad hoc design choices and over-parameterization, also has lower resolution when compared to original images. These drawbacks often arise from failing to exploit appropriate spatial frequency content of the original image, as analyzed by Achanta, Hemami, Estrada and Susstrunk (2009).