Article Preview
Top1. Introduction
The detection of video foreground objects is one of the key problems in video processing, which could directly facilitate the applications such as video object recognition (Prest, Leistner, Civera, Schmid, & Ferrari, 2012), action segmentation (Lu & Jason, 2015) and recognition (Wang & Schmid, 2013). In recent years, many successful video foreground segmentation approaches (Grundmann, Kwatra, Han, & Essa, 2010), (Jang, Lee, & Kim, 2016) and object detectors (Cho, Kwak, Schmid, & Ponce, 2015; Girshick, 2015), are proposed, which leveraged the understanding of high-level video content.
However, building a fully unsupervised model that to segment the foreground object of unconstraint videos is still challenging, as no additional information about the foreground is provided; and the videos may be influenced by the factors such as dynamic background, motion blurry, light condition changes, or even the editing factors (e.g. subtitles or flying logos) (Papazoglou & Ferrari, 2013). Therefore, the dependencies for segmentation become more significant, including the motion cues and appearance cues. The optical flow is a popular example of motion cues (Figure 1).
Figure 1. An illustration of proposed Motion-based likelihood: (a) The original frame; (b) The actual edges of current frame (green) and the motion boundaries (red); (c) The HOG affinity map; where the motion boundaries agreed more with edge responses have higher scores; (d) The edges that aligned by motion; (e) Initial prediction of foreground by Inside-outside-map (Papazoglou & Ferrari, 2013); (f) The motion-based likelihood by accumulating the masks obtained by step (e)
With adequate alignments, the motion boundary by optical flow shall be more effective since it often related to the actual edges of foreground objects (Li, Kim, Humayun, Tsai, & Rehg, 2013). On the other hand, unsupervised models such as the PHM (Cho, Kwak, Schmid, & Ponce, 2015) could be regarded as good appearance model, while the colour distribution seems to be a more simple but efficient option (Koh, Jang, & Kim, 2016). In fact, a more precise foreground could also significantly benefit the foreground prediction with colour model (Stretcu & Leordeanu, 2015).
In this paper, we address the problem of automatically locating and segmenting the foreground object for unconstrained videos and proposed a fully unsupervised approach based on both motion and appearance cues of object. The key contributions of our work are: (1) we proposed a fully unsupervised approach for video foreground object segmentation, with competitive performances on three datasets; (2) we obtained more precise motion-based foreground predictions with a novel HOG affinity map; and (3) we show that the shallow image processing algorithms are still capable for complex vision tasks such as video foreground segmentation. The experimental results show that our approach obtained competitive results on YouTube-Obj, J-HMDB and VOS dataset.
TopThough the classic background subtraction methods performed well towards the task of foreground segmentation with stationary cameras or slow background motions; in unconstrained videos, the background would be more complex and harder to analyse (Koh, Jang, & Kim, 2016). In fact, to locate the foreground object in videos is a difficult task, and many efforts have been made in the past decade.
Given an unconstrained video, there will be no prior knowledge (such as colour or location) provided about the objects (Lee, Kim, & Grauman, 2011); thus, many approaches considered saliency-based measures to find the most obvious object in the frame, such as (Li, Zheng, Chen, & Zhou, 2017) and (Li, Xia, & Chen, 2018). Another popular approach is spatial-temporal saliency, of which the key concern is about the object saliency map (Qiu, Gu, Chen, Chen, & Wang, 2007; Guo, 2008; Liu, 2009).