Article Preview
TopIntroduction
Person re-identification, which aims at identifying a person of interest among different cameras, has become increasingly popular in the community due to its critical role in many surveillance, security and multimedia applications. Currently, major efforts towards this problem focus on the still-image-based scenario, in which each person has only one image available per camera view. Many methods have been developed to either extract discriminative features (Liao et al., 2015; Matsukawa et al., 2013; Satta et al., 2013; Shen et al., 2013) or learn effective distance metric (Hirzer et al., 2012; Köstinger et al., 2012; Liang et al., 2015; Liao et al., 2015; Yang et al., 2016) for this problem.
In spite of great research progress achieved for the still-image-based task, the real-world re-id performance is hindered by limited information extracted from a single image. Such still-image-based person re-id ignores the temporal information among person images, which leads poor feature representation of person. In practical surveillance systems, persons are always recorded by videos, which means that there are multiple consecutive frames available for an individual in each camera’s view field. Thus, it is intuitive to use such sequential images to improve re-id performance, which directly motivates the investigation of video-based person re-id.
Figure 1. An example illustrating that person images are always highly noisy in practical situations
Recently, impressive research progress has been reported in video-based person re-id. However, most approaches mentioned above generally assume that all images in each sequence are of equal importance, losing sight of their difference caused by the interference of various noises. Take iLIDS-VID dataset (Wang et al. 2014) in Figure 1 as an example, person images are always flooded with various noises, such as object occlusions or background clutters, resulting in highly noisy unregulated sequences. In our preliminary comparative experiment conducted on 199-pair unregulated person sequences of iLID-VID dataset, average matching accuracy on original unregulated video sequences is only 7%, ten percent lower than that obtained on filtered clean video sequences.
Table 1. Occlusion condition investigation
Dataset | Occlusion | Long-Term Occlusion | Temporary Occlusion |
ETHZ | 45.21% | 15.15% | 84.85% |
OTB | 58.05% | 10.03% | 89.97% |
iLIDS-VID | 66.33% | 10.05% | 89.95% |