Tracking Persons: A Survey

Tracking Persons: A Survey

Christine Leignel, Jean-Michel Jolion
DOI: 10.4018/978-1-61692-857-5.ch002
(Individual Chapters)
No Current Special Offers


This chapter presents a survey of methods used for tracking in video sequence. We mainly focus this survey on tracking persons. We introduce three main approaches. First, we present the graph based tracking approach where the sequence of tracked objects are embodied in a graph structure. Then we introduce the features (extracted from the images) based tracking and matching with a model. We survey the main primitives and emphasize the approaches based on 2D and 3D body model. We present the particular case of tracking in a network of cameras with the particle filtering method. Finally, As a generalization, we focus on the single vs. stereo approaches.
Chapter Preview


Computer vision reflects a growing interest because of lower cost of new technology whose skills are growing. The video flow, traditionally processed by a human operator, is gradually being replaced by an automatic processing either to detect abnormal events, or to track a person into a scene for teleconferencing applications.

The pioneers in the field of tracking people are (Siebel, 2003), (O’Rourke & Badler, 1980) and (Hogg, 1983). In the area of surveillance, there are many tracking algorithms. The first step in any images sequence processing system for tracking people is to detect the movement of mobile regions in the image. Such regions are classified as individuals, groups and other classes of objects, and are grouped in a graph tracking to facilitate the tracking of individuals over a long period (case for persons who join or leave a group). We can classify the tracking methods in six categories:

  • Category 1: methods, sometimes without a model, based region or « blobs » (set of pixels connected and grouped according to a criterion) tracker, based on color, texture, ponctual primitives and contours ((Bremond, 1997), (Cai, Mitiche, & Aggarwal, 1995), (Khan, Javed, Rasheed, & Shah, 2001), (Lipton, Fujiyoshi, & Patil, 1998), (Wren, Azarbayejani, Darrell, & Pentland, 1997));

  • Category 2: methods using a human body 2D appearance model ((Baumberg, 1995), (Haritaoglu, Harwood, & Davis, 2000), (Johnson, 1998)), with or without explicit model of the shape ;

  • Category 3: methods with a 3D articulated model ((Gavrila & Davis, 1996), (Sidenbladh, Black, & Fleet, 2000));

  • Category 4: methods by background removal ((Haritaoglu, Harwood, & Davis, 1998), (Wren, Azarbayejani, Darrell, & Pentland, 1997)). The system can be more robust in textured environments by combining color, texture and movement to segment the foreground;

  • Category 5: The temporal difference (two or three images) (Anderson, Burt, & Van Der Wal, 1985) yielding a binary map of motion (such as category 4) where motion pixels are grouped into « blobs » ((Haritaoglu, Harwood, & Davis, 2000), (Jabri, Duric, Wechsler, & Rosenfeld, 2000), (Zhao, Nevatia, & Lv, 2001)). The movements and interactions between individuals are obtained by the tracking of the « blobs »;

  • Category 6: Another complementary approach to that of category 5 is the differential approach based on estimation of the velocity field at all points of the image, also known by motion detection. It calculates the velocity vector in the scene, making the invariance assumption between t and t+d. It defines an error function called DFD « Difference Deplaced Frames ». It seeks to minimize the DFD for all points of the image at time t. This family includes the method by « optical flow » (Barron, Fleet, & Beauchemin, 1994). The motion estimation by « optical flow » in terms of spatial and temporal variation of the function of intensity is a way of understanding the movement in a scene. Motion detection highlights mobile regions in the current image.

Key Terms in this Chapter

Graphical Model: A graph is composed of nodes connected by links. In a probabilistic graphical model, each node represents a random variable, and the links represent probabilistic relationships between these variables.

Particle Filtering: The particle filtering is a sampling method whose aim is to find a tractable inference algorithm.

Intelligent Agents: Intelligent agents are independent modules combining information from several cameras (or several level of information) and incrementally constructing the model of the scene.

Bayesian Network: A bayesian network is a directed graphical model, which is useful when one want to express causal relationships between variables.

Occlusion: In the outdoor scenes, objects can be in occlusion by the external elements: trees and buildings.

Scene Analysis: The scene analysis is a model of the scene useful for the re-identification of people from one camera to another.

Weak Limbs: A body model with weak limbs is composed of limbs not connected rigidly but rather one attraction to the other.

Re-Identification: This network has to deal with the problem of re-identification of people during their transition from one camera to another. People tracking and re-identification are often more complex than object tracking because people are articulated and move freely.

3D/2D Articulated Model: The 3D models represent the articulated structure in three dimensions, removing the ambiguities of the 2D models depending on the pose.

Belief Propagation: The Belief propagation algorithm express exact inference on directed graphs without loops.

Complete Chapter List

Search this Book: