Tracking and recognition of human motion has become an important research area in computer vision. In real world conditions it constitutes a complicated problem, considering cluttered backgrounds, gross illumination variations, occlusions, self-occlusions, different clothing and multiple moving objects. These ill-posed problems are usually tackled by making simplifying assumptions regarding the scene or by imposing constraints on the motion. Constraints such as that the contrast between the moving people and the background should be high and that everything in the scene should be static except for the target person are quite often introduced in order to achieve accurate segmentation. Moreover, the motion of the target person is often confined to simple movements with limited occlusions. In addition, assumptions such as known initial position and posture of the person are usually imposed in tracking processes.
The first step towards human tracking is the segmentation of human figures from the background. This problem is usually addressed either by exploiting the temporal relation between consecutive frames (e.g., background subtraction (Sato & Aggarwal, 2001), optical flow (Okada, Shirai & Miura, 2000)), by modeling the image statistics of human appearance (Wren, Azarbayejani, Darrell & Pentland, 1997) or by exploiting the human shape (Leibe, Seemann & Schiele, 2005). Efficient texture-based methods for modeling the background and detecting moving objects from a video sequence have been developed as well (Heikkila & Pietikainen, 2006), while some other recent research copes with the problem of occlusions (Capellades, Doermann, DeMenthon & Chellappa, 2003). The output of the segmentation, which could be edges, silhouettes, blobs, and so forth, comprises the basis for feature extraction.
Feature correspondence is established in order to track the subject. Tracking through consecutive frames commonly incorporates prediction of movement, which ensures continuity of motion especially when some body parts are occluded. For example, when a person is walking there are some moments when one of the legs occludes the other. Furthermore, there are scenes with multiple persons occluding one another. Depending on the scene and the chosen methodology, some techniques try to determine the precise movement of each body part (Sidenbladh, Black, & Sigal, 2002), while other techniques focus on tracking the human body as a whole (Okada, Shirai & Miura, 2000). Tracking may be classified as 2D or 3D. 2D tracking consists in following the motion in the image plane either by exploiting low-level image features or by using a 2D human model. 3D tracking aims at obtaining the parameters, which describe body motion in three dimensions. The 3D tracking process, which estimates the motion of the body parts, is inherently connected to 3D human pose recovery.
3D pose recovery aims at defining the configuration of the body parts in the 3D space and estimating the orientation of the body with respect to the camera. This work will mainly focus on model-based techniques, since they are usually used for 3D reconstruction. Model-based techniques rely on a mathematical representation of human body structure and motion dynamics. The 3D pose parameters are commonly estimated by iteratively matching a set of image features extracted from the current frame with the projection of the model on the image plane. Thus, 3D pose parameters are determined by means of an energy minimization process.
Instead of obtaining the exact configuration of the human body, human motion recognition consists in identifying the action performed by a moving person. Most of the proposed techniques focus on identifying actions belonging to the same category. For example, the objective could be to recognize several aerobic exercises or tennis strokes or some everyday actions such as sitting down, standing up, walking, running, or skipping.
Next, some of the most recent approaches addressing human motion tracking and 3D pose recovery are presented, while the following subsection introduces some whole-body human motion recognition techniques. Previous surveys of vision-based human motion analysis have been carried out by Cédras and Shah (1995), Aggarwal and Cai (1999), Gavrila (1999), Moeslund and Granum (2001), and Moeslund, Hilton, and Kruger (2006).