Facial feature tracking has received considerable attention in recent decades for its applications in video gaming, man-machine interaction, model based coding, and numerous other disciplines. Analyzing a head and shoulders sequence remains to be one of the most needed, yet challenging problems in video processing today. Even so, years of research has reduced the problem of facial analysis to a numerical optimization in which classical search methods are often used. See Pearson (1995) for an excellent review of the evolution of model based coding and Pandizic et al. (2003) for the current state of the art.
A dominant amount of attention for facial feature tracking has been spent applying it towards model-based facial coding (MBFC). In model-based coding, the idea is to analyze an object (i.e., the face) and send high level information about the animation of the object like movement and rotation, instead of sending raw video. The decoder uses knowledge of the model and renders the appropriate animation. The face, however, is a particularly difficult object to model because it requires not only rigid body motion, but also deformations to create expression. Modern facial models can recreate a human likeness with remarkable clarity. Figure 1 shows three reconstructed facial models and the original frames.
Selected frames of a model-based video. Top: the original video frames. Bottom: frames synthesized from a model based coder. The frames are generated using a facial model; they are rendered sequences derived from stretching a facial “texture” over the three dimensional facial wireframe. See (Eisert, 2000).
Using a model to transmit video opens the possibility for transmission using dynamic bandwidths (i.e., sending video at different bandwidths dynamically depending on the conditions of the transmission channel). In transform based coding, dynamic bandwidth means increasing or decreasing the compression thresholds, reducing video quality when the compression is high. In MBFC, dynamic bandwidth means sending a dynamic number of parameters of the face, resulting in less movement of the model at high compression but not necessarily less quality.
This application promises to revolutionize video telephony and teleconferencing by drastically reducing the bandwidth required for transmission (Eisert, 2003). But there is no free lunch. The bandwidth reductions in MBFC must be paid for in computational analysis. Despite the advances in facial analysis, current work has been limited in several aspects. Before MBFC can be adequately implemented, the following factors need addressing:
The limitations of gradient-based optimization. This type of analysis, while showing promise for real-time implementation, inherently relies upon a gradient approximation. This approximation limits the problem scope to facial sequences involving movements built into the gradient approximation training (i.e., small head movements).
The use of static facial parameters. Current algorithms use a hand selected set of animation parameters on the face, or use all facial animation parameters. It is unclear if bandwidth could be further reduced using a dynamic set of animation parameters, or what the ideal set of parameters is that adequately represents all facial animation sequences.
The prohibitive use of computationally complex algorithms. The use of direct methods to analyze head and shoulder sequences in real time has been completed only by reducing the number of animation parameters optimized. The resultant frames are not considered high enough quality to be realistically rendered.