Graphical Models for Representation and Recognition of Human Actions

Graphical Models for Representation and Recognition of Human Actions

Pradeep Natarajan (BBN Technologies, USA) and Ramakant Nevatia (University of Southern California, USA)
Copyright: © 2010 |Pages: 24
DOI: 10.4018/978-1-60566-900-7.ch003
OnDemand PDF Download:
No Current Special Offers


Building a system for recognition of human actions from video involves two key problems - 1) designing suitable low-level features that are both efficient to extract from videos and are capable of distinguishing between events 2) developing a suitable representation scheme that can bridge the large gap between low-level features and high-level event concepts, and also handle the uncertainty and errors inherent in any low-level video processing. Graphical models provide a natural framework for representing state transitions in events and also the spatio-temporal constraints between the actors and events. Hidden Markov models(HMMs) have been widely used in several action recognition applications but the basic representation has three key deficiencies: These include unrealistic models for the duration of a sub-event, not encoding interactions among multiple agents directly and not modeling the inherent hierarchical organization of these activities. Several extensions have been proposed to address one or more of these issues and have been successfully applied in various gesture and action recognition domains. More recently, conditionalrandomfields (CRF) are becoming increasingly popular since they allow complex potential functions for modeling observations and state transitions, and also produce superior performance to HMMs when sufficient training data is available. The authors will first review the various extension of these graphical models, then present the theory of inference and learning in them and finally discuss their applications in various domains.
Chapter Preview


Systems for automated recognition of human gestures and actions are needed for a number of applications in human computer interaction, including assistive technologies and intelligent environments. While the vision of semantic analysis and recognition of actions is compelling there are several difficult challenges to overcome. These include bridging the gap between the low-level sensor readings and high-level semantic concepts, modeling the uncertainties in the observed data, minimizing the requirement for large training sets which are difficult to obtain, and accurately segmenting and recognizing actions from a continuous stream of observations in real time.

Several probabilistic models have been proposed over the years in various communities for activity recognition. (Intille & Bobick, 2001) demonstrates recognition of structured, multi-person actions by integrating goal-based primitives and temporal relationships into a Bayesian network (BN). (Chan et al., 2006) uses dynamic Bayesian networks (DBNs) to simultaneously link broken trajectories and recognize complex events. (Moore & Essa, 2002) uses stochastic context-free grammars (SCFGs) for activity representation and demonstrates the approach for recognizing player strategies in multi-player card games. (Ryoo & Aggarwal, 2006) also uses context-free grammars (CFGs) and presents a layered approach which first recognizes pose, then gestures and finally the atomic and complex events with the output of the lower levels fed to the higher levels.

While each of these formalisms have been successfully applied in various domains, hidden Markov models (HMM) and their extensions have by far been the most popular in activity recognition. Besides their simplicity, they also have well understood learning and inference algorithms making them well suited for a wide range of applications. For example, (Starner, Weaver, & Pentland, 1998) recognizes complex gestures in American sign language (ASL) by modeling the actions of each hand with HMMs. (Vogler & Metaxas, 1999) introduces parallel hidden Markov models (PaHMM), also for recognizing ASL gestures by modeling each hand's action as an independent HMM. In contrast, (Brand, Oliver, & Pentland, 1997) introduces coupled hidden Markov models (CHMM) to explicitly model multiple interacting processes and demonstrated them for recognizing tai-chi gestures. (Bui, Phung, & Venkatesh, 2004) adopt the hierarchial hidden Markov model (HHMM) for monitoring daily activities at multiple levels of abstraction. The abstract hidden Markov model (AHMM) (Bui, Venkatesh, & West, 2002) describes a related extension where a hierarchy of policies decide the action at any instant. (Hongeng & Nevatia, 2003) explores the use of explicit duration models using hidden semi-Markov models (HSMMs) to recognize video events and also presented an algorithm to reduce inference complexity under certain assumptions. (Duong, Bui, Phung, & Venkatesh, 2005) presents the switching hidden semi-Markov model (S-HSMM) which is a two-layered extension of HSMM and applied to activity recognition and abnormality detection. More recently, (Natarajan & Nevatia, 2007a) combines these various extensions in a unified framework to simultaneously model hierarchy, duration and multi-channel interactions.

In contrast to the generative HMMs, discriminative models like conditional random fields (CRFs) are becoming increasingly popular due to their flexibility and improved performance especially when a large amount of labeled training data is available. (Sminchisescu, Kanaujia, Li, & Metaxas, 2005) applies CRFs for contextual motion recognition and showed encouraging results. (Wang, Quattoni, Morency, Demirdjian, & Darrell, 2006) uses the hidden conditional random fields (HCRFs) introduced a 2-layer extension to the basic CRF framework for recognizing segmented gestures. The latent dynamic conditional random fields (LDCRF)(Morency, Quattoni, & Darrell, 2007) extend HCRFs further for recognizing gestures from a continuous unsegmented stream. More recently, (Natarajan & Nevatia, 2008a) embeds complex shape and flow features into a CRF and demonstrated encouraging results for recognizing human actions in cluttered scenes with dynamic backgrounds.

Complete Chapter List

Search this Book: