Generative Group Activity Analysis with Quaternion Descriptor

Generative Group Activity Analysis with Quaternion Descriptor

Guangyu Zhu (National University of Singapore, Singapore), Shuicheng Yan (National University of Singapore, Singapore), Tony X. Han (University of Missouri, USA) and Changsheng Xu (Chinese Academy of Sciences, China)
DOI: 10.4018/978-1-4666-1891-6.ch009
OnDemand PDF Download:
List Price: $37.50


Activity understanding plays an essential role in video content analysis and remains a challenging open problem. Most of previous research is limited due to the use of excessively localized features without sufficiently encapsulating the interaction context or focus on simply discriminative models but totally ignoring the interaction patterns. In this chapter, a new approach is proposed to recognize human group activities. Firstly, the authors designed a new quaternion descriptor to describe the interactive insight of activities regarding the appearance, dynamic, causality, and feedback, respectively. The designed descriptor along with the conventional velocity and position are capable of delineating the individual and pairwise interactions in the activities. Secondly, considering both activity category and interaction variety, the authors propose an extended pLSA (probabilistic Latent Semantic Analysis) model with two hidden variables. This extended probabilistic graphic paradigm constructed on the quaternion descriptors facilitates the effective inference of activity categories as well as the exploration of activity interaction patterns. The extensive experiments on realistic movie and human group activity datasets validate that the multilevel features are effective for activity interaction representation and demonstrate that the graphic model is a promising paradigm for activity recognition.
Chapter Preview


Video-based human activity analysis is one of the most promising applications of computer vision and pattern recognition. Turaga et al. (2008) presented a recent survey of the major approaches pursued over the last two decades. Large amount of the existing work on this problem mainly focused on the relatively simple activities of single person (Laptev, 2003; Liu, 2009; Niebles, 2008; Schuldt, 2004; Wang, 2009), e.g., sitting, walking and hand-waving, which has achieved particular success. In recent years, recognition of group activity with multiple participators (e.g., fighting and gathering) is gaining increasing amount of interests (Marszalek, 2009; Ni, 2009; Ryoo, 2007; Zhou, 2008) from both academia and industry.

Upon the definition given by Turaga (2008), where an activity is referred to a complex sequence of actions performed by several objects who could be interacting with each other, the interactions among the participants reflect the elementary characteristics of different activities. The effective interaction descriptor is therefore essential for developing sophisticated approaches of activity recognition. Most previous research stems from the local representation in image processing. As shown in Figure 1(a), the common sense of constructing local representation is to extract the pattern descriptors (e.g., SIFT (Lowe, 2004)) from spatial salient points and generate the feature representation (e.g., bag-of-SIFT accordingly) using bag-of-words strategy. Such successful scenario has been naturally extended to the video processing by extracting the pattern descriptors based on spatio-temporal salient points (Laptev, 2003; Liu, 2009; Marszalek, 2009; Niebles, 2008; Schuldt, 2004). Although the widely used local descriptors are demonstrated to allow for the recognition of activities in the scenes with occlusions and dynamic cluttered backgrounds, they are solely representations of appearance and motion patterns. An effective feature descriptor for activity recognition should have the capacity of describing the video in terms of the object appearance, dynamic motion as well as the interactive properties.

Figure 1.

Comparison of representation and modeling for image and video analysis: (a) Image representation and modeling; (b) Video representation and modeling

With the activity descriptor, how to make the decision for the activity category classification using the feature representation accordingly is another key issue for activity recognition. Two types of approach are widely used: approaches based on generative model (Niebles, 2008; Wang, 2009) and the ones based on discriminative model (Laptev, 2003; Liu, 2009; Marszalek, 2009; Ni, 2009; Ryoo, 2007; Schuldt, 2004: Zhou, 2008). Considering the mechanism of human perception for group activity, the interactions between objects are firstly distinguished and then synthesized as the activity recognition result. Although discriminative models have been extensively employed because they are much easier to build up, the construction of discriminative models essentially focus on the differences among the activity classes yet ignore the interactive properties involved. Therefore, discriminative models cannot facilitate the interaction analysis and discover the insight of the interactive relations in the activities. In this paper, we firstly investigate how to effectively represent video activities in the interaction context.

Figure 1(b) presents a brief illustration of the extraction in video processing for the new feature descriptor, namely the quaternion descriptor. The quaternion descriptor consists of four types’ components in terms of appearance, individual dynamic, pairwise causalities and feedbacks of the video active objects, respectively. The components in the descriptor describe the appearance and motion patterns as well as encode the interaction properties in the activities. Resorting to the bag-of-words method, the video is represented as a compact bag-of-quaternion feature vector. To recognize the activity category and facilitate the interaction pattern exploration, we then propose to model and classify the activities in a generative framework which is based on an extended pLSA model. Interactions are modeled in the generative framework, which is able to explicitly infer the activity patterns.

Complete Chapter List

Search this Book: