Action Detection by Fusing Hierarchically Filtered Motion with Spatiotemporal Interest Point Features

Action Detection by Fusing Hierarchically Filtered Motion with Spatiotemporal Interest Point Features

YingLi Tian (City University of New York, USA), Liangliang Cao (IBM – T. J. Watson Research Center, USA), Zicheng Liu (Microsoft Research, USA) and Zhengyou Zhang (Microsoft Research, USA)
DOI: 10.4018/978-1-4666-3682-8.ch012
OnDemand PDF Download:
No Current Special Offers


This chapter addresses the problem of action detection from cluttered videos. In recent years, many feature extraction schemes have been designed to describe various aspects of actions. However, due to the difficulty of action detection, e.g., the cluttered background and potential occlusions, a single type of feature cannot effectively solve the action detection problems in cluttered videos. In this chapter, the authors propose a new type of feature, Hierarchically Filtered Motion (HFM), and further investigate the fusion of HFM with Spatiotemporal Interest Point (STIP) features for action detection from cluttered videos. In order to effectively and efficiently detect actions, they propose a new approach that combines Gaussian Mixture Models (GMMs) with Branch-and-Bound search to locate interested actions in cluttered videos. The proposed new HFM features and action detection method have been evaluated on the classical KTH dataset and the challenging MSR Action Dataset II, which consists of crowded videos with moving people or vehicles in the background. Experiment results demonstrate that the proposed method significantly outperforms existing techniques, especially for action detection in crowded videos.
Chapter Preview


In the past few years, computer vision researchers have witnessed a surge of interest in human action analysis through videos. Human action recognition, which classifies a video to a predefined action category, was first studied under well controlled laboratory scenarios, e.g., with clean background and no occlusions (Schuldt et al., 2004). Later research work shows that action recognition is important for analyzing and organizing online videos (Liu et al., 2009). Moreover, action recognition plays a crucial role in building surveillance system (Hu et al., 2009) and studying customer behaviors. With the increasing of web video clips (e.g., videos on Youtube) and the surveillance systems, it has become very important to effectively analyze video actions.

An effective analysis of video actions requires action detection, which can not only answer which action happens in a video, but also when and where the action happens in the video sequence. In other words, action detection will detect action category, locations, and time in video sequences than simply classifying a video clip to one of the existing action labels. When a video contains multiple actions, simple action classification will not work. In practice, surveillance videos often contain multiple types of actions, where only action detection can provide meaningful results.

Action detection is a challenging task. As shown in Figure 1, the background is often cluttered, and the crowds might occlude each other in complex scenes. It is difficult to distinguish interested actions from other video contents. The appearance of interested actions might have similar appearance of the background. Furthermore, the motion field of an action might be occluded by other moving objects in the scene. Due to the difficulty of locating human actions, most existing datasets of human actions (Blank et al., 2005; Schuldt et al., 2004) only involve action classification task without detecting locations of actions, where human actions are usually recorded with clean backgrounds, and each video clip mostly involves a single person who repeatedly performs one category of actions within a whole video clip.

Figure 1.

Comparing the differences between action classification and detection. (a) For a classification task we need only estimate the category label for a given video. (b) For an action detection task we need not only estimate the category of an action but also the location of the action instance. The bounding box illustrates a desirable detection. It can be seen that action detection task is crucial when there is cluttered background and multiple persons in the scene.


In this chapter, we address the action detection problem by proposing a new type of features, Hierarchically Filtered Motion (HFM), and further investigate the fusion of HFM with other Spatiotemporal Interest Point (STIP) features (Dollar et al., 2005; Laptev and Lindeberg, 2003, Cao et al. 2010; Tian et al., 2011) for action detection from cluttered videos. An action is often associated with multiple visual measurements, which can be either appearance features (e.g., color, edge histogram) or motion features (e.g., optical flow, motion history). Different features describe different aspects of the visual characteristics and demand different metrics. How to handle heterogeneous features for action detection becomes an important problem.

Complete Chapter List

Search this Book: