We present a discriminative approach to human action recognition. At the heart of our approach is the use of common spatial patterns (CSP), a spatial filter technique that transforms temporal feature data by using differences in variance between two classes. Such a transformation focuses on differences between classes, rather than on modeling each class individually. As a result, to distinguish between two classes, we can use simple distance metrics in the low-dimensional transformed space. The most likely class is found by pairwise evaluation of all discriminant functions, which can be done in real-time. Our image representations are silhouette boundary gradients, spatially binned into cells. We achieve scores of approximately 96% on the Weizmann human action dataset, and show that reasonable results can be obtained when training on only a single subject. We further compare our results with a recent examplarbased approach. Future work is aimed at combining our approach with automatic human detection.
TopIntroduction
Automatic recognition of human actions from video is an important step towards the goal of automatic understanding of human behavior. This understanding has many potential applications, including improved human-computer interaction, video surveillance and automatic annotation and retrieval of stored video footage. In general, these applications demand classification of human movement into several broad categories. Real-time and robust processing is often an important requirement, while there is still some control over the recording conditions. For example, human-computer interfaces require direct interaction. Another example is surveillance in the area of domotica, where elderly people are monitored to enable them to live independently for a longer period of time.
In the development of a human action recognition algorithm, one issue is the type of image representation that is used. At the one extreme, bag-of-word approaches (Batra et al., 2007, Niebles and Fei-Fei, 2007) have been used. At the other extreme, pose information is used (e.g. Ali et al. (2007)). In this chapter, we assume that the location of a human figure in the image is known. While this might seem unrealistic, related work by Thurau (2007) and Zhu et al. (2006) shows that this detection can be performed reliably and within reasonable time. Recent work on human detection by Wu and Nevatia (2007) and Lin et al. (2008) even deals with partial observations, but we do not consider these here. To encode the observation of the human figure, we use a grid-based silhouette descriptor, where each cell is a histogram of oriented boundary points. This representation resembles the concept of histograms of oriented gradients (HOG, Dalal and Triggs (2005)), as it models the spatial relations, yet is able to generalize about local variations.
For classification, we learn simple functions that can discriminate between two classes. Our main contribution is the application of common spatial patterns (CSP), a spatial filter technique that transforms temporal feature data by using differences in variance between two classes. After applying CSP, the first components of the transformed feature space contain high temporal variance for one class, and low variance for the other class. This effect is opposite for the last components. For an unseen sequence, we calculate the histogram over time, using only a fraction – the first and last components – of the transformed space. Each action is represented by the mean of the histograms of all corresponding training sequences, which is a very compact but somewhat naive representation. A simple classifier distinguishes between the two classes. All discriminant functions are evaluated pairwise to find the most likely action class. This introduces a significant amount of noise over class labels but works well for the given task. Note that CSP can be used with any image descriptor that is encoded as a vector of a fixed size, for example a histogram of codeword frequencies.
We obtained competitive results on the publicly available Weizmann action dataset introduced in Blank et al. (2005). One advantage of our method is that we require relatively few training samples. In fact, despite considerable variation in action performance between persons, we obtain reasonable results when training on data from a single subject. Also, we avoid retraining all functions when adding a new class, as the discriminative functions are learned pairwise, instead of jointly over all classes. Another advantage is that our approach is fast. Training of our classification scheme takes well under 1 second for all actions, with unoptimized Matlab code on a standard PC. After calculating the image descriptors, which can be done efficiently using the integral image (Zhu et al. 2006), classification can be performed in real-time as only a moderate number of simple functions have to be evaluated.
In the next section, we discuss related work on action recognition from monocular video. Common spatial patterns, and the construction of the CSP classifiers, are discussed subsequently. We evaluate our approach on the Weizmann dataset and perform additional experiments to gain more insight into the strengths and limitations of our approach. Finally, we summarize our approach and compare our results to those that have previously been reported in literature. An early version of this chapter appeared as Poppe and Poel (2008).