Detection and Classification of Interacting Persons

Detection and Classification of Interacting Persons

Scott Blunsden (European Commission Joint Research Centre, Italy) and Robert Fisher (University of Edinburgh, UK)
Copyright: © 2010 |Pages: 15
DOI: 10.4018/978-1-60566-900-7.ch011
OnDemand PDF Download:
No Current Special Offers


This chapter presents a way to classify interactions between people. Examples of the interactions we investigate are: people meeting one another, walking together, and fighting. A new feature set is proposed along with a corresponding classification method. Results are presented which show the new method performing significantly better than the previous state of the art method as proposed by Oliver et al. (2000).
Chapter Preview


This chapter presents an investigation into classification of multiple person interactions. There has been much previous work identifying what activity individual people are engaged in. (Davis and Bobick, 2001) used a moment based representation based on extracted silhouettes and (Efros et al., 2003) modeled human activity by generating optical flow descriptions of a person’s action. Descriptions were generated by first hand-tracking an individual, re-scaling to a standard size and then taking the optical flow of person’s actions over several frames. A database of these descriptions was created and matched to novel situations. This method was extended by (Robertson, 2006) who also included location information to help give contextual information to a scene. Location information is of assistance when trying to determine if someone is loitering or merely waiting at a road crossing. Following on from flow based features (Dollar et al., 2005) extracted spatial-temporal features to identify sequences of actions.

Ribeiro and Santos-Victor (2005) took a different approach to classify an individual’s actions in that they used multiple features calculated from tracking (such as speed, eigenvectors of flow) and selected those features which best classified the person’s actions using a classification tree with each branch using at most 3 features to classify the example.

The classification of interacting individuals was studied by Oliver et al. (2000) who used tracking to extract the speed, alignment and derivative of the distance between two individuals. This information was then used to classify sequences using a coupled hidden Markov model (CHMM). Liu and Chua (2006) expanded the two person classification to three person sequences using a hidden Markov model (HMM) with an explicit role attribute. Information derived from tracking was used to provide features such as the relative angle between two persons to classify complete sequences. Xiang and Gong (2003) again used a CHMM to model interactions between vehicles on an aircraft runway. These features are calculated by detecting significantly changed pixels over several frames. The correct model for representing the sequence is determined by the connections between the separate models. Goodness of fit is calculated by the Bayesian information criterion. Using this method a model representing the sequence’s actions is determined.

Multi-person interactions within a rigid formation was also the goal of (Khan and Shah, 2005) who used a geometric model to detect rigid formations between people, such an example would be a marching band. Intille and Bobick (2001) used a pre-defined Bayesian network to describe planned motions during American football games. Others such as Perse et al. (2007) also use a pre-specified template to evaluate the current action being performed by many individuals. Pre-specified templates have been used by Van Vu et al. (2003) and Hongeng and Nevatia (2001) within the context of surveillance applications.


Specifically What Are We Trying To Do?

Given an input video sequence the goal is to automatically determine if any interactions are taking place between two people. If any are taking place then we want to identify the class of the interaction. Here we limit ourselves to pre-defined classes of interaction. To make the situation more realistic there is also a ‘no interaction’ class. We seek to give each frame this label from this predefined set. For example a label may be that person 1 and person 2 are walking together in frame 56.

The ability to automatically classify such interactions would be useful in cases which are typical of many surveillance situations. Such an ability to automatically recognize interactions would also be useful in video summarization where it could be possible to focus only on specific interactions.

Complete Chapter List

Search this Book: