A Survey of Human Activity Interpretation in Image and Video Sequence

A Survey of Human Activity Interpretation in Image and Video Sequence

Xin Xu, Li Chen, Xiaolong Zhang, Dongfang Chen, Xiaoming Liu, Xiaowei Fu
DOI: 10.4018/978-1-4666-3958-4.ch003
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In the past, a large amount of intensive research has been dedicated to the interpretation of human activity in image and video sequence. This popularity is largely due to the emergence of the wide applications of video cameras in surveillance. In image and video sequence analysis, human activity detection and recognition is critically important. By detecting and understanding the human activity, we can fulfill many surveillance related applications including city centre monitoring, consumer behavior analysis, etc. Generally speaking, human activity interpretation in image and video sequence depends on the following stages: human motion detection and human motion interpretation. In this chapter, the authors provide a comprehensive review of the recent advance of all these stages. Various methods for each issue are discussed to examine the state of the art. Finally, some research challenges, possible applications, and future directions are discussed.
Chapter Preview
Top

Introduction

After the tragic event on September 11 and the subsequent terrorist attacks around the world, the interpretation of image and video sequence has attracted much more attention. It can witness a wide range of uses in different visual surveillance related applications. By analyzing human activity in image and video sequence, especially the abnormal human motion patterns, we can predict and recognize the happening of crime and antisocial behavior in real time, such as drunkenness, fights, vandalism, breaking into shop windows, and etc.

Human activity interpretation aims to draw a description of human action and interactions through the analysis of their motion patterns (Kautz, 1987). From technique viewpoint, the interpretation of human activity in image and video sequence may be considered as a classification problem towards time varying data. It can be divided into two level procedures: the lower level extracts human motion feature from image and video sequence; while the higher level detects and interprets the temporal human motion pattern. Visual information is at first extracted from image and video sequence, and then represented in relevant features, which can be used to match with the features extracted from a group of labeled reference sequences representing typical human activity.

Generally speaking, the interpretation of human activity should consider three kinds of features including character of single object, global feature of multiple objects, and relation between object and background.

The investigation of single object's specific feature for activity interpretation has an early start. This kind of features, such as position, velocity, veins, shape, and color, can be extracted during the detecting and tracking procedure, and used to predict the corresponding information in the next time step. Then this information is used to compare with the obtained object's information to check whether an activity have taken place. However, it should be noted that the occurrence of an activity may not only change the feature of single object, but also influence the global feature of multiple objects. In a public transportation application, for example, traffic accident may be caused by break down or rear-end collision of multiple vehicles. As a result, the global feature of multiple objects should be considered. The basic principle of global feature analysis aims to detect activity using the information of multiple objects, such as average speed, region occupancy, and relative positional variations. In addition, human activity can not be analyzed without considering the influence from environment. Most of current methods for activity recognition imply the assumption of specific situation, which may contain large amount of prior knowledge. This assumption can reduce the computational complexity of environment analysis and improve the performance; however it may inevitably degrade the popularity of the technique at the same time. Thus background analysis should be taken into consideration during the recognition of human activity, and can be used to extract the relation with foreground objects.

Figure 1 illustrates the general framework for human activity interpretation. Cameras are used to capture image and video sequences. In order to improve the robustness of visual surveillance in different applications, these cameras may be of different modality including thermal infrared camera, visible color camera, etc. Using the image processing techniques, background can be extracted from the image and video sequence, which can be used to obtain aforementioned three kinds of features. After then, the image interpreting stage includes two main procedures: human detection and segmentation, and human activity interpretation. Our emphasis in this paper is to discuss the recent advance of the techniques in the image interpreting stage. Thus we focus on the two main procedures in this stage, and endeavor to provide summary of progress achieved in the research direction.

Figure 1.

General framework for human activity interpretation

978-1-4666-3958-4.ch003.f01

Beyond human activity interpretation, other similar fields may include behavior understanding, motion interpretation, event detection, goal recognition, or intent prediction. As pointed out in (L. Liao, 2006), although these terms may emphasize different aspects of human activity, their essential goals are the same. Therefore, in this paper, we use the term activity recognition and do not distinguish the minor differences among the different terms mentioned above.

Complete Chapter List

Search this Book:
Reset