Real-Time Recognition of Basic Human Actions

Real-Time Recognition of Basic Human Actions

Vassilis Syrris (Aristotle University of Thessaloniki, Greece)
Copyright: © 2010 |Pages: 18
DOI: 10.4018/978-1-60566-900-7.ch008
OnDemand PDF Download:
List Price: $37.50


This work describes a simple and computationally efficient, appearance-based approach both for human pose recovery and for real-time recognition of basic human actions. We apply a technique that depicts the differences between two or more successive frames and we use a threshold filter to detect the regions of the video frames where some type of human motion is observed. From each frame difference, the algorithm extracts an incomplete and unformed human body shape and generates a skeleton model which represents it in an abstract way. Eventually, the recognition process is formulated as a time-series problem and handled by a very robust and accurate prediction method (Support Vector Regression). The proposed technique could be employed in applications such as surveillance and security systems.
Chapter Preview


Human actions modeling, detection and recognition from video sequences (i.e. temporal series of frames/images) can have many applications in robot/computer vision, indoor/outdoor surveillance and monitoring, human-computer interaction, computer graphics, virtual reality and video analysis (summarization; transmission; retrieval; compression etc). A typical automated human action recognition system is usually examined in both constituent components: hardware (processing units, cameras, networks etc.) and software, that is, algorithms based mainly on computational/statistical techniques.

In this work, we focus on the latter; more specifically the recognition problem confronted herein is the classification of elementary human actions like walking, hands-waving and jumping, captured by a steady monocular camera. There are many parameters that affect the problem complexity:

  • Intra- and inter-class variations: There are large natural differences in performance for many actions considered to belong in the same class (e.g. speed and pace length generate different types of walking). In addition, many times the boundaries between two action classes are hard to define (e.g. drawing and moving one hand).

  • Recording conditions: Existence of one (single data source, monocular camera) or more cameras (binocular or multiple cameras), movement of the camera (autonomous robot) or not (still surveillance camera), indoor or outdoor camera shooting, image resolution (e.g. high quality reduces noise but raises the storage and processing requirements), distance of the observed object and its position in relation to the camera, and so on.

  • Discrepancies in person localization (either spatial or temporal).

  • Motion alignment in order to compare two video sequences

Human motion and event analysis has received much attention in the research communities (some indicative review papers of this subject are Gao et al., 2004; Hu et al., 2004; Kumar et al., 2008; Moeslund & Granum, 2001). Nevertheless, it remains a core unsolved machine vision problem due to several reasons: illumination conditions, depth calculation, complex backgrounds, variations of object appearance and posture, representation of human body or its transformation to a more compact and recognizable structure, description and measurement of activities and high pre-processing requirements are some of the factors obstructing an easy, sound and complete problem resolution.

The objective of this work is to describe a computational approach for tackling a single person action recognition problem. During this course, we analyze the processing stages in order to detect the object of interest, to identify appropriate content features that suitably represent it, to exploit the information underlying within the trajectories of this object and to define suitable decision boundaries capable of classifying new unseen human action videos. At the end of the chapter, the future directions section pinpoints the open research issues and stresses the subtle nuances that someone has to consider when dealing with such a hard, automated application.


Video Content Analysis

In general, video content analysis relates to the following tasks:

  • a. Detecting objects of interest within the frames: a low-level stage of video processing where visual features such as color, image and pixel regions, texture, contours, corners etc are identified/selected/formulated.

  • b. Assigning meaning to the temporal development of these elements: a high-level stage (recognition phase) where objects and events are classified to some predefined classes that represent patterns of motion. The major issue herein is how to map the low-level features (pixels, transformed features or statistical measures) as semantic content.

In addition, videos consist of massive amounts of raw information in the form of spatial-temporal pixel intensities/color variations, but most of this information is not directly relevant to the task of understanding and identifying the activity occurring in the video.

Most of the current methods employ a number of the following steps:

Complete Chapter List

Search this Book: