Accelerating Deep Action Recognition Networks for Real-Time Applications

Accelerating Deep Action Recognition Networks for Real-Time Applications

David Ivorra-Piqueres (University of Alicante, Alicante, Spain), John Alejandro Castro Vargas (University of Alicante, Alicante, Spain) and Pablo Martinez-Gonzalez (University of Alicante, Alicante, Spain)
Copyright: © 2019 |Pages: 16
DOI: 10.4018/IJCVIP.2019040102

Abstract

In this work, the authors propose several techniques for accelerating a modern action recognition pipeline. This article reviewed several recent and popular action recognition works and selected two of them as part of the tools used for improving the aforementioned acceleration. Specifically, temporal segment networks (TSN), a convolutional neural network (CNN) framework that makes use of a small number of video frames for obtaining robust predictions which have allowed to win the first place in the 2016 ActivityNet challenge, and MotionNet, a convolutional-transposed CNN that is capable of inferring optical flow RGB frames. Together with the last proposal, this article integrated a new software for decoding videos that takes advantage of NVIDIA GPUs. This article shows a proof of concept for this approach by training the RGB stream of the TSN network in videos loaded with NVIDIA Video Loader (NVVL) of a subset of daily actions from the University of Central Florida 101 dataset.
Article Preview
Top

1. Introduction

Although in recent years the task of activity recognition has witnessed numerous breakthroughs thanks to the development of new methodologies and the rebirth of deep learning techniques, the natural course of events has not always been like this. As for many years, despite of being tackled from multiple perspectives, the problem of constructing a system that is capable of identifying which activity is being performed in a given scene has been barely solved. In the state of the art we can find different approaches based on handcrafted traditional methods and machine learning approaches:

  • Handcrafted features dominance. The first approximations were motivated by fundamental algorithms such as optical flow (Horn and Rhunck, 1981), the Canny edge detector (Canny, 1986), Hidden Markov Model (HMM) (Rabiner and Juang, 1986) or Dynamic Time Warping (DTW) (Bellman and Kalaba,1959). Several of these methods have been reviewed in (Gavrila, 1999), for hand and the whole-body movements, which can be used to obtain relevant information for the recognition of activities.

  • Machine learning approaches. More modern methods use optical flow (Efros et al., 2003) to obtain temporal features over the sequences, in addition to using automatic learning algorithms such as Support Vector Machine (SVM) (Schüldt, Laptev and Caputo, 2004) to classify spatiotemporal features.

  • Deep learning. The CNN networks allow to obtain robust visual features on 2D images (Chéron and Laptev, 2015), but more specifically its version adapted to work with data defined in three dimensions offers the ability to obtain spatial and temporal features when working with sequences of images. In this way, furthermore of two spatial dimensions (height and width), we have a third dimension defined by time (frames) (Ji et al., 2013) (Simonyan and Zisserman, 2014).

Top

2. Approach

In this section we review the most modern action recognition works carried out in the past three years.

Online Inverse Reinforcement Learning (Rhinehart and Kitani, 2017) is a novel method for predicting future behaviors by modeling the interactions between the subject, objects, and their environment, through a first-person mounted camera. The system makes use of online inverse reinforcement learning. Thus, making it possible to continually discover new long-term goals and relationships. Also, a similar approach to that of the hybrid Siamese networks, has been shown (Mahmud, Hasan and Roy-Chowdhury, 2017) that is possible to simultaneously predict future activity labels and their starting time. It does so by taking advantage of features of previously seen activities and currently present objects in the scene.

Thanks to the use of Single Shot multi-box Detectors (SSDs) CNNs, the system proposed in (Singh et al., 2017) is capable of predicting both action labels, and their corresponding bounding boxes in real-time (28FPS). Moreover, it can detect more than one action at the same time. All of this is accomplished by using RGB image features combined with optical flow ones (with a decrease in the optical flow quality and global accuracy) extracted in real-time for the creation of multiple action tubes.

In (Kong, Tao and Fu, 2017), for predicting action class labels before the action finishes, authors make use of features extracted from fully observed videos processed at train time, for filling out the missing information present in the incomplete videos to predict. Furthermore, thanks to this approach their model obtains a great speedup improvement when compared to similar methods.

A model that is capable of performing visual forecasting at different abstraction levels is presented in (Zeng et al., 2017). For example, the same model can be trained for future frame generation as well as for action anticipation. This is accomplished by following an inverse reinforcement learning approach. Also, the model is enforced to imitate natural visual sequences from pixel level.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2020): 2 Released, 2 Forthcoming
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing