Simple Approach for Violence Detection in Real-Time Videos Using Pose Estimation With Azimuthal Displacement and Centroid Distance as Features

Simple Approach for Violence Detection in Real-Time Videos Using Pose Estimation With Azimuthal Displacement and Centroid Distance as Features

Felipe Boris De Moura Partika
Copyright: © 2022 |Pages: 9
DOI: 10.4018/IJCVIP.304462
Article PDF Download
Open access articles are freely available for download

Abstract

Detecting violence in real time videos is not an easy task even for the most advanced deep learning architectures, considering the subtle details of human behavior that differentiate an ordinary from a violent action. Even with the advances of deep learning, human activity recognition(HAR) in videos can only be achieved at a huge computational cost, most of the time also requiring special hardware for reaching an acceptable accuracy. We present in this paper a novice method for violence detection, a sub-area of HAR, which outperforms in speed and accuracy the state of the art methods. Our method is based on features extracted from the Pose estimator method OpenPose. These features are then transformed into more representative elements in the context of violence detection, which are then submitted to a LSTM neural network to learn how to identify violence. This work was inspired by the violencedetector.org, the first open source project for violence detection in real time videos.
Article Preview
Top

Introduction

In this paper, the author demonstrates a novel method for violence detection in real-time videos, allowing machines to better interpret human actions in videos, therefore being able to differentiate between ordinary and violent behavior.

This differentiation is very complex since there is no single parameter that indicates the existence of violence individually; actually, a complex combination of several parameters and its variations is required (Ullah et al., 2017), such as position of individuals’ parts, contact points between them, and how each part moves along the time (Serpush & Rezaei, 2020). Additionally, human activity recognition (HAR) also requires multiple frames analysis, which increases exponentially the amount of parameters to be processed (Sharif et al., 2019).

Currently, the field of computer vision is dominated by convolutional neural networks (CNNs) (Varior et al., 2016; Zeiler & Fergus, 2013); the same has proven its worth becoming the basic construction block for many deep learning architectures. The convolutions turn two-dimensions images into more abstract elements called activation maps, which hold the learned abstract representations of the images. Therefore, CNNs are capable of learning image details such as contours, edges and patterns based on the pixel intensity variations of the image (Almaadeed et al., 2021), and keeping immune to small transformations in the input image (e.g., translation, scaling, skewing, and distortion) (LeCun et al., 1989). Unfortunately, convolutions normally result in a large footprint over an expensive computational cost.

Since CNNs were initially designed for 2D image classification, for dealing with multiple frames in HAR, one additional temporal dimension had to be added to this architecture. This new architecture is known as 3DCNN. This additional dimension enhances even more the computational cost (Aktı et al., 2020) as described before.

Figure 1.

Process of pose estimation using OpenPose (Cao et al., 2019); (left) original image with the key points and connections; (right) only the key points and connections extracted from the first image

IJCVIP.304462.f01

In order to address this increase in the computational cost, the author’s method translates the real-time data in which the neural network can learn the subtleness that indicates violence in a simpler and lighter approach, based on more representative features than the pixel intensity variation patterns (Javidani & Mahmoudi-Aznaveh, 2018).

For this purpose, the author’s method implies the use of OpenPose (Cao et al., 2019) for localizing the anatomical key points in human bodies by what OpenPose refers to as part affinity fields (PAFs). PAFs learn to associate body parts with individuals in the image, and this process achieves high accuracy and real-time performance (Kim & Lee, 2020). OpenPose (Cao et al., 2019) defines each individual through 18 key points (Figure 1).

Figure 2.

The network architecture is composed by Openpose (Cao et al., 2019), then the features are transformed into more representative information which feeds a long short-term memory network

IJCVIP.304462.f02

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing