Article Preview
TopIntroduction
In this paper, the author demonstrates a novel method for violence detection in real-time videos, allowing machines to better interpret human actions in videos, therefore being able to differentiate between ordinary and violent behavior.
This differentiation is very complex since there is no single parameter that indicates the existence of violence individually; actually, a complex combination of several parameters and its variations is required (Ullah et al., 2017), such as position of individuals’ parts, contact points between them, and how each part moves along the time (Serpush & Rezaei, 2020). Additionally, human activity recognition (HAR) also requires multiple frames analysis, which increases exponentially the amount of parameters to be processed (Sharif et al., 2019).
Currently, the field of computer vision is dominated by convolutional neural networks (CNNs) (Varior et al., 2016; Zeiler & Fergus, 2013); the same has proven its worth becoming the basic construction block for many deep learning architectures. The convolutions turn two-dimensions images into more abstract elements called activation maps, which hold the learned abstract representations of the images. Therefore, CNNs are capable of learning image details such as contours, edges and patterns based on the pixel intensity variations of the image (Almaadeed et al., 2021), and keeping immune to small transformations in the input image (e.g., translation, scaling, skewing, and distortion) (LeCun et al., 1989). Unfortunately, convolutions normally result in a large footprint over an expensive computational cost.
Since CNNs were initially designed for 2D image classification, for dealing with multiple frames in HAR, one additional temporal dimension had to be added to this architecture. This new architecture is known as 3DCNN. This additional dimension enhances even more the computational cost (Aktı et al., 2020) as described before.
Figure 1.
Process of pose estimation using OpenPose (Cao et al., 2019); (left) original image with the key points and connections; (right) only the key points and connections extracted from the first image
In order to address this increase in the computational cost, the author’s method translates the real-time data in which the neural network can learn the subtleness that indicates violence in a simpler and lighter approach, based on more representative features than the pixel intensity variation patterns (Javidani & Mahmoudi-Aznaveh, 2018).
For this purpose, the author’s method implies the use of OpenPose (Cao et al., 2019) for localizing the anatomical key points in human bodies by what OpenPose refers to as part affinity fields (PAFs). PAFs learn to associate body parts with individuals in the image, and this process achieves high accuracy and real-time performance (Kim & Lee, 2020). OpenPose (Cao et al., 2019) defines each individual through 18 key points (Figure 1).
Figure 2. The network architecture is composed by Openpose (Cao et al., 2019), then the features are transformed into more representative information which feeds a long short-term memory network