Article Preview
Top1. Introduction
The analysis of crowd in extremely-high density is essential to public safety. By predicting or alarming the potential hazardous incidents such as panic, casualties can be reduced or avoided. Crowd counting techniques can provide the real-time number of pedestrians within the footage, which is a crucial information to prevent stampede. The strategy of conventional computer vision-based techniques for crowd counting is to extract features such as HOG (Xu et al., 2016), contour (Dong et al., 2007; Weikert et al., 2020) and spatial-temporal information (Wang, 2019) from image patches obtained with a sliding window, and feed these features to classifiers such as SVM (Xu et al., 2016; Tu et al., 2013; Zhao et al., 2017), random forest (Li & Zhou, 2016; Pham et al., 2015) and Markov Model (Jalal et al., 2020) to determine if a pedestrian exists in the patch. Once the detection for the entire footage is completed, the total number of detected pedestrians can be obtained. The major defect of conventional approaches is the low performance on high crowd density. When the density increased to a high-level, pixel-wise information for each pedestrian decreases drastically, and more occlusions will occur. In this case, the accurate detection of individual becomes difficult. and it will cause a significant performance degradation. In order to tackle these issues, regression-based approaches attempt to fabric relations between the crowd distribution and certain global features of the entire footage, and estimate the total crowd count. Arteta et al. (2014) firstly introduced the concept of density map by convolving the pedestrian’s spatial positions in training data with a Gaussian kernel. In the training phase, extracted features and density maps are exploited to train the decoding model. In the testing phase, features are feed to the well-trained model to decode the density map, which will be used to estimate the crowd count. This technique effectively addressed the problem of occlusions in high density, and inspired the deep-learning based crowd counting techniques.
The structure of deep learning-based techniques usually comprises a front-end (feature extraction) network and a back-end (density map generation) network (Cao et al., 2018; Li et al., 2018; Liu et al., 2019; Karthika, 2021; Ranjan et al., 2018; Sindagi & Patel, 2017; Zhang et al., 2016). The front-end network extracts multi-scale features from image data, while the back-end network decodes the features into a density map. Instead of extracting patches with a sliding window, deep learning-based approaches use entire image to fulfill the end-to-end training. Therefore, the processing speed is often much faster than the conventional. Also, the counting accuracy of deep learning-based approaches outperforms the conventional in most of cases.