Crime Detection and Criminal Recognition to Intervene in Interpersonal Violence Using Deep Convolutional Neural Network With Transfer Learning

Interpersonal violence, such as physical and sexual abuse, eve-teasing, bullying, and taking hostages, is a growing concern in our society. The criminals who directly or indirectly committed the crime often do not go into the trial for the lack of proper evidence as it is very tough to collect photographic proof of the incident. A subject’s corneal reflection has the potentiality to reveal the bystander images. Motivated with this clue, a novel approach is proposed in the current paper that uses a convolutional neural network (CNN) along with transfer learning in identifying crime as well as recognizing the criminals from the corneal reflected image of the victim called the Purkinje image. This study found that off-the-shelf CNN can be fine-tuned to extract discriminative features from very low resolution and noisy images. The procedure is validated using the developed datasets comprising six different subjects taken at diverse situations. They confirmed that it has the ability to recognize criminals from corneal reflection images with an accuracy of 95.41%.


INTRoDUCTIoN
Interpersonal violence, usually referred to as the intentional use of physical force or power, threatening, psychological torture, or deprivation by someone against another individual (Mayo Clinic, 2017).It is a growing concern in our society as children and adolescents are more likely to be targeted, but anyone can be a victim of this brutality including women, children, and disabled persons.It might not be easy to identify these types of violence because it is very hard and sometimes impossible to collect the evidence during the offense.Also, these types of crimes often start subtly and get worse over time.For these reasons, the victims may suffer lifelong consequences.For example, kids who get bullied are more likely to suffer from clinical anxiety, depression, and feelings of social isolation.They usually tend to avoid school, and eventually, these continuous chronic strain makes them physically sick (Nansel, Craig, Overpeck, Saluja & Ruan, 2004).Moreover, the chances of being suffered from agoraphobia, panic disorder, and anxiety are 3 to 5 times higher for adults due to these reasons (Copeland, Wolke, Angold & Costello, 2013).
Social awareness programs as well as necessary laws are enforced but criminals are often getting unpunished for the lack of proper evidence.In all the above-mentioned incidents, photographs play a very crucial role in criminal investigations.If photographs (face images) can be taken during the incident, then it gives us a visual knowledge of the offender as well as the surrounding environment that is desirable to identify the crime as well as criminals.Recently, Jenkins and Kerr (2013) performed a primitive investigation through the conventional method on the detection of bystanders from the corneal reflections of a subject known as Purkinje images (Sigut & Sidha, 2011).However, a detailed study of practical perception is not performed yet.Due to technological advancements capturing these images is possible through an IoT-based tiny camera and can be processed using deep learning.Motivated by this research gap we are trying to investigate on criminal identification from the corneal images with advanced technologies.The current paper focuses on the very crucial work on detection and recognition of faces of the criminals from the Purkinje images as these are of low resolution and poor quality.
Moreover, there are lots of variations in image appearance, such as pose variation (front, nonfront), occlusion, image orientation, illuminating condition, and facial expressions.Nowadays, computer vision has numerous applications in every sector of our life, such as load forecasting for smart grid (Mukherjee, Mukherjee, Dey, De & Panigrahi, 2020), prediction of possible health issues (Mukherjee & Mukherjee, 2019), intelligent sensing (Mukherjee, Panja & Dey, 2020), etc.
Recently, deep learning through deep convolutional networks (CNNs) show promising results in object detection and classification (Chen, Li & Li, 2020).The deep CNN approach has been gained significant improvement in face detection (Garcia & Delakis, 2002;Osadchy, Cun & Miller, 2007), facial point detection (Sun, Wang & Tang 2013), human attribute inference (Zhang, Paluri, Ranzato, Darrell & Bourdev, 2014), plant phenotyping (Jiang & Li, 2020), organism detection (Huang et al., 2019) and in many other applications.These models achieve state-of-the-art results on several data sets and can enhance the success rate of criminal identification as compared to conventional as well as other shallow machine learning methods.However, currently, no work has done so far using deep CNNs.So, a method is proposed here that takes face images from the corneal reflections and after pre-processing trains using the off-the-self CNN along with transfer learning.
The contributions of this work are summarized as follows: 1. Criminal identification and recognition through benchmark CNN methods have been implemented to reduce interpersonal violence.2. Transfer learning is applied to retrain CNN models to handle the problem with fewer efforts and improved accuracy.3. Developed a dataset of 6000 unconstrained images of six different subjects (1000 of each category).4. Robustness of the technique is tested using noise, occlusion, rotation, and low-resolution images.
The rest of the paper is assembled as follows: a literature review is presented in Section II.Section III consists of methodology along with the description of the dataset.Experimental results and discussions are described in Section IV.Eventually, it is concluded in Section V.

LITeRATURe ReVIeW
As photographs are very crucial in criminal investigations and thus cameras are seized routinely during investigations.Sometimes it is desirable to identify the photographer or other individuals who were present at the scene or surrounding but not directly present in the photographed images.Criminal identification and recognition are very important particularly in those cases where criminal photographs the victim.
In this perspective, Nishino & Nayar (2006) et. al. has proposed a comprehensive framework that is capable to extract different types of information embedded within a single catadioptric image of the eyes (the image that can be built by combining the cornea and a camera viewing at the eyes).They computed an environment map from the reflections of the cornea as well as the iris.This map could be able to recover the surrounding world of the person which reveals what the person is looking at.As a result, their proposed model could be able to determine the location and circumstances of the person at the time of image capturing.Environment map obtained from the corneal reflection enables several applications such as scene panorama reconstruction (Nakazawa, Nitschke, & Nishida, 2016), biometrics (Bowyer, Hollingsworth & Flynn, 2008), creation of a 3D model (Nishino & Nayar, 2006), super-resolution imaging (Nitschke & Nakazawa, 2012), illumination normalization in face recognition (Nakazawa & Nitschke, 2012), point of gaze estimation (Kar & Corcoran, 2017;Nitschke, Nakazawa & Nishida, 2013;Ogawa, Nakazawa & Nishida, 2018), etc.A study was conducted by Wan and his co-authors to remove reflection components from a mixture image to recover reflection scenes from an image (Wan, Shi, Li, Duan & Kot, 2020).This will be beneficial especially for surveillance or criminal investigations.
A huge number of psychological and image-based researches has been done to identify humans from the low-resolution photographs as well as videos if the identifier is previously familiar with that particular face (Burton, Wilson, Cowan, & Bruce, 1999;Jenkins & Kerr, 2013;Marciniak, Dabrowski, Chmielewska & Weychan, 2012).Yip and Sinha (2001) have proposed a model that could identify dim photographs of familiar faces with a low resolution of 7×10 pixels.For extracting information hidden on an image, Jenkis and Kerr (2013) used corneal reflection images.They experimented with constraint environment (high-resolution) images.From the reflections of the subject eyes, they have segmented out the images of bystanders.They imposed manual experiments and got 71% accuracy if the participants were unfamiliar with the bystander's face.But in the case of familiarity, they got an average accuracy of 84%.
CNN or ConvNet, one of the most extensively used deep learning architecture, have achieved state-of-the-art accuracy for any type of face detection, recognition, and classification tasks (Albu, 2009;Khalajzadeh, Mansouri & Teshnehlab, 2014;Jiang & Learned-Miller, 2017;Khan, Sohail, Zahoora & Qureshi, 2019).Xinhua and Qian (2015) have proposed a deep neural network-based face detection method to improve the biometric recognition technique.They have experimented with their proposed model using more than 13,000 face images and achieved a classification accuracy of more than 97%.Kamencay (Kamencay, Benčo, Miždoš & Radil, 2017) has proposed a CNN-based face recognition method that can be able to identify faces with a recognition accuracy of 98.3%.Kwolek (2005) proposed a combined architecture of Gabor filter and CNN and obtained a detection accuracy of 87.5% on a sample data-set containing 1000 face and 10000 non-face images.A study found that image degradation dropped the classification accuracy of CNN models significantly (Pei, Huang, Zou, Zhang & Wang, 2019).Some studies aim at identifying faces from tiny (Chang, Lu, Liu, Zhou & Qiao, 2020), distorted (Dodge & Karam, 2016), or noisy (Wu, He, Sun & Tan, 2018) images.However, there are limited works done on criminal detection from corneal reflection images.Hence, the present paper tries to fill this gap using a deep CNN-based automated system along with transfer learning.A summary of these works is given in Table I.

The Proposed Scheme
Our work aims to recognize criminals from the Purkinje images.To do this, at first, we have prepared the training and testing datasets.We fed our training dataset directly into the augmentation process to increase the data size.Figure 1 shows the step-wise schematic diagram of our proposed methodology.

Dataset Preparation
For the experiment, six volunteers (5 male, 1 female, and age between 23 to 26 years) are selected.They gave their consent to use and publish their photographs only for research purposes.We have collected a total of 6000 unconstrained images of six different subjects (1000 of each).80% of these images are used for training and the remaining 20% for the validation dataset to pursue 5-fold crossvalidation.For constructing the test dataset, a total of 3000 images (500 images for each subject) are captured in such a way that every image contains a criminal on the subject's corneal region.Some sample images along with the sequential flow diagram of the preparation of the testing image are shown in Figure 2 and Figure 3, respectively.Only the training data is augmented by data augmentation techniques for improving the accuracy and reduce the overfitting problem.For each category cropping, shifts, random rotations, and flips are applied while performing augmentation.After augmentation, the size of the training dataset becomes 48,000.
The dataset developed in this work is available at https://github.com/reduan/Criminal-Identification-Dataset.git

Data Pre-Processing
After preparing the datasets, all test images are pre-processed as they include a huge amount of noise including iris texture contamination, shadows of eyelash and eyelid, non-linear distortion from the reflection of a curved surface, etc.
The graph-based visual saliency (GBVS) method helps us to highlight the "remarkable" position where the image is informative (Harel, Koch & Perona, 2007).Regions that contain 60% saliency are cropped out and the rest of the regions are considered as the background and these are removed.After that, a Gaussian high pass filter is used to enhance the object present in the scene.If D(a, b) denotes the distance of the point (a,b), σ is the cutoff frequency, then the filter is defined by Eq. ( 1) (Asadi, Jamzad & Sajedi, 2008): (1) Eq. ( 2) is imposed to generate high pass filtered image X' from input image X where F{X} is the Fourier Transform of arguments: A fuzzy filter is shown in Eq. ( 3) with kernel size 3 x 3 pixels is imposed for reducing Gaussian noise while keeping other features intact (Rahman, Haque, Rozario & Uddin, 2014): where f max and σ denote the maximum intensity value among the 8 neighboring pixels and the standard deviation of all intensity values respectively.Since color images are used here, this filter is applied separately for each of the R, G, and B components.A discrete wavelet transform based resolution enhancement is applied to enhance the resolution of the images (Khaire & Shelkikar, 2013).Finally, the resultant pre-processed images contain only the passport-style photographs of the criminal are fed to the training and testing processes.

Deep CNN With Transfer Learning
Augmented training and pre-processed test images are sent to retrain the models using transfer learning.Deep learning models need huge labeled data to train for achieving high recognition accuracy.However, it is very expensive to collect huge amounts of labeled data.Due to this reason, it is a suitable alternative to train CNNs with scarce datasets through transfer learning, initiated by Thrun (1996).He investigated the role of previously learned features for applications with scarce training data.The dataset is retrained by 4 well-known deep CNNs: VGG19, InceptionV3, MobileNet, and NasNetMobile.A brief architecture of these 4 deep CNNs is given in Table 2.

Performance Metric
The overall performance of any system is usually measured using the confusion matrix through accuracy.This is calculated as the ratio of the sum of correct clusters to the total clusters as per the binary classifier and the equation is:  where: True Positive (TP) -A criminal is identified as a criminal True Negative (TN) -A good person is identified as a good person False Positive (FP) -A good person is identified as a criminal True Positive (FN) -A criminal is identified as a good person

experimental Result
The system is tested on varying conditions such as noise, illumination, occlusion pose, etc. to check the performance and robustness of different CNNs.

Discussion
Table 3 shows the comparative performance of InceptionV3, VGG19, MobileNet, and NasNetMobile models for rotated (both clockwise and anticlockwise) images.MobileNet shows the highest performance.Table 4 demonstrates how CNN models perform for criminal identification and recognition under uneven illumination settings along with pose variations (full profile, and 3/4 profile).This table shows that the recognition rate decreases by almost 10% when illumination condition changes for the 3/4 profile and full profile poses.Table 6 describes the relative performance of InceptionV3, NASNetMobile, VGG19, and MobileNet methods for images with Salt and Pepper noise.This result validated that VGG19 and MobileNet perform better (on the basis of noise density) than InceptionV3 and NASNetMobile.Table 7 depicts a similar result for Gaussian noise where MobileNet shows the best performance.The experimental result confirms two things: (i) MobileNet shows better accuracy in comparison with other methods in terms of diverse illumination, occlusion, noise, and pose, (ii) robustness of the system is ensured.Some visual results are illustrated in Figure 4 and Figure 7.The confusion matrix for MobileNet is shown in Table 5 as it gives the best performance.
Figure 5 shows some wrong recognition results.The reason for getting confused results is mainly due to the difficulties of acquiring high-resolution test images.The images contain noise including iris texture contamination, shadows of eyelash and eyelid, non-linear distortion from the reflection of a curved surface, and inter-class similarities.The accuracy and loss graphs for MobileNet are shown in Figure 6.
The overall classification accuracy of our proposed criminal identification system for different CNN models after employing the transfer learning is shown in Table 8.The accuracy of the test dataset is calculated using with and without pre-processing.It is found that after pre-processing the accuracy is improved a lot.Through our extensive experiment, it is found that MobileNet gives a better result than other CNNs.First of all, The proposed methodology is used non-contact advanced technology.Therefore, this method is better than conventional methods.Moreover, MobileNet has fewer parameters than others.As a result, MobileNet easily avoided overfitting problems but it is not assured for other CNNs.Also, its computation time is much smaller than the other off-the-shelf CNNs.

CoNCLUSIoN
As interpersonal violence is one of the major threats to our society, attention is given to intervening through image processing techniques.This paper utilized transfer learning techniques to identify crime and recognize criminals.According to our experimental analysis, it is evident that the pretrained CNN models, which are originally trained for other tasks of object recognition, can be effectively fine-tuned to identify criminals through transfer learning.We have done experiments with  different CNN models, such as VGG19, InceptionV3, MobileNet, and NasNetMobile.Among the 4 pre-trained models, MobileNet has attained the highest accuracy (95.41%) than other techniques.Nevertheless, there are still some challenges for applying deep learning to identify criminals using the corneal reflection images: (i) stressing on more robustness of the scheme (especially from the distant image capturing) to enhance its identification accuracy, (ii) integration of images from both eyes to enhance the quality of retrieved images as well as the identification accuracy, (iii) investigation on the practicability of an IoT-based tiny camera system to acquire and detect interpersonal violence by analyzing human emotions.

Informed Consent
Consent has been obtained from the six volunteers to use and publish their photographs only for research purposes.

Funding
No funding was received.

Conflict of Interest
The authors declare that this research work has no conflict of interest to anyone.

Figure 1 .Figure 2 .
Figure 1.Schematic diagram of the System

Figure 3 .
Figure 3. Flow diagram of the preparation of testing images.(a) Original scene image of the criminal that is obtained from the cornea region of the victim after zooming and cropping, (b) Image after employing saliency detection, (c) Resultant passport-style photograph of the criminal after removing backgrounds, (d) Pre-processed image after performing Gaussian noise reduction and discrete wavelength transform-based resolution enhancement, (e) Final image for the input to the deep CNN.

=
Correctly classified number of imagesTotal number of imag e es ×100 (4)

Figure 4 .
Figure 4. Images of an object under different noise levels: (a) original image; (b) image captured from cornea (c) corrupted image by Gaussian noise with density, d = 0.5 (MobileNet and VGG19 recognized this face correctly, but InceptionV3 and NASNetMobile failed.);(d) corrupted image by Salt & Pepper noise with mean = 0 and variance, v = 0.4 (MobileNet and InceptionV3 recognized this face correctly, but VGG19 and NASNetMobile failed).

Figure 5 .
Figure 5.Some error-prone images from the test dataset.Subjects (a), (b), and (c) are classified correctly by the MobileNet but failed to recognize the subjects (d) and (e) due to noise and partial occlusion.

Figure 6 .
Figure 6.Accuracy and loss function vs. Training epoch graph for MobileNet.

Table 8 .Figure 7 .
Figure 7. Error-prone images of two subjects from the test dataset under different noise levels: (a) original image; (b) image captured from cornea (c) corrupted image by Gaussian noise with density, d = 0.7; (d) corrupted image by Salt & Pepper noise with mean = 0 and variance, v = 0.7.(none of InceptionV3, MobileNet, VGG19 and NASNetMobile recognized this face)