An Optimization Algorithm for the Uncertainties of Classroom Expression Recognition Based on SCN

With the gradual application of facial expression recognition (FER) technology in various fields, the facial expression datasets based on specific scenes have gradually increased, effectively improving the application effect. However, the facial images of students collected in real classroom scenes often have problems, such as front and rear occlusion, blurred images, and small targets. Moreover, the current students’ classroom expression recognition technology faces several challenges as a result of sample uncertainties. Therefore, this paper proposes an optimization algorithm for the uncertainties based on SCN. The correction weight of the sample through the sample weight was calculated, and the loss function was designed according to the correction weight. The dynamic threshold is obtained by combining the threshold in the noise relabeling module and the correction weight. The experimental results on public datasets and self-built classroom expression dataset show that the optimization algorithm effectively improves the robustness of SCN to uncertain samples.


INTROdUCTION
Facial expressions are one of human beings' most natural, powerful, and pervasive signals in expressing emotional states and intentions (Darwin & Prodger, 1998;Tian et al., 2001).As a result, facial expression recognition (FER) is widely used in social robotics, medical care, and driver fatigue checking.With the development of deep learning, the research hotspot on FER has shifted from shallow features to deep features, and researchers have made significant progress in FER by improving algorithms and applying large-scale datasets.
Despite the powerful feature learning capabilities of deep learning, problems still exist when applied to FER.Since deep neural networks require a large amount of training data to avoid overfitting, the facial expression data collected in the laboratory can no longer meet the demand; hence the dataset collection site has gradually changed from the laboratory to the wild.This provides a guarantee for the large amount of data required for deep learning, such as AffectNet (Mollahosseini et al., 2017), RAF-DB (S.Li et al., 2017), ExpW (Zhang et al., 2018).However, during the construction of wild datasets of traditional facial expression categories or scene-specific facial expression categories, low-quality facial images, ambiguous facial expressions, and subjectivity of annotators can lead to uncertainty in expression labels, forming the key challenge of FER in the deep learning era.
In general, uncertain samples can cause problems during training.First, they may lead to overfitting of the model to uncertain samples.Second, they affect the model's learning of facial features from reliable samples.Third, a high proportion of uncertain samples will affect the convergence of the model early in training.Moreover, with the application of FER technology in various fields, traditional facial expression classification and datasets can no longer meet the needs, and more people are building facial expression datasets on specific occasions and reclassifying expressions.As a result, these datasets may have problems, such as uneven distribution of expressions, poor image quality, and ambiguous definitions of expressions, leading to the more prominent problem of uncertainty in sample labels.
To solve the problem of sample uncertainty, Wang et al. (2020) proposed a Self-Cure Network (SCN) to suppress the uncertainty of large-scale facial expression recognition.SCN consists of three modules: self-attention importance weighting, rank regularization, and noise relabeling.SCN first extracts the facial features through the backbone convolutional neural network (CNN), and the selfattention importance weighting module then calculates the corresponding weights according to the facial features of the samples.In addition, the rank regularization module sorts the sample weights in descending order and divides them into high-and low-importance groups proportionally.It then calculates the rank regularization loss (RR_Loss).Finally, the noise relabeling module changes the labels of the low-importance group samples by calculating whether the difference between the maximum predicted probability and the given label probability is bigger than the threshold.
SCN effectively suppresses the uncertainty of the sample labels but is too cautious in setting the threshold.The fixed threshold reduces the number of triggering the noise relabeling module, making the network less robust to noise.This paper proposes a simple and effective optimization algorithm (i.e., correction strategy) based on SCN to improve the robustness of the network to label noise.First, the correction weights of the samples are calculated based on the sample weights, and the correction weights are combined with the prediction results of the sample to obtain the correction strategy loss.Second, the correction weights are combined with the threshold of the noise relabeling module to generate new dynamic thresholds, so each sample has a corresponding threshold.
The rest of the paper is structured as follows.First, we summarize the current research status at home and abroad on FER, student classroom facial expression recognition, and uncertainties in FER.We then introduce the correction strategy and its components.We then present the experimental verification and analysis and summarize these results.

BACKgROUNd
This section discusses the existing works on facial expression recognition, student classroom facial expression recognition, and research on uncertainty in facial expression recognition, which is closely related to the proposed method.

Facial Expression Recognition
The three main steps in the FER system are (1) face detection, (2) feature extraction, and (3) facial expression classification.In the face detection stage, face detectors, such as MTCNN (Zhang et al., 2016) and YOLO5Face (Qi et al., 2021), are commonly used to locate faces in complex scenes.YOLO5Face is a face detector implemented by Qi et al. based on the YOLOv5 object detector.It regards face detection as a general object detection task and adds a regression head containing five landmarks to achieve better face detection.Feature extraction is the core step in FER.Several methods are mainly based on shallow features or deep features.The methods based on shallow features mainly include Gabor (Fogel & Sagi, 1989), LBP (Ahonen et al., 2004) and sparse learning (Shojaeilangari et al., 2015).The methods based on deep features are mainly improved network models with backbone networks such as CNN, RNN and GAN (Goodfellow et al., 2020) as the core.The integrity of features is an extremely important factor affecting FER's accuracy.The method based on shallow features relies too greatly on the obvious and complete image features.When environmental factors affect the integrity of face image features, the recognition effect is poor.Deep learning has a powerful feature learning ability and can better extract facial features in complex environments.After learning the facial features, the last step of the FER system is expression classification.Deep learning adjusts the backpropagation error by adding a loss layer at the end of the network, and the network can then directly output the predicted probability of each label.In CNN, softmax loss is the most commonly used function, which minimizes the cross-entropy between the predicted results and the targets.Tang (2013) demonstrated the advantage of end-to-end training with a linear support vector machine (SVM), minimizing the marginal-based loss instead of cross-entropy.Similarly, Dapogny and Bailly (2018) studied deep neural forests (NFs) (Kontschieder et al., 2015), in which they replaced the softmax loss layer with NFs and achieved competitive results in FER.
The softmax loss layer keeps the features of different classes separated; however, facial expressions in real-world scenes have high interclass similarity and intraclass variation.Cai et al. (2018) proposed a new Island Loss, which reduces intraclass differences while expanding interclass differences to enhance the discriminative power of deep learning features.Furthermore, Zeng et al. (2018) proposed a new feature loss inspired by the similarity between shallow and deep features.The information of shallow features is embedded into the network's training process to provide complementary information to deep features in the early training stage.

Student Classroom Facial Expression Recognition
In the task of FER in students' classrooms, Fan et al. ( 2012) designed a FER system for video streams.The facial expression detection part of the system is realized by an algorithm based on facial skin color information and template matching.Chen (2013) designed a FER system under mobile learning based on client-server architecture.The system locates the feature points of the key parts of the face through the improved active shape model and extracts the face shape feature information and texture feature information.Li (2018) conducted a study on the recognition method of learning concentration for original education and designed an AdaBoost face recognition algorithm based on Haar features that is effective for frontal face recognition.The results showed that the eyes and mouth are the most obvious expression features in the learning process.Finally, the fuzzy reasoning method is applied to the field of FER.
The above methods are mainly based on the shallow feature-based FER technology research for students in classrooms, and they are obviously insufficient in terms of recognizable classroom expression categories and accuracy.With the in-depth research and wide application of deep learning technology, the FER technology based on deep features has made great progress.Zhou et al. (2017) introduced a combination of gradient boosting and CNNs to describe the picture features of students' facial expressions.They trained the neural network and used it as a picture feature extractor and mapped features to higher-dimensional space using gradient boosted decision trees.Pan et al. (2021) designed a classroom teaching feedback system based on FER, which integrated deep learning and SVM algorithms to effectively improve the multi-target FER effect.James and Riri (2019) used unobtrusive emotion detection to record student facial expressions through an RGB-Depth Microsoft Kinect camera, taking into account the convenience factor of the students, speed of response time, and cost efficiency.Moreover, Adaptive Network-Based Fuzzy Inference System algorithm is used to recognize facial expressions.
The deep learning technology has significantly improved the effect of classroom expression recognition; however, the current classroom expression recognition technology lags behind the mainstream FER methods of the same period.At the same time, it is difficult to achieve a unified standardization of the classification of classroom expression categories, and there is a lack of open databases in the field.In addition, in real classroom scenarios, small objects, blurred faces, and serious occlusion are often caused by a large number of people and poor quality of raw data.These lead to sample uncertainties, making classroom expression recognition research challenging.

Research on Uncertainty in Facial Expression Recognition
Uncertainty in FER tasks mainly results from blurred facial expressions, low-quality facial images, and label noise.In recent years, the learning of noisy labels has been extensively studied in the field of computer vision.To deal with noisy labels, Li et al. (2017) proposed a unified distillation framework to use "side" information, including a small clean dataset and label relations in a knowledge graph, to hedge the risk of learning from noisy labels.Veit et al.(2017) used a multi-task network to jointly learn to clean up noisy annotations and classify image.For the FER task, Bjørnsten and Zacher Sørensen (2017) analyzed that the uncertainty in face image processing is caused by issues such as temporality and static images.Wu et al. (2018) proposed a variant of maxout activation called Max-Feature-Map (MFM).Unlike the common maxout activation, MFM is achieved through a competitive relationship.MFM not only can separate the noise signal and the information signal but also play a role in feature selection between two feature maps.Zeng et al. (2018) first considered the problem of labeling inconsistencies between different FER datasets and then proposed to exploit these uncertainties to improve the effect of FER.Fan et al. (2020) proposed Rayleigh loss to simultaneously minimize the intraclass distance, maximize the interclass distance, and measure the uncertainty of the given samples by adding weights to the softmax function.
The SCN proposed by Wang et al. reduces the negative impact of low-quality samples on model training by relabeling uncertain samples.However, the RR_Loss proposed by SCN is calculated using the average of the sample weights of the high-importance group and the low-importance group, which cannot measure the gap between the predicted values and the targets well.Moreover, the threshold in the noise relabeling module is fixed and does not change dynamically according to the weight of the relabeled sample, weakening the function of this module.SCN makes less use of sample weight information and is only used for sample grouping and RR_Loss calculation.In addition, the rank regularization module and the noise relabeling module are less connected.

Correction Strategy
This paper proposed a simple and effective correction strategy to improve the robustness of the model to uncertain samples.This section first outlines the idea of the correction strategy and then introduces the two improvement directions for SCN and its detailed implementation.

Overview of Correction Strategy
The correction strategy is an improved algorithm based on SCN.The correction weights calculated from the sample weights are applied to the loss function of the rank regularization module and the noise relabeling module to achieve algorithm improvement, as shown in Fig. 1.The SCN in the figure has a simplified structure, and the red arrow indicates the improvement of the correction strategy over the original SCN.This paper first calculates the corresponding correction weight according to the weight of each sample in the self-attention importance weighting module.Then, in the rank regularization module, the correction strategy loss function (CS_Loss) is calculated using the sample prediction results and correction weights to replace the original RR_Loss.In the noise relabeling module, the correction weights are combined with the threshold to generate new thresholds that are dynamically transformed according to different samples.The correction strategy does not generate additional computational pressure and can effectively improve the robustness of SCN to uncertain samples in a simple way.

Correction weights
To suppress the importance of uncertain samples without affecting reliable samples, this paper designs correction weights to measure the impact on the model when samples are predicted incorrectly.The correction weight is calculated using the sample weight in the self-attention importance weighting module, which can be formulated as: where a i is the weight of the i -th sample, cw i is the correction weight of the corresponding sample, W a represents the parameters of the FC layer for attention, and s is the sigmoid function.This paper believes that a smaller loss should be incurred when a high-weight sample is predicted incorrectly.Conversely, a larger loss should be generated when a low-weight sample is predicted incorrectly.Using the correction weight information, a correction strategy loss function is designed, which is formulated as follows: where M is the number of wrongly identified samples and i is the number of the i -th wrongly identified sample.During model training, the total loss function is formulated as follows, where L WCE is the Logit-weighted cross-entropy loss used in the SCN and g is a trade-off ratio: where W j is the j -th classifier.The L WCE has a positive correlation with a as suggested in the paper (Liu et al., 2017).

Noise Relabeling Module Improvements
In the ranking regularization module, each batch sorts the sample images in descending order according to the sample weights and divides the images into two groups according to a certain proportion, namely, the high-importance group and the low-importance group.Uncertain samples usually get lower sample weights due to problems such as image blurring and are assigned to low-importance groups.The noise relabeling module in SCN calculates the difference between the maximum predicted probability of each sample in the low-importance group and the given label probability and compares the difference with the threshold.If the difference is bigger than the given threshold, it replaces the sample label with the label that corresponds to the maximum predicted probability.Otherwise, the relabeling operation is not performed.Like SCN, this paper only considers samples in the low-importance group in the noise relabeling module.However, this paper believes that there is a large gap in the sample weights even in the low-importance group, and it is unreasonable to use a fixed threshold for all samples.When a high-weight sample is predicted incorrectly, the difference between the maximum predicted probability and the probability of the given label may be small.Meanwhile, when a low-weight sample is predicted incorrectly, the difference between the maximum predicted probability and the probability of the given label may be large.Therefore, this paper associates the threshold with the correction weight of the sample and obtains thresholds that can change dynamically according to different samples.When the sample weight is high, the threshold will be reduced relatively.When the sample weight is low, the threshold will be increased relatively.The noise relabeling module is defined as follows: where ¢ y is the new label of the sample, d 2 is the threshold ( d 1 is used for RR_Loss), P max is the maximum predicted probability, P gtInd is the predicted probability for a given label, cw is the correction weight of the sample, l org is the original given label, and l max is the label that corresponds to the maximum predicted probability.RAF-DB contains 30,000 facial images annotated with basic or compound expressions by 40 trained human coders.In the experiments, using only 6 basic expressions and neutral images, there were 12,271 and 3,068 images for training and testing, respectively.FERPlus (Barsoum et al., 2016) was extended from FER2013 for the ICML 2013 Challenges.It is a large-scale dataset collected by the Google search engine and consists of 28,709 training images, 3,589 validation images, and 3,589 testing images.Moreover, there are eight classes in the dataset because the contempt expression is added to the dataset.
ExpW contains 91,793 face images downloaded using Google Image Search.Each face image is manually annotated into one of the seven basic expression classes.By preprocessing the original data, this paper retains 79,455 face images that can be used for model training.78,577 and 8,728 images are used for training and testing, respectively.The training set and test set ratio for each expression are 9:1.

Student Classroom Expression Dataset
This paper uses professional cameras to record the students in a real classroom scenario.First, the recorded video is extracted every 50 frames for frame extraction, and the frames are saved as pictures.The pictures are then input into the YOLO5Face model one by one, and the model returns the coordinates of the student's face area in the picture.To avoid distorting the students' faces as a result of resizing the images in the deep learning data preprocessing stage, the students' facial expressions were automatically cropped with an aspect ratio of 1:1, where the side with the longer face area coordinate was used as a reference.After a large amount of raw data was obtained, the data were initially screened to remove poor quality images.Finally, students' classroom expressions were divided into five categories: resistance, confusion, understanding, exhaustion, and neutrality, and the data were labeled and cross-checked by professionals.
After the dataset annotation is completed, the data is divided into two parts: a training set and a test set.The number of training sets for each expression is about four times that of the test sets.Finally, a student classroom expression dataset that contains 9,739 training images and 2,435 test images is obtained.

Experiment details
This paper verifies the algorithm's robustness to uncertain samples on three public datasets under the same experimental environment as SCN as much as possible.Taking ResNet-18 as the backbone network, the ratio of the high-importance group and low-importance group is 7:3, and the fixed threshold of the noise relabeling module is 0.2.The noise relabeling module is added from the tenth epoch, and the training stops at 40 epochs.

Evaluation of Correction Strategy for Label Noises
To verify the improvement effect of the correction strategy on SCN, this paper first generates three levels of label noises, including the ratio of 10%, 20%, and 30% to RAF-DB, FERPlus, and ExpW datasets.Specifically, this paper randomly selects 10%, 20%, and 30% of the data from the training set and changes their labels to other random labels.SCN is used for model training, and the proposed algorithm is compared with the original SCN in two training schemes, i.e., train from scratch and train with the same pretrained model.
As presented in Table 1, the algorithm improves the robustness of SCN to label noises as a whole, and with the gradual increase of the noise ratio, the algorithm has more obvious advantages.Without pretraining, the algorithm achieves up to 1.80%, 1.03%, and 0.79% improvement on RAF-DB, FERPlus, and ExpW, respectively.With pre-training, the highest gains are 1.21%, 1.48%, and 0.48%.Through comparison, this paper finds that the correction strategy works better for small datasets, because with the increase of the number of samples, the number of reliable samples gradually increases, and the influence of uncertain samples on the network model gradually decreases.
As presented in Table 2, the same experiment is then performed on the student classroom expression dataset.Without pretraining, the algorithm achieves 0.77%, 1.00%, and 2.13% improvement at 10%, 20%, and 30% label noises, respectively.With pre-training, the algorithm also obtained 1.06%, 1.22%, and 0.53% improvement.Therefore, the correction strategy effectively improves the robustness of SCN to uncertain samples and suppresses the sample uncertainties in classroom expression datasets.
This paper adds random occlusions of 5%, 10%, and 20% without label noises on the RAF-DB dataset to further investigate the robustness of the correction strategy in occluded environments.The experiments are carried out without pretraining to better compare the feature learning ability of the algorithms.As presented in Table 3, the algorithm improves by 0.82% compared to the baseline under the condition of no occlusion and also has some improvements under three different proportions of random occlusion.
This paper then evaluates the effect of different ratios between L cs and L WCE on the RAF-DB dataset with 30% noise and no pretraining.As presented in Table 4, the best result is achieved by setting the ratio of the two loss functions to 1:1.
Finally, this paper selects five representative images from the RAF-DB dataset, selects models with 30% noise and no pretraining for testing, and compares the maximum predicted probability of the samples, as shown in Fig. 2. The selected photos are the samples that are correctly predicted in both models.It can be found that even under the premise of accurate prediction, the maximum predicted probability of the samples is quite different.Therefore, the correction strategy improves the maximum predicted probability of the samples when the recognition is accurate and effectively suppresses the uncertainty influence of other labels on the correct label.The evaluation on public datasets with added noise effectively proves the robustness of the correction strategy to label noise.Analysis of experimental results shows that the correction strategy performs better in smaller datasets, because with the increase of the number of samples, the number of reliable samples gradually increases, and the influence of uncertain samples on the network model gradually decreases.The correction strategy can better suppress the uncertainty of self-built datasets in specific scenarios, because such datasets are usually small and have serious label noise problems.In this paper, the validation is carried out on the self-built student classroom expression dataset, providing ideas for solving the uncertainty problem of fer in specific scenes.Self-built datasets are quite different from public datasets due to different scenes.The dynamic adjustment of algorithms according to the characteristics of scenes helps to better apply advanced algorithms in various fields.

CONCLUSION
This paper proposed a correction strategy based on SCN, and the algorithm is experimentally verified from multiple perspectives.The experimental results show that the correction strategy effectively improves the robustness of the network to uncertain samples and strengthens the ability of learning robust facial expression features.The core of the correction strategy is to calculate the correction weight of the sample according to the sample weight, to calculate the correction strategy loss function through the correction weights to replace the rank regularization loss function, and to combine the correction weights with the threshold of the noise relabeling module to generate more reliable dynamic thresholds.The experimental results on public datasets and self-built student classroom expression dataset show that the correction strategy effectively improves the robustness of SCN to uncertain samples and can deal with label uncertainty in datasets more effectively.The correction strategy applied in large-scale datasets has limitations.However, these kinds of datasets often have serious sample uncertainty problems that will affect the application effect of large-scale datasets.Therefore, exploring methods to suppress sample uncertainty that can be applied to large-scale datasets will be the next research direction.

Figure
Figure 1.Correction strategy based on SCN

Figure 2 .
Figure 2. Improvement of the correction strategy to the maximum predicted probability