Article Preview
TopIntroduction
In the last few years, significant advancement has been seen in the area of computer vision, robotics, and human-machine interaction. With increasing areas of applications in gaze estimation, self-driving cars, and impaired assistance, a reliable head pose estimation framework has been important. Prior research has been done on head pose estimation for understanding how human attention works (Bergasa et al., 2008). It also fits in applications such as analyzing human behavior and social interactions (Ba & Odobez, 2011). Head pose estimation becomes crucial in driver assistance systems to slow down the vehicle when pedestrians are not aware of the presence of the vehicle (Geronimo, López, Sappa & Graf, 2010). Because of this significance, head pose estimation has been thoroughly investigated and explored in various fields.
Head-pose estimation plays a prominent role in use cases such as anomaly detection, surveillance, human-computer interface (HCI), and understanding behavioral dynamics in the crowd (Baxter, Leach, Mukherjee & Robertson, 2015). The extreme facial orientations, varying illumination and resolution, makeup, and presence of hairs in the human face make it challenging to predict head pose. Traditional methods gained some success in head pose estimation using image processing techniques. Histogram of Oriented Gradients (HOG) methods successfully predicted head poses in images and videos (Tran & Lee, 2011). The traditional methods for head pose estimation were founded on discriminative/landmark-based or parameterized appearance-based models. The traditional approaches worked well in estimating head pose but were not flexible and robust to extreme variation in the head pose.
The development of convolutional neural networks (CNN) became a popular choice for estimating head poses (Patacchiola & Cangelosi, 2017) because of their high efficiency. The efficiency of CNNs is reliant on the amount of well-annotated data samples. The more annotated data we have, the more efficiently CNN will perform. But capturing a large and well-annotated dataset is difficult in most cases. Convolutional neural networks while using a large volume of data, are good at predicting head poses, although they lack generalization. A good head poses estimator should be data efficient and have similar efficiency as that of the CNNs. It should also adapt to unseen faces and perform much better as more and more evidence of head pose features becomes available.
In recent years, few-shot learning techniques have been more popular when less data is available. Meta-learning-based techniques gained popularity in the past few years, as they can be applied in few-shot settings and adapt well to unseen data (Sun, Liu, Chua & Schiele, 2018). These techniques can use the knowledge gained from previous experiences and use it to boost their future performance. Meta-learners can learn a novel task from a limited training dataset, and use it to generalize to unseen tasks that the model encounters in the future. This learning method is called learning-to-learn. The use of meta-learning can benefit us with better data and computational efficiency.
This article extends the work (Joshi, Pant, Karn, Heikkonen & Kanth, 2022). The article revises the existing MAML based approach and then proposes a novel approach of using computationally efficient first-order model-agnostic meta-learning (FO-MAML). The novel approach performed well in head-pose estimation and is computationally more efficient. One, five, and ten-shot based experiments have been performed in BIWI head pose dataset using MAML and FO-MAML and comparison has been made in terms of accuracy and time complexity of both approaches.