One, Five, and Ten-Shot-Based Meta-Learning for Computationally Efficient Head Pose Estimation

Many real-world applications rely on head pose estimation. The performance of head pose estimation has significantly improved with techniques like convolutional neural networks (CNN). However, CNN requires a large amount of data for training. This article presents a new framework for head pose estimation using computationally efficient first-order model-agnostic meta-learning (FO-MAML)- based method and compares the performance with existing MAML-based approaches. Experiments using one-shot, five-shot, and ten-shot settings are done using MAML and FO-MAML. A mean average error (MAE avg ) of 7.72, 6.30, and 5.32 has been achieved in predicting head pose using MAML for one, five-, and ten-shot settings, respectively. Similarly, MAE avg of 8.33, 6.84, and 6.23 has been achieved in predicting head pose using FO-MAML for one, five-, and ten-shot settings, respectively. The computational complexity of an outer-loop update in MAML is found to be O(n 2 ) whereas for FO-MAML it is O(n).


INTROdUCTION
In the last few years, significant advancement has been seen in the area of computer vision, robotics, and human-machine interaction.With increasing areas of applications in gaze estimation, self-driving cars, and impaired assistance, a reliable head pose estimation framework has been important.Prior research has been done on head pose estimation for understanding how human attention works (Bergasa et al., 2008).It also fits in applications such as analyzing human behavior and social interactions (Ba & Odobez, 2011).Head pose estimation becomes crucial in driver assistance systems to slow down the vehicle when pedestrians are not aware of the presence of the vehicle (Geronimo, López, Sappa & Graf, 2010).Because of this significance, head pose estimation has been thoroughly investigated and explored in various fields.
Head-pose estimation plays a prominent role in use cases such as anomaly detection, surveillance, human-computer interface (HCI), and understanding behavioral dynamics in the crowd (Baxter, Leach, Mukherjee & Robertson, 2015).The extreme facial orientations, varying illumination and resolution, makeup, and presence of hairs in the human face make it challenging to predict head pose.Traditional methods gained some success in head pose estimation using image processing techniques.Histogram of Oriented Gradients (HOG) methods successfully predicted head poses in images and videos (Tran & Lee, 2011).The traditional methods for head pose estimation were founded on discriminative/landmark-based or parameterized appearance-based models.The traditional approaches worked well in estimating head pose but were not flexible and robust to extreme variation in the head pose.
The development of convolutional neural networks (CNN) became a popular choice for estimating head poses (Patacchiola & Cangelosi, 2017) because of their high efficiency.The efficiency of CNNs is reliant on the amount of well-annotated data samples.The more annotated data we have, the more efficiently CNN will perform.But capturing a large and well-annotated dataset is difficult in most cases.Convolutional neural networks while using a large volume of data, are good at predicting head poses, although they lack generalization.A good head poses estimator should be data efficient and have similar efficiency as that of the CNNs.It should also adapt to unseen faces and perform much better as more and more evidence of head pose features becomes available.
In recent years, few-shot learning techniques have been more popular when less data is available.Meta-learning-based techniques gained popularity in the past few years, as they can be applied in few-shot settings and adapt well to unseen data (Sun, Liu, Chua & Schiele, 2018).These techniques can use the knowledge gained from previous experiences and use it to boost their future performance.Meta-learners can learn a novel task from a limited training dataset, and use it to generalize to unseen tasks that the model encounters in the future.This learning method is called learning-to-learn.The use of meta-learning can benefit us with better data and computational efficiency.
This article extends the work (Joshi, Pant, Karn, Heikkonen & Kanth, 2022).The article revises the existing MAML based approach and then proposes a novel approach of using computationally efficient first-order model-agnostic meta-learning (FO-MAML).The novel approach performed well in head-pose estimation and is computationally more efficient.One, five, and ten-shot based experiments have been performed in BIWI head pose dataset using MAML and FO-MAML and comparison has been made in terms of accuracy and time complexity of both approaches.

BACKGROUNd
Conventional techniques for head pose estimation used appearance templates.Ng and Gong (1999) extended support vector machines (SVM) and modeled the appearance of human faces in 2D.Their work gave excellent results in multi-view face detection and head pose estimation.Sherrah et al., (1999) applied Gabor filters to improve the pose difference at each pose angle.They applied principal component analysis (PCA) to get the identity-invariant properties of human faces.The identityinvariant features were used to compute the head poses.The success of frontal face detection using supervised learning algorithms gave researchers a new direction to find methods for estimating head poses.Researchers extended single-face detector methods to multiple-face detector methods.Hence, detector array-based methods have been developed (Osuna, Freund, & Girosit, 1997).In these techniques, a detector is trained with multiple images rather than comparing facial images to some predefined templates.Mostly SVMs were implemented in detector array-based methods and were successful in predicting head poses.
Geometric models were later developed that used facial key points for computing head pose.Using facial key points, we don't need to have previous knowledge of the user's facial appearance or orientation (Jin & Tan, 2016).Such a method allows us to detect facial features based on anatomical landmarks of the face.However, it requires accurate edge and line detection of human facial key points.Sun et al., (2013) suggested an approach using convolutional neural networks (CNN) for predicting the positions of facial key points.The CNN locates anatomical facial landmarks with high accuracy from the entire face.It also estimates all the key points simultaneously.Liu et al., (2016) did similar work on sequential images.These methods gave an excellent result in computing head poses, although it is exceptionally tedious to detect facial landmarks in human faces.Kashevnik et al., (2021) worked on detecting head pose angles in the context of driver monitoring application using facial landmarks and showed their method have superior performance than the state-of-the-art when pose varies extremely.The authors thoroughly experimented and showed their method could detect head pose angles when the rotation of the head is up to 70º.Srinivasan and Boyer (2002) studied view-based eigenspaces to estimate head poses.They claimed that eigenspaces of facial images were more efficient for accurate facial recognition, which can improve head pose estimation.Their method was more powerful and computationally less expensive than the existing methods.Zhu and Fujimura (2004) proposed a method based on PCA and 3D motion estimation.They further used the depth information of facial images to segment human faces even in a cluttered environment.
Few classification-based methods were developed using discretized head poses.Benfold and Reid (2008) used an ensemble of the tree-based algorithm to detect color-invariant head pose.Fanelli et al., (2011) used depth data for estimating head pose using random forest regression.Yan et al., (2016) have proposed a multi-task learning approach to calculate the head poses of mobile persons in an environment observed by multiple surveillance cameras.Enhancements in deep learning led to the development of non-linear regression-based head pose estimation methods.Ruiz et al., (2018) proposed a method that did not need to use facial key points for estimating head pose.They trained a multi-loss convolutional neural network (CNN) for predicting head pose angles.A real-time CNNbased approach that is similar to LeNet-5 (Lecun et al., 1998) has been suggested to predict head pose (Osadchy, Miller & Lecun, 2004).Their CNN-based architecture has more feature maps than LeNet-5.Other CNN-based methods used RGB images along with depth images.Ahn et al., (2015) used GoogleLeNet (Szegedy et al., 2014) to estimate head poses using RGB and depth information.Venturelli et al., (2017) proposed a lightweight model having five convolution layers and three dense layers and achieved better performance.Recently, the work of Patacchiola and Cangelosi (2017) has shown the better performance of CNN in estimating head pose using multiple datasets.
CNN-based methods give excellent results in estimating head pose.The demerit of using CNN is that CNN requires huge training data to find the best estimator and lacks generalization ability on new data.CNN is not good enough to adapt to a novel set of tasks.A model-independent metatraining approach was introduced that can make use of gradient descent for training (Finn, Abbeel, & Levine, 2017).Model-agnostic meta-learning (MAML) can be used in any model that uses gradient descent for training.Sun et al., (2019) suggested an approach to use a meta-learning technique in few-shot settings.Their approach showed that meta-learning performs admirably in few-shot learning.The fundamental concept of their work was to train a base learner with a small number of examples randomly selected from a set of tasks from a distribution.Then, the base learner is adapted to learn new tasks using identical examples.Antoniou et al., (2019) have shown that model-agnostic metalearning (MAML) can be used to solve regression and classification problems.Park et al., (2019) have worked on meta-learning using (MAML) and suggested an architecture to estimate gaze in few-shot settings.For training a person-specific gaze estimation model using meta-learning, they utilized a small number of calibration samples.Their approach also allows for a far more accurate estimate of gaze direction for any new person.Few-shot learning for head pose estimation has been studied by Joshi et al., (2021).They used ResNet (He et al., 2016) as a feature extractor and trained a few-shot learner to estimate head poses.Joshi et al., (2022) suggested a model-agnostic meta-learning (MAML) based framework for head pose estimation.Their method gave better results in estimating head pose using MAML in few-shot settings.

METHOdOLOGy
In this article, the authors extend the work of Joshi et al., (2022) by omitting the second-order gradients during model-agnostic meta-learning training.A three-stage architecture has been introduced consisting of Face Detection Framework, Representation Learning Framework, and First-Order Meta-Learning Framework.The architecture for head poses estimation using meta-learning is shown in Figure 1.The next sections explain each stage of head pose detection.dataset For this work, the popular BIWI head pose dataset has been used to train the proposed meta-learner network (Fanelli, Weise, Gall, & Van Gool, 2011).This head poses dataset consists of 15,678 images of 20 different subjects in varying environmental conditions.Each image is of size 640 x 480 pixels.Along with RGB images of the subjects, a depth image corresponding to it having a size of 640 × 480 pixels is given.The annotated head pose angles are also provided with the dataset.The head poses of the subjects in the dataset vary approximately ±75• in yaw, ±60• in pitch, and ±50• in the roll.The ground truth for the 3D position and rotation of the heads has also been provided with the images of the subjects.

Experimental Testbench
PyTorch and PyTorch Lightening libraries have been used to create variational autoencoder and meta-learner models.For hyperparameter tuning and neural architectural search, Ray Tune library has been used (Liaw et al., 2018).The meta-training has been done on the free Google Colab platform.The training utilizes Google Colab's free NVIDIA Tesla K80 GPU with 12GB memory.For computing the second-order gradient a library called Higher has been used with PyTorch Lightening (Grefenstette et al., 2019).For image processing, OpenCV library has been used for image processing where needed.

data Preprocessing
Some images in the BIWI dataset consist of multiple background objects.Hence, we have to identify and crop the human faces from the images.Face detection requires proper face localization and bounding box coordinates prediction.For this work, we use a multi-task cascaded convolutional neural network (MTCNN) for detecting faces and predicting bounding box coordinates (Zhang et al., 2016).After applying MTCNN, we crop the identified faces from BIWI images to a size of 128 × 128.The dataset created from cropped images is then split into two: the first fifteen subjects form a distribution from which few-shot tasks can be randomly sampled for training the meta-learner, whereas the remaining five subjects were used during testing the meta-learner.A total of 10581 cropped faces have been used as a distribution to sample training tasks, and the remaining 2638 faces were used for testing.When detecting the faces, images with heavily veiled backgrounds, numerous objects, and extreme stances were eliminated.
The faces obtained from MTCNN were normalized before passing into the representation learning framework.After normalizing with min-max normalization, we get pixels in between 0 and 1. Min-max normalization bound the pixel values into a smaller standard deviation which works well in presence of outliers.Min-max normalization is computed as, where x is the value of a pixel, x min and x max is the minimum and maximum value of the pixels.

Representation Learning Framework
This article presents a representation learning framework designed to learn important features from human faces in latent space.There can be many methods for representation learning, but variational autoencoders (VAE) seem excellent for these sorts of tasks.Furthermore, VAE networks can link representation learning with generative modeling making it possible to generate usable data from scratch.Based on this, a variational autoencoder network has been implemented for this task.The ® respectively, such that an input to the network, an encoder encodes it into latent variable x .In this article, a VAE is implemented to learn the characteristic features from training images and produce a latent space of 200 features.The size of the latent space has been decided by computing the reconstruction loss repeatedly by varying latent dimensions z .The number of latent dimensions that give the least combined error (KL-Divergence and Mean-Squared Error) while decoding is selected as the size of the latent embedding vector.
The encoder network contains five convolution layers.Each convolution layer uses LeakyReLu activation function.A stride of size two and a dropout rate of 0.25 has been taken into account.The first four convolutional layers have been implemented using the kernel size of three.The last convolutional layer uses a kernel of size one.The outputs of the convolutional layers are flattened to obtain 200 latent embedding vectors.The latent embedding vectors are then sent to a liner layer which performs a linear transformation on it to generate 4096 latent features.
The decoder comprises four transposed convolution layers with the LeakyReLu activation function.The stride of two and kernel size of three have been considered for transposed convolution layer.A dropout rate of 0.25 has been taken.The reconstructed images of size 128 x 128 are produced from the last transposed convolution layer.The size of the kernel in the encoder and decoder, dropout rate, number of filters, and number of convolutional layers has been searched as a part of hyperparameter tuning.

Meta-Learning Framework
In the meta-learning framework, we train an adaptive head pose estimator that finds the optimal parameters for a given training dataset and generalizes well to unseen data.For achieving this, the authors have implemented optimization-based meta-learning approaches such as MAML and FO-MAML and compared their performance in estimating head poses.Figure 3 shows the differences between MAML and FO-MAML-based architecture.
Figure 3a shows a MAML-based approach for a single task.We can see that computing metagradient in MAML requires computation all the way up the computation graph.MAML requires f, ( ) while updating the parameter w .It means we need to compute the Hessian matrix given by ∇ ( ) , during the meta-learning step.A computationally more efficient approach compared to MAML is FO-MAML as shown in Figure 3b.
In FO-MAML, we are using the first-order approximation ∇ ( ) , for updating global parameter w .This approach removes the need for computing Hessian.This allows us to omit the second-order gradients during learning making the optimization of the meta-learner faster.

Model-Agnostic Meta-Learning (MAML)
MAML is a model-agnostic and task-agnostic approach that can swiftly train model parameters for quick adaption to new tasks.The benefit of MAML based approach is in learning new tasks rapidly with only a few gradient modifications in the model.Algorithm 1: Generic Model-Agnostic Meta-Learning Algorithm Require: D t ( ) : distribution over tasks (human faces) Require: α β , : learning rates Step 1.
Randomly initialize parameter w Step 2.
while not done do Step 3.
Sample batch of task T D i ~t ( ) Step 4. For all T i do Step 5.
Evaluate ∇ ( ) with respect to K samples Step 6.
Calculate adapted parameters using gradient descent: Step 9. end while Let us consider a model f w having parameters w with t tasks, as shown in Algorithm 1.With just K samples taken at a time, the model f É is trained with tasks T i selected from the distribution D t ( ) .This produced a reliable few-shot learner, that can generalize well to novel examples taken from the new task T i which is again selected from the distribution D t ( ) .Our objective is to minimize loss L i t for task T i , hence the model f w need to be updated regularly.The proposed head pose estimator uses MAML and FO-MAML based model.Let M be the head pose estimator model that learns optimal parameters w * after training with meta-learning.The model is said to be optimized once it learns the optimal parameters w * .This would result in a finetuned model M w * that can generalize pretty well to unseen sets of tasks sampled from the distribution D t ( ) .We create a sample set consisting of support set and query set from training and testing.Let train and updates the weights w t at step t using a few gradient steps.An inner-learning rate a is used for computing the new gradients.A person p train is selected from S s train and parameter w is updated as shown below.
The mean absolute error (MAE) in computing head poses angles is given by: where n is the number of samples in the support set S s train , gt i are the ground truth of head poses angles, and y i  are the predicted head poses angles.Using these updated weights w t ' , we now compute loss for validation set S q train at step t + 1 .
The gradients are computed with respect to original weights w t and using a learning rate b .Finally, the weights w t are updated to minimize the validation loss, as shown below.
The meta-gradient updates continue until the model finds optimal weights w * .After we achieve optimal parameters w * , we can use the model (5) Finally, we verify fast adaptation using sample images sampled from the validation set S q test .Joshi et al., (2022) suggested an approach to estimate accurate head poses using MAML, however, a significant computational expense comes with MAML since it computes second-order derivatives during the backpropagation of meta-gradients.Hence, we implement a first-order approximation of the generic MAML algorithm that eliminates the second derivatives during backpropagating the metagradients through inner loop updates.Despite the absence of second-derivative terms, the resulting technique is able to compute the meta-gradients.Interestingly, this method performs identically to that obtained by considering second-order derivatives.This implies that the gradients of the model at the post-update parameter values are responsible for the improvement in MAML and the secondorder updates are less significant.

First-Order Model-Agnostic Meta Learning (FOMAML)
Algorithm 2 shows the first-order model-agnostic meta-learning algorithm step by step.All the training process remains similar to MAML except for updating the parameters at time t +1 .After we get the updated parameter w t ' at time t , the new gradients at time t +1 are then computed with respect to the newly updated parameter w t ' using b as the learning rate.Finally, the parameters w t are updated with the objective of minimizing the validation loss, as shown below.
while not done do Step 3.
Sample batch of task T D i ~t ( ) Step 4. For all T i do Step 5.
Evaluate ∇ ( ) with respect to K samples Step 6.
Calculate adapted parameters using gradient descent: Step 9. end while

Mathematical Formulation for MAML and its First-Order Approximation (FOMAML)
We are particularly interested in comparing the meta-objectives between FO-MAML and MAML.For this let us look at the MAML optimization problem: find some initial parameters, w , such that the learner will have minimum loss L t after k SGD updates for a randomly selected task t .Let us define few notations for this formulation: For computing this update, we actually compute the gradient with respect to our parameters w .For simplicity, let us omit k , because this is constant for our computation.
where ω ω τ new U = ( ) and U t ' is the Jacobian matrix of the MAML update operation with respect to initial parameters w .Adding the sequence of the gradient vectors in each gradient steps to the initial vector result in U τ ω ( ) .First-Order MAML (FO-MAML) assumes these gradients to be constant, thus, replacing the Jacobian U k t ¢ by the identity operation.Thus, the gradients used by FO-MAML for outer-loop optimization is given by: These gradients can directly be used in outer-loop optimization in FO-MAML given by: From these mathematical formulations, the FO-MAML can be efficiently implemented as shown in Algorithm 3.

Meta-Learner Network
A multi-layer fully connected neural network is implemented as a meta-learner.200 latent embeddings are taken as input to the linear layer and produce 1000 features after linear transformation.The features are then passed through three dense layers each of which gives 1000 features.The final layer of the network comprises of fully connected linear layer and regresses three vectors as head poses angles.LeakyReLu has been taken as an activation function with a negative slope of 0.1.Similarly, a dropout rate of 0.25 is considered.This regression model forecasts the yaw, pitch, and roll angles.Figure 4 depicts the proposed architecture for meta-training.The number of output features, number of dense layers, dropout rate, and activation function have been chosen as a part of hyperparameter tuning.

Hyperparameter Tuning
A grid search-based approach has been used to find optimal hyperparameters of both a meta-learner and a variational autoencoder as a part of hyperparameter optimization.For this, Ray Tune library has been taken into consideration.Using grid search, the meta-learning hyperparameters and were chosen 0.01 respectively.Similarly, a learning rate of 0.001 has been taken for training a variational autoencoder.Ten inner gradient steps have been considered to train MAML and FO-MAML.The dropout rates of 0.25 have been taken for both meta-learner and VAE.For training the meta-learner, stochastic gradient descent (SGD) has been used whereas Adam is used as an optimizer for VAE.Other hyperparameters such as the number of dense layers, convolutional layers, and the number of features in between dense layers have been searched using Ray Tune.

Model Evaluation
The pixel-wise mean squared error (MSE) is computed during representation learning.The squared error between actual and regenerated faces has been used to evaluate the performance of VAE.For computing the similarity between actual and predicted head poses, mean absolute error (MAE) has been used.The mean absolute error (MAE) between actual and predicted head poses is given in equation 11.where y i are the ground truth head pose angles and y i  is the predicted head pose angles.The average of three mean absolute errors in predicting Euler's angles is taken as the finalscore to evaluate the proposed model.It is used to assess the overall performance of the proposed architecture in predicting Euler's angles (yaw, pitch, and roll) for head poses.Mean average error is computed as: where yaw mae , pitch mae , and roll mae represent the mean absolute errors in estimating yaw, pitch, and roll angles respectively.

RESULT ANd ANALySIS
In this section, we analyze the results of the proposed representation learning and meta-learning framework.We present the results from one, five, and ten-shot experiments using MAML and FO-MAML-based approaches for comparing the performance of both methods.

Representation Learning
A variational autoencoder (VAE) is trained to generate latent embedding that preserves the head pose.For this, facial images of size 128×128 are used for training a VAE.In this article, a VAE is trained for 20 epochs to generate 200 latent embeddings.The pixel-wise mean squared loss is computed for the performance evaluation of VAE. Figure 5 shows the pixel-wise reconstruction loss of a VAE while regenerating original faces using latent embeddings.The mean squared error loss during reconstruction is decreasing with further training.The Kullback-Leibler divergence (KL-Divergence) loss during reconstruction is shown in Figure 6.During training, reconstruction loss is given a higher privilege than KL-Divergence loss.Hence,  (Asperti & Trentin, 2020).The total loss of a VAE is balanced after lowering the MSE loss and increasing the KL-Divergence loss with a few epochs of training.
Reconstruction loss and KL-Divergence loss combined together to form the total loss of a VAE which is shown in Figure 7.The embedding produced from a VAE is used to train a meta-learner.Figure 8 shows the original and reconstructed facial images using a VAE.

One-Shot Settings Using MAML and FO-MAML
To construct a support and query set for MAML and FO-MAML training in one-shot scenarios, one sample from each of five different individuals is randomly chosen.The query set is used to fine-tune the network parameters while the support set is utilized for training.Mean absolute error (MAE) has been used to evaluate the performance of the model in predicting head pose angles.We successfully predicted the head pose angles with a mean average error MAE avg ( ) of 7.72 using MAML.For the experiment in one-shot settings, we trained the meta-learner for 250 episodes with a meta batch size of 64.The training and validation loss of meta-learner using MAML in the one-shot setting is shown in Figure 9.
Similarly, we successfully predicted the head pose angles with a MAE avg of 8.33 using FO-MAML.For experimenting with using FO-MAML in one-shot settings, we trained the meta-learner for 250 episodes with the meta batch size of 64.Stochastic Gradient Descent (SGD) is used as an optimizer during meta-learning.
The training and validation loss of meta-learner using FO-MAML in the one-shot settings is shown in Figure 10.Smoother results have been obtained in the case of MAML.This is because we are adding gradient vectors in each update step making use of second-order gradients during backpropagation.The results also show that there is almost similar performance when we ignore the second-order gradients during the meta-learning training.For comparing the time taken to update the meta-gradient in both MAML and FO-MAML-based methods, we computed the time taken for a single outer-loop update.The MAML approach took about 7 15 . m sec for a single outer-loop update whereas it only took 5 41 . msec for FO-MAML.

Five-Shot Settings Using MAML and FO-MAML
Five samples from each of five different individuals are randomly chosen for creating support and query sets for training MAML and FO-MAML in five-shot scenarios.Mean absolute error (MAE) has been used to evaluate the performance of the model in predicting head pose angles.In five-shot settings, the head pose angles have been predicted with a MAE avg of 6 30 .using MAML.For the experiment in five-shot settings, a similar setup is used as that of one-shot settings i.e., episode =  The training and validation loss of meta-learner using MAML in the ten-shot setting is shown in Figure 13.Similarly, the head pose angles have been predicted with a MAE avg of 6.23 using FO-MAML.Figure 14  Table 1 summarizes the mean average errors in one-shot, five-shot, and ten-shot settings.The results show slightly better performance using MAML algorithm in few-shot settings.Almost similar performance has been achieved using FO-MAML for estimating head poses.These results confirm that there is almost similar performance while we use FO-MAML instead of MAML.Ignoring second-order gradients doesn't make a huge difference during meta-training.The findings further confirm that the MAML improvement is mostly composed of the objective's gradients after updating the parameter values.The second-order gradients are less significant for the performance of MAML.The experiments show that the FO-MAML algorithm is computationally more efficient in updating gradients than MAML.A single outer-loop update in FO-MAML seems to be much faster than MAML.The computational complexity of an outer-loop update in a single episode in MAML is found to be O n 2 ( ) whereas, it is linear i.e., O n ( ) for FO-MAML.
Table 2 summarizes the comparison of the proposed method with state-of-art on BIWI head pose dataset.For this, we take the results from five-way five-shot settings and make a comparison in computing average MAE.
Comparison with state-of-methods shows that MAML and FO-MAML based approach shows comparable results in estimating head poses.Unlike existing methods, MAML and FO-MAML based methods adapt more faster to unseen tasks that might come in future.
Table 3 summarizes the average time taken to predict head pose in one-shot, five-shot and tenshot settings.The values are taken by averaging the results of five experiments.The results show FO-MAML is computationally more efficient than MAML in head pose estimation.

Fast Adaptation
A separate unseen test dataset has to be used to analyze the fast adaptation by meta-learner.In this article, we separated a small subset of BIWI dataset containing 2638 faces of five different subject to analyze fast adaptation.The optimal parameters after meta-training are taken as the initial parameters for testing fast adaptation.We just optimize the inner loop for adapting to unseen tasks for analyzing fast adaptation.For this, we consider ten inner steps for training.We randomly select a task from the test distribution and pass it through the meta-learner.Using optimal MAML parameters, mean average error of 8.10, 7.00, and 6.15 has been obtained in predicting head pose using one-shot, five-shot, and ten-shot respectively.Similarly, by using optimal FO-MAML parameters as the initial parameters, mean average error of 8.65, 7.10, and 6.77 has been obtained in estimating head pose for one-shot, five-shot, and ten-shot respectively.

Comparing the Model Performance and Run-Time Performance by Varying the Inner Gradient Steps
For this experiment, we varied inner gradient steps and measured the mean average error in predicting head pose angles.Similarly, we computed the average time taken for the experiment to be complete in seconds.The same hyperparameters while training MAML were used except for the number of inner gradient loop steps: which were taken 5, 10, 15, and 20 respectively.The experiments were run five times each for both MAML and FO-MAML for 100 episodes (epochs); then the mean average error and the average time taken is noted.For simplicity, we conducted the experiment on five-shot settings.The results are shown in Table 4 and Table 5 respectively.The results from the experiments reveal that FO-MAML wins the performance when it comes to computational time compared to MAML.Similarly, having a larger number of inner gradient steps does not significantly improve the performance of the model.Approximately similar MAE has been obtained by varying the number of inner gradient steps.The error in the prediction of head pose angles is also quite comparable between MAML and FO-MAML.
Figure 15 and Figure 16 show the model performance in terms of mean average error ( MAE avg ) and average time to be completed respectively.

CONCLUSION
This article proposed a faster approach to training a head pose estimator network using first-order model-agnostic meta-learning (FO-MAML).Experiments using one-shot, five-shot, and ten-shot  settings are implemented using MAML and FO-MAML.Mean average error ( MAE avg ) of 7.72, 6.30, and 5.32 has been achieved using MAML for one-shot, five-shot, and ten-shot settings respectively.Similarly, mean average error ( MAE avg ) of 8.33, 6.84, and 6.23 is achieved using FO-MAML for one-shot, five-shot, and ten-shot settings.The proposed approach is able to correctly detect head pose angles for faces within the range of ±75• in yaw, ±60• in pitch, and ±50 • in the roll using limited training samples.The computational complexity of an outer-loop update in MAML is found to be O n 2 ( ) .With FO-MAML the computational complexity of the outer-loop update is found to be linear i.e., O n ( ) .The results also suggest that second-order gradients are less significant for the performance of MAML.Almost similar performance could be achieved by considering the first-order meta-gradients.The study also confirms that having a larger number of inner gradient steps does not enhance the performance of the model significantly.Almost similar MAE avg has been achieved by varying the number of inner gradient steps.The model has been successful in adapting to the unseen test samples from the BIWI head pose dataset and predicted correct head pose angles using only ten inner-gradient descent steps.The results also support the fact that FO-MAML is computationally more efficient than MAML and gives almost similar performance when used.
The experiments and the results from this approach will definitely give a robust starting ground to create good head pose estimation applications in few-shot settings.However, meta-learning-based approaches have a few challenges when we encounter highly varying task distributions.In the real world, task distribution is often diverse, making it challenging for meta-learners to adapt efficiently.Additionally, for applying meta-learning in few-shot settings, the tasks should be separable to create support and query sets.Many head pose datasets cannot be separated into few-shot tasks, making the application of meta-learning more difficult.In the future, this approach could be verified on more complicated and highly varying datasets.A more efficient approach could be developed by using other meta-learning algorithms such as Reptile and implicit MAML (iMAML).

Figure 1 .
Figure 1.Illustration of head pose estimation using First-Order Model-Agnostic Meta-Learning architecture

Figure 2 .
Figure 2. The architecture of a variational autoencoder

Figure 3 .
Figure 3.The architecture of a MAML and FO-MAML us denote the support set and the query set for training as S Similarly, the support set and the query set for testing is put together as S random sampling strategy has been used to sample the support and the query set from the distribution D t ( ) .The support and the query sets consist of z gt , ( ) { } , where z is the latent representation of the input training sample and gt is the ground truth of the head pose angles.We generally select small number of examples ≤ ( ) 20 in the support and the query set for training in the few-shot scenario.A cost function for the gradient update is the mean absolute error (MAE) in estimating yaw, pitch, and roll.Meta-learning computes the loss for support set S s a person p test from the test set S s test to fine-tune and adapt our model M w * as shown below.
Algorithm 3: Gradient update in First-Order Model-Agnostic Meta-Learning Step 1. Sample task t from distribution D t Step 2. Apply the update operator to yield ω ω τ new U = ( ) Step 3. Compute the gradients at w new : Figure 4. Meta-learner architecture

Figure 5 .
Figure 5. Mean squared error loss during reconstruction using a VAE

Figure 6 .
Figure 6.KL-Divergence Loss during reconstruction using a VAE

Figure 8 .
Figure 8. Original and reconstructed faces using a VAE

Figure 11 .
Figure 11.Training and validation loss in five-shot settings using MAML shows the meta-training and meta-validation loss in terms of MAE.Again, the training and validation loss curves show more stability in training MAML than FO-MAML.The MAML algorithm took about 9 30 .sec m for a single outer-loop update whereas, it only took about 6

Figure 13 .
Figure 13.Training and validation loss in ten-shot settings using MAML

Figure 15 .
Figure 15.Mean average error with varying number of inner gradient steps for 100 episodes Then we can define MAML update rule as follows for k number of gradient steps :