Object Tracking Based on Global Context Attention

Object Tracking Based on Global Context Attention

Yucheng Wang, Xi Chen, Zhongjie Mao, Jia Yan
DOI: 10.4018/IJCINI.287595
Article PDF Download
Open access articles are freely available for download

Abstract

Previous research has shown that tracking algorithms cannot capture long-distance information and lead to the loss of the object when the object was deformed, the illumination changed, and the background was disturbed by similar objects. To remedy this, this article proposes an object-tracking method by introducing the Global Context attention module into the Multi-Domain Network (MDNet) tracker. This method can learn the robust feature representation of the object through the Global Context attention module to better distinguish the background from the object in the presence of interference factors. Extensive experiments on OTB2013, OTB2015, and UAV20L datasets show that the proposed method is significantly improved compared with MDNet and has competitive performance compared with more mainstream tracking algorithms. At the same time, the method proposed in this article achieves better results when the video sequence contains object deformation, illumination change, and background interference with similar objects.
Article Preview
Top

Introduction

Object tracking is a fundamental task in the field of computer vision. The tracker needs to accurately predict the object’s position and size change in the subsequent video frames according to the object given in the first frame of the video sequence. Although much progress has been made in research on object tracking (Marvasti-Zadeh et al., 2021; Abbass et al., 2021), it is still a challenging task because the object is often disturbed by external factors, such as size change, illumination change, and occlusion (K. Zhang et al., 2014).

In recent years, object-tracking algorithms have mainly been divided into two categories according to the template type. The first type is based on an explicit template tracking algorithm, and the second type is based on an implicit template tracking algorithm. The tracking algorithms based on Siamese neural networks, such as the Fully-Convolutional Siamese Network (SiamFC; Bertinetto et al., 2016) and the Deeper and Wider Siamese Network (Zhang & Peng, 2019), are the representatives of the first kind of method. They take the object branch as the template and find the most similar region to the object in the search region through the cross-correlation method. The representative of the second kind of method is the Multi-Domain Network tracking algorithm MDNet (Nam & Han, 2016). The algorithm divides the convolutional neural network into a shared feature-extraction layer and a domain-specific layer. The shared layer conducts pre-training on the dataset to learn the common features of the object. In contrast, the domain-specific layer uses online training during the tracking process to generate a domain-specific visual tracking classifier. Because the implicit template-based method takes the surface model as the last full connection layer for online learning, the accuracy is usually better than that of the explicit template-based method based on the Siamese neural network.

However, MDNet’s shared feature-extraction layer only has a few convolutional layers, so it cannot capture the remote dependency information. When there is deformation of the tracked object, illumination change, or interference of similar objects in the background, the shared feature-extraction layers cannot extract more discriminative and robust features of the tracked object from the global information, leading to a great loss of tracking algorithm accuracy. Therefore, the tracker proposed in this article introduces the Global Context module (Y. Cao et al., 2019) in MDNet and combines the complete Global Context attention module with the split Global Context attention module to better enhance the recognition of features. The Global Context attention module simplifies the autocorrelation operation in non-local modules by learning a shared attention map for all query positions in the feature map and adopts a method similar to Squeeze-and-Excitation Network (SENet; Hu et al., 2018) to obtain the attention between channels. This method not only captures space and channels attention at the same time, but also greatly reduces the number of additional parameters and computational complexity.

In summary, the main contributions of this work are the following:

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 1 Issue (2022)
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing