Article Preview
TopIntroduction
Object tracking is a fundamental task in the field of computer vision. The tracker needs to accurately predict the object’s position and size change in the subsequent video frames according to the object given in the first frame of the video sequence. Although much progress has been made in research on object tracking (Marvasti-Zadeh et al., 2021; Abbass et al., 2021), it is still a challenging task because the object is often disturbed by external factors, such as size change, illumination change, and occlusion (K. Zhang et al., 2014).
In recent years, object-tracking algorithms have mainly been divided into two categories according to the template type. The first type is based on an explicit template tracking algorithm, and the second type is based on an implicit template tracking algorithm. The tracking algorithms based on Siamese neural networks, such as the Fully-Convolutional Siamese Network (SiamFC; Bertinetto et al., 2016) and the Deeper and Wider Siamese Network (Zhang & Peng, 2019), are the representatives of the first kind of method. They take the object branch as the template and find the most similar region to the object in the search region through the cross-correlation method. The representative of the second kind of method is the Multi-Domain Network tracking algorithm MDNet (Nam & Han, 2016). The algorithm divides the convolutional neural network into a shared feature-extraction layer and a domain-specific layer. The shared layer conducts pre-training on the dataset to learn the common features of the object. In contrast, the domain-specific layer uses online training during the tracking process to generate a domain-specific visual tracking classifier. Because the implicit template-based method takes the surface model as the last full connection layer for online learning, the accuracy is usually better than that of the explicit template-based method based on the Siamese neural network.
However, MDNet’s shared feature-extraction layer only has a few convolutional layers, so it cannot capture the remote dependency information. When there is deformation of the tracked object, illumination change, or interference of similar objects in the background, the shared feature-extraction layers cannot extract more discriminative and robust features of the tracked object from the global information, leading to a great loss of tracking algorithm accuracy. Therefore, the tracker proposed in this article introduces the Global Context module (Y. Cao et al., 2019) in MDNet and combines the complete Global Context attention module with the split Global Context attention module to better enhance the recognition of features. The Global Context attention module simplifies the autocorrelation operation in non-local modules by learning a shared attention map for all query positions in the feature map and adopts a method similar to Squeeze-and-Excitation Network (SENet; Hu et al., 2018) to obtain the attention between channels. This method not only captures space and channels attention at the same time, but also greatly reduces the number of additional parameters and computational complexity.
In summary, the main contributions of this work are the following: