Spatial-Temporal Feature-Based Sports Video Classification

Spatial-Temporal Feature-Based Sports Video Classification

Zengkai Wang (Jiaxing University, China)
Copyright: © 2021 |Pages: 19
DOI: 10.4018/IJACI.2021100105
OnDemand PDF Download:
No Current Special Offers


Video classification has been an active research field of computer vision in last few years. Its main purpose is to produce a label that is relevant to the video given its frames. Unlike image classification, which takes still pictures as input, the input of video classification is a sequence of images. The complex spatial and temporal structures of video sequence incur understanding and computation difficulties, which should be modeled to improve the video classification performance. This work focuses on sports video classification but can be expanded into other applications. In this paper, the authors propose a novel sports video classification method by processing the video data using convolutional neural network (CNN) with spatial attention mechanism and deep bidirectional long short-term memory (BiLSTM) network with temporal attention mechanism. The method first extracts 28 frames from each input video and uses the classical pre-trained CNN to extract deep features, and the spatial attention mechanism is applied to CNN features to decide ‘where' to look. Then the BiLSTM is utilized to model the long-term temporal dependence between video frame sequences, and the temporal attention mechasim is employed to decide ‘when' to look. Finally, the label of the input video is given by the classification network. In order to evaluate the feasibility and effectiveness of the proposed method, an extensive experimental investigation was conducted on the open challenging sports video datasets of Sports8 and Olympic16; the results show that the proposed CNN-BiLSTM network with spatial temporal attention mechanism can effectively model the spatial-temporal characteristics of video sequences. The average classification accuracy of the Sports8 is 98.8%, which is 6.8% higher than the existing method. The average classification accuracy of 90.46% is achieved on Olympic16, which is about 18% higher than the existing methods. The performance of the proposed approach outperforms the state-of-the-art methods, and the experimental results demonstrate the effectiveness of the proposed approach.
Article Preview


With the rapid development of computer and Internet technology, mobile Internet technology and mobile communication devices are widely used in various fields. Using mobile phones and other mobile devices to shoot, watch and share videos has become a part of modern people's life and work (“Cisco Annual Internet Report (2018–2023) White Paper,”). Therefore, video becomes an important information carrier and grows at a geometric order of magnitude in the network. In recent years, the application demand for automatic analysis of video content has been expanding. Over the past decade, video content understanding and recognition technologies have shown broad promise in the fields of surveillance (Angadi & Nandyal, 2020; Chakraborty, Bhattacharyya, & Chakraborty, 2018; Ullah et al., 2020), smart home (Dai, Minciullo, Garattoni, Francesca, & Bremond, 2019), autonomous driving (Gao, Xu, Davis, Socher, & Xiong, 2019), and sports video analysis (Akçay, Seymen, Er, Çetin, & Karslıgil, 2019; Karlsson, 2017; Rafiq, Rafiq, Agyeman, Choi, & Jin, 2020). Sports video has the largest number of audiences in all type of videos. A large number of sports videos are recorded every day. The indexing of sports video by sports category is an important means for post-match analysis, coaching tactics formation and the follow-up processing. It is the basis for the realization of sports video summarization, semantic annotation and retrieval, and has great commercial potential and application value (Z. Wang, Yu, & He, 2016).

Video classification technology is an important research direction in the field of computer vision (Wu, Yao, Fu, & Jiang, 2017). Its main purpose is to analyze video content and classify videos into predefined classes according to objects, scenes, action information of objects and evolution information of scenes, so as to achieve the purpose of supervising and classifying videos (Rafiq et al., 2020). In this process, it involves many fields such as object detection, scene detection, image processing, pattern recognition and artificial intelligence, and almost covers all contents of video processing. Therefore, video classification embodies the advanced and cutting-edge video processing technologies. Video can be regarded as a continuous sequence of images, however, as the dynamic characteristics of video sequence, and the related light conditions, background, camera angle, the shade, it is difficult to distinguish between scene change within the large intra-class differences and small inter-class similarities, making video classification problem much more complex than a single image classification. Therefore, video classification has always been a challenging task in the field of video analysis.

Video classification is essentially a pattern recognition problem, which mainly includes two steps of feature extraction and classification. The feature extraction is the core step of the problem. In the past few decades, with the development of feature extraction technology, video classification technology has made some progress, but it is far from satisfying. There is still a huge semantic gap between low-level features and high-level semantics (Guo, 2020). Video is composed of a series of images in a certain order, and the visual information in the images constitutes the visual information of the video. More importantly, the sequential information between images constitutes the temporal information of the video. This temporal information includes the motion of the object, the evolution of the scene and other information unique to the video carrier. The complex temporal structures of video sequence incur understanding and computation difficulties, which should be modeled to improve the video classification performance. However, the existing feature extraction methods can not fully capture the temporal information, or can only capture the short-time low-level action characteristics, resulting in insufficient feature expressions. As video content on the Internet becomes more and more complex and informative, this problem becomes more and more acute. Therefore, it is of great significance to study video features, especially the extraction method of temporal features.

Complete Article List

Search this Journal:
Open Access Articles
Volume 13: 6 Issues (2022): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing