Article Preview
Top1 Introduction
With the development of related disciplines such as virtual reality and machine learning, the way people interact with computers is moving in a more natural and pervasive direction. There is an urgent need to use natural actions, rather than traditional dedicated input devices that send commands to control systems or interact with digital content in virtual environments. Human-computer interaction is changing from computer-centered to user-centered. Of all the body parts, the human hand plays an important role in interaction as a dexterous and effective executive organ. In daily life, people need to use their hands a lot to manipulate objects or communicate with others. The aim of gesture estimation is to recover the complete motion posture of hand in calculator system. Then, make the computer or other equipment can sense the spatial posture of the hand, so as to execute according to the instruction of the person. Accurate gesture estimation can not only build realistic virtual hand movements, but also enhance user experience in human-computer interaction (Chakraborty B K, Sarma D, Bhuyan M K, et al.2018). This helps computers better understand human behavior, which in turn makes interactions between humans and intelligent systems more intelligent.
As an important interactive way in computer graphics, virtual reality and human-computer interaction, gesture interaction provides a convenient, intuitive, simple and convenient interactive experience. Gesture interaction and recognition are of great significance to virtual reality interaction (De Smedt Q, Wannous H, Vandeborre J P.2016).3D motion sensing game (Duan H, Sun Y, Cheng W, et al.2021), assisted medical surgery (Gao X, Jin Y, Dou Q, et al.202) and other applications. However, due to the high degree of freedom of different gestures (Hussain S, Saxena R, Han X, et al.2017), the acquired gesture image data is usually characterized by low resolution, chaotic background, blocked hands, different finger shapes and sizes, and individual differences. This makes it difficult to accurately represent different gesture features, thus bringing difficulties and challenges to gesture recognition (Hou J, Wang G, Chen X, et al.2018).
Traditional gesture recognition is usually based on camera photos and 2D gestures for recognition and classification. Literature (Jiang D, Li G, Sun Y, et al.2019) analyzes the images of various resolutions in the image pyramid successively from low to high according to the changes in geometric dimensions of different parts of the hands obtained by segmentation. Literature (Li Y, He Z, Ye X, et al.2019)(Moin A, Zhou A, Rahimi A, et al.2021)proposed an effective Distance measure FEMD (finger-earth Mover's Distance) by using gesture shapes. This measure compares the shape differences of different gestures. Literature (Nasri N, Orts-Escolano S, Cazorla M.2020)proposed a gesture recognition method based on gesture main direction and Hausdorff-like distance template matching. This method has a high limitation on the main direction of gestures, which requires that the main direction of gestures obtained be consistent with the main direction of similar gestures in the training library, which limits the applicability of the method.
In recent years, many researchers have fully integrated data modeling and graph structure according to the characteristics of gesture sequence data, and proposed the idea of using Graph convolutional Networks (GCN)(Rudi F F Y, Yuniarno E M.) to predict actions. Because GCN can make full use of the spatial relationship of gestures, the performance of this method is greatly improved. However, a fixed topology is not the best choice for describing a diverse sample of actions, limiting the scope of messaging between nodes. Therefore, a graph structure that can be dynamically adjusted according to data samples is more suitable for modeling diverse gestures. In addition, the previous GCN ignored the importance of different channels. Often, the features produced by some channels are very important for motion recognition, while the features in some channels only play a minor role. In the process of feature extraction, we should pay more attention to those important channel features and ignore the unimportant channel information. In order to dynamically adjust the graph structure according to the data samples, a gesture recognition algorithm based on spatio-temporal graph convolution network is proposed in this paper.