Transforming Hand Gesture Recognition Into Image Classification Using Data Level Fusion: Methods, Framework, and Results

Transforming Hand Gesture Recognition Into Image Classification Using Data Level Fusion: Methods, Framework, and Results

DOI: 10.4018/978-1-6684-7791-5.ch003
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Hand gesture recognition (HGR) is a form of perceptual computing with applications in human-machine interaction, virtual/augmented reality, and human behavior analysis. Within the HGR domain, several frameworks have been developed with different combinations of input modalities and neural network architectures to varying levels of efficacy. Such frameworks maximized performance at the expense of increased computational and hardware requirements. These drawbacks can be mitigated by a skeleton-based framework that transforms the hand gesture recognition task into an image classification task. This chapter explores several temporal information condensation (via data-level fusion) methods for encoding dynamic gesture information into static images. The efficacies of these methods are compared, and the best ones are aggregated into a generalized HGR framework which was extensively evaluated on the CNR, FPHA, LMDHG, SHREC2017, and DHG1428 benchmark datasets. The framework's performance shows competitiveness compared to other frameworks within the state-of-the-art for the datasets.
Chapter Preview
Top

Introduction

Hand Gesture Recognition (HGR) is perceptual computing that allows computers to capture and interpret human hand gestures via mathematical algorithms and execute commands based on those gestures. HGR enhances computational devices with functionality supporting cutting-edge applications such as human-machine interactions, human behavior analysis and characterization, active and assisted living, virtual and augmented reality, and ambient intelligence. The human hand is a complex, deformable object that can assume a near-infinite number of poses, with its physical characteristics (size and color) varying widely across the population. Furthermore, HGR-based applications are deployed in environments with noisy inputs, occlusions, dynamic backgrounds, and real-time constraints.

An HGR framework must effectively tackle challenges posed by the human hand and application environment to meet performance metrics set by developers and required by end-users. These performance metrics include user-friendliness, computational complexity, hardware requirement, latency, and accuracy. Hand gestures can be categorized as static – the hand’s pose and position do not change with time – and dynamic – the hand’s pose and position change significantly over time. Most natural human gestures are dynamic, and such gestures pose an additional challenge to HGR frameworks due to the temporal aspect introduced, i.e., an entire sequence of hand poses is required to understand the semantics of the gesture. Within the HGR domain, several studies have developed frameworks for dynamic hand gestures, using different combinations of input modalities and neural network architectures to varying levels of efficacy (Boulahia et al., 2017; Li et al., 2021; Lupinetti et al., 2020; Min et al., 2020; X. S. Nguyen et al., 2019; Sabater et al., 2021; Shi et al., 2020).

However, outside research, the “bigger picture” of creating practical, real-world HGR-based applications must be kept in mind. The developed frameworks maximized performance at the expense of additional hardware requirements and increased computational complexity. For example, some input modalities require specialized, often expensive, sensors to capture the raw gesture inputs from the user. With multimodal networks, the hardware requirements lead to increased financial cost and reduced user-friendliness while further complicating the HGR task. Similarly, the neural network architectures adopted are often deep and complex enough to require intensive computation during inference, thus impacting their latency for real-time applications. Also, such frameworks require more training data and data augmentation to achieve acceptable gesture recognition performance.

To tackle these problems, (Lupinetti et al., 2020) proposed a unimodal, skeleton-based dynamic HGR framework that encodes gesture information into RGB images via data-level fusion, thus transforming the relatively complex dynamic HGR task into a much simpler image classification task. Furthermore, the hand skeleton of connected joints effectively describes the hand’s geometric shape, containing richer semantic gesture information while eliminating noise from individual differences in physical hand characteristics. However, the data-level fusion method adopted by (Lupinetti et al., 2020) leaves a lot to be desired: the images generated are noisy, insufficiently dissimilar across gesture classes, and visualized from possibly suboptimal view orientations (planes). These transformation issues reduce the performance of the framework’s model and leave doubts about the efficacy of temporal information condensation for encoding spatiotemporal information into static images for classification purposes.

Complete Chapter List

Search this Book:
Reset