Long-Distance Gesture Recognition using Dynamic Neural Networks
IROS 2023

A demonstration of our method used to control a mobile Robot

Abstract

Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a drone. Methods made for short-distance recognition are unable to perform well on long-distance recognition due to gestures occupying only a small portion of the input data. Their performance is especially worse in resource constrained settings where they are not able to effectively focus their limited compute on the gesturing subject. We propose a novel, accurate and efficient method for the recognition of gestures from longer distances. It uses a dynamic neural network to select features from gesturecontaining spatial regions of the input sensor data for further processing. This helps the network focus on features important for gesture recognition while discarding background features early on, thus making it more compute efficient compared to other techniques. We demonstrate the performance of our method on the LD-ConGR long-distance dataset where it outperforms previous state-of-the-art methods on recognition accuracy and compute efficiency.

Proposed Model

We propose a 3D CNN pipeline using a dynamic neural network to selectively Discard background features, while preserving gesturing subject features.



The patch selection subnetwork forms the core of the pipeline, using a binary gesture classifier to assign a score to each patch, with the score of a patch reflecting the likelihood of it having subject features. The patch with the maximum score is selected to be forwarded to the rest of the network, while the rest of the feature patches are discarded. To know more, please refer to our paper.



Results

Our method outperforms all comparable baselines in terms of both Top-1 accuracy and compute efficiency as can be seen by its positioning in the top left quadrant of the accuracy-efficiency plot.



We measure performance of our method and other baselines on a subset of videos with all subjects at the maximum distance of 4 meters. The performance of all methods deteriorates as compared to their performance on the test set where subjects were at distances between 1-4 meters. Our methods shows the least deterioration in performance as compared to other methods, demonstrating its suitability for long distance recognition .

Citation

The website template was borrowed from Michaël Gharbi and Ref-NeRF.