Abstract

A key component of understanding hand-object interactions is the ability to identify the active object -- the object that is being manipulated by the human hand. In order to accurately localize the active object, any method must reason using information encoded by each image pixel, such as whether it belongs to the hand, the object, or the background. To leverage each pixel as evidence to determine the bounding box of the active object, we propose a pixel-wise voting function. Our pixel-wise voting function takes an initial bounding box as input and produces an improved bounding box of the active object as output. The voting function is designed so that each pixel inside of the input bounding box votes for an improved bounding box, and the box with the majority vote is selected as the output. We call the collection of bounding boxes generated inside of the voting function, the Relational Box Field, as it characterizes a field of bounding boxes defined in relationship to the current bounding box. While our voting function is able to improve the bounding box of the active object, one round of voting is typically not enough to accurately localize the active object. Therefore, we repeatedly apply the voting function to sequentially improve the location of the bounding box. However, since it is known that repeatedly applying a one-step predictor (i.e., auto-regressive processing with our voting function) can cause a data distribution shift, we mitigate this issue using reinforcement learning (RL). We adopt standard RL to learn the voting function parameters and show that it provides a meaningful improvement over a standard supervised learning approach. We perform experiments on two large-scale datasets: 100DOH and MECCANO, improving AP50 performance by 8% and 30%, respectively, over the state of the art.

Talk

[Slides]

Method

To achieve robust object localization, especially under occlusion, we propose a voting function with the Relational Box Field that allows each pixel in the image to vote for a bounding box of the active object. Then we progressively refine it towards a more accurate active object bounding box in a sequential decision-making process modeled by a Markov decision process (MDP).

Results

We present the qualitative results of our method below. In the figures, each green arrow points from a hand bounding box (blue) to the corresponding active object bounding box (red). The visualization shows that our method is able to robustly detect the active object under scenes with overlapping objects and severe occlusions. Most failure cases are due to wrong hand detection, motion blur, and insufficient feature from tiny hands and objects. Please check our paper for further results and comparisons.

Qualitative Results on 100DOH Dataset

Qualitative Results on MECCANO Dataset

Analysis

We visualize the IoU (red indicates higher IoU) between the final active object box estimation (red) and the pixel-wise predictions inside the hand bounding box (blue). The visualizations show our voting function is able to adapt predictions from informative hand parts like fingers as opposed to irrelevant parts like wrist and background. Every visualization sample below only shows one pair of hands and objects for better visibility.

Paper and Supplementary Material

Qichen Fu, Xingyu Liu, Kris M. Kitani
Sequential Voting with Relational Box Fields for Active Object Detection
CVPR, 2022.
(hosted on ArXiv)

[Bibtex]

Acknowledgements

This work is funded in part by JST AIP Acceleration, Grant Number JPMJCR20U1, Japan. This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.