A key component of understanding hand-object interactions is the ability to identify the active object -- the object
that is being manipulated by the human hand.
In order to accurately localize the active object, any method must reason using information encoded by each image pixel,
such as whether it belongs to the hand, the object, or the background.
To leverage each pixel as evidence to determine the bounding box of the active object, we propose a pixel-wise voting
function.
Our pixel-wise voting function takes an initial bounding box as input and produces an improved bounding box of the
active object as output.
The voting function is designed so that each pixel inside of the input bounding box votes for an improved bounding box,
and the box with the majority vote is selected as the output.
We call the collection of bounding boxes generated inside of the voting function, the Relational Box Field, as it
characterizes a field of bounding boxes defined in relationship to the current bounding box.
While our voting function is able to improve the bounding box of the active object, one round of voting is typically not
enough to accurately localize the active object.
Therefore, we repeatedly apply the voting function to sequentially improve the location of the bounding box. However,
since it is known that repeatedly applying a one-step predictor (i.e., auto-regressive processing with our voting
function) can cause a data distribution shift, we mitigate this issue using reinforcement learning (RL).
We adopt standard RL to learn the voting function parameters and show that it provides a meaningful improvement over a
standard supervised learning approach.
We perform experiments on two large-scale datasets: 100DOH and MECCANO, improving AP50 performance by 8% and 30%,
respectively, over the state of the art.
|