HOTR: End-to-End Human-Object Interaction Detection with Transformers
release_egbxkcw6lra5fcjim5xqpjl3mi
by
Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, Hyunwoo J. Kim
2021
Abstract
Human-Object Interaction (HOI) detection is a task of identifying "a set of
interactions" in an image, which involves the i) localization of the subject
(i.e., humans) and target (i.e., objects) of interaction, and ii) the
classification of the interaction labels. Most existing methods have indirectly
addressed this task by detecting human and object instances and individually
inferring every pair of the detected instances. In this paper, we present a
novel framework, referred to by HOTR, which directly predicts a set of <human,
object, interaction> triplets from an image based on a transformer
encoder-decoder architecture. Through the set prediction, our method
effectively exploits the inherent semantic relationships in an image and does
not require time-consuming post-processing which is the main bottleneck of
existing methods. Our proposed algorithm achieves the state-of-the-art
performance in two HOI detection benchmarks with an inference time under 1 ms
after object detection.
In text/plain
format
Archived Files and Locations
application/pdf 1.0 MB
file_j3qeeczdtfhkpntsyx4no4wt4i
|
arxiv.org (repository) web.archive.org (webarchive) |
2104.13682v1
access all versions, variants, and formats of this works (eg, pre-prints)