TY - JOUR
T1 - Human-Centric Relation Segmentation
T2 - Dataset and Solution
AU - Liu, Si
AU - Wang, Zitian
AU - Gao, Yulu
AU - Ren, Lejian
AU - Liao, Yue
AU - Ren, Guanghui
AU - Li, Bo
AU - Yan, Shuicheng
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2022/9/1
Y1 - 2022/9/1
N2 - Vision and language understanding techniques have achieved remarkable progress, but currently it is still difficult to well handle problems involving very fine-grained details. For example, when the robot is told to 'bring me the book in the girl's left hand', most existing methods would fail if the girl holds one book respectively in her left and right hand. In this work, we introduce a new task named human-centric relation segmentation (HRS), as a fine-grained case of HOI-det. HRS aims to predict the relations between the human and surrounding entities and identify the relation-correlated human parts, which are represented as pixel-level masks. For the above exemplar case, our HRS task produces results in the form of relation triplets girl [left hand], hold, book and exacts segmentation masks of the book, with which the robot can easily accomplish the grabbing task. Correspondingly, we collect a new Person In Context (PIC) dataset for this new task, which contains 17,122 high-resolution images and densely annotated entity segmentation and relations, including 141 object categories, 23 relation categories and 25 semantic human parts. We also propose a Simultaneous Matching and Segmentation (SMS) framework as a solution to the HRS task. It contains three parallel branches for entity segmentation, subject object matching and human parsing respectively. Specifically, the entity segmentation branch obtains entity masks by dynamically-generated conditional convolutions; the subject object matching branch detects the existence of any relations, links the corresponding subjects and objects by displacement estimation and classifies the interacted human parts; and the human parsing branch generates the pixelwise human part labels. Outputs of the three branches are fused to produce the final HRS results. Extensive experiments on PIC and V-COCO datasets show that the proposed SMS method outperforms baselines with the 36 FPS inference speed. Notably, SMS outperforms the best performing baseline mm-KERN with only 17.6 percent time cost. The dataset and code will be released at http://picdataset.com/challenge/index/.
AB - Vision and language understanding techniques have achieved remarkable progress, but currently it is still difficult to well handle problems involving very fine-grained details. For example, when the robot is told to 'bring me the book in the girl's left hand', most existing methods would fail if the girl holds one book respectively in her left and right hand. In this work, we introduce a new task named human-centric relation segmentation (HRS), as a fine-grained case of HOI-det. HRS aims to predict the relations between the human and surrounding entities and identify the relation-correlated human parts, which are represented as pixel-level masks. For the above exemplar case, our HRS task produces results in the form of relation triplets girl [left hand], hold, book and exacts segmentation masks of the book, with which the robot can easily accomplish the grabbing task. Correspondingly, we collect a new Person In Context (PIC) dataset for this new task, which contains 17,122 high-resolution images and densely annotated entity segmentation and relations, including 141 object categories, 23 relation categories and 25 semantic human parts. We also propose a Simultaneous Matching and Segmentation (SMS) framework as a solution to the HRS task. It contains three parallel branches for entity segmentation, subject object matching and human parsing respectively. Specifically, the entity segmentation branch obtains entity masks by dynamically-generated conditional convolutions; the subject object matching branch detects the existence of any relations, links the corresponding subjects and objects by displacement estimation and classifies the interacted human parts; and the human parsing branch generates the pixelwise human part labels. Outputs of the three branches are fused to produce the final HRS results. Extensive experiments on PIC and V-COCO datasets show that the proposed SMS method outperforms baselines with the 36 FPS inference speed. Notably, SMS outperforms the best performing baseline mm-KERN with only 17.6 percent time cost. The dataset and code will be released at http://picdataset.com/challenge/index/.
KW - Human-centric relation segmentation
KW - human object interaction
KW - matching
KW - visual relation detection
UR - https://www.scopus.com/pages/publications/85105056679
U2 - 10.1109/TPAMI.2021.3075846
DO - 10.1109/TPAMI.2021.3075846
M3 - 文章
C2 - 33905323
AN - SCOPUS:85105056679
SN - 0162-8828
VL - 44
SP - 4987
EP - 5001
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 9
ER -