TY - GEN
T1 - Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension
AU - Yu, Hang
AU - Li, Weixin
AU - Li, Jiankai
AU - Du, Ye
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/12/1
Y1 - 2021/12/1
N2 - Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.
AB - Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.
KW - Entity relation fusion networks (ERFN)
KW - Language guided feature fusion (LGFF)
KW - Language guided multi-scale fusion (LGMSF)
KW - Referring expression comprehension
UR - https://www.scopus.com/pages/publications/85123049961
U2 - 10.1145/3469877.3490592
DO - 10.1145/3469877.3490592
M3 - 会议稿件
AN - SCOPUS:85123049961
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 3rd ACM International Conference on Multimedia in Asia, MMAsia 2021
PB - Association for Computing Machinery
T2 - 3rd ACM International Conference on Multimedia in Asia, MMAsia 2021
Y2 - 1 December 2021 through 3 December 2021
ER -