TY - GEN
T1 - Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding
AU - Hui, Tianrui
AU - Ding, Zihan
AU - Huang, Junshi
AU - Wei, Xiaoming
AU - Wei, Xiaolin
AU - Dai, Jiao
AU - Han, Jizhong
AU - Liu, Si
N1 - Publisher Copyright:
© 2023 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a Phrase-Object Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.
AB - Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a Phrase-Object Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.
UR - https://www.scopus.com/pages/publications/85170355503
U2 - 10.24963/ijcai.2023/99
DO - 10.24963/ijcai.2023/99
M3 - 会议稿件
AN - SCOPUS:85170355503
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 893
EP - 901
BT - Proceedings of the 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023
A2 - Elkind, Edith
PB - International Joint Conferences on Artificial Intelligence
T2 - 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023
Y2 - 19 August 2023 through 25 August 2023
ER -