TY - JOUR
T1 - Multi-Grained Contrastive Learning for Text-Supervised Open-Vocabulary Semantic Segmentation
AU - Liu, Yajie
AU - Ge, Pu
AU - Wang, Guodong
AU - Liu, Qingjie
AU - Huang, Di
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/2/18
Y1 - 2025/2/18
N2 - Learning open-vocabulary semantic segmentation (OVSS) from text supervision has recently received increasing attention for its promising potential in real-world applications. However, only with image-level supervision, it struggles to achieve dense and robust cross-modal alignment and thus limits pixel-level predictions. In this article, we present a novel approach to this task with Multi-Grained Cross-modal Contrastive Learning, named MGCCL. Specifically, unlike current solutions restricted by coarse image/object-text alignment, MGCCL constructs pseudo multi-granular semantic correspondences at the object-, part-, and pixel-level and collaborates with hard sampling strategies to conduct cross-modal contrastive learning, significantly facilitating fine-grained alignment. Further, we develop an adaptive semantic unit which flexibly harnesses the learned multi-grained cross-modal alignment capabilities to effectively mitigate the under- and over-segmentation issues arising from the per-group and per-pixel units. Extensive experiments over a broad suite of eight segmentation benchmarks show that our approach delivers significant advancements over state-of-the-art counterparts, demonstrating its effectiveness.
AB - Learning open-vocabulary semantic segmentation (OVSS) from text supervision has recently received increasing attention for its promising potential in real-world applications. However, only with image-level supervision, it struggles to achieve dense and robust cross-modal alignment and thus limits pixel-level predictions. In this article, we present a novel approach to this task with Multi-Grained Cross-modal Contrastive Learning, named MGCCL. Specifically, unlike current solutions restricted by coarse image/object-text alignment, MGCCL constructs pseudo multi-granular semantic correspondences at the object-, part-, and pixel-level and collaborates with hard sampling strategies to conduct cross-modal contrastive learning, significantly facilitating fine-grained alignment. Further, we develop an adaptive semantic unit which flexibly harnesses the learned multi-grained cross-modal alignment capabilities to effectively mitigate the under- and over-segmentation issues arising from the per-group and per-pixel units. Extensive experiments over a broad suite of eight segmentation benchmarks show that our approach delivers significant advancements over state-of-the-art counterparts, demonstrating its effectiveness.
KW - Fine-grained Cross-modal Alignment
KW - Open-vocabulary Semantic Segmentation
KW - Text Supervision
UR - https://www.scopus.com/pages/publications/105003321589
U2 - 10.1145/3711868
DO - 10.1145/3711868
M3 - 文章
AN - SCOPUS:105003321589
SN - 1551-6857
VL - 21
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 3
M1 - 81
ER -