跳到主要导航 跳到搜索 跳到主要内容

Multi-Grained Contrastive Learning for Text-Supervised Open-Vocabulary Semantic Segmentation

  • Beihang University

科研成果: 期刊稿件文章同行评审

摘要

Learning open-vocabulary semantic segmentation (OVSS) from text supervision has recently received increasing attention for its promising potential in real-world applications. However, only with image-level supervision, it struggles to achieve dense and robust cross-modal alignment and thus limits pixel-level predictions. In this article, we present a novel approach to this task with Multi-Grained Cross-modal Contrastive Learning, named MGCCL. Specifically, unlike current solutions restricted by coarse image/object-text alignment, MGCCL constructs pseudo multi-granular semantic correspondences at the object-, part-, and pixel-level and collaborates with hard sampling strategies to conduct cross-modal contrastive learning, significantly facilitating fine-grained alignment. Further, we develop an adaptive semantic unit which flexibly harnesses the learned multi-grained cross-modal alignment capabilities to effectively mitigate the under- and over-segmentation issues arising from the per-group and per-pixel units. Extensive experiments over a broad suite of eight segmentation benchmarks show that our approach delivers significant advancements over state-of-the-art counterparts, demonstrating its effectiveness.

源语言英语
文章编号81
期刊ACM Transactions on Multimedia Computing, Communications and Applications
21
3
DOI
出版状态已出版 - 18 2月 2025

指纹

探究 'Multi-Grained Contrastive Learning for Text-Supervised Open-Vocabulary Semantic Segmentation' 的科研主题。它们共同构成独一无二的指纹。

引用此