Abstract
Learning open-vocabulary semantic segmentation (OVSS) from text supervision has recently received increasing attention for its promising potential in real-world applications. However, only with image-level supervision, it struggles to achieve dense and robust cross-modal alignment and thus limits pixel-level predictions. In this article, we present a novel approach to this task with Multi-Grained Cross-modal Contrastive Learning, named MGCCL. Specifically, unlike current solutions restricted by coarse image/object-text alignment, MGCCL constructs pseudo multi-granular semantic correspondences at the object-, part-, and pixel-level and collaborates with hard sampling strategies to conduct cross-modal contrastive learning, significantly facilitating fine-grained alignment. Further, we develop an adaptive semantic unit which flexibly harnesses the learned multi-grained cross-modal alignment capabilities to effectively mitigate the under- and over-segmentation issues arising from the per-group and per-pixel units. Extensive experiments over a broad suite of eight segmentation benchmarks show that our approach delivers significant advancements over state-of-the-art counterparts, demonstrating its effectiveness.
| Original language | English |
|---|---|
| Article number | 81 |
| Journal | ACM Transactions on Multimedia Computing, Communications and Applications |
| Volume | 21 |
| Issue number | 3 |
| DOIs | |
| State | Published - 18 Feb 2025 |
Keywords
- Fine-grained Cross-modal Alignment
- Open-vocabulary Semantic Segmentation
- Text Supervision
Fingerprint
Dive into the research topics of 'Multi-Grained Contrastive Learning for Text-Supervised Open-Vocabulary Semantic Segmentation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver