Multi-Grained Contrastive Learning for Text-Supervised Open-Vocabulary Semantic Segmentation

Research output: Contribution to journalArticlepeer-review

Abstract

Learning open-vocabulary semantic segmentation (OVSS) from text supervision has recently received increasing attention for its promising potential in real-world applications. However, only with image-level supervision, it struggles to achieve dense and robust cross-modal alignment and thus limits pixel-level predictions. In this article, we present a novel approach to this task with Multi-Grained Cross-modal Contrastive Learning, named MGCCL. Specifically, unlike current solutions restricted by coarse image/object-text alignment, MGCCL constructs pseudo multi-granular semantic correspondences at the object-, part-, and pixel-level and collaborates with hard sampling strategies to conduct cross-modal contrastive learning, significantly facilitating fine-grained alignment. Further, we develop an adaptive semantic unit which flexibly harnesses the learned multi-grained cross-modal alignment capabilities to effectively mitigate the under- and over-segmentation issues arising from the per-group and per-pixel units. Extensive experiments over a broad suite of eight segmentation benchmarks show that our approach delivers significant advancements over state-of-the-art counterparts, demonstrating its effectiveness.

Original languageEnglish
Article number81
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume21
Issue number3
DOIs
StatePublished - 18 Feb 2025

Keywords

  • Fine-grained Cross-modal Alignment
  • Open-vocabulary Semantic Segmentation
  • Text Supervision

Fingerprint

Dive into the research topics of 'Multi-Grained Contrastive Learning for Text-Supervised Open-Vocabulary Semantic Segmentation'. Together they form a unique fingerprint.

Cite this