Unveiling the Knowledge of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

  • Yajie Liu
  • , Guodong Wang
  • , Jinjin Zhang
  • , Qingjie Liu
  • , Di Huang*
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Training-free open-vocabulary semantic segmentation aims to explore the potential of frozen vision-language models (VLM) for segmentation tasks. Recent works reform the inference process of CLIP and utilize the features from the final layer to reconstruct dense representations for segmentation, demonstrating promising performance. However, the final layer tends to prioritize global components over local representations, leading to suboptimal robustness and effectiveness of existing methods. In this paper, we propose CLIPSeg, a novel training-free framework that fully exploits the diverse knowledge across layers in CLIP for dense predictions. Our study unveils two key discoveries: Firstly, the features in the middle layers exhibit high locality awareness and feature coherence compared to the final layer, based on which we propose the coherence enhanced residual attention module that generates semantic-aware attention. Secondly, despite not being directly aligned with the text, the deep layers capture valid local semantics that complement those in the final layer. Leveraging this insight, we introduce the deep semantic integration module to boost the patch semantics in the final block. Experiments conducted on 9 segmentation benchmarks with various CLIP models demonstrate that CLIPSeg consistently outperforms all training-free methods by substantial margins, e.g., a 7.8% improvement in average mIoU for CLIP with a ViT-L backbone, and competes with learning-based counterparts in generalizing to novel concepts in an efficient way.

Original languageEnglish
Pages (from-to)5649-5657
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number6
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Fingerprint

Dive into the research topics of 'Unveiling the Knowledge of CLIP for Training-Free Open-Vocabulary Semantic Segmentation'. Together they form a unique fingerprint.

Cite this