Abstract
Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can adversely affect the sharing of task-relevant information. In this paper, we propose a novel VPT approach, SVPT. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, SVPT introduces an attentive enhancement (AE) mechanism that automatically identifies salient image tokens and refines them with prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantages of the proposed SVPT, compared to the state-of-the-art counterparts.
| Original language | English |
|---|---|
| Pages (from-to) | 4527-4540 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Image Processing |
| Volume | 34 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Transfer learning
- parameter efficient fine-tuning
- visual prompt tuning
Fingerprint
Dive into the research topics of 'Sharing Task-Relevant Information in Visual Prompt Tuning by Cross-Layer Dynamic Connection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver