TY - JOUR
T1 - Sharing Task-Relevant Information in Visual Prompt Tuning by Cross-Layer Dynamic Connection
AU - Zhou, Nan
AU - Chen, Jiaxin
AU - Huang, Di
N1 - Publisher Copyright:
© 1992-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can adversely affect the sharing of task-relevant information. In this paper, we propose a novel VPT approach, SVPT. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, SVPT introduces an attentive enhancement (AE) mechanism that automatically identifies salient image tokens and refines them with prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantages of the proposed SVPT, compared to the state-of-the-art counterparts.
AB - Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can adversely affect the sharing of task-relevant information. In this paper, we propose a novel VPT approach, SVPT. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, SVPT introduces an attentive enhancement (AE) mechanism that automatically identifies salient image tokens and refines them with prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantages of the proposed SVPT, compared to the state-of-the-art counterparts.
KW - Transfer learning
KW - parameter efficient fine-tuning
KW - visual prompt tuning
UR - https://www.scopus.com/pages/publications/105010860560
U2 - 10.1109/TIP.2025.3587587
DO - 10.1109/TIP.2025.3587587
M3 - 文章
C2 - 40658564
AN - SCOPUS:105010860560
SN - 1057-7149
VL - 34
SP - 4527
EP - 4540
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -