Skip to main navigation Skip to search Skip to main content

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

  • Xiang Lin
  • , Weixin Li*
  • , Shu Guo
  • , Lihong Wang
  • , Di Huang
  • *Corresponding author for this work
  • Beihang University
  • National Computer Network Emergency Response Technical Team/Coordination Center of China

Research output: Contribution to journalConference articlepeer-review

Abstract

Pre-trained Vision-Language Models (VLMs), e.g. CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstructionbased Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameterefficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the tradeoff between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-theart approaches across all evaluation metrics.

Original languageEnglish
Pages (from-to)23594-23602
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume40
Issue number28
DOIs
StatePublished - 2026
Event40th AAAI Conference on Artificial Intelligence, AAAI 2026 - Singapore, Singapore
Duration: 20 Jan 202627 Jan 2026

Fingerprint

Dive into the research topics of 'RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models'. Together they form a unique fingerprint.

Cite this