Skip to main navigation Skip to search Skip to main content

CLIPFusion: Infrared and visible image fusion network based on image–text large model and adaptive learning

  • Dongdong Sun
  • , Chuanyun Wang*
  • , Tian Wang
  • , Qian Gao
  • , Qiong Liu
  • , Linlin Wang
  • *Corresponding author for this work
  • Shenyang Aerospace University
  • Beijing Information Science & Technology University

Research output: Contribution to journalArticlepeer-review

Abstract

The goal of infrared and visible image fusion is to integrate complementary multimodal images to produce highly informative and visually effective fused images, which have a wide range of applications in automated driving, fault diagnosis and night vision. Since the infrared and visible image fusion task usually does not have real labels as a reference, the design of the loss function is highly influenced by human subjectivity, which limits the performance of the model. To address the issue of insufficient real labels, this paper designs a prompt generation network based on the image–text large model, which learns text prompts for different types of images by restricting the distances between unimodal image prompts and fused image prompts to the corresponding images in the potential space of the image–text large model. The learned prompt texts are then used as labels for fused image generation by constraining the distance between the fused image and the different prompt texts in the latent space of the large image–text model. To further improve the quality of the fused images, this paper uses the fused images generated with different iterations to adaptively fine-tune the prompt generation network to continuously improve the quality of the generated prompt text labels and indirectly improve the visual effect of the fused images. In addition, to minimise the influence of subjective information in the fused image generation process, a 3D convolution-based fused image generation network is proposed to achieve the integration of infrared and visible feature through adaptive learning in additional dimensions. Extensive experiments show that the proposed model exhibits good visual effects and quantitative metrics in infrared–visible image fusion tasks in military scenarios, autopilot scenarios and dark-light scenarios, as well as good generalisation ability in multi-focus image fusion and medical image fusion tasks.

Original languageEnglish
Article number103042
JournalDisplays
Volume89
DOIs
StatePublished - Sep 2025

Keywords

  • Adaptive learning
  • Image fusion
  • Large model
  • Prompts

Fingerprint

Dive into the research topics of 'CLIPFusion: Infrared and visible image fusion network based on image–text large model and adaptive learning'. Together they form a unique fingerprint.

Cite this