HiCur-NPC: Hierarchical Feature Fusion Curriculum Learning for Multi-Modal Foundation Model in Nasopharyngeal Carcinoma

  • Zipei Wang
  • , Mengjie Fang
  • , Linglong Tang*
  • , Jie Tian*
  • , Di Dong*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Providing precise and comprehensive diagnostic information to clinicians is crucial for improving the treatment and prognosis of nasopharyngeal carcinoma. Multi-modal foundation models, which can integrate data from various sources, have the potential to significantly enhance clinical assistance. However, several challenges remain: (1) the lack of large-scale visual-language datasets for nasopharyngeal carcinoma; (2) the inability of existing pre-training and fine-tuning methods to capture the hierarchical features required for complex clinical tasks; (3) current foundation models having limited visual perception due to inadequate integration of multi-modal information. While curriculum learning can improve a model's ability to handle multiple tasks through systematic knowledge accumulation, it still lacks consideration for hierarchical features and their dependencies, affecting knowledge gains. To address these issues, we propose the Hierarchical Feature Fusion Curriculum Learning method, which consists of three stages: visual knowledge learning, coarse-grained alignment, and fine-grained fusion. First, we introduce the Hybrid Contrastive Masked Autoencoder to pre-train visual encoders on 755K multi-modal images of nasopharyngeal carcinoma CT, MRI, and endoscopy to fully extract deep visual information. Then, we construct a 65K visual instruction fine-tuning dataset based on open-source data and clinician diagnostic reports, achieving coarse-grained alignment with visual information in a large language model. Finally, we design a Mixture of Experts Cross Attention structure for deep fine-grained fusion of global multi-modal information. Our model outperforms previously developed specialized models in all key clinical tasks for nasopharyngeal carcinoma, including diagnosis, report generation, tumor segmentation, and prognosis.

Original languageEnglish
Pages (from-to)3997-4009
Number of pages13
JournalIEEE Transactions on Medical Imaging
Volume44
Issue number10
DOIs
StatePublished - 2025

Keywords

  • Curriculum learning
  • feature fusion
  • multi-modal models
  • multi-tasks
  • nasopharyngeal carcinoma

Fingerprint

Dive into the research topics of 'HiCur-NPC: Hierarchical Feature Fusion Curriculum Learning for Multi-Modal Foundation Model in Nasopharyngeal Carcinoma'. Together they form a unique fingerprint.

Cite this