Skip to main navigation Skip to search Skip to main content

Multi-modal Segmentation via Medical Image-Text Fusion with Hierarchical Cross-Attention

  • Beihang University
  • Capital Medical University
  • VinUniversity

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Accurate tumor delineation in radiotherapy requires synergistic analysis of multi-modal data. However, current automated methods are predominantly limited to single imaging modalities. We introduce a multi-modal segmentation framework that integrates 3D CT and MRI volumes with clinical text descriptions. Our architecture processes CT and MRI data through shared encoders with modality-specific normalization. A hierarchical cross-attention decoder enables multi-scale fusion of radiometric features and semantic text embeddings. Additionally, a text-guided boundary refinement module uses tumor location and quantity descriptors to accurately segment tumor regions. Evaluated on two public datasets, LiTS (CT+Text) and ATLAS (MRI+Text), our method achieved superior performance in tumor segmentation, with up to 16% improvement in mean Dice scores over existing state-of-the-art methods. Ablation studies confirmed the complementary benefits of image-text integration. The results demonstrate that our multi-modal learning approach enhances segmentation accuracy, particularly for small tumor regions.

Original languageEnglish
Title of host publicationNeural Information Processing - 32nd International Conference, ICONIP 2025, Proceedings
EditorsTadahiro Taniguchi, Tadashi Kozuno, Chi Sing Andrew Leung, Junichiro Yoshimoto, Mufti Mahmud, Maryam Doborjeh, Kenji Doya
PublisherSpringer Science and Business Media Deutschland GmbH
Pages58-69
Number of pages12
ISBN (Print)9789819540990
DOIs
StatePublished - 2026
Event32nd International Conference on Neural Information Processing, ICONIP 2025 - Okinawa, Japan
Duration: 20 Nov 202524 Nov 2025

Publication series

NameCommunications in Computer and Information Science
Volume2757
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference32nd International Conference on Neural Information Processing, ICONIP 2025
Country/TerritoryJapan
CityOkinawa
Period20/11/2524/11/25

Keywords

  • Hierarchical cross-attention
  • Medical image-text fusion
  • Multi-modal learning

Fingerprint

Dive into the research topics of 'Multi-modal Segmentation via Medical Image-Text Fusion with Hierarchical Cross-Attention'. Together they form a unique fingerprint.

Cite this