TY - GEN
T1 - Multi-modal Segmentation via Medical Image-Text Fusion with Hierarchical Cross-Attention
AU - Sun, Xuezheng
AU - Wan, Tao
AU - Xu, Jiankun
AU - Qin, Zengchang
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.
PY - 2026
Y1 - 2026
N2 - Accurate tumor delineation in radiotherapy requires synergistic analysis of multi-modal data. However, current automated methods are predominantly limited to single imaging modalities. We introduce a multi-modal segmentation framework that integrates 3D CT and MRI volumes with clinical text descriptions. Our architecture processes CT and MRI data through shared encoders with modality-specific normalization. A hierarchical cross-attention decoder enables multi-scale fusion of radiometric features and semantic text embeddings. Additionally, a text-guided boundary refinement module uses tumor location and quantity descriptors to accurately segment tumor regions. Evaluated on two public datasets, LiTS (CT+Text) and ATLAS (MRI+Text), our method achieved superior performance in tumor segmentation, with up to 16% improvement in mean Dice scores over existing state-of-the-art methods. Ablation studies confirmed the complementary benefits of image-text integration. The results demonstrate that our multi-modal learning approach enhances segmentation accuracy, particularly for small tumor regions.
AB - Accurate tumor delineation in radiotherapy requires synergistic analysis of multi-modal data. However, current automated methods are predominantly limited to single imaging modalities. We introduce a multi-modal segmentation framework that integrates 3D CT and MRI volumes with clinical text descriptions. Our architecture processes CT and MRI data through shared encoders with modality-specific normalization. A hierarchical cross-attention decoder enables multi-scale fusion of radiometric features and semantic text embeddings. Additionally, a text-guided boundary refinement module uses tumor location and quantity descriptors to accurately segment tumor regions. Evaluated on two public datasets, LiTS (CT+Text) and ATLAS (MRI+Text), our method achieved superior performance in tumor segmentation, with up to 16% improvement in mean Dice scores over existing state-of-the-art methods. Ablation studies confirmed the complementary benefits of image-text integration. The results demonstrate that our multi-modal learning approach enhances segmentation accuracy, particularly for small tumor regions.
KW - Hierarchical cross-attention
KW - Medical image-text fusion
KW - Multi-modal learning
UR - https://www.scopus.com/pages/publications/105023309923
U2 - 10.1007/978-981-95-4100-3_5
DO - 10.1007/978-981-95-4100-3_5
M3 - 会议稿件
AN - SCOPUS:105023309923
SN - 9789819540990
T3 - Communications in Computer and Information Science
SP - 58
EP - 69
BT - Neural Information Processing - 32nd International Conference, ICONIP 2025, Proceedings
A2 - Taniguchi, Tadahiro
A2 - Kozuno, Tadashi
A2 - Leung, Chi Sing Andrew
A2 - Yoshimoto, Junichiro
A2 - Mahmud, Mufti
A2 - Doborjeh, Maryam
A2 - Doya, Kenji
PB - Springer Science and Business Media Deutschland GmbH
T2 - 32nd International Conference on Neural Information Processing, ICONIP 2025
Y2 - 20 November 2025 through 24 November 2025
ER -