TY - GEN
T1 - FEAT
T2 - 28th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2025
AU - Wang, Huihan
AU - Yang, Zhiwen
AU - Zhang, Hui
AU - Zhao, Dan
AU - Wei, Bingzheng
AU - Xu, Yan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Synthesizing high-quality medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at here.
AB - Synthesizing high-quality medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at here.
KW - Efficient Transformer
KW - Medical Video
KW - Video Generation
UR - https://www.scopus.com/pages/publications/105017970424
U2 - 10.1007/978-3-032-05114-1_26
DO - 10.1007/978-3-032-05114-1_26
M3 - 会议稿件
AN - SCOPUS:105017970424
SN - 9783032051134
T3 - Lecture Notes in Computer Science
SP - 267
EP - 277
BT - Medical Image Computing and Computer Assisted Intervention, MICCAI 2025 - 28th International Conference, 2025, Proceedings
A2 - Gee, James C.
A2 - Hong, Jaesung
A2 - Sudre, Carole H.
A2 - Golland, Polina
A2 - Park, Jinah
A2 - Alexander, Daniel C.
A2 - Iglesias, Juan Eugenio
A2 - Venkataraman, Archana
A2 - Kim, Jong Hyo
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 23 September 2025 through 27 September 2025
ER -