FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

  • Huihan Wang
  • , Zhiwen Yang
  • , Hui Zhang
  • , Dan Zhao
  • , Bingzheng Wei
  • , Yan Xu*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Synthesizing high-quality medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at here.

Original languageEnglish
Title of host publicationMedical Image Computing and Computer Assisted Intervention, MICCAI 2025 - 28th International Conference, 2025, Proceedings
EditorsJames C. Gee, Jaesung Hong, Carole H. Sudre, Polina Golland, Jinah Park, Daniel C. Alexander, Juan Eugenio Iglesias, Archana Venkataraman, Jong Hyo Kim
PublisherSpringer Science and Business Media Deutschland GmbH
Pages267-277
Number of pages11
ISBN (Print)9783032051134
DOIs
StatePublished - 2026
Event28th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2025 - Daejeon, Korea, Republic of
Duration: 23 Sep 202527 Sep 2025

Publication series

NameLecture Notes in Computer Science
Volume15968 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2025
Country/TerritoryKorea, Republic of
CityDaejeon
Period23/09/2527/09/25

Keywords

  • Efficient Transformer
  • Medical Video
  • Video Generation

Fingerprint

Dive into the research topics of 'FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation'. Together they form a unique fingerprint.

Cite this