跳到主要导航 跳到搜索 跳到主要内容

Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability

  • Puyue Hou
  • , Jinjin Zhang
  • , Di Huang*
  • *此作品的通讯作者
  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Efficient action understanding in long videos remains a significant challenge for multimodal large language models (MLLMs), primarily due to the difficulty in localizing target actions within long frame sequences. This stems from the overwhelming interference of unrelated actions in long videos. In this work, we focus on efficient action localization to enhance video understanding from an internal interpretability perspective, leveraging the intricate relationship between text and video tokens to remove irrelevant tokens. By tracing attention distribution across videos of varying frame lengths, we observe that unsuccessful action understanding directly correlates with unrelated actions that receive notable attention scores. Motivated by these findings, we propose an Attend and Replay method that efficiently locates critical action information and strengthens its semantic representation. This approach first reduces unrelated action tokens using an attention-guided spatiotemporal pruning strategy, then enriches target action tokens via a pivot-token aggregation method. Extensive experiments show that integrating our method with existing MLLMs (e.g., LLava-Video, Qwen2.5-VL, MiMo-VL) achieves superior performance against other counterparts on various datasets, while enjoys lightning inference speed.

源语言英语
主期刊名Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
出版商Institute of Electrical and Electronics Engineers Inc.
6661-6670
页数10
ISBN(电子版)9798331589882
DOI
出版状态已出版 - 2025
活动2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025 - Honolulu, 美国
期限: 19 10月 202520 10月 2025

出版系列

姓名Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025

会议

会议2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
国家/地区美国
Honolulu
时期19/10/2520/10/25

指纹

探究 'Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability' 的科研主题。它们共同构成独一无二的指纹。

引用此