Skip to main navigation Skip to search Skip to main content

Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability

  • Puyue Hou
  • , Jinjin Zhang
  • , Di Huang*
  • *Corresponding author for this work
  • Beihang University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Efficient action understanding in long videos remains a significant challenge for multimodal large language models (MLLMs), primarily due to the difficulty in localizing target actions within long frame sequences. This stems from the overwhelming interference of unrelated actions in long videos. In this work, we focus on efficient action localization to enhance video understanding from an internal interpretability perspective, leveraging the intricate relationship between text and video tokens to remove irrelevant tokens. By tracing attention distribution across videos of varying frame lengths, we observe that unsuccessful action understanding directly correlates with unrelated actions that receive notable attention scores. Motivated by these findings, we propose an Attend and Replay method that efficiently locates critical action information and strengthens its semantic representation. This approach first reduces unrelated action tokens using an attention-guided spatiotemporal pruning strategy, then enriches target action tokens via a pivot-token aggregation method. Extensive experiments show that integrating our method with existing MLLMs (e.g., LLava-Video, Qwen2.5-VL, MiMo-VL) achieves superior performance against other counterparts on various datasets, while enjoys lightning inference speed.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6661-6670
Number of pages10
ISBN (Electronic)9798331589882
DOIs
StatePublished - 2025
Event2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025 - Honolulu, United States
Duration: 19 Oct 202520 Oct 2025

Publication series

NameProceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025

Conference

Conference2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
Country/TerritoryUnited States
CityHonolulu
Period19/10/2520/10/25

Fingerprint

Dive into the research topics of 'Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability'. Together they form a unique fingerprint.

Cite this