TY - GEN
T1 - Attend and Replay
T2 - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
AU - Hou, Puyue
AU - Zhang, Jinjin
AU - Huang, Di
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Efficient action understanding in long videos remains a significant challenge for multimodal large language models (MLLMs), primarily due to the difficulty in localizing target actions within long frame sequences. This stems from the overwhelming interference of unrelated actions in long videos. In this work, we focus on efficient action localization to enhance video understanding from an internal interpretability perspective, leveraging the intricate relationship between text and video tokens to remove irrelevant tokens. By tracing attention distribution across videos of varying frame lengths, we observe that unsuccessful action understanding directly correlates with unrelated actions that receive notable attention scores. Motivated by these findings, we propose an Attend and Replay method that efficiently locates critical action information and strengthens its semantic representation. This approach first reduces unrelated action tokens using an attention-guided spatiotemporal pruning strategy, then enriches target action tokens via a pivot-token aggregation method. Extensive experiments show that integrating our method with existing MLLMs (e.g., LLava-Video, Qwen2.5-VL, MiMo-VL) achieves superior performance against other counterparts on various datasets, while enjoys lightning inference speed.
AB - Efficient action understanding in long videos remains a significant challenge for multimodal large language models (MLLMs), primarily due to the difficulty in localizing target actions within long frame sequences. This stems from the overwhelming interference of unrelated actions in long videos. In this work, we focus on efficient action localization to enhance video understanding from an internal interpretability perspective, leveraging the intricate relationship between text and video tokens to remove irrelevant tokens. By tracing attention distribution across videos of varying frame lengths, we observe that unsuccessful action understanding directly correlates with unrelated actions that receive notable attention scores. Motivated by these findings, we propose an Attend and Replay method that efficiently locates critical action information and strengthens its semantic representation. This approach first reduces unrelated action tokens using an attention-guided spatiotemporal pruning strategy, then enriches target action tokens via a pivot-token aggregation method. Extensive experiments show that integrating our method with existing MLLMs (e.g., LLava-Video, Qwen2.5-VL, MiMo-VL) achieves superior performance against other counterparts on various datasets, while enjoys lightning inference speed.
UR - https://www.scopus.com/pages/publications/105035155788
U2 - 10.1109/ICCVW69036.2025.00688
DO - 10.1109/ICCVW69036.2025.00688
M3 - 会议稿件
AN - SCOPUS:105035155788
T3 - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
SP - 6661
EP - 6670
BT - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 October 2025 through 20 October 2025
ER -