TY - JOUR
T1 - MIRA
T2 - Multi-view Information Retrieval with Adaptive Routing for Test-time Long-video Comprehension
AU - Hao, Zecheng
AU - Ma, Wenxuan
AU - Cui, Yufeng
AU - Li, Shuang
AU - Wang, Xinlong
AU - Huang, Tiejun
N1 - Publisher Copyright:
© 2026, Transactions on Machine Learning Research. All rights reserved.
PY - 2026
Y1 - 2026
N2 - Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Comprehension (LVC) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVC-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVC tasks. In this work, we propose the Multi-view Information Retrieval with Adaptive Routing (MIRA) framework, which evaluates video frames using distinct metrics for relevance and causality, combines these scores to select a balanced pool of keyframes, and employs an adaptive feedback loop to tailor the retrieval process to different user queries, enabling more precise and sample-grained video comprehension. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVC benchmarks. For instance, integrating MIRA with Qwen-2.5-VL yields performance gains of 3.5% to 13.1% on LVB, VideoMME and MLVU.
AB - Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Comprehension (LVC) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVC-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVC tasks. In this work, we propose the Multi-view Information Retrieval with Adaptive Routing (MIRA) framework, which evaluates video frames using distinct metrics for relevance and causality, combines these scores to select a balanced pool of keyframes, and employs an adaptive feedback loop to tailor the retrieval process to different user queries, enabling more precise and sample-grained video comprehension. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVC benchmarks. For instance, integrating MIRA with Qwen-2.5-VL yields performance gains of 3.5% to 13.1% on LVB, VideoMME and MLVU.
UR - https://www.scopus.com/pages/publications/105032761705
M3 - 文章
AN - SCOPUS:105032761705
SN - 2835-8856
VL - 2026-February
JO - Transactions on Machine Learning Research
JF - Transactions on Machine Learning Research
ER -