跳到主要导航 跳到搜索 跳到主要内容

MIRA: Multi-view Information Retrieval with Adaptive Routing for Test-time Long-video Comprehension

  • Zecheng Hao
  • , Wenxuan Ma
  • , Yufeng Cui
  • , Shuang Li
  • , Xinlong Wang
  • , Tiejun Huang

科研成果: 期刊稿件文章同行评审

摘要

Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Comprehension (LVC) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVC-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVC tasks. In this work, we propose the Multi-view Information Retrieval with Adaptive Routing (MIRA) framework, which evaluates video frames using distinct metrics for relevance and causality, combines these scores to select a balanced pool of keyframes, and employs an adaptive feedback loop to tailor the retrieval process to different user queries, enabling more precise and sample-grained video comprehension. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVC benchmarks. For instance, integrating MIRA with Qwen-2.5-VL yields performance gains of 3.5% to 13.1% on LVB, VideoMME and MLVU.

源语言英语
期刊Transactions on Machine Learning Research
2026-February
出版状态已出版 - 2026

指纹

探究 'MIRA: Multi-view Information Retrieval with Adaptive Routing for Test-time Long-video Comprehension' 的科研主题。它们共同构成独一无二的指纹。

引用此