Skip to main navigation Skip to search Skip to main content

MoTIF: An end-to-end multimodal road traffic scene understanding foundation model

  • Zihe Wang
  • , Haiyang Yu
  • , Changxin Chen
  • , Zhiyong Cui*
  • , Yufeng Bi
  • , Yilong Ren
  • , Zijian Wang
  • , Delan Kong
  • , Jing Tian
  • , Shoutong Yuan
  • , Zhiqiang Li
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Video-based road intelligent detection constitutes a critical component in modern intelligent transportation systems, serving as a crucial role for comprehensive transportation planning and emergency traffic management. Current traffic scene perception methodologies relying on conventional deep learning architectures present inherent limitations, including heavy dependence on extensive manual annotations of specific traffic scenarios and predefined rule configurations. These approaches demonstrate constrained semantic representation capacity and limited generalizability across heterogeneous traffic scenarios. To address these challenges, this study proposes a novel end-to-end multimodal foundation model architecture that jointly generates dynamic traffic event detection outcomes and semantic-rich contextual descriptions. Through integration of low-rank adaptation (LoRA) and prompt fine-tuning as parameter-efficient fine-tuning strategies, we develop the multimodal road traffic scene understanding foundation model (MoTIF), which establishes cross-modal alignment between visual patterns and textual semantics. This framework demonstrates enhanced capability in extracting salient traffic targets and generating hierarchical scene representations, significantly improving automated detection efficiency in road video analytics. Notably, MoTIF exhibits contextual reasoning capabilities for implicit traffic event interpretation. Extensive evaluations on two real-world datasets encompassing urban road intersection scenarios in Tianjin and highway monitoring systems in Shandong Province reveal that MoTIF achieves superior performance metrics: 65.81 average score on multimodal scene understanding assessment and 83.33% event detection accuracy, outperforming mainstream benchmarks in both precision and computational efficiency. This research advances multimodal learning paradigms for intelligent transportation systems while providing practical insights for adaptive traffic management applications.

Original languageEnglish
Article number100227
JournalCommunications in Transportation Research
Volume5
DOIs
StatePublished - Dec 2025

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 11 - Sustainable Cities and Communities
    SDG 11 Sustainable Cities and Communities

Keywords

  • Fine-tuning
  • Multimodal foundation model
  • Road traffic
  • Scene understanding

Fingerprint

Dive into the research topics of 'MoTIF: An end-to-end multimodal road traffic scene understanding foundation model'. Together they form a unique fingerprint.

Cite this