Skip to main navigation Skip to search Skip to main content

A Multiview-Integrated Framework for Traffic Scene Understanding Based on YOLO and LLM

  • Yixuan Zhao
  • , Tian Ma
  • , Zihe Wang
  • , Ziyu Zhang
  • , Chenxi Li
  • , Shuai Liu
  • , Zhiyong Cui
  • , Mengqi Lv
  • , Haiyang Yu*
  • , Zixi Peng
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Traffic scene understanding plays a crucial role in reasoning about and predicting relationships among entities in traffic images. It focuses on analyzing behavioral interaction patterns and global semantic associations to support higher-level traffic requirements. However, few existing frameworks can achieve comprehensive scene understanding and semantic description in complex traffic environments. In particular, effective multiview semantic association modeling is still lacking. To address these challenges, we propose multiview large language model (MVLLM), which integrates YOLO-based object detection with the reasoning ability of large language models (LLMs). Through prompt engineering, MVLLM utilizes the visual information extracted by YOLO to constrain the semantic space and guide the reasoning behavior, thereby enhancing the scene parsing capability. Meanwhile, we design a Chain-of-Thought (CoT) reasoning mechanism to establish spatiotemporal associations across multiple views and to integrate their scene understanding with semantic descriptions. The framework enables intent understanding for vehicles in dynamic environments, enhancing driving safety. It also provides comprehensive semantic descriptions for traffic management agencies, supporting holistic analyses of vehicles, roads, and environmental contexts.

Original languageEnglish
Article number2814128
JournalJournal of Advanced Transportation
Volume2026
Issue number1
DOIs
StatePublished - 2026

Keywords

  • LLM
  • deep learning
  • multiview integration
  • road transportation
  • traffic scene understanding

Fingerprint

Dive into the research topics of 'A Multiview-Integrated Framework for Traffic Scene Understanding Based on YOLO and LLM'. Together they form a unique fingerprint.

Cite this