跳到主要导航 跳到搜索 跳到主要内容

SafeRoute: Enhancing Traffic Scene Understanding via a Unified Deep Learning and Multimodal LLM

  • Ankit Kumar Shaw
  • , Chandan Kumar Sah
  • , Xiaoli Lian*
  • , Arsalan Shahid Baig
  • , Tuopu Wen
  • , Kun Jiang*
  • , Mengmeng Yang
  • , Diange Yang
  • , Li Zhang
  • *此作品的通讯作者
  • Tsinghua University
  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Autonomous vehicles (AVs) require highly reliable traffic sign recognition and robust lane detection to navigate safely in complex and dynamic environments. This paper presents SafeRoute, a unified perception framework that integrates deep learning with instruction-tuned Multimodal Large Language Model (MLLM) for comprehensive road scene understanding. For traffic sign recognition, we benchmark three state-of-the-art architectures, ResNet-50, YOLOv8, and RT-DETR, achieving accuracies of 99.8%, 98.0%, and 96.6% respectively. To address the limitations of traditional vision-only methods in lane detection under adverse conditions (e.g. occlusion, poor lighting, road wear), we introduced a MLLM-based pipeline, fine-tuned via instruction learning without requiring large-scale pretraining. Our approach introduces a novel Multimodal Adapter that fuses CNN-derived spatial features with EVA-CLIP embeddings, enabling fine-grained visual grounding and robustness to occlusion. By integrating these visual tokens into a LLaMA-2 decoder, our system performs semantic-level reasoning and interpretable scene understanding, moving beyond segmentation to structured, language-based lane perception. Quantitatively, SafeRoute achieves a Frame Overall Accuracy (FRM) of 53.87%, Question Overall Accuracy (QNS) of 82.83%, and lane detection accuracies of 99.6% in clear conditions and 93.0% at night. It also demonstrates robust reasoning in adverse conditions, with 88.4% accuracy in rain and 95.6% under lane degradation. Overall, SafeRoute introduces a new paradigm in AV perception by offering a unified, multimodal approach, significantly improving both the robustness and explainability of lane detection in safety-critical scenarios.

源语言英语
主期刊名Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
出版商Institute of Electrical and Electronics Engineers Inc.
4606-4615
页数10
ISBN(电子版)9798331589882
DOI
出版状态已出版 - 2025
活动2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025 - Honolulu, 美国
期限: 19 10月 202520 10月 2025

出版系列

姓名Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025

会议

会议2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
国家/地区美国
Honolulu
时期19/10/2520/10/25

指纹

探究 'SafeRoute: Enhancing Traffic Scene Understanding via a Unified Deep Learning and Multimodal LLM' 的科研主题。它们共同构成独一无二的指纹。

引用此