跳到主要导航 跳到搜索 跳到主要内容

Remote Sensing Spatiotemporal Vision-Language Models: A comprehensive survey

  • Beihang University
  • Key Laboratory of Precision Opto-Mechatronics Technology (Ministry of Education)
  • Inner Mongolia University

科研成果: 期刊稿件文章同行评审

摘要

The interpretation of multitemporal remote sensing imagery is critical for monitoring Earth's dynamic processes. However, previous change detection (CD) methods, which produce binary or semantic masks, fall short of providing human-readable insights into changes. Recent advances in vision-language models (VLMs) have opened a new frontier by fusing visual and linguistic modalities, enabling spatiotemporal vision-language understanding: models that not only capture spatial and temporal dependencies to recognize changes but also provide a richer interactive semantic analysis of temporal images (e.g., generate descriptive captions and answer natural language queries). In this survey, we present the first comprehensive review of remote sensing spatiotemporal VLMs (RS-STVLMs). The survey covers the evolution of models from early task-specific models to recent general foundation models that leverage powerful large language models (LLMs). We discuss progress in representative tasks, such as change captioning, change question answering, and change grounding. Moreover, we systematically dissect the fundamental components and key technologies underlying these models and review the datasets and evaluation metrics that have driven the field. By synthesizing task-level insights with a deep dive into shared architectural patterns, we aim to illuminate current achievements and chart promising directions for future research in spatiotemporal vision-language understanding for remote sensing. We will keep tracing related works at https://github.com/Chen-Yang-Liu/Awesome-RS-SpatioTemporal-VLMs.

源语言英语
页(从-至)383-423
页数41
期刊IEEE Geoscience and Remote Sensing Magazine
14
1
DOI
出版状态已出版 - 2026

指纹

探究 'Remote Sensing Spatiotemporal Vision-Language Models: A comprehensive survey' 的科研主题。它们共同构成独一无二的指纹。

引用此