TY - JOUR
T1 - Remote Sensing Spatiotemporal Vision-Language Models
T2 - A comprehensive survey
AU - Liu, Chenyang
AU - Zhang, Jiafan
AU - Chen, Keyan
AU - Wang, Man
AU - Zou, Zhengxia
AU - Shi, Zhenwei
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2026
Y1 - 2026
N2 - The interpretation of multitemporal remote sensing imagery is critical for monitoring Earth's dynamic processes. However, previous change detection (CD) methods, which produce binary or semantic masks, fall short of providing human-readable insights into changes. Recent advances in vision-language models (VLMs) have opened a new frontier by fusing visual and linguistic modalities, enabling spatiotemporal vision-language understanding: models that not only capture spatial and temporal dependencies to recognize changes but also provide a richer interactive semantic analysis of temporal images (e.g., generate descriptive captions and answer natural language queries). In this survey, we present the first comprehensive review of remote sensing spatiotemporal VLMs (RS-STVLMs). The survey covers the evolution of models from early task-specific models to recent general foundation models that leverage powerful large language models (LLMs). We discuss progress in representative tasks, such as change captioning, change question answering, and change grounding. Moreover, we systematically dissect the fundamental components and key technologies underlying these models and review the datasets and evaluation metrics that have driven the field. By synthesizing task-level insights with a deep dive into shared architectural patterns, we aim to illuminate current achievements and chart promising directions for future research in spatiotemporal vision-language understanding for remote sensing. We will keep tracing related works at https://github.com/Chen-Yang-Liu/Awesome-RS-SpatioTemporal-VLMs.
AB - The interpretation of multitemporal remote sensing imagery is critical for monitoring Earth's dynamic processes. However, previous change detection (CD) methods, which produce binary or semantic masks, fall short of providing human-readable insights into changes. Recent advances in vision-language models (VLMs) have opened a new frontier by fusing visual and linguistic modalities, enabling spatiotemporal vision-language understanding: models that not only capture spatial and temporal dependencies to recognize changes but also provide a richer interactive semantic analysis of temporal images (e.g., generate descriptive captions and answer natural language queries). In this survey, we present the first comprehensive review of remote sensing spatiotemporal VLMs (RS-STVLMs). The survey covers the evolution of models from early task-specific models to recent general foundation models that leverage powerful large language models (LLMs). We discuss progress in representative tasks, such as change captioning, change question answering, and change grounding. Moreover, we systematically dissect the fundamental components and key technologies underlying these models and review the datasets and evaluation metrics that have driven the field. By synthesizing task-level insights with a deep dive into shared architectural patterns, we aim to illuminate current achievements and chart promising directions for future research in spatiotemporal vision-language understanding for remote sensing. We will keep tracing related works at https://github.com/Chen-Yang-Liu/Awesome-RS-SpatioTemporal-VLMs.
UR - https://www.scopus.com/pages/publications/105015842026
U2 - 10.1109/MGRS.2025.3598283
DO - 10.1109/MGRS.2025.3598283
M3 - 文章
AN - SCOPUS:105015842026
SN - 2473-2397
VL - 14
SP - 383
EP - 423
JO - IEEE Geoscience and Remote Sensing Magazine
JF - IEEE Geoscience and Remote Sensing Magazine
IS - 1
ER -