Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model

Research output: Contribution to journalArticlepeer-review

Abstract

Recently, generative foundation models (GFMs) have significantly advanced large-scale text-driven natural image generation and become a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image–text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multiresolution controllability, and unbounded image generation. To address these challenges, this article presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image–text dataset consisting of 10.5 million image–text pairs, five times larger than the previous largest one. The dataset covers a wide range of geographic scenes and contains essential geospatial metadata, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion-parameter GFM based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution guidance mechanism, enabling users to specify image resolutions. A dynamic condition adaptation (DCA) strategy is proposed for training and inference to improve image generation quality. Text2Earth not only excels in zero-shot text2image generation but also demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability surpasses previous models restricted to basic fixed sizes and limited scene types. On the previous text2image benchmark dataset, Text2Earth outperforms previous models, with a significantly improved +26.23 Fréchet inception distance (FID) score and +20.95% zero-shot classification overall accuracy (Cls-OA) metric. Our project page is https://chen-yang-liu.github.io/Text2Earth/.

Original languageEnglish
Pages (from-to)238-259
Number of pages22
JournalIEEE Geoscience and Remote Sensing Magazine
Volume13
Issue number3
DOIs
StatePublished - 2025

Fingerprint

Dive into the research topics of 'Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model'. Together they form a unique fingerprint.

Cite this