Skip to main navigation Skip to search Skip to main content

Learning Visual Representation for Autonomous Drone Navigation via a Contrastive World Model

  • Beihang University

Research output: Contribution to journalArticlepeer-review

Abstract

Visuomotor policy learning for vision-based navigation tasks is still challenging and necessary for autonomous systems. Learning a task-specific policy from scratch simplifies the training pipeline while suffering from poor data efficiency and transfer ability. This problem intends to be more intractable under a low-data regime. In this work, we present a self-supervised representation learning architecture that incorporates Spatial and Temporal information via a Contrastive world model (STC) to extract image representation for vision-based navigation tasks. Specifically, STC leverages the dynamics transition model based on a recurrent neural network to construct a joint low-dimensional latent space for spatial and temporal representations. We simultaneously optimize all components of this architecture using a multiobjective contrastive training loss. The resulting pretrained encoder model acts as a standalone feature extractor to promote the policy learning procedure. We evaluate the final optimized visuomotor policy on both the simulated drone navigation environment and the out-of-domain dataset. Experimental results demonstrate that our proposed method outperforms task-specific and representative contrastive learning baselines in challenging complex visual environments withmore than half the improvement in data efficiency and provides significant gains in learning speed as well as the final performance. Code and video are available at: https://github.com/yibow-wang/cwm4drone. Impact Statement-Image data is one of the most critical perception methods for autonomous systems. However, visual observations are naturally high-dimensional and noisy, which makes it complicated for data processing and utilization. The visual representation learning method proposed in this work leads to a feature extractor that efficiently extracts task-relevant factors fromthe image observations. Evaluation results on vision-based drone navigation tasks show a satisfying improvement in generalization quality and data efficiency for policy learning. Because no task-specific component is used for representation learning, this approach can easily be adopted by various vision-based autonomous systemswith little modification.

Original languageEnglish
Pages (from-to)1263-1276
Number of pages14
JournalIEEE Transactions on Artificial Intelligence
Volume5
Issue number3
DOIs
StatePublished - 1 Mar 2024

Keywords

  • Contrastive representation learning
  • vision-based navigation
  • visuomotor policy learning
  • world model.

Fingerprint

Dive into the research topics of 'Learning Visual Representation for Autonomous Drone Navigation via a Contrastive World Model'. Together they form a unique fingerprint.

Cite this