Skip to main navigation Skip to search Skip to main content

Generating Distance-Aware Human-to-Human Interaction Motions From Text Guidance

  • Jia Qi Zhang
  • , Jia Jun Wang
  • , Fang Lue Zhang
  • , Miao Wang*
  • *Corresponding author for this work
  • Beihang University
  • Victoria University of Wellington
  • Zhongguancun Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

The growing demand for diverse and realistic character animations in video games and films has driven the development of natural language-controlled motion generation systems. While recent advances in text-driven 3D human motion synthesis have made significant progress, generating realistic multi-person interactions remains a major challenge. Existing methods, such as denoising diffusion models and autoregressive frameworks, have explored interaction dynamics using attention mechanisms and causal modeling. However, they consistently overlook a critical physical constraint: the explicit spatial distance between interacting body parts, which is essential for producing semantically accurate and physically plausible interactions. To address this limitation, we propose InterDist, a novel masked generative Transformer model operating in a discrete state space. Our key idea is to decompose two-person motion into three components: two independent, interaction-agnostic single-person motion sequences and a separate interaction distance sequence. This formulation enables direct learning of both individual motion and dynamic spatial relationships from text prompts. We implement this via a VQ-VAE that jointly encodes independent motions and relative distances into discrete codebooks, followed by a bidirectional masked generative Transformer that models their joint distribution conditioned on text. To better align motion and language, we also introduce a cross-modal interaction module to enhance text-motion association. Our approach ensures the generated motions exhibit both semantic alignment with textual descriptions and preserving plausible inter-character distances, setting a new benchmark for text-driven multi-person interaction generation.

Original languageEnglish
Pages (from-to)2615-2627
Number of pages13
JournalIEEE Transactions on Visualization and Computer Graphics
Volume32
Issue number3
DOIs
StatePublished - Mar 2026

Keywords

  • Human-to-human interaction
  • masked generative model
  • motion synthesis
  • text-driven generation

Fingerprint

Dive into the research topics of 'Generating Distance-Aware Human-to-Human Interaction Motions From Text Guidance'. Together they form a unique fingerprint.

Cite this