MaskScene: Hierarchical Conditional Masked Models for Real-time 3D Indoor Scene Synthesis

  • Xinyu Zhang
  • , Yusen Liu
  • , Qichuan Geng
  • , Zhong Zhou*
  • , Wenfeng Song
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Indoor scene synthesis is essential for creative industries, recent advances in scene synthesis using diffusion and autoregressive models have shown promising results. However, existing models struggle to simultaneously achieve real-time performance, high visual fidelity, and flexible scene editability. To tackle these challenges, we propose MaskScene, a novel hierarchical conditional masked model for real-time 3D indoor scene synthesis and editing. Specifically, MaskScene introduces a hierarchical scene representation that explicitly encodes scene relationships, semantics, and tokenization. Based on this representation, we design a hierarchical conditional masked modeling architecture that enables parallel and iterative decoding, conditioned on both semantics and relationships. By masking local objects and leveraging the hierarchical structure of the scene, the model learns to infer and synthesize missing regions from partial observations, enabling rapid construction of 3D indoor environments that more accurately reflect real-world scenes. Compared to state-of-the-art methods, MaskScene achieves 80× faster generation speed and improves scene quality by 10%, while also supporting zero-shot editing, such as scene completion and rearrangement, without extra fine-tuning. Our project and dataset will be public.

Original languageEnglish
JournalIEEE Transactions on Visualization and Computer Graphics
DOIs
StateAccepted/In press - 2026

Keywords

  • 3D indoor Scene Synthesis
  • Hierarchical Conditional Masked Model
  • Hierarchical scene representation

Fingerprint

Dive into the research topics of 'MaskScene: Hierarchical Conditional Masked Models for Real-time 3D Indoor Scene Synthesis'. Together they form a unique fingerprint.

Cite this