Abstract
Indoor scene synthesis is essential for creative industries, recent advances in scene synthesis using diffusion and autoregressive models have shown promising results. However, existing models struggle to simultaneously achieve real-time performance, high visual fidelity, and flexible scene editability. To tackle these challenges, we propose MaskScene, a novel hierarchical conditional masked model for real-time 3D indoor scene synthesis and editing. Specifically, MaskScene introduces a hierarchical scene representation that explicitly encodes scene relationships, semantics, and tokenization. Based on this representation, we design a hierarchical conditional masked modeling architecture that enables parallel and iterative decoding, conditioned on both semantics and relationships. By masking local objects and leveraging the hierarchical structure of the scene, the model learns to infer and synthesize missing regions from partial observations, enabling rapid construction of 3D indoor environments that more accurately reflect real-world scenes. Compared to state-of-the-art methods, MaskScene achieves 80× faster generation speed and improves scene quality by 10%, while also supporting zero-shot editing, such as scene completion and rearrangement, without extra fine-tuning. Our project and dataset will be public.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Visualization and Computer Graphics |
| DOIs | |
| State | Accepted/In press - 2026 |
Keywords
- 3D indoor Scene Synthesis
- Hierarchical Conditional Masked Model
- Hierarchical scene representation
Fingerprint
Dive into the research topics of 'MaskScene: Hierarchical Conditional Masked Models for Real-time 3D Indoor Scene Synthesis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver