Skip to main navigation Skip to search Skip to main content

Text-Driven High-Quality 3D Human Generation via Variational Gradient Estimation and Latent Reward Models

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advances in Score Distillation Sampling (SDS) have enabled text-driven 3D human generation, yet the standard classifier-free guidance (CFG) framework struggles with semantic misalignment and texture oversaturation due to limited model capacity. We propose a novel framework that decouples conditional and unconditional guidance via a dual-model strategy: A pretrained diffusion model ensures geometric stability, while a preference-tuned latent reward model enhances semantic fidelity. To further refine noise estimation, we introduce a lightweight U-shaped Swin Transformer (U-Swin) that regularizes predicted noise against the reward model, reducing gradient bias and local artifacts. Additionally, we design a time-varying noise weighting mechanism to dynamically balance the two guidance signals during denoising, improving stability and texture realism. Extensive experiments show that our method significantly improves alignment with textual descriptions, enhances texture details, and outperforms state-of-the-art baselines in both visual quality and semantic consistency.

Original languageEnglish
Article numbere70089
JournalComputer Animation and Virtual Worlds
Volume37
Issue number1
DOIs
StatePublished - 1 Jan 2026

Keywords

  • 3D human generation
  • classifier-free guidance
  • score distillation sampling
  • text-driven

Fingerprint

Dive into the research topics of 'Text-Driven High-Quality 3D Human Generation via Variational Gradient Estimation and Latent Reward Models'. Together they form a unique fingerprint.

Cite this