Abstract
In the progressive domain of computer vision, generating high-fidelity facial images from textual descriptions with precision remains a complex challenge. While existing diffusion models have demonstrated capabilities in text-to-image synthesis, they often struggle with capturing intricate details from complex, multi-attribute textual descriptions, leading to entity or attribute loss and inaccurate combinations. We propose AttriDiffuser, a novel model designed to ensure that each entity and attribute in textual descriptions is distinctly and accurately represented in the synthesized images. AttriDiffuser utilizes a text-driven attribute diffusion adversarial model, enhancing the correspondence between textual attributes and image features. It incorporates an attribute-gating cross-attention mechanism seamlessly into the adversarial learning enhanced diffusion model. AttriDiffuser advances traditional diffusion models by integrating a face diversity discriminator, which augments adversarial training and promotes the generation of diverse yet precise facial images in alignment with complex textual descriptions. Our empirical evaluation, conducted on the renowned Multimodal VoxCeleb and CelebA-HQ datasets, and benchmarked against other state-of-the-art models, demonstrates AttriDiffuser's superior efficacy. The results indicate its unparalleled capability to synthesize high-quality facial images with rigorous adherence to complex, multi-faceted textual descriptions, marking a significant advancement in text-to-facial attribute synthesis. Our code and model will be made publicly available at https://github.com/sunmeng7/AttriDiffuser.
| Original language | English |
|---|---|
| Article number | 111447 |
| Journal | Pattern Recognition |
| Volume | 163 |
| DOIs | |
| State | Published - Jul 2025 |
Keywords
- Diffusion model
- Diversity face
- Facial synthesis
- Generative adversarial networks
- Text-to-facial generation
Fingerprint
Dive into the research topics of 'AttriDiffuser: Adversarially enhanced diffusion model for text-to-facial attribute image synthesis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver