Abstract
Despite conditional Neural Radiance Fields (NeRF) achieving great success in modeling audio-driven talking portraits, the generation quality is increasingly hampered by the lack of efficient use of space information. This paper presents ER-NeRF, a novel conditional NeRF-based architecture for talking portrait synthesis, and its variant version ER-NeRF++ to concurrently achieve fast convergence, real-time rendering, and state-of-the-art performance with small model size. Inspired by the unequal contribution of spatial regions, we propose two modules in ER-NeRF to guide the talking portrait modeling: (1) A compact and expressive Tri-Plane Hash Representation to improve the accuracy of dynamic head reconstruction by pruning empty spatial regions with three planar hash encoders. (2) A Region Attention Module for the audio–visual feature fusion, including a novel cross-modal attention mechanism to connect audio features with different spatial regions explicitly for local motion priors. Additionally, to tackle the difficulty in learning large facial motions, we propose a deformable variant ER-NeRF++ by including a Deformation Grid Transformer to enable the reuse of cross-regional spatial features for large motion representation. Compared to ER-NeRF, our ER-NeRF++ framework achieves a significant improvement in facial motion quality while maintaining the ability of fast training and real-time rendering. For the torso part, a directAdaptive Pose Encoding is introduced to simplify the pose information for a better head-torso connection. Extensive experiments demonstrate that both of our proposed frameworks can efficiently render lifelike talking portrait videos with rich realistic details, performing better in image quality and audio-lip synchronization compared to previous methods.
| Original language | English |
|---|---|
| Article number | 102456 |
| Journal | Information Fusion |
| Volume | 110 |
| DOIs | |
| State | Published - Oct 2024 |
Keywords
- Attention mechanism
- Audio–visual fusion
- Multimodal transformer
- Neural radiance fields
- Talking portrait synthesis
Fingerprint
Dive into the research topics of 'ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver