Skip to main navigation Skip to search Skip to main content

ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis

  • Jiahe Li
  • , Jiawei Zhang
  • , Xiao Bai*
  • , Jin Zheng
  • , Jun Zhou
  • , Lin Gu
  • *Corresponding author for this work
  • Beihang University
  • Griffith University Queensland
  • RIKEN
  • The University of Tokyo

Research output: Contribution to journalArticlepeer-review

Abstract

Despite conditional Neural Radiance Fields (NeRF) achieving great success in modeling audio-driven talking portraits, the generation quality is increasingly hampered by the lack of efficient use of space information. This paper presents ER-NeRF, a novel conditional NeRF-based architecture for talking portrait synthesis, and its variant version ER-NeRF++ to concurrently achieve fast convergence, real-time rendering, and state-of-the-art performance with small model size. Inspired by the unequal contribution of spatial regions, we propose two modules in ER-NeRF to guide the talking portrait modeling: (1) A compact and expressive Tri-Plane Hash Representation to improve the accuracy of dynamic head reconstruction by pruning empty spatial regions with three planar hash encoders. (2) A Region Attention Module for the audio–visual feature fusion, including a novel cross-modal attention mechanism to connect audio features with different spatial regions explicitly for local motion priors. Additionally, to tackle the difficulty in learning large facial motions, we propose a deformable variant ER-NeRF++ by including a Deformation Grid Transformer to enable the reuse of cross-regional spatial features for large motion representation. Compared to ER-NeRF, our ER-NeRF++ framework achieves a significant improvement in facial motion quality while maintaining the ability of fast training and real-time rendering. For the torso part, a directAdaptive Pose Encoding is introduced to simplify the pose information for a better head-torso connection. Extensive experiments demonstrate that both of our proposed frameworks can efficiently render lifelike talking portrait videos with rich realistic details, performing better in image quality and audio-lip synchronization compared to previous methods.

Original languageEnglish
Article number102456
JournalInformation Fusion
Volume110
DOIs
StatePublished - Oct 2024

Keywords

  • Attention mechanism
  • Audio–visual fusion
  • Multimodal transformer
  • Neural radiance fields
  • Talking portrait synthesis

Fingerprint

Dive into the research topics of 'ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis'. Together they form a unique fingerprint.

Cite this