TY - GEN
T1 - DiffDance
T2 - 31st ACM International Conference on Multimedia, MM 2023
AU - Qi, Qiaosong
AU - Zhuo, Le
AU - Zhang, Aixi
AU - Liao, Yue
AU - Fang, Fei
AU - Liu, Si
AU - Yan, Shuicheng
N1 - Publisher Copyright:
© 2023 Owner/Author.
PY - 2023/10/27
Y1 - 2023/10/27
N2 - When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion model, we also incorporate multiple geometric losses to constrain the model outputs to be physically plausible and add a dynamic loss weight that adaptively changes over diffusion timesteps to facilitate sample diversity. Through comprehensive experiments performed on the benchmark dataset AIST++, we demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music. These results are comparable to those achieved by state-of-the-art autoregressive methods.
AB - When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion model, we also incorporate multiple geometric losses to constrain the model outputs to be physically plausible and add a dynamic loss weight that adaptively changes over diffusion timesteps to facilitate sample diversity. Through comprehensive experiments performed on the benchmark dataset AIST++, we demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music. These results are comparable to those achieved by state-of-the-art autoregressive methods.
KW - conditional generation
KW - diffusion model
KW - multimodal learning
KW - music-to-dance
UR - https://www.scopus.com/pages/publications/85179550154
U2 - 10.1145/3581783.3612307
DO - 10.1145/3581783.3612307
M3 - 会议稿件
AN - SCOPUS:85179550154
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 1374
EP - 1382
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 29 October 2023 through 3 November 2023
ER -