TY - GEN
T1 - Generative Spoken Language Modeling with Quantized Feature Enhancement
AU - Duan, Feiyu
AU - Li, Chen
AU - Wang, Keheng
AU - Wu, Si
AU - Yin, Chuantao
AU - Rong, Wenge
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In the absence of text, training generative models directly on speech data through next token prediction task, similar to text-based language models, has demonstrated its feasibility. However, speech data encompasses more intricate feature information compared to text. To capitalize on these additional features, we propose a feature-enhanced generative spoken language modeling (fGSLM). We calculate the difference between the original speech and its normalized version, and extract quantized features with a VQVAE-structured model. These features are subsequently integrated into the generative spoken language modeling (GSLM) by fine-tuning the unit language model (uLM) through a multi-stream transformer. To evaluate the effectiveness of our model, we conduct experiments on the ProsAudit evaluation task in the Zero Resource Speech Challenge. Experimental results show that our model significantly improves prosody comprehension both at the sentence and lexical levels, and achieves superior performance against baseline models.
AB - In the absence of text, training generative models directly on speech data through next token prediction task, similar to text-based language models, has demonstrated its feasibility. However, speech data encompasses more intricate feature information compared to text. To capitalize on these additional features, we propose a feature-enhanced generative spoken language modeling (fGSLM). We calculate the difference between the original speech and its normalized version, and extract quantized features with a VQVAE-structured model. These features are subsequently integrated into the generative spoken language modeling (GSLM) by fine-tuning the unit language model (uLM) through a multi-stream transformer. To evaluate the effectiveness of our model, we conduct experiments on the ProsAudit evaluation task in the Zero Resource Speech Challenge. Experimental results show that our model significantly improves prosody comprehension both at the sentence and lexical levels, and achieves superior performance against baseline models.
UR - https://www.scopus.com/pages/publications/85204972693
U2 - 10.1109/IJCNN60899.2024.10651390
DO - 10.1109/IJCNN60899.2024.10651390
M3 - 会议稿件
AN - SCOPUS:85204972693
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Joint Conference on Neural Networks, IJCNN 2024
Y2 - 30 June 2024 through 5 July 2024
ER -