Generative Spoken Language Modeling with Quantized Feature Enhancement

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the absence of text, training generative models directly on speech data through next token prediction task, similar to text-based language models, has demonstrated its feasibility. However, speech data encompasses more intricate feature information compared to text. To capitalize on these additional features, we propose a feature-enhanced generative spoken language modeling (fGSLM). We calculate the difference between the original speech and its normalized version, and extract quantized features with a VQVAE-structured model. These features are subsequently integrated into the generative spoken language modeling (GSLM) by fine-tuning the unit language model (uLM) through a multi-stream transformer. To evaluate the effectiveness of our model, we conduct experiments on the ProsAudit evaluation task in the Zero Resource Speech Challenge. Experimental results show that our model significantly improves prosody comprehension both at the sentence and lexical levels, and achieves superior performance against baseline models.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359312
DOIs
StatePublished - 2024
Event2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan
Duration: 30 Jun 20245 Jul 2024

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2024 International Joint Conference on Neural Networks, IJCNN 2024
Country/TerritoryJapan
CityYokohama
Period30/06/245/07/24

Fingerprint

Dive into the research topics of 'Generative Spoken Language Modeling with Quantized Feature Enhancement'. Together they form a unique fingerprint.

Cite this