TY - GEN
T1 - LYRICWHIZ
T2 - 24th International Society for Music Information Retrieval Conference, ISMIR 2023
AU - Zhuo, Le
AU - Yuan, Ruibin
AU - Pan, Jiahao
AU - Ma, Yinghao
AU - Li, Yizhi
AU - Zhang, Ge
AU - Liu, Si
AU - Dannenberg, Roger
AU - Fu, Jie
AU - Lin, Chenghua
AU - Benetos, Emmanouil
AU - Chen, Wenhu
AU - Xue, Wei
AU - Guo, Yike
N1 - Publisher Copyright:
© L Zhuo, R Yuan, and J Pan.
PY - 2023
Y1 - 2023
N2 - We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear"by transcribing the audio, while GPT-4 serves as the "brain,"acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reducesWord Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.
AB - We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear"by transcribing the audio, while GPT-4 serves as the "brain,"acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reducesWord Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.
UR - https://www.scopus.com/pages/publications/85179173818
M3 - 会议稿件
AN - SCOPUS:85179173818
T3 - 24th International Society for Music Information Retrieval Conference, ISMIR 2023 - Proceedings
SP - 343
EP - 351
BT - 24th International Society for Music Information Retrieval Conference, ISMIR 2023 - Proceedings
A2 - Sarti, Augusto
A2 - Antonacci, Fabio
A2 - Sandler, Mark
A2 - Bestagini, Paolo
A2 - Dixon, Simon
A2 - Liang, Beici
A2 - Richard, Gael
A2 - Pauwels, Johan
PB - International Society for Music Information Retrieval
Y2 - 5 November 2023 through 9 November 2023
ER -