TY - GEN
T1 - AECodec
T2 - 3rd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics, AIHCIR 2024
AU - Zhang, Lingfeng
AU - Chen, Lijiang
AU - Su, Yuye
AU - Cui, Chunfeng
AU - Zhao, Qi
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the rapid development of communication devices and voice interactions, speech is playing an increasingly important role. Given the growing demand for voice transmission, efficiently transmitting high-quality speech signals within limited bandwidth becomes a challenge. This paper introduces a high-fidelity speech codec that utilizes both speech signals and electroglottograph signals for speech compression. The codec is capable of compressing and decompressing speech, significantly reducing the bitrate while maintaining high fidelity. The proposed codec employs a streaming encoder-decoder architecture and is trained in an end-to-end manner. In addition to using speech signals as input, the electroglottograph signal is used as an auxiliary input, leveraging its ability to capture vocal fold movement characteristics such as closure degree, closure speed, and cycle duration, thus enhancing the model’s feature extraction capability. Moreover, the encoder-decoder structure integrates a Transformer Encoder module with residual connections, further improving the model's ability to process time series data. To validate the effectiveness of this approach, we conducted extensive objective evaluations and experimental studies across various bandwidths, proving our approach is superior to the baselines methods.
AB - With the rapid development of communication devices and voice interactions, speech is playing an increasingly important role. Given the growing demand for voice transmission, efficiently transmitting high-quality speech signals within limited bandwidth becomes a challenge. This paper introduces a high-fidelity speech codec that utilizes both speech signals and electroglottograph signals for speech compression. The codec is capable of compressing and decompressing speech, significantly reducing the bitrate while maintaining high fidelity. The proposed codec employs a streaming encoder-decoder architecture and is trained in an end-to-end manner. In addition to using speech signals as input, the electroglottograph signal is used as an auxiliary input, leveraging its ability to capture vocal fold movement characteristics such as closure degree, closure speed, and cycle duration, thus enhancing the model’s feature extraction capability. Moreover, the encoder-decoder structure integrates a Transformer Encoder module with residual connections, further improving the model's ability to process time series data. To validate the effectiveness of this approach, we conducted extensive objective evaluations and experimental studies across various bandwidths, proving our approach is superior to the baselines methods.
KW - Codecs
KW - Residual neural networks
KW - Speech coding
KW - component
UR - https://www.scopus.com/pages/publications/105004909442
U2 - 10.1109/AIHCIR65563.2024.00026
DO - 10.1109/AIHCIR65563.2024.00026
M3 - 会议稿件
AN - SCOPUS:105004909442
T3 - Proceedings - 2024 3rd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics, AIHCIR 2024
SP - 111
EP - 115
BT - Proceedings - 2024 3rd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics, AIHCIR 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 15 November 2024 through 17 November 2024
ER -