跳到主要导航 跳到搜索 跳到主要内容

DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

科研成果: 期刊稿件会议文章同行评审

摘要

Although knowledge distillation (KD) is an effec-tive approach to improve the performance of a smaller LLM (i.e., the student model) by trans-ferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnec-essary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowl-edge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the diffi-culty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard sam-ples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and ef-ficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7× compression.

源语言英语
页(从-至)22379-22391
页数13
期刊Proceedings of Machine Learning Research
267
出版状态已出版 - 2025
活动42nd International Conference on Machine Learning, ICML 2025 - Vancouver, 加拿大
期限: 13 7月 202519 7月 2025

指纹

探究 'DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models' 的科研主题。它们共同构成独一无二的指纹。

引用此