摘要
Although knowledge distillation (KD) is an effec-tive approach to improve the performance of a smaller LLM (i.e., the student model) by trans-ferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnec-essary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowl-edge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the diffi-culty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard sam-ples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and ef-ficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7× compression.
| 源语言 | 英语 |
|---|---|
| 页(从-至) | 22379-22391 |
| 页数 | 13 |
| 期刊 | Proceedings of Machine Learning Research |
| 卷 | 267 |
| 出版状态 | 已出版 - 2025 |
| 活动 | 42nd International Conference on Machine Learning, ICML 2025 - Vancouver, 加拿大 期限: 13 7月 2025 → 19 7月 2025 |
指纹
探究 'DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver