跳到主要导航 跳到搜索 跳到主要内容

LMC: Lightweight Message Collection for Distributed Training of Deep Learning Models

  • Yihao Zhang
  • , Yufan Wang
  • , Jie Jia
  • , Yi Liu
  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

As the scale of deep learning models continues to expand, distributed training has become an inevitable trend, which is not only time-consuming but also expensive. Therefore, effective debugging and tuning are crucial for the distributed training, where the messages among training processes can be a useful information. However, there remains a lack of generalpurpose communication logging systems that can efficiently capture message contents while maintaining compatibility with mainstream distributed deep learning frameworks. This paper proposes LMC, a lightweight communication collection system designed for distributed deep learning workloads. LMC employs non-intrusive function interception to transparently capture collective communication calls and associates each operation with runtime semantic context through dynamic context injection. To mitigate the impact of massive messages of frequent collective communications during the training, LMC incorporates several message-reduction approaches. It also supports configurable logging granularity and generates structured, analyzable logs. Experiments are conducted on a CPU-GPU cluster, and results demonstrate effectiveness of LMC.

源语言英语
主期刊名Proceedings - 2025 27th IEEE International Conference on High Performance Computing and Communications, 11th IEEE International Conference on Data Science and Systems, 23rd IEEE International Conference on Smart City, 11th IEEE International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications and 21st IEEE International Conference on Embedded Software and Systems, HPCC/DSS/SmartCity/DependSys/ICESS 2025
编辑Jia Hu, Geyong Min, Haozhe Wang, Wang Miao, Lexi Xu, Nektarios Georgalas, Zhiwei Zhao, Rui Jin, Guangyao Pang, Wei Han, Fei Hao
出版商Institute of Electrical and Electronics Engineers Inc.
419-427
页数9
ISBN(电子版)9798331568740
DOI
出版状态已出版 - 2025
活动27th IEEE International Conference on High Performance Computing and Communications, HPCC 2025 - Exeter, 英国
期限: 13 8月 202515 8月 2025

出版系列

姓名Proceedings - 2025 27th IEEE International Conference on High Performance Computing and Communications, 11th IEEE International Conference on Data Science and Systems, 23rd IEEE International Conference on Smart City, 11th IEEE International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications and 21st IEEE International Conference on Embedded Software and Systems, HPCC/DSS/SmartCity/DependSys/ICESS 2025

会议

会议27th IEEE International Conference on High Performance Computing and Communications, HPCC 2025
国家/地区英国
Exeter
时期13/08/2515/08/25

指纹

探究 'LMC: Lightweight Message Collection for Distributed Training of Deep Learning Models' 的科研主题。它们共同构成独一无二的指纹。

引用此