TY - GEN
T1 - LMC
T2 - 27th IEEE International Conference on High Performance Computing and Communications, HPCC 2025
AU - Zhang, Yihao
AU - Wang, Yufan
AU - Jia, Jie
AU - Liu, Yi
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - As the scale of deep learning models continues to expand, distributed training has become an inevitable trend, which is not only time-consuming but also expensive. Therefore, effective debugging and tuning are crucial for the distributed training, where the messages among training processes can be a useful information. However, there remains a lack of generalpurpose communication logging systems that can efficiently capture message contents while maintaining compatibility with mainstream distributed deep learning frameworks. This paper proposes LMC, a lightweight communication collection system designed for distributed deep learning workloads. LMC employs non-intrusive function interception to transparently capture collective communication calls and associates each operation with runtime semantic context through dynamic context injection. To mitigate the impact of massive messages of frequent collective communications during the training, LMC incorporates several message-reduction approaches. It also supports configurable logging granularity and generates structured, analyzable logs. Experiments are conducted on a CPU-GPU cluster, and results demonstrate effectiveness of LMC.
AB - As the scale of deep learning models continues to expand, distributed training has become an inevitable trend, which is not only time-consuming but also expensive. Therefore, effective debugging and tuning are crucial for the distributed training, where the messages among training processes can be a useful information. However, there remains a lack of generalpurpose communication logging systems that can efficiently capture message contents while maintaining compatibility with mainstream distributed deep learning frameworks. This paper proposes LMC, a lightweight communication collection system designed for distributed deep learning workloads. LMC employs non-intrusive function interception to transparently capture collective communication calls and associates each operation with runtime semantic context through dynamic context injection. To mitigate the impact of massive messages of frequent collective communications during the training, LMC incorporates several message-reduction approaches. It also supports configurable logging granularity and generates structured, analyzable logs. Experiments are conducted on a CPU-GPU cluster, and results demonstrate effectiveness of LMC.
KW - Collective Communication
KW - Distributed Deep Learning
KW - Instrumentation
KW - Lightweight Message Collecting
UR - https://www.scopus.com/pages/publications/105022709080
U2 - 10.1109/HPCC67675.2025.00073
DO - 10.1109/HPCC67675.2025.00073
M3 - 会议稿件
AN - SCOPUS:105022709080
T3 - Proceedings - 2025 27th IEEE International Conference on High Performance Computing and Communications, 11th IEEE International Conference on Data Science and Systems, 23rd IEEE International Conference on Smart City, 11th IEEE International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications and 21st IEEE International Conference on Embedded Software and Systems, HPCC/DSS/SmartCity/DependSys/ICESS 2025
SP - 419
EP - 427
BT - Proceedings - 2025 27th IEEE International Conference on High Performance Computing and Communications, 11th IEEE International Conference on Data Science and Systems, 23rd IEEE International Conference on Smart City, 11th IEEE International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications and 21st IEEE International Conference on Embedded Software and Systems, HPCC/DSS/SmartCity/DependSys/ICESS 2025
A2 - Hu, Jia
A2 - Min, Geyong
A2 - Wang, Haozhe
A2 - Miao, Wang
A2 - Xu, Lexi
A2 - Georgalas, Nektarios
A2 - Zhao, Zhiwei
A2 - Jin, Rui
A2 - Pang, Guangyao
A2 - Han, Wei
A2 - Hao, Fei
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 13 August 2025 through 15 August 2025
ER -