Skip to main navigation Skip to search Skip to main content

LMC: Lightweight Message Collection for Distributed Training of Deep Learning Models

  • Yihao Zhang
  • , Yufan Wang
  • , Jie Jia
  • , Yi Liu
  • Beihang University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As the scale of deep learning models continues to expand, distributed training has become an inevitable trend, which is not only time-consuming but also expensive. Therefore, effective debugging and tuning are crucial for the distributed training, where the messages among training processes can be a useful information. However, there remains a lack of generalpurpose communication logging systems that can efficiently capture message contents while maintaining compatibility with mainstream distributed deep learning frameworks. This paper proposes LMC, a lightweight communication collection system designed for distributed deep learning workloads. LMC employs non-intrusive function interception to transparently capture collective communication calls and associates each operation with runtime semantic context through dynamic context injection. To mitigate the impact of massive messages of frequent collective communications during the training, LMC incorporates several message-reduction approaches. It also supports configurable logging granularity and generates structured, analyzable logs. Experiments are conducted on a CPU-GPU cluster, and results demonstrate effectiveness of LMC.

Original languageEnglish
Title of host publicationProceedings - 2025 27th IEEE International Conference on High Performance Computing and Communications, 11th IEEE International Conference on Data Science and Systems, 23rd IEEE International Conference on Smart City, 11th IEEE International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications and 21st IEEE International Conference on Embedded Software and Systems, HPCC/DSS/SmartCity/DependSys/ICESS 2025
EditorsJia Hu, Geyong Min, Haozhe Wang, Wang Miao, Lexi Xu, Nektarios Georgalas, Zhiwei Zhao, Rui Jin, Guangyao Pang, Wei Han, Fei Hao
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages419-427
Number of pages9
ISBN (Electronic)9798331568740
DOIs
StatePublished - 2025
Event27th IEEE International Conference on High Performance Computing and Communications, HPCC 2025 - Exeter, United Kingdom
Duration: 13 Aug 202515 Aug 2025

Publication series

NameProceedings - 2025 27th IEEE International Conference on High Performance Computing and Communications, 11th IEEE International Conference on Data Science and Systems, 23rd IEEE International Conference on Smart City, 11th IEEE International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications and 21st IEEE International Conference on Embedded Software and Systems, HPCC/DSS/SmartCity/DependSys/ICESS 2025

Conference

Conference27th IEEE International Conference on High Performance Computing and Communications, HPCC 2025
Country/TerritoryUnited Kingdom
CityExeter
Period13/08/2515/08/25

Keywords

  • Collective Communication
  • Distributed Deep Learning
  • Instrumentation
  • Lightweight Message Collecting

Fingerprint

Dive into the research topics of 'LMC: Lightweight Message Collection for Distributed Training of Deep Learning Models'. Together they form a unique fingerprint.

Cite this