跳到主要导航 跳到搜索 跳到主要内容

Kair: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training

  • Beihang University
  • Kuaishou
  • University of Leeds

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

The distributed deep learning training process within large-scale clusters serves as the foundation of contemporary artificial intelligence. However, its inherent characteristics make it particularly sensitive to stragglers, specifically the presence of slow workers, which can significantly decelerate the entire procedure. Observability tools are essential for identifying stragglers within systems. However, the prevailing system profiling tools are either designed for single-node analysis, lacking visibility across multiple workers, or they recognize stragglers but only deliver high-level symptoms, providing engineers with insufficient insight into the underlying causes.We design Kair, a robust production-standard observability tool. Kair uses an innovative hierarchical approach, transitioning from statistical anomaly detection to causal inference. It employs Kolmogorov-Smirnov statistics for the identification of statistically anomalous workers and implements a causal path tracing algorithm to accurately determine the specific operations, such as computation or communication, that are responsible for the delay. Kair has been evaluated in a production cluster of 2,048 NVIDIA A800 GPUs and demonstrated high effectiveness in detecting latent stragglers at the framework level that are often overlooked by conventional tools. It offers precise suggestions that markedly reduce processing inefficiencies and engineering workload.

源语言英语
主期刊名Proceedings - 2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
出版商Institute of Electrical and Electronics Engineers Inc.
3754-3759
页数6
ISBN(电子版)9798350357332
DOI
出版状态已出版 - 2025
活动2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025 - Seoul, 韩国
期限: 16 11月 202520 11月 2025

出版系列

姓名Proceedings - 2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025

会议

会议2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
国家/地区韩国
Seoul
时期16/11/2520/11/25

指纹

探究 'Kair: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training' 的科研主题。它们共同构成独一无二的指纹。

引用此