TY - GEN
T1 - Kair
T2 - 2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
AU - Yang, Yitang
AU - Liu, Junhong
AU - Chen, Jiapeng
AU - Sun, Xiaoyang
AU - Wo, Tianyu
AU - Hu, Chunming
AU - Song, Chengru
AU - Ouyang, Jin
AU - Yang, Renyu
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The distributed deep learning training process within large-scale clusters serves as the foundation of contemporary artificial intelligence. However, its inherent characteristics make it particularly sensitive to stragglers, specifically the presence of slow workers, which can significantly decelerate the entire procedure. Observability tools are essential for identifying stragglers within systems. However, the prevailing system profiling tools are either designed for single-node analysis, lacking visibility across multiple workers, or they recognize stragglers but only deliver high-level symptoms, providing engineers with insufficient insight into the underlying causes.We design Kair, a robust production-standard observability tool. Kair uses an innovative hierarchical approach, transitioning from statistical anomaly detection to causal inference. It employs Kolmogorov-Smirnov statistics for the identification of statistically anomalous workers and implements a causal path tracing algorithm to accurately determine the specific operations, such as computation or communication, that are responsible for the delay. Kair has been evaluated in a production cluster of 2,048 NVIDIA A800 GPUs and demonstrated high effectiveness in detecting latent stragglers at the framework level that are often overlooked by conventional tools. It offers precise suggestions that markedly reduce processing inefficiencies and engineering workload.
AB - The distributed deep learning training process within large-scale clusters serves as the foundation of contemporary artificial intelligence. However, its inherent characteristics make it particularly sensitive to stragglers, specifically the presence of slow workers, which can significantly decelerate the entire procedure. Observability tools are essential for identifying stragglers within systems. However, the prevailing system profiling tools are either designed for single-node analysis, lacking visibility across multiple workers, or they recognize stragglers but only deliver high-level symptoms, providing engineers with insufficient insight into the underlying causes.We design Kair, a robust production-standard observability tool. Kair uses an innovative hierarchical approach, transitioning from statistical anomaly detection to causal inference. It employs Kolmogorov-Smirnov statistics for the identification of statistically anomalous workers and implements a causal path tracing algorithm to accurately determine the specific operations, such as computation or communication, that are responsible for the delay. Kair has been evaluated in a production cluster of 2,048 NVIDIA A800 GPUs and demonstrated high effectiveness in detecting latent stragglers at the framework level that are often overlooked by conventional tools. It offers precise suggestions that markedly reduce processing inefficiencies and engineering workload.
KW - Distributed Training
KW - Performance Analysis
KW - Straggler Detection
KW - System Observability
UR - https://www.scopus.com/pages/publications/105034696871
U2 - 10.1109/ASE63991.2025.00311
DO - 10.1109/ASE63991.2025.00311
M3 - 会议稿件
AN - SCOPUS:105034696871
T3 - Proceedings - 2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
SP - 3754
EP - 3759
BT - Proceedings - 2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 November 2025 through 20 November 2025
ER -