Skip to main navigation Skip to search Skip to main content

Kair: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training

  • Yitang Yang
  • , Junhong Liu
  • , Jiapeng Chen
  • , Xiaoyang Sun*
  • , Tianyu Wo
  • , Chunming Hu
  • , Chengru Song
  • , Jin Ouyang
  • , Renyu Yang
  • *Corresponding author for this work
  • Beihang University
  • Kuaishou
  • University of Leeds

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The distributed deep learning training process within large-scale clusters serves as the foundation of contemporary artificial intelligence. However, its inherent characteristics make it particularly sensitive to stragglers, specifically the presence of slow workers, which can significantly decelerate the entire procedure. Observability tools are essential for identifying stragglers within systems. However, the prevailing system profiling tools are either designed for single-node analysis, lacking visibility across multiple workers, or they recognize stragglers but only deliver high-level symptoms, providing engineers with insufficient insight into the underlying causes.We design Kair, a robust production-standard observability tool. Kair uses an innovative hierarchical approach, transitioning from statistical anomaly detection to causal inference. It employs Kolmogorov-Smirnov statistics for the identification of statistically anomalous workers and implements a causal path tracing algorithm to accurately determine the specific operations, such as computation or communication, that are responsible for the delay. Kair has been evaluated in a production cluster of 2,048 NVIDIA A800 GPUs and demonstrated high effectiveness in detecting latent stragglers at the framework level that are often overlooked by conventional tools. It offers precise suggestions that markedly reduce processing inefficiencies and engineering workload.

Original languageEnglish
Title of host publicationProceedings - 2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3754-3759
Number of pages6
ISBN (Electronic)9798350357332
DOIs
StatePublished - 2025
Event2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025 - Seoul, Korea, Republic of
Duration: 16 Nov 202520 Nov 2025

Publication series

NameProceedings - 2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025

Conference

Conference2025 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
Country/TerritoryKorea, Republic of
CitySeoul
Period16/11/2520/11/25

Keywords

  • Distributed Training
  • Performance Analysis
  • Straggler Detection
  • System Observability

Fingerprint

Dive into the research topics of 'Kair: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training'. Together they form a unique fingerprint.

Cite this