TY - GEN
T1 - Vedrfolnir
T2 - ACM SIGCOMM 2025 Posters and Demos, Part of SIGCOMM 2025
AU - Chen, Yuxuan
AU - Zhang, Menghao
AU - Li, Xiheng
AU - Jiao, Fangzheng
AU - Chunming, Hu
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/9/10
Y1 - 2025/9/10
N2 - Collective communication becomes increasingly crucial as large language models rapidly evolve, but the RDMA it uses inevitably faces network performance anomalies (NPAs). Vedrfolnir is an accurate and efficient diagnosis system for RDMA NPAs in collective communication, which (1) constructs waiting graphs through algorithm decomposition, (2) adaptively detects anomalies while efficiently collecting diagnostic data, and (3) precisely analyzes performance bottlenecks and root causes. Evaluation shows that Vedrfolnir can achieve accurate diagnosis results with low overhead.
AB - Collective communication becomes increasingly crucial as large language models rapidly evolve, but the RDMA it uses inevitably faces network performance anomalies (NPAs). Vedrfolnir is an accurate and efficient diagnosis system for RDMA NPAs in collective communication, which (1) constructs waiting graphs through algorithm decomposition, (2) adaptively detects anomalies while efficiently collecting diagnostic data, and (3) precisely analyzes performance bottlenecks and root causes. Evaluation shows that Vedrfolnir can achieve accurate diagnosis results with low overhead.
KW - Collective Communication
KW - Network Performance Anomalies Diagnosis
KW - Remote Direct Memory Access
UR - https://www.scopus.com/pages/publications/105018223042
U2 - 10.1145/3744969.3748396
DO - 10.1145/3744969.3748396
M3 - 会议稿件
AN - SCOPUS:105018223042
T3 - SIGCOMM 2025 - Proceedings of the 2025 ACM SIGCOMM 2025 Posters and Demos
SP - 10
EP - 12
BT - SIGCOMM 2025 - Proceedings of the 2025 ACM SIGCOMM 2025 Posters and Demos
PB - Association for Computing Machinery, Inc
Y2 - 8 September 2025 through 11 September 2025
ER -