跳到主要导航 跳到搜索 跳到主要内容

Vedrfolnir: RDMA Network Performance Anomalies Diagnosis in Collective Communications

  • Yuxuan Chen
  • , Menghao Zhang*
  • , Xiheng Li
  • , Fangzheng Jiao
  • , Hu Chunming
  • *此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Collective communication becomes increasingly crucial as large language models rapidly evolve, but the RDMA it uses inevitably faces network performance anomalies (NPAs). Vedrfolnir is an accurate and efficient diagnosis system for RDMA NPAs in collective communication, which (1) constructs waiting graphs through algorithm decomposition, (2) adaptively detects anomalies while efficiently collecting diagnostic data, and (3) precisely analyzes performance bottlenecks and root causes. Evaluation shows that Vedrfolnir can achieve accurate diagnosis results with low overhead.

源语言英语
主期刊名SIGCOMM 2025 - Proceedings of the 2025 ACM SIGCOMM 2025 Posters and Demos
出版商Association for Computing Machinery, Inc
10-12
页数3
ISBN(电子版)9798400720260
DOI
出版状态已出版 - 10 9月 2025
活动ACM SIGCOMM 2025 Posters and Demos, Part of SIGCOMM 2025 - Coimbra, 葡萄牙
期限: 8 9月 202511 9月 2025

出版系列

姓名SIGCOMM 2025 - Proceedings of the 2025 ACM SIGCOMM 2025 Posters and Demos

会议

会议ACM SIGCOMM 2025 Posters and Demos, Part of SIGCOMM 2025
国家/地区葡萄牙
Coimbra
时期8/09/2511/09/25

指纹

探究 'Vedrfolnir: RDMA Network Performance Anomalies Diagnosis in Collective Communications' 的科研主题。它们共同构成独一无二的指纹。

引用此