TY - GEN
T1 - Hawkeye
T2 - ACM SIGCOMM 2025 Conference, SIGCOMM 2025
AU - Wang, Shicheng
AU - Zhang, Menghao
AU - Li, Xiao
AU - Peng, Qiyang
AU - Yu, Haoyuan
AU - Wang, Zhiliang
AU - Xu, Mingwei
AU - Hu, Xiaohe
AU - Yang, Jiahai
AU - Shi, Xingang
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/8/27
Y1 - 2025/8/27
N2 - RDMA is becoming increasingly prevalent from private data centers to public multi-tenant clouds, due to its remarkable performance improvement. However, its lossless traffic control, i.e., PFC, introduces new complexities in network performance anomalies (NPAs) due to its cascading congestion spreading property, which usually incurs complaints from customers/applications about certain flows’ performance degradation. Existing studies fall short in fine-grained visibility of PFC impact and traceability of PFC causality, and are thus ineffective in diagnosing the root causes for RDMA NPAs. In this paper, we propose Hawkeye, an accurate and efficient RDMA NPA diagnosis system based on PFC provenance. Hawkeye comprises 1) a fine-grained PFC-aware telemetry mechanism to record the PFC impact on flows; 2) an in-network PFC causality analysis and tracing mechanism to quickly and efficiently collect causal telemetry for diagnosis; and 3) a provenance-based diagnosis algorithm to comprehensively present the anomaly breakdown, identifying the anomaly type and root causes accurately. Through extensive evaluations on both NS-3 simulations and a Tofino testbed, Hawkeye can quickly and accurately diagnose multiple RDMA NPAs with over 90% precision and 1-4 orders of magnitude lower overhead than baselines.
AB - RDMA is becoming increasingly prevalent from private data centers to public multi-tenant clouds, due to its remarkable performance improvement. However, its lossless traffic control, i.e., PFC, introduces new complexities in network performance anomalies (NPAs) due to its cascading congestion spreading property, which usually incurs complaints from customers/applications about certain flows’ performance degradation. Existing studies fall short in fine-grained visibility of PFC impact and traceability of PFC causality, and are thus ineffective in diagnosing the root causes for RDMA NPAs. In this paper, we propose Hawkeye, an accurate and efficient RDMA NPA diagnosis system based on PFC provenance. Hawkeye comprises 1) a fine-grained PFC-aware telemetry mechanism to record the PFC impact on flows; 2) an in-network PFC causality analysis and tracing mechanism to quickly and efficiently collect causal telemetry for diagnosis; and 3) a provenance-based diagnosis algorithm to comprehensively present the anomaly breakdown, identifying the anomaly type and root causes accurately. Through extensive evaluations on both NS-3 simulations and a Tofino testbed, Hawkeye can quickly and accurately diagnose multiple RDMA NPAs with over 90% precision and 1-4 orders of magnitude lower overhead than baselines.
KW - Network Provenance
KW - Performance diagnosis
KW - Programmable Networks
KW - RDMA Networks
UR - https://www.scopus.com/pages/publications/105016123009
U2 - 10.1145/3718958.3750490
DO - 10.1145/3718958.3750490
M3 - 会议稿件
AN - SCOPUS:105016123009
T3 - SIGCOMM 2025 - ACM SIGCOMM 2025 Conference
SP - 481
EP - 495
BT - SIGCOMM 2025 - ACM SIGCOMM 2025 Conference
PB - Association for Computing Machinery, Inc
Y2 - 8 September 2025 through 11 September 2025
ER -