Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance

  • Shicheng Wang*
  • , Menghao Zhang*
  • , Xiao Li
  • , Qiyang Peng
  • , Haoyuan Yu
  • , Zhiliang Wang*
  • , Mingwei Xu
  • , Xiaohe Hu
  • , Jiahai Yang
  • , Xingang Shi
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

RDMA is becoming increasingly prevalent from private data centers to public multi-tenant clouds, due to its remarkable performance improvement. However, its lossless traffic control, i.e., PFC, introduces new complexities in network performance anomalies (NPAs) due to its cascading congestion spreading property, which usually incurs complaints from customers/applications about certain flows’ performance degradation. Existing studies fall short in fine-grained visibility of PFC impact and traceability of PFC causality, and are thus ineffective in diagnosing the root causes for RDMA NPAs. In this paper, we propose Hawkeye, an accurate and efficient RDMA NPA diagnosis system based on PFC provenance. Hawkeye comprises 1) a fine-grained PFC-aware telemetry mechanism to record the PFC impact on flows; 2) an in-network PFC causality analysis and tracing mechanism to quickly and efficiently collect causal telemetry for diagnosis; and 3) a provenance-based diagnosis algorithm to comprehensively present the anomaly breakdown, identifying the anomaly type and root causes accurately. Through extensive evaluations on both NS-3 simulations and a Tofino testbed, Hawkeye can quickly and accurately diagnose multiple RDMA NPAs with over 90% precision and 1-4 orders of magnitude lower overhead than baselines.

Original languageEnglish
Title of host publicationSIGCOMM 2025 - ACM SIGCOMM 2025 Conference
PublisherAssociation for Computing Machinery, Inc
Pages481-495
Number of pages15
ISBN (Electronic)9798400715242
DOIs
StatePublished - 27 Aug 2025
EventACM SIGCOMM 2025 Conference, SIGCOMM 2025 - Coimbra, Portugal
Duration: 8 Sep 202511 Sep 2025

Publication series

NameSIGCOMM 2025 - ACM SIGCOMM 2025 Conference

Conference

ConferenceACM SIGCOMM 2025 Conference, SIGCOMM 2025
Country/TerritoryPortugal
CityCoimbra
Period8/09/2511/09/25

Keywords

  • Network Provenance
  • Performance diagnosis
  • Programmable Networks
  • RDMA Networks

Fingerprint

Dive into the research topics of 'Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance'. Together they form a unique fingerprint.

Cite this