Skip to main navigation Skip to search Skip to main content

A Lightweight and Flexible Tool for Distinguishing between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs

  • Beihang University

Research output: Contribution to journalArticlepeer-review

Abstract

In this paper, we propose a new technique to distinguish the reason for program failure between hardware malfunctions and program bugs, which mitigates the impact of shorter mean time between failures to the debugging process on the future exa-scale supercomputers and improves the productivity of debugging large-scale parallel programs. Our technique detects program failures by observing the abnormal message passing behaviors with distributed monitors and leverages event-driven mechanism to trigger global status checking among different node groups concurrently. Besides, both coarse-grained execution snapshots and fine-grained failure events can be provided for further failure diagnosis and bug analysis. We implement this technique as a user-space library named failure cause resolver (FCR). Experimental results on the Tianhe-2 supercomputer demonstrate that the latency of FCR for failure detection is acceptable with negligible overhead. In addition, FCR does not require administrative privilege and can be easily integrated into existing large-scale parallel programs.

Original languageEnglish
Article number8540813
Pages (from-to)71892-71905
Number of pages14
JournalIEEE Access
Volume6
DOIs
StatePublished - 2018

Keywords

  • Failure detection
  • hardware malfunction
  • parallel program bug

Fingerprint

Dive into the research topics of 'A Lightweight and Flexible Tool for Distinguishing between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs'. Together they form a unique fingerprint.

Cite this