Skip to main navigation Skip to search Skip to main content

并行程序运行故障原因识别

Translated title of the contribution: Identifying causes of execution failure for parallel programs
  • Yi Liu
  • , Yulin Gao
  • , Guozhen Zhang
  • Beihang University

Research output: Contribution to journalArticlepeer-review

Abstract

With the increasing of scale and complexity of high-performance computing systems, the mean time between failures is getting shorter, which causes an increasing probability of execution-failure caused by the hardware and software failures for parallel programs. In addition, the possible programming errors (i. e. bugs) that exist in parallel programs can also lead to execution failure. Approaches to deal with the above two types of execution failures are totally different, therefore, when an execution-failure occurs, the programmer must figure out if the failure is caused by a system fault or a programming bug. In response to this issue, a system to identifying causes of execution-failures for parallel programs was designed and implemented on the basis of the Slurm. The system has all the supported features of Slurm, as well as the ability to monitor job status, re-submit and re-run jobs. The experimental results of the job execution show that the system can distinguish the type of program execution- failures. Experiments conducted with fault injection also demonstrates the accuracy of the system.

Translated title of the contributionIdentifying causes of execution failure for parallel programs
Original languageChinese (Traditional)
Pages (from-to)45-52
Number of pages8
JournalGuofang Keji Daxue Xuebao/Journal of National University of Defense Technology
Volume44
Issue number5
DOIs
StatePublished - Oct 2022

Fingerprint

Dive into the research topics of 'Identifying causes of execution failure for parallel programs'. Together they form a unique fingerprint.

Cite this