Skip to main navigation Skip to search Skip to main content

Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems

  • Beihang University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Checkpointing is the most widely used technique in high performance computing systems to tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with the scaling up of high performance computers, the number of processors and computing nodes increase rapidly, which brings I/O impact of checkpointing to the systems. On arriving at a checkpoint, all the nodes generate checkpoint data and write them to the storage system simultaneously, causing burst and massive traffics and data to the I/O infrastructure including interconnection network, parallel file system and storage. To mitigate the I/O impact of checkpointing, this paper proposes a self-adaptive random delay approach to control the writing of checkpointing data. By generating checkpoint data simultaneously in each node and writing the data according to a self-adaptive random delay policy, the burst traffic and data are smoothed. Experiment and theoretical analysis results show that this approach can mitigate I/O impact of checkpointing on large scale parallel systems.

Original languageEnglish
Title of host publicationProceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages117-123
Number of pages7
ISBN (Electronic)9781538666142
DOIs
StatePublished - 22 Jan 2019
Event20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018 - Exeter, United Kingdom
Duration: 28 Jun 201830 Jun 2018

Publication series

NameProceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018

Conference

Conference20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
Country/TerritoryUnited Kingdom
CityExeter
Period28/06/1830/06/18

Keywords

  • Checkpoint
  • Fault tolerance
  • High performance computing
  • Resilience

Fingerprint

Dive into the research topics of 'Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems'. Together they form a unique fingerprint.

Cite this