Skip to main navigation Skip to search Skip to main content

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems

  • Jie Jia*
  • , Yi Liu
  • , Yanke Liu
  • , Yifan Chen
  • , Fang Lin
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With the scaling-up of high-performance computing (HPC) systems, the resilience has become an important challenge. As a widely used resilience technique for HPC systems, checkpointing saves checkpoints of the system during the execution of parallel programs, and in case of failure, recovers the execution of the program from the most recent checkpoint. However, large-scale parallel programs often produce thousands of processes, and result in large-volume simultaneous data-writings on each checkpoint, which impacts the storage as well as the parallel file systems of HPC. To tackle this problem, this paper proposes AdapCK, an I/O-optimization scheme for checkpointing on large-scale HPC systems. AdapCK consists of two main parts: a load-balancing mechanism used for balancing workloads across low-level storage volumes on checkpointing, and a throughput-aware checkpoint-data writing mechanism used for reducing I/O contentions and increasing utilization of I/O-bandwidth. Experiment results show that the AdapCK can reduce the checkpoint time by more than 30%, up to 54.5%.

Original languageEnglish
Title of host publicationEuro-Par 2024
Subtitle of host publicationParallel Processing - 30th European Conference on Parallel and Distributed Processing, Proceedings
EditorsJesus Carretero, Javier Garcia-Blas, Sameer Shende, Ivona Brandic, Katzalin Olcoz, Martin Schreiber
PublisherSpringer Science and Business Media Deutschland GmbH
Pages342-355
Number of pages14
ISBN (Print)9783031695827
DOIs
StatePublished - 2024
Event30th International Conference on Parallel and Distributed Computing, Euro-Par 2024 - Madrid, Spain
Duration: 26 Aug 202430 Aug 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14803 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference30th International Conference on Parallel and Distributed Computing, Euro-Par 2024
Country/TerritorySpain
CityMadrid
Period26/08/2430/08/24

Keywords

  • Checkpoint
  • DMTCP
  • Fault tolerance
  • High-Performance Computing
  • Parallel file system

Fingerprint

Dive into the research topics of 'AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems'. Together they form a unique fingerprint.

Cite this