TY - GEN
T1 - Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems
AU - Wang, Nana
AU - Sun, Qingzheng
AU - Liu, Yi
AU - Qian, Depei
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2019/1/22
Y1 - 2019/1/22
N2 - Checkpointing is the most widely used technique in high performance computing systems to tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with the scaling up of high performance computers, the number of processors and computing nodes increase rapidly, which brings I/O impact of checkpointing to the systems. On arriving at a checkpoint, all the nodes generate checkpoint data and write them to the storage system simultaneously, causing burst and massive traffics and data to the I/O infrastructure including interconnection network, parallel file system and storage. To mitigate the I/O impact of checkpointing, this paper proposes a self-adaptive random delay approach to control the writing of checkpointing data. By generating checkpoint data simultaneously in each node and writing the data according to a self-adaptive random delay policy, the burst traffic and data are smoothed. Experiment and theoretical analysis results show that this approach can mitigate I/O impact of checkpointing on large scale parallel systems.
AB - Checkpointing is the most widely used technique in high performance computing systems to tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with the scaling up of high performance computers, the number of processors and computing nodes increase rapidly, which brings I/O impact of checkpointing to the systems. On arriving at a checkpoint, all the nodes generate checkpoint data and write them to the storage system simultaneously, causing burst and massive traffics and data to the I/O infrastructure including interconnection network, parallel file system and storage. To mitigate the I/O impact of checkpointing, this paper proposes a self-adaptive random delay approach to control the writing of checkpointing data. By generating checkpoint data simultaneously in each node and writing the data according to a self-adaptive random delay policy, the burst traffic and data are smoothed. Experiment and theoretical analysis results show that this approach can mitigate I/O impact of checkpointing on large scale parallel systems.
KW - Checkpoint
KW - Fault tolerance
KW - High performance computing
KW - Resilience
UR - https://www.scopus.com/pages/publications/85062559201
U2 - 10.1109/HPCC/SmartCity/DSS.2018.00047
DO - 10.1109/HPCC/SmartCity/DSS.2018.00047
M3 - 会议稿件
AN - SCOPUS:85062559201
T3 - Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
SP - 117
EP - 123
BT - Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
Y2 - 28 June 2018 through 30 June 2018
ER -