Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

PM Widener and KB Ferreira and S Levy, EURO-PAR 2016: PARALLEL PROCESSING WORKSHOPS, 10104, 623-634 (2017).

DOI: 10.1007/978-3-319-58943-5_50

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. While fully uncoordinated approaches have been shown to have significant delays, the degree of sychronization required to keep overheads low has not yet been significantly addressed. In this paper, we use a simulation-based approach to show the impact of synchronization on local checkpoint activity. Specifically, we show the degree of synchronization needed to keep the impacts of local checkpointing low is attainable with current technology for a number of key production HPC workloads. Our work provides a critical analysis and comparison of synchronization and local checkpointing. This enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics available.

Return to Publications page