Deduplication Potential of HPC Applications' Checkpoints

J Kaiser and R Gad and T Suss and F Padua and L Nagel and A Brinkmann, 2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 413-422 (2016).

DOI: 10.1109/CLUSTER.2016.32

HPC systems contain an increasing number of components, decreasing the mean time between failures. Checkpoint mechanisms help to overcome such failures for long-running applications. A viable solution to remove the resulting pressure from the I/O backends is to deduplicate the checkpoints. However, there is little knowledge about the potential to save I/Os for HPC applications by using deduplication within the checkpointing process. In this paper, we perform a broad study about the deduplication behavior of HPC application checkpointing and its impact on system design.

Return to Publications page