Tracking scientific simulation using online time-series modelling

MN Dinh and CT Vo and D Abramson, 2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 202-211 (2020).

DOI: 10.1109/CCGrid49817.2020.00-73

The increase in compute power and complexity of supercomputing systems requires the decrease in the feature size and the supply voltage of internal components. Such development makes unintended errors such as soft errors, potentially caused by random bit flips, inevitable because of the huge size of the resources (such as CPU cores and memory). In this paper, we discuss a non-parametric statistical modelling technique to implement a soft error detector. By exploring temporal autocorrelation within key variables of a running scientific simulation, we introduce an automatic anomaly detection technique in which runtime data from a time-step based simulation can be converted into a time series, and a time series modelling technique can be used to identify soft errors at runtime. Experiments with LAMMPS, a high-performance molecular dynamics simulator, and with PLUTO, an open-source astrophysical code, reveal that the time-series based detector is subjected to less than 3% of both false-positive rate and false-negative rate while incurring only 6% performance overheads.

Return to Publications page