Measurements of errors in large-scale computational simulations at runtime

MN Dinh and QM Nguyen, 2020 RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES (RIVF 2020), 291-297 (2020).

Verification of simulation codes often involves comparing the simulation output behavior to a known model using graphical displays or statistical tests. Such process is challenging for large-scale scientific codes at runtime because they often involve thousands of processes, and generate very large data structures. In our earlier work, we proposed a statistical framework for testing the correctness of large-scale applications using their runtime data. This paper studies the concept of 'distribution distance' and establishes the requirements in measuring the runtime differences between a verified stochastic simulation system and its larger-scale counterpart The paper discusses two types of distribution distance including the chi(2) distance and the histogram distance. We prototype the verification methodology and evaluate its performance on two production simulation programs. All experiments were conducted on a 20,000-core Cray XE6.

Return to Publications page