HPC Hardware Design Reliability Benchmarking With HDFIT

P Omland and A Netti and Y Peng and A Baldovin and M Paulitsch and G Espinosa and J Parra and G Hinz and A Knoll, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 34, 995-1006 (2023).

DOI: 10.1109/TPDS.2023.3237777

Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become more concerning, particularly at the scale of High- Performance Computing (HPC) systems: on one hand, hardware fault protection is costly - more than 10% silicon area for floating-point units; on the other, HPC users expect correct application output after the anticipated time of computation, but workloads are seldom bit- reproducible and tolerances in output are allowed for. Benign hardware faults causing errors within these tolerances are therefore acceptable: however, with abstract reliability targets such as 'undetected failures per time,' current HPC system design does not allow for pursuing trade- offs between reliability and performance with respect to faults. To address the above, we propose a user-centric reliability benchmark to specify HPC system reliability targets, allowing for better performance optimizations in hardware design, while meeting HPC user expectations. Our open-source Hardware Design Fault Injection Toolkit (HDFIT) enables - for the first time - end-to-end hardware design reliability experiments: from netlist-level fault injection to application output error. In a proof of concept we present an HPC general matrix multiply (GEMM) reliability study, targeting a series of popular applications, and using HDFIT to benchmark an open-source GEMM accelerator.

Return to Publications page