Redesign and Accelerate the AIREBO Bond-Order Potential on the New Sunway Supercomputer
P Gao and XH Duan and B Schmidt and WB Wan and JX Guo and WS Zhang and L Gan and HH Fu and W Xue and WG Liu and GW Yang, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 34, 3117-3132 (2023).
DOI: 10.1109/TPDS.2023.3321927
Molecular dynamics (MD) is one of the most crucial computer simulation methods for understanding real-world processes at the atomic level. Reactive potentials based on the bond order concept have the ability to model dynamic bond breaking and formation with close to quantum mechanical (QM) precision without actually requiring expensive QM calculations. In this article, we focus on the adaptive intermolecular reactive empirical bond-order (AIREBO) potential in LAMMPS for the simulation of carbon and hydrocarbon systems on the new Sunway supercomputer. To achieve scalable performance, we propose a parallel two-level building scheme and periodic buffering strategy for the tailored data design to explore data locality and data reuse. Furthermore, we design two optimized nearest-neighbor access algorithms: the redistribution of accumulated coefficients algorithm and the double- end search connectivity algorithm. Finally, we implement parallel force computation with an AoS data layout and hardware/software co-cache. In addition, we have designed a low-overhead atomic operation-based load balancing method and vectorization. The overall performance of AIREBO achieves a speedup of nearly $20\times$20x on a single core group (CG), and more than $5\times$5x and $4\times$4x over an Intel Xeon E5 2680 v3 core and an Intel Xeon Gold 6138 core, respectively. Compared with the Intel accelerator package in LAMMPS, our performance further achieves $3.0\times$3.0x of an Intel Xeon E5 2680 v3 core and is better than that of an Intel Xeon Gold 6138 core. We complete the validation of the results in no more than 20.5 hours on a single node with 2,000,000 running steps (i.e., 1 ns). Our experiments show that the simulation of 2,139,095,040 atoms on 798,720 ((1MPE+64CPEs) x 12,288 processes) cores exhibits a parallel efficiency of 88% under weak scaling.
Return to Publications page