Towards Communication Profile, Topology and Node Failure aware Process Placement

I Vardas and M Ploumidis and M Marazakis, 2020 IEEE 32ND INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC- PAD 2020), 241-248 (2020).

DOI: 10.1109/SBAC-PAD49847.2020.00041

HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.

Return to Publications page