Effective Sampling-Driven Performance Tools for GPU-Accelerated Supercomputers

M Chabbi and K Murthy and M Fagan and J Mellor-Crummey, 2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) (2013).

DOI: 10.1145/2503210.2503299

Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling- based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.

Return to Publications page