Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications

AJ Pena and SR Alam, PROCEEDINGS OF THE 2013 13TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID 2013), 144-151 (2013).

DOI: 10.1109/CCGrid.2013.15

Data movement is of high relevance for GPU Computing. Communication and performance efficiencies of applications and systems with GPU accelerators depend on onand off-node data paths, thereby making tuning and optimization an increasingly complex task. In this paper we conduct an in-depth study to establish the parameters that influence performance of data transfers between on-node GPU devices, and located on separate nodes (off-node). We compare the most recent version of MVAPICH2 featuring seamless remote GPU transfers with our own low-level benchmarks, and discuss the bottlenecks that may arise. Data path performance and bottlenecks between GPU devices are analyzed and compared for two substantially different systems: an IBM iDataPlex relying on an InfiniBand QDR fabric with two on-node GPU devices, and a Cray XK6, featuring a single GPU per node, and connected through a Gemini interconnect. Finally, we adapt LAMMPS, a GPU-accelerated application, to benefit from efficient inter-GPU data transfers, and validate our findings.

Return to Publications page