Can Software Reliability Outperform Hardware Reliability on High Performance Interconnects? A Case Study with MPI over InfiniBand

MJ Koop and R Kumar and DK Panda, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 145-154 (2008).

An important part of modern supercomputing platforms is the network interconnect. As the number of computing nodes in clusters have increased, the role of the interconnect has become more important. Modern interconnects, such as InfiniBand, Quadrics, and Myrinet have become popular due to their low latency and increased performance over traditional Ethernet. As these interconnects become more widely used and clusters continue to scale, design choices such as where data reliabihty should be provided are an important issue. In this work we address the issue of network reliability design using InfiniBand as a case study. Unlike other high-performance interconnects, InfiniBand exposes both reliable and unreliable APIs. As part of our study we implement the Message Passing Interface (MPI) over the Unreliable Connection (UC) transport and compare with the Reliable Connection (RC) and Unreliable Datagram (UD) transports for MPI. We detail the costs of reliability for different message patterns and show that providing reliability in software instead of hardware can increase performance up to 25% in a molecular dynamics application (NAMD) on a 512-core InfiniBand cluster.

Return to Publications page