Modern high-performance networks offer remote direct memory access (RDMA) that exposes a process' virtual address space to other processes in the network. The Message Passing Interface (MPI) specification has recently been extended with a programming interface called MPI-3 Remote Memory Access (MPI-3 RMA) for efficiently exploiting state-of-the-art RDMA features. MPI-3 RMA enables a powerful programming model that alleviates many message passing downsides. In this work, we design and develop bufferless protocols that demonstrate how to implement this interface and support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for RMA functions that enable rigorous mathematical analysis of application performance and facilitate the development of codes that solve given tasks within specified time and energy budgets. We validate the usability of our library and models with several application studies with up to half a million processes. In a wider sense, our work illustrates how to use RMA principles to accelerate computation- and data-intensive codes.
1. Introduction
Supercomputers have driven the progress of various society's domains by solving challenging and computationally intensive problems in fields such as climate modeling, weather prediction, engineering, or computational physics. More recently, the emergence of the "Big Data" problems resulted in the increasing focus on designing high-performance architectures that are able to process enormous amounts of data in domains such as personalized medicine, computational biology, graph analytics, and data mining in general. For example, the recently established Graph500 list ranks supercomputers based on their ability to traverse enormous graphs; the results from November 2014 illustrate that the most efficient machines can process up to 23 trillion edges per second in graphs with more than 2 trillion vertices.
Supercomputers consist of massively parallel nodes, each supporting up to hundreds of hardware threads in a single shared-memory domain. Up to tens of thousands of such nodes can be connected with a high-performance network, providing large-scale distributed-memory parallelism. For example, the Blue Waters machine has >700,000 cores and a peak computational bandwidth of >13 petaflops.