Supercomputing is evolving toward hybrid and accelerator-based architectures with millions of cores. The Hardware/Hybrid Accelerated Cosmology Code (HACC) framework exploits this diverse landscape at the largest scales of problem size, obtaining high scalability and sustained performance. Developed to satisfy the science requirements of cosmological surveys, HACC melds particle and grid methods using a novel algorithmic structure that flexibly maps across architectures, including CPU/GPU, multi/many-core, and Blue Gene systems. In this Research Highlight, we demonstrate the success of HACC on two very different machines, the CPU/GPU system Titan and the BG/Q systems Sequoia and Mira, attaining very high levels of scalable performance. We demonstrate strong and weak scaling on Titan, obtaining up to 99.2% parallel efficiency, evolving 1.1 trillion particles. On Sequoia, we reach 13.94 PFlops (69.2% of peak) and 90% parallel efficiency on 1,572,864 cores, with 3.6 trillion particles, the largest cosmological benchmark yet performed. HACC design concepts are applicable to several other supercomputer applications.

### 1. Introduction: Simulating the Sky

Cosmological surveys are our windows to the grandest of all dynamical systems, the Universe itself. Scanning the sky over large areas and to great depths, modern surveys have brought us a remarkably simple, yet mysterious, model of the Universe, whose central pillars, dark matter and dark energy, point to new, and even more fundamental discoveries. The pace of progress continues unabatedthe next generation of sky surveys demand tools for scientific inference that far exceed current capabilities to extract information from observations.

The already important role of cosmological simulations is expected to undergo a sea change as the analysis of surveys moves over to an approach based entirely on forward models of the underlying physics, encompassing as well the complex details of survey measurements. Such an end-to-end paradigm, based on the ability to produce realistic "universes" on demand, will stress the available supercomputing power to its limits.

The desired improvements for simulations over the next decade are often stated in terms of orders of magnitude; high accuracy and robustness are central requirements to be met by this program. Because the simulations to be run areand will continue to bememory-limited on even the largest machines, stringent requirements must be simultaneously imposed on code performance and efficiency.

The rich structure of the current Universeplanets, stars, solar systems, galaxies, and yet larger collections of galaxies (clusters and filaments) all resulted from the growth of very small primordial fluctuations. These perturbations are observed today as very small temperature fluctuations in the cosmic microwave background, the fossil remnant radiation of an early hot phase of the Universe, which cooled as the Universe expanded. The primordial fluctuations grow due to the influence of gravitational attractionthis is known as the Jeans instabilityalthough the growth is slowed by the expansion of the Universe.

A number of observations have convincingly demonstrated that only roughly a fifth of the observed matter density arises from ordinary atomic matter, the rest being a form of matter that, while it behaves normally under gravity, has exceedingly weak interactions of any other kind. This "dark matter" dominates the formation of structuregalaxies and smaller units of structure such as stars, gas clouds, and planets, all live within much larger, extended clumps of dark matter.

Given the above picture, cosmic structure formation at large scales is well described by the gravitational VlasovPoisson equation,^{17} a six-dimensional partial differential equation for the Liouville flow (1) of the phase space probability distribution function, where the gravitational potential arises self-consistently from the Poisson equation (2):

The expansion history of the Universe is encoded in the time-dependence of the scale factor *a*(*t*) governed by the cosmological model, the Hubble parameter, *H* = */a, G* is Newton's constant, * _{c}* is the critical density (if the cosmic density is above

*, the universe recollapses, if below, it expands forever),*

_{c}*, the average mass density as a fraction of*

_{m}*,*

_{c}*(*

_{m}**x**) is the local mass density, and

*(*

_{m}**x**) is the dimensionless density contrast,

The VlasovPoisson equation is hopeless to solve as a PDE because of its high dimensionality and the development of nonlinear structureincluding complex multistreamingon ever finer scales, driven by the Jeans instability. Consequently, N-body methods, using tracer particles to sample *f*(**x, p**) are used; the particles follow Newton's equations, with the forces on the particles given by the gradient of the scalar potential (**x**).^{4}

Initial conditions are set at early times using the known properties of the primordial fluctuations. These perturbations, given by a smooth Gaussian random field, evolve into a "cosmic web" comprised of sheets, filaments, and mass concentrations called halos.^{21,24} The first stars and galaxies form in halos and then evolve as the halo distribution also evolves by a combination of dynamics, mass accretion and loss, and by halo mergers. In addition to gravity, gasdynamic, thermal, radiative, and other processes must also be modeled. Large-volume simulations usually incorporate the latter effects via semi-analytic modeling, since the overall spatial and temporal dynamic range is too vast to be encompassed in an individual simulation.

Elementary arguments easily demonstrate the scale of the challenge to be overcome. Survey depths are of order several Gpc (1 pc = 3.26 light-years); to follow bright galaxies, halos with a minimum mass of solar mass) must be tracked. To properly resolve these halos, the tracer particle mass should be and the force resolution should be small compared to the halo size, that is, kpc. This immediately implies a dynamic range (ratio of smallest resolved scale to box size) of a part in 10^{6} (Gpc/kpc) everywhere in the *entire* simulation volume (Figure 1). In terms of the number of simulation particles required, counts range from hundreds of billions to many trillions. Time-stepping criteria follow from a joint consideration of the force and mass resolution.^{20} Finally, stringent requirements on accuracy are imposed by the very small statistical errors in the observationssome observables must be computed at accuracies of a *fraction* of a percent.

For a cosmological simulation to be considered "high-resolution," *all* of the above demands must be met. In addition, throughput is a significant concern. Scientific inference from cosmological observations defines a statistical inverse problem where many runs of the forward problem are needed to estimate cosmological parameters and associated errors. For such analyses, hundreds of largescale, state of the art simulations will be required.^{12} The Hardware/Hybrid Accelerated Cosmology Code (HACC) framework meets these exacting conditions in the realm of performance and scalability across a variety of node-level architectures. In this Research Highlight, we will describe the basic ideas behind the approach, and present some representative results.

### 2. Current State of the Art

N-body simulations in cosmology have a long history, starting with Peebles' simulations of 300 particles in 1969,^{16} to today's largest simulations evolving more than a trillion particles. Initial *N*^{2} approaches gave way quickly to more efficient methods. Particle-mesh (PM) methods proved insufficient to obtain the required force resolution and were replaced by P^{3}M (ParticleParticle PM) algorithms (e.g., Ref.^{3}), and tree codes.^{18} Because of the high degree of clustering in cosmological simulations, P^{3}M codes have been mostly displaced by tree codes (nevertheless, as demonstrated by HACC, P^{3}M can be resurrected for CPU/GPU systems). To localize tree walks and make handling periodic boundary conditions easier, hybrid TreePM methods were introduced, and form the mainstay of gravity-only cosmology simulations. The most popular code used today, GADGET-2,^{22} is also a TreePM code.

At the same time as the tree-based methods were being developed, high-resolution grid-based PM approaches using Adaptive Mesh Refinement (AMR) also made their appearance, for example, Refs.^{2, 5, 23} In the case of large-volume survey simulations, the efficiency of AMR methods is reduced because clustering occurs over the entire simulation volume, leading to high AMR computational and memory costs (in turn degrading the available force and mass resolution). Consequently, the most successful AMR applications have involved the study of a smaller number of objects such as clusters of galaxies at high resolution and with different physics modules employed. For a comparison and overview of ten different codes spanning all algorithms discussed above see Ref.^{13}

Gas physics has been introduced into the structure formation codes either by using Eulerian methods following an AMR approach, or by using Smoothed Particle Hydrodynamics (SPH). Hybrid approaches are also being intensively explored. Physics at small scales (star formation, supernova feedback) is difficult to model self-consistently and is most often treated using phenomenological subgrid models. The parameters in these models are determined in part by fitting against observational data.

### 3. Multi-Architecture Challenge

**3.1. HACC architecture: two-level approach**

A modern code framework must confront issues raised by diverse architectures and programming models, as well as be able to respond to potentially disruptive future evolution. It should be able to gracefully incorporate multiple algorithms that interoperate with each other, to optimize them for the architecture at hand, and place minimal reliance on external resources that can potentially limit these abilities. The above remarks are a concise statement of HACC's design philosophy.

The strategy follows a two-level paradigm. As discussed in Section 2, the cosmological N-body problem is typically treated using both grid and particle-based approaches. Grid-based techniques are better suited to larger ("smooth") length scales, with particle methods having the opposite property. This suggests that the higher level of code organization should be grid-based, interacting with particle information at a lower level of the computational hierarchy. Following this central idea, HACC uses a hybrid parallel algorithmic structure, splitting the gravitational force calculation into a specially designed grid-based long/medium range spectral PM component that is essentially architecture-independent, and an architecture-adaptive particle-based short-range solver (Figure 2). The grid is responsible for four orders of magnitude of dynamic range, while the particle methods handle the critical two orders of magnitude at the shortest scales where particle clustering is maximal and the bulk of the time-stepping computation takes place.

The flexibility to respond to different nodal architectures is built into the short-range solvers; these can use direct particleparticle interactions, that is, a P^{3}M algorithm,^{15} as on Roadrunner and Titan, or use both tree and particleparticle methods as on the IBM BG/Q ("PPTreePM"). Access to multiple algorithms within the HACC framework also enables careful error testing and validation, for example, the P^{3}M and the PPTreePM versions agree to within 0.1% for the nonlinear power spectrum test in the code comparison suite of Ref.^{14}

HACC's multi-algorithmic structure attacks several common weaknesses of conventional particle codes including limited vectorization, indirection, complex data structures, lack of threading, and short interaction lists. It combines MPI with a variety of nodal programming models (e.g., CUDA, OpenCL, OpenMP) to readily adapt to different platforms. HACC has been ported to conventional and Cell or GPU-accelerated clusters, Blue Gene systems, and Intel Xeon Phi systems. HACC can run at full scale on *all* available supercomputer architectures. To showcase this flexibility, we present scaling results for two very different cases in Section 4: the IBM BG/Q systems Mira at ALCF and Sequoia at LLNL, and Titan at Oak Ridge Leadership Computing Facility (OLCF).

**3.2. HACC top level**

The top level of HACC's architecture consists of the domain decomposition, the medium/long-range force solver, and the interface to the short-range solver. All three aspects of this design involve new ideas to enhance flexibility and performance.

The spatial domain decomposition is in non-cubic 3-D blocks, but unlike guard zones in a typical PM method, full particle replication"particle overloading"is employed across domain boundaries. Overloading provides two crucial benefits. The first is that medium/long-range force calculations require no particle communication, and high-accuracy local force calculations require only sparse refreshes of the overloading zone (for details, see Refs.^{9, 19}).

The second major advantage of overloading is that it frees the local force solver from handling communication tasks, which are taken care of completely at the top level. Thus new "on-node" local methods can be plugged in easily with guaranteed scalability, requiring only local optimizations. All short-range methods in HACC are local to the MPI-rank and the locality can be fine-grained further. This can be used to lower the number of levels in tree algorithms and to parallelize across fine-grained particle interaction sub-volumes. The benefits bestowed by overloading come at only modest cost: the memory overhead for a large run is only 10%.

The long/medium range algorithm is based on a fast, spectrally filtered PM method incorporating several sophisticated features. The density field is generated from the particles using a Cloud-In-Cell (CIC) scheme^{15} and is then smoothed with an isotropizing spectral filter. The spectral filter reduces the anisotropy "noise" of the CIC scheme by over an order of magnitude without recourse to inflexible higher-order spatial particle deposition methods that are more commonly used. The noise reduction allows matching of the short and longer-range forces at small scales and confers the ability to use higher-order methods, both of which have important ramifications for accuracy and performance.

The Poisson solver uses a sixth-order influence function (spectral representation of the inverse Laplacian). The gradient of the scalar potential is obtained using low-noise fourth-order Super-Lanczos spectral differencing.^{10} The "Poisson-solve" in HACC is the composition of all the kernels discussed above within one single discrete sum, each component of the potential gradient requiring an independent FFT. HACC uses its own scalable, high performance 3-D FFT routine implemented using a 2-D pencil decomposition (Section 4).

**3.3. HACC short-range solvers**

The form of the short-range force is given by subtracting the filtered grid force from the exact Newtonian force. The filtered force was determined to high accuracy using randomly sampled particle pairs and then fitted to an expression with the correct large and small distance asymptotics. Thanks to the effective use of filtering (Section 3.2), this functional form is needed only over a small, compact region, and can be represented using a fifth-order polynomial expansion, resulting in the crucial ability to vectorize computations in the main force kernel.

For heterogeneous systems such as Titan, the long/medium-range solver operates at the CPU layer. Depending on the CPU/accelerator memory balance, two different modes may be used, (1) grids held on the CPU and particles on the accelerator, or, more commonly, (2) a streaming paradigm, with grid and particle information resident in CPU memory, and short-range computations streamed through the accelerator. In both cases, the local force solve is a direct particleparticle interaction, resulting in a hardware-accelerated P^{3}M code. The accelerated short-range algorithm on Titan outperforms the corresponding CPU-only TreePM version by more than an order of magnitude. We also have an implementation of the tree-based algorithm as used on the BG/Q, but with the tree-build and walk performed on the CPU, and the actual force evaluations performed on the GPU, leading to comparable performance as with the P^{3}M code (currently it is a factor of two or so slower).

The streaming of particles is carried out by partitioning the 3-D particle domain into 2-D data slabs. The slab width is between 3 and 4 grid cells, as the force resolution of the top level PM solver suffices for larger scales. Therefore, the short range force calculation on one slab only requires data from adjacent slabs. We dynamically store four slabs in the GPU memory at any given time, performing the P^{3}M algorithm on the middle slab. While the GPU performs its calculation on the chosen slab, the CPU host code simultaneously reads in the data results of the previous slab calculation, while writing the upcoming slab into GPU memory for later computation. The latency of reading and writing memory to the GPU is absorbed primarily by the GPU computation time as such memory movement can be performed simultaneously. This memory partitioning, not only eliminates the limited memory problem of the GPU, but also drastically reduces the cost of memory movement between the CPU and the GPU, a pernicious performance chokepoint in GPU codes. In fact, each iteration (i.e., computation of the middle slab) takes longer then the simultaneous memory push, eliminating extra time spent on memory movement.

For a many-core system (e.g., BG/Q or Intel Xeon Phi), the GPU strategy is obviously not applicable, and it is more efficient to change the short-range solver to a tree-based algorithm. HACC uses a recursive coordinate bisection (RCB) tree in conjunction with a highly tuned short-range polynomial force kernel. The implementation of the RCB tree, although not the force evaluation scheme, generally follows the discussion in Ref.^{6} Two core principles underlie the high performance of the RCB tree's design.

*Spatial Locality.* The RCB tree is built by recursively dividing particles into two groups, placing the dividing line at the center of mass coordinate perpendicular to the longest side of the box. Particles are then partitioned such that particles in each group occupy disjoint memory buffers. Local forces are computed one leaf node at a time. The particle data exhibits a high degree of spatial locality after the tree build; because the computation of the short-range force on the particles in any given leaf node, by construction, deals with particles only in nearby leaf nodes, the cache miss rate is extremely low.

*Walk Minimization.* In a traditional tree code, an interaction list is built and evaluated for each particle. The tree walk necessary to build the list is relatively slow because it involves complex conditional statements and "pointer chasing" operations. A direct *N*^{2} force calculation scales poorly as *N* grows, but for a small number of particles, a thoughtfully constructed kernel can still finish the computation in a small number of cycles. The RCB tree exploits our highly tuned force kernels to reduce the overall evaluation time by shifting workload away from the slow tree-walking and into the force kernel. On many systems, tens or hundreds of particles can be in each leaf node before the critical crossover point in computational efficiency is reached.

**3.4. Other features**

The time-stepping in HACC is based on a 2nd order split-operator symplectic scheme that sub-cycles the short/close-range evolution within long/medium-range "kick" maps where particle positions do not change but the velocities are updated. The number of sub-cycles can vary, depending on the force and mass resolution of the simulation, from *n _{c}* = 510. Local density estimates automatically provided by the RCB tree are used to enable adaptive time-stepping at the level of an individual leaf. HACC uses mixed precision computationdouble precision is used for the spectral component of the code, whereas single precision is adequate for the short/close-range particle force evaluations and particle time-stepping.

As emphasized above, HACC's performance and flexibility are not dependent on vendor-supplied or other high-performance libraries or linear algebra packages; the 3-D parallel FFT implementation in HACC couples high performance with a small memory footprint as compared to available libraries. Unlike some other N-body codes that have been specially tuned for performance, no special hardware use is associated with HACC, and assembly level programming is not required.

To summarize, the HACC framework integrates multiple algorithms and optimizes them across architectures; it has several performance-enhancing features, for example, overloading, spectral filtering and differentiation, mixed precision, compact local trees, and locally adaptive time-stepping. Finally, weak scaling is a function only of the spectral solver; HACC's 2-D domain decomposed FFT guarantees excellent performance and scaling properties (Section 4).

### 4. Performance

**4.1. Target architectures and environments**

The defining characteristic of HACC, as already discussed, is its ability to run on diverse architectures (multi/many-core as well as heterogenous) without sacrificing performance or scalability. We will showcase this on two very different architectures: the GPU-accelerated system Titan and the BG/Q systems Sequoia and Mira. These machines currently occupy rank two, three, and five in the Top 500 list (http://www.top500.org/). These architectures represent two very different approaches to parallel supercomputing, a smaller number of "hot" nodes with a larger flops/bandwidth imbalance (Titan) versus a larger number of lower compute intensity nodes, with a more balanced network configuration (Mira and Sequoia). It is important to note that while code ports to Titan have involved a fair degree of effort (measured in man-years), the initial transition of HACC to Titan took less than a two-person month.

Titan, a hybrid Cray XK7 system, is the third generation of major capability computing systems at the OLCF. The initial configuration was accepted in February 2012 and consisted of 18,688 compute nodes for a total of 299,008 AMD Opteron 6274 "Interlagos" processor cores and 960 NVIDIA X2090 "Fermi" Graphical Processing Units (GPU). The peak performance of the Opteron cores is 2.63 PFlops and the peak performance of the GPUs is 638 TFlops in double precision. In late 2012, the 960 NVIDIA X2090 processors were removed and replaced with 18,688 of NVIDIA's next generation Tesla K20X "Kepler" processors, with a total system peak performance in excess of 27 PFlops in double precision.

The BG/Q is the third generation of the IBM Blue Gene line of supercomputers. The BG/Q Compute chip (BQC) combines CPUs, caches, network, and a messaging unit on a single chip; each BG/Q node contains the BQC and 16 GB of DDR3 memory. Each BQC uses 17 augmented 64-bit PowerPC A2 cores with specific enhancements: (1) 4 hardware threads and a SIMD quad floating point unit (Quad Processor eXtension, QPX), (2) a sophisticated L1 prefetching unit with both stream and list prefetching, (3) a wake-up unit to reduce certain thread-to-thread interactions, and (4) transactional memory and speculative execution. Of the 17 BQC cores, 16 are for user applications and one for system services. Each core has access to a private 16 KB L1 data cache and a shared 32 MB multi-versioned L2 cache connected by a crossbar. The A2 core runs at 1.6 GHz and the QPX allows for 4 FMAs per cycle, translating to a peak performance of 204.8 GFlops for the BQC chip. The BG/Q network has a 5-D torus topology; each node has 10 communication links with a peak total bandwidth of 40 GB/s. Our results have been obtained on Sequoia, a 96 rack system (1,572,864 cores) with 20 PFlops peak performance and on Mira, a 48 rack system (786,432 cores) with 10 PFlops peak performance.

**4.2. Performance results**

The results are presented in three parts: performance and scaling of (i) the FFT and hence of the medium/long range solver, and of the full code on (ii) Titan, and (iii) on BG/Q systems. Weak and strong scaling results are shown for all cases. For the full code runs, the particle mass is and the force resolution, 6 kpc. All simulations are for a CDM (Cold Dark Matter) model with * _{m}* = 0.265. Simulations of cosmological surveys focus on large problem sizes, therefore the weak scaling properties are of primary interest. The full code exhibits essentially perfect weak scaling out to 16,384 nodes of Titan (90% of the system) at 92.2% parallel efficiency. It strong scales up to almost half of Titan on a problem with (only) 1024

^{3}particles.

On the BG/Q systems, HACC weak scales to the full machine size, achieving a performance of 13.94 PFlops on 96 racks, at around 6769% of peak in most cases (up to 69.37%) at an efficiency of 90%. We demonstrate strong scaling up to one rack on a 1024^{3} particle problem. Finally, the biggest test run evolved more than 3.6 trillion particles (15,360^{3}), exceeding by more than an order of magnitude the largest high-resolution cosmology run performed to date. Extensive details about the performance results on the BG/Q systems and how the high peak performance was achieved are given in Ref.^{7} Here we give a summary of those results.

**FFT scaling and the Poisson solver.** The weak scaling of HACC is controlled by the FFT that underlies the spectral Poisson solver (Section 3). To achieve extreme scalability, HACC has its own fast, portable, and memory-efficient pencil-decomposed, non-power-of-two FFT (data partitioned across a 2-D subgrid), allowing *N _{rank}* <

*N*, sufficient for use in any supercomputer in the foreseeable future. The FFT is composed of interleaved transposition and sequential 1-D FFT steps, where each transposition only involves a subset of all tasks; the transposition and 1-D FFT steps are overlapped and pipelined, with a reduction in communication hotspots in the interconnect. Details of the implementation are rather complex, requiring careful scheduling of communication phases in order to avoid deadlock.

^{2}_{FFT}Detailed timing information for the FFT on the BG/Q and Titan is given in the original SC'13 paper. Both strong and weak scaling tests were performed. For the strong scaling test, as ranks increase from 256 to 8192 (8 ranks per node on the BG/Q and one rank per node on Titan), the scaling remains close to ideal, similar on both machines. In the second set of scaling tests, the grid size per rank is held constant, at approximately 200^{3} for the BG/Q and 300^{3} for Titan (the last Titan run was increased to 400^{3} particles per rank). FFT scaling is demonstrated on the BG/Q up to 16 racks and to a size of 10,240^{3}. The performance is remarkably stable, predicting excellent FFT performance on the largest systems and, as shown in the next section, holding up to 96 racks. For our final full runs we measured the overall timing onlythe excellent scaling of the full code is proof of the FFT scaling up to 15,360^{3} grid sizes on more than 1.5 million cores.

**HACC scaling up to 16,384 nodes on Titan.** We present performance data for two cases: (1) weak scaling on Titan with up to 16,384 nodes; and (2) strong scaling on up to 8,192 nodes with a fixed-size simulation problem. Timing results are obtained by averaging over 15 substeps.

**Weak Scaling:** We ran with 32 million particles per node in a fixed (nodal) physical volume of (360 Mpc)^{3}, representative of the particle loading in actual large-scale simulations (the GPU version of the code was run with one MPI rank per node). The results are shown in Figure 3, including timing results for a 1.1 trillion particle run, where we have kept the volume per node the same but increased the number of particles per node by a factor of two to 64.5 million. This benchmark demonstrates essentially perfect weak scaling with respect to time to solution.

**Strong Scaling:** Many-core based architectures are tending inexorably towards a large number of heterogeneous cores per node with a concomitant decrease in the byte/flop ratio, a defining characteristic of exascale systems. For these future-looking reasonsanticipating the strong-scaling barrier for large-scale codesand for optimizing wall-clock at fixed problem size, it is important to establish the robustness of the strong scaling properties of the HACC algorithms.

We ran a fixed-size 1024^{3} particle problem while increasing the number of nodes from 32 to 8192, almost half of Titan. The results are shown in Figure 3. For the run with the smallest number of nodes, we utilize 30% of the available CPU memory (32 GB per node), for the largest, we use less than 1%. Up to 512 nodes, HACC strong-scales almost perfectly, after which the scaling degrades somewhat. This is not surprisingat this low particle loading the GPUs lack the computational work to hide the particle transfer penalty to the CPU. A utilization of less than 1% of the available memory is not a real-world scenario, yet even at this value, HACC performs extremely well.

**HACC scaling up to 96 racks of the BG/Q.** We present data for two cases: (1) weak scaling at 90% parallel efficiency with up to 1,572,864 cores (96 racks); (2) strong scaling with up to 16,384 cores with a fixed-size problem to explore future systems with lower memory per core. Timing results are obtained by averaging over 50 substeps.

**Weak Scaling:** We ran with 2 million particles per core, a typical particle loading in actual large-scale simulations on BG/Q systems. Tests with 4 million particles per core produce very similar results. As demonstrated in Figure 4, weak scaling is ideal up to 1,572,864 cores (96 racks), where HACC attains a peak performance of 13.94 PFlops and a time per particle per substep of 0.06 ns for the full high-resolution code. This problem, with 3.6 trillion particles, is the largest cosmological benchmark ever performed. The time to solution is set by the science use requirement, that is, running massive high-precision HACC simulations on a production basiswithin days rather than weeks. The performance achieved allows runs of 100 billion to trillions of particles in a day to a week of wall-clock time.

**Strong Scaling:** The problem set-up follows that on Titan, using 512 to 16,384 cores, and spanning a per node memory utilization of 57%, a typical production run value, to as low as 7%. The actual utilization scales by a factor of 8, instead of 32, because on 16,384 nodes we are running a very small simulation volume per rank with high overloading memory and compute cost. Given that the algorithms are designed to run at >50% of per node memory utilization to a factor of 4 less (15%), the strong scaling, as depicted in Figure 4 is impressive. It stays near-ideal throughout, slowing down at 16,384 cores, only because of extra computations in the overloaded regions. This test demonstrates that the HACC algorithms will work well in situations where the byte/flop ratio is significantly smaller than the optimal plateau for the BG/Q.

### 5. Updates and Future Evolution

**5.1. HACC updates**

After the original SC'13 paper (on which this Research Highlight is based) appeared, a number of improvements have been implemented that further increase the efficiency of the short-range solvers. These include improved data streaming to the GPU, a new task-based load-balancing scheme, multiple RCB trees/rank, and locally adaptive time-stepping.^{8}

On Titan, we determined that PCI bus latencies were best avoided by pushing data in larger blocks, as opposed to asynchronously pushing 2-D slabs. The new code calculates the maximum amount of memory that can be allocated on the GPU and pushes data of that size. With a larger data size, the calculation time is increased on the device, reducing the overall time spent moving data between the host CPU and GPU. The kernel itself was also updated and rewritten to lower the register pressure per thread, allowing for an occupancy of 100%, maximally hiding latencies in memory fetching. Other improvements include further loop unrolling, and actual alterations in the assembly code. The particle interaction algorithm itself was unaltered, but by analyzing various aspects of the PTX assembly, the compiled kernel was further optimized. This led to a three times to four times improvement.

The principle behind the load balancing technique is to partition each node volume into a set of overlapping data blocks, which contain "active" and "passive" particlesanalogous to the overloading particle scheme between nodes. Each block can independently perform the short-range force calculations on its data, where it correctly updates the interior active particles, and streams the bounding passive particles. In this form, one can picture each node as a separate HACC simulation, and the data blocks are the equivalent nodal volume decompositions with overloading zones. The scheme to perform a short-range force timestep is as follows: (1) Each node partitions itself into overlapping data blocks, (2) evolves the blocks independently, and (3) reconciles the active particles, whereby removing the unnecessary duplicated passive ones. Once the simulation data has been subdivided into smaller independent work items, these blocks can be communicated to any nodes that have the available resources to handle the extra workload. Load-balancing can now be performed; more details are given in Ref.^{11}

On the BG/Q, to increase the amount of parallel work in the short-range solver, HACC builds multiple RCB trees per rank. First, the particles are sorted into fixed bins, where each bin is roughly the length-scale of the short-range force. An RCB tree is constructed within each bin, and because this process is independent of all other bins, it can be done in parallel, providing a significant performance boost to the overall force computation. Note that when the force on the particles in each leaf node is computed, not only must the parent tree be searched, but so must the other 26 neighboring trees. Only nearest neighbors need to be considered because of the limited range of the short-range force. While searching neighboring trees adds computational expense, the trees are individually not as deep, and so the resulting walks are less expensive. Also, because we distribute "(leaf node, neighboring tree)" pairs among the threads, this scheme also increases the amount of available parallelism post-tree-build (which helps with thread-level load balancing). Overall this technique provides a significant performance advantage over using one large tree for the entire domain.^{8}

**5.2. HACC future evolution**

HACC framework development is tightly coupled to future architectures; in particular the two-level design naturally maps the lower level (short-range solvers, and, in the very near future, particle-based hydro-solvers) to the individual node architecture. Because of this feature, architecture co-design with HACC kernels is particularly straight-forward. Conversely, knowledge of the nodal architecture allows for ease of optimization and algorithmic choices within the framework.

The choice of programming models is also connected to this two-level structure. It appears very likely that in the near future ("pre-exascale" systems), the number of nodesdistinct pieces of hardware directly connected to the main system networkwill not be too different from 100,000 (or smaller). HACC has already demonstrated running at more than 1.5 million MPI ranks, and it is very unlikely that a significant improvement in this number will be needed. Programming models at the node level will evolve, and, as already demonstrated, HACC can easily adapt to these.

Future nodes will express unprecedented levels of concurrency; possibly thousands of independent threads. The grand challenge for pre-exascale applications will be how to adapt to this change. In principle, HACC has the potential to utilize a large number of independent streams effectively; precisely how to do this is a key focus area. In terms of the bytes/flop ratio, given a pre-exascale system with notional numbers of 100 PFlops and 10 PB of total memory, the results presented here show that HACC is ready for such a machine today.

There are a number of technical issues such as complex memory hierarchies (e.g., NVRAM) and bandwidth, power management, and resilience technologies that we continue to monitor. HACC is one of the benchmark codes for gathering information regarding future supercomputers at Argonne, Livermore, and Oak Ridge. This provides an opportunity to stay abreast of the latest advances; because HACC does not rely on "black box" libraries or packages, it retains the key advantage of allowing optimization to be a continuous process.

Many of the ideas and methods presented here are relatively general and can be re-purposed to benefit other HPC applications, especially in areas such as accelerator beam dynamics and plasma simulations, particle transport codes, and for molecular dynamics simulations.

An important aspect of HACC not mentioned so far is an associated parallel in situ analysis and visualization framework called CosmoTools that runs in tandem with HACC simulations to perform "on the fly" analysis (computation of summary statistics, halo finding, halo merger trees, etc.) and data reduction tasks. The very large simulations undertaken by the HACC framework make having such a capability an absolute requirement, as extensive post-processing of the raw data outputs is almost impossible to carry out. Future development of HACC will also involve associated development of CosmoTools.

Early science results from a number of HACC simulations include a suite of 64 billion particle runs for baryon acoustic oscillations predictions for Baryon Oscillation Spectroscopic Survey^{a} (BOSS) carried out on Roadrunner^{25} and a high-statistics study of galaxy cluster halo profiles.^{1} This has been followed by some very large simulations, among them the largest high-resolution N-body runs in cosmology to date. These include a 1.1 trillion particle simulation (Figure 1)^{8} run on Mira, a simulation with roughly half this number of particles, but an order of magnitude better mass resolution run on Titan,^{11} and a suite of thirty 64 billion particle simulations for the BOSS survey spanning multiple cosmologies, and designed to construct full-sky synthetic catalogs out to a redshift depth of z 0.8. The large Mira run has been used to generate synthetic galaxy catalogs for the next-generation Dark Energy Spectroscopic Instrument (DESI)^{b} and results from both of the large runs will be used to construct synthetic skies for the Large Synoptic Survey Telescope (LSST).^{c}

### Acknowledgments

We are indebted to Bob Walkup for running HACC on a prototype BG/Q system at IBM and to Dewey Dasher for help in arranging access. At ANL, we thank Susan Coghlan, Paul Messina, Mike Papka, Rick Stevens, and Tim Williams for obtaining allocations on different Blue Gene systems. At LLNL, we are grateful to Brian Carnes, Kim Cupps, David Fox, and Michel McCoy for providing access to Sequoia. At ORNL, we thank Bronson Messer and Jack Wells for assistance with Titan. This research used resources of the ALCF, which is supported by DOE/SC under contract DE-AC02-06CH11357 and resources of the OLCF, which is supported by DOE/SC under contract DE-AC05-00OR22725.

### References

1. Bhattacharya, S., Habib, S., Heitmann, K., Vikhlinin, A. Dark matter Halo profiles of massive clusters: Theory versus observations. *Astrophys. J.* 766 (2013), 32.

2. Bryan, G.L., Norman, M.L. In *12th Kingston Meeting on Theoretical Astrophysics, Proceedings of Meeting Held in Halifax; Nova Scotia (ASP Conference Series # 123)*, D.A. Clarke and M. Fall, eds. 1996; see also O'Shea, B.W., Nagamine, K., Springel, V., Hernquist, L., Norman, M.L. *Astrophys. J. Supp. 160* (2005), 1.

3. Couchman, H.M.P., Thomas, P.A., Pearce, F.R. Hydra: An adaptive-mesh implementation of P 3M-SPH *Astrophys. J. 452*, 797 (1995).

4. For a review of cosmological simulation methods, see also Dolag, K., Borgani, S., Schindler, S., Diaferio, A., Bykov, A.M. *Space Sci. Rev. 134* (2008), 229.

5. Fryxell, B., et al. FLASH: An adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. *Astrophys. J. Supp. 131* (2000), 273.

6. Gafton, E., Rosswog, S. A fast recursive coordinate bisection tree for neighbour search and gravity. *Mon. Not. R. Astron. Soc. 418* (2011), 770.

7. Habib, S., Morozov, V., Finkel, H., Pope, A., Heitmann, K., Kumaran, K., Peterka, T., Insley, J., Daniel, D., Fasel, P., Frontiere, N., Luki, Z. arXiv:1211.4864, *Supercomputing 2012.*

8. Habib, S., Pope, A., Finkel, H., Frontiere, N., Heitmann, K., Daniel, D., Fasel, P., Morozov, V., Zagaris, G., Peterka, T., Vishwanath, V., Luki, Z., Sehrish, S., Liao, W.-K. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. *New Astron. 42* (2016), 49 arXiv:1410.2805 [astro-ph.IM].

9. Habib, S., Pope, A., Luki, Z., Daniel, D., Fasel, P., Desai, N., Heitmann, K., Hsu, C.-H., Ankeny, L., Mark, G., Bhattacharya, S., Ahrens, J. Hybrid petacomputing meets cosmology: The Roadrunner Universe project. *J. Phys. Conf. Ser.* 180 (2009), 012019.

10. Hamming, R.W. *Digital Filters.* Dover, Publications, Mineola, New York 1998.

11. Heitmann, K., Frontiere, N., Sewell, C., Habib, S., Pope, A., Finkel, H., Rizzi, S., Insley, J., Bhattacharya, S. The Q continuum simulation: Harnessing the power of GPU accelerated supercomputers. *J. - Astrophys. J. Supp.* 219 (2015), 34 arXiv:1411.3396 [astro-ph.CO].

12. Heitmann, K., Higdon, D., White, M., Habib, S., Williams, B.J., Lawrence, E., Wagner, C. The Coyote universe. II. Cosmological models and precision emulation of the nonlinear matter power spectrum *Astrophys. J. 705* (2009), 156.

13. Heitmann, K., Luki, Z., Fasel, P., Habib, S., Warren, M.S., White, M., Ahrens, J., Ankeny, L., Armstrong, R., O'Shea, B., Ricker, P.M., Springel, V., Stadel, J., Trac, H. The cosmic code comparison project. *Comput. Sci. Dis. 1* (2008), 015003.

14. Heitmann, K., Ricker, P.M., Warren, M.S., Habib, S. Robustness of cosmological simulations. I. Large-scale Structure. *Astrophys. J. Supp. 160* (2005), 28.

15. Hockney, R.W., Eastwood, J.W. *Computer Simulation Using Particles.* Adam Hilger, New York, 1988.

16. Peebles, P.J.E., Structure of the coma cluster of galaxies. *Astron. J. 75* (1970), 13.

17. Peebles, P.J.E. *The Large-Scale Structure of the Universe.* Princeton University Press, Princeton, New Jersey 1980.

18. Pfalzner, S., Gibbon, P. *Many-Body Tree Methods in Physics.* Cambridge University Press, 1996; see also Barnes, J., Hut, *P. Nature 324*, 446 (1986); Warren, M.S., Salmon, J.K. Technical Paper, Supercomputing, Cambridge University Press, New York, USA 1993.

19. Pope, A., Habib, S., Lukic, Z., Daniel, D., Fasel, P., Desai, N., Heitmann, K. *Comput. Sci. Eng. 12* (2010), 17.

20. The accelerated universe. Power, C., Navarro, J.F., Jenkins, A., Frenk, C.S., White, S.D.M., Springel, V., Stadel, J., Quinn, T. The inner structure of ACDM haloes - I. A numerical convergence study. *Mon. Not. R. Astron. Soc. 338* (2003), 14.

21. Shandarin, S.F., Zeldovich, Ya.B. The large-scale structure of the universe: Turbulence, intermittency, structures in a self-gravitating medium. *Rev. Mod. Phys. 61* (1989), 185.

22. Springel, V. The cosmological simulation code GADGET-2. *Mon. Not. R. Astron. Soc. 364* (2005), 1105.

23. Teyssier, R. Cosmological hydrodynamics with adaptive mesh refinement. A new high resolution code called RAMSES. *A&A 385* (2002), 337.

24. White, M. The mass of a halo. *Astron. and Astrophys. 367* (2001), 27.

25. White, M., Pope, A., Carlson, J., Heitmann, K., Habib, S., Fasel, P., Daniel, D., Luki, Z. Particle mesh simulations of the Ly forest and the signature of Baryon acoustic oscillations in the intergalactic medium. *Astrophys. J. 713* (2010), 383.

### Footnotes

a. http://www.sdss.org/surveys/boss/.

The original version of this paper was published in *SC'13, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis* (2013).

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory ("Argonne"). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

### Figures

Figure 1. Zoom-in visualization of the density field in a 1.07 trillion particle, 4.25 Gpc box-size HACC simulation with 6 kpc force resolution and particle mass, . The image, taken during a late stage of the evolution, illustrates the global spatial dynamic range covered, 10^{6}, although the finer details are not resolved by the visualization.

Figure 2. Informal representation of the HACC force evaluation hierarchy(1) long/medium-range contributions from a high-order grid-based, spectrally filtered particle-mesh (PM) solver, (2) medium/short-range contributions using a (rank-local) recursive coordinate bisection (RCB) tree algorithm (green region), (3) close-range contributions using direct particleparticle (PP) interactions (magenta). Parameters governing the cross-overs are discussed in the text.

Figure 3. Weak and strong scaling on Titan. Weak scaling is reported for 32 million particles per node. The time per substep per particle is shown as a function of the number of nodes: The performance and time to solution demonstrate essentially perfect scaling (black line). Strong scaling results are for a fixed-size problem1024^{3} particles in a 1.42 Gpc box. The final number of nodes is 8192, approximately half of Titan. Recent improvements in absolute performance are also shown (see Section 5.1).

Figure 4. Weak and strong scaling on the BG/Q; time per substep per particle (red) and overall performance (blue), as a function of the number of cores. The offset black lines indicate ideal scaling. Weak scaling (solid lines with crosses) is reported for 2 million particles per core for up to 96 racks. Performance and time to solution demonstrate essentially perfect scaling. Strong scaling (dashed lines, boxes) follows the same set up as for Titan (Figure 3), the number of cores going from 512 to 16,384. The timing scales nearly perfectly to 8192 cores, then degrades slightly; performance stays high throughout.

**©2017 ACM 0001-0782/17/01**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2017 ACM, Inc.