Fugaku Takes the Lead

Fugaku at Riken Center for Computational Science — Fugaku supercomputer at the Riken Center for Computational Science in Kobe, Japan.

Japan’s arm-based Fugaku supercomputing system has been acknowledged as the world’s most powerful supercomputer. In June 2020, the system earned the top spot in the Top500 ranking of the 500 most powerful commercially available computer systems on the planet, for its performance on a longstanding metric for massive scientific computation. Although modern supercomputing tasks often emphasize somewhat different capabilities, Fugaku also outperforms by other measures as well.

“It’s amazing on all benchmarks. This architecture just wins big time,” said Torsten Hoefler of the Swiss Federal Institute of Technology (ETH) Zurich. “It is a super-large step.” Hoefler shared the 2019 ACM Gordon Bell Prize with an ETH Zurich team for simulations of heat and quantum electronic flow in nanoscale transistors performed in part on the previous Top500 leader, the Summit System at the U.S. Department of Energy’s Oak Ridge National Laboratory (ORNL) in Tennessee.

Fugaku’s performance on the Top500’s High-Performance Linpack (HPL) benchmark is an impressive 0.4 exaflop/s (10¹⁸ floating-point operations per second), besting Summit by a factor of 2.8 for double-precision (64-bit) arithmetic. For faster, lower-precision operations, the Fugaku system has already exceeded an exaflop/s.

Figure. The Fugaku supercomputer, currently the world’s fastest, at the Riken Center for Computational Science in Kobe, Japan.

In his acceptance of the Top500 award, however, Satoshi Matsuoka, director of the Japanese government-funded RIKEN Center for Computational Science (R-CCS) in Kobe, stressed that the design, done in close collaboration with Fujitsu, was motivated by performance on real-world applications. “Our intention was never to build a machine that only beat the benchmarks,” said Matsuoka, who shared the ACM Gordon Bell Prize with a team of colleagues in 2011.

Top500 pioneer Jack Dongarra, of ORNL and the University of Tennessee at Knoxville, said three new systems in the U.S., and possibly others in China, were expected to achieve exaflop/s performance on 64-bit arithmetic within the next year. Even if its supremacy is fleeting, the Fugaku architecture includes innovations, notably vector arithmetic, that could ease programming and exemplify an alternate paradigm for designing high-performance computers.

Race to the Top

The Top500 list includes 500 powerful systems from around the world, but the few near the top get the most attention. These systems tend to be funded as national resources in major facilities like U.S. national laboratories and RIKEN, a research institute supported by the Japanese government. In this, and in their cost, the leading supercomputers are similar to scientific instruments like the Hubble Space Telescope. “The Fugaku machine is reported to be $1 billion U.S.” to develop and build, Dongarra said. “They’re pushing the technology and you pay a price for that.” Fugaku comprises 158,976 nodes (more than 7 million CPU cores) distributed among 432 racks. Including the support infrastructure, it draws some 30MW of electricity, enough to power some 20,000 U.S. homes.

Unlike the Hubble, which only does astronomy, these systems run simulations that illuminate a diverse range of scientific challenges. “The top 10 machines are really built to solve problems that no other machine can solve,” said Hoefler, including “the big challenge problems in society” such as climate change, brain research, and recently the COVID-19 crisis. Their general-purpose design makes them slightly less efficient than a specialized machine, but ensures broad funding support. Their flagship status also precludes specialized chips, such as those being developed for machine learning. “I think people would think twice before they build a $200-million machine based on those chips,” Hoefler said, especially because the algorithms used for cutting-edge computation continue to evolve rapidly.

Fugaku is built around a Fujitsu processor designated A64FX, developed for this system in collaboration with ARM. It is expected to find use in other high-powered computers as well, including one system being developed by Cray and others marketed by Fujitsu. “The architecture that is pioneered by systems in the Top500 is going to be used in industry in order to solve real engineering problems,” Hoefler said.

Nonetheless, basing Fugaku on a dedicated chip is a departure from recent top supercomputer architectures, which leverage higher-volume chips designed for less-demanding applications. This approach offloads many costs of design and development needed to keep pace with advancing semi-conductor technology. The off-the-shelf approach has its own risks, though. In the summer of 2020, Intel announced manufacturing problems with its latest chips, which may result in delays for the U.S.-based exascale supercomputers that will incorporate them.

Each A64FX chip, manufactured using TSMC’s 7nm FinFET process, contains almost 90 billion transistors and features 48 Arm 8.2A CPUs, whose reduced-instruction-set computing (RISC) design contrasts with most of the processors employed in the Top500. Dongarra says 94% of the Top500 machines use Intel processors, which offer complex-instruction-set computing (CISC) to programmers, while only three currently use ARM. Summit, however, uses the Power9 processor from IBM, which also has a RISC architecture.

TSMC’s Chip-on-Wafer-on-Substrate (CoWoS) process is used to stack high-bandwidth memory (HBM2) on top of the processor chip. “Our studies show that bandwidth is very important to sustain the speedup of the applications,” Matsuoka stressed. The chips also provide interfaces with an updated version of the Tofu interconnect, a system with a six-dimensional torus topology that was previously developed by Fujitsu.

Revenge of Vector Architecture

From an architectural perspective, the most dramatic choice is what Fugaku does not have: graphics processor units, or GPUs. These increasingly powerful computation-intensive chips, often made by Nvidia or AMD, frequently are used as cost-effective accelerators to offload intensive parallel computations from CPUs for both high-performance scientific computations and machine learning.

Dongarra says 94% of the Top500 machines use Intel processors; Fugaku is built around the Fujitsu A64FX processor, developed for it in collaboration with ARM.

Instead, Fugaku’s CPUs incorporate instructions that ARM calls Scalar Vector Extension (SVE). Compared to GPUs, this vector architecture is “a more elegant and easier-to-compile architecture that’s trying to take advantage of that same level of parallelism,” said David Patterson, professor emeritus at the University of California at Berkeley and co-recipient (with John Hennessy) of the 2017 ACM A.M. Turing Award. “You can explain how it works to scientists, it’s got an elegance that lets it scale to very powerful computers with time, and it’s easy to compile for.”

“It has been a long time since the fastest computer on the Top500 had a vector processor in it,” Patterson noted. “Is that what things are going to look like more in the future? That’s going to be interesting to watch.”

Although fixed-length vector operations have been implemented elsewhere, SVE harkens back to the type of vector operations originally envisioned by Seymour Cray in his early supercomputers. “It’s not a fixed-size vector but a variable-size vector, where you can vectorize whole loops,” Hoefler said.

GPUs traditionally force users to identify throughput-sensitive code and explicitly specify fine-grain parallelism for those operations. “In the Fugaku system, you don’t need to that,” Hoefler said. “Fugaku is kind of the first serious implementation of those [ideas], at least since Cray’s time. Those could be really easier to program. I’m super-excited about this.”

CPUs also typically have needed more power than GPUs, but in the A64FX, “our power efficiency is pretty much in the range of GPUs or the latest breeds of specialized accelerators while being a general-purpose CPU,” Matsuoka said. “This was because we really tuned for high-performance computing.”

Decades of Progress

The Top500 has been tracking the exponential improvement in supercomputer performance since 1993, based on the Linpack benchmark Dongarra developed in 1979. At the time, he said, floating point operations were expensive, so 64-bit matrix multiplications formed the core of the benchmark. The same metric is still used to judge the Top500 today.

Parallel computing has become particularly important as clock speeds on individual processors hit a ceiling due to chip heating and other issues. However, because any calculation has some parts that must be done serially, adding more processors in parallel gives diminishing returns in speedup.

Nonetheless, more parallel processors do let researchers attack larger problems efficiently. “Not everybody wants to solve the same problem faster,” said Patterson. “Linpack really embraced that and allows people to solve any matrix size they want. The bigger the computer, the bigger the matrix. I don’t know how many people want to solve a problem that’s 10 million by 10 million dense matrix on a side, but that’s the problem they’re solving.” When Linpack was introduced, “these big matrices were the total workload that people were running on those machines,” agreed Hoefler, but “following Moore’s Law for 40 years, the matrices that people can solve on these machines today are way larger than what anybody would do in practice.”

“While it’s interesting from a historical perspective, it probably doesn’t really reflect the kind of performance we see for what I’ll call normal applications run on supercomputers,” Dongarra acknowledged. In particular, he said, even in intensive scientific calculations, such as solving the partial differential equations that appear in simulations of complex three-dimensional systems such as climate models, the matrices are sparse, meaning they have only a small number of non-zero entries, arranged in predictable patterns.

To assess such sparse-matrix operations, the Top500 team also tracks the HPCG (high-performance conjugate gradients) benchmark. In addition, machine-learning applications typically don’t require full 64-bit accuracy, so Dongarra and his colleagues have introduced a lower-precision version called HPL-AI. Still, on both these benchmarks, Fugaku also ranks highest, achieving 1.4 exaflop/s on HPL-AI.

Nonetheless, Patterson worries “whether the Linpack benchmark is leading to architecture innovations that allow important algorithms, or … we’re just creating one-trick ponies.” He has been supporting an alternative, known as MLPerf, which includes both the training and inference aspects of machine learning. It features a suite of tasks that are frequently updated, including, for example, a large-scale language model within two years of the research paper that introduced it. MLPerf also has an “open” category that leaves the implementation unspecified, to encourage algorithmic innovation. “The benchmark challenge is, how do you have a fair challenge and encourage innovation?” Patterson noted.

Still, Hoefler thinks the continuity of the Top500 provides important context for machines like Fugaku, and notes that machine learning algorithms still rely heavily on the same fused multiply-add operations that power matrix multiplications. “HPL is less relevant than it was, but I believe that it’s incredibly important from a historic perspective.”

Further Reading

Top500: The List www.top500.org

Report on the Fujitsu Fugaku System, Jack Dongarra, June 2020, https://bit.ly/2EQS6Yt

MLPerf Benchmarks, https://mlperf.org/