Spending time at SC10, the world's largest supercomputing conference, is a bit like drinking from a fire hose. In addition to the immense amount of technical information being presented, it is a great opportunity to meet with partners and collaborators. I think I only had about 45 minutes of unscheduled time the entire day! Today, I'll be reporting on three events that focused principally on heterogeneous high-performance computing.
The morning's highlight was the keynote address by Bill Dally, the Chief Scientist and Senior Vice President for Research at NVIDIA. He's also my boss (and in the distant past, my dissertation advisor), so don't be expecting any public criticism from me on his lecture! The title of Bill's talk was (get ready, Buzz Lightyear fans): "GPU Computing to Exascale and Beyond." He first focused on the energy efficiency of GPUs showing the reported performance and energy consumption of systems in today's Top500. For example, the Tianhe-1A heterogeneous GPU computer (#1 on Top500) is about 2.5 times more efficient (flops/watt) than the homogeneous Jaguar machine at #2. The heterogeneous Tsubame 2.0 computer from the Tokyo Institute of Technology (the TiTech machine at #4) is about 50% more efficient than Tianhe-1A. The root cause of this difference is that the CPUs found in the homogeneous machines are optimized for performance of a single thread, and employ the range of modern microprocessor optimizations including branch prediction, multiple forms of speculation, register renaming, dynamic scheduling, and latency optimized caches. In contrast, GPUs are optimized for throughput and are able to achieve about 10x less energy per operations by omitting the power burden that comes with CPU optimizations. However, such efficiency comes at a price--GPUs perform pretty poorly on single-threaded code.
The impact of the efficiency difference will be magnified by the Exascale computers being planned for 2018. Even assuming a 4x improvement in energy efficiency from technology scaling (smaller transistors) and 4x improvement in architectural efficiency for both CPUs and GPUs, the intrinsic relative inefficiency of CPUs leaves them at least 6x behind the efficiency of GPUs. This 6x could make the difference between a 20 megawatt (yes I said *megawatt*) computer and a 120 megawatt computer. At a million dollars per megawatt per year--this adds up to some real money.
The challenge, of course, is harnessing the capabilities of heterogeneous systems to solve real problems that matter to science and society, and not just crunch out LINPACK scores. Bill made a case that NVIDIA's investment in languages such as CUDA is enabling programmers to port old applications and create new ones--and he gave a couple of examples of intriguing applications. One example in the medical field uses GPUs to reduce CT scan radiation dosage to patients by requiring fewer X-ray scans, which will ultimately reduce cancer rates. Another uses GPUs for molecular dynamics simulation to model the chemistry of surfactants and develop better cleaning products such as shampoo. While these applications are currently run on a small number of GPUs, I expect to see some impressive performance results of real scientific codes on the large heterogeneous GPU machines at the top of the Top500 over the coming months (the Tsubame folks are already reporting some now).
Finally, Bill talked a little about a new extreme-scale research project at NVIDIA called Echelon, funded in part by DARPA under the Ubiquitous High Performance Computing (UHPC) program. NVIDIA has teamed with Cray, Oak Ridge National Labs, and six leading universities to develop high performance, energy-efficient, and resilient architectures and programming systems. Bill showed the vision for a future heterogeneous computing system that eliminates artifacts of today's GPU systems such as separate memory spaces and a comparatively low-bandwidth I/O bus connecting the CPU and GPU. The Echelon design incorporates a large number of throughput-optimized cores and a smaller number of latency-optimized cores on a single chip, sharing a common memory system. Such a chip could deliver 20 TeraFLOPs and could be aggregated to form a 2.6 PetaFLOP rack. Reaching exascale would only require a few hundred racks, which is on-par with the top-end machines of today. To be clear--this is currently a research project, although I expect that some in the blogosphere will depict Dally's talk as an NVIDIA product announcement--ha!
My biased opinion is that Bill painted a pretty compelling picture of the direction toward realizable exascale systems, but I don't think everyone in the audience was convinced. Some of rumblings I heard indicated skepticism about whether heterogeneous computing systems could be programmed. My own analysis (which I presented in the "Round 2" panel described below) is that there is no choice. Energy constraints compels systems to deploy energy-optimized, throughput-oriented cores in concert with latency-optimized cores. For better or worse, the community needs to work together to define programming models that can exploit such systems and to ensure that the hardware includes mechanisms that enable the programming models to be effective and efficient.
Round 2 on heterogeneous computing systems was a panel organized by Jeff Vetter of Oak Ridge National Labs. Chuck Moore of AMD gave what I thought was an insightful definition of what heterogeneous computing is and is not. Heterogeneous computing cannot be a "Frankensystem" assembled from arbitrarily attached hardware and software. Nor is it a silver bullet that endows magical powers of efficiency or capability on a computer. Nor does it trivially solve the power/performance/programmability challenges facing high-performance computing. Instead, it is a system architecture framework that facilitates communication and passing of work between different concurrent subsystems. Further, it must enable nonprogrammers to easily exploit specialized or programmable functions. My sense was that the panel agreed that the next stage for heterogeneous computing platforms is to promote the "accelerator" to be a first-class citizen in the computing system. AMD has clearly embraced the hardware side of this equation with their recently announced lineup of "Fusion" chips. However, larger challenges lie in the software and programming systems. Kathy Yelick, director of NERSC, highlighted this challenge with her observation that there will clearly be one programming paradigm disruption on the road to exascale from the MPI/MPI+OpenMP models of today. She doesn't believe the community will tolerate two disruptions, and thinks we better get the programming model right and soon.
Round 3 was a panel session organized by Wu Feng, associate professor of computer science at Virginia Tech and keeper of the Green 500 list, on the three P's of heterogeneous computing: Performance, Power, and Programmability. Mike Houston of AMD made a compelling case that applications commonly exhibit braided parallelism, characterized by lots of conditional data parallelism--an assertion met by nods throughout the panel. The most animated member of the panel was Tim Mattson of Intel who asserted that importance of programmability far outweighed that of performance or power; if you can't program the machine, who cares what its peak performance or theoretical efficiency is. He was even more outspoken on his stance for open software standards, saying that no single company, including Intel, should control the programming language. This assertion later turned into a more heated discussion about the merits of OpenCL versus CUDA. While I won't bother identifying all of the protagonists in this debate, some panelists made an impassioned case for the risks highlighted by Mattson. Others made the case that the technology is still in its infancy and a rush to standardize before the technology matures will stifle innovation. Despite that melodrama, the panelists heartily agreed that software is the biggest challenge (anyone sense a theme here?).
Both of the panel sessions were extremely popular, with standing room only in the rooms, an overflow room, and great audience participation. Jeff Vetter experimented with an audience feedback system that enabled anyone in the audience to respond to survey questions by texting from their cell phones. Once everyone is used to this kind of audience participation, I think it will be a pretty cool tool for future panel sessions.
That's all for now_-more tomorrow.
Steve Keckler is the Director of Architecture Research at NVIDIA and Professor of Computer Science and Electrical and Computer Engineering at the University of Texas at Austin. He has conducted research in parallel computer architecture for 20 years and has co-authored more than 100 publications on the subject. Keckler's research and teaching awards include a Sloan Foundation Fellowship, an NSF CAREER award, the ACM Grace Murray Hopper award, the Edith and Peter O'Donnell award for Engineering, and the President's Associates Teaching Excellence Award at UT-Austin.