Computer designers are becoming increasingly concerned about the ending of Moore's Law, and what it means for users if the industry can no longer count on the idea that the density of logic circuits will double every two years, as it has for close to half a century. It may mean radical changes to the way users think about software.
Leading researchers in semiconductor design point out that, although logic density is butting up against physical limits, it does not necessarily spell the end of Moore's Law itself. Gordon Moore's speech at the 1975 International Electron Device Meeting (IEDM) predicted significant increases in chip size and improvements in circuit design as part of the scaling process, in addition to regular reductions in transistor size and interconnect spacing.
During a September virtual meeting of the IEEE International Roadmap for Devices and Systems group, chairman and Intel director of technology strategy Paolo Gargini, argued, "Though Gordon made this clear, people have concentrated only on dimensional scaling. That's the reason why people have doubts about the next technology nodes. It appears as though we are in a crisis, but we are not, because of the other two components."
"Circuit cleverness" as described by Moore in his 1975 speech, has made a strong contribution in recent years. Philip Wong, professor of electrical engineering at Stanford University, says greater cooperation between circuit designers and the engineers who work on the core process technology has made it possible to eke more gains out of each new node than would be possible using just dimensional scaling. Advances such as burying power rails under transistors and stacking transistors should continue to provide some gains for perhaps two or three generations, out to the latter half of this decade. The remaining directions for future improvements at the physical level are to build out in terms of area by adding more layers of logic gates or other devices. Some warn, however, that this direction has its own imitations.
Neil Thompson, a research scientist at the Massachusetts Institut of Technology (MIT), says, "When you look at 3D (three-dimensional) integration, there are some near-term gains that are available. But heat-dissipation problems get worse when you place things on top of each other.
"It seems much more likely that this will turn out to be similar to what happened with processor cores. When multicore processors appeared, the promise was to keep doubling the number of cores. Initially we got an increase, and then got diminishing returns."
One option is to make more efficient use of the available transistor count. In the lecture to commemorate their 2017 ACM A.M. Turing Award, John Hennessy and David Patterson argued there is a rich vein to mine in highly specialized accelerators that dispense with the heavy overhead of general-purpose computing, much of it due to highly wasteful memory accesses caused by repeated instruction and data fetches, as a way of providing the performance that Moore's Law may not be able to support.
Paul Kelly, professor of software technology at Imperial College, London, uses the term "Turing tariff" to refer to the cost of performing functions using general-purpose hardware. The term is based on the idea the theoretical machine proposed by Alan Turing could perform any function, but not necessarily efficiently. An accelerator pays a lower Turing tariff for its intended functions because operations that are implicit in the module's circuitry need to be explicitly defined in software when run on a general-purpose processor.
A potential major advantage of moving to accelerator-rich designs in the future is that they do not even have to be confined to using conventional digital logic. The greater emphasis on artificial intelligence (AI) in mainstream computing has encouraged designers to look at alternatives to the CMOS technology used for today's processors that either perform processing in the analog domain or use novel switching devices based on electron spin or superconducting techniques to make dramatic energy savings. Though they suffer from poor accuracy and noise, analog and in-memory processors can shrink multipliers that need hundreds or thousands of transistors in the logic domain into just a handful.
The term "Turing tariff" is based on the idea that the theoretical machine proposed by Alan Turing could perform any function, but not necessarily efficiently.
Charles Leiserson, professor of computer science and engineering at MIT, says, "There is a lot of really interesting stuff in these approaches that will be helpful for specific, narrow applications. I continue to be impressed by hardware accelerators."
Users in high-performance computing fields such as machine learning have found accelerators, even with customized code, fail to sustain high throughput when used as part of larger applications. Job startup times and other overheads mean they often leave much of the available performance unused. "The cost-performance ratio is still with the multicores though," Leiserson adds, because of their relative fungibility and accessibility.
Even with more conventional architectures, communications overheads and the complexity of the memory hierarchy of any multicore implementation can easily trip up developers. "You take out some work from your computation and it slows down, and you say: 'what?' If that's your situation, you can't architect for that," Leiserson says. "We need more performance tools and we need hardware to help more there."
Leiserson and Thompson argue developers should go back to the basics of algorithmic analysis to get better predictability and apply it across entire subsystems. "The great achievement of algorithms is that you can predict coarse behavior by doing a back of the envelope analysis using big-O notation. Even if the constant in front of N is large, N-squared is going to be much worse," Leiserson says.
Researchers see potential improvements in code-generation technologies that understand the underlying hardware and its constraints far better than today and remain portable across target architectures through the use of runtime optimization and scheduling.
Jerónimo Castrillón, chair of compiler construction at Germany's Dresden Technical University, points to work at that institution into runtime software that can help manage workloads. "You can look at what hardware features you have and percolate them through the stack into the application programming interfaces. For that to work, you need to carry models of the application."
For example, if an accelerator is unavailable to one module because it is needed by another already running, the scheduler might opt for an alternative compiled for a more general-purpose core instead of holding up the entire application, assuming the compiled code contains enough information to make the analysis possible.
Castrillón believes a shift to domain-specific languages (DSLs) for performance-sensitive parts of the application may be needed, because these can capture more of the developer's intent. "Usually people think you lose performance if you go to higher levels of abstraction, but it's not the case if you do the abstractions right."
Adds Kelly, "With a DSL, the tools can understand that one part is a graph, this other part is a mesh, whereas all a [C or C++] compiler can see is lowered code. Then the compiler is forced to make that uphill struggle to infer what is meant to happen."
Adaptive heterogeneous systems raise problems of verification and debug: how does the programmer know that a particular implementation still works when it has been re-optimized for a certain fabric at a certain time? One possibility is to use similar formal verification techniques to those employed by hardware designers to check that circuits are functionally equivalent to each other after they have been optimized.
The issue of verification becomes far more difficult when it comes to dealing with accelerators that operate in the analog, rather than the digital, domain, and so do not have the same approach to numerical precision and which will have bounded errors.
"Let's get real about investing in performance engineering. We can't just leave it to the technologists to give us more peformance every year."
AI developers have become accustomed to using loss functions and similar metrics to determine whether neural networks that operate at reduced precision or employ other approximation techniques will perform satisfactorily. Yet there are no methods for doing similar analyses of other types of program, such as physics simulation, where users expect to work with fixed, high-precision formats.
Kelly says more comprehensive numerical analysis will be vital to determining how well an analog accelerator can substitute for a more energy-hungry digital processor. Conventional formal-verification methods, today commonly used in hardware design to check circuit optimizations are correct, do not handle uncertainty. Castrillón says advances in that field, such as probabilistic model checking, may provide a path towards tools that are able to verify the suitability of generated code for an application without demanding bit-level equivalence.
"I don't know if those things will compose. Or, you can have strong formal analysis on a large system," Castrillón says.
If composability is not possible, it might fall on programmers to define the levels of accuracy they can tolerate and if a platform cannot meet them, allocate the affected code modules to digital processors that consume more energy or perform the task more slowly.
Although automated code generators may be able to make better use of accelerators than they can do today, there is likely to remain a tension between them and general-purpose cores. Leiserson says while energy concerns push the balance in favor of special-purpose accelerators, generality will likely remain important. "If you have special-purpose hardware, to justify the area it uses, you better be able to use it most of the time."
If hardware generality continues to prove to be more viable, the main path to energy efficiency and performance in the transition away from the traditional approaches to scaling will be algorithmic in nature, Leiserson concludes. "Let's get real about investing in performance engineering. We can't just leave it to the technologists to give us more performance every year. Moore's Law made it so they didn't have to worry about that so much, but the wheel is turning."
Wong, H.-S.P., Akarvardar, K., Bokor, J., Hu, C., King-Liu, T.-J., Mitra, S., Plummer, J.D., and Salahuddin, S.
A Density Metric for Semiconductor Technology, Proceedings of the IEEE, Vol. 108, No. 4, April 2020
Hennessy, J.L. and Patterson, D.A.
A New Golden Age for Computer Architecture, Commun. ACM, Vol. 62, No. 2 (Feb. 2019)
Leiserson, C.E., Thompson, N.C., Emer, J.S., Kuszmaul, B.C., Lampson, B.W., Sanchez, D, and Schardl, T.B.
There's Plenty of Room at the Top: What Will Drive Computer Performance After Moore's Law? Science, 2020 June 5, 368(6495)
Völp, M. et al.
The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware, 1st International Workshop on Post-Moore's Era Supercomputing (2016)
©2021 ACM 0001-0782/21/2
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.