A look under the hood of any major search, commerce, or social-networking site today will reveal a profusion of "deep-learning" algorithms. Over the past decade, these powerful artificial intelligence (AI) tools have been increasingly and successfully applied to image analysis, speech recognition, translation, and many other tasks. Indeed, the computational and power requirements of these algorithms now constitute a major and still-growing fraction of datacenter demand.
Designers often offload much of the highly parallel calculations to commercial hardware, especially graphics-processing units (GPUs) originally developed for rapid image rendering. These chips are especially well-suited to the computationally intensive "training" phase, which tunes system parameters using many validated examples. The "inference" phase, in which deep learning is deployed to process novel inputs, requires greater memory access and fast response, but has also historically been implemented with GPUs.
In response to the rapidly growing demand, however, companies are racing to develop hardware that more directly empowers deep learning, most urgently for inference but also for training. Most efforts focus on "accelerators" that, like GPUs, rapidly perform their specialized tasks under the loose direction of a general-purpose processor, although complete dedicated systems are also being explored. Most of the companies contacted for this article did not respond or declined to discuss their plans in this rapidly evolving and competitive field.
Deep Neural Networks
Neural networks, in use since the 1980s, were inspired by a simplified model of the human brain. Deep learning techniques take neural networks to much higher levels of complexity, their growing success enabled by enormous increases in computing power plus the availability of large databases of validated examples needed to train the systems in a particular domain.
The "neurons" in neural networks are simple computer processes that can be explicitly implemented in hardware, but are usually simulated digitally. Each neuron combines tens or hundreds of inputs, either from the outside world or the activity of other neurons, assigning higher weights to some than to others. The output activity of the neuron is computed based on a nonlinear function of how this weighted combination compares to a chosen threshold.
"Deep" neural networks arrange the neurons into layers (as many as tens of layers) that "infer" successively more abstract representations of the input data, ultimately leading to its result; for example, a translated text, or recognition of whether an image contains a pedestrian.
The number of layers, the specific interconnections within and between layers, the precise values of the weights, and the threshold behavior combine to give the response of the entire network to an input. As many as tens of millions of weights are required to specify the extensive interconnections between neurons. These parameters are determined during an exhaustive "training" process in which a model network is given huge numbers of examples with a known "correct" output.
When the networks are ultimately used for inference, the weights are generally kept fixed as the system is exposed to new inputs. Each of the many neurons in a layer performs an independent calculation (multiplying each of its inputs by an associated weight, adding the products, and doing a nonlinear computation to determine the output). Much of this computation can be framed as a matrix multiplication, which allows many steps to be done in parallel, said Christopher Fletcher, a computer scientist at the University of Illinois at Urbana-Champaign, and "looks like problems that we've been solving on GPUs and in high-performance computing for a very long time."
During inference, unlike in offline training, rapid response is critical, whether in self-driving cars or in web applications. "Latency is the most important thing for cloud providers," Fletcher noted. In contrast, he said, traditional "GPUs are designed from the ground up for people who don't care about latency, but have so much work that as long as they get full throughput everything will turn out OK."
Recognizing the importance of response time and anticipating increasing power demands by neural-network applications, cloud behemoth Google developed its own application-specific integrated circuit (ASIC) called a "tensor-processing unit," or TPU, for inference. Google reported in 2017 that, in its data-centers, the TPU ran common neural networks 15 to 30 times faster than a contemporary CPU or GPU, and used 30 to 80 times less power for the same computational performance (operations per second). To guarantee low latency, the designers streamlined the hardware and omitted common features that keep modern processors busy, but also demand more power. The critical matrix-multiplication unit uses a "systolic" design in which data flows between operations without being returned to memory.
Google developed its own application-specific integrated circuit, the tensor processing unit (TPU), for inference.
So far, Google seems to be unusual among Web giants in designing its own chip, rather than adapting commercially available alternatives. Microsoft, for example, has been using field-programmable gate arrays (FPGAs), which can be rewired after deployment to perform specific circuit functions. Facebook is collaborating with Intel to evaluate its ASIC, called the Neural Network Processor. That chip, aimed at artificial-intelligence applications, started life in Nervana, a startup that Intel acquired in 2016. Unsurprisingly, Nvidia, already the dominant vendor of GPUs, has released updated designs that it says will better support neural network applications, in both inference and training.
These chips follow a strategy that is familiar from other specialized applications, like gaming. Farming out the heavy calculations to a specialized accelerator chip sharing a bus with a general processor and memory allows rapid implementation of new ideas, and lets chip designers focus on dedicated circuits assuming all needed data will be at hand. However, the memory burdens posed by this "simplest" approach is likely to lead to systems with tighter integration, Fletcher said, such as bringing accelerator functions on-chip with the processor. "I think we will inevitably see the world move in that direction."
One technique exploited by the new chips is using low-precision, often fixed-point data, eight bits or even fewer, especially for inference. "Precision is the wild, wild west of deep learning research right now," said Illinois's Fletcher. "One of the major open questions in all of this as far as hardware accelerators are concerned is how far can you actually push this down without losing classification accuracy?"
Results from Google, Intel, and others show that such low-precision computations can be very powerful when the data is prepared correctly, which also opens opportunities for novel electronics. Indeed, neural networks were inspired by biological brains, and researchers in the 1980s implemented them with specialized hardware that mimicked features of brain architecture. Even within the last decade, large government-funded programs in both the U.S. and Europe pursued "neuromorphic" chips that operate on biology-inspired principles to improve performance and increase energy efficiency. Some of these projects, for example, directly hardwire many inputs to a single electronic neuron, while others communicate using short, asynchronous voltage spikes like biological neurons. Despite this history, however, the new AI chips all use traditional digital circuitry.
Qualcomm, for example, which sells many chips for cellphones, explored spiking networks under the U.S. Defense Advanced Research Projects Agency (DARPA) program SyNAPSE, along with startup Brain Corporation (in which Qualcomm has a financial stake). But Jeff Gehlhaar, Qualcomm's vice president for technology, said by email that those networks "had some limitations, which prevented us from bringing them to commercial status." For now, Qualcomm's Artificial Intelligence Platform aims to help designers exploit digital circuits for these applications. Still, Gehlhaar noted the results are being studied by others as "this field is getting a second look."
Indeed, although its NNP chip does not use the technology, Intel also announced a test chip called Loihi that uses spiking circuitry. IBM exploited its SyNAPSE work to develop powerful neuromorphic chip technology it called TrueNorth, and demonstrated its power in image recognition and other tasks.
Gill Pratt, a leader for SyNAPSE at DARPA and now at Toyota, said even though truly neuromorphic circuitry has not been adopted commercially yet, some of the lessons from that project are being leveraged in current designs. "Traditional digital does not mean lack of neuromorphic ideas," he stressed. In particular, "sparse computation" achieves dramatically higher energy efficiency by leaving large sections of the chip underused.
During the last decade, government-funded programs in the U.S. and Europe have pursued the development of neuromorphic chips.
"Any system that is very power efficient will tend to be very sparse," Pratt said, the best example being the phenomenal computational power that our brains achieve with less than 20 watts of power.
Although power is critical to data-centers and especially for handheld devices, Pratt noted that even cars can face serious power challenges. Prototype advanced safety and self-driving features require thousands of watts, but would need much more to approach human capabilities, and Pratt thinks hardware will eventually need to exploit more neuromorphic principles. "I am extremely optimistic that is going to happen," he said. "It hasn't happened yet, because there have been a lot of performance improvements, both in terms of efficiency and raw compute horsepower, to be mined with traditional methods, but we are going to run out."
Joupi, N.P, et al
In-Datacenter Performance Analysis of a Tensor Processing Unit 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 26, 2017 https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf
Neuromorphic Computing Gets Ready for the (Really) Big Time, Communications, April 2014, pp. 1315 https://cacm.acm.org/magazines/2014/6/175183-neuromorphic-computing-gets-ready-for-the-really-big-time/fulltext
U.S. Defense Advanced Research Projects Agency DARPA SyNAPSE Program http://www.artificialbrains.com/darpa-synapse-program
©2018 ACM 0001-0782/18/4
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.