Neural-inspired computing models have captured our imagination from the very beginning of computer science; however, victories of this approach were modest until 2012 when AlexNet, a "deep" neural net of eight layers, achieved a dramatic improvement on the image classification problem. One key to AlexNet's success was its use of the increased computational power offered by graphics processing units (GPUs), and it's natural to ask: Just how far can we push the efficient computing of neural nets?
Computing capability has advanced with Moore's Law over these last three decades, but integrated circuit design costs have grown nearly as fast. Thus, any discussion of novel circuit architectures must be met with a sobering discussion of design costs. That said, a neural net accelerator has two big things going for it. First, it is a special-purpose accelerator. Since the end of single-thread performance scaling due to power density issues, integrated circuit architects have searched for clever ways to exploit the increasing transistor counts afforded by Moore's Law without increasing power dissipation. This has led to a resurgence of special-purpose accelerators that are able to provide 10100x better energy efficiency than general-purpose processors when accelerating their special functions, and which consume practically no power when not in use.
Second, a neural net accelerator can accelerate a broad range of applications. Deep neural nets have begun to realize the promise that has intrigued so many for so long: a single, neuron-inspired computational model that offers superior results on a diverse variety of problems. In particular, modern deep neural net models are winning competitions in computer vision, speech recognition, and text analytics. Without exaggeration, the list of victories achieved through the use of deep neural nets grows every week.
Like many other machine learning approaches, neural net development has two phases. The training phase is essentially an optimization problem in which parameter weights of neural net models are adjusted to minimize the error of the neural net on its training set. This is followed by the implementation or inference phase, in which the resulting neural net is deployed in its target application, such as a speech recognizer in a cellphone.
Training neural nets is a highly distributed optimization problem in which interprocessor-communication costs quickly dominate local computational costs. On the other hand, the implementation of neural nets in embedded applications, such as cellphones, calls out for a special-purpose, energy-efficient accelerator. Thus, if I could cajole my circuit designer colleagues into designing only one circuit, it would surely be a special-purpose, energy-efficient accelerator that is flexible enough to provide efficient implementations of the growing family of neural net models. This is the goal of DianNao (di$aGn n$aKo, Chinese for computer, or, literally "electric brain").
The DianNao accelerator family comprehensively considers the problem of designing a neural net accelerator, and the following paper shows a deep understanding of both neural net implementations and the issues in computer architecture that arise when building an accelerator for them. Neural net models are evolving rapidly, and a significant new neural network model is proposed every month, if not every week. Thus, a computer architect building an accelerator for neural nets must be familiar with their variety. A specializer-architecture that isn't sufficiently flexible to accommodate a broad range of neural net models is certain to become quickly outdated, wasting the extensive chip design effort.
The DianNao family also engages the issues associated with building a processor architecture for a neural net accelerator and puts a particularly strong focus on efficiently supporting the memory access patterns of neural net computations. This includes minimizing both on-chip and off-chip memory transfers. Other members of the DianNao family include DaDianNao, ShiDianNao, and PuDianNao. DaDianNao (big computer) focuses on the challenges of efficiently computing neural nets with one billion or more model parameters. ShiDianNao (vision computer) is further specialized to reduce memory access requirements of Convolutional Neural Nets, a neural net family that is used for computer vision problems. While the number of problems solved by neural nets grows every week, some might wonder: Is this a fundamental change in the field, or will the pendulum swing back to favor a broader range of machine learning approaches? With the PuDianNao (general computer) architecture, the architects hedge their bets on this question by providing an accelerator for more traditional machine learning algorithms.
Despite, or perhaps because of, DianNao's two Best Paper Awards, some readers may think that building a neural network accelerator is just an academic enterprise. These doubts should be allayed by Google's announcement of the Tensor Processing Unit, a novel neural network accelerator deployed in their datacenters. These processors were recently used to help AlphaGo win at Go. It may be quite some time before we learn of TPU's architecture, but details on the DianNao family are only a page away. $$c
To view the accompanying paper, visit doi.acm.org/10.1145/2996864
The Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.