The following paper, "Simba: Scaling Deep-Learning Inference with Chiplet-Based Architecture," by Shao et al. presents a scalable deep learning accelerator architecture that tackles issues ranging from chip integration technology to workload partitioning and non-uniform latency effects on deep neural network performance. Through a hardware prototype, they present a timely study of cross-layer issues that will inform next-generation deep learning hardware, software, and neural network architectures.
Chip vendors face significant challenges with the continued slowing of Moore's Law causing the time between new technology nodes to increase, sky-rocketing manufacturing costs for silicon, and the end of Dennard scaling. In the absence of device scaling, domain specialization provides an opportunity for architects to deliver more performance and greater energy efficiency. However, domain specialization is an expensive proposition for chip manufacturers. The non-recurring engineering costs of producing silicon are exorbitant including design and verification time for chips containing billions of transistors. Without significant market demand, it is difficult to justify this cost.
Fortunately for computer architects, machine learning is a domain where specialized hardware can reap performance and power benefits. Machine learning has seen widespread adoption in recent years and its need for more compute, storage, and energy-efficiency is only growing as models take on more complex tasks. Domain specialization improves performance and energy-efficiency by eschewing all of the hardware in modern processors devoted to providing general-purpose capabilities. Furthermore, architectures targeting machine learning feature regular arrays of simple processing elements (primarily doing multiply accumulate) that can potentially be scaled to large numbers and may offer opportunities to ease verification.
Given the billions of transistors that can fit on a single large die, is scaling up the number of processing elements in a machine learning accelerator trivial? The slowing of Moore's Law makes it increasingly difficult to pack more functionality on a single chip. If transistor sizes stay constant, more functionality could be integrated via larger chips. However, larger chips are undesirable due to significantly higher costs. Verification costs are higher. Manufacturing defects in densely packed logic can dramatically reduce the wafer yield. Lower yield translates into higher manufacturing cost.
A promising solution to combat these yield and verification challenges is to design and fabricate smaller chips (chiplets) and integrate those chiplets into one system via a package-level solution such as a silicon interposer or organic substrate. Small chiplets are cheap to manufacture; a manufacturing defect on a chiplet has a smaller impact on the total wafer yield. The reduced functionality of an individual chiplet is compensated for by integrating a large number of chiplets into the system. This concept of chiplet-based architectures has been explored in CPUs and in GPUs. Simba develops an architecture and hardware prototype to demonstrate how chiplets can be effective employed in machine learning accelerators.
While the focus of the paper is a scalable approach to deliver increasing performance and energy efficiency in datacenter-scale inference accelerators, one exciting feature of the proposed chiplet-based approach as is the ease with which it can be scaled across different market segments. Each chiplet can standalone as a complete system; therefore, a single chiplet could be used as an edge device or a small number of chiplets could be integrated for a consumer-class device. Given the design, verification, and manufacturing costs associated with fabricating silicon, a single chiplet design that delivers for all market segments is a compelling solution.
Another insightful aspect of this paper is the emphasis on hardware/software co-design. Given the myriad challenges facing hardware design and manufacturing, it is imperative that software systems be thoughtfully designed to combat any non-uniformities introduced by the hardware solutions. Non-uniform memory access (NUMA) effects have long been studied for multi-socket, multi-board designs. However, this study provides new insights specifically targeting machine learning applications and hierarchical interconnects with different bandwidth and latency characteristics that will be found in these future chiplet-based architectures. On the software side, they consider the impact of workload partitioning and communication-aware data placement. Through detailed case studies, this paper makes a compelling, evidence-based argument for co-design.
The deep neural network accelerator design space is rich with exciting start-ups and big-name companies producing new silicon. A number of open challenges and questions remain. In addition to hardware/software co-design, can the neural network architectures themselves be adapted to run more efficiently on the given hardware. If a single chiplet can serve a range of market segments, we need software and runtime solutions to adapt the network architecture to run efficiently on each instance of the system. How can we adapt this chiplet-based approach to build custom, heterogeneous hardware solutions at low cost? The hardware prototype in this paper provides a compelling foundation for further research in chiplet-based accelerator architectures.
To view the accompanying paper, visit doi.acm.org/10.1145/3460227
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.