General-purpose processors, while tremendously versatile, pay a huge cost for their flexibility by wasting over 99% of the energy in programmability overheads. We observe that reducing this waste requires tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the algorithms. Hence, by backing off from full programmability and instead targeting key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications within that domain.
We present the Convolution Engine (CE)a programmable processor specialized for the convolution-like data-flow prevalent in computational photography, computer vision, and video processing. The CE achieves energy efficiency by capturing data-reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We demonstrate that the CE is within a factor of 23× of the energy and area efficiency of custom units optimized for a single kernel. The CE improves energy and area efficiency by 815× over data-parallel Single Instruction Multiple Data (SIMD) engines for most image processing applications.
Processors, whether they are the relatively simple RISC cores in embedded platforms, or the multibillion transistor CPU chips in Server/Desktop computers, are extremely versatile computing machines. They can handle virtually any type of workload ranging from web applications, personal spreadsheets, image processing workloads, and embedded control applications to database and financial applications. Moreover, they benefit from well-established programming abstractions and development tools, and decades of programming knowledge making it very easy to code new applications.
Processors, however, are also inefficient computing machines. The overheads of predicting, fetching, decoding, scheduling, and committing instructions account for most of the power consumption in a general-purpose processor core.2, 7, 16 As a result they often consume up to 1000× more energy than a specialized hardware block designed to perform just that particular task. These specialized hardware blocks also typically offer hundreds of times higher performance using a smaller silicon area. Despite these large inefficiencies, processors form the core of most computing systems owing to their versatility and reuse.
Over decades, we have been able to scale up the performance of general-purpose processors without using excessive power, thanks to advances in semiconductor device technology. Each new technology generation exponentially reduced the switching energy of a logic gate enabling us to create bigger and more complex designs with modest increases in power. In recent years, however, the energy scaling has slowed down,12 thus we can no longer scale processor performance as we used to do. Today we fundamentally need to reduce energy waste if we want to scale performance at constant power.
This paper presents a novel highly efficient processor architecture for computational photography, image processing, and video processing applications, which we call the Convolution Engine (CE). With the proliferation of cheap high quality imagers, computational photography and computer vision applications are expected to be critical consumer computation workloads in coming years. Some example applications include annotated reality, gesture-based control, see-in the dark capability, and pulse measurement.
Many of these applications, however, will require multiple TeraOps/s of computation which is far beyond the capability of general processor cores especially mobile processors on a constrained power budget of less than 1 Watt. The three orders of magnitude advantage in compute efficiency of hardware accelerators, means that current mobile systems use heterogeneous computing chips combining processors and accelerators.11,15 An example accelerator is the video codec hardware employed in the mobile SOCs. However, these accelerators target either a single algorithm or small variations on an algorithm. Handling the diverse application set needed by these future smart devices requires a more programmable computing solution. Our CE provides that programmability while still keeping energy dissipation close to dedicated accelerators.
Our design approach is based on the observation that specialized units achieve most of their efficiency gains by tuning data storage structures to the data-flow and data-locality requirements of the algorithm. This tuning eliminates redundant data transfers and facilitates creation of closely coupled data-paths and storage structures allowing hundreds of low-energy operations to be performed for each instruction and data fetched. Processors can enjoy similar energy gains if they target computational motifs and data-flow patterns common to a wide-range of kernels in a domain. Our CE implements a generalized map-reduce abstraction, which describes a surprisingly large number of operations in image processing domain. The resulting design achieves up to two orders of magnitude lower energy consumption compared to a general-purpose processor and comes within 23× of dedicated hardware accelerators.
The next section provides an overview of why general processors consume so much power and the limitations of existing optimization strategies. Section 3 then introduces the convolution abstraction and the five application kernels we target in this study. Section 4 describes the CE architecture focusing primarily on features that improve energy efficiency and/or allow for flexibility and reuse. We then compare this CE to both general-purpose cores with SIMD extensions and to highly customized solutions for individual kernels in terms of energy and area efficiency. Section 5 shows that the CE is within a factor of 23× of custom units and almost 10× better than the SIMD solution for most applications.
The low efficiency of general-purpose processors is explained in Figure 1, which compares the energy dissipation of various arithmetic operations with the overall instruction energy for an extremely simple RISC processor. The energy dissipation of arithmetic operations that perform the useful work in a computation remains much lower than the energy wasted in the instruction overheads such as instruction fetch, decode, pipeline management, program sequencing, etc. The overhead is even worse for media processing applications which typically operate on short data requiring just 0.20.5 pJ (90 nm) of energy per operation, with the result that over 99% energy goes into overheads.
These overheads must be drastically reduced to increase energy efficiency. That places two constraints on the processor design, (i) the processor must execute hundreds of operations per instruction to sufficiently amortize instruction cost and (ii) the processor must also fetch little data, since even a cache hit costs 25 pJ (90 nm) per memory fetch, compared to 0.20.5 pJ for the arithmetic operations.
These two constraints seem contradictory as performing hundreds of operations per cycle would generally necessitate reading large amounts of data from the memory. These conditions could be reconciled, however, for algorithms where most instructions either operate on intermediate results produced by previous instructions, or reuse most of the input data used by the previous instructions. If an adequate storage structure is in place to retain this "past data" in the processor data-path, then large number of operations can be performed per instruction without needing frequent trips to the memory. Fortunately compute bound applications including most image processing and video processing algorithms are a good match for these constraints, providing large data-parallelism and data-reuse.
Most high-performance processors today already include SIMD units which are widely regarded as the most efficient general-purpose optimization for compute intensive applications. SIMD units typically achieve an order of magnitude energy reduction by simultaneously operating on many data operands in a single cycle (typically 816). However, as explained in Hameed et al.7 the resulting efficiency remains two orders of magnitude less than specialized hardware accelerators, as the SIMD model does not scale well to larger degrees of parallelism.
To better understand the architectural limitations of traditional SIMD units, consider the two-dimensional sum of absolute difference operation (SAD) operating on a 16-bit 8 × 8 block as shown in Listing 1.5 The 2D SAD operator is widely employed in multimedia applications such as H.264 video encoder to find the closest match for a 2D image subblock in a reference image or video frame. Listing 1 carries out this search for every location in a srchWinHeight × srch-Win-Width search window in the reference frame, resulting in four nested loops. All four loops are independent and can be simultaneously parallelized. At the same time, each SAD output substantially reuses the input data used to compute previous outputs, both in vertical and horizontal directions.
However, a typical SIMD unit with a register row size of 128 bits is only able to operate on elements that fit in one register row limiting the parallelism to the inner most loop. Trying to scale up the SIMD width to gain more parallelism requires either simultaneously reading multiple image rows from the register file (to parallelize across the 2nd most-inner loop), or simultaneously reading multiple overlapping rows of image data (to parallelize across multiple horizontal outputs). Neither support exists in the SIMD model.
Exploiting the data-reuse requires storing the complete 8 × 8 block in eight rows of the SIMD register file, so that the data is locally available for computing subsequent output pixels. This is straightforward in vertical direction, as every new output just needs one new row and the seven old rows could be reused. To get reuse in the horizontal direction, however, each of the eight rows must be shifted by one pixel before computing each new output pixel, incurring substantial instruction overhead. At the same time, since the shifting process destroys the old pixels, we are limited to getting reuse either in the vertical or horizontal direction but not both, with the result that each data item gets fetched eight times from the memory, wasting too much memory energy.
GPUs achieve a much higher degree of parallelism by using a large number of simple SIMD cores, each with local register resources, and very large memory bandwidth to maintain high computational throughput. While that results in great compute throughput, the energy consumption is also very high thanks to large data access cost. Measuring the performance and energy consumption of an optimized GPU implementation of H.264 SAD algorithm13 using GPUGPUSim simulator1 with GPU-Wattch energy model,9 we ascertain that the GPU implementation runs 40× faster compared to an embedded 128-bit SIMD unit, but also consumes 30× higher energy. That further illustrates the need to minimize memory accesses and provide low-cost local data-accesses.
As the next section shows, the SAD operator belongs to a large class of algorithms which can be described as convolution-like and have the desired parallelism and reuse characteristics for efficient execution. Next section discusses this computational abstraction and provides a number of example applications.