The development of high-performance software has always suffered from a tension between achieving high performance on the one hand and portability and simplicity on the other hand. By specializing an algorithm for optimal performance, considering the memory hierarchy and other architectural particulars, we introduce architecture-specific detail. This obscures algorithmic structure and conflates the general with the specific, compromising simplicity and clarity. It also hurts portability to all but very similar architectures—simple changes, such as different cache sizes, can have substantial performance implications. Moreover, distinctly different architectures, such as CPUs versus GPUs versus DSPs, often require fundamentally different optimization strategies. As a result, high-performance code is difficult to write, debug, maintain, and port.
Numerous research efforts were aimed at addressing this issue by applying automatic code transformations and other forms of compiler optimizations. Ultimately, we would prefer the software developer simply code the algorithm and leave it to the machine to specialize that algorithm to any particular architecture for efficient execution. In this ideal world, portability is a matter of retargeting a compiler's optimization engine. Unfortunately, architectural complexity and the lack of architectural models that are simultaneously sufficiently detailed and tractable have prevented us from realizing this vision.