In an ERA of mobile devices used as windows into services provided by computing in the cloud, the cost and reliability of services provided by large warehouse-scale computers1 is paramount. These warehouse-scale computers are implemented with racks of servers, each one typically consisting of one or two processor chips but many memory chips. Even with a crash-tolerant application layer, understanding the sources and types of errors in server memory systems is still very important.
Similarly, as we look forward to exascale performance in more traditional supercomputing applications, even memory errors correctable through traditional error-correcting codes can have an outsized impact on the total system performance.3 This is because in many systems, execution on a node with hardware-corrected errors that are logged in software runs significantly slower than on nodes without errors. Since execution of bulk synchronous parallel applications is only as fast as the slowest local computation, in a million-node computation the slowdown of one node from memory errors can end up delaying the entire million-node system.
At the system level, low-end PCs have historically not provided any error detection or correction capability, while servers have used error-correcting codes (ECC) that have enabled correction of a single error per codeword. This worked especially well when a different memory chip was used for each bit read or written by a memory bus (such as when using "x1" memory chips). However, in the last 15 years as memory busses have become wider, more bits on the bus need to be read or written from each memory chip, leading to the use of memory chips that can provide four ("x4") or more bits at a time to a memory bus. Unfortunately, this increases the probability of errors correlated across multiple bits, such as when part of a chip address circuit fails. In order to handle cases where an entire chip's contribution to a memory bus is corrupted, chip-kill correct error correcting codes have been developed.2
I hope the following paper will motivate the collection and publication of even more large-scale system memory reliability data.
Since the introduction of DRAMs in the mid-1970s, there has been much work on improving the reliability of individual DRAM devices. Some of the classic problems addressed were tolerance of radiation, from either impurities in the package or cosmic sources. In contrast, very little information has been published on reliability of memory at the system level. There are several reasons for this. First, much of the industrial data is specific to particular memory or CPU vendors. This industrial data typically focuses on configurations that are particularly problematic. Therefore neither DRAM, CPU, nor system vendors find it in their best interest to publish this data.
Nevertheless, in order to advance the field, knowledge of the types of memory errors, their frequencies, and conditions that exacerbate or are unrelated to higher error rates are of critical importance. In order to fill this gap, Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber analyzed measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. They collected data on multiple DRAM capacities, technologies, and vendors (suitably anonymized), totaling millions of DIMM days.
They found that DRAM error behavior at the system level differs markedly from widely held assumptions. For example, they found DRAM error rates that are orders of magnitude more common than previously reported. Additionally, they found that temperature has a surprisingly low effect on memory errors. However, even though errors are rare, they are highly correlated in time by DIMM. Underscoring the importance of chip-kill error correction in servers, they found that the use of chip-kill error correction can reduce the rate of uncorrectable errors by a factor of 38.
I hope the following paper will motivate the collection and publication of even more large-scale system memory reliability data. This work and future studies will be instrumental in aiding architects and system designers to address and solve the real problems in memory system reliability, enabling both cost-effective and reliable cloud services as well as efficiently extending supercomputing to the exascale.
3. Yelick, K. Ten Ways to Waste a Parallel Computer; http://isca09.cs.columbia.edu/ISCA09-WasteParallelComputer.pdf
©2011 ACM 0001-0782/11/0200 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.