In this era of Big Data, there is no such thing as too much storage.
Just ask Stephen Simms, manager of high-performance file systems at Indiana University (IU) in Bloomington, IN, which is about to deploy its five-petabyte (PB) Data Capacitor II (DC II) Lustre file system that will deliver nearly 10 times more storage capacity and ingest data at triple the speed (close to 50 gigabytes (GB) per second) of the university’s first Data Capacitor.
"Five petabytes is not enough," Simms declares. "It will accommodate our storage requirements for a while, but we’ll look beyond five petabytes almost as soon as we get it up and running."
Simms helped write the proposal for the $1.72-million grant from the National Science Foundation (NSF) to build the University’s first Data Capacitor in 2005. That Request for Proposal (RFP) addressed impedance mismatch between data producer and data consumer; for example, a telescope might be able to produce data the faster than a computer could process it. In electronics, Simms explains, a capacitor’s function is to even out the flow of electrons to create a steady stream; the Data Capacitor similarly stores data from a variety of inputs so when the computer is available, it can access the data on the file system right away. The original Data Capacitor was big and fast by 2005 standards: it had half a petabyte of storage capacity, and could accept 14.5 GB per second.
Digital Data Deluge
IU’s professors and students perform data-intensive research in projects such as analyzing substances that may play a pivotal role in the early onset of Alzheimer’s disease. These initiatives consume tremendous storage resources. Simms notes that another of IU’s scientific projects produces four terabytes (TB) of gigapixel images daily. "DC II will give those images a place to go for rapid analysis. Work expands to consume all free time. Data, when left unchecked, will likewise fill all available free space," Simms says.
This type of structured and unstructured data deluge is spurring all classes of organizations to continuously increase their data storage capacities . Even leading-edge supercomputing sites like Lawrence Livermore National Laboratory and Oak Ridge National Laboratory, each of which has high performance storage systems similar but much larger and faster than DC II with 55 PB and 40 PB of storage respectively, find it challenging to stay ahead of storage requirements. Oak Ridge’s Titan supercomputer’s file system can move data at 1.4 TBs.
"We’re confronted with near-exponential growth in data generated by multiple scientific initiatives," observes Galen Shipman, director of Compute and Data Environment for Science at Oak Ridge National Laboratory in Oak Ridge, TN. ORNL supports a myriad of scientific projects and initiatives including simulations for climate science modeling, high-energy physics and nuclear structure simulation experiments. "We’re constantly upping our game with respect to our Titan storage and Lustre file system to improve system scalability and resiliency. And ORNL continually increases capacity because high performance data analysis is becoming more pervasive," Shipman says.
The amount of data generated, processed and stored is growing on average 20% to 30% per year. According to market research firm International Data Corp. (IDC) in Framingham, MA, data storage demands will increase at a compound annual growth rate (CAGR) of 53% between 2011 and 2016. IDC attributes much of this storage growth to the rise of Big Data technology and services.
Technology advancements like digital and video imaging and 16- and 32-slice CT scans, coupled with increasing user demands, are fast outpacing storage technology’s ability to keep up with capacity requirements. Data types such as high-definition video, audio, and animation are becoming de rigueur for corporate websites and presentations and as marketing, sales, and training tools. Technologies like special effects and high-definition imaging substantially raise the amount of data generated; for example, it takes twice as much space to store 3D video as it does 2D video.
IU’s 110,000 students and 17,800 faculty and professional staff produce and store tremendous amounts of Big Data related to a wide range of activities, including scientific, engineering, health, and geosciences research. IU supercomputers have processed marketing analyses for the University’s business school, as well as lighting simulations for the theatre department.
"Storage needs are multiplying like the Tribbles from Star Trek. It’s difficult to comprehend how much data we have," Simms says.
In addition, IU provides field support for a NASA geoscience initiative called Operation IceBridge, which collects data related to snow and ice levels and conditions at the North and South Poles. Simms says Operation IceBridge typically collects up to 80 TB of data in six weeks; with the addition of more advanced radar, that figure is expected to soar to 800 TB in the same timeframe.
Enter DC II, which will ingest data from a slower network and discharge it quickly into an HPC resource utilizing a faster network than its predecessor could. DC II is used in conjunction with Big Red II, IU’s one-petaFLOPS Cray XK supercomputer, which is equipped with more than 21,000 cores. "DC II will accept data across a 10 Gigabit Ethernet fabric and data will enter the Cray using 40-gigabit QDR Infiniband," Simms says. The result is faster processing, allowing researchers to extract knowledge from data faster thereby shortening time to publication, Simms says.
Simms and his team have also constructed a wide area network Data Capacitor (DC-WAN) that has been mounted in multiple geographically distributed locations. Because DC-WAN is a mountable resource, one can bring far-flung computational resources to bear on the same data set without having to use cumbersome data transfer tools. For example, the WAN file system enables researchers to run a code on multiple machines in disparate locations, writing the results to DC-WAN on IU’s Bloomington campus. "It’s a huge leap forward in terms of ease of use and efficiency. As you’re writing data to the file system, you can spy on the running job creating the data. You might discover three minutes into a simulation that your initial conditions were way off and you can kill the job, saving lots of compute cycles, or if the results are promising, you can plan your next simulation and submit it before the first one finishes" Simms says.
Howard Marks, founder and chief scientist at Deep Storage, a Santa Fe, NM-based consulting firm, notes that the use of high performance storage systems are becoming commonplace. "It’s not a disruptive technology. Even large retailers like Target and Wal-Mart now have high end storage solutions capable of processing and storing 30, 40 and 55 PBs of data," Marks notes. "Hadoop has reduced the cost of doing finely detailed analysis that can positively impact the bottom line."
"Our mission is to accelerate research by utilizing cutting-edge technology," Simms says. "Technology is only as good as what you can do with it. If DC II can help researchers better understand cancer or reverse the effects of Alzheimer’s, then it’s fulfilled its mission. Knowing that the Data Capacitor project has made a positive impact keeps me motivated to look for new ideas and solutions that will keep IU on the cutting edge."
Laura DiDio is principal at ITIC, an information technology consultancy near Boston.